#Feature Engineering
**Theory Question**

Q.1 What is a parameter?

→ In feature engineering, a parameter refers to a user-defined setting or configuration value that controls how a data transformation is applied during the preprocessing stage. Feature engineering involves converting raw data into meaningful features that enhance the performance of machine learning models. Various transformation techniques—such as scaling, encoding & imputing—use parameters to customize how they process data. In summary, a parameter in feature engineering determines the specific behavior or rules for transforming data into model-ready features.

Q.2 What is correlation? What does negative correlation mean?

→
1. Correlation:
- Correlation is a statistical measure that describes the strength & direction of a relationship between two variables. It shows how one variable changes in response to another.
- The correlation coefficient is usually denoted by r & ranges between -1 & +1.
- +1 = perfect positive correlation
- 0 = no correlation
- -1 = perfect negative correlation.
2. Negative Correlation:
- A negative correlation means that as one variable increases, the other decreases.
- Ex.: If study time & number of errors have a negative correlation, it means more study time leads to fewer errors.
- The correlation coefficient (r) in this case would be less than 0, typically close to -1 for strong negative correlation.

Q.3 Define Machine Learning. What are the main components in Machine Learning?

→ Machine Learning (ML) is a branch of artificial intelligence (AI) that focuses on building systems that can learn from data & make decisions or predictions without being explicitly programmed. In simple terms, ML enables computers to learn patterns from historical data & improve performance over time.

The main components in ML are as follows:
1. Dataset
2. Features (Input Variables)
3. Labels (Target Variable)
4. Model (Algorithm)
5. Training
6. Evaluation
7. Prediction/Inference

Q.4 How does loss value help in determining whether the model is good or not?

→ The loss value is a numerical indicator of how well a machine learning model's predictions match the actual outcomes. It measures the error between the predicted output & the true (target) value. The loss value is a critical measure of a model's performance. It helps determine how accurate the predictions are & guides the optimization process to improve the model. A low & decreasing loss usually means a good model.

Q.5 What are continuous and categorical variables?

→
1. Continuous Variables:
- Continuous variables are numeric variables that can take any value within a range.
- They can have infinite possible values, including decimals & fractions.
- Continuous variables represent measurable quantities with infinite values & can be used in mathematical operations.
- Ex.: Height (165.4 cm), Temperature (36.7°C), Income (₹50,000.75).
2. Categorical Variables:
- Categorical variables represent groups or categories.
- They can take on a limited number of distinct values (also called "labels" or "classes").
- Types of catrgorical variables are Nominal & Ordinal.
- Ex.: Gender (Male, Female, Other), Color (Red, Blue, Green).

Q.6 How do we handle categorical variables in Machine Learning? What are the common techniques?

→ Categorical variables contain labels or categories (e.g., Gender, Color, Country) that must be converted into a numerical format so that machine learning models can process them.

Common Techniques to handle categorical variables in Machine Learning as follows:
1. Label Encoding
2. One-Hot Encoding
3. Ordinal Encoding
4. Target Encoding
5. Binary Encoding / Hash Encoding

Q.7 What do you mean by training and testing a dataset?

→ In machine learning, a dataset is usually split into two main parts: training and testing sets. This helps evaluate how well a model performs on new, unseen data.
1. Training Dataset:
- The training dataset is the portion of data used to teach the model.
- The model learns patterns, relationships, and rules from this data.
- This phase involves adjusting internal parameters (like weights) to minimize error (loss).
- Ex.: Training a house price prediction model using 80% of the available house data.
2. Testing Dataset:
- The testing dataset is the portion of data used to evaluate the trained model.
- It contains new data that the model has never seen before.
- Helps measure how well the model generalizes to real-world scenarios.
- Ex.: After training, the model predicts house prices on the remaining 20% of the data.

Q.8 What is sklearn.preprocessing?

→ A sklearn.preprocessing is a module in Scikit-learn (a popular Python machine learning library) that provides tools for transforming & preparing data before feeding it into a machine learning model.
- It helps in scaling, encoding, normalizing & transforming data so that machine learning models can process it effectively.
- It is essential for building accurate & efficient machine learning models.

Q.9 What is a Test set?

→ A test set is a portion of the dataset that is kept separate from the training process & is used to evaluate the final performance of a trained machine learning model. It plays a critical role in determining how well the model will perform in the real world.

Key Characteristics of Test Set:
- Not used during training: The model has never seen this data before.
- Used for evaluation only: It helps assess how well the model performs on unseen or real-world data.
- Represents future data: Acts as a simulation of how the model will behave in practical use.

Q.10 How do we split data for model fitting (training and testing) in Python?  How do you approach a Machine Learning problem?

→ In Python, we commonly use Scikit-learn's train_test_split function to split the dataset into training & testing sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Approaching a machine learning problem involves a structured process. Here's a common step-by-step approach:
1. Problem Understanding
2. Data Collection
3. Data Cleaning
4. Exploratory Data Analysis (EDA)
5. Feature Engineering
6. Splitting the Data
7. Model Selection
8. Model Training
9. Model Evaluation
10. Model Tuning & Optimization
11. Deployment

Q.11 Why do we have to perform EDA before fitting a model to the data?

→ EDA (Exploratory Data Analysis) is a crucial step in the machine learning pipeline that helps you understand the structure, patterns & quality of your data before applying any model. It ensures that your model is trained on clean, meaningful & well-understood data.

Key Reasons to Perform EDA Before Model Fitting:
1. Understand Data Distribution
2. Detect Missing or Incorrect Values
3. Identify Outliers & Anomalies
4. Reveal Relationships Between Features
5. Feature Selection & Engineering
6. Understand Class Balance
7. Choose the Right Model or Preprocessing.

Q.12 What is correlation?

→ Correlation is a statistical measure that describes the strength & direction of the relationship between two variables. It's widely used in data analysis to identify patterns & dependencies before building machine learning models.
- It tells you how one variable changes with respect to another.
- The correlation is represented by a correlation coefficient (r), which ranges from -1 to +1.
- Types of Correlation: Positive Correlation, Negative Correlation & No Correlation.

Q.13 What does negative correlation mean?

→ A negative correlation means that as one variable increases, the other decreases & vice versa. This relationship is important in understanding inverse dependencies in data.
- Represented by a correlation coefficient (r) less than 0 (ranges between -1 & 0).
- The closer r is to -1, the stronger the negative correlation.

Q.14 How can you find correlation between Variables in Python?

→ In Python, We can use Pandas or NumPy to find the correlation between variables in a dataset. These libraries provide simple & powerful methods for calculating correlation coefficients. We can find correlation between variables in Python using pandas.DataFrame.corr(), numpy.corrcoef() or by visualizing with Seaborn's heatmap(). These tools help you understand relationships between features before modeling.

Common Methods in order to finding correlation between variables in python as given below:
- Using Pandas .corr() Method
- Using NumPy corrcoef()
- Visualizing Correlation (Optional but Helpful).

Q.15 What is causation? Explain difference between correlation and causation with an example.

→ Causation means that one variable directly affects or causes a change in another variable. In other words, if variable A changes, it directly causes variable B to change.
- In short, Correlation means two variables change together, but it doesn't imply one causes the other.
- All causal relationships have correlation, but not all correlations are causal.
- Ex.: If you increase the temperature, water boils faster - temperature causes boiling.

Q.16 What is an Optimizer? What are different types of optimizers? Explain each with an example.

→ An optimizer is a key component in machine learning, especially in deep learning, that is used to adjust the model's parameters (like weights) in order to minimize the loss function & improve model performance. It helps the model learn by updating weights to reduce prediction errors.

Here are the most commonly used optimizers:
1. Gradient Descent (GD):
- It calculates the gradient of the loss function for the entire dataset & updates weights accordingly.
- This type of optimizer is simple & theoretically effective, but it is slow for large datasets.
2. Stochastic Gradient Descent (SGD):
- It updates weights using a single data point (sample) at a time.
- It is faster than batch GD but can be noisy.
- Ex.:

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

3. Momentum:
- It adds a momentum term to accelerate SGD in the relevant direction & dampens oscillations.
- It helps navigate faster toward the minimum in valleys.
- Ex.:

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

4. RMSProp (Root Mean Square Propagation):
- It solves AdaGrad's shrinking learning rate problem by using a moving average of squared gradients.
- It is used for: Recurrent Neural Networks (RNNs), time-series tasks.
- Ex.:

In [None]:
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)

5. Adam (Adaptive Moment Estimation):
- Combines Momentum & RMSProp; adapts learning rates for each parameter & includes momentum.
- This type is a provide Fast convergence, works well in most problems & most commonly used optimizer in deep learning.
- Ex.:

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Q.17 What is sklearn.linear_model?

→ A sklearn.linear_model is a module in the Scikit-learn library that contains classes & functions used to implement linear models for regression & classification tasks. In simple terms, it provides tools to fit linear algorithms to your data, such as Linear Regression, Logistic Regression & Ridge Regression & Lasso.

Q.18 What does model.fit() do? What arguments must be given?

→ The model.fit() function is used to train a machine learning model using a given dataset. It tells the model to learn the patterns in the training data by adjusting its internal parameters (like weights) to minimize the error (loss).

_"model.fit(X_train, y_train)"_
- X_train: The input features (independent variables)
- y_train: The target labels or outputs (dependent variable)
- After calling fit(), the model is trained & ready to make predictions using .predict().

In [2]:
from sklearn.linear_model import LinearRegression
X_train = [[1], [2], [3], [4]]
y_train = [2, 4, 6, 8]
model = LinearRegression()
model.fit(X_train, y_train)

Q.19 What does model.predict() do? What arguments must be given?

→ The .predict() method is used after training a machine learning model with 'fit()'. It allows the model to make predictions on new or unseen data based on what it has learned.

_"predictions = model.predict(X_test)"_
- X_test: The input features (independent variables) for which you want predictions.
- The method returns the predicted outputs (y_pred) for each input row in X_test.

In [3]:
from sklearn.linear_model import LinearRegression
X_train = [[1], [2], [3], [4]]
y_train = [2, 4, 6, 8]
model = LinearRegression()
model.fit(X_train, y_train)
X_test = [[5], [6]]
predictions = model.predict(X_test)
print(predictions)

[10. 12.]


Q.20 What are continuous and categorical variables?

→ In machine learning, variables (also called features or attributes) are typically classified into two main types: continuous & categorical.
1. Continuous Variables:
- Continuous variables are numeric variables that can take any value within a range.
- They can have infinite possible values, including decimals & fractions.
- Continuous variables represent measurable quantities with infinite values & can be used in mathematical operations.
- Ex.: Height (165.4 cm), Temperature (36.7°C), Income (₹50,000.75).
2. Categorical Variables:
- Categorical variables represent groups or categories.
- They can take on a limited number of distinct values (also called "labels" or "classes").
- Types of catrgorical variables are Nominal & Ordinal.
- Ex.: Gender (Male, Female, Other), Color (Red, Blue, Green).

Q.21 What is feature scaling? How does it help in Machine Learning?

→ Feature scaling is a preprocessing technique in machine learning used to normalize or standardize the range of independent variables (features) in your dataset. In simple terms, feature scaling ensures that all features are on the same scale, especially when they have different units or ranges.

It is important in ML, because of:
- Improves model performance — especially for algorithms sensitive to feature magnitude.
- Speeds up training — gradients converge faster during optimization.
- Avoids dominance — prevents large-valued features from dominating smaller ones.
- It is especially important for: K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Logistic / Linear Regression, Principal Component Analysis (PCA) & Neural Networks.

Q.22 How do we perform scaling in Python?

→ Feature scaling in Python is typically done using Scikit-learn's preprocessing module, which provides several easy-to-use scalers like:
1. StandardScaler
- Scales features so they have mean = 0 & standard deviation = 1.
2. MinMaxScaler
- Scales features to a fixed range [0, 1].
3. RobustScaler
- Scales using the median & IQR, making it robust to outliers.
4. Normalizer
- Scales each data sample (row) to unit norm.

Q.23 What is sklearn.preprocessing?

→ A sklearn.preprocessing is a module in Scikit-learn (a popular Python machine learning library) that provides tools for transforming & preparing data before feeding it into a machine learning model.
- It helps in scaling, encoding, normalizing & transforming data so that machine learning models can process it effectively.
- It is essential for building accurate & efficient machine learning models.

Q.24 How do we split data for model fitting (training and testing) in Python?

→ In Python, we commonly use Scikit-learn's train_test_split function to split the dataset into training & testing sets.





In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
from sklearn.model_selection import train_test_split

X = [[1], [2], [3], [4], [5]]
y = [10, 20, 30, 40, 50]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

print("X_train:", X_train)
print("X_test:", X_test)
print("y_train:", y_train)
print("y_test:", y_test)

X_train: [[5], [1], [4]]
X_test: [[3], [2]]
y_train: [50, 10, 40]
y_test: [30, 20]


Q.25 Explain data encoding?

→ Data encoding is the process of converting categorical data into numerical form, so that machine learning models can process & understand it. ML models work with numbers, not text - encoding translates human-readable categories into machine-readable format. Choosing the right encoding method depends on the type of categorical variable (nominal vs ordinal) & the machine learning algorithm being used.

Types of Data Encoding Techniques:
1. Label Encoding
2. One-Hot Encoding
3. Ordinal Encoding
4. Binary Encoding
5. Target Encoding
6. Hash Encoding

It is important, because of:
- Most algorithms (like Logistic Regression, SVM, KNN) require numeric input.
- Encoding helps convert non-numeric labels (like "Male", "Red", "India") into numbers without losing meaning.