# Feature Engineering

## Asssignment Questions

## 1.What is a parameter?

### A parameter is a value inside a model that influences how the model makes predictions. These values are not set manually — the algorithm automatically adjusts them to fit the data better.

### 🧠 Examples:
1. Linear Regression
Model:

   y=wx+b

- w (weight) and b (bias) are parameters.
- During training, the algorithm adjusts w and b to minimize the prediction error.

## 2.What is correlation? What does negative correlation mean?

### Correlation measures the relationship between two variables — how they move together.

- It ranges from -1 to +1.
  - +1: Perfect positive correlation (both go up together)
  - 0: No correlation
  - -1: Perfect negative correlation (one goes up, the other goes down)

### 🔻 What is Negative Correlation?
- Negative correlation means:
- When one variable increases, the other decreases.

## 3.Define Machine learning. What are the main components in machine learning? 

### Machine Learning (ML) is a branch of Artificial Intelligence where computers learn patterns from data and make decisions or predictions without being explicitly programmed.

🧩 Main Components in Machine Learning:
- 1.Data
   - Raw information used to train and test the model.
   - Example: customer age, income, product ratings.

- 2.Model
   - The mathematical structure that learns from data and makes predictions.
   - Example: Linear Regression, Decision Tree.

- 3.Algorithm
   - The method or process used to train the model.
   - Example: Gradient Descent, k-NN, Random Forest algorithm.

- 4.Loss Function
   - Measures how far the model’s prediction is from the actual value.
   - Example: Mean Squared Error (MSE).

- 4.Training
   - The process of feeding data into the model to learn patterns and adjust parameters.

## 4.How does loss value help in determining whether the model is good or not?

### The loss value tells us how far off the model’s predictions are from the actual values.

- Low loss = Model predictions are close to the actual values → ✅ Good model
- High loss = Model predictions are far off → ❌ Poor model

## 5.What are continous and categorical variables?

### 🔢 Continuous Variables
- Variables that can take any numeric value within a range.
- They are measurable.

Examples:
- Height (e.g., 172.5 cm)
- Temperature (e.g., 36.7°C)
- Salary (e.g., ₹45,500)

### 🔠 Categorical Variables
- Variables that represent groups or categories.
- They are not numeric (or are treated as labels).

Examples:
- Gender (Male, Female)
- Color (Red, Blue, Green)
- City (Delhi, Mumbai, Chennai)

## 6. How do we handle categorical variables in machine learning? What are the common techniques?

### 1. Label Encoding
- Converts each category into a unique number.
- Example:
Color = [Red, Green, Blue] → Red=0, Green=1, Blue=2
- ✅ Simple
- ⚠️ May create a false order (not suitable for non-ordinal categories)

### 2. One-Hot Encoding
- Creates a new column for each category with 0 or 1.
- Example:
  Color = Red → [1, 0, 0]
  Color = Blue → [0, 0, 1]
- ✅ No false order
- ⚠️ Increases data size if many categories

### 3. Ordinal Encoding
- Used when categories have a natural order.
- Example:
  Size = [Small, Medium, Large] → Small=1, Medium=2, Large=3

### 4. Binary Encoding / Target Encoding / Frequency Encoding
- Advanced techniques used when dealing with high-cardinality data (many unique values).
- Example: Encoding based on how often each category appears.

## 7.What do you mean by training and testing a dataset?

### 📚 Training Dataset
- This is the data used to teach the model.
- The model learns patterns from this data.
- Contains both inputs (features) and outputs (labels).

### 🧪 Testing Dataset
- This is the data used to check how well the model performs.
- It is never shown to the model during training.
- Helps us measure accuracy, precision, etc., on unseen data.

## 8.What is sklearn.preprocessing?

### Real-world data is often unclean or unfit for direct use in ML models.
### sklearn.preprocessing helps in scaling, encoding, and transforming features.

| Function               | What It Does                                 | Example Use                  |
| ---------------------- | -------------------------------------------- | ---------------------------- |
| `StandardScaler()`     | Standardizes features (mean = 0, std = 1)    | For algorithms like SVM      |
| `MinMaxScaler()`       | Scales features to a range (e.g., 0 to 1)    | For neural networks          |
| `LabelEncoder()`       | Converts labels to numbers                   | Encode class labels          |
| `OneHotEncoder()`      | Converts categories to binary vectors        | For nominal categorical data |
| `Binarizer()`          | Converts values to 0 or 1 based on threshold | For binary classification    |
| `PolynomialFeatures()` | Adds interaction/power terms                 | For non-linear models        |


## 9. What is a test set?

### A test set is a portion of the dataset that is used to evaluate the performance of a trained machine learning model.

### 🧠 Example:
If you have 1000 rows of data:
 - Training set: 80% (800 rows) → used to train the model
 - Test set: 20% (200 rows) → used to test the model

## 10.How do we split data for model fitting(training and testing) in python? how do you approach a machine learning problem?

### 🧪 Example:

In [None]:
from sklearn.model_selection import train_test_split

# X = features, y = target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### 🧠 How Do You Approach a Machine Learning Problem?
🪜 Step-by-Step Approach:
1.Understand the Problem
  - What is being predicted? (classification, regression, etc.)
  - What is the goal?

2.Collect Data
  - From CSV, databases, APIs, etc.

3.Explore and Clean Data
  - Handle missing values, duplicates, incorrect types
  - Understand data distributions (EDA)

4.Preprocess Data
  - Encode categorical variables (Label/One-Hot)
  - Scale numerical features
  - Split into training and test sets

5.Choose a Model
  - Linear Regression, Decision Tree, SVM, etc., depending on the problem

6.Train the Model
  - Use training data to fit the model

## 11.Why do we have to perform EDA before fitting a model to the data?

### EDA is the process of exploring and understanding the dataset using statistics and visualizations before applying machine learning models.

1.Understand the Data
  - Know what features (columns) exist
  - Check data types (numerical, categorical)

2.Find Missing or Incorrect Data
  - Identify and handle missing values, outliers, or wrong entries

3.Detect Patterns and Relationships
  - Spot useful trends and correlations
  - Helps in selecting meaningful features

4.Feature Selection & Engineering
  - Decide which features to keep, combine, or transform

5.Choose the Right Model
  - Based on the type of data (e.g., linear vs. non-linear patterns)

6.Improve Model Accuracy
  - Clean, well-understood data leads to better training and predictions

## 12.What is correlation?

### Correlation is a statistical measure that shows the relationship between two variables — how they change together.

🔁 Types of Correlation:
- Positive correlation:
  As one variable increases, the other also increases.
  👉 Example: Height and weight

- Negative correlation:
  As one variable increases, the other decreases.
  👉 Example: Speed and travel time

- Zero correlation:
  No relationship between the variables.
  👉 Example: Shoe size and intelligence

### range of correlation coefficient(r):
   -1<=r<=+1

## 13.What does negative correlation mean?

### Negative correlation means that when one variable increases, the other decreases — they move in opposite directions.

### 🔢 Correlation Coefficient Range:
- Negative correlation values range from 0 to -1:
  - -1 → perfect negative correlation
  - -0.5 → moderate negative correlation
  - 0 → no correlation

## 14.How can you find correlation between variables in python?

In [3]:
import pandas as pd

# Sample DataFrame
data = {
    'Age': [25, 30, 35, 40],
    'Salary': [40000, 50000, 60000, 70000],
    'Experience': [2, 5, 7, 10]
}

df = pd.DataFrame(data)

# Correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)


                 Age    Salary  Experience
Age         1.000000  1.000000    0.997054
Salary      1.000000  1.000000    0.997054
Experience  0.997054  0.997054    1.000000


## 15.What is causation? explain difference between correlation and causation with an example?

| Aspect    | **Correlation**         | **Causation**                           |
| --------- | ----------------------- | --------------------------------------- |
| Meaning   | Variables move together | One variable directly affects the other |
| Direction | No direction implied    | Has a clear cause-effect direction      |
| Proof     | Easy to calculate       | Harder to prove                         |


### 🧠 Example:
✅ Correlation (but not causation):
  - Ice cream sales and drowning incidents both increase in summer.
  - They are positively correlated, but ice cream doesn't cause drowning.

✅ Causation:
  - Smoking causes lung disease.
  - Proven by experiments and studies — it's a cause-effect relationship.

## 16.What is an optimizer? What are different types of optimizers? Explain each with an example.

### An optimizer is an algorithm that adjusts the model’s parameters (like weights and biases) during training to minimize the loss function.

### Types of optimizers.

### 1. Gradient Descent (GD)
- Basic optimizer that uses the full dataset to compute gradients.
- Updates parameters in the opposite direction of the gradient of the loss.

### 2. Stochastic Gradient Descent (SGD)
- Uses only one data point at a time to update parameters.
- Faster but can be noisy (fluctuates more).

### 3. Mini-Batch Gradient Descent
- Compromise between GD and SGD.
- Uses a small batch of data (e.g., 32 or 64 samples) at a time.
- Most commonly used in practice.

### 4. Adam (Adaptive Moment Estimation)
  - Combines the best of Momentum and RMSprop.
  - Maintains moving averages of gradients and their squares.
  - Fast, efficient, and handles sparse data well.

## 17.What is sklearn.linear_model?

### sklearn.linear_model is a module in Scikit-learn that provides linear models for both regression and classification tasks.

### 🧠 What are Linear Models?
Linear models make predictions using a linear relationship between input features and the target variable.

### 18.What does model.fit() do? What argument must be given

model.fit() is the function that trains your machine learning model using the given data.

It learns patterns from the training data by adjusting the model’s parameters (like weights).



In [None]:
model.fit(X_train, y_train)


## 19.What does model.predict() do? What arguments must be given?

### model.predict() is used to make predictions using a trained machine learning model.

### It takes input features and returns the model’s predicted outputs.

## 20.What are continous and categorical variables?

🔢 Continuous Variables:
- Can take any numeric value within a range (including decimals).
- They are measurable quantities.

📌 Examples:
- Age (e.g., 25.6 years)
- Temperature (e.g., 37.2°C)
- Salary (e.g., ₹45,000.50)

🔠 Categorical Variables:
- Represent groups or categories.
- Can be labels or names, not numeric by nature.
- Often need to be encoded before using in ML models.

📌 Examples:
- Gender (Male, Female)
- Color (Red, Blue, Green)
- Country (India, USA, Japan)

## 21.What is feature scaling? how does it help in machine learning?

### Feature Scaling is the process of normalizing or standardizing the range of features (input variables) in your dataset.

🚀 How It Helps in ML:
- Improves model performance and convergence speed.
- Ensures fair treatment of all features.
- Prevents bias toward features with large values.

## 22.How do we perform scaling in python? 

### 🛠️ 1. Standardization (Mean = 0, Std = 1)
### 🛠️ 2. Normalization (Scale between 0 and 1)
### 🛠️ 3. Robust Scaling (Uses median and IQR, good for outliers)

## 23. What is sklearn.preprocessing?

### sklearn.preprocessing is a module in Scikit-learn that provides tools to prepare and transform raw data before training a machine learning model.

### 🧰 Common Functions in sklearn.preprocessing:

| Function/Class         | What It Does                                   | Example Use                    |
| ---------------------- | ---------------------------------------------- | ------------------------------ |
| `StandardScaler()`     | Standardizes features (mean = 0, std = 1)      | For algorithms like SVM, LR    |
| `MinMaxScaler()`       | Scales features between 0 and 1                | For neural networks            |
| `LabelEncoder()`       | Converts labels (target) to numbers            | For classification tasks       |
| `OneHotEncoder()`      | Converts categorical variables to binary       | For categorical input features |
| `Binarizer()`          | Converts values to 0 or 1                      | For binary thresholding        |
| `PolynomialFeatures()` | Creates polynomial and interaction terms       | For non-linear relationships   |
| `RobustScaler()`       | Scales using median and IQR (handles outliers) | When data has outliers         |


## 24.How do we split data for model fitting(training and testing) in python?

### We use train_test_split() from sklearn.model_selection to divide the data into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

# X = features, y = target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## 25.Explain data encoding?

### Data encoding is the process of converting categorical (non-numeric) data into numeric format so that it can be used by machine learning models.

### 🎯 Why Encoding is Needed:
- Categorical data like ["Red", "Green", "Blue"] can’t be processed directly.
- Encoding converts these into numbers or vectors that models can understand.