# **Feature Engineering**

1. **What is a parameter?**

    A **parameter** is a **variable that the learning algorithm learns automatically from the training data**.  
    These values define how a model makes predictions and are adjusted during the **training process** to minimize error or loss.

    In simple terms:  
    > **Parameters are the internal configuration values of a model that are learned during training.**

    **Examples**

    | Model Type | Example Parameters | Description |
    |-------------|-------------------|--------------|
    | **Linear Regression** | Weights (β₁, β₂, …) and intercept (β₀) | Define the slope and position of the regression line |
    | **Logistic Regression** | Coefficients and bias | Define how input features affect the probability of a class |
    | **Neural Networks** | Weights and biases of neurons | Adjusted during backpropagation to minimize loss |
    | **Decision Trees** | Split thresholds | Determine how data is divided at each node |


    **Example (Linear Regression)**

    For a simple linear model:

    $$
    y = β_0 + β_1x + ε
    $$

    - $ β_0 $ and $ β_1 $ are **parameters** learned from the data.  
    - $ ε $ is the **error term** (not a parameter, just noise).

2. **What is correlation? What does negative correlation mean?**
   
   **Correlation** is a statistical measure that describes the **strength and direction of a relationship between two variables**.
   
   It tells us **how changes in one variable are associated with changes in another**.
   
   The correlation coefficient (usually denoted as **r**) ranges from **-1 to +1**
   
   **Types of Correlation**
   
  | Type | Range of r | Meaning | Example |
  |------|-------------|----------|----------|
  | **Positive Correlation** | `0 < r ≤ +1` | As one variable increases, the other also increases. | Height vs. Weight |
  | **Negative Correlation** | `-1 ≤ r < 0` | As one variable increases, the other decreases. | Price vs. Demand |
  | **No Correlation** | `r ≈ 0` | No linear relationship between variables. | Shoe size vs. IQ |
   
   **What Does Negative Correlation Mean?**
   
   A **negative correlation** means that:
   > When one variable increases, the other tends to decrease.

   
   It represents an **inverse relationship**.
   
   **Example:**
   - As the **price** of a product increases, the **demand** usually decreases.\
→ This is a **negative correlation**.
   
   If we calculate and get `r = -0.85`, it means:
   - The relationship is **strong** (since |r| is close to 1)
   - The direction is **negative** (one goes up, the other goes down)
   
   * * *
   
   **Formula (Pearson Correlation Coefficient)**
   
   $$
r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}
$$
   
   Where:
   - $ x_i, y_i $ → individual data points  
   - $ \bar{x}, \bar{y} $ → mean of each variable

3. **Define Machine Learning. What are the main components in Machine Learning?**


  **Machine Learning (ML)** is a branch of **Artificial Intelligence (AI)** that enables systems to **learn automatically from data** and **improve their performance** without being explicitly programmed.

  In simple terms:

  > Machine Learning is about teaching computers to **find patterns** in data and **make decisions or predictions** based on those patterns.

  ---

  **Example**

  When you use:
  - **Netflix** → recommends movies you might like  
  - **Gmail** → filters spam emails  
  - **Amazon** → suggests products based on your browsing history  

  All these systems use **machine learning models** that have **learned** from past data.

  ---

  **Main Components of Machine Learning**

  Machine Learning systems are built using several key components:

  1. **Data**
  - The **foundation** of machine learning.
  - Represents past observations, measurements, or examples.
  - Can be:
    - **Labeled data** (used in **supervised learning**)
    - **Unlabeled data** (used in **unsupervised learning**)
    - **Mixed data** (used in **semi-supervised learning**)

  Example: A dataset of customer purchases, medical records, or images.

  ---

  2. **Model**
  - The **mathematical representation** that learns patterns from data.
  - The model tries to **map inputs to outputs**.

  Example:
  - Linear Regression model: $ y = w_1x + b $
  - Neural Network model: multiple layers of weights and biases.

  ---

  3. **Parameters**
  - **Internal variables** of the model that are learned from training data.
  - Adjusted to minimize prediction error.

  Example:
  - In linear regression: slope (**w₁**) and intercept (**b**) are parameters.

  ---

  4. **Learning (or Training) Algorithm**
  - The process that updates model parameters using training data.
  - The goal is to **reduce the loss/error function**.

  Example:
  - Gradient Descent — adjusts weights step-by-step to minimize loss.

  ---

  5. **Loss Function (or Cost Function)**
  - Measures how well or poorly the model performs.
  - Quantifies the difference between **predicted output** and **actual output**.

  Example:
  $$
  \text{MSE (Mean Squared Error)} = \frac{1}{n} \sum (y_{pred} - y_{true})^2
  $$

  ---

  6. **Evaluation Metrics**
  - Used to assess the model’s performance on unseen (test) data.

  Example:
  - Accuracy, Precision, Recall, F1-Score, RMSE, etc.

  ---

  7. **Prediction (or Inference)**
  - The final step where the trained model is used to make predictions on new, unseen data.

  Example:
  - Predicting house prices based on size and location.

  ---


  | Component | Description | Example |
  |------------|--------------|----------|
  | **Data** | Information used for learning | Customer purchase records |
  | **Model** | Mathematical structure | Linear regression line |
  | **Parameters** | Learnable weights | Coefficients in regression |
  | **Learning Algorithm** | Method to update parameters | Gradient descent |
  | **Loss Function** | Measures prediction error | Mean squared error |
  | **Evaluation Metrics** | Performance indicators | Accuracy, F1-score |
  | **Prediction** | Output on new data | Predicting future sales |

4. **How does loss value help in determining whether the model is good or not?**

In Machine Learning, the **loss value** (also called **cost** or **error**) is a numerical measure of **how well or poorly a model’s predictions match the actual data**.

It represents the **difference between the predicted values and the true values** from the training data.

A **smaller loss value** indicates that the model’s predictions are closer to the true outputs, while a **larger loss value** means the model is making bigger errors.

---

**Purpose of the Loss Function**

The **loss function** helps guide the learning process of a model.  
During training, the learning algorithm (like gradient descent) tries to **minimize this loss** by adjusting the model’s parameters (weights and biases).

This process continues until the loss value stops decreasing or reaches a small enough value, indicating that the model has learned the data patterns well.

---

**Example**

For a regression problem, one common loss function is the **Mean Squared Error (MSE):**

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{\text{pred},i} - y_{\text{true},i})^2
$$

- $ y_{\text{pred},i} $: predicted value  
- $ y_{\text{true},i} $: actual value  
- $ n $: number of data points

A lower MSE means predictions are closer to the actual values.

---

**How the Loss Value Indicates Model Quality**

| Loss Value | Interpretation |
|-------------|----------------|
| **High loss** | The model is performing poorly; predictions are far from the true values. |
| **Low loss** | The model is performing well; predictions are close to the true values. |
| **Constant or increasing loss** | The model may not be learning, possibly due to poor model design, wrong learning rate, or noisy data. |
| **Very low loss but poor test accuracy** | The model may be overfitting — it has learned the training data too well but generalizes poorly. |

---

## Example in Practice

Suppose a regression model has:
- **Training loss = 0.02**
- **Validation loss = 0.03**

This indicates the model performs well both on training and unseen data.

However, if:
- **Training loss = 0.01**
- **Validation loss = 0.25**

Then the model likely **overfits**, meaning it performs well on training data but poorly on new data.

5. **What are continuous and categorical variables?**

In Machine Learning and Statistics, variables represent the features or attributes of data that we analyze or use to train models.  
They can broadly be classified into two main types: **Continuous** and **Categorical** variables.

---

1. Continuous Variables:
A **continuous variable** is a numerical variable that can take **any value within a given range**.  
The values are **measurable** and can include fractions or decimals.

Examples:
- Height of a person (e.g., 170.5 cm)
- Weight of an object (e.g., 65.3 kg)
- Temperature (e.g., 36.6°C)
- Time taken to complete a task (e.g., 12.45 seconds)

Characteristics:
- Infinite possible values within a range.  
- Represented using **real numbers**.  
- Operations like addition, subtraction, and averaging are meaningful.  

Example in Data:
| Person | Height (cm) | Weight (kg) |
|---------|--------------|-------------|
| A | 172.5 | 68.4 |
| B | 160.2 | 55.7 |

---

2. Categorical Variables:
A **categorical variable** (also called **qualitative variable**) represents data that can be **divided into categories or groups**.  
The values are **labels or names**, not numbers with mathematical meaning.

Types of Categorical Variables:
1. **Nominal Variables** – Categories with **no inherent order**  
   - Examples: Gender (Male/Female), Color (Red/Blue/Green), Country (India/USA/UK)
2. **Ordinal Variables** – Categories with a **specific order or ranking**  
   - Examples: Education Level (High School < Bachelor’s < Master’s < PhD), Customer Satisfaction (Low < Medium < High)

Example in Data:
| Student | Gender | Education Level |
|----------|---------|----------------|
| X | Male | Bachelor’s |
| Y | Female | Master’s |

6. **How do we handle categorical variables in Machine Learning? What are the common techniques?**

Categorical variables represent discrete values or categories, such as "red", "blue", "green", or "male", "female". Most machine learning algorithms require numerical input, so categorical variables must be converted into a numerical format.  

**Common Techniques:**  

**1. Label Encoding**  
- Assigns a unique integer to each category.  
- Example: `Red → 0, Blue → 1, Green → 2`  
- Suitable for ordinal variables where the order matters (e.g., `Low → 0, Medium → 1, High → 2`).  
- **Caution:** For nominal data, this may introduce an unintended ordinal relationship.  

**2. One-Hot Encoding**  
- Converts each category into a new binary column (0 or 1).  
- Example for color:  

| Red | Blue | Green |
|-----|------|-------|
| 1   | 0    | 0     |
| 0   | 1    | 0     |
| 0   | 0    | 1     |

- Commonly used for nominal variables.  
- Increases dimensionality for high-cardinality features.  

**3. Ordinal Encoding**  
- Similar to label encoding but preserves meaningful order in categories.  
- Example:

| Size  | Encoded |
|-------|---------|
| Small | 1       |
| Medium| 2       |
| Large | 3       |

- Useful for features with natural ordering.  

**4. Binary Encoding**  
- Converts categories into binary numbers and splits them into separate columns.  
- Example for colors (`Red=0, Blue=1, Green=2`) in binary:  

| Red | Blue | Green |
|-----|------|-------|
| 00  | 01   | 10    |

- Reduces dimensionality compared to one-hot encoding for high-cardinality features.  

**5. Frequency / Count Encoding**  
- Replaces each category with its frequency or count in the dataset.  
- Example:  

| Color | Frequency |
|-------|-----------|
| Red   | 50        |
| Blue  | 30        |
| Green | 20        |

- Useful for tree-based algorithms.  

**6. Target Encoding (Mean Encoding)**  
- Replaces categories with the mean of the target variable for each category.  
- Example (predicting sales):  

| Color | Average Sales |
|-------|---------------|
| Red   | 250           |
| Blue  | 180           |
| Green | 200           |

- Powerful but prone to overfitting; usually requires cross-validation or smoothing.  

**7. Embedding Layers (for Deep Learning)**  
- Represent categories as dense vectors in a lower-dimensional space.  
- Example:  

| Color | Embedding Vector      |
|-------|---------------------|
| Red   | [0.1, 0.3, 0.7]     |
| Blue  | [0.2, 0.6, 0.4]     |
| Green | [0.9, 0.1, 0.5]     |

- Learns relationships between categories during training.  

**8. Hashing Encoding**  
- Maps categories to fixed-size hash buckets.  
- Example (3 buckets):  

| Color | Hash Bucket |
|-------|-------------|
| Red   | 0           |
| Blue  | 1           |
| Green | 2           |


7. **What do you mean by training and testing a dataset?**

In Machine Learning, we use datasets to teach the model patterns and then evaluate its performance. To do this effectively, we split the data into two main parts: **training** and **testing**.  

**1. Training a Dataset**  
- **Definition:** Training a dataset means using a portion of your data to **teach the model** the relationships, patterns, and features in the data.  
- During training, the model learns from the input data (features) and the correct outputs (labels) to minimize errors.  
- Example: If you are predicting house prices, the model sees features like size, location, and age, and learns to predict the price.  
- The process involves adjusting the model’s parameters (weights in neural networks) based on the training data.  

**2. Testing a Dataset**  
- **Definition:** Testing a dataset means using a separate portion of your data to **evaluate how well the model performs** on unseen data.  
- The testing data is **not shown to the model during training**, which helps check if the model can generalize beyond what it learned.  
- Example: You give the model new house features and check how close its predicted prices are to the actual prices.  

**Why We Split Data**  
- Prevent **overfitting:** If you train and test on the same data, the model may memorize the training examples but fail on new data.  
- Evaluate **generalization:** Testing ensures the model works well on unseen, real-world data.  

**Typical Split Ratios**  
- 70% training, 30% testing (common)  
- 80% training, 20% testing (for larger datasets)  
- Sometimes, a third **validation set** is used to fine-tune hyperparameters:  
  - 60% training, 20% validation, 20% testing  
  

8. **What is sklearn.preprocessing?**

sklearn.preprocessing is a **module in the Scikit-learn library** in Python that provides **tools to prepare your data** before feeding it into a machine learning model. Preprocessing helps improve model performance, speed up training, and ensure features are on comparable scales.  

**Key Purposes of sklearn.preprocessing:**  
- Scale numerical features  
- Encode categorical features  
- Handle missing values (partially via transformers)  
- Transform features into a format suitable for machine learning algorithms  

**Common Classes and Functions in sklearn.preprocessing:**  

**1. StandardScaler**  
- Standardizes features by removing the mean and scaling to unit variance.  
- Formula: `z = (x - mean) / std`  
- Useful for algorithms like SVM, KNN, or logistic regression.  

**2. MinMaxScaler**  
- Scales features to a fixed range, usually [0, 1].  
- Formula: `x_scaled = (x - min) / (max - min)`  
- Useful when features have different units.  

**3. RobustScaler**  
- Scales features using the median and interquartile range (IQR).  
- Less sensitive to outliers compared to StandardScaler.  

**4. Normalizer**  
- Scales individual samples to have unit norm (length = 1).  
- Often used for text classification or clustering.  

**5. OneHotEncoder**  
- Converts categorical variables into a one-hot numeric array.  
- Example:  

| Color | Red | Blue | Green |
|-------|-----|------|-------|
| Red   | 1   | 0    | 0     |
| Blue  | 0   | 1    | 0     |
| Green | 0   | 0    | 1     |

**6. LabelEncoder**  
- Converts categorical labels into numeric form.  
- Example: `Red → 0, Blue → 1, Green → 2`  

**7. PolynomialFeatures**  
- Generates polynomial and interaction features from existing features.  
- Useful for linear models to capture non-linear relationships.  

**8. FunctionTransformer**  
- Apply a custom function to transform your data.  

**Summary:**  
`sklearn.preprocessing` is essential for preparing your dataset—scaling, normalizing, encoding, or transforming—so your machine learning models work efficiently and accurately.


9. **What is a Test set?**

A **test set** is a **subset of your dataset** that is used to **evaluate the performance of a trained machine learning model**. It is **separate from the training set**, which is used to teach the model.  

**Key Points about a Test Set:**  

**1. Purpose**  
- To measure how well the model **generalizes** to new, unseen data.  
- Helps detect **overfitting**, where the model performs well on training data but poorly on unseen data.  

**2. How it is Created**  
- Typically, the dataset is split into:  
  - **Training set:** 70–80% of the data  
  - **Test set:** 20–30% of the data  
  - Optionally, a **validation set** can be used for hyperparameter tuning.  

**3. Characteristics**  
- The test set **must not be used during training**.  
- Should be **representative of real-world data** to ensure meaningful evaluation.  

**4. Evaluation Metrics**  
- For regression: Mean Squared Error (MSE), R² score  
- For classification: Accuracy, Precision, Recall, F1-score  

**5. Example**  

Suppose we have 1000 data points:  

| Dataset Split | Number of Samples |
|---------------|-----------------|
| Training Set  | 800             |
| Test Set      | 200             |

- Model learns patterns from the **training set (800 samples)**.  
- Model is then tested on the **test set (200 samples)** to check performance on unseen data.

10. **How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?**

In Machine Learning, we **split the dataset** into a **training set** (to train the model) and a **test set** (to evaluate the model). This is commonly done using `train_test_split` from `sklearn.model_selection`.  

**1. Using train_test_split**  

```python
from sklearn.model_selection import train_test_split

# Example dataset
X = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]  # Features
y = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]          # Labels

# Split into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print("Training Features:", X_train)
print("Testing Features:", X_test)
```
**Parameters:**
*test_size: Proportion of the dataset to include in the test split (e.g., 0.2 = 20%)
* train_size: Optional, proportion for the training set
* random_state: Ensures reproducibility
* shuffle: Whether to shuffle the data before splitting (default is True)

**Approaching a Machine Learning Problem**
1. Define the Problem
- Understand the goal (e.g., classification, regression, clustering)
- Identify inputs (features) and outputs (target)

2. Collect and Explore Data
- Gather dataset from sources
- Perform Exploratory Data Analysis (EDA):
    - Check distributions, missing values, outliers
    - Visualize relationships between features and target

3. Preprocess Data
- Handle missing values
- Encode categorical variables (LabelEncoder, OneHotEncoder)
- Scale or normalize features (StandardScaler, MinMaxScaler)

4. Split Data
- Divide dataset into training and testing sets (and optionally validation set)

5. Choose a Model
- Select a model based on the problem:
    - Regression: Linear Regression, Random Forest, XGBoost
    - Classification: Logistic Regression, SVM, Decision Tree
    - Clustering: K-Means, DBSCAN

6. Train the Model
- Fit the model using the training set

7. Evaluate the Model
- Use the test set to measure performance
- Metrics examples:
    - Regression: Mean Squared Error, R² score
    - Classification: Accuracy, Precision, Recall, F1-score

8. Tune Hyperparameters
- Use cross-validation or grid search (GridSearchCV) to improve performance

9. Deploy or Interpret Results
- Apply model to new data or integrate into applications
- Interpret feature importance or model behavior if needed

11. **Why do we have to perform EDA before fitting a model to the data?**

**Why Perform Exploratory Data Analysis (EDA) Before Fitting a Model**  

Exploratory Data Analysis (EDA) is a critical step in any machine learning workflow. It involves examining and visualizing your dataset to understand its main characteristics before building a model.  

**Reasons for Performing EDA:**  

**1. Understanding Data Distribution**  
- Helps identify how features are distributed (normal, skewed, uniform, etc.).  
- Important for choosing the right algorithms and preprocessing steps.  

**2. Detecting Missing Values**  
- Missing or null values can cause errors during model training.  
- EDA helps identify missing data and decide whether to impute, drop, or fill values.  

**3. Identifying Outliers**  
- Outliers can distort model predictions, especially in regression.  
- EDA visualizations (boxplots, scatterplots) help detect outliers and handle them appropriately.  

**4. Understanding Feature Relationships**  
- Correlation analysis shows relationships between features and with the target variable.  
- Helps in feature selection and avoiding multicollinearity.  

**5. Detecting Data Imbalances**  
- In classification problems, EDA helps check if classes are imbalanced.  
- Imbalanced data may require resampling techniques like SMOTE or class weighting.  

**6. Guiding Preprocessing Steps**  
- Scaling, normalization, encoding, and feature engineering decisions are informed by EDA.  

**7. Informing Model Choice**  
- Based on data type, distribution, and relationships, you can choose models likely to perform well.  

**Note**
EDA ensures you **understand your data** thoroughly, catch potential issues early, and make informed decisions about preprocessing and model selection. Skipping EDA may lead to poor model performance, overfitting, or misleading results.  


12. **What is correlation?**

**Correlation in Statistics and Machine Learning**  

**Definition:**  
Correlation is a statistical measure that describes the **strength and direction of a linear relationship** between two variables. It quantifies how changes in one variable are associated with changes in another.  

**Types of Correlation:**  

**1. Positive Correlation**  
- When one variable increases, the other also increases.  
- Example: Height and weight — generally, taller people weigh more.  
- Correlation coefficient (r) is between 0 and +1.  

**2. Negative Correlation**  
- When one variable increases, the other decreases.  
- Example: Number of hours spent watching TV and exam scores — more TV, lower scores.  
- Correlation coefficient (r) is between -1 and 0.  

**3. No Correlation**  
- No linear relationship exists between the variables.  
- Example: Shoe size and intelligence — usually independent.  
- Correlation coefficient (r) is around 0.  

**Correlation Coefficient (r):**  
- Quantifies correlation: ranges from **-1 to +1**.  
  - `r = 1` → perfect positive correlation  
  - `r = -1` → perfect negative correlation  
  - `r = 0` → no linear correlation  

**Why Correlation is Important:**  
- Helps **identify relationships** between features.  
- Useful for **feature selection** in machine learning (e.g., remove highly correlated features to avoid multicollinearity).  
- Guides understanding of **data patterns** before modeling.  

13. **What does negative correlation mean?**

**Negative Correlation**  

**Definition:**  
Negative correlation occurs when **one variable increases while the other decreases**. It represents an **inverse relationship** between two variables.  

**Key Points:**  
- The correlation coefficient (r) is **between -1 and 0**.  
  - `r = -1` → perfect negative correlation (exact inverse relationship)  
  - `r = 0` → no linear correlation  
- As one variable goes up, the other tends to go down.  

**Examples:**  
1. **Number of hours spent watching TV vs. exam scores**  
   - More TV → lower scores → negative correlation.  
2. **Temperature vs. heating bill**  
   - Higher temperature → lower heating bill → negative correlation.  

**Why It Matters in Machine Learning:**  
- Helps identify features that are inversely related to the target.  
- Can guide feature selection or transformation.  
- Important for understanding relationships between variables before modeling.  

**Visual Representation:**  

| Variable X | Variable Y |
|------------|------------|
| 1          | 10         |
| 2          | 8          |
| 3          | 6          |
| 4          | 4          |
| 5          | 2          |

- As X increases, Y decreases → negative correlation.  


14. **How can you find correlation between variables in Python**

In Python, you can calculate correlation between variables using **Pandas** or **NumPy**. The most common method is the **Pearson correlation coefficient**, which measures linear relationships.  

**1. Using Pandas `corr()` Method**
** 2. Using NumPy corrcoef() Function**


In [None]:
import pandas as pd

# Example dataset
data = {
    'Hours_Studied': [2, 4, 6, 8, 10],
    'Scores': [50, 60, 65, 80, 90],
    'Hours_TV': [10, 8, 6, 4, 2]
}

df = pd.DataFrame(data)

# Correlation between all variables
correlation_matrix = df.corr()
print(correlation_matrix)

# Correlation between two specific variables
corr_hours_scores = df['Hours_Studied'].corr(df['Scores'])
print("Correlation between Hours_Studied and Scores:", corr_hours_scores)

               Hours_Studied    Scores  Hours_TV
Hours_Studied       1.000000  0.990148 -1.000000
Scores              0.990148  1.000000 -0.990148
Hours_TV           -1.000000 -0.990148  1.000000
Correlation between Hours_Studied and Scores: 0.9901475429766743


In [None]:
import numpy as np

x = np.array([2, 4, 6, 8, 10])
y = np.array([50, 60, 65, 80, 90])

# Calculate correlation coefficient matrix
corr_matrix = np.corrcoef(x, y)
print(corr_matrix)

# The correlation coefficient between x and y
corr_xy = corr_matrix[0, 1]
print("Correlation between x and y:", corr_xy)

[[1.         0.99014754]
 [0.99014754 1.        ]]
Correlation between x and y: 0.9901475429766743


15. **What is causation? Explain difference between correlation and causation with an example.**

Causation (or causal relationship) occurs when **a change in one variable directly causes a change in another variable**. It implies a **cause-and-effect relationship**.  

**Key Points of Causation:**  
- Causation is **stronger than correlation** because it shows that one event actually influences another.  
- Establishing causation usually requires controlled experiments or advanced statistical methods, not just observational data.  

---

**Correlation vs. Causation**  

| Aspect             | Correlation                                   | Causation                                   |
|-------------------|-----------------------------------------------|--------------------------------------------|
| Definition        | Measures the strength and direction of a relationship between two variables | One variable directly affects or causes a change in another |
| Nature            | Statistical association, can be positive, negative, or zero | Cause-and-effect relationship |
| Evidence Needed   | Observational data or statistical computation | Controlled experiments, intervention, or causal inference techniques |
| Interpretation    | Does not imply one variable causes the other | Implies direct impact of one variable on another |

**Example:**  
- **Correlation Example:**  
  - Ice cream sales and drowning incidents may be positively correlated (both increase in summer).  
  - But buying ice cream **does not cause** drowning.  
- **Causation Example:**  
  - Smoking **causes** lung cancer. Scientific studies have shown a direct cause-effect relationship.  

**Key Takeaway:**  
- **Correlation ≠ Causation.**  
- Correlation tells you variables move together; causation tells you one variable **directly influences** the other.  
- Confusing correlation with causation can lead to incorrect conclusions and poor decision-making.  



16. **What is an Optimizer? What are different types of optimizers? Explain each with an example**

An **optimizer** is an algorithm used to **update the parameters (weights and biases) of a machine learning model** during training in order to **minimize the loss function**.  

**Purpose:**  
- The optimizer helps the model **learn from data** by adjusting its parameters to reduce errors.  
- It determines **how quickly and in which direction** the weights are updated.  

**How it Works:**  
1. The model makes predictions using current weights.  
2. The **loss function** calculates the error between predictions and actual values.  
3. The optimizer updates the weights in the **direction that minimizes the loss**.  
4. This process repeats over multiple iterations (epochs) until the model converges.  

**Example in Python using SGD (Stochastic Gradient Descent):**  

```python
from tensorflow.keras.optimizers import SGD

# Create an SGD optimizer with learning rate 0.01
optimizer = SGD(learning_rate=0.01)
```

**Different Types of Optimizers in Machine Learning**  

1. **SGD (Stochastic Gradient Descent)**  
   - Updates weights using one sample or mini-batch at a time.  
   - Simple and widely used.  
   ```python
    from tensorflow.keras.optimizers import SGD
    optimizer = SGD(learning_rate=0.01)
    ```

2. **Momentum**  
   - Adds a fraction of the previous update to the current update to accelerate convergence and reduce oscillations.
   ```python
   from tensorflow.keras.optimizers import SGD

   optimizer = SGD(learning_rate=0.01, momentum=0.9)
   ```
3. **Adagrad (Adaptive Gradient Algorithm)**  
   - Adjusts learning rate for each parameter individually based on update frequency; good for sparse data.
   ```python
   from tensorflow.keras.optimizers import Adagrad
   optimizer = Adagrad(learning_rate=0.01)
   ```
4. **RMSProp (Root Mean Square Propagation)**  
   - Adaptive learning rate with decay; handles non-stationary objectives well (e.g., RNNs).
   ```python
   from tensorflow.keras.optimizers import RMSprop
   optimizer = RMSprop(learning_rate=0.001, rho=0.9)
   ```

5. **Adam (Adaptive Moment Estimation)**  
   - Combines momentum and RMSProp; tracks first and second moments of gradients; fast and stable.  
   ```python
   from tensorflow.keras.optimizers import Adam
   optimizer = Adam(learning_rate=0.001)
   ```

6. **Nadam (Nesterov-accelerated Adam)**  
   - Adam optimizer with Nesterov momentum for slightly faster convergence.  
   ```python
   from tensorflow.keras.optimizers import Nadam
   optimizer = Nadam(learning_rate=0.001)
   ```


17. **What is sklearn.linear_model?**

**`sklearn.linear_model` in Python**  

**Definition:**  
`sklearn.linear_model` is a module in **scikit-learn** that contains **linear models** for regression and classification.  
Linear models assume a **linear relationship** between input features and the target variable.  

**Common Models in `sklearn.linear_model`:**  

1. **LinearRegression**  
   - Predicts a continuous target using a linear combination of input features.  
   - Example: Predicting house prices.  

2. **LogisticRegression**  
   - Used for binary or multiclass classification.  
   - Outputs probabilities and predicts classes using a sigmoid or softmax function.  

3. **Ridge Regression**  
   - Linear regression with **L2 regularization** to prevent overfitting.  

4. **Lasso Regression**  
   - Linear regression with **L1 regularization**, can shrink some coefficients to zero (feature selection).  

5. **ElasticNet**  
   - Combines L1 and L2 regularization.  

6. **SGDClassifier / SGDRegressor**  
   - Linear models optimized via **stochastic gradient descent**.  

**Example: Linear Regression using `sklearn.linear_model`**  

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([5, 7, 9, 11, 13])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
print(y_pred)


[7.]


18. **What does model.fit() do? What arguments must be given?**

`model.fit()` is the method used to **train a machine learning model** on the provided dataset. During fitting, the model **learns patterns from the input features (X) to predict the target variable (y)** by adjusting its internal parameters (weights and biases).  

**In scikit-learn:**  

**Basic Syntax:**  
```python
model.fit(X, y)
```

Arguments:

X → Input features (array-like, shape [n_samples, n_features])

y → Target variable (array-like, shape [n_samples] for regression or classification)

Optional parameters depending on the model:

sample_weight → Array of weights for each sample (if needed)

Example:
```python
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([5, 7, 9, 11, 13])

model = LinearRegression()
model.fit(X, y)  # Train the model
```
* After this, the model has learned the coefficients and intercept.
* You can now use model.predict(X_test) to make predictions.
```
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))
```


19. **What does model.predict() do? What arguments must be given?**

`model.predict()` is a method used to **make predictions using a trained machine learning model**. After the model has been trained using `model.fit()`, it can take new input data and output predicted values or classes.  

**In scikit-learn:**  

**Basic Syntax:**  
```python
predictions = model.predict(X_new)
```
Arguments:

X_new → Input features for which predictions are to be made (array-like, shape [n_samples, n_features])

Example:
```python
from sklearn.linear_model import LinearRegression
import numpy as np

# Training data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([5, 7, 9, 11, 13])

# Train the model
model = LinearRegression()
model.fit(X, y)

# New data for prediction
X_new = np.array([[6], [7]])
predictions = model.predict(X_new)
print(predictions)  # Output: predicted values for X_new
```
* Output will be continuous values for regression models or class labels for classification models.

20. **What are continuous and categorical variables?**

**1. Continuous Variables**  
- **Definition:** Variables that can take **any numerical value** within a range.  
- Often measured quantities that can be fractional or decimal.  
- **Examples:**  
  - Height (e.g., 170.5 cm)  
  - Weight (e.g., 65.2 kg)  
  - Temperature (e.g., 36.6°C)  
- **Use Case in ML:** Regression tasks, scaling and normalization may be applied.  

---

**2. Categorical Variables**  
- **Definition:** Variables that represent **discrete categories or labels**.  
- They do not have a natural numerical order (unless ordinal).  
- **Examples:**  
  - Gender (Male, Female)  
  - Color (Red, Blue, Green)  
  - Type of vehicle (Car, Bike, Bus)  
- **Use Case in ML:** Classification tasks; often encoded using techniques like **One-Hot Encoding** or **Label Encoding**.  

**Key Difference:**  
| Feature Type      | Nature           | Values Example        | ML Use Case      |
|------------------|-----------------|---------------------|----------------|
| Continuous        | Numerical       | 0.5, 2.3, 100       | Regression, Scaling |
| Categorical       | Discrete/Labels | Red, Blue, Car       | Classification, Encoding |

21. **What is feature scaling? How does it help in Machine Learning?**

Feature scaling is the process of **normalizing or standardizing the range of independent variables (features)** in a dataset.  
It ensures that all features contribute equally to the learning process, preventing models from being biased toward features with larger numerical values.  

Why Feature Scaling is Needed : In many Machine Learning algorithms, the distance between data points or the magnitude of feature values affects how the model learns.
For example, algorithms like K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Gradient Descent-based models are sensitive to differences in feature scales.

If one feature (e.g., salary in thousands) has a much larger range than another (e.g., age in years), the model may give more importance to the higher-valued feature — even if it’s not more important.


**How Feature Scaling Helps in Machine Learning**

* Improves Model Accuracy: Prevents large-scale features from dominating the model’s learning.

* Faster Convergence: Algorithms using gradient descent (like Linear Regression or Neural Networks) converge faster when features are scaled.

* Better Distance Calculations: Distance-based algorithms (KNN, K-Means, SVM) perform better when all features are on a similar scale.

* Avoids Numerical Instability: Prevents computational issues when features have very large or very small values.


22. **How do we perform scaling in Python?**

Feature scaling in Python is commonly done using the **`scikit-learn`** library, which provides several preprocessing classes for different scaling techniques.


**1. Min-Max Scaling (Normalization)**  
This scales all values between a specific range (usually 0 to 1).

```python
from sklearn.preprocessing import MinMaxScaler

# Example data
data = [[10], [20], [30], [40], [50]]

# Initialize scaler
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

print(scaled_data)
```

**2. Standardization (Z-score Scaling)**
This method rescales data so that it has a mean of 0 and a standard deviation of 1.
```python
from sklearn.preprocessing import StandardScaler

# Example data
data = [[10], [20], [30], [40], [50]]

# Initialize scaler
scaler = StandardScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

print(scaled_data)
```

**3. Robust Scaling**
This scaling technique uses the median and interquartile range (IQR), making it robust to outliers.
```python
from sklearn.preprocessing import RobustScaler

# Example data
data = [[10], [20], [3000], [40], [50]]

# Initialize scaler
scaler = RobustScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

print(scaled_data)
```

**4. MaxAbs Scaling**
This scales each feature by dividing by its maximum absolute value, keeping the sign of the data intact.
```python
from sklearn.preprocessing import MaxAbsScaler

# Example data
data = [[-10], [0], [10], [20]]

# Initialize scaler
scaler = MaxAbsScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

print(scaled_data)
```

**5. Using fit(), transform(), and fit_transform()**

* fit() — Learns the scaling parameters (mean, std, min, max, etc.) from the data.

* transform() — Applies the learned scaling to the data.

* fit_transform() — Performs both fit() and transform() in one step.
```python
scaler = StandardScaler()
scaler.fit(data)           # Learn parameters
scaled_data = scaler.transform(data)  # Apply scaling
```


23. **What is sklearn.preprocessing?**

The **`sklearn.preprocessing`** module in the **Scikit-learn** library provides various tools and techniques to **transform raw data into a suitable format** for Machine Learning models.  
It helps in **scaling, encoding, normalizing, and imputing** data so that models can learn effectively.

**Key Functions and Classes in `sklearn.preprocessing`**

1. **Scaling and Normalization**
   - `StandardScaler` – Standardizes features by removing the mean and scaling to unit variance.  
   - `MinMaxScaler` – Scales features to a specific range (default 0 to 1).  
   - `RobustScaler` – Uses median and IQR for scaling (robust to outliers).  
   - `MaxAbsScaler` – Scales each feature by its maximum absolute value.  
   - `Normalizer` – Normalizes samples individually to have unit norm (used in text and clustering).

2. **Encoding Categorical Variables**
   - `LabelEncoder` – Encodes labels (target values) with integer values (0, 1, 2, …).  
   - `OneHotEncoder` – Converts categorical values into a one-hot numeric array.  
   - `OrdinalEncoder` – Encodes categorical features as integers with an assigned order.

3. **Imputing Missing Values**
   - `SimpleImputer` – Replaces missing values with mean, median, or a constant.  
   - `KNNImputer` – Fills missing values using the mean value from nearest neighbors.

4. **Generating Polynomial Features**
   - `PolynomialFeatures` – Expands input features into polynomial combinations to model nonlinear relationships.

5. **Binarization and Thresholding**
   - `Binarizer` – Converts numerical features into binary (0/1) based on a threshold.

**Example: Scaling and Encoding**

```python
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import numpy as np

# Example numeric data
X_numeric = np.array([[10, 20, 30], [20, 30, 40], [30, 40, 50]])

# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_numeric)
print(X_scaled)

# Example categorical data
X_categorical = np.array([['red'], ['green'], ['blue']])

# Encoding
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X_categorical).toarray()
print(X_encoded)
```

24. **How do we split data for model fitting (training and testing) in Python?**

In Machine Learning, we split the dataset into **training** and **testing** sets to evaluate how well a model generalizes to unseen data.

- **Training set** → Used to train (fit) the model.  
- **Testing set** → Used to evaluate the model’s performance.

**Common Function Used**

Scikit-learn provides the function **`train_test_split()`** from the `sklearn.model_selection` module to split data efficiently.

**Syntax**

```python
train_test_split(*arrays, test_size=0.25, train_size=None, random_state=None, shuffle=True)
```

* arrays: The input data (features and target).
* test_size: Fraction or number of samples to include in the test split (e.g., 0.2 means 20%).
* random_state: Ensures reproducibility of results.
* shuffle: Whether to shuffle data before splitting (default: True).

25. **Explain data encoding?**

**Data Encoding**

Data encoding is the process of **converting categorical or textual data into numerical form** so that machine learning algorithms can interpret and process it.  
Most ML algorithms (like regression, SVM, and neural networks) can only work with **numerical input**, not text or categories.

---

**Why Encoding is Important**

1. Machine learning models require numerical data.  
2. Encoding converts labels or categories into numbers.  
3. It preserves useful information while making the data model-ready.

---
**Data Encoding Techniques**

**1. Label Encoding**  
Converts each category into a unique integer value.  
Used when categorical data is **ordinal** (has a natural order).

---

**2. One-Hot Encoding**  
Creates **binary columns (0 or 1)** for each category.  
Used when categorical data is **nominal** (no natural order).

---

**3. Ordinal Encoding**  
Assigns integer values to categories based on their **rank or order**.  
Commonly used for data with levels like "Low", "Medium", "High".

---

**4. Binary Encoding**  
Combines the properties of label and one-hot encoding.  
First converts categories to integers, then represents those integers as **binary digits**.  
Efficient for handling **high-cardinality categorical data**.

---

**5. Frequency Encoding**  
Replaces each category with the **frequency (count)** of its occurrence in the dataset.  
Useful when categorical data has many unique values.

---

**6. Target Encoding**  
Replaces each category with the **mean of the target variable** for that category.  
Common in supervised learning tasks like classification and regression.

---

**7. Hash Encoding (Feature Hashing)**  
Uses a **hash function** to convert categories into a fixed number of numeric columns.  
Efficient for datasets with a **large number of categories**.

---

**8. Count Encoding**  
Similar to frequency encoding but replaces categories with the **count** (not normalized frequency) of their occurrences.

---

**9. Helmert Encoding**  
Represents each category as a comparison of the mean of the previous categories.  
Used in some statistical modeling approaches.

---

**10. Leave-One-Out Encoding (LOO Encoding)**  
A variation of target encoding where the current row’s target is excluded from the mean calculation to avoid data leakage.

---

**11. Base-N Encoding**  
Encodes categories using **base-N representations** (e.g., base-2, base-3).  
It’s a generalization of binary encoding.

