01. What is a parameter?

ANS:- A parameter is a placeholder in a function definition that specifies what kind of data the function expects to receive.

02. What is correlation?

Ans:- **correlation** usually refers to **calculating how strongly two sets of data are related**—just like in statistics—but it's done using code.



What does negative correlation mean?

ANS;- a negative correlation means that as one variable increases, the other tends to decrease—and this relationship is discovered and measured through code, often using data analysis libraries.

03. Define Machine Learning. What are the main components in Machine Learning?

Ans:- Machine Learning is a branch of artificial intelligence (AI) that enables computers to learn from data and make decisions or predictions without being explicitly programmed for every specific task.

Instead of following hard-coded rules, ML models find patterns in data and use those patterns to make predictions or decisions.

04. How does loss value help in determining whether the model is good or not?


Ans:- The loss value (also called cost) is a key indicator of how good or bad your machine learning model is at making predictions. It tells you how far off your model's predictions are from the actual values.



05. What are continuous and categorical variables?

Ans:- ### 📊 **Continuous vs. Categorical Variables**

In data analysis and machine learning, understanding the **type of variable** you're working with is crucial. Two main types are:

---

### 1. 🔢 **Continuous Variables**

These are **numeric variables** that can take on **any value within a range**. They are typically measured and have infinite possible values (in theory).

#### ✅ Characteristics:

* Measurable quantities
* Can have decimals
* Values can be ordered and compared

#### 📌 Examples:

* Height (e.g., 170.5 cm)
* Weight (e.g., 65.2 kg)
* Temperature (e.g., 98.6°F)
* Income (e.g., \$45,500.75)

---

### 2. 🔠 **Categorical Variables**

These represent **categories or groups**. They are **not numerical** (or treated as labels even if numbers are used).

#### ✅ Characteristics:

* Represent types or categories
* Often limited to a fixed set of values
* Can be nominal or ordinal

#### 🧩 Types:

* **Nominal**: No natural order
  e.g., `["Red", "Blue", "Green"]`, `["Dog", "Cat", "Bird"]`

* **Ordinal**: Ordered categories
  e.g., `["Low", "Medium", "High"]`, `["Beginner", "Intermediate", "Expert"]`

#### 📌 Examples:

* Gender (`Male`, `Female`)
* City (`New York`, `London`, `Tokyo`)
* Yes/No (`0`, `1`)
* Education level (`High School`, `Bachelor's`, `Master's`)

---

### 🧠 Why It Matters in Machine Learning:

* Continuous variables are typically used **as-is** in models.
* Categorical variables often need to be **encoded** (e.g., one-hot encoding or label encoding) before being used in ML models.



06. How do we handle categorical variables in Machine Learning? What are the common t
echniques?

Ans:-  Handling **categorical variables** correctly is essential in machine learning, because most ML algorithms can't directly process non-numeric data. You need to **convert categories into numbers** while preserving meaning (where relevant).

---

### 🛠️ **Common Techniques to Handle Categorical Variables**

#### 1. **Label Encoding**

* **What it does**: Assigns a unique number to each category.
* **Best for**: Ordinal variables (where the order matters).

```python
from sklearn.preprocessing import LabelEncoder

data = ['Low', 'Medium', 'High']
encoder = LabelEncoder()
encoded = encoder.fit_transform(data)
print(encoded)  # Output: [1, 2, 0] (for example)
```

> ⚠️ Not ideal for **nominal** data (e.g., `['Red', 'Blue', 'Green']`) because ML models may misinterpret the numbers as having order.

---

#### 2. **One-Hot Encoding**

* **What it does**: Creates binary columns (0 or 1) for each category.
* **Best for**: Nominal variables (where order does **not** matter).

```python
import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue']})
encoded = pd.get_dummies(df)
print(encoded)
```

**Output:**

```
   Color_Blue  Color_Green  Color_Red
0           0            0          1
1           0            1          0
2           1            0          0
```

> ✅ Widely used. Most models (like logistic regression, tree-based models) work well with this.

---

#### 3. **Ordinal Encoding**

* **What it does**: Assigns ordered integers to ordered categories (like label encoding).
* **Best for**: When the categories have a clear order (e.g., `['Low', 'Medium', 'High']`).

```python
from sklearn.preprocessing import OrdinalEncoder

data = [['Low'], ['Medium'], ['High']]
encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
encoded = encoder.fit_transform(data)
```

---

#### 4. **Frequency Encoding**

* **What it does**: Replaces each category with its frequency count in the dataset.
* **Best for**: High-cardinality variables (many unique values).

```python
df['Category_freq'] = df['Category'].map(df['Category'].value_counts())
```

---

#### 5. **Target Encoding (Mean Encoding)**

* **What it does**: Replaces a category with the average target value for that category.
* **Best for**: Categorical features with many levels in supervised learning.

> ⚠️ Risk of **data leakage** if not done properly (use cross-validation).

---

### 🔍 Summary Table:

| Technique          | Best For            | Handles Order?            | Creates New Columns? |
| ------------------ | ------------------- | ------------------------- | -------------------- |
| Label Encoding     | Ordinal             | ✅ Yes                     | ❌ No                 |
| One-Hot Encoding   | Nominal             | ❌ No                      | ✅ Yes                |
| Ordinal Encoding   | Ordinal             | ✅ Yes                     | ❌ No                 |
| Frequency Encoding | High-cardinality    | ❌ No                      | ❌ No                 |
| Target Encoding    | Supervised problems | ❌ No (unless designed to) | ❌ No                 |

---

### 🚧 When Choosing a Technique:

* Use **one-hot encoding** for low-cardinality nominal features.
* Use **label or ordinal encoding** for ordered categories.
* Use **target/frequency encoding** for high-cardinality features (e.g., countries, product IDs).




07. What do you mean by training and testing a dataset?

Ans:- ### 🧠 **Training vs. Testing

| Term             | Meaning                                        | Purpose                       |
| ---------------- | ---------------------------------------------- | ----------------------------- |
| **Training Set** | Data used to **teach** the model               | Learn patterns                |
| **Test Set**     | Data used to **evaluate** the model's accuracy | Check performance on new data |

* **Training** = Model **learns**
* **Testing** = Model is **tested** on unseen data

> Goal: Good performance on both = model is **generalizing** well.



08. What is sklearn.preprocessing?

Ans:- **`sklearn.preprocessing`** is a module in **scikit-learn** (a popular Python machine learning library) that provides tools to **prepare and transform your data** before feeding it into machine learning models.

---

### 🔧 What Does It Do?

* **Scale features** (e.g., normalize or standardize)
* **Encode categorical variables** (e.g., label encoding, one-hot encoding)
* **Generate polynomial features**
* **Impute missing values** (in some related modules)
* **Other transformations** (like binarizing data)

---

### ⚡ Common Classes and Functions in `sklearn.preprocessing`:

| Tool                 | Purpose                                         |
| -------------------- | ----------------------------------------------- |
| `StandardScaler`     | Scale features to have mean=0, std=1            |
| `MinMaxScaler`       | Scale features to a given range (0 to 1)        |
| `LabelEncoder`       | Convert labels to numeric values                |
| `OneHotEncoder`      | Convert categorical variables to binary vectors |
| `PolynomialFeatures` | Create polynomial and interaction features      |
| `Binarizer`          | Convert numerical data to binary (0 or 1)       |

---

### Example: Standardizing Features

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # X is your feature matrix
```

---

### Why Use It?

* Many ML algorithms perform better when data is **scaled** or **properly encoded**.
* Preprocessing improves model **convergence** and **accuracy**.




09. What is a Test set?

Ans:- ### 🧪 **Test Set**

A **test set** is a portion of your dataset that you **set aside and don’t use during training**. Its main purpose is to **evaluate how well your trained machine learning model performs on new, unseen data**.

---

### Key Points:

* Used **after training** to measure model accuracy.
* Helps check if the model **generalizes well** or just memorized training data.
* Usually represents **20-30%** of the total dataset.

---

### Simple analogy:

If training data is the “practice exam,” the test set is the **“final exam”** that evaluates real understanding.



10. How do we split data for model fitting (training and testing) in Python?

Ans:- Here’s how you **split data into training and testing sets in Python** using **scikit-learn**:

---

### Using `train_test_split` from `sklearn.model_selection`

```python
from sklearn.model_selection import train_test_split

# Suppose you have features X and target labels y
X = [...]  # Your feature data (e.g., list, numpy array, pandas DataFrame)
y = [...]  # Your labels/targets

# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Now:
# X_train, y_train -> data used to train the model
# X_test, y_test -> data used to test/evaluate the model
```

---

### Parameters:

* `test_size=0.2`: 20% of data goes to the test set.
* `random_state=42`: Ensures the split is reproducible (same every run).

---

### Quick summary:

* Use **`train_test_split`** to divide your dataset.
* Train on the **training set**.
* Evaluate on the **test set**.



How do you approach a Machine Learning problem?

Ans:-



### 🧩 **Step-by-Step Approach to a Machine Learning Problem**

1. **Understand the Problem**

   * What’s the goal? (e.g., classification, regression)
   * What kind of data do you have?

2. **Collect Data**

   * Gather all relevant data from sources (databases, files, APIs).

3. **Explore and Analyze Data**

   * Understand data types, distributions, missing values.
   * Visualize relationships and detect anomalies.

4. **Prepare and Clean Data**

   * Handle missing values, remove duplicates, fix errors.
   * Encode categorical variables, normalize/scale features.

5. **Split Data**

   * Divide data into training, validation, and testing sets.

6. **Select and Train a Model**

   * Choose suitable ML algorithms (e.g., decision trees, neural networks).
   * Train models on the training set.

7. **Evaluate Model Performance**

   * Use metrics like accuracy, precision, recall, RMSE on validation/test sets.
   * Compare different models and tune hyperparameters.

8. **Improve the Model**

   * Feature engineering (create new features).
   * Try different algorithms, tune hyperparameters.

9. **Deploy the Model**

   * Put the model into production for real-world use.

10. **Monitor and Maintain**

    * Track model performance over time.
    * Update model with new data if needed.

---

### 📝 Summary

| Step                     | Purpose                           |
| ------------------------ | --------------------------------- |
| Understand the problem   | Define goal and data needs        |
| Collect data             | Get raw data                      |
| Explore & clean data     | Prepare data for training         |
| Split data               | Separate for unbiased evaluation  |
| Train model              | Learn from data                   |
| Evaluate & improve model | Measure and enhance accuracy      |
| Deploy & monitor         | Use model in real life and update |


11. Why do we have to perform EDA before fitting a model to the data?

Ans:- Here's why **Exploratory Data Analysis (EDA)** is important before fitting a machine learning model:

---

### 🔍 **Why Perform EDA Before Modeling?**

1. **Understand Your Data**

   * Discover data types, distributions, and ranges.
   * Identify patterns and relationships between variables.

2. **Detect Data Quality Issues**

   * Find missing values, outliers, or errors that could hurt model performance.
   * Decide how to handle those issues (e.g., imputation, removal).

3. **Feature Selection & Engineering**

   * Identify important features or create new ones to improve model accuracy.
   * Drop irrelevant or redundant variables.

4. **Choose the Right Model & Techniques**

   * Knowing data characteristics helps pick suitable algorithms (e.g., linear vs. tree-based).
   * Decide if scaling or encoding is needed.

5. **Prevent Garbage-In Garbage-Out (GIGO)**

   * Bad data leads to bad models. EDA helps ensure data quality, so the model learns meaningful patterns.



12. What is correlation?

Ans:- **Correlation** is a statistical measure that describes the **strength and direction of the relationship** between two variables.

---

### Key points:

* It shows how one variable **changes when the other changes**.
* The correlation coefficient ranges from **-1 to +1**:

  * **+1** means perfect positive correlation (both increase together).
  * **0** means no correlation (no relationship).
  * **-1** means perfect negative correlation (one increases, the other decreases).

---

### Example:

* Height and weight usually have a **positive correlation** (taller people tend to weigh more).
* Time spent watching TV and exam scores might have a **negative correlation** (more TV, lower scores).




13. What does negative correlation mean?

Ans:- Negative correlation = as one variable goes up, the other goes down.

14. How can you find correlation between variables in Python?

Ans:- You can find correlation between variables in Python easily using **pandas** or **NumPy**. Here’s how:

---

### Using **pandas**

```python
import pandas as pd

# Sample data
data = {
    'X': [1, 2, 3, 4, 5],
    'Y': [5, 4, 3, 2, 1],
    'Z': [2, 3, 4, 5, 6]
}
df = pd.DataFrame(data)

# Calculate correlation matrix
corr_matrix = df.corr()

print(corr_matrix)
```

This will output the correlation coefficients between all pairs of variables:

```
     X    Y    Z
X  1.0 -1.0  1.0
Y -1.0  1.0 -1.0
Z  1.0 -1.0  1.0
```

---

### Using **NumPy** for correlation between two variables:

```python
import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])

corr = np.corrcoef(x, y)[0, 1]
print(corr)  # Output: -1.0 (perfect negative correlation)
```



15. What is causation? Explain difference between correlation and causation with an example.

Ans;-

**Causation** means that one event **directly causes** another to happen. In other words, a change in variable A **produces** a change in variable B.

---

### Difference Between **Correlation** and **Causation**

| Aspect              | Correlation                                                      | Causation                                   |
| ------------------- | ---------------------------------------------------------------- | ------------------------------------------- |
| Meaning             | Variables **move together** (linked)                             | One variable **directly affects** the other |
| Direction           | No cause-effect implied                                          | Clear cause and effect relationship         |
| Example             | Ice cream sales ↑ and drowning incidents ↑                       | Smoking **causes** lung cancer              |
| Can it be spurious? | Yes, can be coincidence or caused by a third factor (confounder) | No, causal relationship is direct           |

---

### Example:

* **Correlation**:
  Ice cream sales and drowning rates both increase during summer.
  They are **correlated** because both go up together, but ice cream sales **don’t cause** drowning.

* **Causation**:
  Smoking **causes** lung cancer. This is a direct cause-effect relationship backed by evidence.





16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

Ans:- ### What is an **Optimizer** in Machine Learning?

An **optimizer** is an algorithm or method used to **adjust the parameters** (like weights in neural networks) of a model during training to **minimize the loss function** (error).

* It helps the model **learn** by finding the best parameters that reduce prediction errors.
* The process is called **optimization**.

---

### Common Types of Optimizers & Their Explanation

---

#### 1. **Gradient Descent (GD)**

* The most basic optimizer.
* It updates parameters by moving **opposite to the gradient** of the loss function.
* Uses the **whole training dataset** to compute gradients each step.

**Example:**
For a function $f(w)$, update rule is:
$w = w - \eta \cdot \nabla f(w)$
where $\eta$ = learning rate.

---

#### 2. **Stochastic Gradient Descent (SGD)**

* Similar to GD, but updates parameters using **one training example at a time**.
* Faster per update but noisier steps.
* Good for large datasets.

---

#### 3. **Mini-batch Gradient Descent**

* A compromise between GD and SGD.
* Updates parameters using a **small batch** of data at each step.
* Balances speed and stability.

---

#### 4. **Momentum**

* Helps accelerate SGD by adding a fraction of the previous update to the current update.
* Smooths updates and speeds up convergence, especially in ravines.

**Update rule:**
$v = \gamma v + \eta \nabla f(w)$
$w = w - v$
where $\gamma$ is momentum term.

---

#### 5. **AdaGrad**

* Adapts learning rate for each parameter individually.
* Larger updates for infrequent parameters, smaller for frequent ones.
* Good for sparse data.

---

#### 6. **RMSProp**

* Improves AdaGrad by using a moving average of squared gradients.
* Prevents learning rate from shrinking too much.
* Popular in training deep networks.

---

#### 7. **Adam (Adaptive Moment Estimation)**

* Combines ideas of Momentum and RMSProp.
* Maintains moving averages of gradients and squared gradients.
* One of the most widely used optimizers today.

---

### Summary Table

| Optimizer        | Description                                          | Use Case                              |
| ---------------- | ---------------------------------------------------- | ------------------------------------- |
| Gradient Descent | Full dataset gradient, slow but stable               | Small datasets                        |
| Stochastic GD    | One sample per update, noisy but fast                | Large datasets                        |
| Mini-batch GD    | Small batches per update, balance of speed/stability | Most deep learning models             |
| Momentum         | Adds velocity to updates, accelerates convergence    | Deep networks with complex landscapes |
| AdaGrad          | Adaptive learning rates per parameter                | Sparse data                           |
| RMSProp          | Adaptive learning rates with moving average          | Recurrent Neural Networks, Deep nets  |
| Adam             | Combines momentum and RMSProp, adaptive              | Most popular, works well generally    |

---

### Example (Using Adam in TensorFlow)

```python
import tensorflow as tf

model = ...  # define your model
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
model.fit(X_train, y_train, epochs=10)
```



17. What is sklearn.linear_model ?

Ans:- **`sklearn.linear_model`** is a module in **scikit-learn** that contains implementations of various **linear models** for regression and classification tasks.

---

### What Does It Provide?

* Algorithms that model the relationship between a dependent variable and one or more independent variables **assuming a linear relationship**.
* Common models include:

  * **Linear Regression** (for continuous output)
  * **Logistic Regression** (for binary/multi-class classification)
  * **Ridge Regression**, **Lasso Regression** (regularized linear models)
  * **ElasticNet** (combines Ridge and Lasso)
  * **Perceptron**, **SGDClassifier**, etc.

---

### Why Use `sklearn.linear_model`?

* Simple, interpretable models.
* Good baseline models for regression/classification.
* Often fast and efficient for large datasets.
* Supports regularization to prevent overfitting.

---

### Example: Linear Regression

```python
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)  # Train the model
predictions = model.predict(X_test)  # Predict on test data
```

---

### Summary

| Model                | Task           | Description                       |
| -------------------- | -------------- | --------------------------------- |
| `LinearRegression`   | Regression     | Predict continuous values         |
| `LogisticRegression` | Classification | Predict categorical classes       |
| `Ridge`, `Lasso`     | Regression     | Linear models with regularization |
| `SGDClassifier`      | Classification | Linear classifier with SGD        |



18. What does model.fit() do? What arguments must be given?

Ansa;-

The `fit()` method **trains the machine learning model** using your data. It means the model **learns the relationship** between the input features and the target labels by adjusting its internal parameters.

---

### What arguments does it take?

The most common arguments are:

* `X` — **Feature matrix:** The input data (usually a 2D array or DataFrame), where each row is a sample and each column is a feature.
* `y` — **Target vector:** The labels or values corresponding to each sample in `X`.

---

### Example:

```python
model.fit(X_train, y_train)
```

Here:

* `X_train` is your training data (features).
* `y_train` is the target/output for those samples.

---

### Additional optional arguments (depending on the model):

* `sample_weight` — To give different weights to training samples.
* `epochs`, `batch_size` — In deep learning libraries.
* Others specific to some algorithms.





19. What does model.predict() do? What arguments must be given?

Ans:-

The `predict()` method **uses the trained model to make predictions** on new input data.

* After training (`fit()`), `predict()` applies the learned patterns to unseen data.
* It outputs the **predicted labels or values** depending on the task (classification or regression).

---

### What arguments does it take?

* `X` — The input features (new data you want predictions for). Usually a 2D array or DataFrame with the same number of features as the training data.

---

### Example:

```python
predictions = model.predict(X_test)
```

Here:

* `X_test` is new/unseen data.
* `predictions` will contain the model’s output (e.g., predicted class labels or continuous values).

---

### Summary

| Method      | Purpose                       | Required Argument             |
| ----------- | ----------------------------- | ----------------------------- |
| `predict()` | Generate predictions on input | `X` (features for prediction) |



20. What are continuous and categorical variables?

Ans;-

### **Continuous Variables**

* Variables that can take **any numeric value** within a range.
* Usually measured quantities.
* Examples:

  * Height (e.g., 170.5 cm)
  * Temperature (e.g., 22.3°C)
  * Age (if measured precisely)

---

### **Categorical Variables**

* Variables that represent **categories or groups**.
* Usually have a **fixed set of possible values** (labels).
* Examples:

  * Gender (Male, Female)
  * Color (Red, Blue, Green)
  * Type of vehicle (Car, Bike, Truck)




21. What is feature scaling? How does it help in Machine Learning?

Ans;-

**Feature scaling** is the process of **normalizing or standardizing** the range of independent variables (features) in your data.

---

### Why do we need it?

* Different features may have **different units and scales** (e.g., age in years vs. income in thousands).
* Many ML algorithms **work better or converge faster** when features are on a similar scale.
* Prevents features with larger scales from **dominating** the learning process.

---

### Common Methods of Feature Scaling:

| Method                                | What it does                                      | Result example                     |
| ------------------------------------- | ------------------------------------------------- | ---------------------------------- |
| **Normalization** (Min-Max Scaling)   | Scales features to a fixed range, usually \[0, 1] | Age from 18–90 scaled to 0–1       |
| **Standardization** (Z-score scaling) | Centers features to mean=0, std=1                 | Feature values like -1.2, 0.3, 1.5 |

---

### How it helps in ML:

* **Gradient-based algorithms** (like logistic regression, neural networks) converge faster.
* Algorithms relying on **distance metrics** (like KNN, SVM) perform better.
* Avoids bias toward features with larger scales.

---

### Example in Python (StandardScaler):

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```



22. How do we perform scaling in Python?

Ans:- You can easily perform feature scaling in Python using **scikit-learn’s preprocessing module**. Here are the two most common methods:

---

### 1. **Standardization (StandardScaler)**

Scales data to have **mean = 0** and **standard deviation = 1**.

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # X is your feature matrix (numpy array or DataFrame)
```

---

### 2. **Normalization (MinMaxScaler)**

Scales data to a fixed range, usually **0 to 1**.

```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
```

---

### How it works:

* `.fit_transform()` computes scaling parameters (like mean, std, min, max) from your data and applies the scaling.
* For new data, use `.transform()` only, to apply the same scaling learned.

---

### Example with sample data:

```python
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

X = np.array([[10, 200],
              [15, 300],
              [20, 400]])

# Standardization
scaler_std = StandardScaler()
X_std = scaler_std.fit_transform(X)

# Normalization
scaler_minmax = MinMaxScaler()
X_norm = scaler_minmax.fit_transform(X)

print("Standardized data:\n", X_std)
print("Normalized data:\n", X_norm)
```


23. What is sklearn.preprocessing?


Ansa;- **`sklearn.preprocessing`** is a module in **scikit-learn** that provides tools to **prepare and transform your data** before training machine learning models.

---

### What does it offer?

* **Scaling:** Standardize or normalize features (e.g., `StandardScaler`, `MinMaxScaler`)
* **Encoding:** Convert categorical data to numeric (e.g., `OneHotEncoder`, `LabelEncoder`)
* **Binarizing:** Convert data to binary values (e.g., `Binarizer`)
* **Generating polynomial features:** (e.g., `PolynomialFeatures`)
* **Imputation & normalization utilities**

---

### Why use it?

* Many machine learning algorithms require data in a certain format or scale.
* Preprocessing improves model performance and training speed.

---

### Example usage:

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```



24. How do we split data for model fitting (training and testing) in Python?

Ans:- You can split data for training and testing in Python using **scikit-learn’s** `train_test_split` function.

Here’s how:

```python
from sklearn.model_selection import train_test_split

# X = features, y = target labels
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
```

* `test_size=0.2` means 20% of data goes to testing, 80% to training.
* `random_state=42` makes the split reproducible.

After this, you train your model on `X_train, y_train` and evaluate on `X_test, y_test`.



25. Explain data encoding?

Ans:-

**Data encoding** is the process of converting **categorical variables** (non-numeric data) into a **numeric format** so that machine learning algorithms can work with them.

---

### Why encode data?

* Most ML algorithms only understand numbers, not categories like "Red," "Blue," or "Male," "Female."
* Encoding transforms categories into numbers without losing their meaning.

---

### Common Encoding Techniques:

| Technique            | Description                                 | Example                          |
| -------------------- | ------------------------------------------- | -------------------------------- |
| **Label Encoding**   | Assigns each category a unique integer      | Red → 0, Blue → 1, Green → 2     |
| **One-Hot Encoding** | Creates binary columns for each category    | Red → \[1,0,0], Blue → \[0,1,0]  |
| **Ordinal Encoding** | Similar to label encoding but assumes order | Small → 0, Medium → 1, Large → 2 |

---

### Example: One-Hot Encoding with pandas

```python
import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})

encoded = pd.get_dummies(df['Color'])
print(encoded)
```

Output:

```
   Blue  Green  Red
0     0      0    1
1     1      0    0
2     0      1    0
```

