## **Assignment Questions**

1. What is a parameter?

2. What is correlation?

3. What does negative correlation mean?

4. Define Machine Learning. What are the main components in Machine Learning?

5. How does loss value help in determining whether the model is good or not?

6. What are continuous and categorical variables?

7. How do we handle categorical variables in Machine Learning? What are the common techniques?

8. What do you mean by training and testing a dataset?

9. What is sklearn.preprocessing?

10. What is a test set?

11. How do we split data for model fitting (training and testing) in Python?

12. How do you approach a Machine Learning problem?

13. Why do we have to perform EDA before fitting a model to the data?

14. How can you find correlation between variables in Python?

15. What is causation? Explain the difference between correlation and causation with an example.

16. What is an optimizer? What are different types of optimizers? Explain each with an example.

17. What is sklearn.linear_model?

18. What does model.fit() do? What arguments must be given?

19. What does model.predict() do? What arguments must be given?

20. What is feature scaling? How does it help in Machine Learning?

21. How do we perform scaling in Python?

22. Explain data encoding.

## **Q1. What is a parameter?**
A parameter is a numerical value inside a model that the algorithm learns during training.  
Example: In linear regression `y = mx + c`, **m** and **c** are parameters.


## **Q2. What is correlation?**
Correlation shows how strongly two variables are related to each other.

## **Q3. What does negative correlation mean?**
Negative correlation means:
- One variable ‚Üë increases  
- Other variable ‚Üì decreases

## **Q4. Define Machine Learning. What are the main components in Machine Learning?**
Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables computers to learn patterns from data and make predictions or decisions **without being explicitly programmed**.

In simple terms:

**Machine Learning = Learning patterns from data + Making predictions**

---

## **Main Components in Machine Learning**

### **1. Data**
The raw information used by the model to learn.  
Examples: images, numbers, text, sensor data.

### **2. Model**
A mathematical function that learns patterns from data.  
Examples: Linear Regression, Decision Trees, Neural Networks.

### **3. Features**
The input variables used to make predictions.  
Example: For house price prediction ‚Üí size, rooms, location.

### **4. Labels / Target**
The actual output values the model tries to learn.  
Example: House price = ‚Çπ50,00,000.

### **5. Training**
The process where the model learns from the data by adjusting parameters.

### **6. Loss Function**
A metric that shows how wrong the model's predictions are.  
Examples: MSE, Cross-Entropy.

### **7. Optimization Algorithm**
Used to reduce the loss function and improve learning.  
Example: Gradient Descent.

### **8. Evaluation**
Checking model performance on unseen (test) data.  
Metrics: Accuracy, Precision, Recall, RMSE.

### **9. Prediction / Inference**
Using the trained model to predict outcomes on new data.

## **Q5. How does loss value help in determining whether the model is good or not?**

The **loss value** tells us **how far the model's predictions are from the actual target values**.

- A **low loss value** means the model is performing well.
- A **high loss value** means the model is performing poorly.

During training:
- The model tries to **minimize the loss** using algorithms like Gradient Descent.
- If the loss is **consistently decreasing**, the model is learning properly.
- If the loss is **not decreasing**, the model is not improving or may be overfitting/underfitting.

**In simple terms:**  
Loss value is the **score of the model‚Äôs mistakes**.  
**Lower score = Better model.**


## **Q6. What are continuous and categorical variables?**
- **Continuous variables:** Numeric values (salary, height, temperature)  
- **Categorical variables:** Labels (gender, city, color)


## **Q7. How do we handle categorical variables in Machine Learning? What are the common techniques?**

Categorical variables contain **text labels** instead of numbers (e.g., "Male", "Female", "Red", "Blue").  
Machine Learning models **cannot use text directly**, so we convert them into numbers.  
This process is known as **Encoding**.

### ‚úÖ Common Techniques to Handle Categorical Variables:

#### 1. **Label Encoding**
- Converts categories into numbers like 0, 1, 2, ...
- Example:  
  `["Red", "Blue", "Green"] ‚Üí [0, 1, 2]`
- Works well for **ordinal** data.

#### 2. **One-Hot Encoding**
- Creates new columns with 0/1 values.
- Example:  
  "Color" ‚Üí `Color_Red`, `Color_Blue`, `Color_Green`
- Preferred for **nominal** data.

#### 3. **Ordinal Encoding**
- Used when categories have a **natural order**.
- Example:  
  `["Low", "Medium", "High"] ‚Üí [1, 2, 3]`

#### 4. **Target Encoding**
- Replace each category with the **average target value**.
- Used in **tree-based models** and large datasets.

#### 5. **Frequency Encoding**
- Replace categories with how frequently they appear.
- Example:  
  `{"Red": 50, "Blue": 30, "Green": 20}`

---

### **In short:**
We convert text categories into numeric values so ML models can understand them.  
Common methods: **Label Encoding, One-Hot Encoding, Ordinal Encoding, Target Encoding, Frequency Encoding.**


## **Q8. What do you mean by training and testing a dataset?**

In Machine Learning, a dataset is usually divided into **two parts**:

---

## ‚úÖ 1. Training Dataset
- This is the **data used to teach the model**.
- The model learns patterns, relationships, and rules from this data.
- Example:  
  If you're building a model to predict house prices, the training data includes houses and their prices.

---

## ‚úÖ 2. Testing Dataset
- This data is **not shown to the model during training**.
- It is used to **check how well the model performs** on new, unseen data.
- Helps evaluate the model‚Äôs accuracy, performance, and generalization.

---

### üìå Why do we split the dataset?
If we train and test the model on the **same data**, the model will look perfect but will fail in the real world.

So we split the dataset (commonly 80% training, 20% testing) to ensure the model:

- **Learns** from training data  
- **Is evaluated** on testing data  

---

### ‚úîÔ∏è In simple words:
- **Training data = learn**  
- **Testing data = check performance**


## **Q9. What is sklearn.preprocessing?**
A module used for:
- scaling  
- normalization  
- encoding  
- feature transformations

## **Q10. What is a test set?**
The portion of dataset used ONLY to evaluate the model.


## **Q11. How do we split data for model fitting (training and testing) in Python?**

We usually use the `train_test_split()` function from **scikit-learn** to split a dataset into:
- **Training data**
- **Testing data**

This helps us train the model on one part and evaluate it on another unseen part.

---

### ‚úÖ Example Code in Python (Google Colab)

```python
from sklearn.model_selection import train_test_split
import pandas as pd

# Sample dataset
data = {
    "Age": [22, 25, 47, 52, 46, 56, 23, 34],
    "Salary": [20000, 25000, 47000, 52000, 46000, 56000, 21000, 34000],
    "Buy": [0, 0, 1, 1, 1, 1, 0, 1]
}

df = pd.DataFrame(data)

# Splitting features (X) and target (y)
X = df[["Age", "Salary"]]
y = df["Buy"]

# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Display output
X_train, X_test, y_train, y_test


## **Q12. How do you approach a Machine Learning problem?**

Understand problem

Collect data

Clean data

EDA

Feature engineering

Select model

Train

Evaluate

Deploy

## **Q13. Why do we have to perform EDA before fitting a model to the data?**

**EDA (Exploratory Data Analysis)** is the process of exploring and understanding the data *before* building a Machine Learning model.

It helps us identify important patterns, detect errors, and decide how to clean or prepare the data.

---

## ‚úÖ Reasons Why EDA Is Important:

### **1. Understand the data**
- Check shapes, summary statistics, missing values, duplicates.
- Helps you know what type of model will fit best.

### **2. Detect outliers and noise**
- Outliers can negatively affect model performance.
- EDA helps visualize and handle them.

### **3. Identify missing values**
- Missing data must be imputed or removed.
- Models cannot work well with missing values.

### **4. Check data distributions**
- Helps decide transformations (log transform, normalization, scaling).

### **5. Understand relationships between variables**
- Using correlation, scatter plots, heatmaps.
- Helps identify important features for the model.

### **6. Prevent incorrect assumptions**
- EDA ensures you don‚Äôt blindly trust the raw dataset.

---

## ‚úîÔ∏è In simple words:
EDA tells you:
- **What your data looks like**
- **What problems exist in the data**
- **How to clean and prepare it**
- **Which model will work best**

**Without EDA, your model may perform badly or give wrong results.**


## **Q14. How can you find correlation between variables in Python?**

Correlation tells us **how strongly two variables are related**.  
In Python, we usually calculate correlation using **Pandas**.

---

## ‚úÖ 1. Using `df.corr()`  
This gives the correlation matrix for all numeric columns.

### **Example Code**
```python
import pandas as pd

# Sample dataset
data = {
    "Age": [22, 25, 30, 35, 40],
    "Salary": [20000, 25000, 30000, 40000, 50000],
    "Experience": [1, 2, 3, 5, 7]
}

df = pd.DataFrame(data)

# Find correlation
correlation_matrix = df.corr()
correlation_matrix


## **Q15.  What is causation? Explain difference between correlation and causation with an example.**

---

**Causation** means **one variable directly causes a change in another variable**.  
In simple words:  
**A ‚Üí causes ‚Üí B**

Example:  
More hours of study **cause** higher marks.

---

## ‚úÖ Correlation vs Causation

### **Correlation**
- Two variables **move together**.
- But one does **NOT necessarily cause** the other to change.
- It may be due to coincidence or a hidden factor.

### **Causation**
- One variable **directly affects** the other.
- There is a **cause-and-effect relationship**.

---

## ‚≠ê Example to Understand the Difference

### **Example 1: Correlation (No causation)**
Ice cream sales ‚Üë  
Drowning cases ‚Üë  

These two are correlated because both increase in summer.  
But **buying ice cream does NOT cause drowning**.

The hidden factor = **temperature (summer)**.

---

### **Example 2: Causation**
Amount of fertilizer ‚Üë  
Crop yield ‚Üë  

Here fertilizer **directly causes** the crops to grow.

---

## ‚úîÔ∏è In simple words:
- **Correlation** = two things happen together.  
- **Causation** = one thing happens *because* of the other.


## **Q16. What is an Optimizer? What are different types of optimizers? Explain each with an example.**

---

An **optimizer** is an algorithm that adjusts the model‚Äôs **weights** to **minimize the loss function** during training.

In simple words:  
**Optimizer = helps the model learn faster and better.**

---


### 1. **Gradient Descent**
- Calculates the gradient (slope) of the loss function.
- Updates weights in the opposite direction of the gradient.

Formula:  
`new_weight = old_weight - learning_rate * gradient`

#### Example:
```python
# Pseudo example
weight = weight - lr * gradient


### **17. What is sklearn.linear_model?**

A module containing:

LinearRegression

LogisticRegression

Ridge

Lasso

SGDRegressor

## **Q18. What does `model.fit()` do? What arguments must be given?**

---

`model.fit()` is the function used to **train the machine learning model**.

It does the following:

1. **Feeds training data to the model**
2. **Calculates loss**
3. **Updates weights using the optimizer**
4. **Repeats the process for many epochs**
5. **Learns the best patterns from the data**

In simple words:  
`model.fit()` = **Train the model using X (inputs) and y (labels)**.

---

### **1. X (Features/Input Data)**
The data used to make predictions.  
Example: Age, Salary, Experience.

### **2. y (Target/Labels)**
The correct output values.  
Example: 0/1 for classification.

### **3. epochs**
How many times the entire dataset is passed through the model.

### **4. batch_size**
How many samples are processed before updating weights.

---

## üîπ Optional (Common) Arguments

| Argument | Meaning |
|---------|---------|
| `validation_data` | Helps check accuracy on unseen data during training |
| `callbacks` | Early stopping, saving model, etc. |
| `verbose` | Controls how training output is displayed |
| `shuffle` | Whether to shuffle the data |

---

## ‚úÖ Example in Keras (Google Colab)

```python
model.fit(
    X_train,           # features
    y_train,           # target
    epochs=10,         # number of passes
    batch_size=32,     # samples per update
    validation_split=0.2,  # 20% data for validation
    shuffle=True
)


## **Q19. What does `model.predict()` do? What arguments must be given?**

---

`model.predict()` is used **after training** to make predictions on new or unseen data.

It takes the input features **X** and returns the model‚Äôs output.

In simple words:  
`model.predict()` = **Use the trained model to make predictions.**

---

1. Takes **input data**
2. Passes it through the trained model
3. Applies learned weights
4. Outputs the prediction  
   - Regression ‚Üí numeric value  
   - Classification ‚Üí probabilities or class labels

---

### **1. X (Input data / features)**
The only required argument.

Example:
- A list of numbers  
- A numpy array  
- A DataFrame  
- A single sample or multiple samples

---

## üîπ Optional Arguments

| Argument | Purpose |
|----------|---------|
| `batch_size` | Predict in batches for large datasets |
| `verbose` | Controls progress output |

---

## ‚úÖ Example in Keras (Google Colab)

```python
# Predict using test data
predictions = model.predict(X_test)

# Display results
predictions[:5]


## **Q20. What is Feature Scaling? How does it help in Machine Learning?**

---

**Feature Scaling** is the process of transforming all numeric features to the **same scale**  
so that no variable dominates others just because of larger values.

Examples:
- Age ‚Üí 21, 25, 40  
- Salary ‚Üí 20,000; 50,000; 90,000  

Salary is much larger in magnitude, so the model may give it more importance.  
Feature scaling fixes this problem.

---

### ‚úîÔ∏è 1. Helps the model learn faster  
Algorithms like Gradient Descent converge **much faster** when features are on the same scale.

### ‚úîÔ∏è 2. Prevents dominance of large-value features  
Large numerical values do not overshadow smaller valued features after scaling.

### ‚úîÔ∏è 3. Essential for distance-based algorithms  
Models like:
- KNN  
- K-means  
- SVM  

use distances between points. Without scaling, results become incorrect.

### ‚úîÔ∏è 4. Improves accuracy and stability  
Scaled features make training more stable and improve model performance.

---

## ‚úÖ Common Feature Scaling Techniques

### **1. Min-Max Scaling (Normalization)**  
Converts values to a range **0 to 1**.

Formula:  
`X_scaled = (X - X_min) / (X_max - X_min)`

Example:
```python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform([[10], [20], [30]])
scaled


## **Q21. How do we perform scaling in Python?**

We use preprocessing tools from **scikit-learn** such as:
- `MinMaxScaler` (Normalization)
- `StandardScaler` (Standardization)

These scalers transform numeric features into a similar range so the model can learn better.

---

## ‚úÖ 1. Min-Max Scaling (Normalization)
Scales values between **0 and 1**.

### **Example Code**
```python
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample data
data = {"Age": [20, 30, 40, 50], "Salary": [20000, 40000, 60000, 80000]}
df = pd.DataFrame(data)

# Apply Min-Max Scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)

scaled_data


### **Q22. Explain data encoding.**

Data encoding is the process of converting categorical (non-numeric) data into a numeric format so that Machine Learning models can understand and use it. Most ML algorithms work only with numbers, so encoding is necessary when your dataset contains labels like colors, names, categories, or yes/no values. There are several common encoding techniques. Label Encoding converts each category into a unique integer (for example, Red = 0, Blue = 1, Green = 2). It is simple but may accidentally introduce an order where none exists. One-Hot Encoding avoids this by creating separate binary columns for each category (e.g., Color_Red, Color_Blue, Color_Green), making it suitable for nominal data with no order. Ordinal Encoding is used when the categories have a natural order such as Low < Medium < High. Encoding helps algorithms interpret categorical features correctly and improves the performance of machine learning models.