<a href="https://colab.research.google.com/github/sonjoy1s/ML/blob/main/Module_10_5_Practice_Sheet_of_Module_9_and_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 10.5: Practice Sheet of Module 9 and 10

**Topic:** Data Preprocessing, Feature Engineering, Linear and Logistic Regression

This notebook contains short practice questions for:

- **Module 09:** Data Preprocessing and Feature Engineering Part 2
- **Module 10:** Linear Regression and Logistic Regression

Write your answers in the provided markdown and code cells. You can duplicate this notebook for multiple attempts.

## Module 09: Data Preprocessing and Feature Engineering Part 2

Topics:
- Outlier detection and handling
- Feature transformation (polynomial, binning)
- Feature construction and domain driven feature creation

### Q1. Outlier Detection and Handling

You collected daily website traffic data:

```python
traffic = [120, 135, 140, 125, 138, 142, 130, 900]
```

1. Detect the outlier using the **IQR method**. You can either show calculations or explain the idea.
2. Suggest **one** appropriate way to handle this outlier and justify it in one line.
3. Give an example of a real life situation where this outlier should **not** be removed.

In [1]:
# Optional: calculate IQR to detect the outlier
import numpy as np

traffic = np.array([120, 135, 140, 125, 138, 142, 130, 900])
traffic

array([120, 135, 140, 125, 138, 142, 130, 900])

In [15]:
df = pd.DataFrame(traffic, columns=["Traffic"])
Q1 = df.quantile(0.25)
print("Q1:")
print(Q1)
Q3 = df.quantile(0.75)
print("Q3:")
print(Q3)
IQR = Q3 - Q1
print("IQR:")
print(IQR)
low = Q1 - 1.5 * IQR
print("Low:")
print(low)
high = Q3 + 1.5 * IQR
print("High:")
print(high)
outliers = df[(df < low) | (df > high)]
print("Outliers:")
print(outliers)

df_copy = df.copy()
df_copy['Traffic']= df_copy['Traffic'].clip(lower=low['Traffic'], upper=high['Traffic'])
df_copy

Q1:
Traffic    128.75
Name: 0.25, dtype: float64
Q3:
Traffic    140.5
Name: 0.75, dtype: float64
IQR:
Traffic    11.75
dtype: float64
Low:
Traffic    111.125
dtype: float64
High:
Traffic    158.125
dtype: float64
Outliers:
   Traffic
0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7    900.0


Unnamed: 0,Traffic
0,120.0
1,135.0
2,140.0
3,125.0
4,138.0
5,142.0
6,130.0
7,158.125


Use the code cell above if you want to quickly check quartiles and IQR. Explain your reasoning below.

**Your answer for Q1:**

1. The IQR method ...

### Q2. Polynomial Transformation

You are predicting house prices using the number of rooms. The scatter plot shows a clear **curved** relationship.

1. Explain why adding polynomial features such as `rooms**2` might improve the model.

Short conceptual answer only. No coding required.

**Your answer for Q2:**

### Q3. Binning or Discretization

A continuous variable is given:

```python
ages = [18, 20, 45, 67, 72, 23]
```

1. Create **three bins** such as `young`, `middle`, `old` and assign each age to a bin.
2. State **one benefit** and **one drawback** of using binning in a machine learning model.

In [2]:
# Optional: try binning using pandas
import pandas as pd

ages = [18, 20, 45, 67, 72, 23]
ages_series = pd.Series(ages, name="Age")
ages_series


Unnamed: 0,Age
0,18
1,20
2,45
3,67
4,72
5,23


In [19]:
ages_df = pd.DataFrame(ages_series)
ages_df['Age_bin'] = pd.cut(ages_df['Age'], bins=[0,30,55,80], labels=['young','middle','old'])
print(ages_df[['Age','Age_bin']])

   Age Age_bin
0   18   young
1   20   young
2   45  middle
3   67     old
4   72     old
5   23   young


**Your answer for Q3 (bins, benefit, drawback):**

### Q4. Domain Driven Feature Construction

A food delivery dataset includes the following features:

- `distance_km`
- `order_time`
- `delivery_time`

Your task:

1. Propose **two new features** that might help predict **delivery delay**.
2. For each new feature, give **one sentence** explaining why it can be useful.

Hint: think about duration, rush hour, peak time and so on.

**Your answer for Q4:**

## Module 10: Linear Regression and Logistic Regression

Topics:
- Concept of regression and line fitting
- Cost function, gradient descent and optimization
- Model evaluation metrics R squared, MAE, RMSE
- Assumptions and limitations of linear regression
- Transition from regression to classification with the sigmoid function

### Q5. Concept of Regression and Line Fitting

You are predicting exam scores based on hours studied.

```python
hours = [1, 2, 3, 4, 5]
scores = [50, 55, 65, 70, 80]
```

1. In your own words, describe what **line fitting** means in linear regression.
2. What does the **slope** of the line represent in this context, in plain language?

In [3]:
# Optional: fit a simple linear regression model
import numpy as np
from sklearn.linear_model import LinearRegression

hours = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
scores = np.array([50, 55, 65, 70, 80])

model = LinearRegression()
model.fit(hours, scores)

print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)

Slope: 7.500000000000001
Intercept: 41.5


**Your answer for Q5:**

### Q6. Cost Function and Gradient Descent

Answer in simple language:

1. What does the **cost function** such as mean squared error measure in a regression model?
2. What does **gradient descent** do to the value of the cost function step by step?
3. Why is using a very **large learning rate** risky when running gradient descent?

**Your answer for Q6:**

### Q7. Regression Metrics Interpretation

A regression model produced the following metrics:

- R squared = 0.75
- MAE = 4.2
- RMSE = 7.6

1. Explain what each of these numbers tells you about the model.
2. Which metric **penalizes big errors more**, and why?


In [23]:
import numpy as np
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# True and predicted values
true_vals = np.array([10, 12, 15, 20])
pred_vals = np.array([11, 13, 14, 19])

# Evaluation metrics
r2 = r2_score(true_vals, pred_vals)
mae = mean_absolute_error(true_vals, pred_vals)
rmse = mean_squared_error(true_vals, pred_vals)

# Print results
print("R squared:", r2)
print("MAE:", mae)
print("RMSE:", rmse)

# Short interpretation
# R² ≈ 0.93 → model explains about 93% variance
# MAE = 1.0 → average error is 1 unit
# RMSE = 1.0 → no large errors present


R squared: 0.9295154185022027
MAE: 1.0
RMSE: 1.0


**Your answer for Q7:**

### Q8. Assumptions and Residual Patterns

A residual plot for a linear regression model shows two things:

- A clear **curved** pattern in the residuals.
- The spread of residuals becomes larger for bigger values of `x`.

1. Name **two** linear regression assumptions that are probably being violated.
2. For each assumption, explain in **one sentence** why this is a problem for the model.

**Your answer for Q8:**

### Q9. From Linear Regression to Logistic Regression

You want to predict whether a customer will buy a product, where `0` means No and `1` means Yes.

1. Why is **linear regression** not a good choice for this classification problem?
2. What role does the **sigmoid function** play in logistic regression?
3. If the sigmoid output is `0.81`, what is the predicted class when the decision threshold is `0.5`?

**Your answer for Q9:**

### Q10. Decision Threshold and Trade offs

A hospital uses a logistic regression model to detect a risky health condition. The current decision threshold is **0.5**.

1. If the hospital wants to **reduce false negatives** that is, avoid missing patients who truly have the condition, should the threshold go **up** or **down**?
2. Explain your answer in **one sentence**.

**Your answer for Q10:**

---

✅ You have reached the end of the practice sheet.

You can now:
- Review your answers.
- Run the optional code cells.