## **Assignment 2:- Supervised Learning: Regression Models and Performance Metrics**

**Question 1:** What is Simple Linear Regression (SLR)? Explain its purpose.

**Answer:** **Simple Linear Regression (SLR)** is a **statistical method** used to model the relationship between two variables ‚Äî one independent variable (X) and one dependent variable (Y) ‚Äî by fitting a straight line to the observed data.

üîπ **Definition**

In simple terms, SLR estimates how much the dependent variable
ùëå changes when the independent variable
ùëã changes by one unit.

The mathematical form of the model is:

ùëå = Œ≤0 + Œ≤1 ùëã + Œµ

where:

ùëå = dependent (response) variable

ùëã = independent (predictor) variable

ùõΩ0 = intercept (value of ùëå when ùëã = 0)

ùõΩ1 = slope (rate of change of ùëå with respect to ùëã)

ùúÄ = random error term (captures variability not explained by ùëã)

üîπ **Purpose of Simple Linear Regression**

The main goals of SLR are:

1. **Prediction:**   
Estimate or predict the value of ùëå for a given value of ùëã.  
Example: Predicting a student‚Äôs exam score based on study hours.

2. **Understanding relationships:**  
Determine whether and how strongly two variables are linearly related.  
Example: Analyzing whether advertising spending influences sales.

3. **Quantifying effect:**  
The slope ùõΩ1 quantifies how much ùëå changes on average for a one-unit increase in ùëã.

üîπ **Assumptions of SLR**

1. Linearity: The relationship between ùëã and ùëå is linear.

2. Independence: Observations are independent.

3. Homoscedasticity: The variance of errors is constant.

4. Normality: The residuals (errors) are normally distributed.

üîπ **Example**

Suppose we study how **hours studied (X)** affect **exam scores (Y)**.
Data might show that each additional hour of study increases the score by 5 points, leading to the regression equation:

>> Score = 40 + 5 x Hours studied

So, if a student studies for 6 hours, the predicted score is 40 + 5 (6) = 70


---

**Question 2:** What are the key assumptions of Simple Linear Regression?

**Answer:** The **key assumptions of Simple Linear Regression (SLR)** ensure that the model‚Äôs estimates are valid, reliable, and interpretable. Violating these assumptions can lead to incorrect conclusions or biased predictions.

Here are the **five main assumptions** explained clearly:

üîπ **1. Linearity**

- **Meaning:** The relationship between the independent variable ùëã and the dependent variable ùëå must be linear.

- **In other words:** The change in ùëå should be proportional to the change in ùëã.

- **How to check:** Plot a scatter plot of ùëã vs. ùëå. The data points should roughly form a straight-line pattern.

üîπ **2. Independence of Errors**

- **Meaning:** The residuals (errors) ‚Äî differences between observed and predicted values ‚Äî must be **independent** of each other.

- **Why it matters:** If errors are correlated (e.g., in time-series data), predictions and statistical tests become unreliable.

- **How to check:** Use the **Durbin-Watson test** (especially for time-series data).

üîπ **3. Homoscedasticity (Constant Variance of Errors)**

- **Meaning:** The variance of the residuals should be constant across all levels of **ùëã**.


- **Why it matters:** If the spread of residuals increases or decreases with **ùëã** (called heteroscedasticity), it can distort standard errors and confidence intervals.

- **How to check:** Plot residuals vs. predicted values ‚Äî they should form a random scatter (not a funnel shape).

üîπ **4. Normality of Errors**

- **Meaning:** The residuals should be **approximately normally distributed**.

- **Why it matters:** This assumption is important for hypothesis testing and constructing confidence intervals.

- **How to check:** Use a **histogram** or **Q‚ÄìQ plot** of residuals ‚Äî they should follow a bell-shaped curve.

üîπ **5. No (or minimal) Multicollinearity**

- In **simple** linear regression (only one independent variable), this isn‚Äôt an issue.

- In **multiple** regression, it means that independent variables should not be highly correlated with each other.

---

**Question 3:**  Write the mathematical equation for a simple linear regression model and explain each term.

**Answer:** >The **mathematical equation** for a **Simple Linear Regression (SLR)** model is:

>>Y = Œ≤‚ÇÄ + Œ≤‚ÇÅX + Œµ

>**Explanation of Each Term**

| **Term**           | **Name**                        | **Meaning / Role**                                                                                                                                |
| ------------------ | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Y**              | Dependent Variable              | The variable we want to **predict or explain**. <br>Example: Marks scored, house price, etc.                                                      |
| **X**              | Independent Variable            | The variable used to **predict Y**. <br>Example: Hours studied, area of the house, etc.                                                           |
| **Œ≤‚ÇÄ (Beta-zero)** | Intercept                       | The **value of Y when X = 0**. <br>It represents the baseline or starting value.                                                                  |
| **Œ≤‚ÇÅ (Beta-one)**  | Slope or Regression Coefficient | Shows how much **Y changes** for a **one-unit increase in X**. <br>If Œ≤‚ÇÅ = 5, then for every 1 increase in X, Y increases by 5 units.             |
| **Œµ (Epsilon)**    | Error Term or Residual          | Represents the **difference** between the **actual** and **predicted** Y values. <br>It captures randomness or factors not included in the model. |


>**Example**

>Suppose we are predicting **marks (Y)** based on **hours studied (X)**:

$Y = 40 + 5X + Œµ$


>Here:

>* **Œ≤‚ÇÄ = 40** ‚Üí Base marks if no hours are studied.
>* **Œ≤‚ÇÅ = 5** ‚Üí For every extra hour studied, marks increase by 5.
>* **Œµ** ‚Üí Random error due to factors like mood, environment, or luck.

---

**Question 4:**  Provide a real-world example where simple linear regression can be applied.

**Answer:** Here‚Äôs a clear real-world example of how Simple Linear Regression (SLR) can be applied.

üéØ **Example: Predicting House Prices Based on Size**

**Scenario:**

A real estate company wants to **predict the price of a house** based on its **size (in square feet)**.

**Variables:**

- **Dependent variable (Y):** House price (in ‚Çπ or $)

- **Independent variable (X):** Size of the house (in square feet)

**Regression Model:**

>>Price = ùõΩ0 + ùõΩ1 (Size) + ùúÄ



**Where:**

ùõΩ0 : The predicted price of a house with size = 0 (intercept)
ùõΩ1 : The average increase in house price for every additional square foot
ùúÄ: Random error (captures factors like location, condition, or neighborhood quality)

---

**Question 5:** What is the method of least squares in linear regression?

**Answer:** The **method of least squares in linear regression** is a mathematical technique used to find the **best-fitting line** through a set of data points.

üîπ **Concept:**

It works by minimizing the **sum of the squares of the differences** (called **residuals**) between the **observed values** (actual data) and the **predicted values** (values given by the regression line).

üîπ **Formula:**

For a simple linear regression model:

$$ùë¶ = ùëé + ùëèùë•$$

Where:

* $ùë¶$ = dependent variable (predicted value)
* $ùë•$ = independent variable (input)
* $ùëé$ = intercept
* $ùëè$ = slope of the line

The method of least squares minimizes this function:

$$S = \sum (y_i - (a + bx_i))^2$$

Where:

* $ùë¶ùëñ$ = actual value of

* $ùë¶ (ùëé + ùëèùë•ùëñ)$= predicted value

* $ùëÜ$ = sum of squared residuals

üîπ **Goal:**

Find values of
$ùëé$ and $ùëè$ that make $ùëÜ$ as small as possible ‚Äî meaning the line fits the data points as closely as possible.

üîπ **In simple terms:**

It‚Äôs like drawing a line through a scatter plot so that the total distance (vertically) between the points and the line is as small as possible ‚Äî but we square those distances to avoid negative values and to emphasize larger errors.

---

**Question 6:** What is Logistic Regression? How does it differ from Linear Regression?

**Answer:** **Logistic Regression** is a **statistical method** used for **binary classification** problems ‚Äî where the output (dependent variable) has **two possible** outcomes, such as:
‚úÖ Yes / No
‚úÖ Pass / Fail
‚úÖ 1 / 0

Despite its name, **Logistic Regression is used for classification, not regression.**

It predicts the **probability** that a given input belongs to a particular class using the **logistic (sigmoid) function**, which outputs values between 0 and 1.

üîπ **The Logistic Function (Sigmoid Function):**


$$P (Y = 1|X) = \frac{1}{1 + e^{-(b_0 + b_1x)}}$$


Where:

* $ùëÉ (ùëå = 1‚à£X)$ = probability that output is 1 (positive class)

* $ùëè0$ = intercept

* $ùëè1$ = coefficient (slope)

* $ùëí$ = Euler‚Äôs number (~2.718)

If $ùëÉ >0.5$, we classify the outcome as 1 (positive), otherwise 0 (negative).

üîπ **Logistic Regression Differs from Linear Regression:**
| **Aspect**            | **Linear Regression**                             | **Logistic Regression**                                       |                                      |
| --------------------- | ------------------------------------------------- | ------------------------------------------------------------- | ------------------------------------ |
| **Purpose**           | Predicts a continuous numeric value               | Predicts a categorical outcome (usually binary)               |                                      |
| **Output Range**      | Any real number (‚àí‚àû to +‚àû)                        | Probability between 0 and 1                                   |                                      |
| **Equation Form**     | $( y = b_0 + b_1x )$                                | ( P(Y=1                                                       |  |
| **Error Metric**      | Mean Squared Error (MSE)                          | Log Loss or Cross-Entropy                                     |                                      |
| **Decision Boundary** | Linear relationship between x and y               | Uses sigmoid function to create a nonlinear relationship      |                                      |
| **Use Case Examples** | Predicting house prices, sales, temperature, etc. | Predicting spam/not spam, disease/no disease, pass/fail, etc. |                                      |


---

**Question 7:** Name and briefly describe three common evaluation metrics for regression models.

**Answer:** Here are **three common evaluation metrics** used to measure the performance of **regression models.**


üîπ **1. Mean Absolute Error (MAE)**

**Definition:**
MAE measures the **average absolute difference** between the actual values and the predicted values.

$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

**Meaning:**
It tells us, on average, how far the predictions are from the actual values ‚Äî in the same units as the target variable.

**Example:**
If MAE = 5, the model‚Äôs predictions are off by about 5 units on average.


üîπ **2. Mean Squared Error (MSE)**

**Definition:**
MSE measures the **average of the squared differences** between actual and predicted values.

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

**Meaning:**
It penalizes **larger errors** more than smaller ones because errors are squared.
A lower MSE means better model performance.

üîπ **3. R-squared (Coefficient of Determination)**

**Definition:**
R¬≤ explains how much of the **variance in the dependent** variable is explained by the model.

$$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$$

**Meaning:**

* $ùëÖ2 =1:$ Perfect fit

* $ùëÖ2 = 0:$ Model explains none of the variance

It shows the **goodness of fit** of the regression model.

---

**Question 8:** What is the purpose of the R-squared metric in regression analysis?

**Answer:** **Purpose of R-squared Metric in Regression Analysis**

**R-squared (R¬≤)**, also known as the **coefficient of determination**, measures how well the **regression model explains the variability** of the dependent variable (**Y**) based on the independent variable(s) (**X**).

$$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$$

**Key Points**

* **R¬≤ value ranges from 0 to 1:**

* **0** ‚Üí Model explains none of the variation in Y.
* **1** ‚Üí Model perfectly explains all the variation in Y.
* It represents the **proportion of variance** in the dependent variable that is **explained by the independent variable(s)**.
* Higher **R¬≤** indicates a **better model fit**.

**Example:**

If a regression model has **R¬≤ = 0.85**, it means:

85% of the variation in the dependent variable is explained by the model, and the remaining 15% is due to other unexplained factors (errors or noise).

---


In [1]:
'''Question 9. Write Python code to fit a simple linear regression model using scikit-learn
and print the slope and intercept.'''


# Import necessary libraries
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data
# X = independent variable (reshape required for sklearn)
# Y = dependent variable
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
Y = np.array([2, 4, 5, 4, 5])

# Create and fit the linear regression model
model = LinearRegression()
model.fit(X, Y)

# Print the slope (coefficient) and intercept
print("Slope (Œ≤‚ÇÅ):", model.coef_[0])
print("Intercept (Œ≤‚ÇÄ):", model.intercept_)

# Optional: print predicted values
Y_pred = model.predict(X)
print("\nPredicted values:", Y_pred)



Slope (Œ≤‚ÇÅ): 0.6
Intercept (Œ≤‚ÇÄ): 2.2

Predicted values: [2.8 3.4 4.  4.6 5.2]


**Question 10:** How do you interpret the coefficients in a simple linear regression model?

**Answer:** In a **simple linear regression model**, the relationship between a dependent variable
$ùë¶$ and an independent variable
$ùë•$ is expressed as:

$$y = \beta_0 + \beta_1 x + \epsilon$$

Here‚Äôs how to interpret the **coefficients:**

**1. Intercept** $(\beta_0)$

* Also called the **constant term**.

* It represents the **predicted value** of $ùë¶$ when $ùë• = 0$.

In other words, it‚Äôs the starting point or baseline of the regression line on the y-axis.

* Example: If $\beta_ = 5$,it means that when
$ùë• = 0$, the predicted value of $ùë¶$ is 5.

**2. Slope** $(\beta_1)$

* Represents the **change in**
$ùë¶$ for a **one-unit increase in** $ùë•$.

* It shows the **strength and direction** of the relationship:

  * If $\beta_1 > 0: ùë¶$ increases as $ùë•$ increases (positive relationship).

  * If $\beta_1 > 0: ùë¶$ decreases as $ùë•$ increases (negative relationship).

Example: If $\beta_1 = 2$, it means for every one-unit increase in $ùë•,ùë¶$ is predicted to increase by 2 units.
