# Supervised Learning: Regression Models and Performance Metrics

### Question 1: What is Simple Linear Regression (SLR)? Explain its purpose.

### **Simple Linear Regression (SLR)**

**Definition:**
Simple Linear Regression is a **statistical method** used to study the relationship between **two variables** —

* one **independent variable (X)** and
* one **dependent variable (Y)** —
  by fitting a **straight line (best fit line)** through the data points.

This line is called the **regression line** and is represented by the **equation:**

[
Y = a + bX
]

Where:

* ( Y ): Dependent variable (output/predicted value)
* ( X ): Independent variable (input/predictor)
* ( a ): Intercept (value of Y when X = 0)
* ( b ): Slope (rate of change in Y for each unit change in X)



### **Purpose of Simple Linear Regression**

1. **Prediction:**
   To predict the value of one variable (Y) based on the known value of another (X).

   > Example: Predicting a student’s score (Y) from study hours (X).

2. **Understanding Relationships:**
   To determine how strongly and in what direction (positive or negative) X influences Y.

   > Example: Higher advertising spending (X) leading to higher sales (Y).

3. **Trend Analysis:**
   To analyze trends over time, such as predicting future prices, growth, or demand.

4. **Quantifying Effects:**
   To measure the magnitude of the impact of one variable on another (through the slope ( b )).





### Question 2: What are the key assumptions of Simple Linear Regression?

### **Key Assumptions of Simple Linear Regression (SLR)**

To ensure that the regression results are **valid and reliable**, certain assumptions must be satisfied. These are the **five key assumptions** of Simple Linear Regression:



### **1. Linearity**

* The relationship between the **independent variable (X)** and the **dependent variable (Y)** must be **linear**.
* This means changes in X result in proportional changes in Y.

 **Example:**
If study hours increase linearly, marks should also increase linearly — not exponentially or in a curved pattern.



### **2. Independence of Errors**

* The residuals (errors) should be **independent** of each other.
* No correlation should exist between error terms.

 **Example:**
The error for one observation should not depend on the error for another observation (common in time series data).


### **3. Homoscedasticity**

* The **variance of errors** should be **constant** across all levels of X.
* In other words, the spread of residuals should remain the same for all predicted values.

 **Violation:** When errors increase or decrease as X changes (heteroscedasticity).

 **Example:**
If marks vary evenly for all levels of study hours, the assumption holds.



### **4. Normality of Errors**

* The residuals (difference between observed and predicted Y) should follow a **normal distribution**.
* This is important for hypothesis testing and confidence intervals.

 **Example:**
A histogram or Q–Q plot of residuals should look approximately normal (bell-shaped).

### **5. No (or minimal) Multicollinearity**

* In **simple** linear regression, there is only **one independent variable**, so multicollinearity doesn’t apply.
* But in **multiple** regression, it means independent variables shouldn’t be highly correlated.





### Question 3: Write the mathematical equation for a simple linear regression model and explain each term.

### **Mathematical Equation of a Simple Linear Regression Model**

The general equation for **Simple Linear Regression (SLR)** is:

[
Y = a + bX + e
]

---

### **Explanation of Each Term:**

| **Term** | **Meaning**                    | **Description / Role**                                                                                                                       |
| -------- | ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- |
| **Y**    | Dependent Variable             | The **outcome** or **response** we are trying to predict. <br>Example: Marks scored by a student.                                            |
| **X**    | Independent Variable           | The **predictor** or **input** variable used to explain changes in Y. <br>Example: Hours of study.                                           |
| **a**    | Intercept (Constant term)      | The value of Y when X = 0. <br>It represents the **starting point** of the regression line on the Y-axis.                                    |
| **b**    | Slope (Regression Coefficient) | Represents the **rate of change** in Y for every one-unit increase in X. <br>It shows how strongly X influences Y.                           |
| **e**    | Error term (Residual)          | The **difference** between the observed value and the predicted value of Y. <br>It accounts for randomness or unexplained variation in data. |

---

### **Example:**

Let’s say we are predicting **student marks (Y)** based on **hours studied (X):**

[
Y = 25 + 4X + e
]

**Interpretation:**

* **Intercept (a = 25):**
  If a student studies 0 hours, they are expected to score **25 marks**.
* **Slope (b = 4):**
  For every additional hour studied, marks increase by **4 points**.
* **Error term (e):**
  Represents random factors like mood, health, or exam difficulty that affect marks.




### Question 4: Provide a real-world example where simple linear regression can be applied.

### **Real-World Example of Simple Linear Regression (SLR)**

#### **Example: Predicting House Prices Based on Size**


### **Scenario:**

A real estate agent wants to **predict the price of a house (Y)** based on its **size in square feet (X)**.
She collects data from several houses in a locality.

| **House Size (sq. ft)** | **House Price (₹ in lakhs)** |
| ----------------------- | ---------------------------- |
| 800                     | 40                           |
| 1000                    | 50                           |
| 1200                    | 60                           |
| 1500                    | 75                           |
| 1800                    | 90                           |


### **Applying Simple Linear Regression:**

We model the relationship between **house size (X)** and **price (Y)** using:

[
Y = a + bX
]

Assume after calculation, the regression equation comes out to be:

[
Y = 10 + 0.045X
]

### **Interpretation:**

* **Intercept (a = 10):**
  Even if the house size were 0 sq. ft, the base price would be ₹10 lakhs (land, location, etc.).

* **Slope (b = 0.045):**
  For every **1 sq. ft increase** in house size, the **price increases by ₹0.045 lakhs (₹4,500)**.

* **Prediction Example:**
  If a house is **2000 sq. ft**,
  [
  Y = 10 + 0.045(2000) = 10 + 90 = ₹100 \text{ lakhs}
  ]


### **Purpose in Real Life:**

* Helps **buyers/sellers** estimate fair market prices.
* Aids **builders/investors** in planning and pricing strategies.
* Useful in **economic forecasting** and **valuation models**.



### Question 5: What is the method of least squares in linear regression?

### **Method of Least Squares in Linear Regression**



### **Definition:**

The **Method of Least Squares** is a **mathematical technique** used in linear regression to find the **best-fitting line** through a set of data points by **minimizing the sum of the squared errors (residuals)**.

In simple words —
It finds the line that makes the **difference between the observed and predicted values as small as possible**.

### **Mathematical Explanation:**

The **regression line** is represented as:

Ŷ = a + bX


where

* ( Y ) = actual (observed) value,
* ( a ) = intercept,
* ( b ) = slope,
* ( X ) = independent variable,
* ( \hat{Y} = a + bX ) = predicted value of Y.

The **error (residual)** for each observation is:

e = Y - Ŷ


The **Least Squares Method** minimizes the **sum of the squares of these residuals**, i.e.,

Minimize Σ(Y - Ŷ)²



### **Formulas to Estimate Coefficients:**

b = [ n(ΣXY) - (ΣX)(ΣY) ] / [ n(ΣX²) - (ΣX)² ]


a = Ȳ - bX̄

Where:

* (n) = number of observations
* (X̄), (Ȳ) = mean of X and Y respectively



### **Step-by-Step Process:**

1. Collect data for variables ( X ) and ( Y ).
2. Calculate ( \bar{X} ) and ( \bar{Y} ).
3. Compute slope ( b ) using the above formula.
4. Compute intercept ( a = \bar{Y} - b\bar{X} ).
5. Form the regression equation ( \hat{Y} = a + bX ).
6. Use the equation to predict values of ( Y ).


### **Example:**

| X (Study Hours) | Y (Marks) |
| --------------- | --------- |
| 2               | 40        |
| 4               | 50        |
| 6               | 60        |
| 8               | 70        |

Calculated Regression Equation:
Ŷ = 30 + 5X


Interpretation:

* **Intercept (a = 30):** Base marks = 30
* **Slope (b = 5):** Each additional study hour increases marks by 5

This line minimizes the **total squared error** between actual marks and predicted marks.



### Question 6: What is Logistic Regression? How does it differ from Linear Regression?


**Definition:**
**Logistic Regression** is a **supervised machine learning algorithm** used for **classification problems** — that is, when the dependent variable (**Y**) is **categorical** (e.g., Yes/No, 0/1, True/False).

It predicts the **probability** that a given input belongs to a certain class.


### **Mathematical Form:**

Instead of fitting a straight line, Logistic Regression fits a **sigmoid (S-shaped) curve**:

P(Y = 1) = 1 / (1 + e^-(a + bX))




### **Explanation of Terms:**

| Symbol       | Meaning                                             |
| ------------ | --------------------------------------------------- |
| **P(Y = 1)** | Probability that Y = 1 (event occurs)               |
| **e**        | Euler’s constant (~2.718)                           |
| **a**        | Intercept                                           |
| **b**        | Coefficient (shows how X affects the log-odds of Y) |
| **X**        | Independent variable                                |


### **Purpose:**

* To **classify** observations into **two or more categories**.
* To **predict probabilities** of outcomes (e.g., 0.8 → 80% chance of success).


### **Example:**

Predicting whether a student **passes (1)** or **fails (0)** based on **hours studied (X)**.

If the output probability ( P(Y = 1) = 0.85 ),
it means the student has an **85% chance of passing**.


### **2. Difference Between Linear and Logistic Regression**

| **Basis**                  | **Linear Regression**                       | **Logistic Regression**                             |
| -------------------------- | ------------------------------------------- | --------------------------------------------------- |
| **Purpose**                | Predicts a **continuous** numeric value     | Predicts a **categorical** outcome (classification) |
| **Dependent Variable (Y)** | Continuous (e.g., marks, salary)            | Binary or categorical (e.g., yes/no, 0/1)           |
| **Equation**               | ( Y = a + bX )                              | ( P(Y) = \frac{1}{1 + e^{-(a + bX)}} )              |
| **Output Range**           | From (-∞) to (+∞)                           | Between **0 and 1** (probabilities)                 |
| **Type of Relationship**   | Linear relationship between X and Y         | Non-linear (uses sigmoid/logistic function)         |
| **Use Case Example**       | Predicting house prices, sales, temperature | Predicting spam emails, pass/fail, disease presence |


### Question 7: Name and briefly describe three common evaluation metrics for regression models.


When we build a regression model (like Linear or Multiple Regression), we need to **evaluate how well** it predicts the dependent variable (**Y**).
Here are **three commonly used evaluation metrics**


### **1. Mean Absolute Error (MAE)**

**Formula:**
[
MAE = \frac{1}{n} \sum |Y - \hat{Y}|
]

**Plain text version:**

```
MAE = (1/n) * Σ|Y - Ŷ|
```

**Description:**

* Measures the **average absolute difference** between actual and predicted values.
* It tells us **how far off**, on average, the model’s predictions are.
* Smaller MAE → better model accuracy.

**Example:**
If MAE = 2.5 → on average, the model’s predictions are **2.5 units away** from actual values.

### **2. Mean Squared Error (MSE)**

**Formula:**
[
MSE = \frac{1}{n} \sum (Y - \hat{Y})^2
]

**Plain text version:**

```
MSE = (1/n) * Σ(Y - Ŷ)²
```

**Description:**

* Calculates the **average of squared errors** between actual and predicted values.
* Penalizes **larger errors** more than smaller ones because of squaring.
* Lower MSE indicates better model performance.

**Example:**
If MSE = 4, the model’s predictions deviate by about **√4 = 2 units on average**.

### **3. R-squared (Coefficient of Determination)**

**Formula:**
[
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
]

**Plain text version:**

```
R² = 1 - (SSres / SStot)
```

Where:

* `SSres` = Σ(Y - Ŷ)² → Sum of squared residuals
* `SStot` = Σ(Y - Ȳ)² → Total variation in Y

**Description:**

* Measures how much of the **variation in Y** is explained by the model.
* R² ranges from **0 to 1**:

  * **1** → Perfect prediction
  * **0** → Model explains nothing

**Example:**
If R² = 0.85 → 85% of the variation in Y is explained by the model.


### Question 8: What is the purpose of the R-squared metric in regression analysis?


### **Definition:**

**R-squared (R²)** — also called the **Coefficient of Determination** — is a **statistical measure** that shows how well the regression model explains the **variation** in the dependent variable (**Y**) based on the independent variable(s) (**X**).

### **Formula:**

[
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
]

**Plain text version:**

```
R² = 1 - (SSres / SStot)
```

Where:

* `SSres` = Σ(Y - Ŷ)² → Sum of Squared Residuals (unexplained variation)
* `SStot` = Σ(Y - Ȳ)² → Total Sum of Squares (total variation in Y)


### **Purpose of R-squared:**

1. **Measures Goodness of Fit:**
   It tells how well the regression line **fits** the data points.

   * Higher R² → better model fit.
   * Lower R² → poor fit (model fails to explain variation).

2. **Explains Variability:**
   It represents the **percentage of variation in the dependent variable** that can be explained by the independent variable(s).

   > Example: R² = 0.80 means 80% of the variation in Y is explained by X, and 20% is unexplained (due to random error).

3. **Model Performance Comparison:**
   R² helps compare the performance of **different regression models** on the same dataset — the one with higher R² generally performs better.

### **Interpretation Example:**

If the regression model gives:
[
R^2 = 0.92
]

It means:

> “92% of the variation in the dependent variable (Y) is explained by the independent variable (X), while only 8% is due to random noise or other unaccounted factors.”




### Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.



In [1]:
# Import necessary libraries
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data (Hours studied vs Marks scored)
X = np.array([[2], [4], [6], [8], [10]])   # Independent variable
Y = np.array([40, 50, 60, 70, 80])         # Dependent variable

# Create a Linear Regression model
model = LinearRegression()

# Fit the model with data
model.fit(X, Y)

# Get the slope (coefficient) and intercept
slope = model.coef_[0]
intercept = model.intercept_

# Print the results
print("Slope (b):", slope)
print("Intercept (a):", intercept)

# Predict marks for a student who studied 7 hours
predicted = model.predict([[7]])
print("Predicted Marks for 7 hours of study:", predicted[0])


Slope (b): 4.999999999999999
Intercept (a): 30.000000000000007
Predicted Marks for 7 hours of study: 65.0


### Question 10: How do you interpret the coefficients in a simple linear regression model?

### **General Equation:**

[
Ŷ = a + bX
]

**Plain text version:**

```
Ŷ = a + bX
```

Where:

* **Ŷ** → Predicted value of the dependent variable
* **a** → Intercept (constant term)
* **b** → Slope (coefficient of X)
* **X** → Independent variable


### **Interpretation of Coefficients:**

#### **1. Intercept (a):**

* The **intercept** represents the **predicted value of Y when X = 0**.
* It’s the point where the regression line **crosses the Y-axis**.
* It gives the **baseline value** of Y before X has any effect.

**Example:**
If the equation is

```
Ŷ = 30 + 5X
```

Then:

* Intercept (a = 30):
  When `X = 0` (no study hours), the predicted marks are **30**.


#### **2. Slope (b):**

* The **slope (coefficient)** represents the **change in Y** for a **one-unit change in X**.
* It shows the **strength and direction** of the relationship:

  * If **b > 0** → Y increases as X increases (**positive relationship**)
  * If **b < 0** → Y decreases as X increases (**negative relationship**)

**Example:**
From the same equation

```
Ŷ = 30 + 5X
```

* Slope (b = 5):
  For every additional **1 hour of study**, the marks increase by **5 points** on average.




In [2]:
# Import necessary libraries
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data: (X = Hours studied, Y = Marks scored)
X = np.array([[2], [4], [6], [8], [10]])  # Independent variable
Y = np.array([40, 50, 60, 70, 80])        # Dependent variable

# Create and fit the model
model = LinearRegression()
model.fit(X, Y)

# Extract coefficients
intercept = model.intercept_
slope = model.coef_[0]

# Print the coefficients
print("Intercept (a):", intercept)
print("Slope (b):", slope)

# Display the regression equation
print(f"\nRegression Equation: Ŷ = {intercept:.2f} + {slope:.2f}X")

# Interpret the coefficients
print("\nInterpretation:")
print(f"→ The intercept (a = {intercept:.2f}) means that when X = 0, the predicted marks are {intercept:.2f}.")
print(f"→ The slope (b = {slope:.2f}) means that for every additional 1 hour of study, marks increase by {slope:.2f} points on average.")


Intercept (a): 30.000000000000007
Slope (b): 4.999999999999999

Regression Equation: Ŷ = 30.00 + 5.00X

Interpretation:
→ The intercept (a = 30.00) means that when X = 0, the predicted marks are 30.00.
→ The slope (b = 5.00) means that for every additional 1 hour of study, marks increase by 5.00 points on average.
