## Supervised Learning: Regression Models and Performance Metrics

## Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose.

What is Simple Linear Regression (SLR)?

Simple Linear Regression (SLR) is one of the most basic and widely used statistical techniques used to understand the relationship between two continuous variables — one independent variable (X) and one dependent variable (Y).

In SLR, we try to model the relationship between X and Y using a straight line equation of the form:

$$
Y = b_0 + b_1X + \varepsilon
$$

Y = Dependent (output) variable


X = Independent (input) variable


b0 = Intercept (value of Y when X = 0)

b1 = Slope (change in Y for a one-unit change in X)

𝜀 = Error term (difference between observed and predicted values)

This equation represents the best-fitting straight line that minimizes the difference between the actual data points and the predicted line — known as the line of best fit.

Purpose of Simple Linear Regression

The main purposes of SLR are:

Prediction

It helps in predicting the value of the dependent variable (Y) based on the given independent variable (X).

Example: Predicting a person’s weight (Y) based on their height (X).

Understanding Relationships

SLR helps to identify and measure how strongly two variables are related.

Example: Determining whether there is a positive or negative relationship between temperature and ice-cream sales.

Trend Analysis

It is used to analyze trends and patterns over time or across data points.

Example: Predicting sales growth with time.



## Question 2: What are the key assumptions of Simple Linear Regression?

Here are the main assumptions of SLR explained in detail:

1. Linearity

The relationship between the independent variable (X) and the dependent variable (Y) must be linear.

This means that a change in X leads to a proportional change in Y.

Mathematically,   $$
Y = b_0 + b_1X + \varepsilon
$$


Example: If study hours (X) increase, marks (Y) increase in a roughly straight-line pattern.

Check: Use a scatter plot of X vs Y — the points should form a roughly straight line.

2. Independence of Errors (No Autocorrelation)

The residuals (errors) should be independent of each other.

In other words, the error term for one observation should not influence another.

Example: In time series data, if one day’s error affects the next day’s, this violates independence.

Check: Use the Durbin-Watson test to detect autocorrelation.

3. Homoscedasticity (Constant Variance of Errors)

The residuals should have constant variance across all levels of X.

This means the spread of residuals should be the same throughout the line of best fit.

If violated: It leads to heteroscedasticity, making predictions unreliable.

Check: Plot residuals vs predicted values — the spread should look random and even.

4. Normality of Errors

The residuals (differences between actual and predicted Y values) should be normally distributed.

This assumption ensures valid hypothesis tests and confidence intervals.

Check: Use a histogram or Q-Q plot of residuals — they should form a bell-shaped curve.

5. No Multicollinearity (only relevant in multiple regression)

In Simple Linear Regression, there is only one independent variable, so multicollinearity does not apply.

However, in Multiple Linear Regression, independent variables must not be highly correlated.

6. No Significant Outliers

The dataset should not contain extreme outliers, as they can distort the regression line.

Outliers can heavily influence the slope and intercept values.

Check: Use boxplots or scatter plots to detect outliers.

## Question 3: Write the mathematical equation for a simple linear regression model and explain each term.


Mathematical Equation of Simple Linear Regression (SLR)

The general equation for a simple linear regression model is:
$$
Y = b_0 + b_1X + \varepsilon
$$

Explanation of Each Term:

Y — Dependent Variable (Response Variable)

This is the outcome or target variable that we are trying to predict or explain.

Example: In a study predicting exam scores based on study hours, Y = exam score.

X — Independent Variable (Predictor Variable)

This is the input or explanatory variable used to predict the dependent variable.

Example: X = number of hours studied.

𝑏0— Intercept (Constant Term)

It represents the value of Y when X = 0.

In other words, it’s where the regression line crosses the Y-axis.

Example: If b0=20, it means even with 0 hours of study, a student is expected to score 20 marks.

𝑏1 — Slope (Regression Coefficient)

It represents the rate of change of Y with respect to X.

For every 1 unit increase in X, Y changes by 

Example: If b1=5, then for every extra hour of study, the score increases by 5 marks.

ε — Error Term (Residual)

This represents the difference between the actual and predicted values of Y.

It accounts for the variation in Y that cannot be explained by X.

Mathematically:

ε=Yactual−Ypredicted



## Question 4: Provide a real-world example where simple linear regression can be applied.


Real-World Example of Simple Linear Regression (Theory Answer)

Simple Linear Regression can be applied in many real-world situations where we want to study and predict the relationship between two continuous variables — one independent variable (X) and one dependent variable (Y).

Example: Predicting House Prices Based on Area

One of the most common applications of Simple Linear Regression is in the real estate industry, where it is used to predict the price of a house (Y) based on its area or size (X).

In this case:

Dependent variable (Y): House Price

Independent variable (X): Area of the house (in square feet or square meters)

By applying regression analysis to historical data, we can obtain a regression equation of the form:

$$
Y = b_0 + b_1X + \varepsilon
$$


This equation helps estimate the expected price of a house for a given area.

Purpose of Using SLR in This Case:

Prediction: To estimate the price of a house based on its size.

Understanding Relationship: To understand how house size affects its price.

## Question 5: What is the method of least squares in linear regression?

Method of Least Squares in Linear Regression

The Method of Least Squares is a fundamental mathematical technique used in linear regression to find the best-fitting line through a set of data points.
It works by minimizing the sum of the squares of the differences (errors) between the observed values and the predicted values produced by the regression line.

1. Concept:

In Simple Linear Regression, the relationship between the dependent variable (Y) and the independent variable (X) is given by:
$$
Y = b_0 + b_1X + \varepsilon
$$

where:

𝑌 = actual (observed) value

𝑏0+𝑏1𝑋 = predicted value from regression line

ε = error (difference between actual and predicted value)

The goal is to find the values of 𝑏0(intercept) and 𝑏1(slope) that minimize the total error across all data points.

2. Principle of Least Squares:

The error (residual) for each observation is:

$$
e_i = Y_i - (b_0 + b_1X_i)
$$

The Method of Least Squares minimizes the sum of squared errors (SSE):

$$
S = \sum (Y_i - b_0 - b_1X_i)^2
$$

By squaring the errors, we:

Avoid negative signs cancelling positive ones.

Give more weight to larger errors (to reduce their effect).

The values of 𝑏0 and 𝑏1 are chosen such that this sum S is as small as possible.

3. Formulas for the Regression Coefficients:

From calculus (minimizing S with respect to 𝑏0 and 𝑏1), we get:
$$
b_1 = \frac{n\sum XY - (\sum X)(\sum Y)}{n\sum X^2 - (\sum X)^2}
$$

$$
b_0 = \bar{Y} - b_1\bar{X}
$$

Where:

$$
n = \text{number of data points}
$$

$$
\bar{X} = \text{mean of } X \text{ values}
$$

$$
\bar{Y} = \text{mean of } Y \text{ values}
$$

4. Interpretation:

The slope (𝑏1) represents how much Y changes for a one-unit change in X.

The intercept (𝑏0) represents the value of Y when X = 0.

The resulting line Y=b0+b1X is called the line of best fit because it minimizes the total squared error.

## Question 6: What is Logistic Regression? How does it differ from Linear Regression?


What is Logistic Regression?

Logistic Regression is a statistical method used for classification problems, where the dependent variable is categorical rather than continuous. It is mainly used to predict the probability of an event occurring, such as whether an email is spam or not, whether a customer will buy a product, or whether a student will pass or fail an exam.

Although it has the term “regression” in its name, logistic regression is actually a classification algorithm and not a regression one.

Purpose of Logistic Regression

The main goal of logistic regression is to estimate the probability that a given input belongs to a particular category. It converts the linear regression output into a probability value that always lies between 0 and 1, making it suitable for binary classification tasks (like yes/no, 0/1, true/false).

Mathematical Model

Logistic Regression is based on the logistic (sigmoid) function, which produces an S-shaped curve. The equation is:
$$
P(Y = 1 \mid X) = \frac{1}{1 + e^{-(b_0 + b_1X)}}
$$

In this equation:

P(Y=1∣X) represents the probability that the output Y belongs to class 1 given the input X.

𝑏0 is the intercept.

𝑏1 is the coefficient of the independent variable X.

𝑒 is the base of the natural logarithm (approximately 2.718).

This formula ensures that the output will always be a value between 0 and 1

Decision Rule

Once the probability value is obtained, we apply a decision threshold:

If the probability is greater than or equal to 0.5, the model predicts class 1 (e.g., yes, success, positive).

If the probability is less than 0.5, the model predicts class 0 (e.g., no, failure, negative).

This threshold can be adjusted depending on the problem’s requirements.

Example

Suppose we use logistic regression to predict whether a student will pass or fail based on the number of study hours. The model might predict a probability of 0.8 for passing. Since 0.8 is greater than 0.5, the model classifies the student as “Pass.”

Difference Between Logistic Regression and Linear Regression

Although both models look similar, they are used for different purposes.
Linear Regression is used when the dependent variable is continuous, such as predicting house prices, sales, or temperature. It provides a numeric value as output and follows a straight-line relationship between the variables.

In contrast, Logistic Regression is used when the dependent variable is categorical, such as predicting whether an email is spam or not. Instead of giving a continuous value, it provides a probability between 0 and 1 using the sigmoid function. The relationship between the input variable and the output probability is non-linear, forming an S-shaped curve rather than a straight line.

In simple terms, linear regression predicts “how much,” while logistic regression predicts “which category.”


## Question 7: Name and briefly describe three common evaluation metrics for regression models.

### **Question 7: Name and briefly describe three common evaluation metrics for regression models**

When evaluating a **regression model**, we need to measure how close the predicted values are to the actual values.  
Three of the most common evaluation metrics used are **Mean Absolute Error (MAE)**, **Mean Squared Error (MSE)**, and **Root Mean Squared Error (RMSE)**.

---

### **1. Mean Absolute Error (MAE)**

**Definition:**  
MAE measures the **average magnitude of the errors** between the predicted and actual values.  
It does not consider the direction of the errors (positive or negative).

**Formula:**

\[
MAE = \frac{1}{n} \sum_{i=1}^{n} |Y_i - \hat{Y_i}|
\]

Where:  
- \( Y_i \) = actual value  
- \( \hat{Y_i} \) = predicted value  
- \( n \) = number of data points  

**Interpretation:**  
MAE gives the average error in the same units as the target variable.  
A **lower MAE** indicates a better model.

---

### **2. Mean Squared Error (MSE)**

**Definition:**  
MSE measures the **average of the squared differences** between the actual and predicted values.  
It penalizes larger errors more than smaller ones.

**Formula:**

\[
MSE = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2
\]

**Interpretation:**  
A smaller MSE value indicates that the model’s predictions are closer to the actual data.  
Because the errors are squared, MSE is sensitive to outliers.

---

### **3. Root Mean Squared Error (RMSE)**

**Definition:**  
RMSE is the **square root of the Mean Squared Error (MSE)**.  
It represents the **standard deviation of prediction errors**.

**Formula:**

\[
RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2}
\]

**Interpretation:**  
RMSE provides an error measure in the same units as the dependent variable.  
A **lower RMSE** means the model predictions are more accurate.

---

### **Summary**

- **MAE** → Average size of prediction errors.  
- **MSE** → Average of squared errors (penalizes larger errors).  
- **RMSE** → Square root of MSE, interpretable in original units.  

A good regression model aims to **minimize MAE, MSE, and RMSE** for better accuracy.


## Question 8: What is the purpose of the R-squared metric in regression analysis?

### **Question 8: What is the purpose of the R-squared metric in regression analysis?**



### **Definition**

**R-squared (R²)**, also known as the **Coefficient of Determination**, is a statistical measure used to evaluate how well a **regression model** explains the variability of the dependent variable.  
It represents the **proportion of the variance** in the dependent variable that can be **explained by the independent variable(s)** in the model.


### **Formula**

\[
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
\]

Where:  
- \( SS_{res} = \sum (Y_i - \hat{Y_i})^2 \) → **Residual Sum of Squares** (unexplained variation)  
- \( SS_{tot} = \sum (Y_i - \bar{Y})^2 \) → **Total Sum of Squares** (total variation in data)  
- \( Y_i \) = actual values  
- \( \hat{Y_i} \) = predicted values  
- \( \bar{Y} \) = mean of actual values  



### **Interpretation**

- **R² = 1 (or 100%)** → Perfect fit. The regression model explains all the variability in the data.  
- **R² = 0** → The model explains none of the variability; it does no better than the mean.  
- **Higher R² values** indicate that the model fits the data better.

For example,  
if \( R^2 = 0.85 \), it means that **85% of the variation** in the dependent variable can be explained by the model, while **15%** is due to random noise or other factors not included in the model.



### **Purpose of R-squared**

1. **Measure of Goodness of Fit:**  
   It helps determine how well the regression line fits the observed data.

2. **Model Evaluation:**  
   R² is used to compare different models — a model with a higher R² generally fits the data better.

3. **Explained Variance:**  
   It quantifies how much of the variation in the dependent variable is captured by the independent variables.



### **Limitations**

- A high R² does **not always mean** the model is good; it could be **overfitting** the data.  
- R² **cannot determine causation** — only the strength of the relationship.  
- Adding more variables will always increase R², even if they are irrelevant (this is why **Adjusted R²** is also used).



### **Conclusion**

R-squared is a key metric in regression analysis that shows **how well the model explains the data’s variability**.  
A higher R² indicates a better model fit, but it should always be interpreted carefully alongside other metrics like **RMSE** or **MAE** to ensure the model is both accurate and reliable.


## Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept

In [4]:
# Import necessary libraries
import numpy as np
from sklearn.linear_model import LinearRegression

# Example dataset
# Independent variable (X) - must be 2D for sklearn
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)

# Dependent variable (Y)
Y = np.array([2, 4, 5, 4, 5])

# Create a Linear Regression model
model = LinearRegression()

# Fit the model to the data
model.fit(X, Y)

# Print the slope (coefficient) and intercept
print("Slope (b1):", model.coef_[0])
print("Intercept (b0):", model.intercept_)

Slope (b1): 0.6
Intercept (b0): 2.2
