# Supervised Learning: Regression Models and Performance Metrics | Assignment Solution

This notebook provides the complete solution for the Regression Models and Performance Metrics Assignment (DA-AG-008).

## Question 1: What is Simple Linear Regression (SLR)? Explain its purpose.

**Simple Linear Regression (SLR)** is a statistical method used to model the relationship between two variables: one **independent variable** (or predictor, $X$) and one **dependent variable** (or response, $Y$).

* **Purpose:** The primary purpose of SLR is to **find the best-fitting straight line** through the data that allows us to predict the value of the dependent variable ($Y$) based on the value of the independent variable ($X$). It helps us quantify the strength and direction of the linear relationship between these two variables.

## Question 2: What are the key assumptions of Simple Linear Regression?

SLR models rely on four key assumptions (often remembered by the acronym **LINE** or **LIPE**) for the estimates and statistical tests to be valid:

1.  **Linearity:** The relationship between the independent variable ($X$) and the mean of the dependent variable ($Y$) must be **linear**.
2.  **Independence of Errors (or Observations):** The residuals (errors) must be independent of each other. This is particularly important for time-series data where consecutive errors might be correlated (**autocorrelation**).
3.  **Normality of Errors:** The residuals must be **normally distributed** around the mean of zero.
4.  **Equality of Variance (Homoscedasticity):** The variance of the residuals must be constant for all levels of the independent variable. In simple terms, the spread of the data points around the regression line should be roughly the same across the entire range of $X$.

## Question 3: Write the mathematical equation for a simple linear regression model and explain each term.

The mathematical equation for a simple linear regression model is:

$$Y = \beta_0 + \beta_1 X + \epsilon$$

| Term | Explanation |
| :--- | :--- |
| $\mathbf{Y}$ | **Dependent Variable** (or Response). This is the variable we are trying to predict or explain. |
| $\mathbf{\beta_0}$ | **Intercept** (or Bias). This is the predicted value of $Y$ when the independent variable $X$ is zero. |
| $\mathbf{\beta_1}$ | **Slope** (or Coefficient for $X$). This represents the change in $Y$ for every one-unit increase in $X$. |
| $\mathbf{X}$ | **Independent Variable** (or Predictor/Feature). This is the variable used to explain or predict $Y$. |
| $\mathbf{\epsilon}$ | **Error Term** (or Residual). This represents the difference between the actual observed value of $Y$ and the value predicted by the model. It accounts for all other factors that affect $Y$. |

## Question 4: Provide a real-world example where simple linear regression can be applied.

A common real-world application of simple linear regression is modeling the relationship between **Study Hours and Exam Scores**.

* **Independent Variable ($X$):** Number of hours a student spends studying for an exam.
* **Dependent Variable ($Y$):** The score the student achieves on that exam.

SLR could be used to determine if there is a statistically significant linear relationship and to predict a student's likely exam score based on their study time.

## Question 5: What is the method of least squares in linear regression?

The **Method of Least Squares** is the technique used to find the optimal values for the coefficients ($\beta_0$ and $\beta_1$) in a linear regression model.

* **Mechanism:** It works by minimizing the **Sum of Squared Errors (SSE)**, which is the sum of the squared vertical distances (residuals) between the actual data points and the regression line.
* **Goal:** By minimizing SSE, the method ensures that the line drawn through the data is the "best-fitting" line, as it keeps the total distance from all data points to the line as small as possible. [Image of least squares method]

## Question 6: What is Logistic Regression? How does it differ from Linear Regression?

| Feature | Simple Linear Regression (SLR) | Logistic Regression |
| :--- | :--- | :--- |
| **Problem Type** | **Regression** (Predicting continuous values) | **Classification** (Predicting categorical/binary outcomes) |
| **Dependent Variable (Y)** | Continuous (e.g., Price, Height, Score) | Categorical (usually Binary, e.g., Yes/No, 0/1, Spam/Not Spam) |
| **Output** | A straight line that predicts a raw value. | A probability (between 0 and 1) that an observation belongs to a particular class. |
| **Transformation** | None. Uses a direct linear equation. | Uses the **Sigmoid Function** (or logistic function) to map the output of the linear equation into a probability. |

**In short, Logistic Regression uses the linear equation of SLR but wraps it in a sigmoid function to transform the output into a probability, making it suitable for classification tasks.**

## Question 7: Name and briefly describe three common evaluation metrics for regression models.

1.  **Mean Absolute Error (MAE):**
    * **Description:** This is the average of the absolute differences between the predicted values and the actual values.
    * **Interpretation:** It gives a clear, interpretable measure of the average error in the same units as the dependent variable. It treats all errors equally.

2.  **Mean Squared Error (MSE):**
    * **Description:** This is the average of the squared differences between the predicted and actual values.
    * **Interpretation:** Because errors are squared, this metric heavily penalizes **larger errors (outliers)**, making it sensitive to models that produce significant deviations. Its units are squared.

3.  **Root Mean Squared Error (RMSE):**
    * **Description:** This is the square root of the MSE.
    * **Interpretation:** Like MAE, it returns the error in the original units of the dependent variable, making it more interpretable than MSE. It is still sensitive to outliers but less so than MSE.

## Question 8: What is the purpose of the R-squared metric in regression analysis?

The **R-squared** metric (also known as the **Coefficient of Determination**) measures the **proportion of the variance in the dependent variable ($Y$) that is predictable from the independent variable(s) ($X$)**.

* **Value Range:** It ranges from 0 to 1 (or 0% to 100%).
* **Purpose:** It serves as a measure of the **goodness of fit** of the model.
    * An R-squared of **0.80** means that 80% of the variation in $Y$ is explained by the model (i.e., by $X$).
    * An R-squared close to **1** indicates that the model explains most of the variability in the response data and fits the data well.
    * An R-squared close to **0** indicates that the model does not explain much of the variability, and the prediction is no better than simply using the mean of $Y$.

## Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.

We'll use synthetic data representing the relationship between Study Hours ($X$) and Exam Scores ($Y$).

In [2]:
import numpy as np
from sklearn.linear_model import LinearRegression

# 1. Prepare synthetic data for Study Hours (X) and Exam Scores (y)
# X must be a 2D array (features)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
# y is a 1D array (target)
y = np.array([55, 60, 65, 70, 75, 80, 85, 90, 95, 100])

# 2. Initialize the Simple Linear Regression model
model = LinearRegression()

# 3. Fit the model to the data
model.fit(X, y)

# 4. Print the learned parameters (slope and intercept)
print("--- Python Code Output ---")
# The coefficient (slope) for the single feature is at index 0
print(f"Slope (Coefficient, β1): {model.coef_[0]:.2f}")
# The intercept (bias)
print(f"Intercept (Bias, β0): {model.intercept_:.2f}")

--- Python Code Output ---
Slope (Coefficient, β1): 5.00
Intercept (Bias, β0): 50.00


## Question 10: How do you interpret the coefficients in a simple linear regression model?

The interpretation of the coefficients ($\beta_0$ and $\beta_1$) is direct and follows the mathematical form of the equation: $Y = \beta_0 + \beta_1 X$.

1.  **Interpretation of the Slope ($\beta_1$):**
    * The slope represents the **expected change in the dependent variable ($Y$) for every one-unit increase in the independent variable ($X$)**, assuming all other factors remain constant.
    * *Example (using output from Q9):* If the slope is $\mathbf{5.00}$, it means that for every additional hour a student studies ($X$), their predicted exam score ($Y$) increases by $\mathbf{5.00}$ points.

2.  **Interpretation of the Intercept ($\beta_0$):**
    * The intercept represents the **predicted value of the dependent variable ($Y$) when the independent variable ($X$) is equal to zero.**
    * *Example (using output from Q9):* If the intercept is $\mathbf{50.00}$, it means a student who studies for zero hours ($X=0$) is still predicted to achieve a score of $\mathbf{50.00}$ points ($Y$).
