
# Lab 7: Linear Regression

## Overview
In this lab, we will:
1. Develop a simple linear regression model manually using mathematical computations.
2. Extend it to multiple linear regression using matrix algebra.
3. Visualize both the regression line and residual plots.

### Prerequisites
Make sure you are familiar with:
- Basic Python programming.
- Concepts like correlation and variance.
- Linear algebra (especially matrix operations).

### Required Libraries
We will use the following libraries:
- `pandas` for data manipulation.
- `numpy` for numerical operations.
- `matplotlib` and `seaborn` for visualization.


### Introduction to Linear Regression

Linear regression is one of the most fundamental and widely used techniques in statistical modeling and machine learning. It is used to model the relationship between one or more independent variables (predictors) and a dependent variable (response) by fitting a linear equation to the observed data.

The general form of a linear regression model is:

\\[
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon
\\]

Where:
- \\( y \\): Dependent variable (target or response).
- \\( \beta_0 \\): Intercept (the value of \( y \) when all predictors are 0).
- \\( \beta_1, \beta_2, \dots, \beta_n \\): Regression coefficients that represent the impact of each predictor on the dependent variable.
- \\( x_1, x_2, \dots, x_n \\): Independent variables (features or predictors).
- \\( \epsilon \\): Error term (captures the variation in \\( y \\) that the predictors cannot explain).

Linear regression can be categorized into two types:
1. **Simple Linear Regression**: Models the relationship between a single independent variable and the dependent variable.
2. **Multiple Linear Regression**: Models the relationship between multiple independent variables and the dependent variable.

### Key Assumptions of Linear Regression:
1. **Linearity**: The relationship between predictors and the target variable is linear.
2. **Independence**: Observations are independent of each other.
3. **Homoscedasticity**: The variance of residuals is constant across all levels of the independent variable(s).
4. **Normality**: Residuals are normally distributed.

Linear regression is widely used in various fields, including economics, engineering, biology, and social sciences, due to its simplicity, interpretability, and efficiency for small to moderate-sized datasets.


In [2]:

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
data = pd.read_csv(url)

# Display first few rows
data.head()


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4



### Exploratory Data Analysis
We will analyze the dataset before building regression models.


In [3]:

# Check for missing values
print(data.isnull().sum())

# Statistical summary
print(data.describe())

# Visualize correlations
sns.pairplot(data, x_vars=["total_bill"], y_vars="tip", height=5, aspect=0.8)
plt.show()


total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64
       total_bill         tip        size
count  244.000000  244.000000  244.000000
mean    19.785943    2.998279    2.569672
std      8.902412    1.383638    0.951100
min      3.070000    1.000000    1.000000
25%     13.347500    2.000000    2.000000
50%     17.795000    2.900000    2.000000
75%     24.127500    3.562500    3.000000
max     50.810000   10.000000    6.000000


NameError: name 'sns' is not defined


### Simple Linear Regression

We will predict the `tip` based on the `total_bill`. The formula for simple linear regression is:

\\[ y = \\beta_0 + \\beta_1 x \\]

Where:
- \\( \\beta_0 \\): Intercept (the value of \\( y \\) when \\( x = 0 \\))
- \\( \\beta_1 \\): Slope of the regression line (the rate of change of \\( y \\) with respect to \\( x \\))



In [None]:

# Define variables
X = data["total_bill"].values
y = data["tip"].values
breakpoint()

# Compute coefficients manually
n = len(X)
X_mean = np.mean(X)
y_mean = np.mean(y)

# Slope (beta_1)
numerator = np.sum((X - X_mean) * (y - y_mean))
denominator = np.sum((X - X_mean)**2)
beta_1 = numerator / denominator

# Intercept (beta_0)
beta_0 = y_mean - beta_1 * X_mean

print(f"Slope (beta_1): {beta_1}")
print(f"Intercept (beta_0): {beta_0}")

# Predict values
y_pred = beta_0 + beta_1 * X



### Visualize the Regression Line
We will plot the regression line along with the data points.


In [None]:
# Plot the data points and regression line
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, y_pred, color='red', label='Regression Line')
plt.title("Simple Linear Regression")
plt.xlabel("Total Bill")
plt.ylabel("Tip")
plt.legend()
plt.show()


### Residual Analysis

Residuals represent the differences between the actual and predicted values in a regression model. They are calculated using the formula:

\\[
e_i = y_i - \hat{y}_i
\\]

Where:
- \\( e_i \\): The residual for the \\( i \\)-th observation (error term).
- \\( y_i \\): The actual value of the dependent variable for the \\( i \\)-th observation.
- \\( \hat{y}_i \\): The predicted value of the dependent variable for the \\( i \\)-th observation.

Residual analysis helps evaluate the performance of a regression model by examining patterns in the residuals. Ideally, residuals should be randomly distributed with no discernible pattern.

In [None]:

# Calculate residuals
residuals = y - y_pred

# Plot residuals: 
# Create a scatter plot of predicted values vs residuals, where the x-axis represents predicted values (y_pred),
# and the y-axis represents the residuals (y - y_pred).
plt.scatter(y_pred, residuals, color='purple')

# Add a horizontal line at y=0: This line helps to visually identify if the residuals are evenly distributed around 0.
plt.axhline(y=0, color='red', linestyle='--')

plt.title("Residual Plot")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()


### Multiple Linear Regression

We will extend the regression model to include multiple features: `total_bill` and `size`. Using matrix algebra, the formula for estimating the coefficients is:

\\[
\\beta = (X^T X)^{-1} X^T y
\\]

Where:
- \\( X \\): Matrix of features (including a column of ones for the intercept).
- \\( y \\): Target variable (vector of observed values).
- \\( \\beta \\): Vector of coefficients (including the intercept and slopes).

This formula calculates the optimal regression coefficients by minimizing the sum of squared residuals.


In [None]:
# Define multiple features:
# X_multi is a matrix containing the features. Here, we are using "total_bill" and "size" columns from the 'data' DataFrame.
# We use .values to convert these into a numpy array.
X_multi = data[["total_bill", "size"]].values

# y_multi is the target variable (dependent variable), here it's the "tip" column.
y_multi = data["tip"].values

# Add a column of ones to X for the intercept term:
# This is necessary because the intercept (b) is part of the regression equation.
# Adding a column of ones allows us to calculate the intercept (b) along with the coefficients (beta).
X_multi = np.c_[np.ones(X_multi.shape[0]), X_multi]  # Adds a column of 1s to X_multi for the intercept term.

# Compute coefficients using matrix algebra:
# This step calculates the regression coefficients (betas) using the normal equation: 
# β = (X.T * X)^(-1) * X.T * y
# It uses matrix operations to solve for the coefficients that minimize the least squares error.
beta = np.linalg.inv(X_multi.T @ X_multi) @ X_multi.T @ y_multi

# Print the coefficients (including intercept):
print(f"Coefficients: {beta}")

# Predict values:
# Now that we have the coefficients (beta), we can predict the target variable 'tip' using the formula:
# y_pred_multi = X * beta, where X is the matrix of input features, and beta is the vector of coefficients.
y_pred_multi = X_multi @ beta

### Visualize the Multiple Regression Line
We will plot the regression line along with the data points.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


# Create a grid of values for "total_bill" and "size"
total_bill_range = np.linspace(X_multi[:, 1].min(), X_multi[:, 1].max(), 50)
size_range = np.linspace(X_multi[:, 2].min(), X_multi[:, 2].max(), 50)

# Create a meshgrid for the features
total_bill_grid, size_grid = np.meshgrid(total_bill_range, size_range)

# Predict the values of the target variable (tip) for each combination of total_bill and size
X_grid = np.c_[np.ones(total_bill_grid.size), total_bill_grid.ravel(), size_grid.ravel()]
y_pred_grid = X_grid @ beta

# Reshape predicted values back into a grid
y_pred_grid = y_pred_grid.reshape(total_bill_grid.shape)

# Create a 3D plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Plot the actual data points
ax.scatter(X_multi[:, 1], X_multi[:, 2], y_multi, color='blue', label='Actual data')

# Plot the regression surface
ax.plot_surface(total_bill_grid, size_grid, y_pred_grid, color='red', alpha=0.5)

# Labels and title
ax.set_xlabel('Total Bill')
ax.set_ylabel('Size')
ax.set_zlabel('Tip')
ax.set_title('Multiple Regression (Total Bill vs Size vs Tip)')

# Show the plot
plt.show()



### Residual Analysis for Multiple Linear Regression


In [None]:

# Calculate residuals
residuals_multi = y_multi - y_pred_multi

# Plot residuals
plt.scatter(y_pred_multi, residuals_multi, color='green')
plt.axhline(y=0, color='red', linestyle='--')
plt.title("Residual Plot (Multiple Regression)")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()



### Summary
In this lab, we:
1. Manually implemented a simple linear regression model using mathematical formulas.
2. Extended it to multiple linear regression using matrix algebra.
3. Visualized the regression line and performed residual analysis.


# Practice Questions: Linear Regression

### **Question 1: Simple Linear Regression**
**Task:**  
You have a dataset with two columns: `hours_studied` and `exam_score`. Develop a simple linear regression model to predict the `exam_score` based on the `hours_studied`.  
**Sample data**
data = {'hours_studied': [1, 2, 3, 4, 5], 'exam_score': [50, 55, 60, 65, 70]}

1. Implement simple linear regression using the normal equation:  
   \\[ \beta = (X^T X)^{-1} X^T y \\]
2. Plot the regression line on a scatter plot.
3. Calculate and plot the residuals (difference between actual and predicted values).
4. Evaluate the model's performance by calculating the R-squared value.

**Hints:**
- Add a column of ones to `X` for the intercept term.
- Use `numpy.linalg.inv()` to compute the matrix inverse.

---

### **Question 2: Multiple Linear Regression with Two Features**
**Task:**  
You have a dataset with three columns: `years_of_experience`, `education_level` (numerically coded), and `salary`. Develop a multiple linear regression model to predict the `salary` based on `years_of_experience` and `education_level`.  

**Sample data**
data = {'years_of_experience': [1, 2, 3, 4, 5],
        'education_level': [1, 2, 3, 4, 5],
        'salary': [40000, 45000, 50000, 55000, 60000]}

1. Implement multiple linear regression using the normal equation:  
   \\[ \beta = (X^T X)^{-1} X^T y \\]
2. Plot a 3D regression surface showing how salary is predicted by `years_of_experience` and `education_level`.
3. Plot the residuals for this model.
4. Interpret the coefficients to understand the effect of `years_of_experience` and `education_level` on `salary`.

**Hints:**
- Add a column of ones to `X` for the intercept term.
- Use `numpy.linalg.inv()` for solving the coefficients.

---

### **Question 3: Data Preprocessing and Multiple Linear Regression**
**Task:**  
You have a dataset with the following columns: `age`, `income`, `num_of_children`, and `monthly_spend`. The goal is to predict `monthly_spend`. Before building the regression model, perform the necessary data preprocessing steps, including handling missing values, scaling features, and encoding categorical variables (if any).  

**Sample data**
data = {'age': [25, 30, 35, 40, 45],
        'income': [50000, 60000, 55000, 70000, 75000],
        'num_of_children': [2, 3, 1, 4, 2],
        'monthly_spend': [200, 250, 220, 280, 300]}

1. Handle missing values (e.g., impute with the mean or drop).
2. Scale the numerical features (e.g., Min-Max scaling or Z-score normalization).
3. Implement a multiple linear regression model to predict `monthly_spend`.
4. Visualize the residuals to check the model’s assumptions.
5. Report the model’s performance metrics (e.g., Mean Squared Error).

**Hints:**
- For scaling, you can manually apply normalization using basic arithmetic:  
  \\[ X_{\text{scaled}} = \frac{X - \text{mean}(X)}{\text{std}(X)} \\]
- Use the normal equation to fit the regression model.

---