# Lecture 03: Regression

**Slides:** `03_Regression.pdf`

## What you will learn
- Linear regression with scikit-learn
- Model evaluation and residual analysis
- Regularization intuition (Ridge-style)

## Notes
Uses a real-world style dataset and emphasizes interpretation.

## How to use this notebook
1. Run the **Setup** cell below (it will detect the repository root and set paths).
2. Run cells top-to-bottom. If a cell takes too long, skim it and continue ‚Äî the goal is to learn the workflow, not to optimize runtime.

In [None]:
# --- Setup (run this first) ---
from __future__ import annotations

import os
import sys
from pathlib import Path
from typing import Optional

def _find_repo_root(start: Optional[Path] = None) -> Path:
    """Find repo root by walking upwards and looking for common markers."""
    start = (start or Path.cwd()).resolve()
    for p in [start] + list(start.parents):
        if (p / "pyproject.toml").exists() and (p / "src").exists():
            return p
        if (p / "slides").exists() and (p / "notebooks").exists():
            return p
    return start

PROJECT_ROOT = _find_repo_root()
os.chdir(PROJECT_ROOT)

# Make `import aml_course` work without installing the package.
SRC_DIR = PROJECT_ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

# Common paths used in the course.
DATA_DIR = PROJECT_ROOT / "data"
FIGURES_DIR = PROJECT_ROOT / "pictures"
MODELS_DIR = PROJECT_ROOT / "models"

DATA_DIR.mkdir(parents=True, exist_ok=True)
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR.mkdir(parents=True, exist_ok=True)

print(f"üìÅ Project root: {PROJECT_ROOT}")
print(f"üì¶ Data dir:     {DATA_DIR}")
print(f"üñºÔ∏è  Figures dir:  {FIGURES_DIR}")
print(f"ü§ñ Models dir:   {MODELS_DIR}")


## Part 1 ‚Äî Dataset and setup

We load a dataset, perform basic preprocessing, and prepare features/targets for modeling.

In [None]:
import requests
import zipfile
import io
import pandas as pd
# URL for the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases\
/00275/Bike-Sharing-Dataset.zip"
# Send a HTTP request to the URL of the webpage you want to access
response = requests.get(url)

In [None]:
# Create a ZipFile object from the response content
zip_file = zipfile.ZipFile(io.BytesIO(response.content))
# Extract the 'day.csv' or 'hour.csv' file from the ZipFile object
csv_file = zip_file.open('day.csv')
# Read the CSV data
data = pd.read_csv(csv_file)

In [None]:
# Print the first 5 rows of the data table
data.head()

In [None]:
data['mnth'].value_counts()

In [None]:
from sklearn.model_selection import train_test_split
# Drop the 'dteday' column
data = data.drop('dteday', axis=1)
# Split the data into predictors and target
X = data.drop(['cnt', 'casual', 'registered'], axis=1)
y = data['cnt']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, \
test_size=0.2, random_state=0)

## Part 2 ‚Äî Baseline model: linear regression

We fit a linear regression model and evaluate it using standard regression metrics.

In [None]:
from sklearn.linear_model import LinearRegression
# Create a LinearRegression object
regressor = LinearRegression()
# Train the model
regressor.fit(X_train, y_train)

In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt

# Making predictions
y_pred = regressor.predict(X_test)

# Comparing actual result to the predicted result
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.sample(10)

# Visualizing comparison result
df1.plot(kind='bar', figsize=(7,5))

# Save to PDF
plt.savefig("pictures/bike_pred_bar.pdf")



In [None]:
regressor.coef_, regressor.intercept_

In [None]:
xx = X_test.reset_index(drop=True).loc[0].values
yy = y_test.reset_index(drop=True).loc[0]

In [None]:
import numpy as np
yy_pred = 

In [None]:
class MyLinearRegression:
    def __init__(self):
        self.beta = None

    def fit(self, X, y):
        pass

    def predict(self, X):
        pass

my_regressor = MyLinearRegression()
my_regressor.fit(X_train, y_train)
y_pred = my_regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.sample(10)
df1.plot(kind='bar', figsize=(7,5));

In [None]:
# Make predictions
y_train_pred = my_regressor.predict(X_train)
y_test_pred = my_regressor.predict(X_test)

# Calculate the Mean Squared Error (MSE)
mse_train = 
mse_test = 

In [None]:
rmse_train = np.sqrt(mse_train)
rmse_test = np.sqrt(mse_test)

In [None]:
rmse_train / np.mean(y_train)

In [None]:
rmse_test / np.mean(y_test)

In [None]:
import numpy as np
from sklearn.metrics import r2_score


# Calculate R-squared
r_squared = r2_score(y_test, y_test_pred)

print(f"R-squared: {r_squared}")

In [None]:
import matplotlib.pyplot as plt

# Plot actual vs predicted values
plt.scatter(y_test, y_test_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

# Plot a diagonal line (perfect predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')

plt.title(f'R-squared: {r_squared:.2f}')
plt.show()

## Part 3 ‚Äî Diagnostics: residuals and model fit

Residual analysis is a powerful way to see where a regression model succeeds or fails. We'll visualize residuals and compare actual vs. predicted values.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Residuals plot
plt.figure(figsize=(8, 6))
# plt.subplot(1, 2, 1)
sns.histplot(y_test - y_test_pred, bins=30, kde=True)
plt.title('Distribution of Test Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.savefig('pictures/test_residual.pdf')


In [None]:
import matplotlib.pyplot as plt

# Plot actual vs predicted values
plt.scatter(y_test, y_test_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

# Plot a diagonal line (perfect predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')

plt.title(f'R-squared: {r_squared:.2f}')
plt.show()


## Part 4 ‚Äî Regularization intuition (Ridge-style)

We implement a simplified Ridge-like regression to highlight how adding a penalty term can reduce overfitting and improve generalization.

In [None]:
class MyRidgeRegression:
    def __init__(self, lambda_=1.0):
        self.lambda_ = lambda_
        self.beta = None

    def fit(self, X, y):
        pass

    def predict(self, X):
        pass
    
my_regressor = MyRidgeRegression()
my_regressor.fit(X_train, y_train)
y_pred = my_regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.sample(10)
df1.plot(kind='bar', figsize=(7,5));


In [None]:
# Make predictions
y_train_pred = my_regressor.predict(X_train)
y_test_pred = my_regressor.predict(X_test)

# Calculate the Mean Squared Error (MSE)
mse_train = np.mean((y_train - y_train_pred)**2)
mse_test = np.mean((y_test - y_test_pred)**2)

In [None]:
rmse_train = np.sqrt(mse_train)
rmse_test = np.sqrt(mse_test)

In [None]:
rmse_train / np.mean(y_train)

In [None]:
rmse_test / np.mean(y_test)