# Regression and Classification with Linear Models, Decision Trees, and Cross-Validation

**Objective:** Demonstrate end-to-end ML workflows for regression and classification, including regularization, hyperparameter tuning, cross-validation, and overfitting analysis.

**Author:** ML Intern Tutorial  
**Date:** 2025  
**Estimated Time:** 6-8 hours

## 📋 Quickstart

Before running this notebook, ensure you have the required packages installed:

```bash
pip install numpy==1.24.3 pandas==2.0.3 scikit-learn==1.3.0 matplotlib==3.7.2 seaborn==0.12.2 jupyter==1.0.0
```

Then run all cells from top to bottom.

## Table of Contents
1. [Introduction and Goals](#introduction)
2. [Setup and Imports](#setup)
3. [Data Loading and EDA](#eda)
4. [Train/Validation/Test Split](#split)
5. [Data Preprocessing](#preprocessing)
6. [Baseline Models](#baseline)
7. [Regression: California Housing](#regression)
8. [Classification: Breast Cancer](#classification)
9. [Conclusions and Next Steps](#conclusions)

## 1. Introduction and Goals <a id='introduction'></a>

### Goals of This Notebook

This notebook demonstrates comprehensive machine learning workflows covering:

1. **Regression Task:** Predict California housing prices using Linear Regression, Ridge, Lasso, and Decision Trees
2. **Classification Task:** Diagnose breast cancer (malignant vs. benign) using Logistic Regression and Decision Trees
3. **Key Concepts:**
   - Regularization techniques (L1, L2) to prevent overfitting
   - Cross-validation for robust model evaluation
   - Hyperparameter tuning using GridSearchCV
   - Learning curves and validation curves for bias-variance analysis
   - Feature importance interpretation

### What You'll Learn

- How to build end-to-end ML pipelines
- When and why to use regularization
- How to detect and mitigate overfitting/underfitting
- Best practices for model selection and evaluation

## 2. Setup and Imports <a id='setup'></a>

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn: datasets
from sklearn.datasets import fetch_california_housing, load_breast_cancer

# Scikit-learn: preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Scikit-learn: models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.dummy import DummyRegressor, DummyClassifier

# Scikit-learn: model selection and evaluation
from sklearn.model_selection import (
    cross_val_score, 
    KFold, 
    StratifiedKFold,
    GridSearchCV,
    learning_curve,
    validation_curve
)

# Scikit-learn: metrics
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    r2_score,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    classification_report,
    roc_curve
)

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 10

print("✓ All libraries imported successfully!")
print(f"Random state set to: {RANDOM_STATE}")

## 3. Data Loading and EDA <a id='eda'></a>

We'll load two datasets:
1. **California Housing** - Regression task (predict median house value)
2. **Breast Cancer Wisconsin** - Classification task (diagnose tumor)

### 3.1 California Housing Dataset (Regression)

In [None]:
# Load California Housing dataset
housing = fetch_california_housing(as_frame=True)
df_housing = housing.frame

print("California Housing Dataset")
print("=" * 50)
print(f"Shape: {df_housing.shape}")
print(f"\nFeatures: {list(housing.feature_names)}")
print(f"\nTarget: {housing.target_names[0]}")
print(f"\nTarget description: Median house value (in $100,000s)")
print("\nFirst 5 rows:")
display(df_housing.head())

In [None]:
# Statistical summary
print("Statistical Summary:")
display(df_housing.describe())

# Check for missing values
print(f"\nMissing values: {df_housing.isnull().sum().sum()}")
if df_housing.isnull().sum().sum() == 0:
    print("✓ No missing values found!")

In [None]:
# Visualize target distribution and feature correlations
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Target distribution
axes[0].hist(df_housing['MedHouseVal'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Median House Value ($100k)', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_title('Distribution of Target Variable', fontsize=13, fontweight='bold')
axes[0].axvline(df_housing['MedHouseVal'].mean(), color='red', 
                linestyle='--', label=f"Mean: {df_housing['MedHouseVal'].mean():.2f}")
axes[0].legend()

# Correlation heatmap
corr_matrix = df_housing.corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, ax=axes[1], cbar_kws={'label': 'Correlation'})
axes[1].set_title('Feature Correlation Matrix', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nTop correlations with target (MedHouseVal):")
target_corr = corr_matrix['MedHouseVal'].sort_values(ascending=False)
print(target_corr[1:])

### 3.2 Breast Cancer Dataset (Classification)

In [None]:
# Load Breast Cancer dataset
cancer = load_breast_cancer(as_frame=True)
df_cancer = cancer.frame

print("Breast Cancer Wisconsin Dataset")
print("=" * 50)
print(f"Shape: {df_cancer.shape}")
print(f"\nNumber of features: {len(cancer.feature_names)}")
print(f"\nTarget classes: {cancer.target_names}")
print(f"  0 = malignant (cancerous)")
print(f"  1 = benign (non-cancerous)")
print(f"\nClass distribution:")
print(df_cancer['target'].value_counts().sort_index())
print(f"\nClass balance: {df_cancer['target'].value_counts(normalize=True).sort_index()}")
print("\nFirst 5 rows (showing first 10 columns):")
display(df_cancer.iloc[:5, :10])

In [None]:
# Statistical summary (first 10 features for brevity)
print("Statistical Summary (first 10 features):")
display(df_cancer.iloc[:, :10].describe())

# Check for missing values
print(f"\nMissing values: {df_cancer.isnull().sum().sum()}")
if df_cancer.isnull().sum().sum() == 0:
    print("✓ No missing values found!")

In [None]:
# Visualize class distribution and top feature correlations
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Class distribution
class_counts = df_cancer['target'].value_counts().sort_index()
axes[0].bar(['Malignant', 'Benign'], class_counts.values, 
            color=['#e74c3c', '#2ecc71'], edgecolor='black', alpha=0.7)
axes[0].set_ylabel('Count', fontsize=11)
axes[0].set_title('Class Distribution', fontsize=13, fontweight='bold')
for i, v in enumerate(class_counts.values):
    axes[0].text(i, v + 5, str(v), ha='center', fontweight='bold')

# Top 10 features correlation with target
# Select only mean features for cleaner visualization
mean_features = [col for col in df_cancer.columns if 'mean' in col]
mean_features.append('target')
corr_with_target = df_cancer[mean_features].corr()['target'].drop('target').sort_values()
top_corr = pd.concat([corr_with_target.head(5), corr_with_target.tail(5)])
colors = ['red' if x < 0 else 'green' for x in top_corr.values]
top_corr.plot(kind='barh', ax=axes[1], color=colors, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Correlation with Target', fontsize=11)
axes[1].set_title('Top 10 Feature Correlations with Target', fontsize=13, fontweight='bold')
axes[1].axvline(0, color='black', linewidth=0.8)

plt.tight_layout()
plt.show()

## 4. Train/Validation/Test Split <a id='split'></a>

### Split Strategy

We'll use a **70/15/15 split**:
- **70% Training:** For model training
- **15% Validation:** For hyperparameter tuning and model selection
- **15% Test:** For final unbiased performance evaluation

**Rationale:** 
- 70% training provides sufficient data for learning patterns
- 15% validation is enough for reliable hyperparameter evaluation
- 15% test reserved for final assessment (touched only once)
- Alternative 60/20/20 works well for smaller datasets

**Important:** We stratify the classification split to maintain class balance across sets.

In [None]:
# Split California Housing (Regression)
X_housing = df_housing.drop('MedHouseVal', axis=1)
y_housing = df_housing['MedHouseVal']

# First split: 70% train, 30% temp
X_train_h, X_temp_h, y_train_h, y_temp_h = train_test_split(
    X_housing, y_housing, test_size=0.30, random_state=RANDOM_STATE
)

# Second split: 15% validation, 15% test (from the 30% temp)
X_val_h, X_test_h, y_val_h, y_test_h = train_test_split(
    X_temp_h, y_temp_h, test_size=0.50, random_state=RANDOM_STATE
)

print("California Housing Split:")
print(f"  Training:   {X_train_h.shape[0]} samples ({X_train_h.shape[0]/len(X_housing)*100:.1f}%)")
print(f"  Validation: {X_val_h.shape[0]} samples ({X_val_h.shape[0]/len(X_housing)*100:.1f}%)")
print(f"  Test:       {X_test_h.shape[0]} samples ({X_test_h.shape[0]/len(X_housing)*100:.1f}%)")
print(f"  Total:      {len(X_housing)} samples\n")

In [None]:
# Split Breast Cancer (Classification) - with stratification
X_cancer = df_cancer.drop('target', axis=1)
y_cancer = df_cancer['target']

# First split: 70% train, 30% temp (stratified)
X_train_c, X_temp_c, y_train_c, y_temp_c = train_test_split(
    X_cancer, y_cancer, test_size=0.30, random_state=RANDOM_STATE, stratify=y_cancer
)

# Second split: 15% validation, 15% test (stratified from the 30% temp)
X_val_c, X_test_c, y_val_c, y_test_c = train_test_split(
    X_temp_c, y_temp_c, test_size=0.50, random_state=RANDOM_STATE, stratify=y_temp_c
)

print("Breast Cancer Split:")
print(f"  Training:   {X_train_c.shape[0]} samples ({X_train_c.shape[0]/len(X_cancer)*100:.1f}%)")
print(f"  Validation: {X_val_c.shape[0]} samples ({X_val_c.shape[0]/len(X_cancer)*100:.1f}%)")
print(f"  Test:       {X_test_c.shape[0]} samples ({X_test_c.shape[0]/len(X_cancer)*100:.1f}%)")
print(f"  Total:      {len(X_cancer)} samples\n")

# Verify stratification
print("Class distribution (stratification check):")
print(f"  Original:   {y_cancer.value_counts(normalize=True).sort_index().values}")
print(f"  Training:   {y_train_c.value_counts(normalize=True).sort_index().values}")
print(f"  Validation: {y_val_c.value_counts(normalize=True).sort_index().values}")
print(f"  Test:       {y_test_c.value_counts(normalize=True).sort_index().values}")
print("✓ Class proportions maintained across splits!")

## 5. Data Preprocessing <a id='preprocessing'></a>

### Why Scaling Matters

**Linear and Logistic Regression** are sensitive to feature scales because:
- Gradient descent converges faster with scaled features
- Regularization (L1/L2) penalizes all features equally when scaled
- Coefficients become directly comparable

**Decision Trees** don't require scaling (they make splits based on thresholds, not distances).

**Critical:** We fit the scaler on training data only and transform all sets to prevent data leakage.

In [None]:
# Scale California Housing features
scaler_housing = StandardScaler()

# Fit on training data only!
X_train_h_scaled = scaler_housing.fit_transform(X_train_h)
X_val_h_scaled = scaler_housing.transform(X_val_h)
X_test_h_scaled = scaler_housing.transform(X_test_h)

print("California Housing - Feature Scaling:")
print(f"  Original mean (train): {X_train_h.mean().mean():.3f}")
print(f"  Original std (train):  {X_train_h.std().mean():.3f}")
print(f"  Scaled mean (train):   {X_train_h_scaled.mean():.6f}")
print(f"  Scaled std (train):    {X_train_h_scaled.std():.3f}")
print("✓ Features standardized (mean≈0, std≈1)\n")

In [None]:
# Scale Breast Cancer features
scaler_cancer = StandardScaler()

# Fit on training data only!
X_train_c_scaled = scaler_cancer.fit_transform(X_train_c)
X_val_c_scaled = scaler_cancer.transform(X_val_c)
X_test_c_scaled = scaler_cancer.transform(X_test_c)

print("Breast Cancer - Feature Scaling:")
print(f"  Original mean (train): {X_train_c.mean().mean():.3f}")
print(f"  Original std (train):  {X_train_c.std().mean():.3f}")
print(f"  Scaled mean (train):   {X_train_c_scaled.mean():.6f}")
print(f"  Scaled std (train):    {X_train_c_scaled.std():.3f}")
print("✓ Features standardized (mean≈0, std≈1)")

## 6. Baseline Models <a id='baseline'></a>

### Why Baselines Matter

Baselines provide context for model performance:
- **Regression baseline:** Predicts the mean of training targets
- **Classification baseline:** Predicts the majority class

Any model worse than the baseline is essentially useless!

In [None]:
# Regression baseline: mean predictor
baseline_reg = DummyRegressor(strategy='mean')
baseline_reg.fit(X_train_h_scaled, y_train_h)
y_pred_baseline_reg = baseline_reg.predict(X_val_h_scaled)

baseline_rmse = np.sqrt(mean_squared_error(y_val_h, y_pred_baseline_reg))
baseline_mae = mean_absolute_error(y_val_h, y_pred_baseline_reg)
baseline_r2 = r2_score(y_val_h, y_pred_baseline_reg)

print("Regression Baseline (Mean Predictor):")
print(f"  RMSE: {baseline_rmse:.4f}")
print(f"  MAE:  {baseline_mae:.4f}")
print(f"  R²:   {baseline_r2:.4f}")
print(f"\nInterpretation: Always predicting {y_train_h.mean():.2f} gives R²={baseline_r2:.4f}")
print("Any useful model should significantly outperform this!\n")

In [None]:
# Classification baseline: majority class predictor
baseline_clf = DummyClassifier(strategy='most_frequent')
baseline_clf.fit(X_train_c_scaled, y_train_c)
y_pred_baseline_clf = baseline_clf.predict(X_val_c_scaled)

baseline_acc = accuracy_score(y_val_c, y_pred_baseline_clf)
baseline_f1 = f1_score(y_val_c, y_pred_baseline_clf)

print("Classification Baseline (Majority Class Predictor):")
print(f"  Accuracy: {baseline_acc:.4f}")
print(f"  F1 Score: {baseline_f1:.4f}")
print(f"\nInterpretation: Always predicting class {y_train_c.mode()[0]} gives {baseline_acc:.2%} accuracy")
print("This is because the dataset is imbalanced (~63% benign)")
print("A good model should achieve much higher accuracy and F1!")

## 7. Regression: California Housing <a id='regression'></a>

We'll train and compare:
1. **Linear Regression** (no regularization)
2. **Ridge Regression** (L2 regularization)
3. **Lasso Regression** (L1 regularization)
4. **Decision Tree Regressor**

### 7.1 Train Initial Models

In [None]:
# Initialize models
models_reg = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0, random_state=RANDOM_STATE),
    'Lasso Regression': Lasso(alpha=0.1, random_state=RANDOM_STATE),
    'Decision Tree': DecisionTreeRegressor(max_depth=5, random_state=RANDOM_STATE)
}

# Train and evaluate each model
results_reg = {}

for name, model in models_reg.items():
    model.fit(X_train_h_scaled, y_train_h)
    y_val_pred = model.predict(X_val_h_scaled)
    rmse = np.sqrt(mean_squared_error(y_val_h, y_val_pred))
    mae = mean_absolute_error(y_val_h, y_val_pred)
    r2 = r2_score(y_val_h, y_val_pred)
    results_reg[name] = {'RMSE': rmse, 'MAE': mae, 'R²': r2}

# Display results
results_df_reg = pd.DataFrame(results_reg).T
results_df_reg = results_df_reg.sort_values(by='R²', ascending=False)

print("Regression Model Performance on Validation Set:")
display(results_df_reg)

### 7.2 Hyperparameter Tuning with GridSearchCV