<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Final%20DNN%20Code%20Examples/Bike%20Sharing/Bike%20Sharing%20-%20Regression%20Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bike Sharing - Regression Example

This notebook demonstrates the **Universal ML Workflow** applied to a **regression problem** - predicting continuous numerical values instead of categories.

## Learning Objectives

By the end of this notebook, you will be able to:
- Apply neural networks to **regression** (predicting continuous values)
- Understand key differences between regression and classification:
  - Output layer: Linear activation vs. Softmax/Sigmoid
  - Loss function: MSE/MAE vs. Cross-entropy
  - Metrics: MAE, R² vs. Accuracy, Precision, Recall
- Handle mixed feature types (categorical + numerical) for regression problems
- Use **Hyperband** for efficient hyperparameter tuning
- Apply **Dropout + L2 regularisation** to prevent overfitting

---

## Dataset Overview

| Attribute | Description |
|-----------|-------------|
| **Source** | [UCI Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset) |
| **Problem Type** | Regression |
| **Target Variable** | `cnt` - Total bike rental count |
| **Data Type** | Structured Tabular (Mixed Categorical & Numerical) |
| **Features** | Weather, date/time, and environmental variables |
| **Target Range** | 22 to 8,714 rentals per day |

---

## Technique Scope

This notebook uses only techniques from **Chapters 1–4** of *Deep Learning with Python* (Chollet, 2021). This means:

| Technique | Status | Rationale |
|-----------|--------|-----------|
| **Dense layers (DNN)** | ✓ Used | Core building block (Ch. 3-4) |
| **Dropout** | ✓ Used | Regularisation technique (Ch. 4) |
| **L2 regularisation** | ✓ Used | Weight penalty (Ch. 4) |
| **Early stopping** | ✗ Not used | Introduced in Ch. 7 |
| **CNN** | ✗ Not used | Introduced in Ch. 8 |
| **RNN/LSTM** | ✗ Not used | Introduced in Ch. 10 |

We demonstrate that **Dropout + L2 regularisation** alone can effectively prevent overfitting without requiring early stopping.

---

## Regression vs. Classification

| Aspect | Regression | Classification |
|--------|------------|----------------|
| **Output** | Continuous number (e.g., 542 bikes) | Category (e.g., "spam") |
| **Output Activation** | Linear (none) | Softmax/Sigmoid |
| **Loss Function** | MSE, MAE, Huber | Cross-entropy |
| **Metrics** | MAE, R² | Accuracy, F1, AUC |
| **Class Weights** | Not applicable | Used for imbalanced data |

---

## 1. Defining the Problem and Assembling a Dataset

The first step in any machine learning project is to clearly define the problem and understand the data.

**Problem Statement:** Predict the total number of bike rentals for a given day based on weather and calendar features.

**Why this problem is interesting:**
- **Business value:** Bike sharing companies need to plan bike distribution across stations
- **Operational planning:** Accurate demand prediction helps with maintenance scheduling
- **Feature richness:** Multiple factors (weather, holidays, seasons) affect demand

**Data Source:** This dataset contains 2 years of daily bike rental data from a Capital Bikeshare system.

## 2. Choosing a Measure of Success

### Regression Metrics

For regression problems, we use different metrics than classification:

| Metric | Formula | Interpretation | When to Use |
|--------|---------|----------------|-------------|
| **MAE** | Mean(\|y - ŷ\|) | Average error in original units | Primary metric - interpretable |
| **R²** | 1 - SS_res/SS_tot | Proportion of variance explained (0 to 1) | Secondary metric - normalised comparison |

**Why MAE as our primary metric?**
- It's interpretable: "On average, we're off by X bikes"
- It's robust to outliers
- It's in the same units as the target variable

**Why R² as secondary metric?**
- Provides a normalised view (0-1 scale) useful for comparing across different datasets
- Shows what proportion of variance the model explains

### References

- Chai, T. and Draxler, R.R. (2014) 'Root mean square error (RMSE) or mean absolute error (MAE)?', *Geoscientific Model Development*, 7(3), pp. 1247–1250.

- Willmott, C.J. and Matsuura, K. (2005) 'Advantages of the mean absolute error (MAE) over the root mean square error (RMSE)', *Climate Research*, 30(1), pp. 79–82.

## 3. Deciding on an Evaluation Protocol

### Hold-Out vs K-Fold Cross-Validation

The choice between hold-out and K-fold depends on **dataset size** and **computational cost**:

| Dataset Size | Recommended Method | Rationale |
|--------------|-------------------|-----------|
| < 1,000 | K-Fold (K=5 or 10) | High variance with small hold-out sets |
| 1,000 – 10,000 | K-Fold or Hold-Out | Either works; K-fold more robust |
| > 10,000 | Hold-Out | Sufficient data; K-fold computationally expensive |

### Data Split Strategy (This Notebook)

With only **731 samples**, this dataset is below the 10,000 threshold, so we use **K-Fold Cross-Validation**:

```
Original Data (731 samples)
├── Test Set (10% = ~73 samples) - Final evaluation only
└── Training Pool (90% = ~658 samples)
    └── 5-Fold Cross-Validation
        ├── Fold 1: Train on folds 2-5, validate on fold 1
        ├── Fold 2: Train on folds 1,3-5, validate on fold 2
        ├── ...
        └── Fold 5: Train on folds 1-4, validate on fold 5
```

**Why K-Fold for small datasets?**
- Reduces variance in performance estimates
- Uses all data for both training and validation
- More reliable model comparison

**Important:** For regression, we don't use `stratify` (no classes to balance). We use random shuffling.

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Hastie, T., Tibshirani, R. and Friedman, J. (2009) *The elements of statistical learning*. 2nd edn. New York: Springer.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *IJCAI*, 2, pp. 1137–1145.

## 4. Preparing Your Data

### 4.1 Import Libraries and Set Random Seed

We set random seeds for reproducibility - this ensures that running the notebook multiple times produces the same results.

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Keras Tuner for hyperparameter search
%pip install -q -U keras-tuner
import keras_tuner as kt

import matplotlib.pyplot as plt

SEED = 204

tf.random.set_seed(SEED)
np.random.seed(SEED)

import warnings
warnings.filterwarnings('ignore')

### 4.2 Load and Explore the Dataset

Let's download the bike sharing data from Google Drive and examine its structure.

In [None]:
# Load data directly from Google Drive
GDRIVE_FILE_ID = '1H2d1H5ASQis4FcJSyh1REvSTdU2TVerk'
DATA_URL = f'https://drive.google.com/uc?id={GDRIVE_FILE_ID}&export=download'

df = pd.read_csv(DATA_URL)

print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Examine the dataset
df.describe()

In [None]:
# Define feature columns
NUMERICAL_VARIABLES = ['temp', 'atemp', 'hum', 'windspeed']
CATEGORICAL_VARIABLES = ['season', 'holiday', 'weekday', 'workingday', 'weathersit']

TARGET_VARIABLE = 'cnt'

# Examine target variable distribution
print(f"Target variable: {TARGET_VARIABLE}")
print(f"  Min: {df[TARGET_VARIABLE].min():,}")
print(f"  Max: {df[TARGET_VARIABLE].max():,}")
print(f"  Mean: {df[TARGET_VARIABLE].mean():,.0f}")
print(f"  Std: {df[TARGET_VARIABLE].std():,.0f}")

### 4.3 Define Features and Target

In [None]:
features = df[NUMERICAL_VARIABLES + CATEGORICAL_VARIABLES]
target = df[TARGET_VARIABLE]

print(f"Features shape: {features.shape}")
print(f"Target shape: {target.shape}")

### 4.4 Split Data into Train and Test Sets

We reserve 10% of the data for final testing. Note: For regression, we don't use `stratify` (no classes to balance).

In [None]:
TEST_SIZE = 0.1

X_train_raw, X_test_raw, y_train_full, y_test = train_test_split(
    features, target, 
    test_size=TEST_SIZE, 
    shuffle=True, random_state=SEED
)

print(f"Training set: {len(X_train_raw):,} samples")
print(f"Test set: {len(X_test_raw):,} samples")

### 4.5 Preprocessing with ColumnTransformer

We use `ColumnTransformer` to apply different preprocessing to different feature types:

```
ColumnTransformer
├── Categorical Features → One-Hot Encoding
│   (Creates binary columns for each category)
└── Numerical Features → Standard Scaling
    (Mean=0, Std=1)
```

**Important:** We fit the preprocessor only on training data to prevent data leakage.

In [None]:
# Create preprocessing pipeline
preprocessor = ColumnTransformer([
    ('one-hot-encoder', OneHotEncoder(handle_unknown="ignore"), CATEGORICAL_VARIABLES),
    ('standard_scaler', StandardScaler(), NUMERICAL_VARIABLES)
])

# Fit on training data only (prevent data leakage)
preprocessor.fit(X_train_raw)

# Transform both sets
X_train_full = preprocessor.transform(X_train_raw)
X_test = preprocessor.transform(X_test_raw)

# Convert target to numpy arrays
y_train_full = y_train_full.values
y_test = y_test.values

print(f"Preprocessed training shape: {X_train_full.shape}")
print(f"Preprocessed test shape: {X_test.shape}")

## 5. Developing a Model That Does Better Than a Baseline

Before building complex models, we need to establish **baseline performance**. This gives us a reference point to know if our model is actually learning something useful.

### 5.1 Examine Target Distribution

In [None]:
# =============================================================================
# DATA-DRIVEN ANALYSIS: Dataset Size & Target Distribution
# =============================================================================

# Dataset size analysis
n_samples = len(df)
HOLDOUT_THRESHOLD = 10000  # Use hold-out if samples > 10,000

# Target distribution analysis
target_mean = target.mean()
target_std = target.std()
target_range = target.max() - target.min()

print("=" * 60)
print("DATA-DRIVEN CONFIGURATION")
print("=" * 60)
print(f"\n1. DATASET SIZE: {n_samples:,} samples")
print(f"   Threshold: {HOLDOUT_THRESHOLD:,} samples")
print(f"   Decision: {'Hold-Out' if n_samples > HOLDOUT_THRESHOLD else 'K-Fold Cross-Validation'}")

print(f"\n2. TARGET DISTRIBUTION:")
print(f"   Mean: {target_mean:,.0f} bikes/day")
print(f"   Std: {target_std:,.0f} bikes/day")
print(f"   Range: {target.min():,} to {target.max():,}")

print("\n" + "=" * 60)
print("PRIMARY METRIC: MAE (interpretable in original units)")
print("VALIDATION: 5-Fold Cross-Validation")
print("=" * 60)

### 5.2 Calculate Baseline Metrics

**Regression Baselines:**
- **Mean Baseline:** Always predict the mean of training data
- This gives us a reference MAE to beat

In [None]:
# Baseline: always predict the mean
baseline_prediction = y_train_full.mean()
baseline_mae = mean_absolute_error(y_train_full, np.full_like(y_train_full, baseline_prediction))

print(f"Baseline (always predict mean = {baseline_prediction:,.0f}):")
print(f"  MAE: {baseline_mae:,.0f} bikes")

### 5.3 Configure K-Fold Cross-Validation

Since our dataset has only 731 samples (below the 10,000 threshold), we use **5-Fold Cross-Validation** instead of a simple hold-out validation split.

In [None]:
# =============================================================================
# K-FOLD CROSS-VALIDATION SETUP
# =============================================================================
N_FOLDS = 5

kfold = KFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

print(f"K-Fold Configuration:")
print(f"  Number of folds: {N_FOLDS}")
print(f"  Training pool: {X_train_full.shape[0]:,} samples")
print(f"  Samples per fold: ~{X_train_full.shape[0] // N_FOLDS:,}")
print(f"  Test set (held out): {X_test.shape[0]:,} samples")

# For initial model development, we use the first fold
# Final evaluation will use all folds
first_fold = list(kfold.split(X_train_full))[0]
train_idx, val_idx = first_fold

X_train = X_train_full[train_idx]
X_val = X_train_full[val_idx]
y_train = y_train_full[train_idx]
y_val = y_train_full[val_idx]

print(f"\nFirst fold (for initial development):")
print(f"  Training: {X_train.shape[0]:,} samples")
print(f"  Validation: {X_val.shape[0]:,} samples")

### 5.4 Configure Training Parameters

**Key training settings for regression:**
- **Optimiser:** Adam - adaptive learning rate optimiser
- **Loss:** MSE (Mean Squared Error) - standard loss for regression
- **Output Activation:** Linear (none) - allows any real number output
- **Primary Metric:** MAE - computed separately after training

In [None]:
INPUT_DIMENSION = X_train.shape[1]
OUTPUT_DIMENSION = 1  # Regression: single continuous output

OPTIMIZER = 'adam'
LOSS_FUNC = 'mse'  # Mean Squared Error for regression

# Training metrics (tracked by Keras during training)
# Note: MAE (our primary metric) is also tracked for monitoring
METRICS = ['mae']

print(f"Input dimension: {INPUT_DIMENSION}")
print(f"Output dimension: {OUTPUT_DIMENSION}")

In [None]:
# Single-Layer Perceptron (no hidden layers) - Linear Regression Baseline
# Note: No activation on output layer = linear output for regression
slp_model = Sequential(name='Single_Layer_Perceptron')
slp_model.add(layers.Input(shape=(INPUT_DIMENSION,)))
slp_model.add(Dense(OUTPUT_DIMENSION))  # Linear activation (default)
slp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

slp_model.summary()

In [None]:
# =============================================================================
# TRAINING CONFIGURATION
# =============================================================================

BATCH_SIZE = 32  # Smaller batch size for small dataset (731 samples)

# We use DIFFERENT epoch counts for different training phases:
#
# EPOCHS_BASELINE (100): For SLP and unregularised DNN
#   - SLP converges quickly (simple model)
#   - Unregularised DNN: 100 epochs clearly shows overfitting
#
# EPOCHS_REGULARIZED (150): For DNN with Dropout + L2
#   - Regularisation slows down learning
#   - With regularisation, longer training is SAFE (no overfitting risk)

EPOCHS_BASELINE = 100      # SLP and DNN (no regularisation)
EPOCHS_REGULARIZED = 150   # DNN with Dropout + L2

In [None]:
# Train the Single-Layer Perceptron (Linear Regression)
history_slp = slp_model.fit(
    X_train, y_train, 
    batch_size=BATCH_SIZE, epochs=EPOCHS_BASELINE, 
    validation_data=(X_val, y_val),
    verbose=0
)
val_score_slp = slp_model.evaluate(X_val, y_val, verbose=0)

In [None]:
# Display SLP validation metrics
preds_slp_val = slp_model.predict(X_val, verbose=0).flatten()
mae_slp_val = mean_absolute_error(y_val, preds_slp_val)
r2_slp_val = r2_score(y_val, preds_slp_val)

print('MAE (Validation): {:.0f} bikes (baseline={:.0f})'.format(mae_slp_val, baseline_mae))
print('R² (Validation): {:.4f}'.format(r2_slp_val))
print(f'\nMAE: {mae_slp_val:.0f}  ← Primary Metric')

In [None]:
def plot_training_history(history, title=None):
    """
    Plot training and validation metrics over epochs for regression.
    Plots: (1) Loss (MSE), (2) MAE
    
    Parameters:
    -----------
    history : keras History object
        Training history from model.fit()
    title : str, optional
        Model name to display in plot titles
    """
    fig, axs = plt.subplots(1, 2, figsize=(14, 5))
    epochs = range(1, len(history.history['loss']) + 1)
    title_suffix = f' ({title})' if title else ''

    # Plot 1: Loss (MSE)
    axs[0].plot(epochs, history.history['loss'], 'b-', label='Training', linewidth=1.5)
    axs[0].plot(epochs, history.history['val_loss'], 'r-', label='Validation', linewidth=1.5)
    axs[0].set_title(f'Loss (MSE){title_suffix}')
    axs[0].set_xlabel('Epochs')
    axs[0].set_ylabel('MSE')
    axs[0].legend()
    axs[0].grid(alpha=0.3)

    # Plot 2: MAE
    axs[1].plot(epochs, history.history['mae'], 'b-', label='Training', linewidth=1.5)
    axs[1].plot(epochs, history.history['val_mae'], 'r-', label='Validation', linewidth=1.5)
    axs[1].set_title(f'MAE{title_suffix}')
    axs[1].set_xlabel('Epochs')
    axs[1].set_ylabel('MAE (bikes)')
    axs[1].legend()
    axs[1].grid(alpha=0.3)

    plt.tight_layout()
    plt.show()

In [None]:
# Plot SLP training history
plot_training_history(history_slp, title='SLP Baseline (Linear Regression)')

## 6. Scaling Up: Developing a Model That Overfits

The next step in the Universal ML Workflow is to build a model with **enough capacity to overfit**. If a model can't overfit, it may be too simple to learn the patterns in the data.

**Strategy:** Add hidden layers to capture non-linear relationships between features and bike demand.

**No regularisation applied:** We intentionally train this model **without any regularisation** to observe overfitting behaviour.

---

### Architecture Design Decisions

**Why 64 neurons in the hidden layer?**

This balances capacity and efficiency for our small dataset (731 samples):
- **Too few (e.g., 8):** May not capture non-linear patterns in weather/demand relationships
- **Too many (e.g., 256):** High overfitting risk with small data
- **64 neurons:** Reasonable capacity for tabular regression tasks

**Why only 1 hidden layer instead of 2-3?**

Per the **Universal ML Workflow**, the goal of this step is to demonstrate that the model *can* overfit—proving it has sufficient capacity to capture the underlying patterns. Once overfitting is observed:

1. **Capacity is proven sufficient:** If the model overfits, it can learn the training data's complexity
2. **No need for more depth:** Adding layers would increase overfitting further without benefit
3. **Regularise, don't expand:** The next step (Section 7) is to *reduce* overfitting through regularisation

*"The right question is not 'How many layers?' but 'Can it overfit?' If yes, regularise. If no, add capacity."*

### 6.1 Build a Deep Neural Network (DNN)

In [None]:
# Deep Neural Network (1 hidden layer, no regularisation for overfitting demo)
dnn_model = Sequential(name='Deep_Neural_Network')
dnn_model.add(layers.Input(shape=(INPUT_DIMENSION,)))
dnn_model.add(Dense(64, activation='relu'))
dnn_model.add(Dense(OUTPUT_DIMENSION))  # Linear activation for regression
dnn_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

dnn_model.summary()

In [None]:
# Train the Deep Neural Network (without regularisation to demonstrate overfitting)
history_dnn = dnn_model.fit(
    X_train, y_train, 
    batch_size=BATCH_SIZE, epochs=EPOCHS_BASELINE, 
    validation_data=(X_val, y_val), 
    verbose=0
)
val_score_dnn = dnn_model.evaluate(X_val, y_val, verbose=0)

In [None]:
# Plot DNN training history (expect overfitting: val_loss increasing)
plot_training_history(history_dnn, title='DNN - No Regularisation')

In [None]:
# Display DNN validation metrics
preds_dnn_val = dnn_model.predict(X_val, verbose=0).flatten()
mae_dnn_val = mean_absolute_error(y_val, preds_dnn_val)
r2_dnn_val = r2_score(y_val, preds_dnn_val)

print('MAE (Validation): {:.0f} bikes (baseline={:.0f})'.format(mae_dnn_val, baseline_mae))
print('R² (Validation): {:.4f}'.format(r2_dnn_val))
print(f'\nMAE: {mae_dnn_val:.0f}  ← Primary Metric')

## 7. Regularising Your Model and Tuning Hyperparameters

Now we address the overfitting observed in Section 6 by adding **regularisation**. We use two complementary techniques:

| Technique | How it works | Effect |
|-----------|--------------|--------|
| **Dropout** | Randomly drops neurons during training | Acts like ensemble averaging |
| **L2 (Weight Decay)** | Adds penalty for large weights to loss | Keeps weights small |

Using **Hyperband** for efficient hyperparameter tuning.

### 7.1 Hyperband Search

In [None]:
# Hyperband Model Builder for Regression
def build_model_hyperband(hp):
    """
    Build Bike Sharing model with FIXED architecture (1 hidden layer, 64 neurons).
    Same architecture as Section 6 DNN - only tunes regularisation and learning rate.
    """
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))

    # L2 regularisation strength
    l2_reg = hp.Float('l2_reg', 1e-5, 1e-2, sampling='log')

    # Fixed architecture: 1 hidden layer with 64 neurons (same as Section 6)
    model.add(layers.Dense(64, activation='relu', 
                           kernel_regularizer=regularizers.l2(l2_reg)))
    dropout_rate = hp.Float('dropout', 0.0, 0.5, step=0.1)
    model.add(layers.Dropout(dropout_rate))

    # Output layer for regression (linear activation)
    model.add(layers.Dense(OUTPUT_DIMENSION))

    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss=LOSS_FUNC,
        metrics=METRICS
    )
    return model

In [None]:
# Configure Hyperband tuner
# Objective: minimise validation loss (MSE)
tuner = kt.Hyperband(
    build_model_hyperband,
    objective='val_loss',
    max_epochs=20,
    factor=3,
    directory='bike_hyperband',
    project_name='bike_tuning',
    overwrite=True
)

print("Tuning objective: val_loss (MSE)")
print("(Note: Final evaluation uses MAE as primary metric)")

# Run Hyperband search
tuner.search(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=20,
    batch_size=BATCH_SIZE
)

In [None]:
# Get best hyperparameters
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Best hyperparameters found by Hyperband:")
print(f"  L2 Regularisation: {best_hp.get('l2_reg'):.6f}")
print(f"  Dropout Rate: {best_hp.get('dropout')}")
print(f"  Learning Rate: {best_hp.get('lr'):.6f}")

### 7.2 Sanity Check and Final Retraining

After finding the best hyperparameters, we follow a two-step process:

1. **Sanity Check:** Retrain with the best hyperparameters using training and validation data to visually confirm the model is not overfitting. This validates that Hyperband found hyperparameters that generalise well.

2. **Final Refit:** Combine training and validation sets and retrain without validation. Since the hyperparameters have been validated, we maximise the data available for the final model.

---

#### Why This Two-Step Approach?

| Step | Purpose | Validation Data |
|------|---------|-----------------|
| **Sanity Check** | Confirm hyperparameters prevent overfitting | ✓ Used for monitoring |
| **Final Refit** | Maximise training data for production model | ✗ Merged into training |

Once the sanity check confirms no overfitting, we can confidently combine all available data for the final model.

In [None]:
# =============================================================================
# STEP 1: SANITY CHECK - Retrain with validation to confirm no overfitting
# =============================================================================

# Extract the number of epochs from the best trial
best_trial = tuner.oracle.get_best_trials(num_trials=1)[0]
best_epochs = best_trial.best_step + 1  # best_step is 0-indexed

print("=" * 60)
print("SANITY CHECK: Retraining with Validation")
print("=" * 60)
print(f"Training for {best_epochs} epochs (matched from Hyperband's best trial)")
print("Purpose: Visually confirm the hyperparameters prevent overfitting\n")

# Build a fresh model with the best hyperparameters
sanity_model = tuner.hypermodel.build(best_hp)

history_sanity = sanity_model.fit(
    X_train, y_train,
    epochs=best_epochs,
    batch_size=BATCH_SIZE,
    validation_data=(X_val, y_val),  # Include validation for monitoring
    verbose=0
)

print("Sanity check training complete.")

In [None]:
# Plot sanity check training history (with validation curves)
plot_training_history(history_sanity, title=f'Sanity Check - Best Hyperparameters ({best_epochs} epochs)')

# Verify no overfitting: validation loss should not increase significantly
val_losses = history_sanity.history['val_loss']
min_val_loss_epoch = val_losses.index(min(val_losses)) + 1
final_val_loss = val_losses[-1]
min_val_loss = min(val_losses)

print(f"\nSanity Check Results:")
print(f"  Minimum validation loss: {min_val_loss:.4f} at epoch {min_val_loss_epoch}")
print(f"  Final validation loss: {final_val_loss:.4f}")
if final_val_loss <= min_val_loss * 1.1:  # Within 10% of minimum
    print("  ✓ No significant overfitting detected - hyperparameters are validated")
else:
    print("  ⚠ Some overfitting detected - consider adjusting epochs")

In [None]:
# =============================================================================
# STEP 2: FINAL REFIT - Combine data and retrain for production
# =============================================================================

# Combine training and validation sets for final model
X_combined = np.vstack([X_train, X_val])
y_combined = np.concatenate([y_train, y_val])

print("=" * 60)
print("FINAL REFIT: Training on Combined Data")
print("=" * 60)
print(f"Training data: {X_train.shape[0]:,} samples")
print(f"Validation data: {X_val.shape[0]:,} samples")
print(f"Combined data: {X_combined.shape[0]:,} samples")
print(f"  → {(X_combined.shape[0] / X_train.shape[0] - 1) * 100:.1f}% more training data")

# Build and train final model on combined data
print(f"\nRetraining for {best_epochs} epochs on combined data...")

best_model = tuner.hypermodel.build(best_hp)

best_model.fit(
    X_combined, y_combined,
    epochs=best_epochs,
    batch_size=BATCH_SIZE,
    verbose=0
    # No validation_data - merged into training
    # No plotting needed - sanity check already validated the hyperparameters
)

print("\n✓ Final model training complete on combined dataset.")

In [None]:
# =============================================================================
# K-FOLD CROSS-VALIDATION EVALUATION
# =============================================================================
def evaluate_with_kfold(build_fn, X, y, kfold, epochs, batch_size):
    """
    Evaluate a model using K-Fold cross-validation.
    
    Parameters:
    -----------
    build_fn : callable
        Function that returns a compiled Keras model
    X, y : array-like
        Full training data
    kfold : KFold
        Configured KFold splitter
    epochs : int
        Training epochs per fold
    batch_size : int
        Batch size for training
    
    Returns:
    --------
    dict : Mean and std of metrics across folds
    """
    fold_metrics = {'mae': [], 'r2': []}
    
    for fold, (train_idx, val_idx) in enumerate(kfold.split(X)):
        # Split data for this fold
        X_train_fold = X[train_idx]
        X_val_fold = X[val_idx]
        y_train_fold = y[train_idx]
        y_val_fold = y[val_idx]
        
        # Build fresh model for each fold
        model = build_fn()
        
        # Train
        model.fit(
            X_train_fold, y_train_fold,
            validation_data=(X_val_fold, y_val_fold),
            epochs=epochs, batch_size=batch_size, verbose=0
        )
        
        # Evaluate
        preds = model.predict(X_val_fold, verbose=0).flatten()
        fold_metrics['mae'].append(mean_absolute_error(y_val_fold, preds))
        fold_metrics['r2'].append(r2_score(y_val_fold, preds))
        
        print(f"  Fold {fold+1}: MAE={fold_metrics['mae'][-1]:.0f}")
    
    return {
        'mae_mean': np.mean(fold_metrics['mae']),
        'mae_std': np.std(fold_metrics['mae']),
        'r2_mean': np.mean(fold_metrics['r2']),
        'r2_std': np.std(fold_metrics['r2'])
    }

# Build function using best hyperparameters
def build_best_model():
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))
    model.add(layers.Dense(64, activation='relu', 
                           kernel_regularizer=regularizers.l2(best_hp.get('l2_reg'))))
    model.add(layers.Dropout(best_hp.get('dropout')))
    model.add(layers.Dense(OUTPUT_DIMENSION))
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=best_hp.get('lr')),
        loss=LOSS_FUNC, metrics=METRICS
    )
    return model

print("Evaluating best model with 5-Fold Cross-Validation...")
kfold_results = evaluate_with_kfold(
    build_best_model, X_train_full, y_train_full, 
    kfold, EPOCHS_REGULARIZED, BATCH_SIZE
)

print("\n" + "=" * 50)
print("K-FOLD CROSS-VALIDATION RESULTS")
print("=" * 50)
print(f"MAE:  {kfold_results['mae_mean']:.0f} ± {kfold_results['mae_std']:.0f} bikes")
print(f"R²:   {kfold_results['r2_mean']:.4f} ± {kfold_results['r2_std']:.4f}")
print("=" * 50)

### 7.3 Final Model Training and Test Evaluation

Now we train the final model on the **entire training pool** (all folds combined) and evaluate on the held-out test set.

In [None]:
# =============================================================================
# FINAL MODEL: Train on ALL training data, evaluate on test set
# =============================================================================
# Build final model with best hyperparameters
final_model = build_best_model()

# Train on entire training pool
final_model.fit(
    X_train_full, y_train_full,
    epochs=EPOCHS_REGULARIZED,
    batch_size=BATCH_SIZE,
    verbose=0
)

# Evaluate on held-out test set
preds_test = final_model.predict(X_test, verbose=0).flatten()

test_mae = mean_absolute_error(y_test, preds_test)
test_r2 = r2_score(y_test, preds_test)

print('=' * 50)
print('FINAL TEST SET RESULTS')
print('=' * 50)
print(f'MAE (Test): {test_mae:.0f} bikes  ← Primary Metric')
print(f'R² (Test): {test_r2:.4f}')
print(f'\nBaseline MAE: {baseline_mae:.0f} bikes')
print(f'Improvement over baseline: {(baseline_mae - test_mae) / baseline_mae * 100:.1f}%')

In [None]:
# Visualise predictions vs actual
fig, axs = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Predicted vs Actual
axs[0].scatter(y_test, preds_test, alpha=0.6, edgecolor='k', linewidth=0.5)
axs[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2, label='Perfect prediction')
axs[0].set_xlabel('Actual Bike Rentals')
axs[0].set_ylabel('Predicted Bike Rentals')
axs[0].set_title('Predicted vs Actual (Test Set)')
axs[0].legend()
axs[0].grid(alpha=0.3)

# Plot 2: Residuals
residuals = y_test - preds_test
axs[1].scatter(preds_test, residuals, alpha=0.6, edgecolor='k', linewidth=0.5)
axs[1].axhline(y=0, color='r', linestyle='--', linewidth=2)
axs[1].set_xlabel('Predicted Bike Rentals')
axs[1].set_ylabel('Residual (Actual - Predicted)')
axs[1].set_title('Residual Plot (Test Set)')
axs[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

---

## 8. Results Summary

The following dynamically-generated table compares all models trained in this notebook.

In [None]:
# =============================================================================
# RESULTS SUMMARY
# =============================================================================

# Create results DataFrame
results = pd.DataFrame({
    'Model': ['Naive Baseline', 'SLP (Linear)', 'DNN (No Reg)', f'DNN (K-Fold, {N_FOLDS} folds)', 'DNN (Dropout + L2) - Test'],
    'MAE': [baseline_mae, mae_slp_val, mae_dnn_val, kfold_results['mae_mean'], test_mae],
    'R²': [0.0, r2_slp_val, r2_dnn_val, kfold_results['r2_mean'], test_r2],
    'Dataset': ['N/A', 'Fold 1', 'Fold 1', 'K-Fold CV', 'Test']
})

print("=" * 80)
print("MODEL COMPARISON - RESULTS SUMMARY")
print("=" * 80)
print(f"Primary Metric: MAE (regression task)")
print("=" * 80)
print(results.to_string(index=False, float_format='{:.2f}'.format))
print("=" * 80)
print(f"\nKey Observations:")
print(f"  - All models significantly outperform naive baseline (MAE: {baseline_mae:.0f} bikes)")
print(f"  - Final model trained on entire training pool ({X_train_full.shape[0]:,} samples)")
print(f"  - K-Fold CV MAE: {kfold_results['mae_mean']:.0f} ± {kfold_results['mae_std']:.0f} bikes")
print(f"  - Final test MAE: {test_mae:.0f} bikes, R²: {test_r2:.4f}")

---

## 9. Key Takeaways

### Decision Framework Summary

| Decision | Threshold | This Dataset | Choice | Reference |
|----------|-----------|--------------|--------|-----------|
| **Hold-Out vs K-Fold** | > 10,000 samples | 731 samples | **K-Fold (5 folds)** | Kohavi (1995) |
| **Primary Metric** | Regression | Continuous target | **MAE** | Interpretable in original units |
| **Class Weights** | Classification only | N/A (regression) | **Not applicable** | - |

### Lessons Learned

1. **Regression Uses Different Metrics:** MAE and R² replace accuracy/F1-score. The naive baseline predicts the mean.

2. **Same Architecture Works:** The Universal ML Workflow applies equally to regression - just change the output layer to linear activation and the loss to MSE.

3. **K-Fold for Small Datasets:** With 731 samples (below 10,000 threshold), K-Fold provides more robust metric estimates.

4. **Feature Engineering Matters:** Proper preprocessing of categorical (one-hot) and numerical (scaling) features is essential.

5. **Regularisation Prevents Overfitting:** L2 + Dropout controls overfitting without early stopping, even for regression tasks.

6. **Maximise Data for Final Model:** After hyperparameter tuning, we combine training and validation sets for the final model. The validation set's job is done (model selection), so we use all available data to maximise learning.

7. **R² Interpretation:** Values close to 1.0 indicate the model explains most variance; values near 0 indicate no better than the mean.

### Regression vs Classification Summary

| Aspect | Regression (This Notebook) | Classification |
|--------|---------------------------|----------------|
| **Output** | Continuous value | Discrete classes |
| **Activation** | Linear (none) | Sigmoid/Softmax |
| **Loss** | MSE | Cross-entropy |
| **Metrics** | MAE, R² | Accuracy, F1, AUC |
| **Baseline** | Mean prediction | Majority class |

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Fanaee-T, H. and Gama, J. (2014) 'Event labelling combining ensemble detectors and background knowledge', *Progress in Artificial Intelligence*, 2(2-3), pp. 113-127.

- Hastie, T., Tibshirani, R. and Friedman, J. (2009) *The elements of statistical learning*. 2nd edn. New York: Springer.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *IJCAI*, 2, pp. 1137–1145.

---

## Appendix: Modular Helper Functions

For cleaner code organisation, you can wrap the model building and training patterns into reusable functions.

In [None]:
# =============================================================================
# MODULAR HELPER FUNCTIONS
# =============================================================================

def build_regression_model(input_dim, hidden_units=None, dropout=0.0, l2_reg=0.0,
                           optimizer='adam', loss='mse', 
                           metrics=['mae'], name=None):
    """
    Build a regression neural network.
    
    Parameters:
    -----------
    input_dim : int
        Number of input features
    hidden_units : list of int, optional
        Neurons per hidden layer, e.g., [64] or [128, 64]
        None or [] creates a single-layer perceptron (linear regression)
    dropout : float
        Dropout rate (0.0 to 0.5)
    l2_reg : float
        L2 regularisation strength
    optimizer : str or keras.optimizers.Optimizer
        Optimiser name or instance
    loss : str
        Loss function name (mse, mae, huber)
    metrics : list
        Metrics to track during training
    name : str, optional
        Model name
        
    Returns:
    --------
    keras.Sequential : Compiled model ready for training
    """
    model = Sequential(name=name)
    model.add(layers.Input(shape=(input_dim,)))
    
    hidden_units = hidden_units or []
    kernel_reg = regularizers.l2(l2_reg) if l2_reg > 0 else None
    
    for units in hidden_units:
        model.add(Dense(units, activation='relu', kernel_regularizer=kernel_reg))
        if dropout > 0:
            model.add(Dropout(dropout))
    
    # Output layer for regression (linear activation)
    model.add(Dense(1))
    
    model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
    return model


def train_model(model, X_train, y_train, X_val, y_val,
                batch_size=32, epochs=100, verbose=0):
    """
    Train a model and return training history.
    
    Parameters:
    -----------
    model : keras.Model
        Compiled Keras model
    X_train, y_train : array-like
        Training data and labels
    X_val, y_val : array-like
        Validation data and labels
    batch_size : int
        Training batch size
    epochs : int
        Number of training epochs
    verbose : int
        Verbosity mode
        
    Returns:
    --------
    keras.callbacks.History : Training history object
    """
    return model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        batch_size=batch_size, 
        epochs=epochs,
        verbose=verbose
    )


def evaluate_regression_model(model, X, y_true):
    """
    Evaluate regression model.
    
    Parameters:
    -----------
    model : keras.Model
        Trained Keras model
    X : array-like
        Input features
    y_true : array-like
        True target values
        
    Returns:
    --------
    dict : Dictionary containing metrics (mae, r2)
    """
    y_pred = model.predict(X, verbose=0).flatten()
    
    metrics = {
        'mae': mean_absolute_error(y_true, y_pred),
        'r2': r2_score(y_true, y_pred),
    }
    
    return metrics


# =============================================================================
# USAGE EXAMPLES
# =============================================================================
# 
# # Build models
# slp = build_regression_model(INPUT_DIMENSION, name='SLP')
# dnn = build_regression_model(INPUT_DIMENSION, hidden_units=[64], name='DNN')
# dnn_reg = build_regression_model(INPUT_DIMENSION, hidden_units=[64], 
#                                  dropout=0.3, l2_reg=0.001, name='DNN_Regularized')
# 
# # Train
# history = train_model(dnn, X_train, y_train, X_val, y_val)
# 
# # Evaluate
# metrics = evaluate_regression_model(dnn, X_val, y_val)
# print(f"MAE: {metrics['mae']:.0f} bikes")