<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Final%20DNN%20Code%20Examples/ASL%20Sign%20Language/ASL%20Sign%20Language%20-%20Image%20Classification%20Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ASL Sign Language - Image Classification Example

This notebook demonstrates the **Universal ML Workflow** applied to image classification using American Sign Language (ASL) hand gesture images.

## Learning Objectives

By the end of this notebook, you will be able to:
- Load and extract image data from a zip archive
- Preprocess images for neural network input: **Colour → Grayscale → Flatten**
- Build neural networks for multi-class **image classification**
- Apply the Universal ML Workflow to computer vision problems
- Evaluate classification performance with accuracy and confusion matrices

---

## Technique Scope

| Aspect | What We Use | What We Don't Use (Yet) |
|--------|-------------|------------------------|
| **Architecture** | Dense layers only | CNNs, pooling, feature extractors |
| **Regularisation** | L2 + Dropout | Early stopping, data augmentation |
| **Optimiser** | Adam | SGD with momentum, learning rate schedules |
| **Tuning** | Hyperband | Bayesian optimisation, neural architecture search |

> **Note**: Dense networks applied to flattened images serve as a baseline. CNNs (Chapter 8) are the standard approach for image classification and would preserve spatial structure.

---

## Dataset Overview

| Attribute | Description |
|-----------|-------------|
| **Source** | ASL Sign Language Dataset (3 classes: A, B, C) |
| **Problem Type** | Multi-Class Classification |
| **Data Balance** | Perfectly Balanced (3,000 samples per class) |
| **Data Type** | Unstructured (Images) |
| **Preprocessing** | Resize to 32×32 → Grayscale → Flatten (1024 features) |

### Image Preprocessing Pipeline

```
Original Image        Grayscale          Flattened
[H x W x 3]    →    [32 x 32]    →    [1024]
(Colour RGB)        (Single channel)   (1D array for NN)
```

**Why flatten?** Dense neural networks expect 1D input vectors. We sacrifice spatial relationships for simplicity. (CNNs preserve spatial structure but are covered in Chapter 8.)

---

## Code Reuse Philosophy

This notebook follows a **"Same Code, Different Data"** philosophy. The core ML pipeline remains consistent across different classification tasks:

```
┌─────────────────────────────────────────────────────────────────┐
│                    UNIVERSAL ML PIPELINE                        │
├─────────────────────────────────────────────────────────────────┤
│  Data Loading → Preprocessing → Train/Val/Test Split → Model   │
│  → Baseline → Overfitting → Regularisation → Evaluation        │
└─────────────────────────────────────────────────────────────────┘
```

**What changes:** Data source, preprocessing, number of output classes  
**What stays the same:** Model architecture pattern, training loop, evaluation code

---

## 1. Defining the Problem and Assembling a Dataset

**Problem Statement:** Classify images of hand gestures into ASL letters (A, B, or C).

**Why this matters:**
- ASL recognition can help bridge communication gaps
- Image classification is foundational to computer vision
- This simplified 3-class problem demonstrates key concepts before tackling the full 26-letter alphabet

## 2. Choosing a Measure of Success

### Data-Driven Metric Selection

| Criterion | This Dataset | Decision |
|-----------|--------------|----------|
| **Class Balance** | Equal across 3 classes | Balanced |
| **Number of Classes** | 3 (A, B, C) | Multi-class |
| **Primary Metric** | Accuracy | Standard for balanced multi-class |
| **Secondary Metrics** | Precision, Recall, AUC | Per-class performance |

**Why these thresholds?**
- **Balanced data (< 3:1 ratio):** When classes are roughly equal, accuracy is meaningful and interpretable
- **Imbalanced data (> 3:1 ratio):** Accuracy becomes misleading; F1-Score provides a fairer evaluation

### References

- Branco, P., Torgo, L. and Ribeiro, R.P. (2016) 'A survey of predictive modelling on imbalanced domains', *ACM Computing Surveys*, 49(2), pp. 1–50.

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

**Decision:** Since the dataset is balanced, **Accuracy** is the primary metric.

## 3. Deciding on an Evaluation Protocol

### Data-Driven Protocol Selection

The choice between hold-out validation and k-fold cross-validation is a trade-off between estimate stability and computational cost. For model selection, k-fold cross-validation is widely used because it averages performance over multiple splits; Kohavi (1995) reports stratified 10-fold cross-validation as a strong general default for model selection. However, k-fold cross-validation requires training the model k times, which can be computationally expensive, especially for larger datasets and heavier models.

| Situation        | Recommended method                                   | Rationale |
|-----------------|--------------------------------------------------------|-----------|
| Smaller datasets | Stratified k-fold cross-validation (commonly 5 or 10 folds) | A single hold-out split may be unstable when the validation set is small; averaging across folds typically provides a more reliable model-selection signal. |
| Larger datasets  | Hold-out validation + separate test set                | With sufficient data, a single validation split is often adequate while avoiding the k× training cost of k-fold; a held-out test set supports final unbiased reporting. |


*Note:* This table summarises a rule-of-thumb stability–cost trade-off rather than fixed numeric cut-offs.


### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *Proceedings of the 14th International Joint Conference on Artificial Intelligence*, 2, pp. 1137–1145.

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Hastie, T., Tibshirani, R. and Friedman, J. (2009) *The elements of statistical learning: data mining, inference, and prediction*. 2nd edn. New York: Springer.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *Proceedings of the 14th International Joint Conference on Artificial Intelligence*, 2, pp. 1137–1145.

**Decision:** With ~9,000 samples (below the 10,000 threshold), **K-Fold Cross-Validation** provides more robust performance estimates.

```
Original Data (~9,000 samples)
├── Test Set (10% = ~900 samples) - Final evaluation only
└── Training Pool (90% = ~8,100 samples)
    └── 5-Fold Stratified Cross-Validation
        ├── Fold 1: Train on folds 2-5, validate on fold 1
        ├── Fold 2: Train on folds 1,3-5, validate on fold 2
        ├── ...
        └── Fold 5: Train on folds 1-4, validate on fold 5
```

## 4. Preparing Your Data

### 4.1 Import Libraries

In [None]:
import os
import zipfile
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from PIL import Image
from skimage.color import rgb2gray
from skimage.transform import resize

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout

# Keras Tuner for hyperparameter search
%pip install -q -U keras-tuner
import keras_tuner as kt

import matplotlib.pyplot as plt

# ============================================================
# RANDOM SEED - Set once, use everywhere
# ============================================================
SEED = 204

tf.random.set_seed(SEED)
np.random.seed(SEED)

import warnings
warnings.filterwarnings('ignore')

### 4.2 Download and Extract Dataset

In [None]:
# ============================================================
# DATASET CONFIGURATION
# ============================================================
GDRIVE_FILE_ID = '1Df0wlpvKUSD12RAYihBI-slI1zbz-Vjj'
DATA_URL = f'https://drive.google.com/uc?id={GDRIVE_FILE_ID}&export=download'
ZIP_FILE = 'asl.zip'
EXTRACT_DIR = 'asl'

# Image configuration
RESIZE = (32, 32)  # Resize to 32x32 - balance between preserving detail and dimensionality
SAMPLE_SIZE = 3000  # Max samples per class

# Class names
CLASS_NAMES = ['A', 'B', 'C']

In [None]:
# Download dataset from Google Drive
import gdown

if not os.path.exists(ZIP_FILE):
    print(f"Downloading ASL dataset from Google Drive...")
    gdown.download(DATA_URL, ZIP_FILE, quiet=False)
else:
    print(f"Dataset already downloaded: {ZIP_FILE}")

# Extract the zip file
if not os.path.exists(EXTRACT_DIR):
    print(f"Extracting {ZIP_FILE}...")
    with zipfile.ZipFile(ZIP_FILE, 'r') as zip_ref:
        zip_ref.extractall('.')
    print(f"Extracted to: {EXTRACT_DIR}")
else:
    print(f"Directory already exists: {EXTRACT_DIR}")

### 4.3 Load and Preprocess Images

In [None]:
def load_image(file_path, target_size=(16, 16), grayscale=True):
    """
    Load and preprocess an image file.

    Args:
        file_path: Path to image file
        target_size: Target size for resizing (width, height)
        grayscale: Convert to grayscale if True

    Returns:
        Preprocessed image as numpy array
    """
    # Load image using PIL
    img = Image.open(file_path)
    img_array = np.array(img)

    # Resize using skimage (consistent with other notebooks)
    img_resized = resize(img_array, (*target_size, 3), anti_aliasing=True)

    # Convert to grayscale if specified
    if grayscale:
        img_resized = rgb2gray(img_resized)

    return img_resized

In [None]:
# Load images from directory structure
images, labels = [], []

for class_name in CLASS_NAMES:
    class_dir = os.path.join(EXTRACT_DIR, class_name)
    if not os.path.exists(class_dir):
        print(f"Warning: Directory not found: {class_dir}")
        continue

    files = os.listdir(class_dir)[:SAMPLE_SIZE]
    print(f"Loading class '{class_name}': {len(files)} images")

    for file in files:
        file_path = os.path.join(class_dir, file)
        try:
            img = load_image(file_path, target_size=RESIZE, grayscale=True)
            images.append(img)
            labels.append(class_name)
        except Exception as e:
            print(f"Error loading {file_path}: {e}")

print(f"\nTotal images loaded: {len(images)}")

In [None]:
# Convert to numpy arrays
X = np.array(images)
y_labels = np.array(labels)

# Flatten images: (N, 16, 16) -> (N, 256)
X = X.reshape((X.shape[0], -1))

# Create label mapping
label_to_idx = {name: idx for idx, name in enumerate(CLASS_NAMES)}
y_raw = np.array([label_to_idx[label] for label in y_labels])

# One-hot encode labels
y = to_categorical(y_raw)

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Number of classes: {y.shape[1]}")

### 4.4 Verify Class Balance

In [None]:
# Check class distribution
unique, counts = np.unique(y_raw, return_counts=True)

print("Class Distribution:")
for class_idx, count in zip(unique, counts):
    print(f"  {CLASS_NAMES[class_idx]}: {count} ({100*count/len(y_raw):.1f}%)")

# Calculate imbalance ratio
imbalance_ratio = max(counts) / min(counts)
print(f"\nImbalance Ratio: {imbalance_ratio:.2f}:1")
print(f"Decision: {'Use Accuracy (balanced)' if imbalance_ratio < 3 else 'Use F1-Score (imbalanced)'}")

### 4.5 Train/Test Split

In [None]:
# ============================================================
# TRAIN/TEST SPLIT (90%/10%)
# ============================================================
TEST_SIZE = 0.10

X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y,
    test_size=TEST_SIZE,
    stratify=y_raw,
    random_state=SEED,
    shuffle=True
)

# Also keep raw labels for test set
_, _, y_train_full_raw, y_test_raw = train_test_split(
    X, y_raw,
    test_size=TEST_SIZE,
    stratify=y_raw,
    random_state=SEED,
    shuffle=True
)

print(f"Training + Validation: {X_train_full.shape[0]} samples")
print(f"Test: {X_test.shape[0]} samples")

### 4.6 Normalise Features

In [None]:
# ============================================================
# NORMALISE PIXEL VALUES [0, 1]
# ============================================================
# Note: skimage resize already normalises to [0, 1], but we ensure it
X_train_full = X_train_full.astype('float32')
X_test = X_test.astype('float32')

# Verify normalisation
print(f"Feature range: [{X_train_full.min():.3f}, {X_train_full.max():.3f}]")

### 4.7 Configure K-Fold Cross-Validation

Since our dataset has ~9,000 samples (below the 10,000 threshold), we use **5-Fold Stratified Cross-Validation** instead of a simple hold-out validation split.

In [None]:
# ============================================================
# K-FOLD CROSS-VALIDATION SETUP
# ============================================================
N_FOLDS = 5

# Use StratifiedKFold to preserve class balance in each fold
skfold = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

print(f"K-Fold Configuration:")
print(f"  Number of folds: {N_FOLDS}")
print(f"  Training pool: {X_train_full.shape[0]:,} samples")
print(f"  Samples per fold: ~{X_train_full.shape[0] // N_FOLDS:,}")
print(f"  Test set (held out): {X_test.shape[0]:,} samples")

# For initial model development, we use the first fold
# Final evaluation will use all folds
first_fold = list(skfold.split(X_train_full, y_train_full.argmax(axis=1)))[0]
train_idx, val_idx = first_fold

X_train = X_train_full[train_idx]
X_val = X_train_full[val_idx]
y_train = y_train_full[train_idx]
y_val = y_train_full[val_idx]

# Keep raw labels for train set (for class weights)
y_train_raw = y_train.argmax(axis=1)

print(f"\nFirst fold (for initial development):")
print(f"  Training: {X_train.shape[0]:,} samples")
print(f"  Validation: {X_val.shape[0]:,} samples")

### 4.8 Visualise Sample Images

In [None]:
# Display sample images from each class
fig, axes = plt.subplots(1, 3, figsize=(10, 4))
fig.suptitle('Sample Images (32×32 Grayscale)', fontsize=14)

for class_idx, class_name in enumerate(CLASS_NAMES):
    # Get first sample of this class
    sample_idx = np.where(y_train_raw == class_idx)[0][0]

    ax = axes[class_idx]
    # Reshape flattened image back to 2D
    img = X_train[sample_idx].reshape(RESIZE)
    ax.imshow(img, cmap='gray')
    ax.axis('off')
    ax.set_title(f"Letter '{class_name}'", fontsize=12)

plt.tight_layout()
plt.show()

## 5. Developing a Model That Does Better Than a Baseline

**Baseline for 3-class balanced problem:** 33.3% accuracy (random guessing)

In [None]:
# ============================================================
# MODEL CONFIGURATION
# ============================================================
INPUT_DIMENSION = X_train.shape[1]  # 1024 features (32x32)
OUTPUT_CLASSES = y_train.shape[1]   # 3 classes

OPTIMIZER = 'adam'
LOSS_FUNC = 'categorical_crossentropy'
METRICS = ['accuracy']

# Training configuration
# Batch Size Selection:
# - Large datasets (>10,000 samples): Use 512 for efficient GPU utilisation
# - Small datasets (<10,000 samples): Use 32-64 for better gradient estimates
# ASL has ~9,000 samples → Use batch size 64 (below threshold, prioritise gradient quality)
BATCH_SIZE = 64
EPOCHS_BASELINE = 100
EPOCHS_REGULARIZED = 150

print(f"Input Dimension: {INPUT_DIMENSION}")
print(f"Output Classes: {OUTPUT_CLASSES}")
print(f"Batch Size: {BATCH_SIZE}")

In [None]:
# ============================================================
# ESTABLISH BASELINE
# ============================================================
# For balanced 3-class classification, random guessing = 33.3%
baseline_accuracy = 1.0 / OUTPUT_CLASSES

print(f"Baseline Accuracy (random guessing): {baseline_accuracy:.2f}")

In [None]:
# ============================================================
# CLASS WEIGHTS - Not needed for balanced dataset
# ============================================================
# Note: ASL dataset is perfectly balanced (3,000 samples per class)
# Class weights are not necessary for balanced datasets.
# We keep this cell for consistency with other notebooks and
# to demonstrate when class weights would be used.

print("Class Distribution (balanced - no class weights needed):")
for class_idx, class_name in enumerate(CLASS_NAMES):
    count = np.sum(y_train_raw == class_idx)
    print(f"  {class_name}: {count} ({100*count/len(y_train_raw):.1f}%)")

In [None]:
# ============================================================
# SINGLE LAYER PERCEPTRON (SLP) - Simplest possible model
# ============================================================
slp_model = Sequential(name='Single_Layer_Perceptron')
slp_model.add(Dense(OUTPUT_CLASSES, activation='softmax', input_shape=(INPUT_DIMENSION,)))
slp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

slp_model.summary()

In [None]:
# Train SLP
slp_history = slp_model.fit(
    X_train, y_train,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS_BASELINE,
    validation_data=(X_val, y_val),
    verbose=0
)

slp_val_acc = slp_model.evaluate(X_val, y_val, verbose=0)[1]
print(f"SLP Validation Accuracy: {slp_val_acc:.4f} (baseline: {baseline_accuracy:.2f})")

In [None]:
# ============================================================
# PLOT TRAINING HISTORY
# ============================================================
def plot_training_history(history, title='Training History'):
    """Plot training and validation loss/accuracy curves."""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Loss
    axes[0].plot(history.history['loss'], 'b-', label='Training Loss')
    axes[0].plot(history.history['val_loss'], 'r-', label='Validation Loss')
    axes[0].set_title('Training and Validation Loss')
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Loss')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

    # Accuracy
    axes[1].plot(history.history['accuracy'], 'b-', label='Training Accuracy')
    axes[1].plot(history.history['val_accuracy'], 'r-', label='Validation Accuracy')
    axes[1].set_title('Training and Validation Accuracy')
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('Accuracy')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)

    plt.suptitle(title, fontsize=14)
    plt.tight_layout()
    plt.show()

plot_training_history(slp_history, 'Single Layer Perceptron')

## 6. Scaling Up: Developing a Model That Overfits

Adding a hidden layer to learn more complex features for distinguishing hand gestures.

**No regularisation applied:** We intentionally train this model **without any regularisation** (no dropout, no L2, no early stopping) to observe overfitting behaviour.

---

### Architecture Design Decisions

**Why 64 neurons in the hidden layer?**

This is a practical starting point that balances capacity and efficiency:
- **Too few (e.g., 16):** May not have enough capacity to distinguish subtle hand gesture differences
- **Too many (e.g., 512):** Increases overfitting risk and training time without proportional benefit
- **64 neurons:** A common choice that provides sufficient capacity for most classification tasks

**Why only 1 hidden layer instead of 2-3?**

Per the **Universal ML Workflow**, the goal of this step is to demonstrate that the model *can* overfit—proving it has sufficient capacity to capture the underlying patterns. Once overfitting is observed:

1. **Capacity is proven sufficient:** If the model overfits, it can learn the training data's complexity
2. **No need for more depth:** Adding layers would increase overfitting further without benefit
3. **Regularise, don't expand:** The next step (Section 7) is to *reduce* overfitting through regularisation

*"The right question is not 'How many layers?' but 'Can it overfit?' If yes, regularise. If no, add capacity."*

In [None]:
# ============================================================
# MULTI-LAYER PERCEPTRON (MLP) - Standard architecture
# ============================================================
HIDDEN_NEURONS = 64

mlp_model = Sequential(name='Multi_Layer_Perceptron')
mlp_model.add(Dense(HIDDEN_NEURONS, activation='relu', input_shape=(INPUT_DIMENSION,)))
mlp_model.add(Dense(OUTPUT_CLASSES, activation='softmax'))
mlp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

mlp_model.summary()

In [None]:
# Train MLP
mlp_history = mlp_model.fit(
    X_train, y_train,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS_BASELINE,
    validation_data=(X_val, y_val),
    verbose=0
)

mlp_val_acc = mlp_model.evaluate(X_val, y_val, verbose=0)[1]
print(f"MLP Validation Accuracy: {mlp_val_acc:.4f} (baseline: {baseline_accuracy:.2f})")
print(f"Improvement over SLP: {(mlp_val_acc - slp_val_acc)*100:.2f}%")

In [None]:
plot_training_history(mlp_history, 'Multi-Layer Perceptron (1 Hidden Layer)')

## 7. Regularising Your Model and Tuning Hyperparameters

Using **Hyperband** for efficient hyperparameter tuning with L2 regularisation and Dropout.

### Why Hyperband?

**Hyperband** is more efficient than grid search because it:
1. Starts training many configurations for a few epochs
2. Eliminates poor performers early
3. Allocates more resources to promising configurations

### Regularisation Strategy

| Technique | Purpose | How It Works |
|-----------|---------|-------------|
| **L2 Regularisation** | Prevent large weights | Adds penalty term to loss |
| **Dropout** | Prevent co-adaptation | Randomly zeros neurons during training |

In [None]:
# ============================================================
# HYPERBAND MODEL BUILDER
# ============================================================
def build_model_hyperband(hp):
    """
    Build ASL model with FROZEN architecture (1 hidden layer, 64 neurons).
    Tunes: L2 regularisation, Dropout rate, Learning rate.
    """
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))

    # Hyperparameters to tune
    l2_reg = hp.Float('l2_reg', min_value=1e-5, max_value=1e-2, sampling='log')
    dropout_rate = hp.Float('dropout_rate', min_value=0.0, max_value=0.5, step=0.1)
    learning_rate = hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='log')

    # Hidden layer with L2 regularisation
    model.add(layers.Dense(
        HIDDEN_NEURONS,
        activation='relu',
        kernel_regularizer=regularizers.l2(l2_reg)
    ))
    model.add(layers.Dropout(dropout_rate))

    # Output layer
    model.add(layers.Dense(OUTPUT_CLASSES, activation='softmax'))

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
        loss=LOSS_FUNC,
        metrics=METRICS
    )

    return model

In [None]:
# ============================================================
# CONFIGURE AND RUN HYPERBAND TUNER
# ============================================================
tuner = kt.Hyperband(
    build_model_hyperband,
    objective='val_accuracy',
    max_epochs=50,
    factor=3,
    directory='asl_hyperband',
    project_name='asl_tuning',
    overwrite=True
)

# Run search
tuner.search(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=50,
    batch_size=BATCH_SIZE,
    verbose=0
)

In [None]:
# Get best hyperparameters
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Best hyperparameters found by Hyperband:")
print(f"  L2 Regularisation: {best_hp.get('l2_reg'):.6f}")
print(f"  Dropout Rate: {best_hp.get('dropout_rate')}")
print(f"  Learning Rate: {best_hp.get('learning_rate'):.6f}")

### 7.2 Sanity Check and Final Retraining

After finding the best hyperparameters, we follow a two-step process:

1. **Sanity Check:** Retrain with the best hyperparameters using training and validation data to visually confirm the model is not overfitting. This validates that Hyperband found hyperparameters that generalise well.

2. **Final Refit:** Combine training and validation sets and retrain without validation. Since the hyperparameters have been validated, we maximise the data available for the final model.

---

#### Why This Two-Step Approach?

| Step | Purpose | Validation Data |
|------|---------|-----------------|
| **Sanity Check** | Confirm hyperparameters prevent overfitting | ✓ Used for monitoring |
| **Final Refit** | Maximise training data for production model | ✗ Merged into training |

Once the sanity check confirms no overfitting, we can confidently combine all available data for the final model.

In [None]:
# =============================================================================
# STEP 1: SANITY CHECK - Retrain with validation to confirm no overfitting
# =============================================================================

# Extract the number of epochs from the best trial
best_trial = tuner.oracle.get_best_trials(num_trials=1)[0]
best_epochs = best_trial.best_step + 1  # best_step is 0-indexed

print("=" * 60)
print("SANITY CHECK: Retraining with Validation")
print("=" * 60)
print(f"Training for {best_epochs} epochs (matched from Hyperband's best trial)")
print("Purpose: Visually confirm the hyperparameters prevent overfitting\n")

# Build a fresh model with the best hyperparameters
sanity_model = tuner.hypermodel.build(best_hp)

history_sanity = sanity_model.fit(
    X_train, y_train,
    epochs=best_epochs,
    batch_size=BATCH_SIZE,
    validation_data=(X_val, y_val),  # Include validation for monitoring
    verbose=0
)

print("Sanity check training complete.")

In [None]:
# Plot sanity check training history (with validation curves)
plot_training_history(history_sanity, title=f'Sanity Check - Best Hyperparameters ({best_epochs} epochs)')

# Verify no overfitting: validation loss should not increase significantly
val_losses = history_sanity.history['val_loss']
min_val_loss_epoch = val_losses.index(min(val_losses)) + 1
final_val_loss = val_losses[-1]
min_val_loss = min(val_losses)

print(f"\nSanity Check Results:")
print(f"  Minimum validation loss: {min_val_loss:.4f} at epoch {min_val_loss_epoch}")
print(f"  Final validation loss: {final_val_loss:.4f}")
if final_val_loss <= min_val_loss * 1.1:  # Within 10% of minimum
    print("  ✓ No significant overfitting detected - hyperparameters are validated")
else:
    print("  ⚠ Some overfitting detected - consider adjusting epochs")

In [None]:
# =============================================================================
# STEP 2: FINAL REFIT - Combine data and retrain for production
# =============================================================================

# Combine training and validation sets for final model
X_combined = np.vstack([X_train, X_val])
y_combined = np.vstack([y_train, y_val])  # One-hot encoded labels

print("=" * 60)
print("FINAL REFIT: Training on Combined Data")
print("=" * 60)
print(f"Training data: {X_train.shape[0]:,} samples")
print(f"Validation data: {X_val.shape[0]:,} samples")
print(f"Combined data: {X_combined.shape[0]:,} samples")
print(f"  → {(X_combined.shape[0] / X_train.shape[0] - 1) * 100:.1f}% more training data")

# Build and train final model on combined data
print(f"\nRetraining for {best_epochs} epochs on combined data...")

best_model = tuner.hypermodel.build(best_hp)

best_model.fit(
    X_combined, y_combined,
    epochs=best_epochs,
    batch_size=BATCH_SIZE,
    verbose=0
    # No validation_data - merged into training
    # No plotting needed - sanity check already validated the hyperparameters
)

print("\n✓ Final model training complete on combined dataset.")

In [None]:
# ============================================================
# K-FOLD CROSS-VALIDATION EVALUATION
# ============================================================
def evaluate_with_kfold(build_fn, X, y, y_raw, skfold, epochs, batch_size):
    """
    Evaluate a model using Stratified K-Fold cross-validation.
    """
    fold_metrics = {'accuracy': [], 'precision': [], 'recall': []}

    for fold, (train_idx, val_idx) in enumerate(skfold.split(X, y_raw)):
        X_train_fold = X[train_idx]
        X_val_fold = X[val_idx]
        y_train_fold = y[train_idx]
        y_val_fold = y[val_idx]
        y_val_raw_fold = y_raw[val_idx]

        # Build fresh model for each fold
        model = build_fn()

        # Train (no class weights - dataset is balanced)
        model.fit(
            X_train_fold, y_train_fold,
            validation_data=(X_val_fold, y_val_fold),
            epochs=epochs, batch_size=batch_size,
            verbose=0
        )

        # Evaluate
        preds = model.predict(X_val_fold, verbose=0).argmax(axis=1)
        fold_metrics['accuracy'].append(accuracy_score(y_val_raw_fold, preds))
        fold_metrics['precision'].append(precision_score(y_val_raw_fold, preds, average='macro'))
        fold_metrics['recall'].append(recall_score(y_val_raw_fold, preds, average='macro'))

        print(f"  Fold {fold+1}: Accuracy={fold_metrics['accuracy'][-1]:.4f}")

    return {
        'acc_mean': np.mean(fold_metrics['accuracy']),
        'acc_std': np.std(fold_metrics['accuracy']),
        'prec_mean': np.mean(fold_metrics['precision']),
        'prec_std': np.std(fold_metrics['precision']),
        'rec_mean': np.mean(fold_metrics['recall']),
        'rec_std': np.std(fold_metrics['recall'])
    }

# Build function using best hyperparameters
def build_best_model():
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))
    model.add(layers.Dense(HIDDEN_NEURONS, activation='relu',
                           kernel_regularizer=regularizers.l2(best_hp.get('l2_reg'))))
    model.add(layers.Dropout(best_hp.get('dropout_rate')))
    model.add(layers.Dense(OUTPUT_CLASSES, activation='softmax'))
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=best_hp.get('learning_rate')),
        loss=LOSS_FUNC, metrics=METRICS
    )
    return model

print("Evaluating best model with 5-Fold Stratified Cross-Validation...")
kfold_results = evaluate_with_kfold(
    build_best_model, X_train_full, y_train_full, y_train_full_raw,
    skfold, EPOCHS_REGULARIZED, BATCH_SIZE
)

print("\n" + "=" * 50)
print("K-FOLD CROSS-VALIDATION RESULTS")
print("=" * 50)
print(f"Accuracy:  {kfold_results['acc_mean']:.4f} ± {kfold_results['acc_std']:.4f}")
print(f"Precision: {kfold_results['prec_mean']:.4f} ± {kfold_results['prec_std']:.4f}")
print(f"Recall:    {kfold_results['rec_mean']:.4f} ± {kfold_results['rec_std']:.4f}")
print("=" * 50)

## 8. Final Evaluation

Train the final model on the **entire training pool** and evaluate on the held-out test set.

In [None]:
# ============================================================
# FINAL MODEL: Train on ALL training data, evaluate on test set
# ============================================================
# Build final model with best hyperparameters
final_model = build_best_model()

# Train on entire training pool (no class weights - dataset is balanced)
final_model.fit(
    X_train_full, y_train_full,
    epochs=EPOCHS_REGULARIZED,
    batch_size=BATCH_SIZE,
    verbose=0
)

# Evaluate on held-out test set
y_pred_proba = final_model.predict(X_test, verbose=0)
y_pred = y_pred_proba.argmax(axis=1)

# Calculate metrics
test_accuracy = accuracy_score(y_test_raw, y_pred)
test_precision = precision_score(y_test_raw, y_pred, average='macro')
test_recall = recall_score(y_test_raw, y_pred, average='macro')
test_auc = roc_auc_score(y_test, y_pred_proba, multi_class='ovr', average='macro')

print("="*50)
print("FINAL TEST SET RESULTS")
print("="*50)
print(f"Accuracy:  {test_accuracy:.4f} (baseline: {baseline_accuracy:.2f})")
print(f"Precision: {test_precision:.4f}")
print(f"Recall:    {test_recall:.4f}")
print(f"AUC:       {test_auc:.4f}")
print("="*50)

In [None]:
# ============================================================
# CONFUSION MATRIX
# ============================================================
cm = confusion_matrix(y_test_raw, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=CLASS_NAMES)

fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax, cmap='Blues')
plt.title('Confusion Matrix - Test Set')
plt.show()

# Print detailed breakdown
print("\nConfusion Matrix Breakdown:")
for i, class_name in enumerate(CLASS_NAMES):
    correct = cm[i, i]
    total = cm[i, :].sum()
    print(f"  {class_name}: {correct}/{total} correct ({100*correct/total:.1f}%)")

## Model Comparison Summary

In [None]:
# ============================================================
# MODEL COMPARISON
# ============================================================
print("\nModel Comparison:")
print("="*60)
print(f"{'Model':<35} {'Accuracy':>15} {'Dataset':>10}")
print("-"*60)
print(f"{'Baseline (random)':<35} {baseline_accuracy:>15.4f} {'N/A':>10}")
print(f"{'Single Layer Perceptron':<35} {slp_val_acc:>15.4f} {'Fold 1':>10}")
print(f"{'Multi-Layer Perceptron':<35} {mlp_val_acc:>15.4f} {'Fold 1':>10}")
print(f"{'Regularised (L2+Dropout) K-Fold':<35} {kfold_results['acc_mean']:.4f} +/- {kfold_results['acc_std']:.4f}  {'5-Fold CV':>10}")
print(f"{'Final Model - Test Set':<35} {test_accuracy:>15.4f} {'Test':>10}")
print("="*60)
print(f"\nKey Observations:")
print(f"  - All models significantly outperform random baseline ({baseline_accuracy:.2%})")
print(f"  - Final model trained on combined train+val data ({X_train_full.shape[0]:,} samples)")
print(f"  - K-Fold CV accuracy: {kfold_results['acc_mean']:.2%} +/- {kfold_results['acc_std']:.2%}")

---

## Key Takeaways

### Decision Framework Summary

| Decision | Threshold | This Dataset | Choice | Reference |
|----------|-----------|--------------|--------|-----------|
| **Hold-Out vs K-Fold** | > 10,000 samples | ~9,000 samples | **K-Fold (5 folds)** | Kohavi (1995) |
| **Primary Metric** | Balanced multi-class | Equal distribution | Accuracy | Standard choice |
| **Class Weights** | Imbalance > 3:1 | 1:1 (balanced) | Not needed | He and Garcia (2009) |

### Lessons Learned

1. **K-Fold for Medium Datasets:** With ~9,000 samples (below the 10,000 threshold), K-Fold cross-validation provides more robust performance estimates.

2. **K-Fold Reports Mean +/- Std:** Unlike hold-out (single number), K-Fold gives us confidence intervals.

3. **No Class Weights for Balanced Data:** Since each class has exactly 3,000 samples, class weights are unnecessary.

4. **Image Preprocessing:** Resize → Grayscale → Flatten → Normalise converts images to vectors for dense networks.

5. **Flattening loses spatial information:** Dense networks treat pixels as independent features, ignoring spatial relationships.

6. **Maximise Data for Final Model:** After hyperparameter tuning, we combine training and validation sets for the final model. The validation set's job is done (model selection), so we use all available data to maximise learning.

7. **High accuracy on ASL:** The 3-class problem (A, B, C) is relatively easy for neural networks due to distinct hand shapes.

8. **Final Model Uses All Training Data:** After K-Fold validation, the final model is trained on the entire training pool.

### Next Steps for Better Performance

- **Expand to full alphabet** (26 classes) - more challenging
- **Use CNNs** (Chapter 8) - preserves spatial structure
- **Higher resolution** - 32×32 is still relatively small
- **Data augmentation** - artificial variations for robustness

---

## Appendix: Modular Helper Functions

For cleaner code organisation, you can wrap the model building and training patterns into reusable functions.

In [None]:
# =============================================================================
# MODULAR HELPER FUNCTIONS
# =============================================================================

def build_image_classifier(input_dim, num_classes, hidden_units=None, dropout=0.0, l2_reg=0.0,
                           optimizer='adam', learning_rate=None, name=None):
    """
    Build a multi-class image classification neural network.

    Parameters:
    -----------
    input_dim : int
        Number of input features (flattened image size)
    num_classes : int
        Number of output classes
    hidden_units : list of int, optional
        Neurons per hidden layer, e.g., [64] or [128, 64]
        None or [] creates a single-layer perceptron
    dropout : float
        Dropout rate (0.0 to 0.5)
    l2_reg : float
        L2 regularisation strength
    optimizer : str
        Optimiser name
    learning_rate : float, optional
        Custom learning rate (uses default if None)
    name : str, optional
        Model name

    Returns:
    --------
    keras.Sequential : Compiled model ready for training
    """
    model = Sequential(name=name)
    model.add(layers.Input(shape=(input_dim,)))

    hidden_units = hidden_units or []
    kernel_reg = regularizers.l2(l2_reg) if l2_reg > 0 else None

    for units in hidden_units:
        model.add(Dense(units, activation='relu', kernel_regularizer=kernel_reg))
        if dropout > 0:
            model.add(Dropout(dropout))

    # Output layer for multi-class classification
    model.add(Dense(num_classes, activation='softmax'))

    # Configure optimizer
    if learning_rate is not None:
        opt = keras.optimizers.Adam(learning_rate=learning_rate)
    else:
        opt = optimizer

    model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
    return model


def train_model(model, X_train, y_train, X_val, y_val,
                batch_size=64, epochs=100, verbose=0):
    """
    Train a model and return training history.

    Parameters:
    -----------
    model : keras.Model
        Compiled Keras model
    X_train, y_train : array-like
        Training data and labels
    X_val, y_val : array-like
        Validation data and labels
    batch_size : int
        Training batch size
    epochs : int
        Number of training epochs
    verbose : int
        Verbosity mode

    Returns:
    --------
    keras.callbacks.History : Training history object
    """
    return model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        batch_size=batch_size,
        epochs=epochs,
        verbose=verbose
    )


def evaluate_multiclass_model(model, X, y_true_onehot, class_names=None):
    """
    Evaluate multi-class classification model.

    Parameters:
    -----------
    model : keras.Model
        Trained Keras model
    X : array-like
        Input features
    y_true_onehot : array-like
        True labels (one-hot encoded)
    class_names : list, optional
        Names of classes for display

    Returns:
    --------
    dict : Dictionary containing all metrics
    """
    y_pred_proba = model.predict(X, verbose=0)
    y_pred = y_pred_proba.argmax(axis=1)
    y_true = y_true_onehot.argmax(axis=1)

    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred, average='macro'),
        'recall': recall_score(y_true, y_pred, average='macro'),
        'auc': roc_auc_score(y_true_onehot, y_pred_proba, multi_class='ovr', average='macro'),
    }

    return metrics


# =============================================================================
# USAGE EXAMPLES
# =============================================================================
#
# # Build models
# slp = build_image_classifier(INPUT_DIMENSION, OUTPUT_CLASSES, name='SLP')
# mlp = build_image_classifier(INPUT_DIMENSION, OUTPUT_CLASSES, hidden_units=[64], name='MLP')
# mlp_reg = build_image_classifier(INPUT_DIMENSION, OUTPUT_CLASSES, hidden_units=[64],
#                                  dropout=0.3, l2_reg=0.001, learning_rate=0.001,
#                                  name='MLP_Regularized')
#
# # Train
# history = train_model(mlp, X_train, y_train, X_val, y_val)
#
# # Evaluate
# metrics = evaluate_multiclass_model(mlp, X_val, y_val, CLASS_NAMES)
# print(f"Accuracy: {metrics['accuracy']:.4f}")