<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Final%20DNN%20Code%20Examples/Fashion%20MNIST/Fashion%20MNIST%20-%20TFDS%20Gray-Scaled%20Image%20Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fashion MNIST - TFDS Grayscale Image Example

This notebook demonstrates the **Universal ML Workflow** applied to **multi-class image classification** using grayscale images from TensorFlow Datasets.

## Learning Objectives

By the end of this notebook, you will be able to:
- Work with **grayscale images** (no colour conversion needed)
- Handle **10-class image classification**
- Understand **image preprocessing** for dense neural networks (resize + flatten)
- Use **TensorFlow Datasets** for standard benchmarks
- Use **Hyperband** for efficient hyperparameter tuning
- Apply **Dropout + L2 regularisation** to prevent overfitting

---

## Dataset Overview

| Attribute | Description |
|-----------|-------------|
| **Source** | [TensorFlow Datasets - fashion_mnist](https://www.tensorflow.org/datasets/catalog/fashion_mnist) |
| **Problem Type** | Multi-Class Classification (10 classes) |
| **Classes** | T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot |
| **Data Balance** | Perfectly Balanced (7,000 per class) |
| **Image Size** | 28×28 grayscale (resized to 32×32) |
| **Preprocessing** | Resize → Flatten → Normalise |

---

## Technique Scope

This notebook uses only techniques from **Chapters 1–4** of *Deep Learning with Python* (Chollet, 2021). This means:

| Technique | Status | Rationale |
|-----------|--------|-----------|
| **Dense layers (DNN)** | ✓ Used | Core building block (Ch. 3-4) |
| **Dropout** | ✓ Used | Regularisation technique (Ch. 4) |
| **L2 regularisation** | ✓ Used | Weight penalty (Ch. 4) |
| **Early stopping** | ✗ Not used | Introduced in Ch. 7 |
| **CNN** | ✗ Not used | Introduced in Ch. 8 |
| **RNN/LSTM** | ✗ Not used | Introduced in Ch. 10 |

**Note:** For image classification, Convolutional Neural Networks (CNNs) would typically be preferred. Here we use **Dense layers only** to demonstrate the Universal ML Workflow and regularisation techniques. Images are flattened to 1D vectors before being fed to the network.

---

## Why Fashion MNIST?

Fashion MNIST is a drop-in replacement for classic MNIST digit classification:

| Aspect | MNIST (Digits) | Fashion MNIST |
|--------|----------------|---------------|
| **Classes** | 10 digits (0-9) | 10 clothing items |
| **Difficulty** | Essentially "solved" (~99.8%) | More challenging (~92-94% with CNN) |
| **Realism** | Handwritten digits | Real product images |
| **Image size** | 28×28 grayscale | 28×28 grayscale |

---

## 1. Defining the Problem and Assembling a Dataset

The first step in any machine learning project is to clearly define the problem and understand the data.

**Problem Statement:** Classify grayscale images of fashion items into 10 categories.

**Why this matters:**
- **E-commerce automation:** Fashion classification enables automated product tagging, reducing manual effort for retailers with millions of SKUs
- **Visual search:** Users can photograph an item and find similar products—requires accurate category classification as a first step
- **Inventory management:** Automated classification helps track stock across categories

**Why this problem is interesting:**
- **Standard benchmark:** Fashion MNIST is widely used for evaluating classification algorithms
- **Balanced classes:** Enables straightforward accuracy comparison
- **Moderate difficulty:** More challenging than digit MNIST, but achievable with basic techniques

**Data Source:** TensorFlow Datasets provides pre-split train/test sets directly.

## 2. Choosing a Measure of Success

### Metric Selection Based on Class Imbalance

| Imbalance Ratio | Classification | Primary Metric | Rationale |
|-----------------|----------------|----------------|-----------|
| ≤ 1.5:1 | Balanced | **Accuracy** | Classes roughly equal |
| 1.5:1 – 3:1 | Mild Imbalance | **Accuracy** | Majority class < 75% |
| > 3:1 | Moderate/Severe | **F1-Score** | Accuracy becomes misleading |

**Why these thresholds?**
- **3:1 ratio**: When majority class exceeds 75%, a naive classifier achieves high accuracy while ignoring minority classes
- **Balanced data (< 3:1):** Accuracy is meaningful and interpretable

### References

- Branco, P., Torgo, L. and Ribeiro, R.P. (2016) 'A survey of predictive modelling on imbalanced domains', *ACM Computing Surveys*, 49(2), pp. 1–50.

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

**Fashion MNIST is perfectly balanced** (7,000 samples per class), so **Accuracy** is our primary metric.

We also report **Top-K Accuracy** - was the correct class among the top K predictions? This is useful for multi-class problems where similar items (e.g., Pullover vs. Coat) may be easily confused.

## 3. Deciding on an Evaluation Protocol

### Hold-Out vs K-Fold Cross-Validation

| Dataset Size | Recommended Method | Rationale |
|--------------|-------------------|-----------|
| < 1,000 | K-Fold (K=5 or 10) | High variance with small hold-out sets |
| 1,000 – 10,000 | K-Fold or Hold-Out | Either works; K-fold more robust |
| > 10,000 | Hold-Out | Sufficient data; K-fold computationally expensive |

### Data Split Strategy (This Notebook)

```
Original Data (70,000 samples) → Hold-Out Selected
├── Test Set (10%) - Final evaluation only
└── Training Pool (90%)
    ├── Training Set (~81%) - Model training
    └── Validation Set (~9%) - Hyperparameter tuning
```

**Important:** We use `stratify` parameter to maintain class proportions in all splits.

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *IJCAI*, 2, pp. 1137–1145.

## 4. Preparing Your Data

### 4.1 Import Libraries and Set Random Seed

We set random seeds for reproducibility - this ensures that running the notebook multiple times produces the same results.

In [None]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from skimage.transform import resize

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
import tensorflow_datasets as tfds

# Keras Tuner for hyperparameter search
%pip install -q -U keras-tuner
import keras_tuner as kt

import matplotlib.pyplot as plt

SEED = 204

tf.random.set_seed(SEED)
np.random.seed(SEED)

import warnings
warnings.filterwarnings('ignore')

### 4.2 Load Dataset from TensorFlow Datasets

TensorFlow Datasets provides convenient access to standard ML benchmarks.

In [None]:
# Dataset configuration
DATASET = 'fashion_mnist'
RESIZE = (32, 32)  # Resize 28x28 to 32x32 - balance between detail and dimensionality

# Class names for Fashion MNIST
CLASS_NAMES = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

print(f"Loading {DATASET} from TensorFlow Datasets...")

In [None]:
# Load dataset
ds = tfds.load(DATASET, split='all', shuffle_files=True)

# Extract images and labels
images, labels = [], []
for entry in ds.take(len(ds)):
    image, label = entry['image'], entry['label']
    
    # Convert to numpy (grayscale has 1 channel)
    image = image.numpy()[:, :, 0]
    label = label.numpy()
    
    # Resize image to reduce dimensions
    image = resize(image, RESIZE, anti_aliasing=True)
    
    images.append(image)
    labels.append(label)

print(f"Loaded {len(images):,} images")
print(f"Original size: 28×28, Resized to: {RESIZE[0]}×{RESIZE[1]}")

In [None]:
# Display sample images
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(images[i], cmap='gray')
    ax.set_title(CLASS_NAMES[labels[i]])
    ax.axis('off')
plt.suptitle('Sample Fashion MNIST Images (resized to 32×32)', fontsize=12)
plt.tight_layout()
plt.show()

### 4.3 Prepare Features and Labels

**Image Preprocessing for Dense Networks:**

```
28×28 Image → Resize to 32×32 → Flatten to 1024 features → Normalise [0, 1]
```

**Why flatten?** Dense layers expect 1D input. Without CNNs (which preserve spatial structure), we treat each pixel as an independent feature.

**Why resize to 32×32?** Expanding from 784 (28×28) to 1024 (32×32) features:
- Provides a standard power-of-2 dimension for neural networks
- Preserves all detail from the original images
- Slight upsampling maintains edge clarity

In [None]:
# Convert to numpy arrays and flatten
X = np.array(images)
X = X.reshape((X.shape[0], -1))  # Flatten: (N, 16, 16) -> (N, 256)

print(f"Feature shape: {X.shape}")
print(f"Features per image: {X.shape[1]} ({RESIZE[0]}×{RESIZE[1]} = {RESIZE[0]*RESIZE[1]})")

In [None]:
# Encode labels
label_encoder = LabelEncoder()
label_encoder.fit(labels)

# One-hot encode for multi-class classification
y = to_categorical(label_encoder.transform(labels))

print(f"Label shape: {y.shape}")
print(f"Number of classes: {y.shape[1]}")

### 4.4 Split Data into Train and Test Sets

In [None]:
TEST_SIZE = 0.10

X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, 
    stratify=labels,  # Maintain class balance
    shuffle=True, random_state=SEED
)

print(f"Training pool: {len(X_train_full):,} samples")
print(f"Test set: {len(X_test):,} samples")

In [None]:
# Normalise pixel values from [0, 1] (after resize) to ensure consistent range
X_train_full = X_train_full / X_train_full.max()
X_test = X_test / X_test.max()

print(f"Pixel value range: [{X_train_full.min():.2f}, {X_train_full.max():.2f}]")

## 5. Developing a Model That Does Better Than a Baseline

Before building complex models, we need to establish **baseline performance**.

### 5.1 Examine Class Distribution

In [None]:
# Count samples per class
counts = np.sum(y, axis=0)
print("Samples per class:")
for i, (name, count) in enumerate(zip(CLASS_NAMES, counts)):
    print(f"  {i}: {name}: {int(count):,}")

In [None]:
# =============================================================================
# DATA-DRIVEN ANALYSIS: Dataset Size & Imbalance
# =============================================================================

# Dataset size analysis
n_samples = len(X)
n_classes = y.shape[1]
HOLDOUT_THRESHOLD = 10000

# Imbalance analysis
imbalance_ratio = counts.max() / counts.min()
IMBALANCE_THRESHOLD = 3.0

use_holdout = n_samples > HOLDOUT_THRESHOLD
use_accuracy = imbalance_ratio <= IMBALANCE_THRESHOLD

print("=" * 60)
print("DATA-DRIVEN CONFIGURATION")
print("=" * 60)
print(f"\n1. DATASET SIZE: {n_samples:,} samples")
print(f"   Threshold: {HOLDOUT_THRESHOLD:,} samples")
print(f"   Decision: {'Hold-Out' if use_holdout else 'K-Fold Cross-Validation'}")

print(f"\n2. CLASS IMBALANCE: {imbalance_ratio:.2f}:1 ratio")
print(f"   Threshold: {IMBALANCE_THRESHOLD:.1f}:1")
print(f"   Decision: {'Accuracy (balanced)' if use_accuracy else 'F1-Score (imbalanced)'}")

print(f"\n3. NUMBER OF CLASSES: {n_classes}")

print("\n" + "=" * 60)
PRIMARY_METRIC = 'accuracy' if use_accuracy else 'f1'
print(f"PRIMARY METRIC: {PRIMARY_METRIC.upper()}")
print("=" * 60)

In [None]:
# Baseline accuracy (random guessing with balanced classes)
baseline = 1.0 / n_classes

print(f"Baseline accuracy (random guess): {baseline:.2%}")

### 5.2 Create Validation Set

In [None]:
VALIDATION_SIZE = 0.10  # 10% of training pool

X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, 
    test_size=VALIDATION_SIZE, 
    stratify=y_train_full.argmax(axis=1),
    shuffle=True, random_state=SEED
)

print(f"Training set: {X_train.shape[0]:,} samples")
print(f"Validation set: {X_val.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")

### 5.3 Configure Training Parameters

In [None]:
INPUT_DIMENSION = X_train.shape[1]
OUTPUT_CLASSES = y_train.shape[1]

OPTIMIZER = 'adam'
LOSS_FUNC = 'categorical_crossentropy'  # Multi-class classification

# Training metrics
METRICS = ['accuracy']

print(f"Input dimension: {INPUT_DIMENSION}")
print(f"Output classes: {OUTPUT_CLASSES}")

In [None]:
# Single-Layer Perceptron (no hidden layers) - Baseline
slp_model = Sequential(name='Single_Layer_Perceptron')
slp_model.add(layers.Input(shape=(INPUT_DIMENSION,)))
slp_model.add(Dense(OUTPUT_CLASSES, activation='softmax'))
slp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

slp_model.summary()

In [None]:
# =============================================================================
# TRAINING CONFIGURATION
# =============================================================================

# Batch Size Selection:
# - Large datasets (>10,000 samples): Use 512 for efficient GPU utilisation
# - Small datasets (<10,000 samples): Use 32-64 for better gradient estimates
# Fashion MNIST has 70,000 samples → Use batch size 512
BATCH_SIZE = 512

# Epoch strategy:
# EPOCHS_BASELINE (100): For SLP and unregularised DNN
# EPOCHS_REGULARIZED (150): For DNN with Dropout + L2 (more epochs needed
#   because regularisation slows convergence)

EPOCHS_BASELINE = 100
EPOCHS_REGULARIZED = 150

In [None]:
# Train the Single-Layer Perceptron
history_slp = slp_model.fit(
    X_train, y_train, 
    batch_size=BATCH_SIZE, epochs=EPOCHS_BASELINE, 
    validation_data=(X_val, y_val),
    verbose=0
)
val_score_slp = slp_model.evaluate(X_val, y_val, verbose=0)

In [None]:
# Display SLP validation metrics
preds_slp_val = slp_model.predict(X_val, verbose=0).argmax(axis=1)
acc_slp_val = accuracy_score(y_val.argmax(axis=1), preds_slp_val)

print('Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(acc_slp_val, baseline))
print(f'\nAccuracy: {acc_slp_val:.2%}  ← Primary Metric')

In [None]:
def plot_training_history(history, title=None):
    """
    Plot training and validation metrics over epochs.
    Plots: (1) Loss, (2) Accuracy
    """
    fig, axs = plt.subplots(1, 2, figsize=(14, 5))
    epochs = range(1, len(history.history['loss']) + 1)
    title_suffix = f' ({title})' if title else ''

    # Plot 1: Loss
    axs[0].plot(epochs, history.history['loss'], 'b-', label='Training', linewidth=1.5)
    axs[0].plot(epochs, history.history['val_loss'], 'r-', label='Validation', linewidth=1.5)
    axs[0].set_title(f'Loss{title_suffix}')
    axs[0].set_xlabel('Epochs')
    axs[0].set_ylabel('Loss')
    axs[0].legend()
    axs[0].grid(alpha=0.3)

    # Plot 2: Accuracy
    axs[1].plot(epochs, history.history['accuracy'], 'b-', label='Training', linewidth=1.5)
    axs[1].plot(epochs, history.history['val_accuracy'], 'r-', label='Validation', linewidth=1.5)
    axs[1].set_title(f'Accuracy{title_suffix}')
    axs[1].set_xlabel('Epochs')
    axs[1].set_ylabel('Accuracy')
    axs[1].legend()
    axs[1].grid(alpha=0.3)

    plt.tight_layout()
    plt.show()

In [None]:
# Plot SLP training history
plot_training_history(history_slp, title='SLP Baseline')

## 6. Scaling Up: Developing a Model That Overfits

The next step is to build a model with **enough capacity to overfit**. If a model can't overfit, it may be too simple to learn the patterns in the data.

**No regularisation applied:** We intentionally train this model **without any regularisation** (no dropout, no L2, no early stopping) to observe overfitting behaviour.

---

### Architecture Design Decisions

**Why 64 neurons in the hidden layer?**

This is a practical starting point that balances capacity and efficiency:
- **Too few (e.g., 16):** May not have enough capacity to learn complex visual patterns
- **Too many (e.g., 512):** Increases overfitting risk and training time without proportional benefit
- **64 neurons:** A common choice that provides sufficient capacity for most classification tasks on flattened images

**Why only 1 hidden layer instead of 2-3?**

Per the **Universal ML Workflow**, the goal of this step is to demonstrate that the model *can* overfit—proving it has sufficient capacity to capture the underlying patterns. Once overfitting is observed:

1. **Capacity is proven sufficient:** If the model overfits, it can learn the training data's complexity
2. **No need for more depth:** Adding layers would increase overfitting further without benefit
3. **Regularise, don't expand:** The next step (Section 7) is to *reduce* overfitting through regularisation, not to add more capacity

If this 1-layer model *couldn't* overfit (training and validation loss both plateau high), we would then add more layers. But since it does overfit, the architecture is adequate.

*"The right question is not 'How many layers?' but 'Can it overfit?' If yes, regularise. If no, add capacity."*

### 6.1 Build a Deep Neural Network (DNN)

In [None]:
# Deep Neural Network (1 hidden layer, no regularisation)
dnn_model = Sequential(name='Deep_Neural_Network')
dnn_model.add(layers.Input(shape=(INPUT_DIMENSION,)))
dnn_model.add(Dense(64, activation='relu'))
dnn_model.add(Dense(OUTPUT_CLASSES, activation='softmax'))
dnn_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

dnn_model.summary()

In [None]:
# Train the Deep Neural Network
history_dnn = dnn_model.fit(
    X_train, y_train, 
    batch_size=BATCH_SIZE, epochs=EPOCHS_BASELINE, 
    validation_data=(X_val, y_val), 
    verbose=0
)
val_score_dnn = dnn_model.evaluate(X_val, y_val, verbose=0)

In [None]:
# Plot DNN training history
plot_training_history(history_dnn, title='DNN - No Regularisation')

In [None]:
# Display DNN validation metrics
preds_dnn_val = dnn_model.predict(X_val, verbose=0).argmax(axis=1)
acc_dnn_val = accuracy_score(y_val.argmax(axis=1), preds_dnn_val)

print('Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(acc_dnn_val, baseline))
print(f'\nAccuracy: {acc_dnn_val:.2%}  ← Primary Metric')

## 7. Regularising Your Model and Tuning Hyperparameters

Now we address overfitting by adding **Dropout + L2 regularisation**.

Using **Hyperband** for efficient hyperparameter tuning.

### 7.1 Hyperband Search

In [None]:
# Hyperband Model Builder
def build_model_hyperband(hp):
    """
    Build Fashion MNIST model with FIXED architecture (1 hidden layer, 64 neurons).
    Only tunes regularisation and learning rate.
    """
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))

    # L2 regularisation strength
    l2_reg = hp.Float('l2_reg', 1e-5, 1e-2, sampling='log')

    # Fixed architecture: 1 hidden layer with 64 neurons
    model.add(layers.Dense(64, activation='relu', 
                           kernel_regularizer=regularizers.l2(l2_reg)))
    dropout_rate = hp.Float('dropout', 0.0, 0.5, step=0.1)
    model.add(layers.Dropout(dropout_rate))

    # Output layer for multi-class classification
    model.add(layers.Dense(OUTPUT_CLASSES, activation='softmax'))

    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss=LOSS_FUNC,
        metrics=METRICS
    )
    return model

In [None]:
# Configure Hyperband tuner
tuner = kt.Hyperband(
    build_model_hyperband,
    objective='val_accuracy',
    max_epochs=20,
    factor=3,
    directory='fashion_mnist_hyperband',
    project_name='fashion_mnist_tuning',
    overwrite=True
)

print("Tuning objective: val_accuracy")

# Run Hyperband search
tuner.search(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=20,
    batch_size=BATCH_SIZE
)

In [None]:
# Get best hyperparameters
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Best hyperparameters found by Hyperband:")
print(f"  L2 Regularisation: {best_hp.get('l2_reg'):.6f}")
print(f"  Dropout Rate: {best_hp.get('dropout')}")
print(f"  Learning Rate: {best_hp.get('lr'):.6f}")

# =============================================================================
# CRITICAL: Extract the number of epochs from the best trial
# =============================================================================
# Hyperband trains different configurations for different numbers of epochs.
# The best trial achieved its performance at a SPECIFIC epoch count.
# We must retrain for the SAME number of epochs to avoid mismatch.

best_trial = tuner.oracle.get_best_trials(num_trials=1)[0]
best_epochs = best_trial.best_step + 1  # best_step is 0-indexed

print(f"\n>>> Best trial was trained for {best_epochs} epochs <<<")
print(f"    (This is the epoch count we'll use for retraining)")

# Build a fresh model with the best hyperparameters
opt_model = tuner.hypermodel.build(best_hp)
opt_model.summary()

### 7.2 Retraining with Full Data and Matched Epochs

After finding the best hyperparameters, we retrain with two important considerations:

1. **Matched Epochs:** Use the exact number of epochs that produced the best validation score during Hyperband tuning
2. **Combined Training Data:** Merge training and validation sets to maximise the data available for the final model

---

#### Why Combine Training and Validation Sets?

Once hyperparameter tuning is complete, the validation set has served its purpose (model selection). For the final model:

| Approach | Training Data | Benefit |
|----------|--------------|---------|
| Keep validation separate | ~81% of original | Can monitor overfitting during retrain |
| **Combine train + validation** | **~90% of original** | **Maximises data for final model** |

We combine because:
- **More data = better generalisation:** The model learns from more examples
- **Validation set's job is done:** It was used for hyperparameter selection, not needed for final training
- **Standard practice:** This is the recommended approach in production ML pipelines (Chollet, 2021)

---

#### The Epoch Mismatch Problem

Hyperband finds hyperparameters optimal at a **specific epoch count**. If we retrain for a different number of epochs, the hyperparameters may no longer be optimal.

We extract `best_trial.best_step + 1` to match the exact epoch count where Hyperband found the best validation score.

> *"Combine your data to maximise learning, match your epochs to honour what Hyperband discovered."*

In [None]:
# =============================================================================
# RETRAIN WITH FULL DATA AND MATCHED EPOCHS
# =============================================================================

# Combine training and validation sets for final model
X_combined = np.vstack([X_train, X_val])
y_combined = np.vstack([y_train, y_val])  # One-hot encoded labels

print(f"Training data: {X_train.shape[0]:,} samples")
print(f"Validation data: {X_val.shape[0]:,} samples")
print(f"Combined data: {X_combined.shape[0]:,} samples")
print(f"  → {(X_combined.shape[0] / X_train.shape[0] - 1) * 100:.1f}% more training data")

print(f"\nRetraining with best hyperparameters for {best_epochs} epochs...")
print(f"(Matching the epoch count from Hyperband's best trial)")

history_opt = opt_model.fit(
    X_combined, y_combined,  # Train on combined data
    epochs=best_epochs,       # CRITICAL: Use matched epochs!
    batch_size=BATCH_SIZE,
    verbose=0
    # Note: No validation_data - we've merged it into training
)

print(f"\nTraining complete on combined dataset.")

In [None]:
# =============================================================================
# PLOT TRAINING METRICS (Training only - no validation data)
# =============================================================================
def plot_training_only(history, title=None):
    """
    Plot training metrics over epochs (no validation).
    Used when training on combined train+val data.
    """
    fig, axs = plt.subplots(1, 2, figsize=(14, 5))
    epochs = range(1, len(history.history['loss']) + 1)
    title_suffix = f' ({title})' if title else ''

    # Plot 1: Loss
    axs[0].plot(epochs, history.history['loss'], 'b-', label='Training', linewidth=1.5)
    axs[0].set_title(f'Training Loss{title_suffix}')
    axs[0].set_xlabel('Epochs')
    axs[0].set_ylabel('Loss')
    axs[0].legend()
    axs[0].grid(alpha=0.3)

    # Plot 2: Accuracy
    axs[1].plot(epochs, history.history['accuracy'], 'b-', label='Training', linewidth=1.5)
    axs[1].set_title(f'Training Accuracy{title_suffix}')
    axs[1].set_xlabel('Epochs')
    axs[1].set_ylabel('Accuracy')
    axs[1].legend()
    axs[1].grid(alpha=0.3)

    plt.tight_layout()
    plt.show()

# Plot training history for the final model (trained on combined data)
plot_training_only(history_opt, title=f'Final Model - Combined Data ({best_epochs} epochs)')

In [None]:
# =============================================================================
# VALIDATION SET STATUS
# =============================================================================
print("=" * 60)
print("VALIDATION SET STATUS")
print("=" * 60)
print("\nThe validation set has been merged into training data.")
print("This maximises the data available for the final model.")
print("\nValidation metrics from hyperparameter tuning (before merge):")
print(f"  - These were used to select the best hyperparameters")
print(f"  - The final model is evaluated on the TEST set only")
print("=" * 60)

### 7.3 Final Model Evaluation on Test Set

In [None]:
# Final evaluation on test set
preds_test = opt_model.predict(X_test, verbose=0)
preds_test_labels = preds_test.argmax(axis=1)
y_test_labels = y_test.argmax(axis=1)

test_accuracy = accuracy_score(y_test_labels, preds_test_labels)
test_precision = precision_score(y_test_labels, preds_test_labels, average='macro')
test_recall = recall_score(y_test_labels, preds_test_labels, average='macro')
test_f1 = f1_score(y_test_labels, preds_test_labels, average='macro')

print('=' * 50)
print('FINAL TEST SET RESULTS')
print('=' * 50)
print(f'Accuracy (Test): {test_accuracy:.4f}  ← Primary Metric')
print(f'Precision (Test, macro): {test_precision:.4f}')
print(f'Recall (Test, macro): {test_recall:.4f}')
print(f'F1-Score (Test, macro): {test_f1:.4f}')
print(f'\nBaseline: {baseline:.4f}')

In [None]:
# Confusion matrix
fig, ax = plt.subplots(figsize=(10, 8))
cm = confusion_matrix(y_test_labels, preds_test_labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=CLASS_NAMES)
disp.plot(ax=ax, cmap='Blues', values_format='d', xticks_rotation=45)
plt.title('Confusion Matrix - Test Set')
plt.tight_layout()
plt.show()

In [None]:
# Top-K Accuracy
def top_k_accuracy(y_true, y_pred_proba, k):
    """Calculate top-K accuracy: correct class in top K predictions."""
    top_k_preds = np.argsort(y_pred_proba, axis=1)[:, -k:]
    correct = sum(y_true[i] in top_k_preds[i] for i in range(len(y_true)))
    return correct / len(y_true)

print("Top-K Accuracy (Test Set):")
for k in [1, 2, 3]:
    top_k = top_k_accuracy(y_test_labels, preds_test, k)
    print(f"  Top-{k}: {top_k:.2%}")

---

## 8. Results Summary

In [None]:
# Results summary
results = pd.DataFrame({
    'Model': ['Random Baseline', 'SLP (No Hidden)', 'DNN (No Regularisation)', 'DNN (Dropout + L2) - Test'],
    'Accuracy': [baseline, acc_slp_val, acc_dnn_val, test_accuracy],
    'Dataset': ['N/A', 'Validation', 'Validation', 'Test']
})

print("=" * 70)
print("MODEL COMPARISON - RESULTS SUMMARY")
print("=" * 70)
print(f"Primary Metric: ACCURACY (balanced dataset, {n_classes} classes)")
print("=" * 70)
print(results.to_string(index=False, float_format='{:.4f}'.format))
print("=" * 70)
print(f"\nKey Observations:")
print(f"  - All models significantly outperform random baseline ({baseline:.2%})")
print(f"  - Final model trained on combined train+val data ({X_combined.shape[0]:,} samples)")
print(f"  - Test accuracy: {test_accuracy:.2%}, Top-3: {top_k_accuracy(y_test_labels, preds_test, 3):.2%}")

---

## 9. Key Takeaways

### Decision Framework Summary

| Decision | Threshold | This Dataset | Choice | Reference |
|----------|-----------|--------------|--------|-----------|
| **Hold-Out vs K-Fold** | > 10,000 samples | 70,000 samples | Hold-Out | Kohavi (1995) |
| **Accuracy vs F1-Score** | > 3:1 imbalance | 1.00:1 ratio | Accuracy | He and Garcia (2009) |
| **Batch Size** | > 10,000 samples | 70,000 samples | 512 | Efficient GPU utilisation |

### Lessons Learned

1. **Image Preprocessing for DNNs:** Without CNNs, images must be flattened to 1D vectors. This sacrifices spatial relationships but enables use of dense layers.

2. **Grayscale Simplifies Processing:** No colour conversion needed—single channel per pixel reduces dimensionality (3x fewer features than RGB).

3. **Balanced Classes → Accuracy:** With 10 equal-sized classes, accuracy is a valid primary metric. No class weights needed.

4. **Top-K Accuracy for Multi-Class:** When similar items may be confused (Pullover vs. Coat), Top-K shows how often the correct class is in the top K predictions.

5. **Capacity Before Regularisation:** Build a model that overfits first (Section 6). If it can't overfit, add more capacity. Only then regularise (Section 7).

6. **Maximise Data for Final Model:** After hyperparameter tuning, we combine training and validation sets for the final model. The validation set's job is done (model selection), so we use all available data to maximise learning.

7. **Regularisation Enables Longer Training:** With proper regularisation, we can train for more epochs without overfitting risk. *"Regularisation buys you the freedom to train longer."*

8. **Technique Scope:** Dense layers only (Ch. 1-4). CNNs (Ch. 8) would achieve ~92-94% accuracy by preserving spatial structure.

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *IJCAI*, 2, pp. 1137–1145.

- Xiao, H., Rasul, K. and Vollgraf, R. (2017) 'Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms', *arXiv preprint arXiv:1708.07747*.

---

## Appendix: Modular Helper Functions

For cleaner code organisation, you can wrap the model building and training patterns into reusable functions.

In [None]:
# =============================================================================
# MODULAR HELPER FUNCTIONS
# =============================================================================

def build_multiclass_classifier(input_dim, num_classes, hidden_units=None, dropout=0.0, l2_reg=0.0,
                                 optimizer='adam', learning_rate=None, name=None):
    """
    Build a multi-class image classification neural network.
    
    Parameters:
    -----------
    input_dim : int
        Number of input features (flattened image size)
    num_classes : int
        Number of output classes
    hidden_units : list of int, optional
        Neurons per hidden layer, e.g., [64] or [128, 64]
    dropout : float
        Dropout rate (0.0 to 0.5)
    l2_reg : float
        L2 regularisation strength
    learning_rate : float, optional
        Custom learning rate
    name : str, optional
        Model name
        
    Returns:
    --------
    keras.Sequential : Compiled model ready for training
    """
    model = Sequential(name=name)
    model.add(layers.Input(shape=(input_dim,)))
    
    hidden_units = hidden_units or []
    kernel_reg = regularizers.l2(l2_reg) if l2_reg > 0 else None
    
    for units in hidden_units:
        model.add(Dense(units, activation='relu', kernel_regularizer=kernel_reg))
        if dropout > 0:
            model.add(Dropout(dropout))
    
    model.add(Dense(num_classes, activation='softmax'))
    
    if learning_rate is not None:
        opt = keras.optimizers.Adam(learning_rate=learning_rate)
    else:
        opt = optimizer
    
    model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
    return model


def train_model(model, X_train, y_train, X_val, y_val,
                batch_size=512, epochs=100, verbose=0):
    """Train a model and return training history."""
    return model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        batch_size=batch_size, 
        epochs=epochs,
        verbose=verbose
    )


def evaluate_multiclass(model, X, y_true_onehot, class_names=None):
    """
    Evaluate multi-class classification model.
    
    Returns:
    --------
    dict : Dictionary containing accuracy, precision, recall, f1, and top-k accuracy
    """
    y_pred_proba = model.predict(X, verbose=0)
    y_pred = y_pred_proba.argmax(axis=1)
    y_true = y_true_onehot.argmax(axis=1)
    
    # Top-K accuracy
    def top_k_acc(y_true, y_pred_proba, k):
        top_k_preds = np.argsort(y_pred_proba, axis=1)[:, -k:]
        return sum(y_true[i] in top_k_preds[i] for i in range(len(y_true))) / len(y_true)
    
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred, average='macro'),
        'recall': recall_score(y_true, y_pred, average='macro'),
        'f1': f1_score(y_true, y_pred, average='macro'),
        'top_2_accuracy': top_k_acc(y_true, y_pred_proba, 2),
        'top_3_accuracy': top_k_acc(y_true, y_pred_proba, 3),
    }
    
    return metrics


# =============================================================================
# USAGE EXAMPLES
# =============================================================================
# 
# # Build models
# slp = build_multiclass_classifier(INPUT_DIMENSION, OUTPUT_CLASSES, name='SLP')
# mlp = build_multiclass_classifier(INPUT_DIMENSION, OUTPUT_CLASSES, 
#                                    hidden_units=[64], name='MLP')
# mlp_reg = build_multiclass_classifier(INPUT_DIMENSION, OUTPUT_CLASSES,
#                                        hidden_units=[64], dropout=0.3, l2_reg=0.001,
#                                        learning_rate=0.001, name='MLP_Regularized')
# 
# # Train
# history = train_model(mlp, X_train, y_train, X_val, y_val)
# 
# # Evaluate
# metrics = evaluate_multiclass(mlp, X_val, y_val)
# print(f"Accuracy: {metrics['accuracy']:.4f}, Top-3: {metrics['top_3_accuracy']:.4f}")