<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Final%20DNN%20Code%20Examples/Imagenette/Imagenette%20-%20TFDS%20Color%20Image%20Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imagenette - TFDS Colour Image Example

This notebook demonstrates the **Universal ML Workflow** for multi-class image classification using TensorFlow Datasets (TFDS).

## Learning Objectives

By the end of this notebook, you will be able to:
- Load image datasets from **TensorFlow Datasets (TFDS)**
- Preprocess colour images: Resize → Grayscale → Flatten → Normalise
- Handle 10-class image classification with Top-K accuracy
- Apply Hyperband for hyperparameter tuning on image data

---

## Technique Scope

| Aspect | What We Use | What We Don't Use (Yet) |
|--------|-------------|------------------------|
| **Architecture** | Dense layers only | CNNs, pooling, feature extractors |
| **Regularisation** | L2 + Dropout | Early stopping, data augmentation |
| **Optimiser** | Adam | SGD with momentum, learning rate schedules |
| **Tuning** | Hyperband | Bayesian optimisation, neural architecture search |

> **Note**: Dense networks applied to flattened images serve as a baseline. CNNs (Chapter 8) are the standard approach for image classification and would significantly improve performance on this challenging dataset.

---

## Dataset Overview

| Attribute | Description |
|-----------|-------------|
| **Source** | [TensorFlow Datasets - imagenette/160px](https://www.tensorflow.org/datasets/catalog/imagenette) |
| **Problem Type** | Multi-Class Classification (10 classes) |
| **Classes** | Tench, English springer, Cassette player, Chain saw, Church, French horn, Garbage truck, Gas pump, Golf ball, Parachute |
| **Data Balance** | Nearly Balanced |
| **Total Images** | ~13,000 images |
| **Preprocessing** | Resize to 32×32 → Grayscale → Flatten (1024 features) |

### Imagenette vs Fashion MNIST

| Aspect | Imagenette | Fashion MNIST |
|--------|------------|---------------|
| **Original Images** | 160×160 colour | 28×28 grayscale |
| **Number of Classes** | 10 | 10 |
| **Image Content** | Real-world photos | Synthetic clothing |
| **Difficulty** | Harder (high variation) | Moderate |
| **Dataset Size** | ~13,000 | 70,000 |

---

## Code Reuse Philosophy

This notebook follows a **"Same Code, Different Data"** philosophy. The core ML pipeline remains consistent across different classification tasks:

```
┌─────────────────────────────────────────────────────────────────┐
│                    UNIVERSAL ML PIPELINE                        │
├─────────────────────────────────────────────────────────────────┤
│  Data Loading → Preprocessing → Train/Val/Test Split → Model   │
│  → Baseline → Overfitting → Regularisation → Evaluation        │
└─────────────────────────────────────────────────────────────────┘
```

**What changes:** Data source, preprocessing, number of output classes  
**What stays the same:** Model architecture pattern, training loop, evaluation code

---

## 1. Defining the Problem and Assembling a Dataset

**Problem:** Classify images of 10 different objects from the Imagenette dataset - a smaller subset of ImageNet designed for faster experimentation.

**Why this matters:**
- **Object recognition foundation:** Real-world object classification is the basis for autonomous vehicles, robotics, and visual search
- **Transfer learning testbed:** Imagenette is used to quickly test architectures before scaling to full ImageNet
- **Research accessibility:** Enables ML practitioners without GPU clusters to experiment with image classification

**Why Imagenette?** ImageNet has 1000 classes and millions of images, making it slow to iterate on. Imagenette provides a 10-class subset that's large enough to be challenging but small enough for rapid prototyping.

**Why TFDS?** TensorFlow Datasets provides easy access to common ML datasets with consistent APIs, automatic caching, and preprocessing utilities.

## 2. Choosing a Measure of Success

### Data-Driven Metric Selection

| Criterion | This Dataset | Decision |
|-----------|--------------|----------|
| **Class Balance** | ~Equal across 10 classes | Balanced |
| **Number of Classes** | 10 | Multi-class |
| **Primary Metric** | Accuracy | Standard for balanced multi-class |
| **Secondary Metrics** | Top-K Accuracy | Additional insight for multi-class |

**Why these thresholds?**
- **3:1 ratio**: When majority class exceeds 75%, a naive classifier achieves high accuracy while ignoring minority classes
- **Balanced data (< 3:1):** Accuracy is meaningful and interpretable

### References

- Branco, P., Torgo, L. and Ribeiro, R.P. (2016) 'A survey of predictive modelling on imbalanced domains', *ACM Computing Surveys*, 49(2), pp. 1–50.

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

**Decision:** Since the dataset is balanced, **Accuracy** is the primary metric. **Top-K Accuracy** shows if the correct class was among the model's top K predictions.

## 3. Deciding on an Evaluation Protocol

### Data-Driven Protocol Selection

| Criterion | This Dataset | Decision |
|-----------|--------------|----------|
| **Sample Size** | ~13,000 images | Large |
| **Threshold** | > 10,000 | Use Hold-Out |
| **Protocol** | Train/Validation/Test | 80%/10%/10% split |

**Why 10,000 as a practical threshold?**
- Below 10,000 samples, hold-out validation has higher variance (Kohavi, 1995)
- Above 10,000, statistical estimates from hold-out are reliable
- Deep learning models are expensive to train; K-fold multiplies cost by K (Chollet, 2021)

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Hastie, T., Tibshirani, R. and Friedman, J. (2009) *The elements of statistical learning: data mining, inference, and prediction*. 2nd edn. New York: Springer.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *Proceedings of the 14th International Joint Conference on Artificial Intelligence*, 2, pp. 1137–1145.

**Decision:** With ~13,000 samples, **Hold-Out validation** is appropriate.

## 4. Preparing Your Data

### 4.1 Import Libraries and Load TFDS Dataset

In [None]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

from skimage.color import rgb2gray
from skimage.transform import resize

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
import tensorflow_datasets as tfds
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout

# Keras Tuner for hyperparameter search
%pip install -q -U keras-tuner
import keras_tuner as kt

import matplotlib.pyplot as plt

# ============================================================
# RANDOM SEED - Set once, use everywhere
# ============================================================
SEED = 204

tf.random.set_seed(SEED)
np.random.seed(SEED)

import warnings
warnings.filterwarnings('ignore')

### 4.2 Configuration

In [None]:
# ============================================================
# DATASET CONFIGURATION
# ============================================================
DATASET = 'imagenette/160px'
RESIZE = (32, 32)  # Resize to 32x32 - balance between preserving detail and dimensionality
GRAY_SCALE = True  # Convert to grayscale for simplicity

# ============================================================
# CLASS NAMES
# ============================================================
CLASS_NAMES = [
    'Tench', 'English springer', 'Cassette player', 'Chain saw', 'Church',
    'French horn', 'Garbage truck', 'Gas pump', 'Golf ball', 'Parachute'
]

### 4.3 Load and Preprocess Data

In [None]:
# Load dataset from TensorFlow Datasets
# Combine train and validation splits for custom splitting
ds_train = tfds.load(DATASET, split='train', shuffle_files=True)
ds_val = tfds.load(DATASET, split='validation', shuffle_files=True)

# Process images from both splits
images, labels = [], []

for ds in [ds_train, ds_val]:
    for entry in ds:
        image, label = entry['image'].numpy(), entry['label'].numpy()
        
        # Resize to target size
        image = resize(image, (*RESIZE, 3), anti_aliasing=True)
        
        # Convert to grayscale if specified
        if GRAY_SCALE:
            image = rgb2gray(image)
        
        images.append(image)
        labels.append(label)

print(f"Loaded {len(images)} images")

In [None]:
# Convert to numpy arrays
X = np.array(images)
y_raw = np.array(labels)

# Flatten images: (N, 16, 16) -> (N, 256)
X = X.reshape((X.shape[0], -1))

# One-hot encode labels
y = to_categorical(y_raw)

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Number of classes: {y.shape[1]}")

### 4.4 Verify Class Balance

In [None]:
# Check class distribution
unique, counts = np.unique(y_raw, return_counts=True)

print("Class Distribution:")
for class_idx, count in zip(unique, counts):
    print(f"  {CLASS_NAMES[class_idx]}: {count} ({100*count/len(y_raw):.1f}%)")

# Calculate imbalance ratio
imbalance_ratio = max(counts) / min(counts)
print(f"\nImbalance Ratio: {imbalance_ratio:.2f}:1")
print(f"Decision: {'Use Accuracy (balanced)' if imbalance_ratio < 3 else 'Use F1-Score (imbalanced)'}")

### 4.5 Train/Test Split

In [None]:
# ============================================================
# TRAIN/TEST SPLIT (90%/10%)
# ============================================================
TEST_SIZE = 0.10

X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y,
    test_size=TEST_SIZE,
    stratify=y_raw,
    random_state=SEED,
    shuffle=True
)

# Also keep raw labels for test set
_, _, y_train_full_raw, y_test_raw = train_test_split(
    X, y_raw,
    test_size=TEST_SIZE,
    stratify=y_raw,
    random_state=SEED,
    shuffle=True
)

print(f"Training + Validation: {X_train_full.shape[0]} samples")
print(f"Test: {X_test.shape[0]} samples")

### 4.6 Normalise Features

In [None]:
# ============================================================
# NORMALISE PIXEL VALUES [0, 1]
# ============================================================
# Note: skimage resize already normalises to [0, 1], but we ensure it
X_train_full = X_train_full.astype('float32')
X_test = X_test.astype('float32')

# Verify normalisation
print(f"Feature range: [{X_train_full.min():.3f}, {X_train_full.max():.3f}]")

### 4.7 Train/Validation Split

In [None]:
# ============================================================
# TRAIN/VALIDATION SPLIT
# ============================================================
# Use 10% of training pool for validation (consistent with other notebooks)
VALIDATION_SIZE = 0.10

X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full,
    test_size=VALIDATION_SIZE,
    stratify=y_train_full.argmax(axis=1),
    random_state=SEED,
    shuffle=True
)

# Keep raw labels for train set (for class weights)
y_train_raw = y_train.argmax(axis=1)

print(f"Training: {X_train.shape[0]} samples")
print(f"Validation: {X_val.shape[0]} samples")
print(f"Test: {X_test.shape[0]} samples")

### 4.8 Visualise Sample Images

In [None]:
# Display sample images from each class
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
fig.suptitle('Sample Images (32×32 Grayscale)', fontsize=14)

for class_idx in range(10):
    # Get first sample of this class
    sample_idx = np.where(y_train_raw == class_idx)[0][0]
    
    ax = axes[class_idx // 5, class_idx % 5]
    # Reshape flattened image back to 2D
    img = X_train[sample_idx].reshape(RESIZE)
    ax.imshow(img, cmap='gray')
    ax.axis('off')
    ax.set_title(CLASS_NAMES[class_idx], fontsize=9)

plt.tight_layout()
plt.show()

## 5. Developing a Model That Does Better Than a Baseline

**Baseline for 10-class balanced problem:** 10% accuracy (random guessing)

In [None]:
# ============================================================
# MODEL CONFIGURATION
# ============================================================
INPUT_DIMENSION = X_train.shape[1]  # 1024 features (32x32)
OUTPUT_CLASSES = y_train.shape[1]   # 10 classes

OPTIMIZER = 'adam'
LOSS_FUNC = 'categorical_crossentropy'
METRICS = ['accuracy']

# Training configuration
# Batch Size Selection:
# - Large datasets (>10,000 samples): Use 512 for efficient GPU utilisation
# - Small datasets (<10,000 samples): Use 32-64 for better gradient estimates
# Imagenette has ~13,000 samples → Use batch size 512
BATCH_SIZE = 512
EPOCHS_BASELINE = 100
EPOCHS_REGULARIZED = 150

print(f"Input Dimension: {INPUT_DIMENSION}")
print(f"Output Classes: {OUTPUT_CLASSES}")
print(f"Batch Size: {BATCH_SIZE}")

In [None]:
# ============================================================
# ESTABLISH BASELINE
# ============================================================
# For balanced 10-class classification, random guessing = 10%
baseline_accuracy = 1.0 / OUTPUT_CLASSES

print(f"Baseline Accuracy (random guessing): {baseline_accuracy:.2f}")

In [None]:
# ============================================================
# CLASS WEIGHTS (for balanced training)
# ============================================================
weights = compute_class_weight('balanced', classes=np.unique(y_train_raw), y=y_train_raw)
CLASS_WEIGHTS = dict(enumerate(weights))

print("Class Weights (sample):")
for class_idx in [0, 1, 2]:
    print(f"  {CLASS_NAMES[class_idx]}: {CLASS_WEIGHTS[class_idx]:.4f}")
print("  ...")

In [None]:
# ============================================================
# SINGLE LAYER PERCEPTRON (SLP) - Simplest possible model
# ============================================================
slp_model = Sequential(name='Single_Layer_Perceptron')
slp_model.add(Dense(OUTPUT_CLASSES, activation='softmax', input_shape=(INPUT_DIMENSION,)))
slp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

slp_model.summary()

In [None]:
# Train SLP
slp_history = slp_model.fit(
    X_train, y_train,
    class_weight=CLASS_WEIGHTS,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS_BASELINE,
    validation_data=(X_val, y_val),
    verbose=0
)

slp_val_acc = slp_model.evaluate(X_val, y_val, verbose=0)[1]
print(f"SLP Validation Accuracy: {slp_val_acc:.4f} (baseline: {baseline_accuracy:.2f})")

In [None]:
# ============================================================
# PLOT TRAINING HISTORY
# ============================================================
def plot_training_history(history, title='Training History'):
    """Plot training and validation loss/accuracy curves."""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Loss
    axes[0].plot(history.history['loss'], 'b-', label='Training Loss')
    axes[0].plot(history.history['val_loss'], 'r-', label='Validation Loss')
    axes[0].set_title('Training and Validation Loss')
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Loss')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Accuracy
    axes[1].plot(history.history['accuracy'], 'b-', label='Training Accuracy')
    axes[1].plot(history.history['val_accuracy'], 'r-', label='Validation Accuracy')
    axes[1].set_title('Training and Validation Accuracy')
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('Accuracy')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.suptitle(title, fontsize=14)
    plt.tight_layout()
    plt.show()

plot_training_history(slp_history, 'Single Layer Perceptron')

## 6. Scaling Up: Developing a Model That Overfits

Adding a hidden layer to learn more complex features for distinguishing between 10 diverse object classes.

**No regularisation applied:** We intentionally train this model **without any regularisation** (no dropout, no L2, no early stopping) to observe overfitting behaviour.

---

### Architecture Design Decisions

**Why 64 neurons in the hidden layer?**

This is a practical starting point that balances capacity and efficiency:
- **Too few (e.g., 16):** May not have enough capacity to distinguish 10 diverse object classes
- **Too many (e.g., 512):** Increases overfitting risk and training time without proportional benefit
- **64 neurons:** A common choice that provides sufficient capacity for most classification tasks

**Why only 1 hidden layer instead of 2-3?**

Per the **Universal ML Workflow**, the goal of this step is to demonstrate that the model *can* overfit—proving it has sufficient capacity to capture the underlying patterns. Once overfitting is observed:

1. **Capacity is proven sufficient:** If the model overfits, it can learn the training data's complexity
2. **No need for more depth:** Adding layers would increase overfitting further without benefit
3. **Regularise, don't expand:** The next step (Section 7) is to *reduce* overfitting through regularisation

*"The right question is not 'How many layers?' but 'Can it overfit?' If yes, regularise. If no, add capacity."*

In [None]:
# ============================================================
# MULTI-LAYER PERCEPTRON (MLP) - Standard architecture
# ============================================================
HIDDEN_NEURONS = 64

mlp_model = Sequential(name='Multi_Layer_Perceptron')
mlp_model.add(Dense(HIDDEN_NEURONS, activation='relu', input_shape=(INPUT_DIMENSION,)))
mlp_model.add(Dense(OUTPUT_CLASSES, activation='softmax'))
mlp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

mlp_model.summary()

In [None]:
# Train MLP
mlp_history = mlp_model.fit(
    X_train, y_train,
    class_weight=CLASS_WEIGHTS,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS_BASELINE,
    validation_data=(X_val, y_val),
    verbose=0
)

mlp_val_acc = mlp_model.evaluate(X_val, y_val, verbose=0)[1]
print(f"MLP Validation Accuracy: {mlp_val_acc:.4f} (baseline: {baseline_accuracy:.2f})")
print(f"Improvement over SLP: {(mlp_val_acc - slp_val_acc)*100:.2f}%")

In [None]:
plot_training_history(mlp_history, 'Multi-Layer Perceptron (1 Hidden Layer)')

## 7. Regularising Your Model and Tuning Hyperparameters

Using **Hyperband** for efficient hyperparameter tuning with L2 regularisation and Dropout.

### Why Hyperband?

**Hyperband** is more efficient than grid search because it:
1. Starts training many configurations for a few epochs
2. Eliminates poor performers early
3. Allocates more resources to promising configurations

### Regularisation Strategy

| Technique | Purpose | How It Works |
|-----------|---------|-------------|
| **L2 Regularisation** | Prevent large weights | Adds penalty term to loss |
| **Dropout** | Prevent co-adaptation | Randomly zeros neurons during training |

In [None]:
# ============================================================
# HYPERBAND MODEL BUILDER
# ============================================================
def build_model_hyperband(hp):
    """
    Build Imagenette model with FROZEN architecture (1 hidden layer, 64 neurons).
    Tunes: L2 regularisation, Dropout rate, Learning rate.
    """
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))
    
    # Hyperparameters to tune
    l2_reg = hp.Float('l2_reg', min_value=1e-5, max_value=1e-2, sampling='log')
    dropout_rate = hp.Float('dropout_rate', min_value=0.0, max_value=0.5, step=0.1)
    learning_rate = hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='log')
    
    # Hidden layer with L2 regularisation
    model.add(layers.Dense(
        HIDDEN_NEURONS,
        activation='relu',
        kernel_regularizer=regularizers.l2(l2_reg)
    ))
    model.add(layers.Dropout(dropout_rate))
    
    # Output layer
    model.add(layers.Dense(OUTPUT_CLASSES, activation='softmax'))
    
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
        loss=LOSS_FUNC,
        metrics=METRICS
    )
    
    return model

In [None]:
# ============================================================
# CONFIGURE AND RUN HYPERBAND TUNER
# ============================================================
tuner = kt.Hyperband(
    build_model_hyperband,
    objective='val_accuracy',
    max_epochs=50,
    factor=3,
    directory='imagenette_hyperband',
    project_name='imagenette_tuning',
    overwrite=True
)

# Run search
tuner.search(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=50,
    batch_size=BATCH_SIZE,
    class_weight=CLASS_WEIGHTS,
    verbose=0
)

In [None]:
# ============================================================
# GET BEST HYPERPARAMETERS AND BEST MODEL DIRECTLY
# ============================================================
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]

print("Best Hyperparameters found by Hyperband:")
print(f"  L2 Regularisation: {best_hp.get('l2_reg'):.6f}")
print(f"  Dropout Rate: {best_hp.get('dropout_rate'):.2f}")
print(f"  Learning Rate: {best_hp.get('learning_rate'):.6f}")

# Get the best model directly - already trained at optimal epochs for these hyperparameters
best_model = tuner.get_best_models(num_models=1)[0]
best_model.summary()

### 7.2 Using the Best Model Directly

Rather than rebuilding and retraining from scratch, we retrieve the best model directly from the tuner using `tuner.get_best_models()`. This approach avoids the **epoch mismatch problem**:

---

#### The Epoch Mismatch Problem

Hyperband uses **successive halving** - most configurations train for few epochs, only top performers get more:

```
Hyperband with max_epochs=50, factor=3:
Round 1: 81 configs × ~2 epochs  → Keep top 27
Round 2: 27 configs × ~6 epochs  → Keep top 9
Round 3:  9 configs × ~17 epochs → Keep top 3
Round 4:  3 configs × ~50 epochs → Select best
```

The best hyperparameters were found optimal at a **specific epoch count** (e.g., 50 epochs). If we rebuild and retrain for a different number of epochs (e.g., 150), the hyperparameters may no longer be optimal - **this is the epoch mismatch problem**.

---

#### Clean Solution: Use Best Model Directly

Instead of rebuilding, we use `tuner.get_best_models(num_models=1)[0]` to retrieve the model that **already achieved the best validation performance** during tuning. This model:

- Has weights trained at the optimal epoch count for its hyperparameters
- Achieved the best validation accuracy during the Hyperband search
- Avoids any mismatch between tuning epochs and final epochs

| Approach | Epochs Match? | Issue |
|----------|---------------|-------|
| ~~Rebuild + retrain for 150 epochs~~ | ✗ No | Hyperparameters may be suboptimal at 150 epochs |
| **Use best model directly** | ✓ Yes | Model already trained at optimal epochs |

> *"Use the model that actually achieved the best performance, not a rebuilt version that might perform differently."*

In [None]:
# The best model is already trained - evaluate on validation set
best_val_acc = best_model.evaluate(X_val, y_val, verbose=0)[1]
print(f"Best Model Validation Accuracy: {best_val_acc:.4f}")
print(f"Improvement over MLP: {(best_val_acc - mlp_val_acc)*100:.2f}%")

In [None]:
# Note: Training history plot is not available when using get_best_models()
# The best model was retrieved directly from the tuner, which doesn't preserve
# the training history. To visualise training curves, you would need to either:
# 1. Use TensorBoard callbacks during tuning, or
# 2. Retrain the model (but this risks the epoch mismatch problem)
#
# For this notebook, we skip the training history plot since we're using
# the best model directly to ensure optimal performance.
print("Training history not available when using get_best_models() directly.")

In [None]:
## 8. Final Evaluation

Evaluate the best model on the held-out test set.

## 8. Final Evaluation

Evaluate the best model on the held-out test set.

In [None]:
# ============================================================
# TOP-K ACCURACY FUNCTION
# ============================================================
def top_k_accuracy(y_true, y_pred_proba, k):
    """
    Calculate Top-K accuracy: was the true class in the model's top K predictions?
    
    Args:
        y_true: True class labels (integer indices)
        y_pred_proba: Predicted probabilities (N x num_classes)
        k: Number of top predictions to consider
    
    Returns:
        Top-K accuracy score
    """
    top_k_preds = np.argsort(y_pred_proba, axis=1)[:, -k:]
    correct = sum(y_true[i] in top_k_preds[i] for i in range(len(y_true)))
    return correct / len(y_true)

In [None]:
# ============================================================
# TEST SET EVALUATION
# ============================================================
# Get predictions
y_pred_proba = best_model.predict(X_test, verbose=0)
y_pred = y_pred_proba.argmax(axis=1)

# Calculate metrics
test_accuracy = accuracy_score(y_test_raw, y_pred)
top_2_accuracy = top_k_accuracy(y_test_raw, y_pred_proba, k=2)
top_3_accuracy = top_k_accuracy(y_test_raw, y_pred_proba, k=3)

print("="*50)
print("FINAL TEST SET RESULTS")
print("="*50)
print(f"Top-1 Accuracy: {test_accuracy:.4f} (baseline: {baseline_accuracy:.2f})")
print(f"Top-2 Accuracy: {top_2_accuracy:.4f}")
print(f"Top-3 Accuracy: {top_3_accuracy:.4f}")
print("="*50)

In [None]:
# ============================================================
# CONFUSION MATRIX
# ============================================================
cm = confusion_matrix(y_test_raw, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=CLASS_NAMES)

fig, ax = plt.subplots(figsize=(12, 10))
disp.plot(ax=ax, cmap='Blues', xticks_rotation=45)
plt.title('Confusion Matrix - Test Set')
plt.tight_layout()
plt.show()

In [None]:
# ============================================================
# PER-CLASS ACCURACY
# ============================================================
print("\nPer-Class Accuracy:")
print("-"*40)
for class_idx in range(OUTPUT_CLASSES):
    class_mask = y_test_raw == class_idx
    class_correct = (y_pred[class_mask] == class_idx).sum()
    class_total = class_mask.sum()
    class_acc = class_correct / class_total
    print(f"{CLASS_NAMES[class_idx]:<20}: {class_acc:.2f} ({class_correct}/{class_total})")

---

## 9. Results Summary

The following dynamically-generated table compares all models trained in this notebook.

In [None]:
# =============================================================================
# RESULTS SUMMARY
# =============================================================================

# Create results DataFrame
results = pd.DataFrame({
    'Model': ['Naive Baseline', 'SLP (No Hidden)', 'MLP (No Regularisation)', 'MLP (Dropout + L2)', 'MLP (Dropout + L2) - Test'],
    'Accuracy': [baseline_accuracy, slp_val_acc, mlp_val_acc, best_val_acc, test_accuracy],
    'Top-3 Acc': [0.3, 0.0, 0.0, 0.0, top_3_accuracy],  # Top-3 computed only for final test
    'Dataset': ['N/A', 'Validation', 'Validation', 'Validation', 'Test']
})

print("=" * 70)
print("MODEL COMPARISON - RESULTS SUMMARY")
print("=" * 70)
print(f"Primary Metric: ACCURACY ({OUTPUT_CLASSES}-class balanced dataset)")
print("=" * 70)
print(results.to_string(index=False, float_format='{:.4f}'.format))
print("=" * 70)
print(f"\nKey Observations:")
print(f"  - All models significantly outperform random baseline ({baseline_accuracy:.2%})")
print(f"  - Regularisation improves accuracy: {mlp_val_acc:.4f} → {best_val_acc:.4f}")
print(f"  - Final test accuracy: {test_accuracy:.4f}, Top-3: {top_3_accuracy:.4f}")

---

## 10. Key Takeaways

### Decision Framework Summary

| Decision | Threshold | This Dataset | Choice | Reference |
|----------|-----------|--------------|--------|-----------|
| **Hold-Out vs K-Fold** | > 10,000 samples | ~13,000 samples | **Hold-Out** | Kohavi (1995) |
| **Accuracy vs F1-Score** | > 3:1 imbalance | ~1:1 ratio | **Accuracy** | He and Garcia (2009) |
| **Batch Size** | > 10,000 samples | ~13,000 samples | **512** | GPU efficiency |

### Lessons Learned

1. **TFDS Simplifies Data Loading:** `tfds.load()` handles download, caching, and parsing automatically, making it easy to access standard benchmarks.

2. **Top-K Accuracy for Multi-Class:** When classes are visually similar (e.g., different objects), Top-K accuracy shows if the correct answer was among the model's top K guesses.

3. **Image Preprocessing Trade-offs:** Resize → Grayscale → Flatten sacrifices spatial information for simplicity. CNNs (Chapter 8) would preserve this structure.

4. **Real-World Images are Harder:** Imagenette contains real photographs with high variation in lighting, backgrounds, and poses - more challenging than synthetic datasets like Fashion MNIST.

5. **Dense Network Limitations:** DNNs treat each pixel as an independent feature, ignoring spatial relationships. This limits performance on image tasks.

6. **Regularisation Prevents Overfitting:** L2 regularisation + Dropout control overfitting without requiring early stopping (Ch. 7).

7. **Batch Size Selection:** With ~13,000 samples (above 10,000 threshold), batch size 512 provides efficient GPU utilisation while maintaining good gradient estimates.

### Next Steps for Better Performance

- **Use CNNs** (Chapter 8) - preserves spatial structure, learns hierarchical features
- **Higher resolution** - 32×32 still loses significant detail from 160×160 originals
- **Transfer learning** - use pre-trained models (ResNet, EfficientNet)
- **Data augmentation** - artificially increase training data variety

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

- Howard, J. (2019) 'Imagenette: A smaller subset of 10 easily classified classes from Imagenet', *fast.ai*.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *IJCAI*, 2, pp. 1137–1145.

---

## Appendix: Modular Helper Functions

For cleaner code organisation, you can wrap the model building and training patterns into reusable functions.

In [None]:
# =============================================================================
# MODULAR HELPER FUNCTIONS
# =============================================================================

def build_image_classifier(input_dim, num_classes, hidden_units=None, dropout=0.0, l2_reg=0.0,
                           optimizer='adam', learning_rate=None, name=None):
    """
    Build a multi-class image classification neural network.
    
    Parameters:
    -----------
    input_dim : int
        Number of input features (flattened image size)
    num_classes : int
        Number of output classes
    hidden_units : list of int, optional
        Neurons per hidden layer, e.g., [64] or [128, 64]
    dropout : float
        Dropout rate (0.0 to 0.5)
    l2_reg : float
        L2 regularisation strength
    learning_rate : float, optional
        Custom learning rate
    name : str, optional
        Model name
        
    Returns:
    --------
    keras.Sequential : Compiled model ready for training
    """
    model = Sequential(name=name)
    model.add(layers.Input(shape=(input_dim,)))
    
    hidden_units = hidden_units or []
    kernel_reg = regularizers.l2(l2_reg) if l2_reg > 0 else None
    
    for units in hidden_units:
        model.add(Dense(units, activation='relu', kernel_regularizer=kernel_reg))
        if dropout > 0:
            model.add(Dropout(dropout))
    
    model.add(Dense(num_classes, activation='softmax'))
    
    if learning_rate is not None:
        opt = keras.optimizers.Adam(learning_rate=learning_rate)
    else:
        opt = optimizer
    
    model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
    return model


def train_with_class_weights(model, X_train, y_train, X_val, y_val,
                              batch_size=512, epochs=100, verbose=0):
    """Train model with automatic class weight computation."""
    y_train_raw = y_train.argmax(axis=1)
    weights = compute_class_weight('balanced', classes=np.unique(y_train_raw), y=y_train_raw)
    class_weights = dict(enumerate(weights))
    
    return model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        batch_size=batch_size, 
        epochs=epochs,
        class_weight=class_weights,
        verbose=verbose
    )


def evaluate_multiclass_image(model, X, y_true_onehot, y_true_raw=None):
    """
    Evaluate multi-class image classification model with Top-K accuracy.
    
    Returns:
    --------
    dict : Dictionary containing accuracy and top-k accuracy
    """
    y_pred_proba = model.predict(X, verbose=0)
    y_pred = y_pred_proba.argmax(axis=1)
    
    if y_true_raw is None:
        y_true_raw = y_true_onehot.argmax(axis=1)
    
    # Top-K accuracy
    def top_k_acc(y_true, y_pred_proba, k):
        top_k_preds = np.argsort(y_pred_proba, axis=1)[:, -k:]
        return sum(y_true[i] in top_k_preds[i] for i in range(len(y_true))) / len(y_true)
    
    metrics = {
        'accuracy': accuracy_score(y_true_raw, y_pred),
        'top_2_accuracy': top_k_acc(y_true_raw, y_pred_proba, 2),
        'top_3_accuracy': top_k_acc(y_true_raw, y_pred_proba, 3),
    }
    
    return metrics


# =============================================================================
# USAGE EXAMPLES
# =============================================================================
# 
# # Build models
# slp = build_image_classifier(INPUT_DIMENSION, OUTPUT_CLASSES, name='SLP')
# mlp = build_image_classifier(INPUT_DIMENSION, OUTPUT_CLASSES, 
#                              hidden_units=[64], name='MLP')
# mlp_reg = build_image_classifier(INPUT_DIMENSION, OUTPUT_CLASSES,
#                                  hidden_units=[64], dropout=0.3, l2_reg=0.001,
#                                  learning_rate=0.001, name='MLP_Regularized')
# 
# # Train with class weights
# history = train_with_class_weights(mlp, X_train, y_train, X_val, y_val)
# 
# # Evaluate with Top-K accuracy
# metrics = evaluate_multiclass_image(mlp, X_val, y_val)
# print(f"Accuracy: {metrics['accuracy']:.4f}, Top-3: {metrics['top_3_accuracy']:.4f}")