<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Final%20DNN%20Code%20Examples/Movie%20Review/Movie%20Review%20-%20NLP%20Binary%20Classification%20Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Review - NLP Binary Classification Example

This notebook demonstrates the **Universal ML Workflow** applied to a binary NLP classification problem using movie review sentiment data.

## Learning Objectives

By the end of this notebook, you will be able to:
- Apply the Universal ML Workflow to a **binary** text classification problem
- Understand key differences between **binary** and **multi-class** classification
- Convert text data to numerical features using **TF-IDF vectorisation**
- Build and train deep neural networks for **binary classification**
- Use **Hyperband** for efficient hyperparameter tuning
- Apply **Dropout + L2 regularisation** to prevent overfitting
- Evaluate model performance using appropriate metrics (Accuracy, Precision, Recall, AUC)

---

## Dataset Overview

| Attribute | Description |
|-----------|-------------|
| **Source** | [NLTK Movie Review Dataset](https://www.kaggle.com/datasets/nltkdata/movie-review) |
| **Problem Type** | Binary Classification (2 classes) |
| **Classes** | Positive, Negative |
| **Data Balance** | Nearly Balanced (~51% Positive, ~49% Negative) |
| **Data Type** | Unstructured Text (Movie Reviews) |
| **Input Features** | TF-IDF Vectors (5000 features, bigrams) |
| **Output** | Sentiment: Positive (1) or Negative (0) |

---

## Technique Scope

This notebook uses only techniques from **Chapters 1–4** of *Deep Learning with Python* (Chollet, 2021). This means:

| Technique | Status | Rationale |
|-----------|--------|----------|
| **Dense layers (MLP/DNN)** | ✓ Used | Core building block (Ch. 3-4) |
| **Dropout** | ✓ Used | Regularisation technique (Ch. 4) |
| **L2 regularisation** | ✓ Used | Weight penalty (Ch. 4) |
| **Early stopping** | ✗ Not used | Introduced in Ch. 7 |
| **CNN** | ✗ Not used | Introduced in Ch. 8 |
| **RNN/LSTM** | ✗ Not used | Introduced in Ch. 10 |

We demonstrate that **Dropout + L2 regularisation** alone can effectively prevent overfitting without requiring early stopping.

---

## Binary vs. Multi-Class Classification

| Aspect | Binary (This Notebook) | Multi-Class (Twitter Examples) |
|--------|------------------------|-------------------------------|
| **Output neurons** | 1 neuron | N neurons (one per class) |
| **Output activation** | Sigmoid (outputs 0-1 probability) | Softmax (outputs N probabilities summing to 1) |
| **Loss function** | Binary cross-entropy | Categorical cross-entropy |
| **Label format** | Single value: 0 or 1 | One-hot vector: [1,0,0], [0,1,0], etc. |
| **Prediction** | Threshold at 0.5 | argmax of probabilities |

---

## 1. Defining the Problem and Assembling a Dataset

The first step in any machine learning project is to clearly define the problem and understand the data.

**Problem Statement:** Given a movie review, predict whether the sentiment is positive or negative.

**Why this matters:** Sentiment analysis of movie reviews helps:
- Studios understand audience reception
- Streaming platforms recommend content
- Critics aggregate opinions at scale

**Key difference from multi-class sentiment:** This is a simpler problem with only two outcomes. Binary classification is often more robust and easier to interpret than multi-class alternatives.

**Data Source:** This dataset contains movie reviews labelled as positive or negative from the NLTK corpus.

## 2. Choosing a Measure of Success

### Metric Selection Based on Class Imbalance

The choice of evaluation metric depends on **class imbalance**. We use practical guidelines derived from the literature:

| Imbalance Ratio | Classification | Primary Metric | Rationale |
|-----------------|----------------|----------------|----------|
| ≤ 1.5:1 | Balanced | **Accuracy** | Classes roughly equal |
| 1.5:1 – 3:1 | Mild Imbalance | **Accuracy** | Majority class < 75% |
| > 3:1 | Moderate/Severe | **F1-Score** | Accuracy becomes misleading |

**For this dataset:** With ~51:49 class distribution, the imbalance ratio is ~1.04:1 (essentially balanced). We use **Accuracy** as the primary metric.

### References

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

*Note: Even with balanced data, we still track Precision, Recall, and AUC for a complete picture.*

## 3. Deciding on an Evaluation Protocol

### Hold-Out vs K-Fold Cross-Validation

The choice between hold-out and K-fold depends on **dataset size** and **computational cost**:

| Dataset Size | Recommended Method | Rationale |
|--------------|-------------------|----------|
| < 1,000 | K-Fold (K=5 or 10) | High variance with small hold-out sets |
| 1,000 – 10,000 | K-Fold or Hold-Out | Either works; K-fold more robust |
| > 10,000 | Hold-Out | Sufficient data; K-fold computationally expensive |
| Deep Learning | Hold-Out (preferred) | Training cost prohibitive for K iterations |

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *Proceedings of the 14th International Joint Conference on Artificial Intelligence*, 2, pp. 1137–1145.

### Data Split Strategy (This Notebook)

```
Original Data (~65,000 samples) → Hold-Out Selected
├── Test Set (10%) - Final evaluation only
└── Training Pool (90%)
    ├── Training Set (81%) - Model training
    └── Validation Set (9%) - Hyperparameter tuning
```

**Important:** We use `stratify` parameter to maintain class proportions in all splits.

## 4. Preparing Your Data

### 4.1 Import Libraries and Set Random Seed

We set random seeds for reproducibility - this ensures that running the notebook multiple times produces the same results.

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import TfidfVectorizer

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Keras Tuner for hyperparameter search
%pip install -q -U keras-tuner
import keras_tuner as kt

import matplotlib.pyplot as plt

SEED = 204

tf.random.set_seed(SEED)
np.random.seed(SEED)

import warnings
warnings.filterwarnings('ignore')

### 4.2 Load and Explore the Dataset

Let's download the movie review data from Google Drive and examine its structure.

In [None]:
# Load data directly from Google Drive
GDRIVE_FILE_ID = '17zquac-Q4viIEs1hSwHmlBIFYXncn9IB'
DATA_URL = f'https://drive.google.com/uc?id={GDRIVE_FILE_ID}&export=download'

reviews = pd.read_csv(DATA_URL)
reviews = reviews[['text', 'tag']]

reviews.head()

### 4.3 Split Data into Train and Test Sets

We reserve 10% of the data for final testing. The `stratify` parameter ensures that each split maintains the same class proportions as the original dataset.

In [None]:
TEST_SIZE = 0.1

(text_train, text_test, 
 tag_train, tag_test) = train_test_split(reviews['text'], reviews['tag'], 
                                         test_size=TEST_SIZE, stratify=reviews['tag'],
                                         shuffle=True, random_state=SEED)

### 4.4 Text Vectorisation with TF-IDF

Neural networks require numerical input, but reviews are text. We use **TF-IDF (Term Frequency-Inverse Document Frequency)** to convert text to numbers.

**Our settings:**
- `max_features=5000`: Keep only the 5000 most important terms
- `ngram_range=(1, 2)`: Include both single words and word pairs (bigrams)

These settings are consistent with our other NLP notebooks, demonstrating that the same preprocessing pipeline works across different text classification problems.

In [None]:
MAX_FEATURES = 5000
NGRAMS = 2

tfidf = TfidfVectorizer(ngram_range=(1, NGRAMS), max_features=MAX_FEATURES)
tfidf.fit(text_train)

X_train, X_test = tfidf.transform(text_train).toarray(), tfidf.transform(text_test).toarray()

### 4.5 Encode Labels

For **binary classification**, we encode labels as single values (0 or 1), not one-hot vectors:
- Negative → 0
- Positive → 1

This is simpler than multi-class encoding and works with sigmoid output activation.

In [None]:
label_encoder = LabelEncoder()
label_encoder.fit(reviews['tag'])

# For binary classification: single 0/1 labels (not one-hot)
y_train = label_encoder.transform(tag_train)
y_test = label_encoder.transform(tag_test)

print(f"Label mapping: {dict(zip(label_encoder.classes_, range(len(label_encoder.classes_))))}")

## 5. Developing a Model That Does Better Than a Baseline

Before building complex models, we need to establish **baseline performance**. This gives us a reference point to know if our model is actually learning something useful.

### 5.1 Examine Class Distribution

Let's look at how the sentiment classes are distributed:

In [None]:
counts = reviews.groupby(['tag']).count()
counts.reset_index(inplace=True)

counts

In [None]:
# =============================================================================
# DATA-DRIVEN ANALYSIS: Dataset Size & Imbalance
# =============================================================================

# Dataset size analysis (for hold-out vs K-fold decision)
n_samples = len(reviews)
HOLDOUT_THRESHOLD = 10000  # Use hold-out if samples > 10,000 (Kohavi, 1995; Chollet, 2021)

# Imbalance analysis (for metric selection)
majority_class = counts['text'].max()
minority_class = counts['text'].min()
imbalance_ratio = majority_class / minority_class
IMBALANCE_THRESHOLD = 3.0  # Use F1-Score if ratio > 3.0 (He & Garcia, 2009)

# Determine evaluation strategy and metric
use_holdout = n_samples > HOLDOUT_THRESHOLD
use_f1 = imbalance_ratio > IMBALANCE_THRESHOLD

print("=" * 60)
print("DATA-DRIVEN CONFIGURATION")
print("=" * 60)
print(f"\n1. DATASET SIZE: {n_samples:,} samples")
print(f"   Threshold: {HOLDOUT_THRESHOLD:,} samples (Kohavi, 1995)")
print(f"   Decision: {'Hold-Out' if use_holdout else 'K-Fold Cross-Validation'}")

print(f"\n2. CLASS IMBALANCE: {imbalance_ratio:.2f}:1 ratio")
print(f"   Threshold: {IMBALANCE_THRESHOLD:.1f}:1 (He & Garcia, 2009)")
print(f"   Decision: {'F1-Score (imbalanced)' if use_f1 else 'Accuracy (balanced)'}")

print("\n" + "=" * 60)
PRIMARY_METRIC = 'f1' if use_f1 else 'accuracy'
print(f"PRIMARY METRIC: {PRIMARY_METRIC.upper()}")
print("=" * 60)

### 5.2 Calculate Baseline Metrics

**Naive Baseline (Majority Class):** If we always predict the most common class (positive), we get ~51% accuracy. This is our accuracy baseline.

**Balanced Accuracy Baseline:** For binary classification, a random classifier achieves 50% balanced accuracy.

In [None]:
# Find the majority class
majority_class_name = counts.loc[counts['text'].idxmax(), 'tag']
baseline = counts['text'].max() / counts['text'].sum()

# Balanced accuracy baseline (random classifier = 50% for binary)
balanced_accuracy_baseline = 0.5

print(f"Majority class: {majority_class_name}")
print(f"Baseline accuracy (majority class): {baseline:.2f}")
print(f"Balanced accuracy baseline (random): {balanced_accuracy_baseline:.2f}")

### 5.3 Create Validation Set

We split off a portion of the training data for validation. This will be used to:
- Evaluate model performance during hyperparameter tuning
- Compare models without touching the test set

In [None]:
VALIDATION_SIZE = 0.1

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                                  test_size=VALIDATION_SIZE, stratify=y_train,
                                                  shuffle=True, random_state=SEED)

### 5.4 Configure Training Parameters

**Key training settings for binary classification:**
- **Optimiser:** Adam - adaptive learning rate optimiser
- **Loss:** Binary cross-entropy - standard loss for binary classification
- **Output:** 1 neuron with sigmoid activation (outputs probability 0-1)
- **Prediction:** Apply threshold at 0.5 to convert probability to class

In [None]:
INPUT_DIMENSION = X_train.shape[1]
OUTPUT_DIMENSION = 1  # Binary classification: single output neuron

OPTIMIZER = 'adam'
LOSS_FUNC = 'binary_crossentropy'  # Binary classification loss

# Training metrics
METRICS = ['accuracy', 
           tf.keras.metrics.Precision(name='precision'), 
           tf.keras.metrics.Recall(name='recall'),
           tf.keras.metrics.AUC(name='auc')]

In [None]:
# Single-Layer Perceptron (no hidden layers)
# For binary: 1 output neuron with sigmoid activation
slp_model = Sequential(name='Single_Layer_Perceptron')
slp_model.add(layers.Input(shape=(INPUT_DIMENSION,)))
slp_model.add(Dense(OUTPUT_DIMENSION, activation='sigmoid'))
slp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

slp_model.summary()

In [None]:
# =============================================================================
# TRAINING CONFIGURATION
# =============================================================================

BATCH_SIZE = 512

# We use DIFFERENT epoch counts for different training phases:
#
# EPOCHS_BASELINE (100): For SLP and unregularised DNN
# EPOCHS_REGULARIZED (150): For DNN with Dropout + L2
#
# Regularisation slows learning, so regularised models need more epochs.

EPOCHS_BASELINE = 100      # SLP and DNN (no regularisation)
EPOCHS_REGULARIZED = 150   # DNN with Dropout + L2

### 5.5 Handle Class Imbalance with Class Weights

Even though the data is nearly balanced, we still compute class weights for consistency with our other notebooks. With ~51:49 distribution, the weights will be close to 1.0 for both classes.

In [None]:
weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
CLASS_WEIGHTS = dict(enumerate(weights))

print(f"Class weights: {CLASS_WEIGHTS}")
print("(Values close to 1.0 indicate balanced classes)")

In [None]:
# Train the Single-Layer Perceptron
history_slp = slp_model.fit(X_train, y_train, 
                            class_weight=CLASS_WEIGHTS,
                            batch_size=BATCH_SIZE, epochs=EPOCHS_BASELINE, 
                            validation_data=(X_val, y_val),
                            verbose=0)
val_score_slp = slp_model.evaluate(X_val, y_val, verbose=0)[1:]

In [None]:
# Display SLP validation metrics
print('Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(val_score_slp[0], baseline))
print('Precision (Validation): {:.2f}'.format(val_score_slp[1]))
print('Recall (Validation): {:.2f}'.format(val_score_slp[2]))
print('AUC (Validation): {:.2f}'.format(val_score_slp[3]))

# For binary: threshold predictions at 0.5
preds_slp_val = (slp_model.predict(X_val, verbose=0) > 0.5).astype('int32').flatten()
print('Balanced Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(
    balanced_accuracy_score(y_val, preds_slp_val), balanced_accuracy_baseline))

In [None]:
def plot_training_history(history, title=None):
    """
    Plot training and validation metrics over epochs.
    Plots: (1) Loss, (2) Accuracy
    """
    fig, axs = plt.subplots(1, 2, figsize=(14, 5))
    epochs = range(1, len(history.history['loss']) + 1)
    title_suffix = f' ({title})' if title else ''

    # Plot 1: Loss
    axs[0].plot(epochs, history.history['loss'], 'b-', label='Training', linewidth=1.5)
    axs[0].plot(epochs, history.history['val_loss'], 'r-', label='Validation', linewidth=1.5)
    axs[0].set_title(f'Loss{title_suffix}')
    axs[0].set_xlabel('Epochs')
    axs[0].set_ylabel('Loss')
    axs[0].legend()
    axs[0].grid(alpha=0.3)

    # Plot 2: Accuracy
    axs[1].plot(epochs, history.history['accuracy'], 'b-', label='Training', linewidth=1.5)
    axs[1].plot(epochs, history.history['val_accuracy'], 'r-', label='Validation', linewidth=1.5)
    axs[1].set_title(f'Accuracy{title_suffix}')
    axs[1].set_xlabel('Epochs')
    axs[1].set_ylabel('Accuracy')
    axs[1].legend()
    axs[1].grid(alpha=0.3)

    plt.tight_layout()
    plt.show()

In [None]:
# Plot SLP training history
plot_training_history(history_slp, title='SLP Baseline')

## 6. Scaling Up: Developing a Model That Overfits

The next step in the Universal ML Workflow is to build a model with **enough capacity to overfit**. If a model can't overfit, it may be too simple to learn the patterns in the data.

**Strategy:** Add a hidden layer with 64 neurons to increase model capacity.

**No regularisation applied:** We intentionally train without regularisation to observe overfitting behaviour.

---

### Architecture Design Decisions

**Why 64 neurons in the hidden layer?**

This is a practical starting point that balances capacity and efficiency:
- **Too few (e.g., 16):** May not have enough capacity to learn complex sentiment patterns
- **Too many (e.g., 512):** Increases overfitting risk and training time without proportional benefit
- **64 neurons:** A common choice that provides sufficient capacity for most text classification tasks

**Why only 1 hidden layer instead of 2-3?**

Per the **Universal ML Workflow**, the goal of this step is to demonstrate that the model *can* overfit—proving it has sufficient capacity to capture the underlying patterns. Once overfitting is observed:

1. **Capacity is proven sufficient:** If the model overfits, it can learn the training data's complexity
2. **No need for more depth:** Adding layers would increase overfitting further without benefit
3. **Regularise, don't expand:** The next step (Section 7) is to *reduce* overfitting through regularisation

*"The right question is not 'How many layers?' but 'Can it overfit?' If yes, regularise. If no, add capacity."*

### 6.1 Build a Deep Neural Network (DNN)

Let's add a hidden layer with 64 neurons and ReLU activation:

In [None]:
# Deep Neural Network (1 hidden layer, no regularisation for overfitting demo)
# For binary: 1 output neuron with sigmoid
dnn_model = Sequential(name='Deep_Neural_Network')
dnn_model.add(layers.Input(shape=(INPUT_DIMENSION,)))
dnn_model.add(Dense(64, activation='relu'))
dnn_model.add(Dense(OUTPUT_DIMENSION, activation='sigmoid'))
dnn_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

dnn_model.summary()

In [None]:
# Train the Deep Neural Network (without regularisation)
history_dnn = dnn_model.fit(X_train, y_train, 
                            class_weight=CLASS_WEIGHTS,
                            batch_size=BATCH_SIZE, epochs=EPOCHS_BASELINE, 
                            validation_data=(X_val, y_val), 
                            verbose=0)
val_score_dnn = dnn_model.evaluate(X_val, y_val, verbose=0)[1:]

In [None]:
# Plot DNN training history (expect overfitting: val_loss increasing)
plot_training_history(history_dnn, title='DNN - No Regularisation')

In [None]:
# Display DNN validation metrics
print('Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(val_score_dnn[0], baseline))
print('Precision (Validation): {:.2f}'.format(val_score_dnn[1]))
print('Recall (Validation): {:.2f}'.format(val_score_dnn[2]))
print('AUC (Validation): {:.2f}'.format(val_score_dnn[3]))

preds_dnn_val = (dnn_model.predict(X_val, verbose=0) > 0.5).astype('int32').flatten()
print('Balanced Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(
    balanced_accuracy_score(y_val, preds_dnn_val), balanced_accuracy_baseline))

## 7. Regularising Your Model and Tuning Hyperparameters

Now we address the overfitting observed in Section 6 by adding **regularisation**. We use two complementary techniques:

| Technique | How it works | Effect |
|-----------|--------------|--------|
| **Dropout** | Randomly drops neurons during training | Acts like ensemble averaging, reduces co-adaptation |
| **L2 (Weight Decay)** | Adds penalty for large weights to loss | Keeps weights small, smoother decision boundaries |

Using **Hyperband** for efficient hyperparameter tuning.

### 7.1 Hyperband Search

In [None]:
# Hyperband Model Builder for Binary Classification
def build_model_hyperband(hp):
    """
    Build Movie Review model with FIXED architecture (1 hidden layer, 64 neurons).
    Same architecture as Section 6 DNN - only tunes regularisation and learning rate.
    """
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))

    # L2 regularisation strength
    l2_reg = hp.Float('l2_reg', 1e-5, 1e-2, sampling='log')

    # Fixed architecture: 1 hidden layer with 64 neurons
    model.add(layers.Dense(64, activation='relu', 
                           kernel_regularizer=regularizers.l2(l2_reg)))
    dropout_rate = hp.Float('dropout', 0.0, 0.5, step=0.1)
    model.add(layers.Dropout(dropout_rate))

    # Output layer for binary classification
    model.add(layers.Dense(OUTPUT_DIMENSION, activation='sigmoid'))

    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss=LOSS_FUNC,
        metrics=METRICS
    )
    return model

In [None]:
# Configure Hyperband tuner
# For balanced binary classification, we can use val_accuracy or val_auc
TUNING_OBJECTIVE = 'val_accuracy' if PRIMARY_METRIC == 'accuracy' else 'val_auc'

tuner = kt.Hyperband(
    build_model_hyperband,
    objective=TUNING_OBJECTIVE,
    max_epochs=20,
    factor=3,
    directory='movie_review_hyperband',
    project_name='movie_review_tuning',
    overwrite=True
)

print(f"Tuning objective: {TUNING_OBJECTIVE}")

# Run Hyperband search
tuner.search(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=20,
    batch_size=BATCH_SIZE,
    class_weight=CLASS_WEIGHTS
)

In [None]:
# Get best hyperparameters
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Best hyperparameters found by Hyperband:")
print(f"  L2 Regularisation: {best_hp.get('l2_reg'):.6f}")
print(f"  Dropout Rate: {best_hp.get('dropout')}")
print(f"  Learning Rate: {best_hp.get('lr'):.6f}")

# =============================================================================
# CRITICAL: Extract the number of epochs from the best trial
# =============================================================================
best_trial = tuner.oracle.get_best_trials(num_trials=1)[0]
best_epochs = best_trial.best_step + 1  # best_step is 0-indexed

print(f"\n>>> Best trial was trained for {best_epochs} epochs <<<")
print(f"    (This is the epoch count we'll use for retraining)")

# Build a fresh model with the best hyperparameters
opt_model = tuner.hypermodel.build(best_hp)
opt_model.summary()

### 7.2 Retraining with Matched Epochs

After extracting the best hyperparameters, we retrain the model using the **exact same number of epochs** that Hyperband used for the best trial.

---

#### The Epoch Mismatch Problem

Hyperband uses **successive halving** - most configurations train for few epochs, only top performers get more:

```
Hyperband with max_epochs=20, factor=3:
Round 1: 81 configs × ~1 epoch  → Keep top 27
Round 2: 27 configs × ~2 epochs → Keep top 9
Round 3:  9 configs × ~7 epochs → Keep top 3
Round 4:  3 configs × ~20 epochs → Select best
```

The best hyperparameters were found optimal at a **specific epoch count** (e.g., 7 epochs). If we rebuild and retrain for a different number of epochs (e.g., 150), the hyperparameters may no longer be optimal.

---

#### Solution: Match the Epoch Count

We extract `best_step` from the best trial (0-indexed epoch where best validation score occurred) and retrain for exactly `best_step + 1` epochs:

| Approach | Epochs Match? | Issue |
|----------|---------------|-------|
| ~~Rebuild + retrain for 150 epochs~~ | No | Hyperparameters may be suboptimal at 150 epochs |
| ~~Use get_best_models() directly~~ | Yes | No training history available for plotting |
| **Rebuild + retrain for best_epochs** | Yes | Best of both: matched epochs + training history |

> *"Use the epoch count that Hyperband determined was optimal for these hyperparameters."*

In [None]:
# =============================================================================
# RETRAIN WITH MATCHED EPOCHS
# =============================================================================
print(f"Retraining with best hyperparameters for {best_epochs} epochs...")
print(f"(Matching the epoch count from Hyperband's best trial)")

history_opt = opt_model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=best_epochs,  # CRITICAL: Use matched epochs!
    batch_size=BATCH_SIZE,
    verbose=0
)

val_score_opt = opt_model.evaluate(X_val, y_val, verbose=0)[1:]
print(f"\nValidation Accuracy: {val_score_opt[0]:.4f}")

In [None]:
# Plot training history for the optimised model
# Now available because we retrained with matched epochs!
plot_training_history(history_opt, title=f'DNN - Dropout + L2 ({best_epochs} epochs)')

In [None]:
preds_opt_val = (opt_model.predict(X_val, verbose=0) > 0.5).astype('int32').flatten()

print('Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(val_score_opt[0], baseline))
print('Precision (Validation): {:.2f}'.format(val_score_opt[1]))
print('Recall (Validation): {:.2f}'.format(val_score_opt[2]))
print('AUC (Validation): {:.2f}'.format(val_score_opt[3]))
print('Balanced Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(
    balanced_accuracy_score(y_val, preds_opt_val), balanced_accuracy_baseline))

### 7.3 Final Model Evaluation on Test Set

Now we evaluate our best model on the held-out test set.

In [None]:
# Final evaluation on test set
test_score = opt_model.evaluate(X_test, y_test, verbose=0)[1:]
preds_test = (opt_model.predict(X_test, verbose=0) > 0.5).astype('int32').flatten()

print('=' * 50)
print('FINAL TEST SET RESULTS')
print('=' * 50)
print(f'Accuracy (Test): {test_score[0]:.4f} (baseline={baseline:.4f})  ← Primary Metric')
print(f'Precision (Test): {test_score[1]:.4f}')
print(f'Recall (Test): {test_score[2]:.4f}')
print(f'AUC (Test): {test_score[3]:.4f}')
print(f'Balanced Accuracy (Test): {balanced_accuracy_score(y_test, preds_test):.4f}')

In [None]:
# Display confusion matrix for test predictions
fig, ax = plt.subplots(figsize=(8, 6))
cm = confusion_matrix(y_test, preds_test)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=label_encoder.classes_)
disp.plot(ax=ax, cmap='Blues', values_format='d')
plt.title('Confusion Matrix - Test Set Predictions')
plt.tight_layout()
plt.show()

# Print per-class metrics
print("\nPer-Class Recall:")
for i, class_name in enumerate(label_encoder.classes_):
    class_mask = y_test == i
    class_recall = (preds_test[class_mask] == i).mean()
    print(f"  {class_name}: {class_recall:.2%} ({class_mask.sum()} samples)")

---

## 8. Results Summary

The following table compares all models trained in this notebook.

In [None]:
# =============================================================================
# RESULTS SUMMARY
# =============================================================================

results = pd.DataFrame({
    'Model': ['Naive Baseline', 'SLP (No Hidden)', 'DNN (No Regularisation)', 'DNN (Dropout + L2)', 'DNN (Dropout + L2) - Test'],
    'Accuracy': [baseline, val_score_slp[0], val_score_dnn[0], val_score_opt[0], test_score[0]],
    'AUC': [0.5, val_score_slp[3], val_score_dnn[3], val_score_opt[3], test_score[3]],
    'Dataset': ['N/A', 'Validation', 'Validation', 'Validation', 'Test']
})

print("=" * 70)
print("MODEL COMPARISON - RESULTS SUMMARY")
print("=" * 70)
print(f"Primary Metric: ACCURACY (imbalance ratio: {imbalance_ratio:.2f}:1 - balanced)")
print("=" * 70)
print(results.to_string(index=False, float_format='{:.4f}'.format))
print("=" * 70)
print(f"\nKey Observations:")
print(f"  - All models significantly outperform naive baseline ({baseline:.2%})")
print(f"  - Final test accuracy: {test_score[0]:.4f}")

---

## 9. Key Takeaways

### Decision Framework Summary

| Decision | Threshold | This Dataset | Choice | Reference |
|----------|-----------|--------------|--------|----------|
| **Hold-Out vs K-Fold** | > 10,000 samples | ~65,000 samples | Hold-Out | Kohavi (1995); Chollet (2021) |
| **Accuracy vs F1-Score** | > 3:1 imbalance | ~1.04:1 ratio | Accuracy | He and Garcia (2009) |

### Binary vs. Multi-Class Classification

| Aspect | Binary (This Notebook) | Multi-Class (Twitter Examples) |
|--------|------------------------|-------------------------------|
| **Output neurons** | 1 neuron | N neurons |
| **Output activation** | Sigmoid | Softmax |
| **Loss function** | Binary cross-entropy | Categorical cross-entropy |
| **Label format** | Single 0/1 value | One-hot vector |
| **Prediction** | Threshold at 0.5 | argmax |

### Lessons Learned

1. **Binary Classification is Simpler:** With only two classes, the model architecture and loss function are more straightforward than multi-class.

2. **Class Weights Optional for Balanced Data:** With ~51:49 class distribution, class weights have minimal impact (weights ≈ 1.0 for both classes).

3. **Same TF-IDF Pipeline Works:** The text preprocessing approach (TF-IDF with bigrams) applies equally well to binary and multi-class NLP problems.

4. **Accuracy is Appropriate for Balanced Data:** With imbalance ratio < 3:1, accuracy is a meaningful primary metric.

5. **Regularisation Prevents Overfitting:** Combining Dropout + L2 regularisation controls overfitting effectively.

6. **Code Reusability:** The core workflow and code patterns are nearly identical to the multi-class Twitter notebooks—only the output layer, loss function, and label encoding differ.

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *IJCAI*, 2, pp. 1137–1145.

---

## Appendix: Modular Helper Functions

For cleaner code organisation, you can wrap the model building and training patterns into reusable functions.

In [None]:
# =============================================================================
# MODULAR HELPER FUNCTIONS
# =============================================================================

def build_binary_nlp_classifier(input_dim, hidden_units=None, dropout=0.0, l2_reg=0.0,
                                 optimizer='adam', learning_rate=None, name=None):
    """
    Build a binary NLP classification neural network.
    
    Parameters:
    -----------
    input_dim : int
        Number of input features (TF-IDF vector dimension)
    hidden_units : list of int, optional
        Neurons per hidden layer, e.g., [64] or [128, 64]
    dropout : float
        Dropout rate (0.0 to 0.5)
    l2_reg : float
        L2 regularisation strength
    learning_rate : float, optional
        Custom learning rate
    name : str, optional
        Model name
        
    Returns:
    --------
    keras.Sequential : Compiled model ready for training
    """
    model = Sequential(name=name)
    model.add(layers.Input(shape=(input_dim,)))
    
    hidden_units = hidden_units or []
    kernel_reg = regularizers.l2(l2_reg) if l2_reg > 0 else None
    
    for units in hidden_units:
        model.add(Dense(units, activation='relu', kernel_regularizer=kernel_reg))
        if dropout > 0:
            model.add(Dropout(dropout))
    
    # Binary output
    model.add(Dense(1, activation='sigmoid'))
    
    if learning_rate is not None:
        opt = keras.optimizers.Adam(learning_rate=learning_rate)
    else:
        opt = optimizer
    
    metrics = ['accuracy', 
               tf.keras.metrics.Precision(name='precision'),
               tf.keras.metrics.Recall(name='recall'),
               tf.keras.metrics.AUC(name='auc')]
    
    model.compile(optimizer=opt, loss='binary_crossentropy', metrics=metrics)
    return model


def train_with_class_weights(model, X_train, y_train, X_val, y_val,
                              batch_size=512, epochs=100, verbose=0):
    """Train model with automatic class weight computation."""
    weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
    class_weights = dict(enumerate(weights))
    
    return model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        batch_size=batch_size, 
        epochs=epochs,
        class_weight=class_weights,
        verbose=verbose
    )


def evaluate_binary_nlp(model, X, y_true, threshold=0.5):
    """
    Evaluate binary NLP classification model.
    
    Returns:
    --------
    dict : Dictionary containing all binary classification metrics
    """
    y_pred_proba = model.predict(X, verbose=0).flatten()
    y_pred = (y_pred_proba > threshold).astype('int32')
    
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred),
        'auc': roc_auc_score(y_true, y_pred_proba),
        'balanced_accuracy': balanced_accuracy_score(y_true, y_pred),
    }
    
    return metrics


# =============================================================================
# USAGE EXAMPLES
# =============================================================================
# 
# # Build models
# slp = build_binary_nlp_classifier(INPUT_DIMENSION, name='SLP')
# mlp = build_binary_nlp_classifier(INPUT_DIMENSION, hidden_units=[64], name='MLP')
# mlp_reg = build_binary_nlp_classifier(INPUT_DIMENSION, hidden_units=[64], 
#                                       dropout=0.3, l2_reg=0.001, learning_rate=0.001)
# 
# # Train with class weights
# history = train_with_class_weights(mlp, X_train, y_train, X_val, y_val)
# 
# # Evaluate
# metrics = evaluate_binary_nlp(mlp, X_val, y_val)
# print(f"Accuracy: {metrics['accuracy']:.4f}, AUC: {metrics['auc']:.4f}")