<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Final%20DNN%20Code%20Examples/Twitter%20US%20Airline%20Sentiment/Twitter%20US%20Airline%20Sentiment%20-%20NLP%20Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Twitter US Airline Sentiment - NLP Example

This notebook demonstrates the **Universal ML Workflow** applied to a multi-class NLP classification problem using Twitter airline sentiment data.

## Learning Objectives

By the end of this notebook, you will be able to:
- Apply the Universal ML Workflow to an NLP text classification problem
- Convert text data to numerical features using **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorization
- Handle **imbalanced classes** using class weights during training
- Build and train deep neural networks for **multi-class classification**
- Use **Hyperband** for efficient hyperparameter tuning
- Evaluate model performance using appropriate metrics for imbalanced data (Balanced Accuracy, Precision, Recall, AUC)

---

## Dataset Overview

| Attribute | Description |
|-----------|-------------|
| **Source** | [Kaggle Twitter US Airline Sentiment](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment) |
| **Problem Type** | Multi-Class Classification (3 classes) |
| **Data Balance** | Imbalanced (Negative: ~63%, Neutral: ~21%, Positive: ~16%) |
| **Data Type** | Unstructured Text (Tweets) |
| **Input Features** | TF-IDF Vectors (5000 features, bigrams) |
| **Output** | Sentiment: Negative, Neutral, or Positive |
| **Imbalance Handling** | Class Weights during Training |

---

## 1. Defining the Problem and Assembling a Dataset

The first step in any machine learning project is to clearly define the problem and understand the data.

**Problem Statement:** Given a tweet about a US airline, predict the sentiment (Negative, Neutral, or Positive).

**Why this matters:** Airlines can use sentiment analysis to:
- Identify unhappy customers quickly and respond to complaints
- Track brand perception over time
- Discover common pain points in customer experience

**Data Source:** This dataset contains tweets about major US airlines, collected via Twitter's API and labeled by human annotators.

## 2. Choosing a Measure of Success

### Metric Selection Based on Class Imbalance

The choice of evaluation metric depends on **class imbalance**. We use practical guidelines derived from the literature:

| Imbalance Ratio | Classification | Primary Metric | Rationale |
|-----------------|----------------|----------------|-----------|
| ≤ 1.5:1 | Balanced | **Accuracy** | Classes roughly equal |
| 1.5:1 – 3:1 | Mild Imbalance | **Accuracy** | Majority class < 75% |
| > 3:1 | Moderate/Severe | **F1-Score** | Accuracy becomes misleading |

**Why these thresholds?**
- **3:1 ratio**: When majority class exceeds 75%, a naive classifier achieves high accuracy while ignoring minority classes
- **F1-Score**: Harmonic mean of precision and recall, effective for imbalanced data (He and Garcia, 2009)

### References

- Branco, P., Torgo, L. and Ribeiro, R.P. (2016) 'A survey of predictive modeling on imbalanced domains', *ACM Computing Surveys*, 49(2), pp. 1–50.

- Brownlee, J. (2020) *A gentle introduction to imbalanced classification*. Available at: https://machinelearningmastery.com/what-is-imbalanced-classification/ (Accessed: 20 January 2025).

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

- Luque, A., Carrasco, A., Martín, A. and de las Heras, A. (2019) 'The impact of class imbalance in classification performance metrics based on the binary confusion matrix', *Pattern Recognition*, 91, pp. 216–231.

*Note: The 3:1 threshold is a practical guideline, not a strict academic standard. The literature suggests metric choice depends on domain-specific costs of errors.*

## 3. Deciding on an Evaluation Protocol

### Hold-Out vs K-Fold Cross-Validation

The choice between hold-out and K-fold depends on **dataset size** and **computational cost**:

| Dataset Size | Recommended Method | Rationale |
|--------------|-------------------|-----------|
| < 1,000 | K-Fold (K=5 or 10) | High variance with small hold-out sets |
| 1,000 – 10,000 | K-Fold or Hold-Out | Either works; K-fold more robust |
| > 10,000 | Hold-Out | Sufficient data; K-fold computationally expensive |
| Deep Learning | Hold-Out (preferred) | Training cost prohibitive for K iterations |

**Why 10,000 as a practical threshold?**
- Below 10,000 samples, hold-out validation has higher variance (Kohavi, 1995)
- Above 10,000, statistical estimates from hold-out are reliable
- Deep learning models are expensive to train; K-fold multiplies cost by K (Chollet, 2021)

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Hastie, T., Tibshirani, R. and Friedman, J. (2009) *The elements of statistical learning: data mining, inference, and prediction*. 2nd edn. New York: Springer.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *Proceedings of the 14th International Joint Conference on Artificial Intelligence*, 2, pp. 1137–1145.

- Pedregosa, F. et al. (2011) 'Scikit-learn: machine learning in Python', *Journal of Machine Learning Research*, 12, pp. 2825–2830. Available at: https://scikit-learn.org/stable/modules/cross_validation.html (Accessed: 20 January 2025).

*Note: The 10,000 threshold is a practical guideline. For computationally cheap models, K-fold is preferred regardless of size.*

### Data Split Strategy (This Notebook)

```
Original Data (14,640 samples) → Hold-Out Selected
├── Test Set (10%) - Final evaluation only
└── Training Pool (90%)
    ├── Training Set (81%) - Model training
    └── Validation Set (9%) - Early stopping & tuning
```

**Important:** We use `stratify` parameter to maintain class proportions in all splits.

## 4. Preparing Your Data

### 4.1 Import Libraries and Set Random Seed

We set random seeds for reproducibility - this ensures that running the notebook multiple times produces the same results.

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import TfidfVectorizer

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Keras Tuner for hyperparameter search
!pip install -q -U keras-tuner
import keras_tuner as kt

import matplotlib.pyplot as plt

SEED = 204

tf.random.set_seed(SEED)
np.random.seed(SEED)

import warnings
warnings.filterwarnings('ignore')

### 4.2 Load and Explore the Dataset

Let's load the Twitter airline sentiment data and examine its structure.

In [3]:
tweets = pd.read_csv('Tweets.csv', sep=',')
tweets = tweets[['text', 'airline_sentiment']]

tweets.head()

Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials t...,positive
2,@VirginAmerica I didn't today... Must mean I n...,neutral
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative


### 4.3 Split Data into Train and Test Sets

We reserve 10% of the data for final testing. The `stratify` parameter ensures that each split maintains the same class proportions as the original dataset - critical for imbalanced data.

In [4]:
TEST_SIZE = 0.1

(tweets_train, tweets_test, 
 sentiment_train, sentiment_test) = train_test_split(tweets['text'], tweets['airline_sentiment'], 
                                                     test_size=TEST_SIZE, stratify=tweets['airline_sentiment'],
                                                     shuffle=True, random_state=SEED)

### 4.4 Text Vectorization with TF-IDF

Neural networks require numerical input, but tweets are text. We use **TF-IDF (Term Frequency-Inverse Document Frequency)** to convert text to numbers.

**How TF-IDF works:**
- **TF (Term Frequency):** How often a word appears in a document
- **IDF (Inverse Document Frequency):** Downweights words that appear in many documents (like "the", "is")
- **TF-IDF = TF × IDF:** Words that are frequent in a document but rare overall get high scores

**Our settings:**
- `max_features=5000`: Keep only the 5000 most important terms
- `ngram_range=(1, 2)`: Include both single words (unigrams) and word pairs (bigrams) like "great service"

In [5]:
MAX_FEATURES = 5000
NGRAMS = 2

tfidf = TfidfVectorizer(ngram_range=(1, NGRAMS), max_features=MAX_FEATURES)
tfidf.fit(tweets_train)

X_train, X_test = tfidf.transform(tweets_train).toarray(), tfidf.transform(tweets_test).toarray()

### 4.5 Encode Labels as One-Hot Vectors

For multi-class classification with softmax output, we need to convert categorical labels to one-hot encoded vectors:
- Negative → [1, 0, 0]
- Neutral → [0, 1, 0]
- Positive → [0, 0, 1]

In [None]:
label_encoder = LabelEncoder()
label_encoder.fit(tweets['airline_sentiment'])

y_train = to_categorical(label_encoder.transform(sentiment_train))
y_test = to_categorical(label_encoder.transform(sentiment_test))

## 5. Developing a Model That Does Better Than a Baseline

Before building complex models, we need to establish **baseline performance**. This gives us a reference point to know if our model is actually learning something useful.

### 5.1 Examine Class Distribution

Let's look at how the sentiment classes are distributed:

In [8]:
counts = tweets.groupby(['airline_sentiment']).count()
counts.reset_index(inplace=True)

counts

Unnamed: 0,airline_sentiment,text
0,negative,9178
1,neutral,3099
2,positive,2363


In [None]:
# =============================================================================
# DATA-DRIVEN ANALYSIS: Dataset Size & Imbalance
# =============================================================================

# Dataset size analysis (for hold-out vs K-fold decision)
n_samples = len(tweets)
HOLDOUT_THRESHOLD = 10000  # Use hold-out if samples > 10,000 (Kohavi, 1995; Chollet, 2021)

# Imbalance analysis (for metric selection)
majority_class = counts['text'].max()
minority_class = counts['text'].min()
imbalance_ratio = majority_class / minority_class
IMBALANCE_THRESHOLD = 3.0  # Use F1-Score if ratio > 3.0 (He & Garcia, 2009)

# Determine evaluation strategy and metric
use_holdout = n_samples > HOLDOUT_THRESHOLD
use_f1 = imbalance_ratio > IMBALANCE_THRESHOLD

print("=" * 60)
print("DATA-DRIVEN CONFIGURATION")
print("=" * 60)
print(f"\n1. DATASET SIZE: {n_samples:,} samples")
print(f"   Threshold: {HOLDOUT_THRESHOLD:,} samples (Kohavi, 1995)")
print(f"   Decision: {'Hold-Out' if use_holdout else 'K-Fold Cross-Validation'}")

print(f"\n2. CLASS IMBALANCE: {imbalance_ratio:.2f}:1 ratio")
print(f"   Threshold: {IMBALANCE_THRESHOLD:.1f}:1 (He & Garcia, 2009)")
print(f"   Decision: {'F1-Score (imbalanced)' if use_f1 else 'Accuracy (balanced)'}")

print("\n" + "=" * 60)
PRIMARY_METRIC = 'f1' if use_f1 else 'accuracy'
print(f"PRIMARY METRIC: {PRIMARY_METRIC.upper()}")
print("=" * 60)

### 5.2 Calculate Baseline Metrics

**Naive Baseline (Majority Class):** If we always predict "negative", we get ~63% accuracy. This is our accuracy baseline.

**Balanced Accuracy Baseline:** A random classifier would achieve 33.3% balanced accuracy (1/3 for each class). This is more meaningful for imbalanced data.

In [9]:
baseline = counts[counts['airline_sentiment']=='negative']['text'].values[0] / counts['text'].sum()

baseline

0.6269125683060109

In [None]:
# Balanced accuracy baseline (random classifier)
balanced_accuracy_baseline = balanced_accuracy_score(y_train.argmax(axis=1), np.zeros(len(y_train)))

print(f"Baseline accuracy (majority class): {baseline:.2f}")
print(f"Balanced accuracy baseline (random): {balanced_accuracy_baseline:.2f}")

### 5.3 Create Validation Set

We split off a portion of the training data for validation. This will be used to monitor training progress and evaluate model performance without touching the test set.

In [None]:
VALIDATION_SIZE = 0.1

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                                  test_size=VALIDATION_SIZE, stratify=y_train,
                                                  shuffle=True, random_state=SEED)

### 5.4 Configure Training Parameters

**Key training settings:**
- **Optimizer:** RMSprop - adaptive learning rate optimizer that works well for most problems
- **Loss:** Categorical cross-entropy - standard loss for multi-class classification
- **Early Stopping:** Configured to stop training when validation loss stops improving (patience=10 epochs)

In [None]:
INPUT_DIMENSION = X_train.shape[1]
OUTPUT_CLASSES = y_train.shape[1]

OPTIMIZER = 'rmsprop'
LOSS_FUNC = 'categorical_crossentropy'
METRICS = ['categorical_accuracy', 
           tf.keras.metrics.Precision(name='precision'), 
           tf.keras.metrics.Recall(name='recall'),
           tf.keras.metrics.AUC(name='auc', multi_label=True)]

In [14]:
# Single-Layer Perceptron (no hidden layers)
slp_model = Sequential(name='Single_Layer_Perceptron')
slp_model.add(Dense(OUTPUT_CLASSES, activation='softmax', input_shape=(INPUT_DIMENSION,)))
slp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

slp_model.summary()

Model: "Single_Layer_Perceptron"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 3)                 15003     
                                                                 
Total params: 15,003
Trainable params: 15,003
Non-trainable params: 0
_________________________________________________________________


In [None]:
batch_size = 512
EPOCHS = 100

### 5.5 Handle Class Imbalance with Class Weights

To handle imbalanced classes, we compute **class weights** that give more importance to minority classes during training:
- **Negative (majority):** Lower weight (~0.53)
- **Neutral:** Medium weight (~1.57)
- **Positive (minority):** Higher weight (~2.06)

This makes errors on minority classes "cost more", encouraging the model to learn them better.

In [16]:
labels = np.argmax(y_train, axis=1)
weights = compute_class_weight('balanced', classes=np.unique(labels), y=labels)
CLASS_WEIGHTS = dict(enumerate(weights))

CLASS_WEIGHTS

{0: 0.5317352220103514, 1: 1.5748285599031868, 2: 2.064516129032258}

In [None]:
# Train the Single-Layer Perceptron
history_slp = slp_model.fit(X_train, y_train, 
                            class_weight=CLASS_WEIGHTS,
                            batch_size=batch_size, epochs=EPOCHS, 
                            validation_data=(X_val, y_val),
                            verbose=0)
val_score_slp = slp_model.evaluate(X_val, y_val, verbose=0)[1:]

In [None]:
print('Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(val_score_slp[0], baseline))
print('Precision (Validation): {:.2f}'.format(val_score_slp[1]))
print('Recall (Validation): {:.2f}'.format(val_score_slp[2]))
print('AUC (Validation): {:.2f}'.format(val_score_slp[3]))

In [None]:
preds = slp_model.predict(X_val, verbose=0).argmax(axis=1)

print('Balanced Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(
    balanced_accuracy_score(y_val.argmax(axis=1), preds), balanced_accuracy_baseline))

In [None]:
def plot_training_history(history, primary_metric='f1'):
    """
    Plot training and validation metrics over epochs.
    Always plots: (1) Loss, (2) Primary metric (F1 or Accuracy)

    Parameters:
    -----------
    history : keras History object
        Training history from model.fit()
    primary_metric : str
        'f1' for F1-Score (computed from precision/recall) or 'accuracy' for categorical_accuracy
    """
    fig, axs = plt.subplots(1, 2, sharex='all', figsize=(15, 5))
    epochs = range(1, len(history.history['loss']) + 1)

    # Plot 1: Loss (always)
    ax = axs[0]
    ax.plot(epochs, history.history['loss'], 'b.-', label='Training Loss')
    ax.plot(epochs, history.history['val_loss'], 'r.-', label='Validation Loss')
    ax.set_xlim([0, len(epochs)])
    ax.set_title('Training and Validation Loss')
    ax.set_xlabel('Epochs')
    ax.set_ylabel('Loss')
    ax.legend()
    ax.grid()

    # Plot 2: Primary metric (F1 or Accuracy)
    ax = axs[1]
    if primary_metric == 'f1':
        # Compute F1 from precision and recall: F1 = 2 * (P * R) / (P + R)
        train_precision = np.array(history.history['precision'])
        train_recall = np.array(history.history['recall'])
        train_f1 = 2 * (train_precision * train_recall) / (train_precision + train_recall + 1e-7)

        val_precision = np.array(history.history['val_precision'])
        val_recall = np.array(history.history['val_recall'])
        val_f1 = 2 * (val_precision * val_recall) / (val_precision + val_recall + 1e-7)

        ax.plot(epochs, train_f1, 'b.-', label='Training F1-Score')
        ax.plot(epochs, val_f1, 'r.-', label='Validation F1-Score')
        ax.set_title('Training and Validation F1-Score')
        ax.set_ylabel('F1-Score')
    else:
        # Use categorical accuracy
        ax.plot(epochs, history.history['categorical_accuracy'], 'b.-', label='Training Accuracy')
        ax.plot(epochs, history.history['val_categorical_accuracy'], 'r.-', label='Validation Accuracy')
        ax.set_title('Training and Validation Accuracy')
        ax.set_ylabel('Accuracy')

    ax.set_xlim([0, len(epochs)])
    ax.set_xlabel('Epochs')
    ax.legend()
    ax.grid()

    plt.tight_layout()
    plt.show()

In [None]:
# Plot SLP training history
plot_training_history(history_slp, primary_metric=PRIMARY_METRIC)

## 6. Scaling Up: Developing a Model That Overfits

The next step in the Universal ML Workflow is to build a model with **enough capacity to overfit**. If a model can't overfit, it may be too simple to learn the patterns in the data.

**Strategy:** Add hidden layers and neurons to increase model capacity.

**No regularization applied:** We intentionally train this model **without any regularization** (no dropout, no L2, no early stopping) to observe overfitting behavior. In the training plots, you should see:
- Training loss continues to decrease
- Validation loss starts increasing after some epochs (overfitting)

This demonstrates why regularization (Section 7) is necessary.

### 6.1 Build a Deep Neural Network (DNN)

Let's add a hidden layer with 64 neurons and ReLU activation:

In [None]:
# Deep Neural Network (1 hidden layer, no dropout for overfitting demo)
dnn_model = Sequential(name='Deep_Neural_Network')
dnn_model.add(Dense(64, activation='relu', input_shape=(INPUT_DIMENSION,)))
dnn_model.add(Dense(OUTPUT_CLASSES, activation='softmax'))
dnn_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

dnn_model.summary()

In [None]:
# Train the Deep Neural Network (without early stopping to demonstrate overfitting)
history_dnn = dnn_model.fit(X_train, y_train, 
                            class_weight=CLASS_WEIGHTS,
                            batch_size=batch_size, epochs=EPOCHS, 
                            validation_data=(X_val, y_val), 
                            verbose=0)
val_score_dnn = dnn_model.evaluate(X_val, y_val, verbose=0)[1:]

In [None]:
# Plot DNN training history
plot_training_history(history_dnn, primary_metric=PRIMARY_METRIC)

In [None]:
# Display DNN validation metrics
print('Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(val_score_dnn[0], baseline))
print('Precision (Validation): {:.2f}'.format(val_score_dnn[1]))
print('Recall (Validation): {:.2f}'.format(val_score_dnn[2]))
print('AUC (Validation): {:.2f}'.format(val_score_dnn[3]))

preds_dnn = dnn_model.predict(X_val, verbose=0).argmax(axis=1)
print('Balanced Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(
    balanced_accuracy_score(y_val.argmax(axis=1), preds_dnn), balanced_accuracy_baseline))

## 7. Regularizing Your Model and Tuning Hyperparameters

Now we address the overfitting observed in Section 6 by adding **regularization**. We use two complementary techniques:

| Technique | How it works | Effect |
|-----------|--------------|--------|
| **Dropout** | Randomly drops neurons during training | Acts like ensemble averaging, reduces co-adaptation |
| **L2 (Weight Decay)** | Adds penalty for large weights to loss | Keeps weights small, smoother decision boundaries |

Using **Hyperband** for efficient hyperparameter tuning to find optimal regularization strengths.

### Why Hyperband?

**Hyperband** is more efficient than grid search because it:
1. Starts training many configurations for a few epochs
2. Eliminates poor performers early
3. Allocates more resources to promising configurations

### 7.1 Hyperband Search

In [None]:
# Hyperband Model Builder for Multi-Class Twitter Airline Classification
def build_model_hyperband(hp):
    """
    Build Twitter Airline model with FROZEN architecture (2 layers: 64 -> 32 neurons).
    Tunes regularization (Dropout + L2) and learning rate.
    """
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))

    # L2 regularization strength (shared across layers)
    l2_reg = hp.Float('l2_reg', 1e-5, 1e-2, sampling='log')

    # Fixed architecture: 2 hidden layers with 64 and 32 neurons
    # Layer 1: 64 neurons with L2 regularization
    model.add(layers.Dense(64, activation='relu', 
                           kernel_regularizer=regularizers.l2(l2_reg)))
    drop_0 = hp.Float('drop_0', 0.0, 0.5, step=0.1)
    model.add(layers.Dropout(drop_0))

    # Layer 2: 32 neurons with L2 regularization
    model.add(layers.Dense(32, activation='relu',
                           kernel_regularizer=regularizers.l2(l2_reg)))
    drop_1 = hp.Float('drop_1', 0.0, 0.5, step=0.1)
    model.add(layers.Dropout(drop_1))

    # Output layer for multi-class classification
    model.add(layers.Dense(OUTPUT_CLASSES, activation='softmax'))

    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss=LOSS_FUNC,
        metrics=METRICS
    )
    return model

In [None]:
# Configure Hyperband tuner
# Use appropriate objective based on PRIMARY_METRIC
# Note: For F1, we use AUC as tuning objective (good proxy for imbalanced data)
# Final evaluation still uses F1-Score as the primary metric
TUNING_OBJECTIVE = 'val_categorical_accuracy' if PRIMARY_METRIC == 'accuracy' else 'val_auc'

tuner = kt.Hyperband(
    build_model_hyperband,
    objective=TUNING_OBJECTIVE,
    max_epochs=20,
    factor=3,
    directory='twitter_airline_hyperband',
    project_name='twitter_airline_tuning'
)

print(f"Tuning objective: {TUNING_OBJECTIVE}")

# Run Hyperband search
tuner.search(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=20,
    batch_size=batch_size,
    class_weight=CLASS_WEIGHTS
)

In [None]:
# Get best hyperparameters and build best model
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Best hyperparameters:")
print(f"  L2 Regularization: {best_hp.get('l2_reg'):.6f}")
print(f"  Dropout Layer 1: {best_hp.get('drop_0')}")
print(f"  Dropout Layer 2: {best_hp.get('drop_1')}")
print(f"  Learning Rate: {best_hp.get('lr'):.6f}")

opt_model = tuner.hypermodel.build(best_hp)
opt_model.summary()

### 7.2 Retrain with Optimized Hyperparameters

Now that we have the best hyperparameters from Hyperband search, we:

1. **Build a fresh model** with the optimized L2 strength, dropout rates, and learning rate
2. **Retrain from scratch** with full epochs (not the limited epochs used during search)

**Why no early stopping?**

With proper regularization (Dropout + L2), the model should **not overfit** even when trained for the full number of epochs. This is the key insight:
- **Section 6 (no regularization):** Model overfits → validation loss increases
- **Section 7 (with regularization):** Model doesn't overfit → validation loss stays low

The training plots should show that both training and validation loss converge together, demonstrating that **Dropout + L2 alone prevent overfitting**.

In [None]:
# Train the best model (regularization via Dropout + L2, no early stopping needed)
history_opt = opt_model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=EPOCHS,
    batch_size=batch_size,
    class_weight=CLASS_WEIGHTS,
    verbose=1
)
val_score_opt = opt_model.evaluate(X_val, y_val, verbose=0)[1:]

In [None]:
# Plot optimized model training history
plot_training_history(history_opt, primary_metric=PRIMARY_METRIC)

In [None]:
preds_opt = opt_model.predict(X_val, verbose=0)

print('Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(val_score_opt[0], baseline))
print('Precision (Validation): {:.2f}'.format(val_score_opt[1]))
print('Recall (Validation): {:.2f}'.format(val_score_opt[2]))
print('AUC (Validation): {:.2f}'.format(val_score_opt[3]))
print('Balanced Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(
    balanced_accuracy_score(y_val.argmax(axis=1), preds_opt.argmax(axis=1)), balanced_accuracy_baseline))

### 7.3 Final Model Evaluation on Test Set

Now we evaluate our best model on the held-out test set that was never used during training or tuning.

In [None]:
# Final evaluation on test set
test_score = opt_model.evaluate(X_test, y_test, verbose=0)[1:]
preds_test = opt_model.predict(X_test, verbose=0)

print('=' * 50)
print('FINAL TEST SET RESULTS')
print('=' * 50)
print('Accuracy (Test): {:.2f} (baseline={:.2f})'.format(test_score[0], baseline))
print('Precision (Test): {:.2f}'.format(test_score[1]))
print('Recall (Test): {:.2f}'.format(test_score[2]))
print('AUC (Test): {:.2f}'.format(test_score[3]))
print('Balanced Accuracy (Test): {:.2f} (baseline={:.2f})'.format(
    balanced_accuracy_score(y_test.argmax(axis=1), preds_test.argmax(axis=1)), balanced_accuracy_baseline))

In [None]:
# Display confusion matrix for test predictions
fig, ax = plt.subplots(figsize=(8, 6))
cm = confusion_matrix(y_test.argmax(axis=1), preds_test.argmax(axis=1))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=label_encoder.classes_)
disp.plot(ax=ax, cmap='Blues', values_format='d')
plt.title('Confusion Matrix - Test Set Predictions')
plt.tight_layout()
plt.show()

# Print per-class performance
print("\nPer-Class Performance:")
for i, class_name in enumerate(label_encoder.classes_):
    class_mask = y_test.argmax(axis=1) == i
    class_acc = (preds_test.argmax(axis=1)[class_mask] == i).mean()
    print(f"  {class_name.capitalize()}: {class_acc:.2%} accuracy ({class_mask.sum()} samples)")

---

## 8. Results Summary

The following dynamically-generated table compares all models trained in this notebook.

In [None]:
# =============================================================================
# RESULTS SUMMARY - Dynamically Generated
# =============================================================================
from sklearn.metrics import f1_score

# Calculate metrics for each model
preds_slp = slp_model.predict(X_val, verbose=0).argmax(axis=1)
preds_dnn_val = dnn_model.predict(X_val, verbose=0).argmax(axis=1)
preds_opt_val = opt_model.predict(X_val, verbose=0).argmax(axis=1)
preds_test_final = preds_test.argmax(axis=1)
y_val_labels = y_val.argmax(axis=1)
y_test_labels = y_test.argmax(axis=1)

# Calculate F1 scores (macro-averaged for multi-class)
f1_slp = f1_score(y_val_labels, preds_slp, average='macro')
f1_dnn = f1_score(y_val_labels, preds_dnn_val, average='macro')
f1_opt = f1_score(y_val_labels, preds_opt_val, average='macro')
f1_test = f1_score(y_test_labels, preds_test_final, average='macro')

# Create results DataFrame
results = pd.DataFrame({
    'Model': ['Naive Baseline', 'SLP (No Hidden)', 'DNN (No Dropout)', 'Optimized (Tuned)', 'Optimized (Test)'],
    'Accuracy': [baseline, val_score_slp[0], val_score_dnn[0], val_score_opt[0], test_score[0]],
    'F1-Score': [0.0, f1_slp, f1_dnn, f1_opt, f1_test],
    'Dataset': ['N/A', 'Validation', 'Validation', 'Validation', 'Test']
})

print("=" * 65)
print("MODEL COMPARISON - RESULTS SUMMARY")
print("=" * 65)
print(f"Primary Metric: {PRIMARY_METRIC.upper()} (imbalance ratio: {imbalance_ratio:.2f}:1)")
print("=" * 65)
print(results.to_string(index=False, float_format='{:.4f}'.format))
print("=" * 65)
print(f"\nKey Observations:")
print(f"  - All models outperform naive baseline ({baseline:.2%} accuracy)")
print(f"  - Best validation F1-Score: {max(f1_slp, f1_dnn, f1_opt):.4f}")
print(f"  - Final test F1-Score: {f1_test:.4f}")

---

## 9. Key Takeaways

### Decision Framework Summary

| Decision | Threshold | This Dataset | Choice | Reference |
|----------|-----------|--------------|--------|-----------|
| **Hold-Out vs K-Fold** | > 10,000 samples | 14,640 samples | Hold-Out | Kohavi (1995); Chollet (2021) |
| **Accuracy vs F1-Score** | > 3:1 imbalance | 3.88:1 ratio | F1-Score | He and Garcia (2009) |

### Lessons Learned

1. **Data-Driven Metric Selection:** With imbalance ratio > 3:1, we use F1-Score instead of Accuracy to ensure fair evaluation across all classes.

2. **Data-Driven Evaluation Protocol:** With > 10,000 samples and deep learning, hold-out validation provides reliable estimates while being computationally efficient.

3. **Class Imbalance Handling:** Using class weights during training improves performance on minority classes.

4. **Simple Models Can Work Well:** The SLP achieved competitive F1-Score with good feature engineering (TF-IDF).

5. **Regularization Prevents Overfitting:** The unregularized DNN showed overfitting; combining **Dropout + L2 regularization** controls this without needing early stopping.

6. **Complementary Regularization:** Dropout (ensemble-like effect) and L2 (weight penalty) work together to prevent overfitting while allowing full training.

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Hastie, T., Tibshirani, R. and Friedman, J. (2009) *The elements of statistical learning*. 2nd edn. New York: Springer.

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *IJCAI*, 2, pp. 1137–1145.

---

## Appendix: Modular Helper Functions

For cleaner code organization, you can wrap the model building and training patterns into reusable functions. Below are the modular versions of the code used in this notebook.

In [None]:
# =============================================================================
# MODULAR HELPER FUNCTIONS
# =============================================================================
# The following functions encapsulate the model building and training patterns
# used throughout this notebook. You can use these for cleaner code organization.

def deep_neural_network(hidden_layers=0, hidden_neurons=64, activation='relu',
                        dropout=0.0, input_dimension=2, output_dimension=1,
                        optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'],
                        name=None):
    """
    Create a deep neural network with configurable architecture.
    
    Parameters:
    -----------
    hidden_layers : int
        Number of hidden layers (0 for single-layer perceptron)
    hidden_neurons : int
        Number of neurons per hidden layer
    activation : str
        Activation function for hidden layers ('relu', 'tanh', etc.)
    dropout : float
        Dropout rate (0.0 to 1.0) applied after each hidden layer
    input_dimension : int
        Number of input features
    output_dimension : int
        Number of output classes (1 for binary, >1 for multi-class)
    optimizer : str
        Optimizer name ('rmsprop', 'adam', 'sgd', etc.)
    loss : str
        Loss function name
    metrics : list
        List of metrics to track during training
    name : str, optional
        Model name for identification
        
    Returns:
    --------
    keras.Sequential : Compiled model ready for training
    
    Example:
    --------
    # model = deep_neural_network(
    #     hidden_layers=2, hidden_neurons=64, activation='relu', dropout=0.25,
    #     input_dimension=5000, output_dimension=3,
    #     optimizer='rmsprop', loss='categorical_crossentropy',
    #     metrics=['categorical_accuracy'], name='My_Model'
    # )
    """
    model = Sequential()
    
    for layer in range(hidden_layers):
        model.add(Dense(hidden_neurons, activation=activation,
                       input_shape=(input_dimension,) if layer == 0 else None))
        if dropout > 0:
            model.add(Dropout(dropout))
    
    # Output layer
    output_activation = 'sigmoid' if output_dimension == 1 else 'softmax'
    if hidden_layers == 0:
        model.add(Dense(output_dimension, activation=output_activation, 
                       input_shape=(input_dimension,)))
    else:
        model.add(Dense(output_dimension, activation=output_activation))
    
    if name is not None:
        model._name = name
    
    model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
    return model


def train_deep_neural_network(model, X_train, y_train, X_val, y_val,
                              class_weights=None, batch_size=32, epochs=100):
    """
    Train a deep neural network and return results.
    
    Parameters:
    -----------
    model : keras.Model
        Compiled Keras model to train
    X_train, y_train : array-like
        Training data and labels
    X_val, y_val : array-like
        Validation data and labels
    class_weights : dict, optional
        Class weights for imbalanced data
    batch_size : int
        Training batch size
    epochs : int
        Number of training epochs
        
    Returns:
    --------
    tuple : (history, val_score)
        - history: Training history object
        - val_score: Validation metrics (excluding loss)
    
    Example:
    --------
    # history, val_score = train_deep_neural_network(
    #     model, X_train, y_train, X_val, y_val,
    #     class_weights=CLASS_WEIGHTS, batch_size=512, epochs=100
    # )
    # print(f'Validation Accuracy: {val_score[0]:.2f}')
    """
    history = model.fit(
        X_train, y_train,
        class_weight=class_weights,
        batch_size=batch_size, 
        epochs=epochs,
        validation_data=(X_val, y_val),
        verbose=0
    )
    val_score = model.evaluate(X_val, y_val, verbose=0)[1:]
    
    return history, val_score