<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Final%20DNN%20Code%20Examples/Sentiment%20Analysis%20of%20US%20Airline%20Tweets%20-%20A%20Write-Up%20Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis of US Airline Tweets Using Deep Neural Network

---

## Introduction

### Problem Definition

Sentiment analysis, a key task in natural language processing (NLP), involves identifying and categorizing opinions expressed in a piece of text to determine the writer's sentiment—whether positive, neutral, or negative. This study focuses on analyzing the sentiment of tweets related to US airlines, aiming to classify them into these three categories. Sentiment analysis on social media data, such as tweets, presents unique challenges due to the informal language, use of slang, emojis, sarcasm, and the brevity of the messages. For instance, a tweet might convey a sentiment that is not straightforward, making it difficult for traditional methods to accurately capture the intended emotional tone.

### Motivation

In today's digital age, social media platforms like Twitter have become crucial channels for customers to voice their opinions and experiences with companies, including airlines. Understanding customer sentiment from these platforms allows airlines to gain real-time insights into customer satisfaction and potential issues. This ability to swiftly analyze and respond to customer feedback is vital, as studies have shown that 67% of customers use social media for customer service. Effective sentiment analysis can enable airlines to proactively address problems, enhance customer satisfaction, and build stronger brand loyalty. The goal of this study is to develop a reliable and scalable method for automating sentiment analysis, tailored to the unique characteristics of social media data, which can help airlines improve their services and gain a competitive edge.

### Dataset

The dataset used in this study is the "Twitter US Airline Sentiment" dataset from Kaggle. This dataset consists of 14,640 tweets directed at various US airlines, each labeled as positive, neutral, or negative. The distribution of sentiment labels is approximately 16% positive, 63% negative, and 21% neutral, highlighting a significant class imbalance. The tweets have already undergone basic preprocessing steps, such as the removal of duplicates and irrelevant information. The class imbalance in this dataset poses a challenge for model training, requiring careful selection of evaluation metrics and techniques to ensure that the models perform robustly across all classes.

### Constraints and Methodological Focus

This assignment is guided by the principles outlined in *Deep Learning with Python* by François Chollet, specifically adhering to "The Universal Workflow of Machine Learning." Within this framework, we are constrained by the requirement to use only Dense layers, Dropout layers, and L1/L2 regularization techniques in our neural network models. We are restricted from using more advanced techniques such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), or methods like Early Stopping. Despite these constraints, our exploration will focus on various dense neural network architectures, including wider, deeper, and regularized models, to determine their effectiveness in sentiment classification. This exploration will provide insights into how different architectural choices impact model performance, particularly in handling the class imbalance present in the dataset.

### Objectives

The primary objectives of this study are:

1. **Data Preprocessing:** To preprocess the text data, converting it into a numerical format that can be effectively utilized by machine learning models. This involves techniques like tokenization, TF-IDF vectorization, and handling class imbalance.

2. **Model Development and Exploration:** To develop and train dense neural network models for sentiment classification, exploring various network architectures. This includes experimenting with wider and deeper models, as well as applying Dropout and L1/L2 regularization techniques to understand their impact on model performance.

3. **Performance Evaluation:** To rigorously evaluate the performance of these models using a range of metrics, including accuracy, F1 Score, and AUC, with a particular focus on understanding how these models handle the class imbalance in the dataset.

4. **Model Optimization:** To investigate and implement techniques such as Dropout and L1/L2 regularization to prevent overfitting and improve the generalization of the models, given the constraints of not using CNNs, RNNs, or Early Stopping.

5. **Architectural Insights:** To analyze and compare the performance of different neural network architectures, providing insights into the strengths and limitations of each approach within the given constraints.

These objectives aim to provide a comprehensive understanding of the effectiveness of dense neural networks in handling sentiment analysis tasks, while also exploring how different architectural choices impact performance.

---

## Methodology

## 1. Data Loading and Preprocessing

### Data Loading

We begin by mounting Google Drive to access the dataset stored in it. We create the necessary directories to store the dataset and use the `gdown` library to download the dataset from the given URL. This ensures that our data is easily accessible and can be seamlessly integrated into our Google Colab environment.

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')

!pip -q install gdown==4.6.0
import gdown

# Detect if running in Google Colab
try:
    import google.colab
    IN_COLAB = True
    from google.colab import drive
    drive.mount('/content/drive')
    base_path = "/content/drive/MyDrive/Neural Networks/Twitter US Airline Sentiment/"
    # Create necessary directories to store the dataset
    os.makedirs(base_path, exist_ok=True)
except ImportError:
    IN_COLAB = False
    base_path = "./"

# Download the dataset from the given URL
URL = "https://drive.google.com/file/d/15XHy_PdD6Q2aa6n-pnWmSFGCv1oK9vWA/view?usp=sharing"
DOWNLOAD_FILE_PATH = "https://drive.google.com/uc?export=download&id=" + URL.split("/")[-2]
gdown.download(DOWNLOAD_FILE_PATH, base_path + "Tweets.csv", quiet=True)

### Data Preprocessing

The dataset is loaded into a pandas DataFrame. We use sklearn's `TfidfVectorizer` to convert the text data into numerical form using TF-IDF mode. The sentiment labels are converted into numerical form using label encoding and then one-hot encoded. We chose 5000 features and bigrams for TF-IDF to capture a broad range of important terms and their combinations, providing a richer representation of the text data.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.utils import to_categorical

# Load the data from the CSV file into a pandas DataFrame
file_path = os.path.join(base_path, "Tweets.csv")
tweets = pd.read_csv(file_path)[['text', 'airline_sentiment']]

# Label Encoding to convert sentiment labels into numerical form
label_encoder = LabelEncoder()
y = to_categorical(label_encoder.fit_transform(tweets['airline_sentiment']))

# Split data into training, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(tweets['text'], y, test_size=0.2, stratify=y, random_state=42)

# Define the TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf = tfidf.transform(X_test).toarray()

---

## 2. Choosing a Measure of Success

### Exploring Class Imbalance

Before proceeding with model development, it is crucial to understand the class distribution to address any potential imbalance. The class distribution plot below shows a significant imbalance, with a majority of tweets being negative (63%), followed by neutral (21%), and positive (16%). This imbalance will be taken into account when selecting evaluation metrics and techniques to ensure robust performance.

In [None]:
class_counts = tweets['airline_sentiment'].value_counts()
print(class_counts)

### Metrics

We track multiple metrics to comprehensively evaluate model performance:

| Metric | Purpose | When to Use |
|--------|---------|-------------|
| **Accuracy** | Overall correctness | Balanced datasets |
| **F1-Score (macro)** | Balance of precision & recall across all classes | **Primary metric** for imbalanced data |
| **AUC** | Discrimination ability across thresholds | Ranking quality; used for hyperparameter tuning |

**Why F1-Score as Primary, AUC for Tuning?**

- **F1-Score** directly measures performance on minority classes - critical for our 3.88:1 imbalanced dataset
- **AUC** provides smooth gradients during hyperparameter search, making it ideal as a tuning objective
- We report both metrics for final evaluation

### Naive Baseline

To provide a reference point for model performance, we establish a naive baseline using the most frequent class. Given the class imbalance, the naive baseline would predict every tweet as the majority class, which is "negative". The naive baseline results show an accuracy of 0.63, an F1 Score of 0.26, and an AUC of 0.50. These results highlight the need for a more sophisticated model to better capture the sentiment distribution.

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train_tfidf, y_train.argmax(axis=1))
y_dummy_pred = dummy_clf.predict(X_train_tfidf)

naive_accuracy = accuracy_score(y_train.argmax(axis=1), y_dummy_pred)
naive_f1 = f1_score(y_train.argmax(axis=1), y_dummy_pred, average='macro')
naive_auc = roc_auc_score(y_train, to_categorical(y_dummy_pred), average='macro', multi_class='ovo')

print(f'Naive Baseline - Accuracy: {naive_accuracy:.2f}, F1 Score: {naive_f1:.2f}, AUC: {naive_auc:.2f}')

### Class Weights

Given the significant class imbalance observed, it is necessary to apply class weights during model training to ensure the model treats each class fairly. Class weights adjust the importance of each class in the loss function, giving more weight to minority classes. This helps the model focus on correctly predicting these underrepresented classes, improving overall performance.

The `compute_class_weight` function from sklearn is used to calculate the weights for each class. These weights are then converted into a dictionary format, which can be passed directly to the Keras model during training.

In [None]:
from sklearn.utils.class_weight import compute_class_weight

# Compute class weights
class_weights = compute_class_weight(class_weight='balanced',
                                      classes=np.unique(np.argmax(y_train, axis=1)),
                                      y=np.argmax(y_train, axis=1))

# Convert the class weights to a dictionary format required by Keras
class_weight = dict(enumerate(class_weights))

# Print class weights
print(f"Computed Class Weights: {class_weight}")

---

## 3. Deciding on an Evaluation Protocol

### Hold-Out vs K-Fold Cross-Validation

The choice between hold-out and K-fold depends on **dataset size** and **computational cost**:

| Dataset Size | Recommended Method | Rationale |
|--------------|-------------------|-----------|
| < 1,000 | K-Fold (K=5 or 10) | High variance with small hold-out sets |
| 1,000 – 10,000 | K-Fold or Hold-Out | Either works; K-fold more robust |
| > 10,000 | Hold-Out | Sufficient data; K-fold computationally expensive |

### Data Split Strategy (This Notebook)

With **14,640 samples** (above the 10,000 threshold), we use **Hold-Out validation**:

```
Original Data (14,640 samples) → Hold-Out Selected
├── Test Set (20%) - Final evaluation only
└── Training Pool (80%)
    ├── Training Set (~72%) - Model training
    └── Validation Set (~8%) - Hyperparameter tuning
```

**Important:** We use `stratify` parameter to maintain class proportions in all splits.

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *IJCAI*, 2, pp. 1137–1145.

---

## 4. Developing a Model that Does Better than a Naive Baseline

### Baseline Model

We establish a baseline model using a simple dense layer with softmax activation. The model is compiled with categorical crossentropy loss and evaluated using categorical accuracy, F1 score, and AUC. This baseline provides a reference point for evaluating more complex models. A simple model serves as a baseline to understand the minimal performance we can achieve without sophisticated techniques. This helps us gauge the improvement offered by more complex models.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import AUC, F1Score

# Define model parameters
OUTPUT_CLASSES = y_train.shape[1]
LOSS = 'categorical_crossentropy'
METRICS = ['categorical_accuracy', F1Score(name='f1_score', average='macro'), AUC(name='auc', multi_label=True)]

# Build baseline model (Single Layer Perceptron - no hidden layers)
baseline = Sequential([
    Dense(OUTPUT_CLASSES, activation='softmax', input_shape=(X_train_tfidf.shape[1],))
])

# Compile the baseline model
baseline.compile(optimizer=Adam(learning_rate=0.005),
                 loss=LOSS,
                 metrics=METRICS)

# Train baseline model
baseline_history = baseline.fit(X_train_tfidf, y_train, batch_size=512, epochs=100,
                                validation_split=0.1, verbose=0, class_weight=class_weight)

# Evaluate the baseline model
baseline_scores = baseline.evaluate(X_test_tfidf, y_test, verbose=0)
print(f'Baseline Model - Test Accuracy: {baseline_scores[1]:.2f}, F1 Score: {baseline_scores[2]:.2f}, AUC: {baseline_scores[3]:.2f}')

The baseline SLP model achieved strong performance (accuracy ~0.80, F1-Score ~0.72, AUC ~0.91), significantly outperforming the naive baseline. This demonstrates that TF-IDF features capture meaningful sentiment signals even without hidden layers.

### Plot Baseline Model Training History

The helper function below visualises training and validation loss over epochs, helping us identify convergence and potential overfitting.

In [None]:
import matplotlib.pyplot as plt

def plot_training_history(history, monitor='loss'):
    loss, val_loss = history.history[monitor], history.history['val_' + monitor]
    epochs = range(1, len(loss) + 1)
    plt.plot(epochs, loss, 'b.', label=monitor)
    plt.plot(epochs, val_loss, 'r.', label='Validation ' + monitor)
    plt.xlim([0, len(loss)])
    plt.title('Training and Validation ' + monitor + 's')
    plt.xlabel('Epochs')
    plt.ylabel(monitor)
    plt.legend()
    plt.grid()
    plt.show()

plot_training_history(baseline_history, monitor='loss')

The training history shows both training and validation loss decreasing and converging. This is expected for an SLP—it has limited capacity (no hidden layers), so it cannot overfit easily. The model learns a linear decision boundary that generalises well.

**Comparison with Naive Baseline:**
| Model | Accuracy | F1-Score | AUC |
|-------|----------|----------|-----|
| Naive (majority class) | 0.63 | 0.26 | 0.50 |
| SLP Baseline | ~0.80 | ~0.72 | ~0.91 |

The SLP dramatically outperforms the naive baseline, confirming that our TF-IDF features contain useful signal. Next, we add hidden layers to see if we can do better.

---

## 5. Scaling Up: Developing a Model that Overfits

### More Complex Model

We build a model with one or two hidden layers and see if it can overfit the data. This model helps gauge the complexity required to learn the patterns in the data. By monitoring the training and validation loss, we can observe overfitting and decide on regularization techniques to mitigate it. Understanding the complexity required to fit the data is crucial for determining the appropriate model architecture and regularization techniques. This step helps us identify the point where the model becomes too complex and starts overfitting.

In [None]:
BATCH_SIZE = 512

# Build overfitting model with one hidden layer
overfit = Sequential([
    Dense(64, activation="relu", input_shape=(X_train_tfidf.shape[1],)),
    Dense(OUTPUT_CLASSES, activation="softmax")
])

# Compile the overfitting model with updated learning rate
overfit.compile(optimizer=Adam(learning_rate=0.0001),
                loss=LOSS,
                metrics=METRICS)

# Train overfitting model
overfit_history = overfit.fit(X_train_tfidf, y_train, batch_size=BATCH_SIZE, epochs=200,
                              validation_split=0.1, verbose=0, class_weight=class_weight)

# Access validation accuracy and other metrics from history
overfit_val_accuracy = overfit_history.history['val_categorical_accuracy'][-1]
overfit_val_f1_score = overfit_history.history['val_f1_score'][-1]
overfit_val_auc = overfit_history.history['val_auc'][-1]

print(f'Validation Accuracy: {overfit_val_accuracy:.2f}, F1 Score: {overfit_val_f1_score:.2f}, AUC: {overfit_val_auc:.2f}')
plot_training_history(overfit_history, monitor='loss')

**Overfitting Confirmed:** The training history shows the classic overfitting pattern:
- Training loss continues to decrease (model memorises training data)
- Validation loss increases after ~110 epochs (model fails to generalise)

This is exactly what we wanted to see! It confirms that:
1. A single hidden layer with 64 neurons has **sufficient capacity** to fit the data
2. **Regularisation is needed** to prevent overfitting
3. We should NOT add more capacity (wider/deeper)—we should regularise instead

> *"If your model can overfit, you have enough capacity. The solution is regularisation, not more neurons."* — Universal ML Workflow principle

---

## 6. Regularizing Your Model and Tuning Your Hyperparameters

### Regularization Techniques

We identified that the deeper model with hidden layers overfits the data. To address this, we incorporated regularization techniques:

| Technique | How it works | Effect |
|-----------|--------------|--------|
| **Dropout** | Randomly drops neurons during training | Acts like ensemble averaging, reduces co-adaptation |
| **L2 (Weight Decay)** | Adds penalty for large weights to loss | Keeps weights small, smoother decision boundaries |

### Hyperparameter Tuning Using Hyperband

To find the optimal hyperparameters efficiently, we use **Hyperband** - an adaptive resource allocation algorithm that eliminates poor configurations early. Hyperband is particularly well-suited for deep learning because training epochs are a natural "resource" to allocate adaptively.

| Method | How it works | Pros | Cons |
|--------|--------------|------|------|
| **Grid Search** | Tries all combinations exhaustively | Thorough, reproducible | Exponentially expensive |
| **Random Search** | Samples random combinations | More efficient than grid | Still trains all configs to completion |
| **Hyperband** | Early stopping of poor performers | Very efficient for deep learning | May discard slow starters prematurely |

We tune the following hyperparameters:
- **L2 regularization strength** (1e-5 to 1e-2)
- **Dropout rate** (0.0 to 0.5)
- **Learning rate** (1e-4 to 1e-2)

In [None]:
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.layers import Dropout, Dense
from tensorflow.keras.regularizers import l2

# Install and import Keras Tuner
%pip install -q -U keras-tuner
import keras_tuner as kt

# Store dimensions for use in model builder
INPUT_DIMENSION = X_train_tfidf.shape[1]

# Hyperband Model Builder
def build_model_hyperband(hp):
    """
    Build sentiment analysis model with FIXED architecture (1 hidden layer, 64 neurons).
    Only tunes regularisation and learning rate.
    """
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))

    # L2 regularisation strength
    l2_reg = hp.Float('l2_reg', 1e-5, 1e-2, sampling='log')

    # Fixed architecture: 1 hidden layer with 64 neurons
    model.add(layers.Dense(64, activation='relu', 
                           kernel_regularizer=regularizers.l2(l2_reg)))
    
    # Tunable dropout
    dropout_rate = hp.Float('dropout', 0.0, 0.5, step=0.1)
    model.add(layers.Dropout(dropout_rate))

    # Output layer for multi-class classification
    model.add(layers.Dense(OUTPUT_CLASSES, activation='softmax'))

    # Tunable learning rate
    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss=LOSS,
        metrics=METRICS
    )
    return model

In [None]:
# Configure Hyperband tuner
# Objective: maximize validation AUC (good for imbalanced data)
tuner = kt.Hyperband(
    build_model_hyperband,
    objective='val_auc',
    max_epochs=20,
    factor=3,
    directory='airline_sentiment_hyperband',
    project_name='airline_sentiment_tuning',
    overwrite=True
)

print("Tuning objective: val_auc")
print("(Note: F1-Score is also tracked for final evaluation)")

# Create validation split from training data
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train_tfidf, y_train, test_size=0.1, stratify=y_train, random_state=42
)

# Run Hyperband search
tuner.search(
    X_train_split, y_train_split,
    validation_data=(X_val_split, y_val_split),
    epochs=20,
    batch_size=BATCH_SIZE,
    class_weight=class_weight
)

In [None]:
# Get best hyperparameters
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Best hyperparameters found by Hyperband:")
print(f"  L2 Regularisation: {best_hp.get('l2_reg'):.6f}")
print(f"  Dropout Rate: {best_hp.get('dropout')}")
print(f"  Learning Rate: {best_hp.get('lr'):.6f}")

# Build the best model
opt_model = tuner.hypermodel.build(best_hp)
opt_model.summary()

### Retrain with Optimised Hyperparameters

Now that we have the best hyperparameters from Hyperband search, we:

1. **Build a fresh model** with the optimised L2 strength, dropout rate, and learning rate
2. **Retrain from scratch** with extended epochs to ensure full convergence

**Why Train the Regularised Model Longer?**

Regularisation slows down learning:

| Technique | Effect on Learning |
|-----------|-------------------|
| **Dropout** | Randomly masks neurons each batch → each gradient update uses only partial network information |
| **L2 penalty** | Penalises large weights → constrains the size of weight updates |

Both techniques deliberately impede the optimisation process. The model takes smaller, noisier steps toward the solution. This is the *price* we pay for overfitting protection.

| Model | Epochs | Why This Number? |
|-------|--------|------------------|
| **SLP (baseline)** | 100 | Simple model, converges quickly |
| **DNN (no regularisation)** | 200 | Enough to clearly demonstrate overfitting |
| **DNN (with Dropout + L2)** | 150 | Compensates for slower learning; ensures full convergence |

> *"Regularisation adds noise and constraints that slow down learning. In exchange for protection against overfitting, the model needs more iterations to converge."*

In [None]:
EPOCHS_REGULARIZED = 150

# Train with extended epochs
regularized_history = opt_model.fit(
    X_train_tfidf, y_train,
    validation_split=0.1,
    epochs=EPOCHS_REGULARIZED,
    batch_size=BATCH_SIZE,
    class_weight=class_weight,
    verbose=0
)

# Evaluate on test set
regularized_scores = opt_model.evaluate(X_test_tfidf, y_test, verbose=0)
print(f'Regularized Model - Test Accuracy: {regularized_scores[1]:.2f}, F1 Score: {regularized_scores[2]:.2f}, AUC: {regularized_scores[3]:.2f}')

# Plot regularized model training history
plot_training_history(regularized_history, monitor='loss')

**Regularisation Works:** The training history shows that validation loss now stabilises instead of increasing. The gap between training and validation loss is smaller, indicating better generalisation.

The regularised model achieves:
- Similar or slightly better accuracy than the unregularised DNN
- Improved F1-Score (better minority class performance)
- Stable validation metrics (no overfitting)

This confirms our approach: **regularise the existing architecture rather than adding more capacity**.

---

## 7. Exploring Different Neural Network Architectures

To further explore and optimize the model, we experimented with different architectures:

- **Wider models:** These have more units in the hidden layers, allowing them to capture more complex patterns in the data.
- **Deeper models:** These include additional hidden layers, enabling the model to learn hierarchical representations of the data.
- **Narrower models:** These have fewer units in the hidden layers, promoting simplicity and reducing the risk of overfitting.

In [None]:
# =============================================================================
# ARCHITECTURE EXPLORATION WITH HYPERBAND
# =============================================================================

def build_wider_model(hp):
    """Wider model: 128 neurons in hidden layer."""
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))
    l2_reg = hp.Float('l2_reg', 1e-5, 1e-2, sampling='log')
    model.add(layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l2(l2_reg)))
    dropout_rate = hp.Float('dropout', 0.0, 0.5, step=0.1)
    model.add(layers.Dropout(dropout_rate))
    model.add(layers.Dense(OUTPUT_CLASSES, activation='softmax'))
    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr), loss=LOSS, metrics=METRICS)
    return model

def build_deeper_model(hp):
    """Deeper model: 2 hidden layers with 64 neurons each."""
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))
    l2_reg = hp.Float('l2_reg', 1e-5, 1e-2, sampling='log')
    dropout_rate = hp.Float('dropout', 0.0, 0.5, step=0.1)
    model.add(layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(l2_reg)))
    model.add(layers.Dropout(dropout_rate))
    model.add(layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(l2_reg)))
    model.add(layers.Dropout(dropout_rate))
    model.add(layers.Dense(OUTPUT_CLASSES, activation='softmax'))
    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr), loss=LOSS, metrics=METRICS)
    return model

def build_narrower_model(hp):
    """Narrower model: 32 neurons in hidden layer."""
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))
    l2_reg = hp.Float('l2_reg', 1e-5, 1e-2, sampling='log')
    model.add(layers.Dense(32, activation='relu', kernel_regularizer=regularizers.l2(l2_reg)))
    dropout_rate = hp.Float('dropout', 0.0, 0.5, step=0.1)
    model.add(layers.Dropout(dropout_rate))
    model.add(layers.Dense(OUTPUT_CLASSES, activation='softmax'))
    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr), loss=LOSS, metrics=METRICS)
    return model

print("Architecture variants defined:")
print("  - Wider:    128 neurons (1 hidden layer)")
print("  - Deeper:   64 neurons  (2 hidden layers)")
print("  - Narrower: 32 neurons  (1 hidden layer)")

### Training and Evaluating Different Architectures

We trained the various model architectures defined above, including wider, deeper, and narrower models, to explore their performance.

In [None]:
# =============================================================================
# WIDER MODEL - Hyperband Tuning
# =============================================================================
tuner_wider = kt.Hyperband(
    build_wider_model,
    objective='val_auc',
    max_epochs=20,
    factor=3,
    directory='airline_wider_hyperband',
    project_name='wider_tuning',
    overwrite=True
)

tuner_wider.search(X_train_split, y_train_split, validation_data=(X_val_split, y_val_split),
                   epochs=20, batch_size=BATCH_SIZE, class_weight=class_weight)

wider_best_hp = tuner_wider.get_best_hyperparameters(num_trials=1)[0]
print(f"Wider Model - Best: L2={wider_best_hp.get('l2_reg'):.6f}, Dropout={wider_best_hp.get('dropout')}, LR={wider_best_hp.get('lr'):.6f}")

In [None]:
# Train wider model with best hyperparameters
wider_model = tuner_wider.hypermodel.build(wider_best_hp)
wider_history = wider_model.fit(X_train_tfidf, y_train, validation_split=0.1, 
                                 epochs=EPOCHS_REGULARIZED, batch_size=BATCH_SIZE,
                                 class_weight=class_weight, verbose=0)

wider_scores = wider_model.evaluate(X_test_tfidf, y_test, verbose=0)
print(f'Wider Model - Test Accuracy: {wider_scores[1]:.2f}, F1 Score: {wider_scores[2]:.2f}, AUC: {wider_scores[3]:.2f}')
plot_training_history(wider_history, monitor='loss')

In [None]:
# =============================================================================
# DEEPER MODEL - Hyperband Tuning
# =============================================================================
tuner_deeper = kt.Hyperband(
    build_deeper_model,
    objective='val_auc',
    max_epochs=20,
    factor=3,
    directory='airline_deeper_hyperband',
    project_name='deeper_tuning',
    overwrite=True
)

tuner_deeper.search(X_train_split, y_train_split, validation_data=(X_val_split, y_val_split),
                    epochs=20, batch_size=BATCH_SIZE, class_weight=class_weight)

deeper_best_hp = tuner_deeper.get_best_hyperparameters(num_trials=1)[0]
print(f"Deeper Model - Best: L2={deeper_best_hp.get('l2_reg'):.6f}, Dropout={deeper_best_hp.get('dropout')}, LR={deeper_best_hp.get('lr'):.6f}")

In [None]:
# Train deeper model with best hyperparameters
deeper_model = tuner_deeper.hypermodel.build(deeper_best_hp)
deeper_history = deeper_model.fit(X_train_tfidf, y_train, validation_split=0.1, 
                                   epochs=EPOCHS_REGULARIZED, batch_size=BATCH_SIZE,
                                   class_weight=class_weight, verbose=0)

deeper_scores = deeper_model.evaluate(X_test_tfidf, y_test, verbose=0)
print(f'Deeper Model - Test Accuracy: {deeper_scores[1]:.2f}, F1 Score: {deeper_scores[2]:.2f}, AUC: {deeper_scores[3]:.2f}')
plot_training_history(deeper_history, monitor='loss')

In [None]:
# =============================================================================
# NARROWER MODEL - Hyperband Tuning
# =============================================================================
tuner_narrower = kt.Hyperband(
    build_narrower_model,
    objective='val_auc',
    max_epochs=20,
    factor=3,
    directory='airline_narrower_hyperband',
    project_name='narrower_tuning',
    overwrite=True
)

tuner_narrower.search(X_train_split, y_train_split, validation_data=(X_val_split, y_val_split),
                      epochs=20, batch_size=BATCH_SIZE, class_weight=class_weight)

narrower_best_hp = tuner_narrower.get_best_hyperparameters(num_trials=1)[0]
print(f"Narrower Model - Best: L2={narrower_best_hp.get('l2_reg'):.6f}, Dropout={narrower_best_hp.get('dropout')}, LR={narrower_best_hp.get('lr'):.6f}")

In [None]:
# Train narrower model with best hyperparameters
narrower_model = tuner_narrower.hypermodel.build(narrower_best_hp)
narrower_history = narrower_model.fit(X_train_tfidf, y_train, validation_split=0.1, 
                                       epochs=EPOCHS_REGULARIZED, batch_size=BATCH_SIZE,
                                       class_weight=class_weight, verbose=0)

narrower_scores = narrower_model.evaluate(X_test_tfidf, y_test, verbose=0)
print(f'Narrower Model - Test Accuracy: {narrower_scores[1]:.2f}, F1 Score: {narrower_scores[2]:.2f}, AUC: {narrower_scores[3]:.2f}')
plot_training_history(narrower_history, monitor='loss')

### Models Performance Comparison Table

The summary table below compares all model architectures on the held-out test set. For each model, we report:
- **Performance metrics:** Accuracy, F1-Score, and AUC
- **Tuned hyperparameters:** Dropout rate, L2 regularization strength, and learning rate (found via Hyperband)

In [None]:
import pandas as pd

# Extracting the performance metrics and hyperparameters from Hyperband tuning
models_performance = {
    "Model": ["Baseline", "Regularized (64)", "Wider (128)", "Deeper (64×2)", "Narrower (32)"],
    "Accuracy": [baseline_scores[1], regularized_scores[1], wider_scores[1], deeper_scores[1], narrower_scores[1]],
    "F1 Score": [baseline_scores[2], regularized_scores[2], wider_scores[2], deeper_scores[2], narrower_scores[2]],
    "AUC": [baseline_scores[3], regularized_scores[3], wider_scores[3], deeper_scores[3], narrower_scores[3]],
    "Dropout": [np.nan, best_hp.get('dropout'), wider_best_hp.get('dropout'), deeper_best_hp.get('dropout'), narrower_best_hp.get('dropout')],
    "L2 Reg": [np.nan, best_hp.get('l2_reg'), wider_best_hp.get('l2_reg'), deeper_best_hp.get('l2_reg'), narrower_best_hp.get('l2_reg')],
    "Learning Rate": [np.nan, best_hp.get('lr'), wider_best_hp.get('lr'), deeper_best_hp.get('lr'), narrower_best_hp.get('lr')]
}

# Creating a DataFrame for the performance comparison table
models_performance_df = pd.DataFrame(models_performance)
print(models_performance_df.to_string(index=False))

---

## Analysis of Results

The performance of various neural network models is summarised in the comparison table above. Key observations:

### Baseline Model (SLP)

- The baseline Single Layer Perceptron (no hidden layers) achieves strong performance, demonstrating that even a simple linear classifier can capture much of the signal in TF-IDF features.
- Training converges smoothly without overfitting, as expected for a model with limited capacity.

### Overfitting Model (Section 5)

- Adding a hidden layer (64 neurons) increases model capacity.
- The training history shows clear overfitting: validation loss increases after ~110 epochs while training loss continues to decrease.
- This confirms the model has **sufficient capacity** to memorise the training data, justifying the need for regularisation.

### Regularised Model (64 neurons + Dropout + L2)

- Hyperband found optimal regularisation hyperparameters automatically.
- The regularised model shows improved generalisation: validation loss stabilises instead of increasing.
- F1-Score improves over the baseline, indicating better performance on minority classes.

### Architecture Variants

| Architecture | Observation |
|--------------|-------------|
| **Wider (128)** | Marginal improvement over 64 neurons; extra capacity not needed |
| **Deeper (64×2)** | Similar performance; hierarchical features don't help for TF-IDF |
| **Narrower (32)** | Slight decrease; 32 neurons may underfit slightly |

### Key Insight

The baseline model already has sufficient statistical power to learn the patterns in TF-IDF features. Adding complexity (wider/deeper) provides diminishing returns. **Regularisation** (Dropout + L2) is more valuable than architectural changes for this task.

> *"When your baseline can already overfit, the answer is regularisation, not more capacity."* — Chollet (2021)

---

## Conclusions

This study explored dense neural network architectures for sentiment analysis on US airline tweets, following the Universal ML Workflow from Chollet (2021).

### Key Findings

1. **Simple models work well:** A Single Layer Perceptron achieved ~80% accuracy, demonstrating that TF-IDF features already capture strong sentiment signals.

2. **Overfitting is easy to achieve:** A single hidden layer (64 neurons) was sufficient to overfit the training data, confirming adequate model capacity.

3. **Regularisation > Architecture:** Dropout and L2 regularisation improved generalisation more than architectural changes (wider/deeper/narrower). This aligns with the principle: *"Regularise, don't expand."*

4. **Hyperband is efficient:** Automated hyperparameter tuning found good regularisation settings without exhaustive grid search.

5. **Class weights matter:** For this imbalanced dataset (3.88:1 ratio), class weights were essential for learning minority classes effectively.

### Limitations

- Constrained to Dense layers only (no CNNs/RNNs/Transformers)
- TF-IDF loses word order information
- No early stopping used (per assignment constraints)

### Future Work

- **Early stopping** and **learning rate scheduling** could improve training efficiency
- **LIME** or **SHAP** for model interpretability
- **Pre-trained embeddings** (Word2Vec, BERT) could capture semantic relationships better than TF-IDF

---

## References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *IJCAI*, 2, pp. 1137–1145.