---
title: "DTSA 5511 Introduction to Machine Learning: Deep Learning"
subtitle: "Week 4: Natural Language Processing with Disaster Tweets"
author:
    - name: Andrew Simms
      affiliation: University of Colorado Boulder

date: today

format:
    html:
        mainfont: "system-ui, sans-serif"
        monofont: "ui-monospace, 'Cascadia Code', 'Source Code Pro', Menlo, Consolas, 'DejaVu Sans Mono', monospace"
        theme: [cosmo, custom.scss]
        html-math-method: "katex"
        highlight-style: monokai
        fig-format: svg
        code-fold: true
        code-line-numbers: true
        embed-resources: true
        self-contained-math: true
        toc-title: 'Table of Contents'
        toc-location: left
        toc-depth: 5
        toc-expand: 1
        number-sections: true
        table-of-contents: true
    ipynb:
        number-sections: true


bibliography: ../ref.bib

---


# Problem Description

This projects builds a Recurrent Neural Network model for the Natural Language Processing with Disaster Tweets competition, hosted on Kaggle [@kaggle_nlp], with the objective of developing a machine learning model that can accurately classify tweets as disaster-related or not. The dataset consists of 10,000 manually labeled tweets, creating a binary classification task where tweets are labeled `1` for disaster-related content and `0` non disaster-related content.

To achieve this goal, this project will develop a neural network model using PyTorch to build of a Recurrent Neural Network (RNN) model, which is should be well-suited for processing sequential text data. The final predictions generated by the trained model will be submitted to Kaggle for evaluation.

To achieve this goal, this project will leverage PyTorch [@pytorch] to design and implement a Recurrent Neural Network (RNN), a neural architecture well-suited for processing sequential text data. The trained model will generate predictions that will be submitted to Kaggle for evaluation.

This project will address the following research questions:

| Research Area          | Question
| :---                   | :---                                                                                                             |
| Data Preparation       | How should text data be preprocessed to maximize model performance?                                              |
| Model Building         | How do we implementation a RNN model in PyTorch?                                                                 |
| Hyperparameter Tuning  | What static hyperparameters should be defined and what dynamic hyperparameters should be tuned?                  |
| Model Performance      | What performance metrics should be used?<br/>How do the models perform during training, validation, and testing? |
| Improvement Strategies | What methods can be used to further enhance model performance?                                                   |

: Project Research Questions {#tbl-project-questions}

Beyond answering these questions, the project aims to address the technical challenges related to RNN models, including mitigating overfitting, handling exploding gradients, and balancing model complexity with prediction accuracy.

The workflow for this research is summarized in @fig-proj-work. The process begins with exploratory data analysis and preprocessing, followed by model training. Hyperparameter tuning is iteratively applied to refine model performance, culminating in the development of final models for Kaggle submission.


```{mermaid}
%%| label: fig-proj-work
%%| fig-cap: RNN Project Workflow

flowchart LR
    EDA["<div style='line-height:1.0;'>Exploratory<br>Data<br>Analysis</div>"]
    --> Clean["<div style='line-height:1.0;'>Clean<br>Original<br>Data</div>"]
    --> BuildModel["<div style='line-height:1.0;'>Build<br>RNN<br>Model</div>"]
    --> Train["<div style='line-height:1.0;'>Train<br>Model</div>"]
    --> Tune["<div style='line-height:1.0;'>Tune<br>Hyperparameters</div>"]
    --> OutputFinal["<div style='line-height:1.0;'>Final<br>Models</div>"]
    --> Submit["<div style='line-height:1.0;'>Submit<br>Results</div>"]
    Tune --> Train
```


# Exploratory Data Analysis


For the defined binary classification task test and training data were supplied tweets, location,
keywords, an id and a class. Each training tween is labeled as either disaster-related (`1`) or not
(`0`). The primary input feature is `text` (the original tweet content), supplemented by fields like
`keyword` and `location`.

## Training Data Columns and Types


In [None]:
#| label: tbl-train-info
#| tbl-cap: Data Columns and Types of Training Data

import pandas as pd

train_df = pd.read_csv("../data/train.csv")
train_df.info()

@tbl-train-info details the available data. Of note are large numbers of non existant data in the
location column, and a small amount of data missing in the keyword column. `id` and `target` are
numbers while the other 3 columns are text.


## Training Data Sample


In [None]:
#| label: tbl-train-head
#| tbl-cap: Sample of Training Data

train_df.loc[train_df['location'].notna()].head()

@tbl-train-head outputs the contents of a subset of the data. Data in the keyword column appears to
be somewhat standardized while data in the location and text columns appear to be original inputs
from the user.

## Distribution of Target Values

It is common for binary classification training data to have an equal weight of true and false
values in the input. This is calculated by counting the occurrence of each value.


In [None]:
#| label: fig-target-hist
#| fig-cap: Histogram of `target` Values in Training Data

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()


plt.figure(figsize=(7, 2.5))
sns.countplot(x='target', data=train_df)
plt.xlabel("Target")
plt.ylabel("Count")
plt.title("Count of Target Values")
plt.show()

In @fig-target-hist there are unequal values of each class in the data with a larger amount of
negative values. This imbalance is important and must be accounted for during the model training and
validation.


## Sample of Positive Tweets


In [None]:
#| label: tbl-train-sample-positive
#| tbl-cap: Sample of Positive Training Data

train_df.loc[train_df['target'] == 1].head()

In @tbl-train-sample-positive, positive tweets appear to have some relation to the disaster they are
describing. The content appears to have multiple complex words and hashtags.

## Sample of Negative Tweets


In [None]:
#| label: tbl-train-sample-negative
#| tbl-cap: Sample of Negative Training Data

train_df.loc[train_df['target'] == 0].head()

In @tbl-train-sample-negative, negative tweets have content that appears irrelevant to a disaster.
This content appears relatively generic and not specific to any event or location.

## Word Count

To identify if data cleaning is necessary, the content of the tweets (`text` column) is visualized
below. Each row is split into words lists by whitespace and a combined list of words is generated
and counted.


In [None]:
import numpy as np
from collections import Counter

def plot_word_horizontal_bar_chart(df, column, top_n=10, figsize=None):
    """
    Plot a horizontal bar chart of the most common words.

    :param words: List of words to analyze.
    :param top_n: Number of most common words to display.
    """
    # Count word frequencies
    words = word_list = dataframe_to_word_list(train_df, column)
    word_counts = Counter(words)
    most_common = word_counts.most_common(top_n)

    # Split words and their counts
    labels, counts = zip(*most_common)

    # print(labels[:10])
    # print(counts[:10])

    # # Plot the horizontal bar chart
    if figsize is None:
        figsize=(3, 6)
    plt.figure(figsize=figsize)
    plt.barh(labels, counts)
    plt.xlabel("Count")
    # plt.ylabel("Words")
    plt.title(f"Top {top_n} Words in {column}")
    plt.gca().invert_yaxis()  # Invert the y-axis to display the highest count at the top
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()

def dataframe_to_word_list(df, text_column):
    """
    Convert a DataFrame column of text into a list of words.

    :param df: The input DataFrame.
    :param text_column: The name of the column containing text data.
    :return: A list of words.
    """
    # Tokenize each row into words and flatten into a single list
    words = df[text_column].str.split().explode().tolist()
    return [word for word in words if word and word is not np.nan]

:::{.column-screen-inset}


In [None]:
#| label: fig-wordcount-original
#| fig-cap: Original Data Word Count by Column
#| fig-subcap:
#|   - '`text` Column'
#|   - '`location` Column'
#|   - '`keyword` Column'
#| layout-ncol: 3


plot_word_horizontal_bar_chart(train_df, 'text', top_n=25)
plot_word_horizontal_bar_chart(train_df.loc[train_df['location'].notna()], 'location', top_n=25)
plot_word_horizontal_bar_chart(train_df, 'keyword', top_n=25)

:::

The word count histograms in @fig-wordcount-original reveals a mixture of relevant and  possible unnecessary words and characters in each column. Additionally some characters may be removed to cleanup the input into the model.

# Data Cleaning

Based on the word count visualizations in @fig-wordcount-original, it is evident that removing common stop words may have an effect on the model's performance. This process will be implemented using the NLTK Python package [@py_lib_nltk]. To allow data cleaning level to be a hyperparameter and track different preprocessing configurations, the cleaned datasets will be labeled with an `a_<count>` suffix, where the first cleaned dataset will be designated as `a1`.


## Data Level `a1`: Removing Stop Words


In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def filter_stop_words(df, column):
    def filter_stop_words(word_list):
        """Filter stop words from a list of words."""
        words = str(word_list).split()
        words = [word for word in words if word != "nan"]
        return " ".join([word for word in words if word.lower() not in stop_words and word])

    # Apply the filtering function to the specified column
    df[column] = df[column].apply(filter_stop_words)

    return df

train_df_a1 = train_df.copy()

train_df_a1 = filter_stop_words(train_df_a1, 'text')
train_df_a1 = filter_stop_words(train_df_a1, 'location')
train_df_a1 = filter_stop_words(train_df_a1, 'keyword')

:::{.column-screen-inset}


In [None]:
#| label: fig-wordcount-a1
#| fig-cap: A1 Data Word Count by Column
#| fig-subcap:
#|   - '`text` Column'
#|   - '`location` Column'
#|   - '`keyword` Column'
#| layout-ncol: 3


plot_word_horizontal_bar_chart(train_df_a1, 'text', top_n=25)
plot_word_horizontal_bar_chart(train_df_a1, 'location', top_n=25)
plot_word_horizontal_bar_chart(train_df_a1, 'keyword', top_n=25)

:::

## Data Level `a2`: Removing Unnecessary Characters

It may be beneficial to further clean the data. We some high level techniques to normalize the input
data.


In [None]:
import re

def clean_df_text_column(df, column):
    def clean_row(tweet):
        words = tweet.split()
        cleaned_words = []
        for word in words:
            # Remove URLs
            # word = re.sub(r'http\S+|www\S+', '[URL]', word)

            # Replace user mentions (@username) with a placeholder
            # word = re.sub(r'@\w+', '[USER]', word)

            # Remove hashtags but keep the word (e.g., "#earthquake" → "earthquake")
            # word = re.sub(r'#(\w+)', r'\1', word)

            # Remove unwanted characters (e.g., punctuation)
            word = re.sub(r'[^\w\s]', '', word)

            # Remove dashes
            word = re.sub('-', '', word)

            # Remove extra spaces (if any remain)
            word = word.strip()

            # Add the cleaned word to the list if it's not empty
            if word and len(word) > 1:
                cleaned_words.append(word.lower())

        return " ".join(cleaned_words)

    df[column] = df[column].apply(clean_row)

    return df

train_df_a2 = train_df_a1.copy()

train_df_a2 = clean_df_text_column(train_df_a2, 'text')
train_df_a2 = clean_df_text_column(train_df_a2, 'location')
train_df_a2 = clean_df_text_column(train_df_a2, 'keyword')

:::{.column-screen-inset}


In [None]:
#| label: fig-wordcount-a2
#| fig-cap: A2 Data Word Count by Column
#| fig-subcap:
#|   - '`text` Column'
#|   - '`location` Column'
#|   - '`keyword` Column'
#| layout-ncol: 3


plot_word_horizontal_bar_chart(train_df_a2, 'text', top_n=25)
plot_word_horizontal_bar_chart(train_df_a2, 'location', top_n=25)
plot_word_horizontal_bar_chart(train_df_a2, 'keyword', top_n=25)

:::

## Final Cleaning Results

Final cleaning results from data level `a2` are detailed below

### Text Content

To determine if there are differences between cleaned positive and negative tweets a sample of
randomly selected tweets from each class is output below.

#### Positive Tweets


In [None]:
#| label: tbl-cleaned-positive
#| tbl-cap: Sample of Cleaned Positive Training Data

# train_df_a2[train_df_a2['target'] == 1].loc[train_df_a2['location'].str.len > 1].head()
train_df_a2[
    (train_df_a2['target'] == 1) &
    (train_df_a2['location'].str.len() > 1) &
    (train_df_a2['keyword'].str.len() > 1)
].sample(frac=1.0).head()

#### Negative Tweets


In [None]:
#| label: tbl-cleaned-negative
#| tbl-cap: Sample of Cleaned Negative Training Data

# train_df_a2[train_df_a2['target'] == 0].head()
# train_df_a2[train_df_a2['target'] == 0].loc[train_df_a2['location'].notna()].head()
train_df_a2[
    (train_df_a2['target'] == 0) &
    (train_df_a2['location'].str.len() > 1) &
    (train_df_a2['keyword'].str.len() > 1)
].sample(frac=1.0).head()

In the randomly selected tweets there are some differences between the two classes, but after the
cleaning the differences are not readily apparent from a content perspective. Both positive and
negative tweets have some level of text that is not readily comprehensible.

### Visualizations


In [None]:
def count_unique_words(input_df, column):
    # Create a set to store unique words
    unique_words = set()

    # Iterate through each row in the column
    for text in input_df[column]:
        if isinstance(text, str):  # Ensure the entry is a string
            words = text.split()  # Split into words and normalize to lowercase
            unique_words.update(words)  # Add words to the set

    # Return the size of the set
    return len(unique_words)

In [None]:
results = [
    {
        "class": "original",
        "column": "text",
        "count": count_unique_words(train_df, 'text')
    },
    {
        "class": "a1",
        "column": "text",
        "count": count_unique_words(train_df_a1, 'text')
    },
    {
        "class": "a2",
        "column": "text",
        "count": count_unique_words(train_df_a2, 'text')
    },
    {
        "class": "original",
        "column": "keyword",
        "count": count_unique_words(train_df, 'keyword')
    },
    {
        "class": "a1",
        "column": "keyword",
        "count": count_unique_words(train_df_a1, 'keyword')
    },
    {
        "class": "a2",
        "column": "keyword",
        "count": count_unique_words(train_df_a2, 'keyword')
    },
    {
        "class": "original",
        "column": "location",
        "count": count_unique_words(train_df, 'location')
    },
    {
        "class": "a1",
        "column": "location",
        "count": count_unique_words(train_df_a1, 'location')
    },
    {
        "class": "a2",
        "column": "location",
        "count": count_unique_words(train_df_a2, 'location')
    },
]

results_df = pd.DataFrame(results)

results_df = results_df.rename({
"class": "Data Level",
"count": "Count",
"column": "Data Column",
}, axis="columns")

def plot_count_hist(input_df, column):
    plt.figure(figsize=(4, 3))

    # Create the barplot
    ax = sns.barplot(
        data=input_df.loc[input_df["Data Column"] == column],
        y="Count",
        x="Data Column",
        hue="Data Level",
    )

    # Customize the legend position to appear below the plot
    plt.legend(
        title="Data Level",  # Optional: Add a title to the legend
        loc="upper center",  # Center the legend horizontally
        bbox_to_anchor=(0.5, -0.25),  # Adjust the vertical position below the plot
        ncol=3,  # Display the legend in two columns (optional for compactness)
        frameon=False,  # Remove the legend border (optional)
    )

    plt.xlabel(None)
    plt.tight_layout()  # Adjust layout to prevent overlap
    plt.show()


:::{.column-screen-inset}

:::{.grid}

:::{.g-col-4}


In [None]:
#| label: fig-text-count
#| fig-cap: Count of unique values in text throughout cleaning process

plot_count_hist(results_df, "text")

:::

:::{.g-col-4}


In [None]:
#| label: fig-location-count
#| fig-cap: Count of unique values in location throughout cleaning process

plot_count_hist(results_df, "location")

:::

:::{.g-col-4}


In [None]:
#| label: fig-keyword-count
#| fig-cap: Count of unique values in keyword throughout cleaning process

plot_count_hist(results_df, "keyword")

:::

:::

:::

In @fig-text-count the number of unique values decreases at each data
processing step. For the `text` column removing stop words slightly reduces the unique values count
while performing text clean removes a significant number of values. `location` in @fig-location-count follows a similar pattern, but the number of words removed by the stop word cleaning is higher that `text`. `keyword` in @fig-keyword-count shows no change to the cleaning suggesting that the data format is defined in the user input and the original data is in a formatted state..


| Data Level | Description                                                                       |
| :---       | :----                                                                             |
| Original   | Original data without modifications                                               |
| a1         | Stop words removed                                                                |
| a2         | a1 \& lowered, removed white space, remove length less than 2, remove punctuation |

: Data Level Descriptions {#tbl-data-lvl-descrip}

We will now process the training and test data using the above functions and same them to parquet
files as input for testing:


In [None]:
from pathlib import Path
data_path = Path("../data/preprocessed").resolve()
data_path.mkdir(exist_ok=True, parents=True)

train_data_path = Path(data_path, "train")
train_data_path.mkdir(exist_ok=True)
train_raw_filename = Path(train_data_path, "train_raw.parquet")
train_a1_filename = Path(train_data_path, "train_a1.parquet")
train_a2_filename = Path(train_data_path, "train_a2.parquet")

train_df.to_parquet(train_raw_filename)
train_df_a1.to_parquet(train_a1_filename)
train_df_a2.to_parquet(train_a2_filename)

test_data_path = Path(data_path, "test")
test_data_path.mkdir(exist_ok=True)
test_raw_filename = Path(test_data_path, "test_raw.parquet")
test_a1_filename = Path(test_data_path, "test_a1.parquet")
test_a2_filename = Path(test_data_path, "test_a2.parquet")

test_df = pd.read_csv("../data/test.csv")
test_df_a1 = test_df.copy()
test_df_a1 = filter_stop_words(test_df_a1, 'text')
test_df_a1 = filter_stop_words(test_df_a1, 'location')
test_df_a1 = filter_stop_words(test_df_a1, 'keyword')

test_df_a2 = test_df_a1.copy()
test_df_a2 = clean_df_text_column(test_df_a2, 'text')
test_df_a2 = clean_df_text_column(test_df_a2, 'location')
test_df_a2 = clean_df_text_column(test_df_a2, 'keyword')

test_df.to_parquet(test_raw_filename)
test_df_a1.to_parquet(test_a1_filename)
test_df_a2.to_parquet(test_a2_filename)

## Tokenization

To clean the data another step is converting the text into a numerical representation that a neural
network can work with. This can be implemented using multiple methods and one popular library is the
[Transformers](https://pypi.org/project/transformers/) Python library developed by
@transformers_python_lib. As detailed by [AkaraAsai on GitHub](https://github.com/AkariAsai/pytorch-pretrained-BERT/blob/master/README.md) there are many options for pretrained transformers. For this project we will use the `bert-base-uncased` and `bert-base-cased`tokenizers via the [`PreTrainedTokenizer`](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer) class in the transformers library.


In [None]:
#| eval: false
#| code-fold: false
#| lst-label: lst-tokenizer
#| lst-cap: Tokenizer Data Loader Implemention
def preprocess_dataframe(
    df: pd.DataFrame,
    tokenizer: PreTrainedTokenizer,
    text_max_length: int,
    keyword_max_length: int,
    location_max_length: int,
):
    ids, tokens, attentions, targets = [], [], [], []

    max_length = text_max_length + keyword_max_length + location_max_length

    df["keyword"] = df["keyword"].fillna("")
    df["location"] = df["location"].fillna("")

    for _, row in df.iterrows():
        # Tokenize each component and add special tokens to indicate type
        text_tokens = tokenizer.encode(
            row["text"],
            add_special_tokens=True,
            truncation=True,
            max_length=text_max_length,
            padding="max_length",
            return_tensors="pt",
        ).tolist()[0]
        keyword_tokens = tokenizer.encode(
            row["keyword"],
            add_special_tokens=True,
            truncation=True,
            max_length=keyword_max_length,
            padding="max_length",
            return_tensors="pt",
        ).tolist()[0]
        location_tokens = tokenizer.encode(
            row["location"],
            add_special_tokens=True,
            truncation=True,
            max_length=location_max_length,
            padding="max_length",
            return_tensors="pt",
        ).tolist()[0]

        # Combine tokens
        combined_tokens = text_tokens + keyword_tokens + location_tokens

        # Create initial attention mask with 1s for all tokens
        attention_mask = [1] * len(combined_tokens)

        # Pad tokens and attention mask to max_length
        padding_length = max_length - len(combined_tokens)
        combined_tokens += [tokenizer.pad_token_id] * padding_length
        attention_mask += [0] * padding_length

        # Update attention mask to set positions with padding tokens to 0
        attention_mask = [
            0 if token == tokenizer.pad_token_id else mask
            for token, mask in zip(combined_tokens, attention_mask)
        ]

        # Collect processed data
        ids.append(row["id"])
        tokens.append(combined_tokens)
        attentions.append(attention_mask)
        try:
            targets.append(row["target"])
        except KeyError:
            continue

    # Return targets if they exist
    if len(targets) > 0:
        return pd.DataFrame(
            {"id": ids, "tokens": tokens, "attention": attentions, "target": targets}
        )
    else:
        return pd.DataFrame({"id": ids, "tokens": tokens, "attention": attentions})

# Recurrent Neural Network Model

<!--

Using PyTorch [@pytorch] we define a Recurrent Neural Network (RNN) model to solve this text
classification problem. The model has the following properties and is optimal for this problem
because. Potential hyper parameters to ture are the embedding dim and the hidden dim. This model
outputs a number between 0 and 1 that will need to be converted by the validate or test into a
binary 1 or zero.

-->


To solve the disaster tweet classification problem, a RNN model was implemented using PyTorch
[@pytorch]. This model is specifically designed to process and classify sequential text data while
also leveraging additional features, such as `keyword` and `location`, to improve accuracy. The
architecture has the following key components and characteristics:

## Embedding Layers

The model uses three separate embedding layers (`text_embedding`, `keyword_embedding`, and
`location_embedding`) to convert the categorical input (`text`, `keyword`, and `location`) into
dense, low-dimensional vectors.

- Embedding Dimension: The size of these dense representations is a tunable hyperparameter (`embedding_dim`, default: 128).
- Padding Index: A padding index of 0 ensures uniform sequence lengths for inputs with varying
  sizes.

## Recurrent Layer (LSTM)

The embeddings are concatenated into a combined representation, which serves as input to an LSTM (Long Short-Term Memory) layer first described by @LSTM_paper. For this project the input dimension is  can be tuned as a hyperparameter:

* Input Dimensions
  * The concatenated embeddings have a dimensionality of $\text{Embedding Dimension} \times 3$ as all three embeddings are combined.
* Hidden Dimension
  The LSTM layer processes the input and outputs a hidden state with size `hidden_dim`. To reduce
  the number of hyperparamaters this value is fixed at $\text{Input Dimension} * 2$
* Batch Processing: The `batch_first=True` argument ensures the input tensors are structured as (batch size, sequence length, feature size).

## Fully Connected Layer

The final hidden state of the LSTM, corresponding to the last time step, is passed through a fully connected (`fc`) layer to reduce the dimensionality to the output size of 1.

## Sigmoid Activation

The output of the fully connected layer is passed through a sigmoid activation function, which scales the predictions to a range of $[0, 1]$. These outputs represent the probability of a tweet being disaster-related. For validation or testing, these probabilities are thresholded (e.g., $\geq 0.5$) to classify the output as either 1 or 0.

## `RNNWithMultiInput` Class Definition


In [None]:
#| eval: true
#| code-fold: false
#| lst-label: lst-rnn-multi
#| lst-cap: RNN Pytorch Model

import torch
import torch.nn as nn

class RNNWithMultiInput(nn.Module):
    def __init__(
        self,
        vocab_size,
        use_attention=False,
        embedding_dim=128,
        hidden_dim=256,
        output_dim=1,
    ):
        super(RNNWithMultiInput, self).__init__()
        self.use_attention = use_attention
        self.text_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.keyword_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.location_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        self.lstm = nn.LSTM(
            embedding_dim * 3,
            hidden_dim,
            batch_first=True,
        )
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(
        self,
        text_input_ids,
        text_attention_mask,
        keyword_input_ids,
        keyword_attention_mask,
        location_input_ids,
        location_attention_mask,
    ):
        # Embedding layers
        text_emb = self.text_embedding(text_input_ids)
        keyword_emb = self.keyword_embedding(keyword_input_ids)
        location_emb = self.location_embedding(location_input_ids)

        if self.use_attention is True:
            text_emb = text_emb * text_attention_mask.unsqueeze(-1)
            keyword_emb = keyword_emb * keyword_attention_mask.unsqueeze(-1)
            location_emb = location_emb * location_attention_mask.unsqueeze(-1)

        # Combine embeddings
        combined_emb = torch.cat((text_emb, keyword_emb, location_emb), dim=2)

        # Pass through LSTM
        lstm_out, _ = self.lstm(combined_emb)
        last_hidden_state = lstm_out[:, -1, :]

        # Fully connected layer
        logits = self.fc(last_hidden_state)
        return self.sigmoid(logits).squeeze()

## Hyperparameter Tuning

Key hyperparameters for optimization include:

* Embedding Dimension (`embedding_dim`): Controls the size of the feature space for text representation.
* Hidden Dimension (`hidden_dim`): Determines the capacity of the LSTM layer to capture sequential patterns.
* Batch Size and Learning Rate: While not part of the architecture, these parameters significantly influence training efficiency and model performance.

### Attention Mechanism Hyperparameter

The model includes an optional attention mechanism (enabled by the `use_attention` flag). If enabled, attention masks are applied to the embeddings to emphasize relevant parts of the input sequences while ignoring padded elements.

## Rationale for Architecture

This architecture is well-suited for this problem because:

1. The LSTM layer efficiently captures sequential dependencies in textual data, which is critical for understanding the context within tweets.
2. By incorporating separate embeddings for `keyword` and `location`, the model leverages additional information beyond the tweet text, potentially improving classification accuracy.
3. The flexibility to enable or disable attention mechanisms provides adaptability for datasets with varying levels of noise or irrelevant data.

<!--

# Training

The training process focuses on optimizing key aspects of model performance, including learning rate
scheduling, early stopping criteria, embedding size, tokenizer selection, and data cleaning levels.
The objective is to balance model complexity with minimizing training and validation loss to ensure
robust generalization. The top three models from this process will be submitted to Kaggle for final
evaluation.

Data Driven questions:

1. How should we change the learning rate during training
1. Should we stop early
2. What embedding size is optimal
3. What tokenizer should we use
4. What cleaning level should we use

The goal is to find an optimal balance between complexity, training loss and validation loss

Based on thiese results we will pass the 3 best models to Kaggle for final scores.

-->

# Training

The training process for this project begins with an 80/20 train-test split of the dataset, ensuring
a robust and reliable evaluation of model performance. A critical consideration in training a neural
network is the selection of an appropriate validation metric. Given the binary classification nature
of the task—where tweets are labeled as disaster-related (`1`) or non-disaster-related
(`0`)—traditional metrics such as accuracy, precision, recall, and F1 score must be carefully
evaluated in the context of class imbalance.

For this dataset, **F1 score** is chosen as the primary evaluation metric. This decision is driven
by the inherent class imbalance in the disaster-related tweets, where false positives (non-disaster
tweets incorrectly classified as disasters) and false negatives (disaster tweets missed by the
model) both carry significant consequences. The F1 score, being the harmonic mean of precision and
recall, provides a balanced measure that accounts for both types of error, ensuring the model
optimizes performance in a way that minimizes the impact of misclassifications.

With the evaluation metric established, we proceed to the core aspects of the training process,
which focus on optimizing the model’s performance. Several key data-driven questions guide this
process:

1. **Baseline Determination**: How does the model perform with default hyperparameter settings?
2. **Tokenizer Selection**: Which tokenizer should be employed to preprocess the textual data effectively, accounting for nuances such as tokenization of hashtags, mentions, and special characters?
3. **Data Cleaning**: What level of data preprocessing (e.g., removal of stopwords, punctuation, or special characters) is optimal for this task to ensure high-quality input data?
4. **Embedding Layers**: What embedding dimension best balances the model's ability to capture semantic information without increasing computational complexity unnecessarily?

To answer each of these questions different models will be produced to answer each question,
learning from the results of the last question. At each comparison step multiple embedding
dimensions will be modeled to see if this has an effect on the outcome.

The goal is to fine-tune these hyperparameters and preprocessing steps to strike a balance between model complexity and predictive accuracy, ensuring the best possible performance on unseen data. Following this optimization process, the top three models—based on their performance in training and validation—will be selected for final evaluation and submission to Kaggle.


In [None]:
import duckdb

con = duckdb.connect()
query = "SELECT * FROM read_parquet('../train_stats_f1/*.parquet', union_by_name=True)"
df = con.execute(query).fetchdf()

df['Positive Ratio'] = (df['Validation True Positive'] + df['Validation False Positive']) / (df['Validation True Positive'] + df['Validation True Negative'] + df['Validation False Positive'] + df['Validation False Negative'])


df["Epoch"] = df["Epoch"].astype(int)

In [None]:
def plot_vs_embedding_dim(df, y_col, hue = "Embedding Dimensions", show_legend = True, ylim=None):
    plt.figure(figsize=(5.5, 3))
    sns.lineplot(df, x = 'Epoch', y=y_col, hue = hue, palette="deep", legend=show_legend)

    # Customize legend
    if show_legend is True:
        plt.legend(
            title="\n".join(hue.split(" ")),
            loc='center left',  # Adjust to the right of the plot
            bbox_to_anchor=(1, 0.5),  # Position to the right
            frameon=False  # Remove background and border
        )

    if ylim is not None:
        plt.ylim((0, ylim))

    plt.tight_layout()
    plt.show()

## Baseline Models

### Training Loss and F1 Score


In [None]:
df_baseline = df.loc[
    (df["Comparison Type"] == 'Baseline')
]

:::{.column-screen-inset-right}


In [None]:
#| label: fig-baseline-f1
#| fig-cap: Baseline Model - Training and Validation F1 Score
#| fig-subcap:
#|   - Training F1 Score
#|   - Validation F1 Score
#| layout-ncol: 2


plot_vs_embedding_dim(
    df_baseline,
    "Training F1 Score",
    show_legend=False,
    ylim=1.0
)
plot_vs_embedding_dim(
    df_baseline,
    "Validation F1 Score",
    ylim=1.0,
)

### Learning Rate and Compute Time


In [None]:
#| label: fig-baseline-other-stats
#| fig-cap: Baseline Model - Learning Rate and Compute Time
#| fig-subcap:
#|   - Learning Rate
#|   - Compute Time Per Epoch [s]
#| layout-ncol: 2


plot_vs_embedding_dim(
    df_baseline,
    "Learning Rate",
    show_legend=False,
)
plot_vs_embedding_dim(
    df_baseline,
    "Compute Time",
)

:::

Figure @fig-baseline-f1-1 illustrates that across embedding dimensions, the training F1 score consistently improves with an increasing number of epochs. Models with lower embedding dimensions, however, converge to a lower final training F1 score, suggesting that these dimensions may limit the model's complexity capacity. Similarly, in Figure @fig-baseline-f1-2, validation F1 scores also increase with training epochs. However, an upper limit is observed, where embedding dimensions above 16 do not demonstrate a significant difference in validation performance.

Based on these trends, training for 50 epochs appears sufficient for models with embedding dimensions greater than 8 to achieve training stability and reach their performance plateau.

Figure @fig-baseline-other-stats-1 visualizes the learning rate progression over training epochs. The learning rate is adjusted dynamically using a `ReduceLROnPlateau` scheduler based on validation F1 scores, as shown in the code snippet below:


In [None]:
#| eval: false
#| code-fold: false
#| lst-label: lst-scheduler
#| lst-cap: Learning Rate Scheduler

optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode="max",
)
# ...
scheduler.step(val_f1)

Models with higher embedding dimensions tend to trigger the scheduler earlier, indicating that these models converge more rapidly. While this is advantageous for reducing the risk of overfitting, it does not necessarily translate into improved validation performance beyond the observed upper limit.

@fig-baseline-other-stats-2 examines the compute times per epoch. Models with embedding dimensions below 256 maintain consistent training times of approximately 2 seconds per epoch. However, as embedding dimensions increase, so do computation times. Given that larger models fail to deliver meaningful performance improvements, it is computationally efficient to select a smaller model that balances performance and training cost effectively.


## Tokenizer Comparison


In [None]:
df_bert_uncased = df.loc[
    (df['Comparison Type'] == 'Tokenizer') & (df['Tokenizer'] == 'bert-base-uncased')
]

df_bert_cased = df.loc[
    (df['Comparison Type'] == 'Tokenizer') & (df['Tokenizer'] == 'bert-base-cased')
]

:::{.column-screen-inset-right}


In [None]:
#| label: fig-cased-vs-uncased-training-loss
#| fig-cap: Uncased vs Cased Tokenizer - Training Loss
#| fig-subcap:
#|   - Uncased Tokenizer Training Loss
#|   - Cased Tokenizer Training Loss
#| layout-ncol: 2

plot_vs_embedding_dim(df_bert_uncased, "Training F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_bert_cased, "Training F1 Score", ylim=1.0)

In [None]:
#| label: fig-cased-vs-uncased-f1-score
#| fig-cap: Uncased vs Cased Tokenizer - F1 Score
#| fig-subcap:
#|   - Uncased Tokenizer F1 Score
#|   - Cased Tokenizer F1 Score
#| layout-ncol: 2

plot_vs_embedding_dim(df_bert_uncased, "Validation F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_bert_cased, "Validation F1 Score", ylim=1.0)

:::

@fig-cased-vs-uncased-training-loss and @fig-cased-vs-uncased-f1-score show no significant differences in performance between the cased and uncased tokenizers, as measured by training and validation F1 scores. However, since the original dataset retains case sensitivity, preserving this feature through a cased tokenizer aligns with the original characteristics of the data. This decision ensures that potentially meaningful information encoded in capitalization is retained.


## Data Level Comparison


In [None]:
df_data_original = df.loc[
    (df['Comparison Type'] == 'Data Level') & (df['Data Level'] == 'original')
]

df_data_a1 = df.loc[
    (df['Comparison Type'] == 'Data Level') & (df['Data Level'] == 'a1')
]

df_data_a2 = df.loc[
    (df['Comparison Type'] == 'Data Level') & (df['Data Level'] == 'a2')
]


:::{.column-screen-inset}


In [None]:
#| label: fig-data-level-training-f1
#| fig-cap: Training F1 Score by Data Levels
#| fig-subcap:
#|   - Original Data - Training F1 Score
#|   - A1 Data - Training F1 Score
#|   - A2 Data - Training F1 Score
#| layout-ncol: 3

plot_vs_embedding_dim(df_data_original, "Training F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_data_a1, "Training F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_data_a2, "Training F1 Score", ylim=1.0)

In [None]:
#| label: fig-data-level-f1-score
#| fig-cap: Validation F1 Score by Data Levels
#| fig-subcap:
#|   - Original Data - Validation F1 Score
#|   - A1 Data - Validation F1 Score
#|   - A2 Data - Validation F1 Score
#| layout-ncol: 3

plot_vs_embedding_dim(df_data_original, "Validation F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_data_a1, "Validation F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_data_a2, "Validation F1 Score", ylim=1.0)

:::

@fig-data-level-training-f1 and @fig-data-level-f1-score show no difference in F1 score between
diferent data levels. Because altering the data level changes the way the tokenizer functions we
will keep the original (unaltered) data as input into the tokenizer


Figures @fig-data-level-training-f1 and @fig-data-level-f1-score indicate no measurable difference in F1 scores across various data preprocessing levels. Given that altering the data level modifies how the tokenizer processes the input, it introduces additional complexity without yielding performance benefits. Based on this finding we will use the original unaltered data as input to subsequent models.

## 1-3 Embedding Layers


In [None]:
df_data_one_layer = df.loc[
    (df['Comparison Type'] == 'Layers') & (df['Embedding Layers'] == 1)
]

df_data_two_layer = df.loc[
    (df['Comparison Type'] == 'Layers') & (df['Embedding Layers'] == 2)
]

df_data_three_layer = df.loc[
    (df['Comparison Type'] == 'Layers') & (df['Embedding Layers'] == 3)
]


:::{.column-screen-inset}


In [None]:
#| label: fig-layers-training-f1
#| fig-cap: Training F1 Score by Data Levels
#| fig-subcap:
#|   - 1 Embedding Layer - Training F1 Score
#|   - 2 Embedding Layers - Training F1 Score
#|   - 3 Embedding Layers - Training F1 Score
#| layout-ncol: 3

plot_vs_embedding_dim(df_data_one_layer, "Training F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_two_layer, "Training F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_three_layer, "Training F1 Score", ylim=1.0)

In [None]:
#| label: fig-layers-validation-f1
#| fig-cap: Validation F1 Score by Data Levels
#| fig-subcap:
#|   - 1 Embedding Layer - Validation F1 Score
#|   - 2 Embedding Layers - Validation F1 Score
#|   - 3 Embedding Layers - Validation F1 Score
#| layout-ncol: 3

plot_vs_embedding_dim(df_data_one_layer, "Validation F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_two_layer, "Validation F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_three_layer, "Validation F1 Score", ylim=1.0)

:::


## 4-6 Embedding Layers


In [None]:
df_data_one_layer = df.loc[
    (df['Comparison Type'] == 'Big Layers') & (df['Embedding Layers'] == 4)
]

df_data_two_layer = df.loc[
    (df['Comparison Type'] == 'Big Layers') & (df['Embedding Layers'] == 5)
]

df_data_three_layer = df.loc[
    (df['Comparison Type'] == 'Big Layers') & (df['Embedding Layers'] == 6)
]


:::{.column-screen-inset}


In [None]:
#| label: fig-big-layers-f1
#| fig-cap: Training F1 Score by Data Levels
#| fig-subcap:
#|   - 4 Embedding Layers - Training F1 Score
#|   - 5 Embedding Layers - Training F1 Score
#|   - 6 Embedding Layers - Training F1 Score
#| layout-ncol: 3

plot_vs_embedding_dim(df_data_one_layer, "Training F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_two_layer, "Training F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_three_layer, "Training F1 Score", ylim=1.0)

In [None]:
#| label: fig-big-layers-val-f1-score
#| fig-cap: Validation F1 Score by Data Levels
#| fig-subcap:
#|   - 4 Embedding Layers - Validation F1 Score
#|   - 5 Embedding Layers - Validation F1 Score
#|   - 6 Embedding Layers - Validation F1 Score
#| layout-ncol: 3

plot_vs_embedding_dim(df_data_one_layer, "Validation F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_two_layer, "Validation F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_three_layer, "Validation F1 Score", ylim=1.0)

Increasing the embedding dimensions has a subtle but observable effect on the validation F1 score. Models with higher embedding dimensions inherently introduce dropout, which helps mitigate overfitting by regularizing the network. This regularization contributes to greater training stability, as evidenced by reduced fluctuations in F1 scores across epochs. While the performance gains are marginal, the enhanced stability offered by larger embedding dimensions may provide a slight advantage for tasks requiring consistent and reliable predictions.

:::

## Final Models


In [None]:
df_final = df.loc[
    df['Comparison Type'] == 'Final'
]

:::{.column-screen-inset-right}


In [None]:
#| label: fig-big-layers-training-loss-2
#| fig-cap: Final Model F1 Scores
#| layout-ncol: 2
#| fig-subcap:
#|   - Training F1 Score
#|   - Validation F1 Score

plot_vs_embedding_dim(df_final, "Training F1 Score", hue = "Embedding Layers", show_legend=False)
plot_vs_embedding_dim(df_final, "Validation F1 Score", hue = "Embedding Layers")

:::

In the final output models, we observe a steady increase in the training F1 score across all models, while the validation F1 score stabilizes after ~30 epochs. Notably, models with 2 and 6 embedding layers achieve higher validation F1 scores compared to those with 4 embedding layers. This suggests that embedding layer depth plays a nuanced role in model performance, with certain configurations better capturing the underlying patterns in the data. All models demonstrate the ability to fit the input data effectively and achieve stability within the specified 50 epochs.


# Results

## Kaggle Screenshot

![Kaggle Results](./week_4_kaggle_scores_2.png){#fig-kaggle-res}

## Results Analysis


In [None]:
df_final = df_final.loc[df_final['Epoch'] == 50]
public_scores = [0.75973, 0.73490, 0.73337]
df_final["Public Score"] = public_scores

In [None]:
#| label: fig-final-scores-combined-public
#| fig-cap: Final Model Scores

# Reshape the data for a grouped bar plot
df_melted = df_final.melt(
    id_vars="Embedding Layers",
    value_vars=["Public Score"],
    var_name="Score Type",
    value_name="Score",
)

# Create the grouped bar plot
plt.figure(figsize=(10, 3))
ax = sns.barplot(data=df_melted, x="Embedding Layers", y="Score", hue="Score Type")
for container in ax.containers:
    ax.bar_label(container, fmt="%.3f")
plt.ylim((0, 0.85))
plt.title("Scores by Embedding Layers")
plt.show()

In [None]:
#| label: tbl-final-public-scores
#| tbl-cap: Final Model Scores

pd.set_option('display.max_rows', 500)
df_melted

In @fig-final-scores-combined-public and @tbl-final-public-scores final model Kaggle scores are
visualized and tabulated. For each model the other specifications are:

In @fig-final-scores-combined-public and @tbl-final-public-scores, the final Kaggle scores for each model are visualized and tabulated. These scores represent the performance of the models on the public leaderboard, providing a benchmark for comparison. Based on these results, the model with 2 embedding layers achieved the highest public score, suggesting that simpler architectures may offer better generalization for this task. This outcome aligns with observations from the validation F1 scores, where models with fewer embedding layers demonstrated comparable or superior performance relative to more complex configurations.

### Final Model Specifications


In [None]:
#| label: tbl-final-specs
#| tbl-cap: Final Model Specifications

from IPython.display import display, Markdown

df_final['Start Learning Rate'] = 0.001
df_final['End Learning Rate'] = df['Learning Rate']

df_specs = (
    df_final[
        [
            "Learning Optimization",
            "Start Learning Rate",
            "End Learning Rate",
            "Specified Epochs",
            "Batch Size",
            "Data Level",
            "Vocab Size",
            "Tokenizer",
            "Embedding Dimensions",
            "Hidden Dimensions",
        ]
    ]
    .iloc[0]
    .T
)

# Convert series into df
df_specs = df_specs.reset_index()
df_specs.columns = ["Specification", "Value"]

display(Markdown(df_specs.to_markdown(index=False)))

### Additional Final Model Statistics


In [None]:
#| label: fig-final-scores-combined-computed
#| fig-cap: Final Model Scores - Kaggle Public Score vs. Accuracy Score

# Reshape the data for a grouped bar plot
df_melted = df_final.melt(
    id_vars="Embedding Layers",
    value_vars=["Validation Accuracy Score", "Validation Precision Score", "Validation Recall Score"],
    var_name="Score Type",
    value_name="Score",
)

# Create the grouped bar plot
plt.figure(figsize=(10, 3))
ax = sns.barplot(data=df_melted, x="Embedding Layers", y="Score", hue="Score Type")
for container in ax.containers:
    ax.bar_label(container, fmt="%.3f")
plt.ylim((0, 0.85))
plt.title("Scores by Embedding Layers")
plt.show()

In [None]:
#| label: tbl-final-specs-additional
#| tbl-cap: Final Model Additional Statistics

pd.set_option('display.max_rows', 500)
df_melted

In addition to the public Kaggle scores, the models were validated using accuracy, precision, and recall, and F1 score, shown in @fig-final-scores-combined-public and @tbl-final-specs-additional. These metrics provide a comprehensive view of model performance and highlight differences in how the models handle the classification task.

1. **Accuracy:**
   - The model with 2 embedding layers achieved the highest validation accuracy (0.768221), outperforming both the 4-layer and 6-layer models.
   - While the accuracy decreases slightly with an increase in embedding layers, the difference is
     small.

2. **Precision:**
   - Precision is highest for the 2-layer model (0.781197), indicating its strength in minimizing false positives.
   - As the number of embedding layers increases, precision declines, with the 6-layer model scoring the lowest (0.730475).

3. **Recall:**
   - Recall improves with more embedding layers, with the 6-layer model achieving the highest score (0.699413). This suggests that models with more embedding layers are better at capturing true positives, albeit at the expense of increased false positives.




# Conclusion

## Project Summary

This project explored the classification of disaster-related tweets using Recurrent Neural Networks (RNNs) built with PyTorch. Various configurations, including embedding dimensions, tokenizer choices, and data cleaning levels, were systematically evaluated to tune hyperparameters optimally. The results demonstrated that a 2-layer embedding model achieved the highest overall performance, balancing validation accuracy (0.768221), precision (0.781197), recall (0.670088), and F1 score (0.721389) and a public Kaggle evaluation scores of 0.75973. The findings underscore the value of simplicity in model architecture, with more complex configurations yielding diminishing returns.

## Lessons Learned

1. **Model Architecture and Complexity:**
   Increasing embedding layers and introducing dropout led to marginal stability improvements but did not significantly enhance validation F1 scores. Simpler architectures performed comparably or better in most cases.

2. **Tokenizer and Data Cleaning:**
   The cased tokenizer demonstrated equivalent performance to the uncased version, justifying the retention of the original data's case sensitivity. Altering data levels disrupted tokenizer behavior without providing measurable benefits.

3. **Layer Size:**
   Models with fewer embedding layers converged faster and triggered the learning rate scheduler earlier, suggesting efficiency advantages in training. All models reached stability within 50 epochs, highlighting the importance of early stopping to reduce computational overhead.

4. **Evaluation Metrics:**
   The F1 score proved to be the most informative metric for this dataset, balancing precision and recall effectively. Relying solely on accuracy or public Kaggle scores would have overlooked critical trade-offs in model performance.

## Areas for Improvement / Future Work

1. **Feature Engineering:**
   Incorporating additional features, such as sentiment analysis scores or tweet metadata, could enhance the models ability to capture nuanced patterns in the data.

2. **Architectural Changes:**
   Future experiments could include transformer-based architectures, such as BERT or GPT, to assess whether advanced models outperform RNNs for this task.

3. **Generalization Analysis:**
   While this project focused on disaster-related tweets, extending the dataset to include non-disaster events could help test the model's generalization capabilities across broader text classification domains.