---
title: "NLP Text Classification Summary Report"
date: today
date-format: long
author: "Steven  Ndung'u"
format:
  html:
    toc: true
    toc-depth: 2
    toc-location: left
    page-layout: full
    theme:
          light: flatly
          dark: darkly
    number-sections: false
    highlighting: true
    smooth-scroll: true
    code-fold: true
    highlighting-style: GitHub
    self-contained: true
execute:
    echo: true
    warning: false
    enable: true

title-block-banner: true

---

```{=html}
<style type="text/css">

h1.title {
  font-size: 0px;
  color: White;
  text-align: center;
}
h4.author { /* Header 4 - and the author and data headers use this too  */
    font-size: 16px;
  font-family: "Source Sans Pro Semibold", Times, serif;
  color: Red;
  text-align: center;
}
h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 16px;
  font-family: "Source Sans Pro Semibold", Times, serif;
  color: Red;
  text-align: center;
}
</style>
```


------------------------------------------------------------------------
:::{.column-page}

::: {style="text-align:center"}
<h2>Text Classification Challenge</h2>
:::

</br>


In [None]:
#| echo: false
#| code-fold: false
#| 
###################################################

###################################################
#$Env:QUARTO_PYTHON = "C:\Users\P307791\Anaconda3\python.exe"
import os
os.environ['PYTHONHASHSEED'] = 'python'
from scipy import stats

from IPython.display import display, Markdown, HTML
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
from itables import show

import torch.nn as nn

import plotly.express as px
import pandas as pd
import plotly.graph_objects as go
import plotly.io as pio
#pio.renderers.default = "notebook"

import pandas as pd
import numpy as np
import re
from scipy import stats

import seaborn as sns
import matplotlib.pyplot as plt

### Problem Statement

The task is to build a text classification model that accurately predicts whether a given movie review expresses a positive or negative sentiment. Sentiment analysis is a critical task in NLP with applications in marketing, customer feedback, social media monitoring, and more. Accurately classifying sentiments can provide valuable insights into customer opinions and help businesses make data-driven decisions.

### Why This Task is Important

Understanding customer sentiment through text data is crucial for businesses and organizations to respond effectively to customer needs and preferences. By automating the sentiment analysis process, companies can efficiently analyze vast amounts of data, identify trends, and make informed strategic decisions. For this challenge, we will use the IMDb dataset, a widely-used benchmark in sentiment analysis, to train and evaluate our model.

### Dataset Description

The dataset used for this challenge is the IMDb movie reviews dataset, which contains 50,000 reviews labeled as either positive or negative. This dataset is balanced, with an equal number of positive and negative reviews, making it ideal for training and evaluating sentiment analysis models.

- **Columns:**
  - `review`: The text of the movie review.
  - `sentiment`: The sentiment label (`positive` or `negative`).

The IMDb dataset provides a real-world scenario where understanding sentiment can offer insights into public opinion about movies, directors, and actors, as well as broader trends in the entertainment industry.

### Approach

Transformers have revolutionized NLP by allowing models to consider the context of a word based on surrounding words, enabling better understanding and performance on various tasks, including sentiment analysis. Their ability to transfer learning from massive datasets and adapt to specific tasks makes them highly effective for text classification.
 



</br>

### Data Exploration and Preprocessing

The data is generally clean the only preprocessing required is to remove any special characters and convert the text to lowercase.

#### Preview the data:

In [None]:
#| echo: false
#| code-fold: false
#| 
import pandas as pd
import numpy as np
import re, os, random, torch
import seaborn as sns
from collections import defaultdict
import matplotlib.pyplot as plt
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from IPython.display import display, Markdown, HTML
from utils import *
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
df = pd.read_csv('IMDB Dataset.csv')
df['sentiment'] = df['sentiment'].apply(lambda x: 1 if x=='positive' else 0)
df['review'] = df['review'].apply(clean_text)

In [None]:
#| echo: false
#| code-fold: false
#|
res = df.head()

html_table = res.to_html(index=True)

# Wrap in a scrollable div
scrollable_table = f"""
<div style="height: 400px; width: 100%; overflow-x: auto; overflow-y: auto;">
    {html_table}
</div>
"""
# Display the scrollable table
display(HTML(scrollable_table))

### Model Implementation


In [None]:
from utils import *


# Load IMDb Dataset
df = pd.read_csv('IMDB Dataset.csv')
#print('IMDB Dataset.csv data loaded ...')
# Preprocess the dataset
df['sentiment'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)  # Convert sentiments to binary
df['review'] = df['review'].apply(clean_text)

# Simple tokenization process
def tokenize(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters and numbers
    tokens = text.split()
    return tokens

# Build Vocabulary
def build_vocab(reviews):
    vocab = Counter()
    for review in reviews:
        tokens = tokenize(review)
        vocab.update(tokens)
    
    vocab = {word: i+3 for i, (word, _) in enumerate(vocab.most_common())}  # Start index from 3
    vocab['<PAD>'] = 0  # Padding token
    vocab['<UNK>'] = 1  # Unknown token
    vocab['[CLS]'] = 2  # [CLS] token
    return vocab

# Convert text to numerical indices
def text_to_indices(text, vocab, max_len):
    tokens = ['[CLS]'] + tokenize(text)
    indices = [vocab.get(token, vocab['<UNK>']) for token in tokens]
    if len(indices) < max_len:
        indices += [vocab['<PAD>']] * (max_len - len(indices))  # Padding
    else:
        indices = indices[:max_len]  # Truncate if longer than max_len
    return indices

# Maximum sequence length (adjusted to include [CLS] token)
max_len = 401  # max_len (400) + 1 for [CLS] token

# Tokenize and encode the dataset
if os.path.exists('tokenized_reviews_vocab.npy') and os.path.exists('vocab.pkl'):
    tokenized_reviews = np.load('tokenized_reviews_vocab.npy')
    with open('vocab.pkl', 'rb') as f:
        vocab = pickle.load(f)
else:
    reviews = df['review'].tolist()
    vocab = build_vocab(reviews)
    # Convert reviews to input IDs
    tokenized_reviews = [text_to_indices(review, vocab, max_len) for review in reviews]
    np.save('tokenized_reviews_vocab.npy', np.array(tokenized_reviews))
    with open('vocab.pkl', 'wb') as f:
        pickle.dump(vocab, f)

# Convert arrays into PyTorch tensors
inputs_input_ids = torch.tensor(tokenized_reviews).to(device)
labels = torch.tensor(df['sentiment'].values).to(device)

#print('Tokenization and vocabulary buidling complete ...')

# Split the dataset into training, validation, and test sets (70%, 15%, 15%)
train_inputs, valid_test_inputs, train_labels, valid_test_labels = train_test_split(
    inputs_input_ids, labels, test_size=0.3, random_state=100, shuffle=True
)
valid_inputs, test_inputs, valid_labels, test_labels = train_test_split(
    valid_test_inputs, valid_test_labels, test_size=0.5, random_state=100, shuffle=True
)

# Create DataLoader
train_dataset = TensorDataset(train_inputs, train_labels)
valid_dataset = TensorDataset(valid_inputs, valid_labels)
test_dataset = TensorDataset(test_inputs, test_labels)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=32)
test_loader = DataLoader(test_dataset, batch_size=32)

#print('Data loaders complete ...')

##### Architecture Overview:
@fig-methodology illustrates the proposed workflow model (Transformer-based architecture) designed for IMDb sentiment classification:



::: {.column-page layout-ncol=1}

![](text_classification.png){#fig-methodology}


Proposed Transformer-based architecture for IMDb sentiment classification.
:::

The transformer-based sentiment analysis model is a sophisticated deep learning architecture designed to effectively capture and classify the sentiment expressed in text data. It leverages the power of transformer encoder blocks, which employ multi-head self-attention mechanisms to understand complex relationships between words, regardless of their distance in the sentence. The model initially processes the input text by converting it into numerical representations, known as token IDs. These tokens are then embedded into a continuous vector space where semantic meaning can be captured. To preserve the context of word order, positional embeddings are added to the token embeddings.

The core of the model consists of multiple transformer encoder layers, each containing two main sublayers: multi-head self-attention and position-wise feed-forward neural network. The multi-head self-attention sublayer allows the model to weigh the importance of different words based on their relevance to the overall sentiment. The feed-forward neural network applies non-linear transformations to capture complex patterns in the input data.

After passing through the transformer encoder stack, the model selects the hidden state corresponding to the [CLS] token, which serves as a summary representation of the entire sequence. This representation is then fed into a linear classification layer that maps it to a vector of class probabilities. The final output of the model is a prediction of the sentiment expressed in the input text, indicating whether it is positive, negative, or neutral.

Why This Architecture is Ideal for Text Classification (IMDb Dataset):

The Transformer-based architecture is ideal for IMDb sentiment classification due to its ability to handle variable-length sequences, efficiently capturing long-range dependencies across the text. The self-attention mechanism allows the model to focus on sentiment-bearing words and understand their contextual meaning by attending to all parts of the sequence. The use of the [CLS] token for sequence-level classification encodes the overall sentiment of the entire review, making it effective for binary sentiment prediction. Layer normalization and residual connections stabilize the training of deep architectures, while feed-forward networks introduce non-linearity, helping the model learn complex patterns in the data. Multi-head attention provides diverse perspectives on the input, enhancing the richness of the representations, and positional embeddings maintain the sequence structure. Regularization techniques, such as dropout, prevent overfitting, and the flexibility of hyperparameters allows the model to adapt to different datasets and resources. This architecture is also scalable and aligns with popular models like BERT, offering potential for future transfer learning. 

### Training and Evaluation

To determine the best hyperparameter set, a grid search is performed across various combinations of hyperparameters. For each combination, the model is trained on the training dataset, and validation accuracy is computed by evaluating the model on a separate validation set. This process is repeated for all combinations in the hyperparameter search space, and the corresponding validation accuracies are recorded. The hyperparameter combination that yields the highest validation accuracy is selected as the best or optimal set. Once the optimal hyperparameters are identified, the model is retrained using the training data and validated hyperparameters. Finally, the model's performance is tested on an unseen test dataset to obtain the final test accuracy, which serves as the ultimate measure of the model's effectiveness. This ensures that the model generalizes well and avoids overfitting to the training and validation data.


In [None]:
# Training function

def train(model, train_loader, valid_loader, optimizer, scheduler, loss_fn, device, epochs, early_stopping_patience=5):
    train_losses = []
    val_losses = []
    train_accuracies = []
    val_accuracies = []
    
    best_accuracy = 0  # Track the best validation accuracy
    patience_counter = 0  # Counter for early stopping
    
    for epoch in range(epochs):
        model.train()
        running_loss = 0
        correct = 0
        total = 0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
        
        train_loss = running_loss / len(train_loader)
        train_acc = 100 * correct / total
        train_losses.append(train_loss)
        train_accuracies.append(train_acc)
        
        # Evaluate on the validation set
        val_loss, val_acc = evaluate(model, valid_loader, loss_fn, device)
        val_losses.append(val_loss)
        val_accuracies.append(val_acc)
        
        scheduler.step()

        # Early stopping
        if val_acc > best_accuracy:
            best_accuracy = val_acc
            patience_counter = 0
            torch.save(model.state_dict(), 'best_model.pth')
            
            print(f"Epoch {epoch+1}: Best model saved with accuracy: {best_accuracy:.2f}%")
        # else:
        #     patience_counter += 1
        #     if patience_counter >= early_stopping_patience:
        #         print("Early stopping triggered.")
        #         break
        
        print(f"Epoch [{epoch+1}/{epochs}], Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%, Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%")
    
        
    return train_losses, val_losses, train_accuracies, val_accuracies

# Evaluation function
def evaluate(model, data_loader, loss_fn, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)
            total_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    avg_loss = total_loss / len(data_loader)
    accuracy = 100 * correct / total
    return avg_loss, accuracy

def hyperparameter_search(hyperparameter_combinations, results_df):
    for i, params in enumerate(hyperparameter_combinations):
        print(f"\nRunning hyperparameter set {i+1}/{len(hyperparameter_combinations)}")
        
        # Unpack hyperparameters
        lr, num_hidden_layers, num_attention_heads, hidden_size, intermediate_size, hidden_dropout_prob, activation_function = params
        
        # Prepare DataLoader with current batch_size
        train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
        valid_loader = DataLoader(valid_dataset, batch_size=32)
        
        # Initialize the model with current hyperparameters
        config = TransformerConfig(
            vocab_size=len(vocab),
            hidden_size=hidden_size,
            num_attention_heads=num_attention_heads,
            num_hidden_layers=num_hidden_layers,
            intermediate_size=intermediate_size,
            hidden_dropout_prob=hidden_dropout_prob,
            max_position_embeddings=max_len,
            num_labels=2,
            activation_function=activation_function
        )
        model = TransformerForSequenceClassification(config).to(device)
        
        # Initialize optimizer and loss function
        optimizer = optim.Adam(model.parameters(), lr=lr)
        scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.1)
        loss_fn = nn.CrossEntropyLoss()
        
        # Train the model
        epochs = 100 
        
        train_losses, val_losses, train_accuracies, val_accuracies = train(model, train_loader, valid_loader, optimizer, scheduler, loss_fn, device, epochs, early_stopping_patience=5)
        
        # Get the best validation accuracy
        best_val_accuracy = max(val_accuracies)
        best_val_loss = min(val_losses)
        
        # Save the results
        current_result = pd.DataFrame([{
            'lr': lr,
            'num_hidden_layers': num_hidden_layers,
            'num_attention_heads': num_attention_heads,
            'hidden_size': hidden_size,
            'intermediate_size': intermediate_size,
            'hidden_dropout_prob': hidden_dropout_prob,
            'activation_function': activation_function,
            'val_loss': best_val_loss,
            'val_accuracy': best_val_accuracy
         }])

        results_df = pd.concat([results_df, current_result], ignore_index=True)
        
        # save the model if it's the best so far
        if best_val_accuracy == results_df['val_accuracy'].max():
            torch.save(model.state_dict(), 'best_model.pth')
            results_df.to_csv('results_df.csv', index=False)
            print(f"New best model saved with validation accuracy: {best_val_accuracy:.2f}%")

            # Plot loss and accuracy curves
            plot_curves(train_losses, val_losses, train_accuracies, val_accuracies)
            # Load the best model and evaluate on the test set
            model.load_state_dict(torch.load('best_model.pth'))
            test_loss, test_acc = evaluate(model, test_loader, loss_fn, device)
            print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.2f}%")

        
    return results_df

# DataFrame to store hyperparameters and validation scores:
results_df = pd.DataFrame(columns=[
    'lr',
    'num_hidden_layers',
    'num_attention_heads',
    'hidden_size',
    'intermediate_size',
    'hidden_dropout_prob',
    'activation_function',
    'val_loss',
    'val_accuracy'
])

#results_df = hyperparameter_search(sampled_combinations, results_df)
# Sort the results by validation accuracy
#sorted_results = results_df.sort_values(by='val_accuracy', ascending=False)

# Display the top 5 configurations
#print("Top 5 Hyperparameter Configurations:")
#print(sorted_results.head(5))

columns = [
    'lr',
    'num_hidden_layers',
    'num_attention_heads',
    'hidden_size',
    'intermediate_size',
    'hidden_dropout_prob',
    'activation_function'
]

df_hyperparams = pd.DataFrame(sampled_combinations, columns=columns)

Sampled hyperparameters:


::: {.callout-important}
The selected hyperparameters do not represent all the possible hyperparameters that could be considered in this case. This process can be extended to include adjustments such as changing the activation functions, experimenting with different loss functions, or tuning additional parameters like learning rate schedules or optimizer types. By exploring a broader range of hyperparameters, it’s possible to further optimize the model’s performance and identify configurations that might yield better results.
:::


In [None]:
#| echo: false
#| code-fold: false
#|


html_table = df_hyperparams.to_html(index=True)

# Wrap in a scrollable div
scrollable_table = f"""
<div style="height: 400px; width: 100%; overflow-x: auto; overflow-y: auto;">
    {html_table}
</div>
"""
# Display the scrollable table
display(HTML(scrollable_table))

### Prediction and Inference

Based on the limited grid search best test accuracy of 88% 

::: {.callout-note}

Based on the limited grid search I obtained a best test accuracy of 88%.
:::


In [None]:
# loss_fn = nn.CrossEntropyLoss()
# test_loss, test_acc = evaluate(model, test_loader, loss_fn, device)

# Initialize the model and load the saved state
config = TransformerConfig(
          vocab_size=len(vocab),
          hidden_size=256,
          num_attention_heads=8,
          num_hidden_layers=8,
          intermediate_size=512,
          hidden_dropout_prob=0.1,
          max_position_embeddings=max_len,
          num_labels=2,
          activation_function='gelu'
      )
model = TransformerForSequenceClassification(config).to(device)
model.load_state_dict(torch.load('best_model.pth', map_location=torch.device('cpu')))
model = model.eval()



# - vocab: the vocabulary used for tokenizing text
# - max_len: the maximum length of the tokenized review
# - class_names: list of class names (e.g., ['negative', 'positive'])
# - model: the trained Transformer model



# Tokenize and convert the review to input IDs
def preprocess_review(review_text, vocab,max_len):



  # Tokenize the review text
  tokens = review_text.lower()
  tokens = re.sub(r'[^a-zA-Z\s]', '', tokens).split()  # Basic tokenization
  # Convert tokens to IDs using the vocabulary
  token_ids = [vocab.get(token, vocab['<UNK>']) for token in tokens]
  # Pad or truncate to max_len
  if len(token_ids) < max_len:
      token_ids += [vocab['<PAD>']] * (max_len - len(token_ids))  # Padding
  else:
      token_ids = token_ids[:max_len] 
  return torch.tensor([token_ids]).to(device)  

# Predict the sentiment of a review
def predict_sentiment(review_text):
    # Preprocess the review
    input_ids = preprocess_review(review_text, vocab, max_len)
    
    # Make prediction
    with torch.no_grad():
        logits = model(input_ids)
        probabilities = torch.softmax(logits, dim=1)
        _, prediction = torch.max(probabilities, dim=1)
    
    return int(prediction), probabilities

class_names = ['Negative', 'Positive']
# Print results for a single review
def Print_Results(review_text_row):
    selected_review = df.review.iloc[review_text_row]  # Get the review text from the DataFrame
    prediction, probabilities = predict_sentiment(selected_review)
    sentiment = class_names[prediction]
    
    # Display the prediction result
    print(f"Selected Review: {selected_review}")
    print(f"Predicted Sentiment: {sentiment}")
    print(f"Confidence: {probabilities[0][prediction].item() * 100:.2f}%")

# Example usage with reviews from the dataset:
print('#######################################')
print('##          Example 1                ##')
print('#######################################')
print('\n')
Print_Results(2)
print('\n')
print('#######################################')
print('##          Example 2                ##')
print('#######################################')
print('\n')
Print_Results(480)

Example of model training and validation curves:

[](Train_valid_curves_v4.png)



### Model Deployment

The model is deployed at: [deployed text classification model](https://textclassificationdemo.streamlit.app/).

@fig-negative and @fig-positive provides two examples extracted from the [app](https://textclassificationdemo.streamlit.app/).

::: {.column-page layout-ncol=1}

![Negative](example1.png){#fig-negative}

![Positive](example2.png){#fig-positive}

Model predictions for two example reviews from the IMDb dataset.
:::

More details can be found at my [Github repository](https://github.com/stevenndungu/text_classification):

 - [Flask App script](https://github.com/stevenndungu/text_classification/blob/main/flask_app.py)

 - [Streamlit App script](https://github.com/stevenndungu/text_classification/blob/main/streamlit_app.py)


:::
