# Step 1: Data Preparation

1.1 Loading the Dataset

For simplicity, let's stick with the IMDB dataset, which can be loaded using PyTorch's torchtext library.

### General Description of the IMDB Dataset:

**Task: Binary Sentiment Classification

Objective: Classify the sentiment of a movie review as either positive (label = 1) or negative (label = 0).
Number of Classes: 2 (Positive, Negative)
Dataset Size:
Training Set: 25,000 reviews
Test Set: 25,000 reviews
Total: 50,000 reviews

**Dataset Characteristics:

Review Content: The reviews are in plain text, consisting of movie reviews written by users. These reviews vary in length and writing style, making the dataset a good testbed for text preprocessing, sentiment analysis, and text classification.
Balanced Dataset: The dataset contains an equal number of positive and negative reviews, making it balanced for binary classification tasks.
Positive reviews: 25,000
Negative reviews: 25,000
Review Length: The average length of reviews can vary, but some reviews are quite long (even exceeding 1,000 words). Typically, longer reviews tend to have more diverse opinions.

In [14]:
import torch
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader

# Load IMDB dataset
train_iter, test_iter = IMDB(split=('train', 'test'))

# Tokenizer
tokenizer = get_tokenizer('basic_english')

# Function to yield tokenized text
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

# Build vocabulary from training data
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

# Reset iterators (as they are exhausted after building vocab)
train_iter, test_iter = IMDB(split=('train', 'test'))

In [15]:
# Convert IMDB iterators to lists
train_list = list(train_iter)
test_list = list(test_iter)

print(train_list[0])
print("Training cases: ", len(train_list))

print(test_list[0])
print("Test cases: ", len(test_list))

(1, 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between

# 1.2 Text Preprocessing and DataLoader

Convert the text data into tokenized sequences and pad them to a fixed length.

In [16]:
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data.dataset import random_split

# Set sequence length
max_length = 200

# Text processing function: convert text to numerical tokens and pad
def text_pipeline(text):
    tokens = vocab(tokenizer(text))
    if len(tokens) > max_length:
        return tokens[:max_length]
    else:
        return tokens + [0] * (max_length - len(tokens))

# Label processing function: convert label to binary (pos=1, neg=0)
def label_pipeline(label):
    return 1 if label == 1 else 0

# Prepare the data iterators
def collate_batch(batch):
    text_list, label_list = [], []
    for label, text in batch:
        label_list.append(label_pipeline(label))
        text_list.append(torch.tensor(text_pipeline(text), dtype=torch.int64))
    return torch.stack(text_list), torch.tensor(label_list, dtype=torch.float32)

train_dataloader = DataLoader(train_list, batch_size=64, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_list, batch_size=64, shuffle=False, collate_fn=collate_batch)
print("finish")


finish


### Add Pre-trained Embedding

We plan to use pre-trained word embeddings from GloVe. We will fine-tune these embeddings for the movie reviews dataset.

In [23]:
#Use pre-trained word embeddings from GloVe. 

import numpy as np
# Path to the GloVe file
glove_file = 'glove.6B\glove.6B\glove.6B.100d.txt'  # Change path as needed

# Initialize an empty dictionary
embeddings_index = {}

# Load the GloVe vectors
with open(glove_file, encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

# Define the embedding dimension (this should match your GloVe vectors, e.g., 100 for 100d GloVe)
vocab_size = len(vocab)
embedding_dim = 100

# Initialize the embedding matrix (vocab_size x embedding_dim)
embedding_matrix = np.zeros((vocab_size, embedding_dim))

# Fill the embedding matrix with GloVe vectors
for word, i in train_dataloader:
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

### Step 3: Building the Feedforward Neural Network Model

**3.1 Defining the Model Architecture in PyTorch

Build a simple feedforward neural network using pre-trained embeddings.

In [22]:
import torch.nn as nn

class FFNN_TextClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim):
        super(FFNN_TextClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = True
        self.fc1 = nn.Linear(embedding_dim * max_length, output_dim)
        self.relu = nn.ReLU()
    
    def forward(self, text):
        embedded = self.embedding(text)
        embedded = embedded.view(embedded.size(0), -1)  # Flatten
        output = self.relu(self.fc1(embedded))
        return output

# Instantiate the model
embedding_dim = 100
vocab_size = len(vocab)
output_dim = 1  # Binary classification

print("vocab_size, embedding_dim, hidden_dim, output_dim", vocab_size, embedding_dim, output_dim)
model = FFNN_TextClassifier(vocab_size, embedding_dim, output_dim)


vocab_size, embedding_dim, hidden_dim, output_dim 100683 100 1


### Step 4: Training the Model

**4.1 Define Loss Function, Optimizer, and Training Loop

In [24]:
import torch.optim as optim

# Define loss function and optimizer
criterion = nn.BCEWithLogitsLoss()  # Binary classification, so we use binary cross-entropy
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
def train(model, dataloader, criterion, optimizer):
    model.train()
    total_loss, total_acc = 0, 0
    for text, labels in dataloader:
        optimizer.zero_grad()
        output = model(text)
        loss = criterion(output.squeeze(), labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(dataloader)

# Train the model
num_epochs = 10
for epoch in range(num_epochs):
    train_loss = train(model, train_dataloader, criterion, optimizer)
    print(f'Epoch: {epoch+1}, Training Loss: {train_loss:.8f}')


Epoch: 1, Training Loss: 0.56040600
Epoch: 2, Training Loss: 0.43289931
Epoch: 3, Training Loss: 0.38790478
Epoch: 4, Training Loss: 0.37351632
Epoch: 5, Training Loss: 0.36886769
Epoch: 6, Training Loss: 0.36614848
Epoch: 7, Training Loss: 0.36528832
Epoch: 8, Training Loss: 0.36512598
Epoch: 9, Training Loss: 0.36373957
Epoch: 10, Training Loss: 0.36340582


### Step 5: Model Evaluation

**5.1 Evaluating the Model on the Test Set

In [21]:
import sklearn
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def evaluate(model, dataloader, criterion):
    model.eval()
    total_loss, total_acc = 0, 0
    all_preds, all_labels = [], []
    with torch.no_grad():
        for text, labels in dataloader:
            output = model(text)
            loss = criterion(output.squeeze(), labels)
            preds = torch.round(torch.sigmoid(output))
            all_preds.extend(preds.numpy())
            all_labels.extend(labels.numpy())
            #print("all_preds", all_preds)
            #print("all_labels", all_labels)
            total_loss += loss.item()
    #print(sum(all_labels))
    accuracy = accuracy_score(all_labels, all_preds)
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average='binary')
    return total_loss / len(dataloader), accuracy, precision, recall, f1

# Evaluate the model
test_loss, test_accuracy, precision, recall, f1 = evaluate(model, test_dataloader, criterion)
print(f'Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}')
print(f'Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}')


Test Loss: 0.5645, Test Accuracy: 0.8396
Precision: 0.8687, Recall: 0.8000, F1 Score: 0.8330
