Goals:
- understand the difficulties of counting and probabilities in NLP applications
- work with real world data using different approaches to classification
- stress test your model (to some extent)


Name: __Abhishek Rao__

This is about the __provided__ movie review data set.

1. Where did you get the data from? The provided dataset(s) were sub-sampled from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
2. (1 pt) How was the data collected (where did the people acquiring the data get it from and how)?
The data was collected from IMDB reviews of movies

3. (2 pts) How large is the dataset (answer for both the train and the dev set, separately)? (# reviews, # tokens in both the train and dev sets)
Ans: DEV-Total Reviews: 200
Total Tokens: 47158

      TRAIN-Total Reviews: 1600
Total Tokens: 369136

4. (1 pt) What is your data? (i.e. newswire, tweets, books, blogs, etc)
The data was collected from IMDB reviews movies

5. (1 pt) Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)
The authors of the data are the people that watched movies and had polarizing views on it.
6. (2 pts) What is the distribution of labels in the data (answer for both the train and the dev set, separately)?

Ans:

        DEV- Sentiment
                1    105
                0     95

        TRAIN-Sentiment
                1    804
                0    796
7. (2 pts) How large is the vocabulary (answer for both the train and the dev set, separately)?

Ans:

      DEV-Vocabulary Size: 11000

      TRAIN- Vocabulary Size: 44008
8. (1 pt) How big is the overlap between the vocabulary for the train and dev set?

Ans:

      Vocabulary Overlap Size: 6973

      Overlap as a percentage of Training Vocabulary: 15.84%

      Overlap as a percentage of Development Vocabulary: 63.39%

In [23]:
import pandas as pd
from math import log

import pandas as pd

train_file_path = 'movie_reviews_train.txt'
train_data = pd.read_csv(train_file_path, sep="\t", header=None, names=['ID', 'Review', 'Sentiment'])
train_data['Tokens'] = train_data['Review'].apply(lambda x: x.lower().split())
train_vocab = set(word for review in train_data['Tokens'] for word in review)

dev_file_path = 'movie_reviews_dev.txt'
dev_data = pd.read_csv(dev_file_path, sep="\t", header=None, names=['ID', 'Review', 'Sentiment'])
dev_data['Tokens'] = dev_data['Review'].apply(lambda x: x.lower().split())
dev_vocab = set(word for review in dev_data['Tokens'] for word in review)

vocab_overlap = train_vocab.intersection(dev_vocab)
overlap_size = len(vocab_overlap)

print(f"Vocabulary Overlap Size: {overlap_size}")
print(f"Overlap as a percentage of Training Vocabulary: {100 * overlap_size / len(train_vocab):.2f}%")
print(f"Overlap as a percentage of Development Vocabulary: {100 * overlap_size / len(dev_vocab):.2f}%")


Vocabulary Overlap Size: 6973
Overlap as a percentage of Training Vocabulary: 15.84%
Overlap as a percentage of Development Vocabulary: 63.39%


In [24]:
import re
import nltk
nltk.download('punkt')
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from collections import Counter
import time
from nltk.corpus import stopwords

nltk.download('stopwords')
stopwords = stopwords.words('english')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
# The following function reads a data-file and splits the contents by tabs.
# The first column is an ID, and thus is discarded. The second column consists of the actual reviews data.
# The third column is the true label for each data point.

# The function returns two objects - a list of all reviews, and a numpy array of labels.
# You will need to use this function later.

def get_lists(input_file):
    f=open(input_file, 'r')
    lines = [line.split('\t')[1:] for line in f.readlines()]
    X = [row[0] for row in lines]
    y=np.array([int(row[1]) for row in lines])
    return X, y


# If the vocabulary argument is supplied, then the function should only convert the input corpus
# to feature vectors using the provided vocabulary and the max_features argument (if not None).
# In this case, the function should return feature vectors and the supplied vocabulary.

# If the max_features parameter is set to None, then all words in the corpus should be used.
# If the max_features parameter is specified (say, k),
# then only use the k most frequent words in the corpus to build your vocabulary.

# The function should return two things.

# The first object should be a numpy array of shape (n_documents, vocab_size),
# which contains the TF-IDF feature vectors for each document.

# The second object should be a dictionary of the words in the vocabulary,
# mapped to their corresponding index in alphabetical sorted order.
from collections import defaultdict
import math

def get_tfidf_vectors(token_lists, max_features=None, vocabulary=None):

    doc_freqs = defaultdict(int)

    lowercase = [token_list.lower() for token_list in token_lists]
    token_lists = lowercase

    # Create a vocabulary from the corpus if it's not provided
    if vocabulary is None:
        vocab = set()
        for token_list in token_lists:
            tokens = nltk.word_tokenize(token_list)
            for token in tokens:
                if token not in vocab:
                    vocab.add(token)
        vocab = sorted(vocab)

        # Limit the vocabulary size if max_features is specified
        if max_features is not None:
            vocab = vocab[:max_features]
    else:
        vocab = vocabulary

    n_docs = len(token_lists)
    vocab_size = len(vocab)
    vocab_index = {term: i for i, term in enumerate(vocab)}

    # Calculate term frequencies and document frequencies in one pass
    tf = np.zeros((n_docs, vocab_size), dtype=np.float32)
    doc_freqs = np.zeros(vocab_size, dtype=np.float32)
    for i, doc in enumerate(token_lists):
        tokens = doc.split()
        term_freqs = Counter(tokens)
        doc_length = len(tokens)
        for term in term_freqs:
            if term in vocab_index:
                j = vocab_index[term]
                # tf[i, j] = term_freqs[term] / doc_length
                # doc_freqs[j] += 1
                tf_scale = 1 + math.log(term_freqs[term]) # Scale term freq within doc
                tf_scale /= doc_length
                tf[i, j] = tf_scale
                doc_freqs[j] += 1

    # Calculate inverse document frequencies
    idf = np.log(n_docs / (1 + doc_freqs)) + 1

    # Calculate TF-IDF
    tfidf = tf * idf[None, :]
    # epsilon = 1e-8  # Small constant to avoid division by zero
    # tfidf_norm = np.linalg.norm(tfidf, axis=1, keepdims=True)
    # tfidf_norm[tfidf_norm == 0] = epsilon  # Replace zero norms with epsilon
    # tfidf = tfidf / tfidf_norm

    return tfidf, vocab_index


We will now compare the runtime of our Tf-Idf implementation to the `sklearn` implementation. Call the respective functions with appropriate arguments in the code block below.

In [44]:
# define constants for the files we are using
TRAIN_FILE = "movie_reviews_train.txt"
TEST_FILE = "movie_reviews_test.txt"

train_corpus, y_train = get_lists(TRAIN_FILE)

# First we will use our custom vectorizer to convert words to features, and time it.


start = time.time()
X_train_custom, vocab_custom = get_tfidf_vectors(train_corpus)
end = time.time()

print("Time taken: ", end-start, " seconds")

# Next we will use sklearn's TfidfVectorizer to load in the data, and time it.

start = time.time()
vectorizer = TfidfVectorizer()
X_train_sklearn = vectorizer.fit_transform(train_corpus)
end = time.time()

print("Time taken: ", end-start, " seconds")

Time taken:  4.23777437210083  seconds
Time taken:  0.42064905166625977  seconds


NOTE: Ideally, your vectorizer should be within one order of magnitude of the sklearn implementation.

In [45]:
def calculate_sparsity(X):
    num_zero = np.sum(X == 0, axis=1)  # Count zeros in each row (review)
    num_features = X.shape[1]  # Total number of features
    sparsity_percentages = (num_zero / num_features) * 100  # Percentage of zeros in each review
    avg_sparsity_percentage = np.mean(sparsity_percentages)  # Average across all reviews
    return avg_sparsity_percentage


In [46]:
print ("Length of vocab_custom: ", len(vocab_custom))
print ("Length of sklearn vocab: ", len(vectorizer.vocabulary_))
print ("Sparsity of custom features: ", calculate_sparsity(X_train_custom))
print ("Sparsity of sklearn features: ", calculate_sparsity(X_train_sklearn))

Length of vocab_custom:  27172
Length of sklearn vocab:  22601
Sparsity of custom features:  99.56511988443987


  print ("Sparsity of sklearn features: ", calculate_sparsity(X_train_sklearn))


Sparsity of sklearn features:  99.39285042697226


1. How large is the vocabulary generated by your vectorizer?<br> **Vocabulary size of custom vectorizer: 27172**
2. How large is the vocabulary generated by the `sklearn` TfidfVectorizer?<br> **Vocabulary size of sklearn TfidfVectorizer: 22601**
3. Where might these differences be coming from?<br> **The differences in vocabulary sizes could have come from several factors including differences in tokenization methods, handling of stop words, and the treatment of term frequencies (including the handling of rare terms).**
4. What steps did you take to ensure your vectorizer is optimized for best possible runtime?<br> **both term frequencies and doc freq were computed in one pass, ans used numpy array operations insead of python defaults; Utilized Python's collections.Counter to efficiently count term frequencies across documents**
5. How sparse are your custom features (average percentage of features per review that are zero)?<br> **Sparsity of custom features: 99.56%**
6. How sparse are the TfidfVectorizer's features?<br> **99.40%**

NOTE: if you set the lowercase option to False, the sklearn vectorizer should have a vocabulary of around 50k words/tokens.

**Logistic Regression**

Now, we will compare how our custom features stack up against sklearn's TfidfVectorizer, by training two separate Logistic Regression classifiers - one on each set of feature vectors. Then load the test set, and convert it to two sets of feature vectors, one using our custom vectorizer (to do this, provide the vocabulary as a function argument), and one using sklearn's Tfidf (use the same object as before to transform the test inputs). For both classifiers, print the average accuracy on the test set and the F1 score.

In [28]:

lr_custom = LogisticRegression()
lr_custom.fit(X_train_custom, y_train)

# Load the test data, extract features using your custom vectorizer, and test the performance of the LR classifier

test_corpus, y_test = get_lists(TEST_FILE)

X_test_custom, _ = get_tfidf_vectors(test_corpus, vocabulary=vocab_custom)
y_pred_custom = lr_custom.predict(X_test_custom)

# Print the accuracy of your model on the test data

accuracy_custom = f1_score(y_test, y_pred_custom)
print("F1 Score of custom vectorizer: ", accuracy_custom)


lr_sklearn = LogisticRegression()
lr_sklearn.fit(X_train_sklearn, y_train)

X_test_sklearn = vectorizer.transform(test_corpus)
y_pred_sklearn = lr_sklearn.predict(X_test_sklearn)

accuracy_sklearn = f1_score(y_pred_sklearn, y_pred_custom)
print("F1 Score of sklearn's TfidfVectorizer: ", accuracy_sklearn)


F1 Score of custom vectorizer:  0.8035714285714286
F1 Score of sklearn's TfidfVectorizer:  0.9411764705882354


NOTE: we're expecting to see a F1 score of around 80% using both your custom features and the sklearn features.

Finally, repeat the process (training and testing), but this time, set the max_features argument to 1000 for both our custom vectorizer and sklearn's Tfidfvectorizer. Report average accuracy and F1 scores for both classifiers.

In [29]:

X_train_custom_1000, vocab_custom_1000 = get_tfidf_vectors(train_corpus, max_features=1000)
X_test_custom_1000, _ = get_tfidf_vectors(test_corpus, vocabulary=vocab_custom_1000, max_features=1000)


lr_custom_1000 = LogisticRegression()
lr_custom_1000.fit(X_train_custom_1000, y_train)
y_pred_custom_1000 = lr_custom_1000.predict(X_test_custom_1000)


# Print the accuracy of your model on the test data


f1_custom_1000 = f1_score(y_test, y_pred_custom_1000)
print("F1 score of custom vectorizer with max_features=1000: ", f1_custom_1000)


# Now repeat the above steps, but this time using features extracted by sklearn's Tfidfvectorizer

vectorizer_1000 = TfidfVectorizer(max_features=1000)
X_train_sklearn_1000 = vectorizer_1000.fit_transform(train_corpus)
X_test_sklearn_1000 = vectorizer_1000.transform(test_corpus)

lr_sklearn_1000 = LogisticRegression()
lr_sklearn_1000.fit(X_train_sklearn_1000, y_train)
y_pred_sklearn_1000 = lr_sklearn_1000.predict(X_test_sklearn_1000)

f1_sklearn_1000 = f1_score(y_test, y_pred_sklearn_1000)
print("F1 score of sklearn's TfidfVectorizer with max_features=1000: ", f1_sklearn_1000)

F1 score of custom vectorizer with max_features=1000:  0.6887417218543046
F1 score of sklearn's TfidfVectorizer with max_features=1000:  0.8036529680365296


1. Is there a stark difference between the two vectorizers with 1000 features?<br>**Yes**
2. Use sklearn's documentation for the Tfidfvectorizer to figure out what may be causing the performance difference (or lack thereof).<br>**idf smoothing, stop words and feature scaling?**

NOTE: Irrespective of your conclusions, both implementations should be above 60% F1 Score.


----
1. Using PyTorch, implement a feedforward neural network to do sentiment analysis. This model should take sparse vectors of length 10000 as input (note this is 10000, not 1000), and have a single output with the sigmoid activation function. The number of hidden layers, and intermediate activation choices are up to you, but please make sure your model does not take more than ~1 minute to train.
2. Evaluate the model using PyTorch functions for average accuracy, area under the ROC curve and F1 scores (see [torcheval](https://pytorch.org/torcheval/stable/)) using both vectorizers, with max_features set to 10000 in both cases.

In [30]:
import torch
import torch.nn as nn

# if torch.backends.mps.is_available():
# 	device = torch.device("mps")
if torch.cuda.is_available():
	device = torch.device("cuda")
else:
	device = torch.device("cpu")

In [31]:
class feedforward(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()

        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, X):

        out = self.fc1(X)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

    def predict(self, X):
        self.eval()
        with torch.no_grad():
            output = self.forward(X)
            predicted = (output > 0.5).float()
        return predicted

In [32]:
# Load the data using custom and sklearn vectors

X_train_custom = torch.from_numpy(X_train_custom).float().to(device)
X_train_sklearn = torch.from_numpy(X_train_sklearn.toarray()).float().to(device)
y_train = torch.from_numpy(y_train).float().to(device)

In [33]:
# Create a feedforward neural network model
# Adam is an appropriate optimizer for this task


input_size_custom = X_train_custom.shape[1]
input_size_sklearn = X_train_sklearn.shape[1]
hidden_size = 128
output_size = 1

model_custom = feedforward(input_size_custom, hidden_size, output_size).to(device)
model_sklearn = feedforward(input_size_sklearn, hidden_size, output_size).to(device)

criterion = nn.BCELoss()
optimizer_custom = torch.optim.Adam(model_custom.parameters())
optimizer_sklearn = torch.optim.Adam(model_sklearn.parameters())

In [34]:
# Train the model for 50 epochs on both custom and sklearn vectors


num_epochs = 50
for epoch in range(num_epochs):
    # Custom vectors
    model_custom.train()
    optimizer_custom.zero_grad()
    outputs = model_custom(X_train_custom)
    loss = criterion(outputs.squeeze(), y_train)
    loss.backward()
    optimizer_custom.step()

for epoch in range(num_epochs):
    # Sklearn vectors
    model_sklearn.train()
    optimizer_sklearn.zero_grad()
    outputs = model_sklearn(X_train_sklearn)
    loss = criterion(outputs.squeeze(), y_train)
    loss.backward()
    optimizer_sklearn.step()

In [35]:
!pip install torcheval



In [36]:
# Evaluate the model using custom and sklearn vectors

model_custom.eval()
model_sklearn.eval()


from torcheval.metrics.functional import binary_f1_score, binary_auroc, binary_accuracy
# Test the model using custom and sklearn vectors
# Evaluate the model and report the score using Binary F1 score, Binary AUROC and Binary accuracy

X_test_custom = torch.from_numpy(X_test_custom).float().to(device)
y_test = torch.from_numpy(y_test).float().to(device)
X_test_sklearn = torch.from_numpy(X_test_sklearn.toarray()).float().to(device)

y_pred_custom = model_custom.predict(X_test_custom).squeeze()
y_pred_sklearn = model_sklearn.predict(X_test_sklearn).squeeze()

f1_custom = binary_f1_score(y_pred_custom, y_test)
auroc_custom = binary_auroc(y_pred_custom, y_test)
accuracy_custom = binary_accuracy(y_pred_custom, y_test)

f1_sklearn = binary_f1_score(y_pred_sklearn, y_test)
auroc_sklearn = binary_auroc(y_pred_sklearn, y_test)
accuracy_sklearn = binary_accuracy(y_pred_sklearn, y_test)

print("Custom Vectors:")
print(f"Binary F1 Score: {f1_custom:.4f}")
print(f"Binary AUROC: {auroc_custom:.4f}")
print(f"Binary Accuracy: {accuracy_custom:.4f}")

print("\nSklearn Vectors:")
print(f"Binary F1 Score: {f1_sklearn:.4f}")
print(f"Binary AUROC: {auroc_sklearn:.4f}")
print(f"Binary Accuracy: {accuracy_sklearn:.4f}")

Custom Vectors:
Binary F1 Score: 0.7841
Binary AUROC: 0.7489
Binary Accuracy: 0.7550

Sklearn Vectors:
Binary F1 Score: 0.7861
Binary AUROC: 0.7910
Binary Accuracy: 0.7850
