Homework 3: Sentiment Analysis
----

The following instructions apply to all notebooks and `.py` files you submit for this homework.

Due date: April 15th, 2024 11:59 PM (EST)

Total Points: (105)
- Task 0: 05 points
- Task 1: 10 points
- Task 2: 20 points
- Task 3: 25 points
- Task 4: 40 points (question in LSTM_EncDec.ipynb)

Goals:
- understand the difficulties of counting and probabilities in NLP applications
- work with real world data using different approaches to classification
- stress test your model (to some extent)


Allowed python modules:
- `numpy`, `matplotlib`, `keras`, `pytorch`, `nltk`, `pandas`, `sci-kit learn` (`sklearn`), `seaborn`, and all built-in python libraries (e.g. `math` and `string`)
- if you would like to use a library not on this list, please check with us on Campuswire first.
- all *necessary* imports have been included for you (all imports that we used in our solution)

Instructions:
- Complete outlined problems in this notebook.
- When you have finished, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__.
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that you have completed Task 0.
- Submit your work on Gradescope.
- Double check that your submission on Gradescope looks like you believe it should.

Names & Sections
----
Names: __YOUR NAMES HERE__

Task 0: Name, References, Reflection (5 points)
---

References
---
List the resources you consulted to complete this homework here. Write one sentence per resource about what it provided to you. If you consulted no references to complete your assignment, write a brief sentence stating that this is the case and why it was the case for you.

(Example)
- https://docs.python.org/3/tutorial/datastructures.html
    - Read about the the basics and syntax for data structures in python.

AI Collaboration
---
Following the *Policy on the use of Generative AI* in the syllabus, please cite any LLMs that you used here and briefly describe what you used them for, including to improve language clarity in the written sections.

Reflection
----
Answer the following questions __after__ you complete this assignment (no more than 1 sentence per question required, this section is graded on completion):

1. Does this work reflect your best effort?
2. What was/were the most challenging part(s) of the assignment?
3. If you want feedback, what function(s) or problem(s) would you like feedback on and why?
4. Briefly reflect on how your partnership functioned--who did which tasks, how was the workload on each of you individually as compared to the previous homeworks, etc.

Task 1: Provided Data Write-Up (10 points)
---

Every time you use a data set in an NLP application (or in any software application), you should be able to answer a set of questions about that data. Answer these now. Default to no more than 1 sentence per question needed. If more explanation is necessary, do give it.

This is about the __provided__ movie review data set.

1. Where did you get the data from? The provided dataset(s) were sub-sampled from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
2. (1 pt) How was the data collected (where did the people acquiring the data get it from and how)?
3. (2 pts) How large is the dataset (answer for both the train and the dev set, separately)? (# reviews, # tokens in both the train and dev sets)
4. (1 pt) What is your data? (i.e. newswire, tweets, books, blogs, etc)
5. (1 pt) Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)
6. (2 pts) What is the distribution of labels in the data (answer for both the train and the dev set, separately)?
7. (2 pts) How large is the vocabulary (answer for both the train and the dev set, separately)?
8. (1 pt) How big is the overlap between the vocabulary for the train and dev set?

Task 2: Train a Logistic Regression Model (20 points)
----
1. Implement a custom function to read in a dataset, and return a list of tuples, using the Tf-Idf feature extraction technique.
2. Compare your implementation to `sklearn`'s TfidfVectorizer (imported below) by timing both on the provided datasets using the time module.
3. Using each set of features, and `sklearn`'s implementation of `LogisticRegression`, train a machine learning model to predict sentiment on the given dataset.

In [2]:
import nltk
#nltk.download('punkt')
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from collections import Counter
import time
from nltk.corpus import stopwords

#nltk.download('stopwords')
stopwords = stopwords.words('english')

In [None]:
# The following function reads a data-file and splits the contents by tabs.
# The first column is an ID, and thus is discarded. The second column consists of the actual reviews data.
# The third column is the true label for each data point.

# The function returns two objects - a list of all reviews, and a numpy array of labels.
# You will need to use this function later.

def get_lists(input_file):
    f=open(input_file, 'r')
    lines = [line.split('\t')[1:] for line in f.readlines()]
    X = [row[0] for row in lines]
    y=np.array([int(row[1]) for row in lines])
    return X, y

# Fill in the following function to take a corpus (list of reviews) as input,
# extract TfIdf values and return an array of features and the vocabulary.

# If the vocabulary argument is supplied, then the function should only convert the input corpus
# to feature vectors using the provided vocabulary and the max_features argument (if not None).
# In this case, the function should return feature vectors and the supplied vocabulary.

# If the max_features parameter is set to None, then all words in the corpus should be used.
# If the max_features parameter is specified (say, k),
# then only use the k most frequent words in the corpus to build your vocabulary.

# The function should return two things.

# The first object should be a numpy array of shape (n_documents, vocab_size),
# which contains the TF-IDF feature vectors for each document.

# The second object should be a dictionary of the words in the vocabulary,
# mapped to their corresponding index in alphabetical sorted order.

def get_tfidf_vectors(token_lists, max_features=None, vocabulary=None):

    #YOUR CODE HERE

    pass


We will now compare the runtime of our Tf-Idf implementation to the `sklearn` implementation. Call the respective functions with appropriate arguments in the code block below.

In [4]:
# define constants for the files we are using
TRAIN_FILE = "movie_reviews_train.txt"
TEST_FILE = "movie_reviews_test.txt"

train_corpus, y_train = get_lists(TRAIN_FILE)

# First we will use our custom vectorizer to convert words to features, and time it.

###### YOUR CODE HERE #######

# print("Time taken: ", end-start, " seconds")

# Next we will use sklearn's TfidfVectorizer to load in the data, and time it.

###### YOUR CODE HERE #######


# print("Time taken: ", end-start, " seconds")

NOTE: Ideally, your vectorizer should be within one order of magnitude of the sklearn implementation.

In [5]:
# Any additional code needed to answer questions below.


1. How large is the vocabulary generated by your vectorizer?<br> **YOUR ANSWER HERE**
2. How large is the vocabulary generated by the `sklearn` TfidfVectorizer?<br> **YOUR ANSWER HERE**
3. Where might these differences be coming from?<br> **YOUR ANSWER HERE**
4. What steps did you take to ensure your vectorizer is optimized for best possible runtime?<br> **YOUR ANSWER HERE**
5. How sparse are your custom features (average percentage of features per review that are zero)?<br> **YOUR ANSWER HERE**
6. How sparse are the TfidfVectorizer's features?<br> **YOUR ANSWER HERE**

NOTE: if you set the lowercase option to False, the sklearn vectorizer should have a vocabulary of around 50k words/tokens.

**Logistic Regression**

Now, we will compare how our custom features stack up against sklearn's TfidfVectorizer, by training two separate Logistic Regression classifiers - one on each set of feature vectors. Then load the test set, and convert it to two sets of feature vectors, one using our custom vectorizer (to do this, provide the vocabulary as a function argument), and one using sklearn's Tfidf (use the same object as before to transform the test inputs). For both classifiers, print the average accuracy on the test set and the F1 score.

In [6]:
# First use sklearn's LogisticRegression classifier to do sentiment analysis using your custom feature vectors:

###### YOUR CODE HERE #######

# Load the test data, extract features using your custom vectorizer, and test the performance of the LR classifier

###### YOUR CODE HERE #######

# Print the accuracy of your model on the test data

###### YOUR CODE HERE #######

# Now repeat the above steps, but this time using features extracted by sklearn's Tfidfvectorizer

###### YOUR CODE HERE #######


NOTE: we're expecting to see a F1 score of around 80% using both your custom features and the sklearn features.

Finally, repeat the process (training and testing), but this time, set the max_features argument to 1000 for both our custom vectorizer and sklearn's Tfidfvectorizer. Report average accuracy and F1 scores for both classifiers.

In [7]:
###### YOUR CODE HERE #######

# First use sklearn's LogisticRegression classifier to do sentiment analysis using your custom feature vectors:

###### YOUR CODE HERE #######


# Load the test data, extract features using your custom vectorizer, and test the performance of the LR classifier

###### YOUR CODE HERE #######


# Print the accuracy of your model on the test data

###### YOUR CODE HERE #######

# Now repeat the above steps, but this time using features extracted by sklearn's Tfidfvectorizer

###### YOUR CODE HERE #######


1. Is there a stark difference between the two vectorizers with 1000 features?<br>**YOUR ANSWER HERE**
2. Use sklearn's documentation for the Tfidfvectorizer to figure out what may be causing the performance difference (or lack thereof).<br>**YOUR ANSWER HERE**

NOTE: Irrespective of your conclusions, both implementations should be above 60% F1 Score.

Task 3: Train a Feedforward Neural Network Model (25 points)
----
1. Using PyTorch, implement a feedforward neural network to do sentiment analysis. This model should take sparse vectors of length 10000 as input (note this is 10000, not 1000), and have a single output with the sigmoid activation function. The number of hidden layers, and intermediate activation choices are up to you, but please make sure your model does not take more than ~1 minute to train.
2. Evaluate the model using PyTorch functions for average accuracy, area under the ROC curve and F1 scores (see [torcheval](https://pytorch.org/torcheval/stable/)) using both vectorizers, with max_features set to 10000 in both cases.

In [8]:
import torch
import torch.nn as nn

# if torch.backends.mps.is_available():
# 	device = torch.device("mps")
if torch.cuda.is_available():
	device = torch.device("cuda")
else:
	device = torch.device("cpu")

In [9]:
class feedforward(nn.Module):
    def __init__(self):
        super().__init__()

        ###### YOUR CODE HERE #######

    def forward(self, X):
        ###### YOUR CODE HERE #######
        pass

    def predict(self, X):
        ###### YOUR CODE HERE #######
        pass

In [10]:
# Load the data using custom and sklearn vectors

###### YOUR CODE HERE #######


In [11]:
# Create a feedforward neural network model
# you may use any activation function on the hidden layers
# you should use binary cross-entropy as your loss function
# Adam is an appropriate optimizer for this task


###### YOUR CODE HERE #######


In [12]:
# Train the model for 50 epochs on both custom and sklearn vectors


###### YOUR CODE HERE #######

In [13]:
!pip install torcheval

# Evaluate the model using custom and sklearn vectors

###### YOUR CODE HERE #######


from torcheval.metrics.functional import binary_f1_score
from torcheval.metrics import BinaryAUROC, BinaryAccuracy

# Test the model using custom and sklearn vectors
# Evaluate the model and report the score using Binary F1 score, Binary AUROC and Binary accuracy

###### YOUR CODE HERE #######


Collecting torcheval
  Downloading torcheval-0.0.7-py3-none-any.whl (179 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m122.9/179.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.2/179.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torcheval
Successfully installed torcheval-0.0.7


NOTE: As in the last task, we're expecting to see a F1 score of over 60% using both your custom features and the sklearn features.

5 points in this assignment are reserved for overall style (both for writing and for code submitted). All work submitted should be clear, easily interpretable, and checked for spelling, etc. (Re-read what you write and make sure it makes sense). Course staff are always happy to give grammatical help (but we won't pre-grade the content of your answers).