# Assignment 2b: Word2Vec Representations (10 Marks)

## Due: March 17, 2022

Welcome to the Assignment 2b of the course. This week we will learn about vector representations for words and how can we utilize them to solve the sentiment analysis task that we discussed in the previous Assignment.

In [None]:
try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    data_dir = "/content/gdrive/MyDrive/Colab Notebooks/PlakshaNLP/Assignment2b/data/SST-2"
except:
    data_dir = "/datadrive/t-kabir/work/repos/PlakshaNLP/source/Assignment2b/data/SST-2"

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
# Install required libraries
!pip install numpy
!pip install pandas
!pip install nltk
!pip install torch
!pip install tqdm
!pip install matplotlib
!pip install seaborn
!pip install gensim



In [None]:
# We start by importing libraries that we will be making use of in the assignment.
import string
import tqdm
import numpy as np
import pandas as pd
import torch
import gensim
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
nltk.download("punkt")
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Similar to last time we will again be working on the Stanford Sentiment Dataset. Below we load the dataset into the memory

In [None]:
# We can use pandas to load the datasets
train_df = pd.read_csv(f"{data_dir}/train.tsv", sep = "\t")
test_df = pd.read_csv(f"{data_dir}/dev.tsv", sep = "\t")

print(f"Number of Training Examples: {len(train_df)}")
print(f"Number of Test Examples: {len(test_df)}")

Number of Training Examples: 67349
Number of Test Examples: 872


In [None]:
# View a sample of the dataset
train_df.head()

Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0


## Task 0: Warm Up Excercise (2 Marks)

To start we ask you to re-implement some functions from the Assignment 1. Mainly you will implement the preprocessing pipeline and vocabulary building functions again as well as some new but related functions. Details about the functions will be given in their Doc Strings.

### Task 0.1: Preprocessing Pipeline (1 Mark)

Implement the preprocessing pipeline like we did in the previous assignment, however, this time we will only implement converting the text to lower case and removing punctuations.

We are not doing any stemming this time as we will be using pre-trained word representations in this assignment, and like it was discussed in the lectures stemming often results in the words that may not exist in common dictionaries.

We are also skipping stop words removal this time around, the reason is that removing stop words can often hurt the structural integrity of a sentence and the choice of stop words to use can be very subjective and depend upon the task at hand. For example: In the stop words list that we used last time also contained the word `not`, removing which can change the sentiment of the sentence, eg. I did not like this movie -> I did like this movie. In this assignment we will explore more sophisticated ways to handle the stop words than just directly removing them from the text.

In [None]:
def preprocess_pipeline(text):
    """
    Given a piece of text applies preprocessing techniques
    like converting to lower case, removing stop words and punctuations.

    Apply the functions in the following order:
    1. to_lower_case
    2. remove_punctuations

    Inputs:
    - text (str) : A python string containing text to be pre-processed

    Returns:
    - text_preprocessed (str) : Resulting string after applying preprocessing
    
    Note: You may implement the functions for the two steps seperately in this cell
            or just write all the code in this function only we leave that up to you.
    """
    import string
    text_preprocessed = text.lower()
    text_preprocessed=text_preprocessed.translate(text_preprocessed.maketrans('','',string.punctuation))
    return text_preprocessed

In [None]:

def evaluate_string_test_cases(test_case_input,
                        test_case_func_output,
                        test_case_exp_output):
  
    print(f"Input: {test_case_input}")
    print(f"Function Output: {test_case_func_output}")
    print(f"Expected Output: {test_case_exp_output}")

    if test_case_func_output == test_case_exp_output:
        print("Test Case Passed :)")
        print("**********************************\n")
        return True
    else:
        print("Test Case Failed :(")
        print("**********************************\n")
        return False

print("Running Sample Test Cases")
print("Sample Test Case 1:")
test_case = "Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal!"
test_case_answer = "mr and mrs dursley of number four privet drive were proud to say that they were perfectly normal"
test_case_student_answer = preprocess_pipeline(test_case)
assert evaluate_string_test_cases(test_case, test_case_student_answer, test_case_answer)

print("Sample Test Case 2:")
test_case = "\"Little tyke,\" chortled Mr. Dursley as He left the house."
test_case_answer = "little tyke chortled mr dursley as he left the house"
test_case_student_answer = preprocess_pipeline(test_case)
assert evaluate_string_test_cases(test_case, test_case_student_answer, test_case_answer)


Running Sample Test Cases
Sample Test Case 1:
Input: Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal!
Function Output: mr and mrs dursley of number four privet drive were proud to say that they were perfectly normal
Expected Output: mr and mrs dursley of number four privet drive were proud to say that they were perfectly normal
Test Case Passed :)
**********************************

Sample Test Case 2:
Input: "Little tyke," chortled Mr. Dursley as He left the house.
Function Output: little tyke chortled mr dursley as he left the house
Expected Output: little tyke chortled mr dursley as he left the house
Test Case Passed :)
**********************************



In [None]:
## Preprocess the dataset

train_df["sentence"] = train_df["sentence"].apply(lambda x : preprocess_pipeline(x))
test_df["sentence"] = test_df["sentence"].apply(lambda x : preprocess_pipeline(x))

### Task 0.2: Create Vocabulary (0.25 Marks)

Implement the `create_vocab` function below like you did last time. Do not forget using `nltk.tokenize.word_tokenize` to tokenize the text into words.

In [None]:
def create_vocab(documents):
    """
    Given a list of documents each represented as a string,
    create a word vocabulary containing all the words that occur
    in these documents.
    (0.25 Marks)

    Inputs:
        - documents (list) : A list with each element as a string representing a
                            document.

    Returns:
        - vocab (list) : A **sorted** list containing all unique words in the
                        documents

    Example Input: ['john likes to watch movies mary likes movies too',
                  'mary also likes to watch football games']

    Expected Output: ['also',
                    'football',
                    'games',
                    'john',
                    'likes',
                    'mary',
                    'movies',
                    'to',
                    'too',
                    'watch']


    Hint: `nltk.tokenize.word_tokenize` function may come in handy

    """
        
    vocab = []

    from nltk import word_tokenize
    for document in tqdm(documents):
      vocab += word_tokenize(document)
    
    vocab=list(set(vocab))

    return sorted(vocab) # Don't change this

In [None]:
def evaluate_list_test_cases(test_case_input,
                        test_case_func_output,
                        test_case_exp_output):
  
    print(f"Input: {test_case_input}")
    print(f"Function Output: {test_case_func_output}")
    print(f"Expected Output: {test_case_exp_output}")

    if test_case_func_output == test_case_exp_output:
        print("Test Case Passed :)")
        print("**********************************\n")
        return True
    else:
        print("Test Case Failed :(")
        print("**********************************\n")
        return False


print("Running Sample Test Cases")
print("Sample Test Case 1:")

test_case = ["john likes to watch movies mary likes movies too",
              "mary also likes to watch football games"]
test_case_answer = ['also', 'football', 'games', 'john', 'likes', 'mary', 'movies', 'to', 'too', 'watch']
test_case_student_answer = create_vocab(test_case)
assert evaluate_list_test_cases(test_case, test_case_student_answer, test_case_answer)

print("Sample Test Case 2:")

test_case = ["We all live in a yellow submarine.",
             "Yellow submarine, yellow submarine!!"
             ]
test_case_answer = ['!', ',', '.', 'We', 'Yellow', 'a', 'all', 'in', 'live', 'submarine', 'yellow']
test_case_student_answer = create_vocab(test_case)
assert evaluate_list_test_cases(test_case, test_case_student_answer, test_case_answer)



Running Sample Test Cases
Sample Test Case 1:
Input: ['john likes to watch movies mary likes movies too', 'mary also likes to watch football games']
Function Output: ['also', 'football', 'games', 'john', 'likes', 'mary', 'movies', 'to', 'too', 'watch']
Expected Output: ['also', 'football', 'games', 'john', 'likes', 'mary', 'movies', 'to', 'too', 'watch']
Test Case Passed :)
**********************************

Sample Test Case 2:
Input: ['We all live in a yellow submarine.', 'Yellow submarine, yellow submarine!!']
Function Output: ['!', ',', '.', 'We', 'Yellow', 'a', 'all', 'in', 'live', 'submarine', 'yellow']
Expected Output: ['!', ',', '.', 'We', 'Yellow', 'a', 'all', 'in', 'live', 'submarine', 'yellow']
Test Case Passed :)
**********************************



In [None]:
# Create vocabulary from training data
train_documents = train_df["sentence"].values.tolist()
train_vocab = create_vocab(train_documents)

### Task 0.3: Get Word Frequencies (0.75 Marks)

We define the normalized frequency of a word `w` in a corpus as:

p(w) = Number of occurences of `w` in all documents / Total Number of occurences of all words in all documents

Note that this is same as unigram probabilities discussed in Assignment2a as well as in the lectures.
Word frequencies can be helpful as it can help us recognize the most common words which in most cases will be stop words as well as rare words that occur in the documents. Later we will be making use of word frequencies to create sentence representations, but for now just implement the `get_word_frequencies` below

In [None]:
def get_word_frequencies(documents):
    """
    Gets the normalized frequency of each word w i.e. 
    p(w) =  #num_of_occurences_of_w / #total_occurences_of_all_words
    present in documents
    
    Inputs:
        - documents(list): A list of documents
    
    Returns:
        - word2freq(dict): A dictionary containing words as keys
                           and values as their corresponding frequencies
    
    """
    from nltk import word_tokenize
    from tqdm import tqdm

    word2freq = {}
    count = 0
    vocab=[]
    all_words=[]
    print('Preparing. . . . ')
    for document in tqdm(documents):
      tokens = word_tokenize(document)
      count += len(tokens)
      all_words+=tokens
      vocab = list(set(vocab+tokens))
    print('Finalizing. .. .')
    for word in tqdm(vocab):
      word2freq[word] = len(np.where(np.array(all_words)==word)[0])/count

    return word2freq

In [None]:
def check_dicts_same(dict1, dict2):
    if not isinstance(dict1, dict):
        print("Your function output is not a dictionary!")
        return False
    if len(dict1) != len(dict2):
        return False
    
    for key in dict1:
        val1 = dict1[key]
        val2 = dict2[key]
        if isinstance(val1, float) and isinstance(val1, float):
            if not np.allclose(val1, val2, 1e-4):
                return False
        if val1 != val2:
            return False
    
    return True
    
print("Running Sample Test Case 1")
sample_documents = [
    'john likes to watch movies mary likes movies too',
    'mary also likes to watch football games'
]
actual_word2freq = {'john': 0.0625,
                     'likes': 0.1875,
                     'to': 0.125,
                     'watch': 0.125,
                     'movies': 0.125,
                     'mary': 0.125,
                     'too': 0.0625,
                     'also': 0.0625,
                     'football': 0.0625,
                     'games': 0.0625}

output_word2freq = get_word_frequencies(sample_documents)
print(f"Input Documents: {sample_documents}")
print(f"Output Word Frequencies: {output_word2freq}")
print(f"Expected Word Frequencies: {actual_word2freq}")

assert check_dicts_same(output_word2freq, actual_word2freq)
print("****************************************\n")

print("Running Sample Test Case 2")
sample_documents = [
    'We all live in a yellow submarine.',
    'Yellow submarine, yellow submarine!!'
]
actual_word2freq = {'We': 0.06666666666666667,
                    'all': 0.06666666666666667,
                    'live': 0.06666666666666667,
                    'in': 0.06666666666666667,
                    'a': 0.06666666666666667,
                    'yellow': 0.13333333333333333,
                    'submarine': 0.2,
                    '.': 0.06666666666666667,
                    'Yellow': 0.06666666666666667,
                    ',': 0.06666666666666667,
                    '!': 0.13333333333333333}

output_word2freq = get_word_frequencies(sample_documents)
print(f"Input Documents: {sample_documents}")
print(f"Output Word Frequencies: {output_word2freq}")
print(f"Expected Word Frequencies: {actual_word2freq}")

assert check_dicts_same(output_word2freq, actual_word2freq)
print("****************************************\n")


Running Sample Test Case 1
Input Documents: ['john likes to watch movies mary likes movies too', 'mary also likes to watch football games']
Output Word Frequencies: {'games': 0.0625, 'too': 0.0625, 'football': 0.0625, 'likes': 0.1875, 'john': 0.0625, 'watch': 0.125, 'also': 0.0625, 'mary': 0.125, 'movies': 0.125, 'to': 0.125}
Expected Word Frequencies: {'john': 0.0625, 'likes': 0.1875, 'to': 0.125, 'watch': 0.125, 'movies': 0.125, 'mary': 0.125, 'too': 0.0625, 'also': 0.0625, 'football': 0.0625, 'games': 0.0625}
****************************************

Running Sample Test Case 2
Input Documents: ['We all live in a yellow submarine.', 'Yellow submarine, yellow submarine!!']
Output Word Frequencies: {',': 0.06666666666666667, 'We': 0.06666666666666667, 'all': 0.06666666666666667, 'a': 0.06666666666666667, 'Yellow': 0.06666666666666667, 'submarine': 0.2, 'yellow': 0.13333333333333333, 'live': 0.06666666666666667, 'in': 0.06666666666666667, '!': 0.13333333333333333, '.': 0.066666666666666

## Task 1: Word2Vec Representations

In this task you will learn how to use word2vec for obtaining vector representations for words and then how to use them further to create sentence/document level vector representations. We will be using the popular [gensim](https://radimrehurek.com/gensim/) package that has great support for vector space models and supports various popular word embedding methods like word2vec, fasttext, LSA etc. For the purposes of this assignment we will be working with the pretrained word2vec vectors on the google news corpus containing about 100 billion tokens. Below we provide a tutorial on how to use gensim for obtaining these word vectors.

We start by downloading pretrained word2vec vectors and create a `gensim.models.keyedvectors` obect. The download has a size of about 2GB, so might take a few minutes to download and load. 

In [None]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')



The `wv` object has a bunch of methods that we can use to obtain vector representations of words, finding similar words etc. We start with how to obtain vectors for words using it, which can be done using the `get_vector` method as demonstrated below.

In [None]:
# np.save('word2vec-google-news-300.npy',wv)

wv=np.load('word2vec-google-news-300.npy')

In [None]:
word = "bad"
vector = wv.get_vector(word)
print(f"Word : {word}")
print(f"Length of the vector: {len(vector)}")
print(f"Vector:")
print(vector)

Word : bad
Length of the vector: 300
Vector:
[ 0.06298828  0.12451172  0.11328125  0.07324219  0.03881836  0.07910156
  0.05078125  0.171875    0.09619141  0.22070312 -0.04150391 -0.09277344
 -0.02209473  0.14746094 -0.21582031  0.15234375  0.19238281 -0.05078125
 -0.11181641 -0.3203125   0.00506592  0.15332031 -0.02563477 -0.0234375
  0.36328125  0.20605469  0.04760742 -0.02624512  0.09033203  0.00457764
 -0.15332031  0.06591797  0.3515625  -0.12451172  0.03015137  0.16210938
  0.00242615 -0.02282715  0.02978516  0.00531006  0.25976562 -0.22460938
  0.29492188 -0.18066406  0.07910156  0.02282715  0.12109375 -0.17382812
 -0.03735352 -0.06933594 -0.21972656  0.1875     -0.03320312 -0.06225586
 -0.04492188  0.11621094 -0.23339844 -0.11669922  0.09814453 -0.11962891
  0.13964844  0.28710938 -0.26953125 -0.05493164  0.03112793 -0.05029297
  0.1328125  -0.01831055 -0.37695312 -0.06298828  0.12597656 -0.07910156
 -0.04467773  0.10400391 -0.41210938  0.22851562 -0.07080078  0.24511719
  0.064

You can also obtain the brackets by using angular brackets notation i.e. `wv["bad"]`

In [None]:
word = "bad"
vector = wv[word]
print(f"Word : {word}")
print(f"Length of the vector: {len(vector)}")
print(f"Vector:")
print(vector)

Word : bad
Length of the vector: 300
Vector:
[ 0.06298828  0.12451172  0.11328125  0.07324219  0.03881836  0.07910156
  0.05078125  0.171875    0.09619141  0.22070312 -0.04150391 -0.09277344
 -0.02209473  0.14746094 -0.21582031  0.15234375  0.19238281 -0.05078125
 -0.11181641 -0.3203125   0.00506592  0.15332031 -0.02563477 -0.0234375
  0.36328125  0.20605469  0.04760742 -0.02624512  0.09033203  0.00457764
 -0.15332031  0.06591797  0.3515625  -0.12451172  0.03015137  0.16210938
  0.00242615 -0.02282715  0.02978516  0.00531006  0.25976562 -0.22460938
  0.29492188 -0.18066406  0.07910156  0.02282715  0.12109375 -0.17382812
 -0.03735352 -0.06933594 -0.21972656  0.1875     -0.03320312 -0.06225586
 -0.04492188  0.11621094 -0.23339844 -0.11669922  0.09814453 -0.11962891
  0.13964844  0.28710938 -0.26953125 -0.05493164  0.03112793 -0.05029297
  0.1328125  -0.01831055 -0.37695312 -0.06298828  0.12597656 -0.07910156
 -0.04467773  0.10400391 -0.41210938  0.22851562 -0.07080078  0.24511719
  0.064

Also note that the word2vec model might not have vectors for all words, you can check for Out of Vocabulary (OOV) words using the `in` operator as shown in the code block below.

In [None]:
print("book" in wv)
print("blastoise" in wv)

True
False


Just looking at the vectors we cannot really gain any insights about them, but it is the relation between the vectors of different words that is much more easier to interpet. `wv` object has a `most_similar` method that for a given word obtains the words that are most similar to it by computing cosine similarity between them.

In [None]:
# Running the below 4 cells were crashing the 

In [None]:
wv.most_similar("bad",topn=5)

In [None]:
wv.most_similar("king",topn=5)

You can see that the we obtain very reasonable similar words in both examples. We can also use `most_similar` to do the analogy comparison that was discussed in the class. For eg: man : king :: woman : ?

In [None]:
wv.most_similar(positive=['woman', 'king'], negative=['man'], topn = 1)

In [None]:
wv.most_similar(positive=['woman', 'father'], negative=['man'], topn = 1)

### Task 1.1 Sentence representations using Word2Vec : Bag of Words Methods (1 Mark)

Now that we know how to obtain the vectors of each word, how can we obtain a vector representation for a sentence or a document? One of the simplest way is to add the vectors of all the words in the sentence to obtain sentence vector. This is also called the Bag of Words approach. Can you think of why? Last time when we discussed bag of words features for a sentence, it contained counts of each word occuring in the sentence. This can be just thought of as just adding one hot vectors for all the words in a sentence. Hence, adding word2vec vectors for each word in the sentence can also be viewed as a bag of words representation.

Implement the `get_bow_sent_vec` function below that takes in a sentence and adds the word2vec vectors for each word occuring in the sentence to obtain the sentence vector. Also, in practice it is helpful to divide the sum of word vectors by the number of words to normalize the representation obtained.

In [None]:
def get_bow_sent_vec(sentence, wv):
    """
    Obtains the vector representation of a sentence by adding the word vectors
    for each word occuring in the sentence (and dividing by the number of words) i.e
    
    v(s) = sum_{w \in s}(v(w)) / N(s)
    where N(s) is the number of words in the sentence,
    v(w) is the word2vec representation for word w
    and v(s) is the obtained vector representation of sentence s
    
    Inputs:
        - sentence (str): A string containing the sentence to be encoded
        - wv (gensim.models.keyedvectors.KeyedVectors) : A gensim word vector model object.
        
    Returns:
        - sentence_vec (np.ndarray): A numpy array containing the vector representation
        of the sentence
        
    Note : Not all the words might be present in `wv` so you will need to check for that,
          and only add vectors for the words that are present. Also while normalization
          divide by the number of words for which a word vector was actually present in `wv`
    
    Important Note: In case no word in the sentence is present in `wv`, return an all zero vector!

    """
    from nltk import word_tokenize
    sentence_vec = np.zeros(300)
    
    tokens = word_tokenize(sentence)
    adjuster=0
    for word in tokens:
      try:
        sentence_vec += wv[word]
      except:
        adjuster +=1
    # Normalising
    if (len(tokens)-adjuster)>0:
      sentence_vec /= (len(tokens)-adjuster)
    else:
      pass

    return sentence_vec

In [None]:
print("Running Sample Test Case 1")
sample_sentence ='john likes watching movies mary likes movies too'
sentence_vec = get_bow_sent_vec(sample_sentence, wv)
expected_sent_vec = np.array([ 0.03330994,  0.11713409,  0.00738525,  0.24951172, -0.0202179 ])
print(f"Input Sentence: {sample_sentence}")
print(f"First five elements of output vector: {sentence_vec[:5]}")
print(f"Expected first five elements of output vector: {expected_sent_vec}")
assert np.allclose(sentence_vec[:5], expected_sent_vec, 1e-4) 
print("Sample Test Case Passed")
print("*******************************\n")

print("Running Sample Test Case 2")
sample_sentence ='We all live in a yellow submarine.'
sentence_vec = get_bow_sent_vec(sample_sentence, wv)
expected_sent_vec = np.array([-0.08424886,  0.14601644,  0.0727946 ,  0.09978231, -0.02655029])
print(f"Input Sentence: {sample_sentence}")
print(f"First five elements of output vector: {sentence_vec[:5]}")
print(f"Expected first five elements of output vector: {expected_sent_vec}")
assert np.allclose(sentence_vec[:5], expected_sent_vec, 1e-4) 
print("Sample Test Case Passed")
print("*******************************\n")

print("Running Sample Test Case 3")
sample_sentence ='blastoise pikachu charizard'
sentence_vec = get_bow_sent_vec(sample_sentence, wv)
expected_sent_vec = np.array([0.,  0.,  0. ,  0., 0.])
print(f"Input Sentence: {sample_sentence}")
print(f"First five elements of output vector: {sentence_vec[:5]}")
print(f"Expected first five elements of output vector: {expected_sent_vec}")
assert np.allclose(sentence_vec[:5], expected_sent_vec, 1e-4) 
print("Sample Test Case Passed")
print("*******************************\n")


Running Sample Test Case 1
Input Sentence: john likes watching movies mary likes movies too
First five elements of output vector: [ 0.03330994  0.11713409  0.00738525  0.24951172 -0.0202179 ]
Expected first five elements of output vector: [ 0.03330994  0.11713409  0.00738525  0.24951172 -0.0202179 ]
Sample Test Case Passed
*******************************

Running Sample Test Case 2
Input Sentence: We all live in a yellow submarine.
First five elements of output vector: [-0.08424886  0.14601644  0.0727946   0.09978231 -0.02655029]
Expected first five elements of output vector: [-0.08424886  0.14601644  0.0727946   0.09978231 -0.02655029]
Sample Test Case Passed
*******************************

Running Sample Test Case 3
Input Sentence: blastoise pikachu charizard
First five elements of output vector: [0. 0. 0. 0. 0.]
Expected first five elements of output vector: [0. 0. 0. 0. 0.]
Sample Test Case Passed
*******************************



### Task 1.2 Sentence representations using Word2Vec : Inverse Frequency Weighted Sum Method (1 Mark)

Instead of directly adding the vectors for all the words in the sentence, we can do something slightly better which tends to work very well in practice. [Arora et al. 2017](https://openreview.net/pdf?id=SyK00v5xx) proposes the following method for computing sentence embedding from word vectors

<img src="https://i.ibb.co/vwzHXHy/sent-embed.jpg" alt="sent-embed" border="0">

Here v_w is the vector representation of the word w, p(w) is the frequency of the word w, |s| is the number of words in the sentence, and `a` is just a constant with a typical value between 1e-3 to 1e-4.

Intuitively, we take a weighted sum of all the word vectors where the weights are inversely propotional to the frequency of the word (p(w)). This ensures that very frequent words which are often stop words like "the", "I" etc. are given lower weightage when constructing the sentence vector. `a` is used as smoothing constant, such that when p(w) = 0 we still have finite weights.


In [None]:
def get_weighted_bow_sent_vec(sentence, wv, word2freq, a = 1e-4):
    """
    Obtains the vector representation of a sentence by adding the word vectors
    for each word occuring in the sentence (and dividing by the number of words) i.e
    
    v(s) = (sum_{w \in s} a / (a + p(w)) * (v(w))) / N(s)
    
    Inputs:
        - sentence (str): A string containing the sentence to be encoded
        - wv (gensim.models.keyedvectors.KeyedVectors) : A gensim word vector model object.
        - word2freq (dict): A dictionary with words as keys and their frequency in the
                            entire training dataset as values
        - a (float): Smoothing constant
        
    Returns:
        - sentence_vec (np.ndarray): A numpy array containing the vector representation
        of the sentence
    
    Important Note: In case no word in the sentence is present in `wv`, return an all zero vector!
    
    Hint: If a word is not present in the `word2freq` dictionary, you can consider frequency
          of that word to be zero
        
    """
    
    sentence_vec = np.zeros(300)
    
    from nltk import word_tokenize
    tokens = word_tokenize( sentence )
    s = len(tokens)
    reduce_count = 0
    for word in tokens : 
      try:
        wv_value=wv[word]
        try:
          sentence_vec += (wv[word] * (a/(a + word2freq[word])))
        except:
          sentence_vec += (wv[word] * (a/(a + 0)))
      except:
        reduce_count +=1

    if s-reduce_count >0:
      sentence_vec /= (s-reduce_count)
    else:
      pass
    return sentence_vec

In [None]:
print("Running Sample Test Case 1")
sample_sentence ='john likes watching movies mary likes movies too'
sample_word2freq = {
    "john" : 0.001,
    "likes": 0.01,
    "watching" : 0.01,
    "movies": 0.05,
    "mary" : 0.001,
    "too": 0.1
}
sentence_vec = get_weighted_bow_sent_vec(sample_sentence, wv, sample_word2freq)
expected_sent_vec = np.array([-0.00384654,  0.00208942,  0.00010824,  0.00648482, -0.00236967])
print(f"Input Sentence: {sample_sentence}")
print(f"First five elements of output vector: {sentence_vec[:5]}")
print(f"Expected first five elements of output vector: {expected_sent_vec}")
assert np.allclose(sentence_vec[:5], expected_sent_vec, 1e-4) 
print("Sample Test Case Passed")
print("*******************************\n")

print("Running Sample Test Case 2")
sample_sentence ='We all live in a yellow submarine.'
sentence_vec = get_weighted_bow_sent_vec(sample_sentence, wv, word2freq = {}, a = 1e-3)
expected_sent_vec = np.array([-0.08424886,  0.14601644,  0.0727946 ,  0.09978231, -0.02655029])
print(f"Input Sentence: {sample_sentence}")
print(f"First five elements of output vector: {sentence_vec[:5]}")
print(f"Expected first five elements of output vector: {expected_sent_vec}")
assert np.allclose(sentence_vec[:5], expected_sent_vec, 1e-4) 
print("Sample Test Case Passed")
print("*******************************\n")

print("Running Sample Test Case 3")
sample_sentence ='blastoise pikachu charizard'
sentence_vec = get_weighted_bow_sent_vec(sample_sentence, wv, word2freq = {}, a = 1e-3)
expected_sent_vec = np.array([0.,  0.,  0. ,  0., 0.])
print(f"Input Sentence: {sample_sentence}")
print(f"First five elements of output vector: {sentence_vec[:5]}")
print(f"Expected first five elements of output vector: {expected_sent_vec}")
assert np.allclose(sentence_vec[:5], expected_sent_vec, 1e-4) 
print("Sample Test Case Passed")
print("*******************************\n")


Running Sample Test Case 1
Input Sentence: john likes watching movies mary likes movies too
First five elements of output vector: [-0.00384654  0.00208942  0.00010824  0.00648482 -0.00236967]
Expected first five elements of output vector: [-0.00384654  0.00208942  0.00010824  0.00648482 -0.00236967]
Sample Test Case Passed
*******************************

Running Sample Test Case 2
Input Sentence: We all live in a yellow submarine.
First five elements of output vector: [-0.08424886  0.14601644  0.0727946   0.09978231 -0.02655029]
Expected first five elements of output vector: [-0.08424886  0.14601644  0.0727946   0.09978231 -0.02655029]
Sample Test Case Passed
*******************************

Running Sample Test Case 3
Input Sentence: blastoise pikachu charizard
First five elements of output vector: [0. 0. 0. 0. 0.]
Expected first five elements of output vector: [0. 0. 0. 0. 0.]
Sample Test Case Passed
*******************************



Now that you have implemented the sentence vector functions, let's obtain sentence vectors for all the sentences in our training and test sets. This will take a few minutes

In [None]:
train_documents = train_df["sentence"].values.tolist()
test_documents = test_df["sentence"].values.tolist()
train_vocab = create_vocab(train_documents)
train_word2freq = get_word_frequencies(train_documents)

train_bow_vectors = np.array([
    get_bow_sent_vec(document, wv)
    for document in train_documents
])
test_bow_vectors = np.array([
    get_bow_sent_vec(document, wv)
    for document in test_documents
])

train_w_bow_vectors = np.array([
    get_weighted_bow_sent_vec(document, wv, train_word2freq, a = 1e-3)
    for document in train_documents
])
test_w_bow_vectors = np.array([
    get_weighted_bow_sent_vec(document, wv, train_word2freq, a = 1e-3)
    for document in test_documents
])

Preparing. . . . 


100%|██████████| 67349/67349 [03:31<00:00, 318.44it/s]


Finalizing. .. .


100%|██████████| 14704/14704 [30:53<00:00,  7.93it/s]


In [None]:
# # saving the data to reduce execution time on next run
# np.save('train_documents.npy',train_documents)
# np.save('test_documents.npy',test_documents)
# np.save('train_vocab .npy',train_vocab )
# np.save('train_word2freq.npy',train_word2freq)

In [None]:
train_documents = np.load('train_documents.npy')
test_documents = np.load('test_documents.npy')
train_vocab = np.load('train_vocab.npy')
train_word2freq = np.load('train_word2freq.npy')


## Task 2: Train a Sentiment Classifier using Sentence Vectors

This part will be just like Assignment 1, but instead of the Bag of Word features we defined last time to train the classifier, we will use the sentence vectors obtained from word2vec.

### Define a Custom Dataset class

In [None]:
from torch.utils.data import Dataset, DataLoader

class SST2Dataset(Dataset):
    
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels
        
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]
    
    

### Task 2.1: Define the Logistic Regression Model (1 Mark)

Like last time define a Logistic Regression model that takes as input the sentence vector and predicts the label.

In [None]:
import torch
import torch.nn as nn

class LogisticRegressionModel(nn.Module):
    
    def __init__(self, d_input):
        """
        Define the architecture of a Logistic Regression classifier.
        You will need to define two components, one will be the linear layer using
        nn.Linear, and a sigmoid activation function for the output.

        Inputs:
          - d_input (int): The dimensionality or number of features in each input. 
                            This will be required to define the linear layer

        Hint: Recall that in logistic regression we obtain a single probablility
        value for each input that denotes how likely is the input belonging
        to the positive class
        """
        #Need to call the constructor of the parent class
        super(LogisticRegressionModel, self).__init__()

        self.linear_layer = nn.Linear ( d_input ,1 )
        self.sigmoid_layer = nn.Sigmoid()


  
    def forward(self, x):
        """
        Passes the input `x` through the layers in the network and returns the output

        Inputs:
          - x (torch.tensor): A torch tensor of shape [batch_size, d_input] representing the batch of inputs

        Returns:
          - output (torch.tensor): A torch tensor of shape [batch_size,] obtained after passing the input to the network

        """
        output = self.linear_layer(x)
        output = self.sigmoid_layer(output)
        

        return output.squeeze(-1) # Question: Why do squeeze() here? 

In [None]:
print("Running Sample Test Cases")
torch.manual_seed(42)
d_input = 5
sample_lr_model = LogisticRegressionModel(d_input = d_input)
print(f"Sample Test Case 1: Testing linear layer input and output sizes, for d_input = {d_input}")
in_features = sample_lr_model.linear_layer.in_features
out_features = sample_lr_model.linear_layer.out_features

print(f"Number of Input Features: {in_features}")
print(f"Number of Output Features: {out_features}")
print(f"Expected Number of Input Features: {d_input}")
print(f"Expected Number of Output Features: {1}")
assert in_features == d_input and out_features == 1

print("**********************************\n")
d_input = 24
sample_lr_model = LogisticRegressionModel(d_input = d_input)
print(f"Sample Test Case 2: Testing linear layer input and output sizes, for d_input = {d_input}")
in_features = sample_lr_model.linear_layer.in_features
out_features = sample_lr_model.linear_layer.out_features

print(f"Number of Input Features: {in_features}")
print(f"Number of Output Features: {out_features}")
print(f"Expected Number of Input Features: {d_input}")
print(f"Expected Number of Output Features: {1}")
assert in_features == d_input and out_features == 1
print("**********************************\n")

print(f"Sample Test Case 3: Checking if the model gives correct output")
test_input = torch.rand(d_input)
model_output = sample_lr_model(test_input)
model_output_np = model_output.detach().numpy()
expected_output = 0.6298196315765381
print(f"Model Output: {model_output_np}")
print(f"Expected Output: {expected_output}")

assert np.allclose(model_output_np, expected_output, 1e-5)
print("**********************************\n")

print(f"Sample Test Case 4: Checking if the model gives correct output")
test_input = torch.rand(4, d_input)
model_output = sample_lr_model(test_input)
model_output_np = model_output.detach().numpy()
expected_output = np.array([0.5503339, 0.5428218, 0.561816,  0.51846  ])
print(f"Model Output: {model_output_np}")
print(f"Expected Output: {expected_output}")

assert model_output_np.shape == expected_output.shape and np.allclose(model_output_np, expected_output, 1e-5)
print("**********************************\n")


Running Sample Test Cases
Sample Test Case 1: Testing linear layer input and output sizes, for d_input = 5
Number of Input Features: 5
Number of Output Features: 1
Expected Number of Input Features: 5
Expected Number of Output Features: 1
**********************************

Sample Test Case 2: Testing linear layer input and output sizes, for d_input = 24
Number of Input Features: 24
Number of Output Features: 1
Expected Number of Input Features: 24
Expected Number of Output Features: 1
**********************************

Sample Test Case 3: Checking if the model gives correct output
Model Output: 0.6298196315765381
Expected Output: 0.6298196315765381
**********************************

Sample Test Case 4: Checking if the model gives correct output
Model Output: [0.5503339 0.5428218 0.561816  0.51846  ]
Expected Output: [0.5503339 0.5428218 0.561816  0.51846  ]
**********************************



### Task 2.2: Training and Evaluating the Model (5 Marks)

Write the training and evaluation script like the last time to train and evaluate sentiment classification model. You will need to write the entire functions on your own this time. You can refer to the code in Assignment 1.

In [None]:
import torch
import torch.nn as nn
from torch.optim import Adam

def train(model, train_dataloader,
          lr = 1e-3, num_epochs = 20,
          device = "cpu"):

    """
    Runs the training loop. Define the loss function as BCELoss like the last tine
    and optimizer as Adam and traine for `num_epochs` epochs.

    Inputs:
        - model (LogisticRegressionModel): A classifer model to be trained
        - train_dataloader (torch.utils.DataLoader): A dataloader defined over the training dataset
        - lr (float): The learning rate for the optimizer
        - num_epochs (int): Number of epochs to train the model for.
        - device (str): Device to train the model on. Can be either 'cuda' (for using gpu) or 'cpu'

    Returns:
        - model (LogisticRegressionModel): Model after completing the training
        - epoch_loss (float) : Loss value corresponding to the final epoch
    """
    # Transfer the model to specified device
    model = model.to(device)

    # Step 1: Define the Binary Cross Entropy loss function
    loss_fn = nn.BCELoss()
  
    # Step 2: Define Adam Optimizer
    optimizer = Adam(model.parameters(),lr=lr)


    # Iterate over `num_epochs`
    for epoch in range(num_epochs):
        epoch_loss = 0 # We can use this to keep track of how the loss value changes as we train the model.
        # Iterate over each batch using the `train_dataloader`
        for train_batch in tqdm.tqdm(train_dataloader):
            # Zero out any gradients stored in the previous steps
            optimizer.zero_grad()

            # Unwrap the batch to get features and labels
            features, labels = train_batch

            # Most nn modules and loss functions assume the inputs are of type Float, so convert both features and labels to floats
            features = features.float()
            labels = labels.float()

            # Transfer the features and labels to device
            features = features.to(device)
            labels = labels.to(device)


            # Step 3: Feed the input features to the model to get predictions
            preds = model(features)
            # Step 4: Compute the loss and perform backward pass
            loss = loss_fn(preds,labels)
            loss.backward()

            # Step 5: Take optimizer step
            optimizer.step()
            # Store loss value for tracking
            epoch_loss += loss.item()

        epoch_loss = epoch_loss / len(train_dataloader)
        print(f"Epoch {epoch} completed.. Average Loss: {epoch_loss}")

    return model, epoch_loss
    

In [None]:
torch.manual_seed(42)
print("Training on 100 data points for sanity check")
sample_documents = train_df["sentence"].values.tolist()[:100]
sample_labels = train_df["label"].values.tolist()[:100]
sample_features = np.array([get_bow_sent_vec(document, wv) for document in sample_documents])
sample_dataset = SST2Dataset(sample_features, sample_labels)
sample_dataloader = DataLoader(sample_dataset, batch_size=64)
sample_lr_model = LogisticRegressionModel(d_input = len(sample_features[0]))

sample_lr_model, loss = train(sample_lr_model, sample_dataloader,
      lr = 1e-2, num_epochs = 10,
      device = "cpu")

expected_loss = 0.5364882349967957
print(f"Final Loss Value: {loss}")
print(f"Expected Loss Value: {expected_loss}")

Training on 100 data points for sanity check


100%|██████████| 2/2 [00:00<00:00, 12.24it/s]


Epoch 0 completed.. Average Loss: 0.6949431896209717


100%|██████████| 2/2 [00:00<00:00, 282.25it/s]


Epoch 1 completed.. Average Loss: 0.6701762676239014


100%|██████████| 2/2 [00:00<00:00, 164.23it/s]


Epoch 2 completed.. Average Loss: 0.6490162014961243


100%|██████████| 2/2 [00:00<00:00, 166.67it/s]


Epoch 3 completed.. Average Loss: 0.6294075548648834


100%|██████████| 2/2 [00:00<00:00, 181.27it/s]


Epoch 4 completed.. Average Loss: 0.6111842691898346


100%|██████████| 2/2 [00:00<00:00, 164.13it/s]


Epoch 5 completed.. Average Loss: 0.5942146182060242


100%|██████████| 2/2 [00:00<00:00, 391.68it/s]


Epoch 6 completed.. Average Loss: 0.5783675312995911


100%|██████████| 2/2 [00:00<00:00, 184.94it/s]


Epoch 7 completed.. Average Loss: 0.5635259449481964


100%|██████████| 2/2 [00:00<00:00, 435.27it/s]


Epoch 8 completed.. Average Loss: 0.5495925843715668


100%|██████████| 2/2 [00:00<00:00, 576.14it/s]

Epoch 9 completed.. Average Loss: 0.536488264799118
Final Loss Value: 0.536488264799118
Expected Loss Value: 0.5364882349967957





Don't worry if the loss values do not match exactly but you should see a decreasing trend and the final value should be of the same order of magnitude

In [None]:


def evaluate(model, test_dataloader, threshold = 0.5, device = "cpu"):
    
    """
    Evaluates `model` on test dataset

    Inputs:
        - model (LogisticRegressionModel or MLPModel): Logistic Regression model to be evaluated
        - test_dataloader (torch.utils.DataLoader): A dataloader defined over the test dataset

    Returns:
        - accuracy (float): Average accuracy over the test dataset 
    """    

    model.to(device)
    model = model.eval() # Set model to evaluation model 
    accuracy = 0
    
    # by specifying `torch.no_grad`, it ensures no gradients are calcuated while running the model,
    # this makes the computation much more faster
    with torch.no_grad():
      for test_batch in test_dataloader:
        features, labels = test_batch
        features = features.float().to(device)
        labels = labels.float().to(device)

        # Step 1: Get probability predictions from the model and store it in `pred_probs`
        pred_probs = model(features)

        # Convert predictions and labels to numpy arrays from torch tensors as they are easier to operate for computing metrics
        pred_probs = pred_probs.detach().cpu().numpy()
        labels = labels.detach().cpu().numpy()

        # Step 2: Get accuracy of predictions and store it in `batch_accuracy`
        predictions =[]
        for value in pred_probs:
          if value>threshold:
            predictions.append(1)
          else:
            predictions.append(0)


        batch_accuracy = None
        correct_predictions=0
        for prediction,actual in zip(predictions,labels):
          if prediction==actual:
            correct_predictions+=1
        batch_accuracy=correct_predictions/len(predictions)


        accuracy += batch_accuracy

      # Divide by number of batches to get average accuracy
      accuracy = accuracy / len(test_dataloader)

      return accuracy


In [None]:
print(f"Testing the sample model on 100 examples for sanity check")
torch.manual_seed(42)
sample_documents = test_df["sentence"].values.tolist()[:100]
sample_labels = test_df["label"].values.tolist()[:100]
sample_features = np.array([get_bow_sent_vec(document, wv) for document in sample_documents])

sample_dataset = SST2Dataset(sample_features,
                            sample_labels)

sample_dataloader = DataLoader(sample_dataset, batch_size = 64)
accuracy = evaluate(sample_lr_model, sample_dataloader, device ="cpu")
expected_accuracy = 0.7204861111111112
print(f"Accuracy: {accuracy}")
print(f"Expected Accuracy: {expected_accuracy}")


Testing the sample model on 100 examples for sanity check
Accuracy: 0.7204861111111112
Expected Accuracy: 0.7204861111111112


Now that you have implemented the training and evaluation functions, we will train (and evaluate) 2 different models and compare their performance. The 2 models are:

    - Logistic Regression with Bag of Word2vec features
    - Logistic Regression with Weighted Bag of Word2vec features

In [None]:
print(f"Training and Evaluating Logistic Regression with Bag of Word2vec features")
device = "cuda" if torch.cuda.is_available() else "cpu"

train_labels = train_df["label"].values.tolist()
test_labels = test_df["label"].values.tolist()

train_dataset = SST2Dataset(train_bow_vectors, train_labels)
train_loader = DataLoader(train_dataset, batch_size = 64)

test_dataset = SST2Dataset(test_bow_vectors, test_labels)
test_loader = DataLoader(test_dataset, batch_size = 64)

lr_bow_model = LogisticRegressionModel(
    d_input = wv.vector_size,
)

lr_bow_model, loss = train(lr_bow_model, train_loader,
      lr = 1e-2, num_epochs = 10,
      device = device)

test_accuracy = evaluate(
    lr_bow_model, test_loader,
    device = device
)

print(f"Test Accuracy: {test_accuracy}")

Training and Evaluating Logistic Regression with Bag of Word2vec features


100%|██████████| 1053/1053 [00:02<00:00, 524.34it/s]


Epoch 0 completed.. Average Loss: 0.40442544707262507


100%|██████████| 1053/1053 [00:01<00:00, 547.97it/s]


Epoch 1 completed.. Average Loss: 0.37194564222497023


100%|██████████| 1053/1053 [00:01<00:00, 568.37it/s]


Epoch 2 completed.. Average Loss: 0.3695470960306646


100%|██████████| 1053/1053 [00:01<00:00, 556.84it/s]


Epoch 3 completed.. Average Loss: 0.36882112299048775


100%|██████████| 1053/1053 [00:01<00:00, 571.14it/s]


Epoch 4 completed.. Average Loss: 0.3685299224182185


100%|██████████| 1053/1053 [00:01<00:00, 571.39it/s]


Epoch 5 completed.. Average Loss: 0.36839686528003907


100%|██████████| 1053/1053 [00:01<00:00, 551.26it/s]


Epoch 6 completed.. Average Loss: 0.3683308569567609


100%|██████████| 1053/1053 [00:01<00:00, 541.46it/s]


Epoch 7 completed.. Average Loss: 0.3682959823668399


100%|██████████| 1053/1053 [00:01<00:00, 555.80it/s]


Epoch 8 completed.. Average Loss: 0.3682767879447819


100%|██████████| 1053/1053 [00:01<00:00, 564.78it/s]

Epoch 9 completed.. Average Loss: 0.36826538582529666
Test Accuracy: 0.8087053571428572





In [None]:
print(f"Training and Evaluating Logistic Regression with Weighted Bag of Word2vec features")
device = "cuda" if torch.cuda.is_available() else "cpu"

train_labels = train_df["label"].values.tolist()
test_labels = test_df["label"].values.tolist()

train_dataset = SST2Dataset(train_w_bow_vectors, train_labels)
train_loader = DataLoader(train_dataset, batch_size = 64)

test_dataset = SST2Dataset(test_w_bow_vectors, test_labels)
test_loader = DataLoader(test_dataset, batch_size = 64)

lr_bow_model = LogisticRegressionModel(
    d_input = wv.vector_size,
)

lr_bow_model, loss = train(lr_bow_model, train_loader,
      lr = 1e-2, num_epochs = 10,
      device = device)

test_accuracy = evaluate(
    lr_bow_model, test_loader,
    device = device
)

print(f"Test Accuracy: {test_accuracy}")

Training and Evaluating Logistic Regression with Weighted Bag of Word2vec features


100%|██████████| 1053/1053 [00:01<00:00, 544.59it/s]


Epoch 0 completed.. Average Loss: 0.4211177029925534


100%|██████████| 1053/1053 [00:01<00:00, 554.67it/s]


Epoch 1 completed.. Average Loss: 0.3881261649153285


100%|██████████| 1053/1053 [00:01<00:00, 554.81it/s]


Epoch 2 completed.. Average Loss: 0.385535904173611


100%|██████████| 1053/1053 [00:01<00:00, 553.38it/s]


Epoch 3 completed.. Average Loss: 0.3847372835999088


100%|██████████| 1053/1053 [00:01<00:00, 557.69it/s]


Epoch 4 completed.. Average Loss: 0.3844115064609424


100%|██████████| 1053/1053 [00:01<00:00, 548.44it/s]


Epoch 5 completed.. Average Loss: 0.3842606441615767


100%|██████████| 1053/1053 [00:01<00:00, 555.56it/s]


Epoch 6 completed.. Average Loss: 0.3841837417309554


100%|██████████| 1053/1053 [00:01<00:00, 550.96it/s]


Epoch 7 completed.. Average Loss: 0.38414362674368524


100%|██████████| 1053/1053 [00:01<00:00, 565.55it/s]


Epoch 8 completed.. Average Loss: 0.38411985124051856


100%|██████████| 1053/1053 [00:01<00:00, 544.34it/s]

Epoch 9 completed.. Average Loss: 0.38410562988643066
Test Accuracy: 0.7933035714285713





First thing that you can notice is that these models train substantially faster than the models in Assignment 1, as now we have much more lower sized sentence representations i.e. 300, compared to last time when it was equal to the size of vocabulary i.e. around 10k!

Both models get around ~80% test accuracy, which is close to what we got with Bag of Words features in Assignment 1 only. The reason we do not see much improvement in performance is because both models still take a (weighted) sum of the individual word vectors to obtain sentence vectors, and fails to encode any structural information as well as semantics properly. For eg. both of the following sentences:

- it was a good movie adapted from a bad book
- it was a bad movie adapted from a good book

both of these sentences will get exact similar vector representations according to both the methods and hence the model will never be able to distinguish between the sentiment of these two sentences giving same prediction for both. 

In the next assignments we shall see how we can learn more contextual representation of the sentences that can help us solve the task much more efficiently.
