<a href="https://colab.research.google.com/github/sophiezhzh18/NLP/blob/main/6861_hw1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit
rm -rf hw2
git clone https://github.com/mit-6864/hw2.git

Cloning into 'hw2'...


In [None]:
import sys
sys.path.append("/content/hw2")

import csv
import itertools as it
import numpy as np
import sklearn.decomposition
from tqdm import tqdm

import lab_util

# Introduction

In this notebook, you will find code scaffolding for the implementation portion of Homework 1. There are certain parts of the scaffolding marked with `# Your code here!` comments where you can fill in code to perform the specified tasks. You should be able to complete this assignment without changing any of the scaffolding code, just writing code to fill in the scaffolding and run experiments. Make sure to read the text between cells for certain implementation details. Please submit the notebook with all code cells running.

This notebook can be done independently of the handout and will be graded based on the code and cell outputs. However, certain questions on the handout require you to design and perform experiments to evaluate the methods used here. There is space at the end of the notebook for you to carry out these experiments.


## Dataset

We're going to be working with a dataset of product reviews. The following cell loads the dataset and splits it into training, validation, and test sets.

In [None]:
data = []
n_positive = 0
n_disp = 0
with open("/content/hw2/reviews.csv") as reader:
  csvreader = csv.reader(reader)
  next(csvreader)
  for id, review, label in csvreader:
    label = int(label)

    # hacky class balancing
    if label == 1:
      if n_positive == 2000:
        continue
      n_positive += 1
    if len(data) == 4000:
      break

    data.append((review, label))

    if n_disp > 5:
      continue
    n_disp += 1
    print("review:", review)
    print("rating:", label, "(good)" if label == 1 else "(bad)")
    print()

print(f"Read {len(data)} total reviews.")
np.random.seed(0)
np.random.shuffle(data)
reviews, labels = zip(*data)
train_reviews, train_labels = reviews[:3000], labels[:3000]
val_reviews, val_labels = reviews[3000:3500], labels[3000:3500]
test_reviews, test_labels = reviews[3500:], labels[3500:]

# Preliminaries: Word-document representations

We start by constructing the bag-of-words matrix (look at `/content/hw2/lab_util.py` in the file browser on the left if you want to see how this works).

In [None]:
vectorizer = lab_util.CountVectorizer()
vectorizer.fit(train_reviews)
bow_matrix = vectorizer.transform(train_reviews)
print(f"BoW matrix is {bow_matrix.shape[0]} x {bow_matrix.shape[1]}")

BoW matrix is 3000 x 2006


In class, we've seen that we can get more informative representations by using representations other than raw counts. Implement the TF-IDF transform below.

Note: In lecture, we multiplied the raw term frequencies by idfs to get the TF-IDF matrix (tfidf=tf*idf). Feel free to experiment with other transformations, such as log(1+tf) for the measure of term frequency.

In [None]:
class TfidfFeaturizer:
    def fit(self, matrix):
        # `matrix` is a `|V| x |D|` matrix of raw counts, where `|V|` is the
        # vocabulary size and `|D|` is the number of documents in the corpus.
        # This function should create the inverse document frequency (idf) matrix
        # for the given term-document matrix.

        self.idf = None # Your code here!

    def transform_tfidf(self, matrix):
        # `matrix` is a `|V| x |D|` matrix of raw counts, where `|V|` is the
        # vocabulary size and `|D|` is the number of documents in the corpus. This
        # function should (nondestructively) return a version of `matrix` with the
        # TF-IDF transform applied.

        # Your code here!
        raise NotImplementedError

td_matrix = bow_matrix.T
featurizer = TfidfFeaturizer()
featurizer.fit(td_matrix)
tfidf_matrix = featurizer.transform_tfidf(td_matrix)
print(f"TF-IDF matrix is {tfidf_matrix.shape[0]} x {tfidf_matrix.shape[1]}")

#### Sanity check 1
The following cell should print `True` if your `transform_tfidf` function is implemented properly. (*Hint: in our implementation, we use the natural logarithm (base $e$) when computing inverse document frequency.*)

In [None]:
DEBUG_sc1_matrix = np.array([[3,1,0,3,0],
                             [0,2,0,0,1],
                             [7,8,2,0,1],
                             [1,9,8,1,0]])
DEBUG_gt = np.array([[1.53247687, 0.51082562, 0.        , 1.53247687, 0.        ],
                     [0.        , 1.83258146, 0.        , 0.        , 0.91629073],
                     [1.56200486, 1.78514841, 0.4462871 , 0.        , 0.22314355],
                     [0.22314355, 2.00829196, 1.78514841, 0.22314355, 0.        ]])
debug = TfidfFeaturizer()
debug.fit(DEBUG_sc1_matrix)
print(np.allclose(debug.transform_tfidf(DEBUG_sc1_matrix), DEBUG_gt))

#### Linear models on BoW and TFIDF features

Now we have two feature representations, BoW and TF-IDF. Let's first see how effective these features are for the sentiment classification task.

Below, implement two logistic regression models to classify the reviews, using BoW and TF-IDF respectively. You should feel free to use the `scikit-learn` library, which has the `sklearn.linear_model.LogisticRegression` available for you. Report the training and test accuracy of these two models.

Note: For the TF-IDF classifier, we only fit the IDF matrix to the training data. (Think about why you might not want a separate IDF for the test set!)


In [None]:
from sklearn.linear_model import LogisticRegression

def train_and_eval(train_X, train_y, test_X, test_y):
    # Create and train a model that takes as input a feature
    # representation of the training data and outputs a sentiment label.
    # Make sure to report the training and test accuracy of your model.
    # Hint: changing max_iter might be helpful

    # Your code here!
    raise NotImplementedError

print('Logistic regression with bag of word features')
# Your code here!

print('Logistic regression with tf-idf features')
# Your code here!

Let's look at what the model learns about sentiment. For both models, display the top 5 most positive and negative weights, as well as their corresponding words.

Hint: look at `/content/hw2/lab_util.py` for how to convert between token indices and words.

In [None]:
# Your code here!

# LSA: Word representations via matrix factorization

In class, we've seen that the above approaches can lead to high dimensional representations. To alleviate this, we can use latent semantic analysis (LSA).

First, implement the function `learn_reps_lsa` that computes word representations via latent semantic analysis. The `sklearn.decomposition` or `np.linalg` packages may be useful.

In [None]:
def learn_reps_lsa(matrix, rep_size):
    # `matrix` is a `|V| x |D|` matrix, where `|V|` is the number of words in the
    # vocabulary and |D| is the number of training reviews. This function should
    # return a `|V| x rep_size` matrix with each row corresponding to a word
    # representation. The `sklearn.decomposition` package may be useful.

    # Your code here!
    raise NotImplementedError

#### Sanity check 2
The following cell contains a simple sanity check for your `learn_reps_lsa` implementation: it should print `True` if your `learn_reps_lsa` function is implemented equivalently to one of our solutions.

In [None]:
DEBUG_sc2_matrix = np.array([[1,0,0,2,1,3,5],
                             [2,0,0,0,0,4,0],
                             [0,3,4,1,8,6,6],
                             [1,4,5,0,0,0,0]])

DEBUG_reps = learn_reps_lsa(DEBUG_sc2_matrix, 3)
DEBUG_gt1 = np.array([[ -4.92017554,  -2.85465774,   1.18575453],
                      [ -2.14977584,  -1.19987977,   3.37221899],
                      [-12.62664695,   0.10890093,  -1.32131745],
                      [ -2.69216011,   5.66453534,   1.33728063]])
DEBUG_gt2 = np.array([[-0.35188159, -0.44213061,  0.29358929],
                      [-0.15374788, -0.18583789,  0.83495136],
                      [-0.90303377,  0.01686662, -0.32715426],
                      [-0.19253817,  0.87732566,  0.3311067 ]])

print(np.allclose(np.abs(DEBUG_reps), np.abs(DEBUG_gt1)) or np.allclose(np.abs(DEBUG_reps), np.abs(DEBUG_gt2)))

Let's look at some representations:

In [None]:
# LSA reps for term-document matrix
# Feel free to change the rep size!
reps = learn_reps_lsa(td_matrix, 500)
words = ["good", "bad", "cookie", "jelly", "dog", "the", "3"]
show_tokens = [vectorizer.tokenizer.word_to_token[word] for word in words]
lab_util.show_similar_words(vectorizer.tokenizer, reps, show_tokens)

How do the given similar words change if we apply LSA to the TF-IDF matrix instead?

In [None]:
# Feel free to change the rep size!
reps_tfidf = learn_reps_lsa(tfidf_matrix, 500)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

Now that we have some representations, let's see if we can do something useful with them.

Below, implement a feature function that represents a document as the sum of its
learned word embeddings.

The remaining code trains a logistic regression model on a set of *labeled* reviews; we're interested in seeing how much representations learned from *unlabeled* reviews, or reviews without any additional human annotation, can improve classification.

In [None]:
import sklearn.linear_model
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

def word_featurizer(xs):
    # normalize
    return xs / np.sqrt((xs ** 2).sum(axis=1, keepdims=True))

def lsa_featurizer(xs):
    # This function takes in a matrix in which each row contains the word counts
    # for the given review. It should return a matrix in which each row contains
    # the learned feature representation of each review (e.g. the sum of LSA
    # word representations).

    feats = None # Your code here!

    # normalize
    return feats / np.sqrt((feats ** 2).sum(axis=1, keepdims=True))

def combo_featurizer(xs):
    return np.concatenate((word_featurizer(xs), lsa_featurizer(xs)), axis=1)

def train_model(featurizer, xs, ys):
    xs_featurized = featurizer(xs)
    model = sklearn.linear_model.LogisticRegression(penalty=None, max_iter=1000)
    model.fit(xs_featurized, ys)
    return model

def eval_model(model, featurizer, xs, ys):
    xs_featurized = featurizer(xs)
    pred_ys = model.predict(xs_featurized)
    return np.mean(pred_ys == ys)

def training_experiment(name, featurizer, n_train):
    print(f"{name} features, {n_train} examples")
    train_xs = vectorizer.transform(train_reviews[:n_train])
    train_ys = train_labels[:n_train]
    test_xs = vectorizer.transform(test_reviews)
    test_ys = test_labels
    model = train_model(featurizer, train_xs, train_ys)
    acc = eval_model(model, featurizer, test_xs, test_ys)
    print(acc, '\n')
    return acc

In [None]:
# this will run a training experiment with all 3k examples in training set
n_train = 3000
training_experiment("word", word_featurizer, n_train)
training_experiment("lsa", lsa_featurizer, n_train)
training_experiment("combo", combo_featurizer, n_train)
print()

# Word2vec: word representations via neural model

In this section, we'll train a word embedding model with a word2vec-style objective rather than a matrix factorization objective. This requires a little more work; we've provided scaffolding for a PyTorch model implementation below.
If you don't have much PyTorch experience, there are some tutorials [here](https://pytorch.org/tutorials/) which may be useful. You may also find the classes `nn.Embedding` and `nn.EmbeddingBag` useful.

Note: We will be implementing a CBOW model; that is, given a word's context, we will predict the central word.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as torch_data

class Word2VecModel(nn.Module):
    # A torch module implementing a word2vec predictor. The `forward` function
    # should take a batch of context word ids as input and predict the word
    # in the middle of the context as output, as in the CBOW model from lecture.
    # Hint: look at how padding is handled in lab_util.get_ngrams when
    # initializing `ctx`: vocab_size is used as the padding token for contexts
    # near the beginning and end of sequences. If you use an embedding module
    # in your Word2Vec implementation, make sure to account for this extra
    # padding token in the input dimension and include the `padding_idx` kwarg.

    def __init__(self, vocab_size, embedding_size, padding_idx=2006):
        super().__init__()

        # Your code here!
        raise NotImplementedError

    def forward(self, context):
        # Context is an `n_batch x n_context` matrix of integer word ids.
        # In this case, n_context = 2 * window_size where window_size is defined
        # in lab_util.py. This is because each word has both left and right context.
        # This function should return an `n_batch x vocab_size` matrix with
        # element i, j being the (possibly log) probability of the middle word
        # in context i being word j.

        # Your code here!
        raise NotImplementedError

Train the model using the function below. Note that we use an [Adam optimizer](https://arxiv.org/abs/1412.6980). This is a fancy version of SGD which uses momentum and adaptive updates.



In [None]:
def learn_reps_word2vec(corpus, window_size, rep_size, n_epochs, n_batch):
    # This method takes in a corpus of training sentences. It returns a matrix of
    # word embeddings with the same structure as used in the previous section of
    # the assignment. (You can extract this matrix from the parameters of the
    # Word2VecModel.)

    tokenizer = lab_util.Tokenizer()
    tokenizer.fit(corpus)
    tokenized_corpus = tokenizer.tokenize(corpus)

    ngrams = lab_util.get_ngrams(tokenized_corpus, window_size, pad_idx=2006)

    device = torch.device('cuda')  # run on colab gpu
    model = Word2VecModel(tokenizer.vocab_size, rep_size).to(device)
    opt = optim.Adam(model.parameters(), lr=0.001)

    loader = torch_data.DataLoader(ngrams, batch_size=n_batch, shuffle=True)

    # What loss function should we use for Word2Vec?
    loss_fn = None  # Your code here!

    losses = []  # Potentially useful for debugging (loss should go down!)
    for epoch in tqdm(range(n_epochs)):
        epoch_loss = 0
        for context, label in loader:
            # As described above, `context` is a batch of context word ids, and
            # `label` is a batch of predicted word labels.

            # Here, perform a forward pass to compute predictions for the model.
            # Your code here!
            preds = None


            # Now finish the backward pass and gradient update.
            # Remember, you need to compute the loss, zero the gradie nts
            # of the model parameters, perform the backward pass, and
            # update the model parameters.
            # Your code here!
            loss = None


            epoch_loss += loss.item()
        losses.append(epoch_loss)

    # Hint: you want to return a `vocab_size x embedding_size` numpy array
    embedding_matrix = None  # Your code here!

    return embedding_matrix

In [None]:
# Feel free to change the hyperparameters!
# Use the function you just wrote to learn Word2Vec embeddings:
reps_word2vec = learn_reps_word2vec(train_reviews, 2, 500, 10, 100)

After training the embeddings, we can try to visualize the embedding space to see if it makes sense. First, we can take any word in the space and check its closest neighbors.

In [None]:
lab_util.show_similar_words(vectorizer.tokenizer, reps_word2vec, show_tokens)

We can also cluster the embedding space. Clustering in 4 or more dimensions is hard to visualize, and even clustering in 2 or 3 can be difficult because there are so many words in the vocabulary. One thing we can try to do is assign cluster labels and qualitiatively look for an underlying pattern in the clusters.

In [None]:
from sklearn.cluster import KMeans

indices = KMeans(n_clusters=10).fit_predict(reps_word2vec)
zipped = list(zip(range(vectorizer.tokenizer.vocab_size), indices))
np.random.shuffle(zipped)
zipped = zipped[:100]
zipped = sorted(zipped, key=lambda x: x[1], reverse=True)
for token, cluster_idx in zipped:
  word = vectorizer.tokenizer.token_to_word[token]
  print(f"{word}: {cluster_idx}")

Finally, we can use the trained word embeddings to construct vector representations of full reviews. One common approach is to simply average all the word embeddings in the review to create an overall embedding. Implement the transform function in Word2VecFeaturizer to do this.

In [None]:
def w2v_featurizer(xs):
    # This function takes in a matrix in which each row contains the word counts
    # for the given review. It should return a matrix in which each row contains
    # the average Word2Vec embedding of each review (hint: this will be very
    # similar to `lsa_featurizer` from above, just using Word2Vec embeddings
    # instead of LSA).

    feats = None # Your code here!

    # normalize
    return feats / np.sqrt((feats ** 2).sum(axis=1, keepdims=True))

training_experiment("word2vec", w2v_featurizer, 3000)
print()

# Experiments for HW1

Below, you can implement experiments to answer the experimental questions in the HW1 handout. Please label each code cell with its relevant question part.

In [None]:
# Your code here!