<a href="https://colab.research.google.com/github/weissercn/Eta_to_Ap_Gamma_Search/blob/master/Good_Copy_of_6864_hw1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit 
rm -rf 6864-hw1
git clone https://github.com/lingo-mit/6864-hw1.git

Cloning into '6864-hw1'...


In [0]:
import sys
sys.path.append("/content/6864-hw1")

import csv
import itertools as it
import numpy as np
np.random.seed(0)

import lab_util

## Introduction

In this lab, you'll explore three different ways of using unlabeled text data to learn pretrained word representations. Your lab report will describe the effects of different modeling decisions (representation learning objective, context size, etc.) on both qualitative properties of learned representations and their effect on a downstream prediction problem.

**General lab report guidelines**

Homework assignments should be submitted in the form of a research report. (We'll be providing a place to upload them before the due date, but are still sorting out some logistics.) Please upload PDFs, with a maximum of four single-spaced pages. (If you want you can use the [Association for Computational Linguistics style files](http://acl2020.org/downloads/acl2020-templates.zip).) Reports should have one section for each part of the homework assignment below. Each section should describe the details of your code implementation, and include whatever charts / tables are necessary to answer the set of questions at the end of the corresponding homework part.



We're going to be working with a dataset of product reviews. It looks like this:

In [3]:
data = []
n_positive = 0
n_disp = 0
with open("/content/6864-hw1/reviews.csv") as reader:
  csvreader = csv.reader(reader)
  next(csvreader)
  for id, review, label in csvreader:
    label = int(label)

    # hacky class balancing
    if label == 1:
      if n_positive == 2000:
        continue
      n_positive += 1
    if len(data) == 4000:
      break

    data.append((review, label))
    
    if n_disp > 5:
      continue
    n_disp += 1
    print("review:", review)
    print("rating:", label, "(good)" if label == 1 else "(bad)")
    print()

print(f"Read {len(data)} total reviews.")
np.random.shuffle(data)
reviews, labels = zip(*data)
train_reviews = reviews[:3000]
train_labels = labels[:3000]
val_reviews = reviews[3000:3500]
val_labels = labels[3000:3500]
test_reviews = reviews[3500:]
test_labels = labels[3500:]

review: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
rating: 1 (good)

review: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
rating: 0 (bad)

review: This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother an

We've provided a little bit of helper code for reading in the dataset; your job is to implement the learning!

## Part 1: word representations via matrix factorization

First, we'll construct the term--document matrix (look at `/content/6864-hw1/lab_util.py` in the file browser on the left if you want to see how this works).

In [4]:
vectorizer = lab_util.CountVectorizer()
vectorizer.fit(train_reviews)
td_matrix = vectorizer.transform(train_reviews).T
print(f"TD matrix is {td_matrix.shape[0]} x {td_matrix.shape[1]}")

TD matrix is 2006 x 3000


First, implement a function that computes word representations via latent semantic analysis:

In [0]:
def learn_reps_lsa(matrix, rep_size):
  # `matrix` is a `|V| x n` matrix, where `|V|` is the number of words in the
  # vocabulary. This function should return a `|V| x rep_size` matrix with each
  # row corresponding to a word representation. The `sklearn.decomposition` 
  # package may be useful.

  # Your code here!
    from sklearn.decomposition import TruncatedSVD
    svd = TruncatedSVD(n_components=rep_size, n_iter=7, random_state=42)
    svd.fit(matrix)
    return svd.transform(matrix)

In [6]:
def learn_reps_lsa_test(matrix, rep_size):
  # `matrix` is a `|V| x n` matrix, where `|V|` is the number of words in the
  # vocabulary. This function should return a `|V| x rep_size` matrix with each
  # row corresponding to a word representation. The `sklearn.decomposition` 
  # package may be useful.

  # Your code here!
    from sklearn.decomposition import TruncatedSVD
    svd = TruncatedSVD(n_components=rep_size, n_iter=7, random_state=42)
    svd.fit(matrix)
    res = svd.transform(matrix)
    from scipy import linalg
    U, s, Vh = linalg.svd(res)
    U1, s1, Vh1 = linalg.svd(matrix)
    U2, s2, Vh2 = linalg.svd(np.matmul(res, res.transpose() ))
    U3, s3, Vh3 = linalg.svd(np.matmul(matrix, matrix.transpose() ))
    print("U", U, U2)
    print("U", U1,U3)
    print(U.shape, U1.shape, U2.shape, U3.shape)
    
    return svd.get_params()
U =learn_reps_lsa_test(td_matrix, 500)

U [[-6.89031937e-01 -6.72372892e-01  1.78411037e-01 ...  2.54319531e-04
   2.03106361e-04 -2.07221953e-03]
 [-2.31553598e-02  5.20963554e-02 -1.21153239e-02 ...  1.25874572e-03
  -3.12207734e-03  4.00310656e-03]
 [-3.83301968e-02  4.29225155e-02 -1.93856196e-02 ...  1.27053012e-03
  -1.40084451e-03  8.97751537e-04]
 ...
 [-3.17231578e-04  6.50245704e-04 -1.52675912e-03 ...  9.68889507e-01
   1.86446326e-03 -7.28153447e-04]
 [-4.42263586e-04  2.14287700e-04  4.98282343e-04 ...  4.26442928e-04
   9.81787254e-01 -1.22439074e-03]
 [-2.76302849e-03 -1.39797955e-02  5.86256685e-03 ...  5.39120990e-04
  -4.45575283e-03  7.95273211e-01]] [[-6.89031937e-01  6.72372892e-01  1.78411037e-01 ... -1.05230763e-03
  -6.17478003e-03  8.48553559e-04]
 [-2.31553598e-02 -5.20963554e-02 -1.21153239e-02 ...  5.89750501e-03
  -1.76333031e-02  2.69489740e-02]
 [-3.83301968e-02 -4.29225155e-02 -1.93856196e-02 ... -6.03945671e-03
  -1.36362303e-02 -1.74960448e-02]
 ...
 [-3.17231578e-04 -6.50245704e-04 -1.52675

Let's look at some representations:

In [7]:
reps = learn_reps_lsa(td_matrix, 500)
words = ["good", "bad", "cookie", "jelly", "dog", "the", "4"]
show_tokens = [vectorizer.tokenizer.word_to_token[word] for word in words]
lab_util.show_similar_words(vectorizer.tokenizer, reps, show_tokens)

good 47
  . 1.056
  a 1.101
  but 1.121
  , 1.152
  the 1.157
bad 201
  . 1.396
  taste 1.417
  but 1.434
  a 1.435
  i 1.449
cookie 504
  nana's 0.796
  cookies 1.025
  oreos 1.282
  bars 1.382
  bites 1.405
jelly 351
  twist 1.091
  cardboard 1.227
  peanuts 1.400
  advertised 1.405
  plastic 1.468
dog 925
  food 1.049
  pet 1.066
  pets 1.069
  switched 1.205
  foods 1.228
the 36
  . 0.331
  <unk> 0.366
  of 0.395
  and 0.403
  to 0.422
4 292
  1 1.047
  6 1.121
  70 1.135
  stevia 1.196
  concentrated 1.245


We've been operating on the raw count matrix, but in class we discussed several reweighting schemes aimed at making LSA representations more informative. 

Here, implement the TF-IDF transform and see how it affects learned representations.

In [0]:
def transform_tfidf(matrix):
  # `matrix` is a `|V| x |D|` matrix of raw counts, where `|V|` is the 
  # vocabulary size and `|D|` is the number of documents in the corpus. This
  # function should (nondestructively) return a version of `matrix` with the
  # TF-IDF transform appliied.

  # Your code here!
    #matrix #token, documents
    V, D = matrix.shape
    
    idf = np.count_nonzero(matrix, axis=1)
    idf = np.log(D/idf).reshape(-1,1)
    
    return matrix * idf, idf

How does this change the learned similarity function?

In [9]:
td_matrix_tfidf, idf = transform_tfidf(td_matrix)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 1000)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

good 47
  . 1.028
  a 1.082
  but 1.103
  and 1.138
  the 1.143
bad 201
  . 1.379
  taste 1.412
  but 1.422
  a 1.425
  i 1.440
cookie 504
  nana's 0.988
  cookies 1.282
  oreos 1.545
  moist 1.551
  bars 1.553
jelly 351
  twist 1.312
  softer 1.530
  cardboard 1.541
  plum 1.624
  advertised 1.634
dog 925
  food 1.050
  pets 1.162
  pet 1.182
  foods 1.257
  dogs 1.311
the 36
  . 0.263
  <unk> 0.328
  and 0.341
  of 0.362
  to 0.383
4 292
  1 1.053
  6 1.132
  stevia 1.210
  70 1.264
  5 1.322


Now that we have some representations, let's see if we can do something useful with them.

Below, implement a feature function that represents a document as the sum of its
learned word embeddings.

The remaining code trains a logistic regression model on a set of *labeled* reviews; we're interested in seeing how much representations learned from *unlabeled* reviews improve classification.

In [33]:
def word_featurizer(xs):
  # normalize
  return xs / np.sqrt((xs ** 2).sum(axis=1, keepdims=True))

def lsa_featurizer(xs):
  # This function takes in a matrix in which each row contains the word counts
  # for the given review. It should return a matrix in which each row contains
  # the learned feature representation of each review (e.g. the sum of LSA 
  # word representations).
    
  #xs.shape nexamples, nwords

  xs_tfidf = (xs.T * idf).T
  feats = np.dot(xs_tfidf, reps_tfidf)
  

  #feats = np.dot(xs, reps) # None # Your code here!

  # normalize
  return feats / np.sqrt((feats ** 2).sum(axis=1, keepdims=True)) # averages over example

def combo_featurizer(xs):
  return np.concatenate((word_featurizer(xs), lsa_featurizer(xs)), axis=1) # (2,3) and (2,3) arrays concatenated to (2,6)

def train_model(featurizer, xs, ys):
  import sklearn.linear_model
  xs_featurized = featurizer(xs)
  model = sklearn.linear_model.LogisticRegression()
  model.fit(xs_featurized, ys)
  return model

def eval_model(model, featurizer, xs, ys):
  xs_featurized = featurizer(xs)
  pred_ys = model.predict(xs_featurized)
  print("test accuracy", np.mean(pred_ys == ys))

def training_experiment(name, featurizer, n_train):
  print(f"{name} features, {n_train} examples")
  train_xs = vectorizer.transform(train_reviews[:n_train])
  train_ys = train_labels[:n_train]
  test_xs = vectorizer.transform(test_reviews)
  test_ys = test_labels
  model = train_model(featurizer, train_xs, train_ys)
  eval_model(model, featurizer, test_xs, test_ys)
  print()

training_experiment("word", word_featurizer, 1000)
training_experiment("lsa", lsa_featurizer, 1000)
training_experiment("combo", combo_featurizer, 1000)

word features, 3000 examples
test accuracy 0.784

lsa features, 3000 examples
test accuracy 0.792

combo features, 3000 examples
test accuracy 0.81



**Part 1: Lab writeup**

Part 1 of your lab report should discuss any implementation details that were important to filling out the code above. Then, use the code to set up experiments that answer the following questions:

1. Qualitatively, what do you observe about nearest neighbors in representation    space? (E.g. what words are most similar to _the_, _dog_, _3_, and _good_?)

2. How does the size of the LSA representation affect this behavior?


3. Recall that the we can compute the word co-occurrence matrix $W_{tt} = W_    
   {td} W_{td}^\top$. What can you prove about the relationship between the    
   left singular vectors of $W_{td}$ and $W_{tt}$? Do you observe this behavior 
   with your implementation of `learn_reps_lsa`? Why or why not?

4. Do learned representations help with the review classification problem? What
   is the relationship between the number of labeled examples and the effect of
   word embeddings?
   
5. What is the relationship between the size of the word embeddings and their      usefulness for the classification task.

## Part 2: word representations via language modeling

In this section, we'll train a word embedding model with a word2vec-style objective rather than a matrix factorization objective. This requires a little more work; we've provided scaffolding for a PyTorch model implementation below.
(If you've never used PyTorch before, there are some tutorials [here](https://pytorch.org/tutorials/). You're also welcome to implement these experiments in
any other framework of your choosing.)

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as torch_data

class Word2VecModel(nn.Module):
  # A torch module implementing a word2vec predictor. The `forward` function
  # should take a batch of context word ids as input and predict the word 
  # in the middle of the context as output, as in the CBOW model from lecture.

  def __init__(self, vocab_size, embed_dim):
      super().__init__()

      self.vocab_size = vocab_size
      self.embed_dim  = embed_dim

      self.V = nn.Linear(vocab_size, embed_dim, bias=False)
      self.U   = nn.Linear(embed_dim, vocab_size, bias=False)

      # Your code here!
      pass

  def convert_average_input(self,context):
      # Context is an `n_batch x n_context` matrix
      window_size = len(context[0])/2.
      n_batch = len(context)
      x = torch.zeros([n_batch, self.vocab_size])
      for iexample, example in enumerate(context):
        #x[iexample, [example]] += 1./(2*window_size) #this is more compact to write
        for iword, word in enumerate(example):
          x[iexample, word] += 1./(2*window_size)
      
      return x

  def forward(self, context):
      # Context is an `n_batch x n_context` matrix of integer word ids
      # this function should return a set of scores for predicting the word 
      # in the middle of the context

      # first create sparse matrix, then average n_context
      x = self.convert_average_input(context)
      # multiply by V matrix
      x = self.V(x)
      # multiply by U matrix
      x = self.U(x)
      return x

  def get_embedding(self):
        
      return self.V.weight



In [0]:
def convert_label(context, vocab_size):
      # Context is an `n_batch x n_context` matrix
      n_batch = len(context)
      x = np.zeros(n_batch, vocab_size)
      for iword, word in enumerate(context):
          x[iword, word] += 1.
      
      return x

def learn_reps_word2vec(corpus, window_size, rep_size, n_epochs, n_batch):
  # This method takes in a corpus of training sentences. It returns a matrix of
  # word embeddings with the same structure as used in the previous section of 
  # the assignment. (You can extract this matrix from the parameters of the 
  # Word2VecModel.)

  tokenizer = lab_util.Tokenizer()
  tokenizer.fit(corpus)
  tokenized_corpus = tokenizer.tokenize(corpus)

  ngrams = lab_util.get_ngrams(tokenized_corpus, window_size)

  #device = torch.device('cuda')  # run on colab gpu
  model = Word2VecModel(tokenizer.vocab_size, rep_size) #.to(device)
  opt = optim.Adam(model.parameters(), lr=0.001)
  criterion = nn.CrossEntropyLoss() 

      

  loader = torch_data.DataLoader(ngrams, batch_size=n_batch, shuffle=True)

  for epoch in range(n_epochs):
    print("epoch ", epoch)
    for context, label in loader:
      # as described above, `context` is a batch of context word ids, and
      # `label` is a batch of predicted word labels

      opt.zero_grad()
      output = model(context)
      #label_conv = convert_label(label, tokenizer.vocab_size) # cross entropy does this for me

      
      loss = criterion(output, label) 
      loss.backward()
      opt.step()
      # Your code here!

  torch.save(model.state_dict(), '2_25')

  # reminder: you want to return a `vocab_size x embedding_size` numpy array
  embedding_matrix = model.get_embedding()
  return embedding_matrix
  # Your code here!

In [18]:
reps_word2vec_tr = learn_reps_word2vec(train_reviews, 2, 500, 10, 100) #reps_word2vec = learn_reps_word2vec(train_reviews, 2, 500, 10, 100)
reps_word2vec = reps_word2vec_tr.data.numpy().T

epoch  0
epoch  1
epoch  2
epoch  3
epoch  4
epoch  5
epoch  6
epoch  7
epoch  8
epoch  9


In [19]:
print(type(vectorizer.tokenizer), type(reps_word2vec), type(show_tokens))
print(reps_word2vec.shape)


<class 'lab_util.Tokenizer'> <class 'numpy.ndarray'> <class 'list'>
(2006, 500)


After training the embeddings, we can try to visualize the embedding space to see if it makes sense. First, we can take any word in the space and check its closest neighbors.

In [20]:
lab_util.show_similar_words(vectorizer.tokenizer, reps_word2vec, show_tokens)

good 47
  great 0.868
  decent 0.982
  bad 1.082
  disgusting 1.092
  fantastic 1.112
bad 201
  good 1.082
  awful 1.092
  overpowering 1.147
  terrible 1.175
  acidic 1.221
cookie 504
  range 1.322
  covered 1.326
  berry 1.338
  lover 1.340
  bisquick 1.365
jelly 351
  bears 1.193
  pork 1.255
  sized 1.267
  san 1.278
  coffees 1.289
dog 925
  cat 0.844
  baby 1.028
  cats 1.218
  breed 1.222
  son 1.253
the 36
  mrs 1.073
  our 1.269
  their 1.283
  my 1.287
  amazon's 1.336
4 292
  2 0.792
  5 0.978
  10 1.045
  3 1.096
  75 1.114


We can also cluster the embedding space. Clustering in 4 or more dimensions is hard to visualize, and even clustering in 2 or 3 can be difficult because there are so many words in the vocabulary. One thing we can try to do is assign cluster labels and qualitiatively look for an underlying pattern in the clusters.

In [21]:
from sklearn.cluster import KMeans

indices = KMeans(n_clusters=10).fit_predict(reps_word2vec)
zipped = list(zip(range(vectorizer.tokenizer.vocab_size), indices))
np.random.shuffle(zipped)
zipped = zipped[:100]
zipped = sorted(zipped, key=lambda x: x[1])
for token, cluster_idx in zipped:
  word = vectorizer.tokenizer.token_to_word[token]
  print(f"{word}: {cluster_idx}")

gives: 0
anywhere: 0
claims: 0
left: 0
took: 0
came: 0
citrus: 1
horrible: 1
test: 1
serious: 1
palatable: 1
lot: 1
much: 1
tough: 1
twist: 1
nutritious: 1
difficult: 1
low: 1
bottle: 2
complete: 2
worth: 2
chewy: 3
reduced: 3
about: 3
tin: 3
such: 3
tasting: 3
maybe: 3
decaffeinated: 3
bears: 3
like: 3
than: 3
worst: 3
i: 4
unfortunately: 4
she: 4
trouble: 4
barely: 4
fell: 4
second: 4
today: 4
described: 4
baby: 5
watching: 5
mother: 5
date: 5
puppy: 5
toddler: 5
guests: 5
co: 5
health: 5
weight: 5
sunflower: 6
irish: 6
potato: 6
paste: 6
dip: 6
dressing: 6
mustard: 6
chunks: 6
allergy: 6
peanuts: 6
chocolates: 6
sesame: 6
egg: 6
cereal: 6
mint: 6
grain: 6
spinach: 6
beef: 6
placed: 7
paying: 7
disappointed: 7
contacted: 7
decided: 7
craving: 7
remaining: 7
wait: 8
enjoy: 8
become: 8
write: 8
figure: 8
remember: 8
speak: 8
state: 8
fit: 8
touch: 8
lid: 9
kind: 9
stuff: 9
flavour: 9
beverage: 9
thats: 9
clear: 9
quantity: 9
design: 9
thing: 9
switch: 9
shape: 9
shows: 9


Finally, we can use the trained word embeddings to construct vector representations of full reviews. One common approach is to simply average all the word embeddings in the review to create an overall embedding. Implement the transform function in Word2VecFeaturizer to do this.

In [36]:
def lsa_featurizer(xs):

  xs = (xs.T * idf).T #tfidf
  
  print(xs.shape, reps_word2vec.shape)
  feats = None # Your code here
  #xs_average = np.average()
  feats = np.dot(xs,reps_word2vec)

  # normalize
  return feats / np.sqrt((feats ** 2).sum(axis=1, keepdims=True))

training_experiment("word2vec", lsa_featurizer, 3000)

word2vec features, 3000 examples
(3000, 2006) (2006, 500)
(500, 2006) (2006, 500)
test accuracy 0.804



**Part 2: Lab writeup**

Part 2 of your lab report should discuss any implementation details that were important to filling out the code above. Then, use the code to set up experiments that answer the following questions:

1. Qualitatively, what do you observe about nearest neighbors in representation space? (E.g. what words are most similar to _the_, _dog_, _3_, and _good_?) How well do word2vec representations correspond to your intuitions about word similarity?

2. One important parameter in word2vec-style models is context size. How does changing the context size affect the kinds of representations that are learned?

3. How do results on the downstream classification problem compare to 
   part 1?

4. What are some advantages and disadvantages of learned embedding representations, relative to the featurization done in part 1?

5. What are some potential problems with constructing a representation of the review by averaging the embeddings of the individual words?