<a href="https://colab.research.google.com/github/dbamman/nlp21/blob/main/HW3/HW_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 3: Pytorch and CNNs

In this homework, you will begin exploring Pytorch, a neural network library that will be used throughout the remainder of the semester.  This homework will focus on Convolutional Neural Networks.



In [None]:
import sys, argparse
import numpy as np
import re
import nltk
import csv
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import nltk
from tqdm import tqdm
from collections import Counter

#Sets random seeds for reproducibility
seed=159259
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

In [None]:
!python -m nltk.downloader punkt

When looking up pytorch documentation, it may be useful to know which version of torch you are running.


In [None]:
print(torch.__version__)

# **IMPORTANT**: GPU is not enabled by default

You must switch runtime environments if your output of the next block of code has an error saying "ValueError: Expected a cuda device, but got: cpu"

Go to Runtime > Change runtime type > Hardware accelerator > GPU

In [None]:
device = torch.cuda.device("cuda" if torch.cuda.is_available() else "cpu")
print("Running on {}".format(device))

# Data Processing

Let's begin by loading our datasets and the 50-dimensional GLoVE word embeddings.  

In [None]:
!wget https://raw.githubusercontent.com/dbamman/nlp21/main/HW3/acl.train
!wget https://raw.githubusercontent.com/dbamman/nlp21/main/HW3/acl.dev
!wget https://raw.githubusercontent.com/dbamman/nlp21/main/HW3/glove.6B.50d.50K.txt

In [None]:
trainingFile = "acl.train"
devFile = "acl.dev"

In [None]:
labels = {'APPLICATIONS': 11,
 'CSSCA': 23,
 'DIALOGUE': 12,
 'DISCOURSE': 13,
 'ETHICS': 8,
 'GENERATION': 9,
 'GREEN': 15,
 'GROUNDING': 18,
 'IE': 6,
 'INTERPRET': 10,
 'IR': 22,
 'LEXSEM': 7,
 'LING': 24,
 'MLCLASS': 1,
 'MLLM': 16,
 'MT': 4,
 'MULTILING': 3,
 'OTHER': 25,
 'PHON': 5,
 'QA': 17,
 'RESOURCES': 14,
 'SA': 21,
 'SENTSEM': 0,
 'SPEECH': 19,
 'SUMM': 2,
 'SYNTAX': 20}

In [None]:
def get_batches(x, y, xType, batch_size=12):
    batches_x=[]
    batches_y=[]
    for i in range(0, len(x), batch_size):
        batches_x.append(xType(x[i:i+batch_size]))
        batches_y.append(torch.LongTensor(y[i:i+batch_size]))
    
    return batches_x, batches_y
        

In [None]:
PAD_INDEX = 0             # reserved for padding words
UNKNOWN_INDEX = 1         # reserved for unknown words
SEP_INDEX = 2

data_lens = []

def read_embeddings(filename, vocab_size=50000):
  """
  Utility function, loads in the `vocab_size` most common embeddings from `filename`
  
  Arguments:
  - filename:     path to file
                  automatically infers correct embedding dimension from filename
  - vocab_size:   maximum number of embeddings to load

  Returns 
  - embeddings:   torch.FloatTensor matrix of size (vocab_size x word_embedding_dim)
  - vocab:        dictionary mapping word (str) to index (int) in embedding matrix
  """

  # get the embedding size from the first embedding
  with open(filename, encoding="utf-8") as file:
    word_embedding_dim = len(file.readline().split(" ")) - 1

  vocab = {}

  embeddings = np.zeros((vocab_size, word_embedding_dim))
  with open(filename, encoding="utf-8") as file:
    for idx, line in enumerate(file):

      if idx + 2 >= vocab_size:
        break

      cols = line.rstrip().split(" ")
      val = np.array(cols[1:])
      word = cols[0]
      embeddings[idx + 2] = val
      vocab[word] = idx + 2
  
  # a FloatTensor is a multidimensional matrix
  # that contains 32-bit floats in every entry
  # https://pytorch.org/docs/stable/tensors.html
  return torch.FloatTensor(embeddings), vocab




# Logistic regression

First, let's code up logistic regression in pytorch so you can see how the general framework works, and also get a sense of baseline performance that we can compare a CNN against.

In [None]:
def get_vocab(filename, max_words=10000):
    unigram_counts=Counter()
    with open(filename) as file:    
        for line in file:
            cols=line.rstrip().split("\t")
            idd = cols[0]
            label = cols[1]
            title = cols[2]
            abstract = cols[3]
            strr="%s %s" % (title, abstract)
            words=nltk.word_tokenize(strr)

            for word in words:
                word=word.lower()
                unigram_counts[word]+=1

    vocab={}
    for k,v in unigram_counts.most_common(max_words):
        vocab[k]=len(vocab)
    return vocab
        

In [None]:
class LogisticRegressionClassifier(nn.Module):

   def __init__(self, input_dim, output_dim):
      super().__init__()
      self.linear = torch.nn.Linear(input_dim, output_dim)
 
    
   def forward(self, input): 
      x1 = self.linear(input)
      return x1

   def evaluate(self, x, y):

      self.eval()
      corr = 0.
      total = 0.
      with torch.no_grad():
        for x, y in zip(x, y):
          y_preds=self.forward(x)
          for idx, y_pred in enumerate(y_preds):
              prediction=torch.argmax(y_pred)
              if prediction == y[idx]:
                corr += 1.
              total+=1                          
      return corr/total



## Average Embedding Representation
Let's train a logistic regression classifier where the input is the average GLoVE embedding for all words in a paper's title and abstract

In [None]:
def read_glove_data(filename, vocab, embs):
    data=[]
    data_labels=[]
    with open(filename) as file:
        for line in file:
            avg_emb=np.zeros(50)
            cols=line.rstrip().split("\t")
            idd = cols[0]
            label = cols[1]
            title = cols[2]
            abstract = cols[3]
            strr="%s %s" % (title, abstract)
            words=nltk.word_tokenize(strr)
            avg_counter = 0.
            for word in words:
                word=word.lower()
                if word in glove_vocab:
                    avg_emb += embs[glove_vocab[word]].numpy()
                    avg_counter += 1.
            avg_emb /= avg_counter

            data.append(avg_emb)
            data_labels.append(labels[label])
    return data, data_labels 

In [None]:
embs, glove_vocab = read_embeddings("glove.6B.50d.50K.txt")
avg_train_x, avg_train_y=read_glove_data(trainingFile, glove_vocab, embs)
avg_dev_x, avg_dev_y=read_glove_data(devFile, glove_vocab, embs)

In [None]:
avg_trainX, avg_trainY=get_batches(avg_train_x, avg_train_y, xType=torch.FloatTensor)
avg_devX, avg_devY=get_batches(avg_dev_x, avg_dev_y, xType=torch.FloatTensor)

In [None]:
logreg=LogisticRegressionClassifier(50, len(labels))
optimizer = torch.optim.Adam(logreg.parameters(), lr=0.001, weight_decay=1e-5)
losses = []
cross_entropy=nn.CrossEntropyLoss()

num_labels=len(labels)

for epoch in range(200):
    logreg.train()
    
    for x, y in zip(avg_trainX, avg_trainY):
        y_pred=logreg.forward(x)
        loss = cross_entropy(y_pred.view(-1, num_labels), y.view(-1))
        losses.append(loss)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    dev_accuracy=logreg.evaluate(avg_devX, avg_devY)
    if epoch % 5 == 0:
        print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))


## BOW Representation
Feel free to fill in your bag-of-words implementation into read_bow_data() to see how the logistic classifier model works with a different featurization.  (You are not required to do anything within this BOW representation section; we provide the structure in case you'd like to explore how your your BOW logistic regression model from the last homework could be implemented in Pytorch).

In [None]:
def read_bow_data(filename, vocab):
    data=[]
    data_labels=[]
    with open(filename) as file:
        for line in file:
            cols=line.rstrip().split("\t")
            idd = cols[0]
            label = cols[1]
            title = cols[2]
            abstract = cols[3]
            strr="%s %s" % (title, abstract)
            bow=np.zeros(len(vocab))

            '''
            Insert your bow code here to store the featurization in the bow variable 
            
            '''

            data.append(bow)

            data_labels.append(labels[label])
    return data, data_labels 



In [None]:
bow_vocab=get_vocab(trainingFile)
bow_train_x, bow_train_y=read_bow_data(trainingFile, bow_vocab)
bow_dev_x, bow_dev_y=read_bow_data(devFile, bow_vocab)

In [None]:
bow_trainX, bow_trainY=get_batches(bow_train_x, bow_train_y, xType=torch.FloatTensor)
bow_devX, bow_devY=get_batches(bow_dev_x, bow_dev_y, xType=torch.FloatTensor)

In [None]:
logreg=LogisticRegressionClassifier(len(bow_vocab), len(labels))
optimizer = torch.optim.Adam(logreg.parameters(), lr=0.001, weight_decay=1e-5)
losses = []
cross_entropy=nn.CrossEntropyLoss()

num_labels=len(labels)

for epoch in range(200):
    for x, y in zip(bow_trainX, bow_trainY):
        y_pred=logreg.forward(x)
        loss = cross_entropy(y_pred.view(-1, num_labels), y.view(-1))
        losses.append(loss)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    dev_accuracy=logreg.evaluate(bow_devX, bow_devY)
    if epoch % 5 == 0:
        print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))


# Deliverable 1. CNN 

Now let's create our CNN.

In [None]:
def read_data(filename, vocab, labels):
    """
    :param filename: the name of the file
    :return: list of tuple ([word index list], label)
    as input for the forward and backward function
    """    
    data = []
    data_labels = []
    file = open(filename)
    for line in file:
        cols = line.split("\t")
        idd = cols[0]
        label = cols[1]
        title = cols[2]
        abstract = cols[3]
        w_int = []
        for w in nltk.word_tokenize(title.lower()):
            if w in vocab:
                w_int.append(vocab[w])
            else:
                w_int.append(UNKNOWN_INDEX)
        w_int.append(SEP_INDEX)
        w_int.append(SEP_INDEX)
        for w in nltk.word_tokenize(abstract.lower()):
            if w in vocab:
                w_int.append(vocab[w])
            else:
                w_int.append(UNKNOWN_INDEX)
        data_lens.append(len(w_int))
        if len(w_int) < 549:
            w_int.extend([PAD_INDEX] * (549 - len(w_int)))
        if len(w_int) < 550:
          data.append((w_int))
          data_labels.append(labels[label])
    file.close()
    return data, data_labels

In [None]:
embs, cnn_vocab = read_embeddings("glove.6B.50d.50K.txt")

In [None]:
cnn_train_x, cnn_train_y = read_data(trainingFile, cnn_vocab, labels)
cnn_dev_x, cnn_dev_y = read_data(devFile, cnn_vocab, labels)


In [None]:
cnn_trainX, cnn_trainY=get_batches(cnn_train_x, cnn_train_y, torch.LongTensor)
cnn_devX, cnn_devY=get_batches(cnn_dev_x, cnn_dev_y, torch.LongTensor)


In [None]:
class CNNClassifier(nn.Module):

   def __init__(self, params, pretrained_embeddings):
      super().__init__()
      self.seq_len = params["max_seq_len"]
      self.num_labels = params["label_length"]
      
      '''
      Initialize the following layers according to the hw spec
      '''
      self.embeddings = ...

      # convolution over 1 word
      self.conv_1 = nn.Conv1d(...)
      self.pool_1 = nn.MaxPool1d(...)

      # convolution over 2 words    
      self.conv_2 = nn.Conv1d(...)
      self.pool_2 = nn.MaxPool1d(...)
        
      # convolution over 3 words
      self.conv_3 = nn.Conv1d(...)
      self.pool_3 = nn.MaxPool1d(...)
        
      self.fc = ...


    
   def forward(self, input): 
      #embeds the input sequences
      x0 = self.embeddings(input)
      #changes dimensions to be consistent with conv1d
      x0 = x0.permute(0, 2, 1)

      '''
      Create the hidden representations according to the hw spec
      '''
      #Apply the one-word convolution, tanh, and pool
      x1 = ...
    
      #Apply the two-word convolution, tanh, and pool
      x2 = ...
        
      #Apply the three-word convolution, tanh, and pool
      x3 = ...

      #Concatenates the output of all 3 convolution layers
      combined=...

      #Connects the combined output to the fully-connected layer
      out = ...
      return out.squeeze()

   def evaluate(self, x, y):
      
      self.eval()
      corr = 0.
      total = 0.

      with torch.no_grad():

        for x, y in zip(x, y):
          y_preds=self.forward(x)
          for idx, y_pred in enumerate(y_preds):
              prediction=torch.argmax(y_pred)
              if prediction == y[idx]:
                corr += 1.
              total+=1                          
      return corr/total



In [None]:
embs, cnn_vocab = read_embeddings("glove.6B.50d.50K.txt")
cnnmodel = CNNClassifier(params={"max_seq_len": 549, "label_length": len(labels)}, pretrained_embeddings=embs)
optimizer = torch.optim.Adam(cnnmodel.parameters(), lr=0.001, weight_decay=1e-5)
losses = []
cross_entropy=nn.CrossEntropyLoss()

num_epochs=25
best_dev_acc = 0.

for epoch in range(num_epochs):
    cnnmodel.train()

    for x, y in zip(cnn_trainX, cnn_trainY):
      y_pred = cnnmodel.forward(x)
      loss = cross_entropy(y_pred.view(-1, cnnmodel.num_labels), y.view(-1))
      losses.append(loss) 
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
    dev_accuracy=cnnmodel.evaluate(cnn_devX, cnn_devY)
    if epoch % 1 == 0:
        print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))
        if dev_accuracy > best_dev_acc:
          torch.save(cnnmodel.state_dict(), 'best-cnnmodel-parameters.pt')
          best_dev_acc = dev_accuracy

cnnmodel.load_state_dict(torch.load('best-cnnmodel-parameters.pt'))
print("\nBest Performing Model achieves dev accuracy of : %.3f" % (best_dev_acc))

# Model Exploration

## Loss Examination
To debug your model and ensure it is updating correctly, it may be helpful to visualize your training loss.  The following code plots loss over epoch.  This should decrease as the model trains and eventually converge.  If your training loss is not decreasing, you might not be initializing your model or creating your forward() pass correctly.

In [None]:
import matplotlib.pyplot as plt
plt.plot(range(len(losses)), losses)
plt.title("Training Loss over Time")
plt.show()

# Deliverable 2: Explore NLP articles
Now that you have your CNN trained, let's go ahead and make predictions for all of the 7,188 abstracts in our full dataset of NLP papers published between 2013-2020. 

In [None]:
!wget https://raw.githubusercontent.com/dbamman/nlp21/main/HW3/acl.all.tsv

In [None]:
def read_prediction_data(filename, vocab):
    """
    :param filename: the name of the file
    :return: list of tuple ([word index list], label)
    as input for the forward and backward function
    """    
    data = []
    data_dates = []
    file = open(filename)
    for line in file:
        cols = line.split("\t")
        idd = cols[0]
        year = int(cols[1])
        title = cols[2]
        abstract = cols[3]
        w_int = []
        for w in nltk.word_tokenize(title.lower()):
            # skip the unknown words
            if w in vocab:
                w_int.append(vocab[w])
            else:
                w_int.append(UNKNOWN_INDEX)
        w_int.append(SEP_INDEX)
        w_int.append(SEP_INDEX)
        for w in nltk.word_tokenize(abstract.lower()):
            # skip the unknown words
            if w in vocab:
                w_int.append(vocab[w])
            else:
                w_int.append(UNKNOWN_INDEX)
        data_lens.append(len(w_int))
        if len(w_int) < 549:
            w_int.extend([PAD_INDEX] * (549 - len(w_int)))
        if len(w_int) < 550:
          data.append((w_int))
          data_dates.append(year)
    file.close()
    return data, data_dates

In [None]:
predictFile="acl.all.tsv"
cnn_test_x, cnn_test_dates = read_prediction_data(predictFile, cnn_vocab)
cnn_predictX, cnn_predictDates=get_batches(cnn_test_x, cnn_test_dates, torch.LongTensor, batch_size=256)

In [None]:
reverse_labels={labels[k]:k for k in labels}

Now let's make predictions on all of that data with your trained `cnnmodel`.

In [None]:
with torch.no_grad():

  all_dates=[]
  all_preds=[]
  for x, y in zip(cnn_predictX, cnn_predictDates):
    y_preds=cnnmodel.forward(x)
    for idx, y_pred in enumerate(y_preds):
        prediction=int(torch.argmax(y_pred))
        all_dates.append(int(y[idx]))
        all_preds.append(prediction)

What are the most frequent categories among our predictions?

In [None]:
from collections import Counter
cat_counts=Counter()
for val in all_preds:
  cat_counts[val]+=1

for k,v in cat_counts.most_common():
  print(v, reverse_labels[k])

Now let's plot the frequency with which any given category appears over time by aggregating those predictions by the year in which the corresponding papers were published.

In [None]:
minYear=min(all_dates)
maxYear=max(all_dates)
counts=np.zeros((maxYear-minYear+1, len(labels)))
for year, pred in zip(all_dates, all_preds):
  counts[year-minYear][pred]+=1
counts=counts/np.sum(counts,axis=1)[:, np.newaxis]

In [None]:
import matplotlib.pyplot as plt

def plot_category(cats, labels):
  for cat in cats:
    data=[]
    for idx, val in enumerate(counts[:,labels[cat]]):
      data.append(val)
    plt.plot(range(2013,2021), data)
  plt.legend(cats)
  plt.show()


In [None]:
plot_category(["MT", "SA", "GENERATION", "ETHICS"], labels)

Should we trust these results as reflecting trends in the attention the ACL community gives to these topics?  Think about the potential biases that might exist in this method and the results, especially given your experience in creating this dataset.  Explore this model and data with whatever methods you think would help your argument -- e.g., try plotting a confusion matrix over the development data to see which classes are being confused, examine the data points that have the highest confidence wrong predictions, etc.).  How would you go about interrogating this method to know whether to trust these findings?  Submit your <200 word answer to this question as a PDF on gradescope.
