<a href="https://colab.research.google.com/github/sunnyshen321/info159/blob/main/HW_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SETUP

In [5]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |▏                               | 10kB 27.0MB/s eta 0:00:01[K     |▍                               | 20kB 34.1MB/s eta 0:00:01[K     |▌                               | 30kB 23.2MB/s eta 0:00:01[K     |▊                               | 40kB 26.9MB/s eta 0:00:01[K     |▉                               | 51kB 25.8MB/s eta 0:00:01[K     |█                               | 61kB 28.4MB/s eta 0:00:01[K     |█▏                              | 71kB 25.0MB/s eta 0:00:01[K     |█▍                              | 81kB 26.3MB/s eta 0:00:01[K     |█▌                              | 92kB 22.9MB/s eta 0:00:01[K     |█▊                              | 102kB 23.5MB/s eta 0:00:01[K     |██                              | 112kB 23.5MB/s eta 0:00:01[K     |██                              | 

In [6]:
import torch
import torch.nn as nn
import transformers
import sys, argparse
import numpy as np
import re
import random
from tqdm import tqdm
from collections import Counter
from transformers import BertTokenizer, BertModel
from collections import Counter

#Sets random seeds for reproducibility
seed=0
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

In [7]:
print(torch.__version__)
print(transformers.__version__)

1.7.0+cu101
4.3.3


# **IMPORTANT**: GPU is not enabled by default

You must switch runtime environments if your output of the next block of code has an error saying "ValueError: Expected a cuda device, but got: cpu"

Go to Runtime > Change runtime type > Hardware accelerator > GPU

In [8]:
# if this cell prints "Running on cpu", you must switch runtime environments
# go to Runtime > Change runtime type > Hardware accelerator > GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Running on {}".format(device))

Running on cuda


# BERT Classification Example (Nothing for you to Implement here)

Before you implement anything, here's an example of a classification model using BERT and the [Transformers](https://huggingface.co/transformers/) python library from Huggingface. This model is trained using the data you annotated in Homework 1 to do the same topic classification you did in Homework 3. Note that it gets a higher accuracy score then either the CNN or logistic regression model that we tried in Homework 3. BERT tops out here at an accuracy of around 0.637 on the dev data.

Running the cells below will train this BERT-based classifier - this takes a while, so feel free to stop it running.

In [9]:
!wget https://raw.githubusercontent.com/dbamman/nlp21/main/HW3/acl.train
!wget https://raw.githubusercontent.com/dbamman/nlp21/main/HW3/acl.dev

--2021-02-25 02:26:52--  https://raw.githubusercontent.com/dbamman/nlp21/main/HW3/acl.train
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1027009 (1003K) [text/plain]
Saving to: ‘acl.train’


2021-02-25 02:26:52 (37.2 MB/s) - ‘acl.train’ saved [1027009/1027009]

--2021-02-25 02:26:52--  https://raw.githubusercontent.com/dbamman/nlp21/main/HW3/acl.dev
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 359826 (351K) [text/plain]
Saving to: ‘acl.dev’


2021-02-25 02:26:52 (39.6 MB/s) - ‘acl.dev’ saved [359826/359826]



In [10]:
trainingFile = "acl.train"
devFile = "acl.dev"

labels = {'APPLICATIONS': 11,
 'CSSCA': 23,
 'DIALOGUE': 12,
 'DISCOURSE': 13,
 'ETHICS': 8,
 'GENERATION': 9,
 'GREEN': 15,
 'GROUNDING': 18,
 'IE': 6,
 'INTERPRET': 10,
 'IR': 22,
 'LEXSEM': 7,
 'LING': 24,
 'MLCLASS': 1,
 'MLLM': 16,
 'MT': 4,
 'MULTILING': 3,
 'OTHER': 25,
 'PHON': 5,
 'QA': 17,
 'RESOURCES': 14,
 'SA': 21,
 'SENTSEM': 0,
 'SPEECH': 19,
 'SUMM': 2,
 'SYNTAX': 20}

def read_acl_data(filename, labels):
 
    data = []
    data_labels = []
    file = open(filename)
    for line in file:
        cols = line.split("\t")
        idd = cols[0]
        label = cols[1]
        title = cols[2]
        abstract = cols[3]

        data.append("%s %s" % (title, abstract))
        data_labels.append(labels[label])
        
    file.close()
    return data, data_labels

In [11]:
class BERTClassifier(nn.Module):

   def __init__(self, params):
      super().__init__()
        
      self.tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False, do_basic_tokenize=False)
      self.bert = BertModel.from_pretrained("bert-base-cased")
        
      self.num_labels = params["label_length"]

      self.fc = nn.Linear(768, self.num_labels)

   def get_batches(self, all_x, all_y, batch_size=32, max_toks=256):
            
      """ Get batches for input x, y data, with data tokenized according to the BERT tokenizer 
      (and limited to a maximum number of WordPiece tokens """

      batches_x=[]
      batches_y=[]
        
      for i in range(0, len(all_x), batch_size):

            current_batch=[]

            x=all_x[i:i+batch_size]

            batch_x = self.tokenizer(x, padding=True, truncation=True, return_tensors="pt", max_length=max_toks)
            batch_y=all_y[i:i+batch_size]

            batches_x.append(batch_x.to(device))
            batches_y.append(torch.LongTensor(batch_y).to(device))
            
      return batches_x, batches_y
  

   def forward(self, batch_x): 
    
      bert_output = self.bert(input_ids=batch_x["input_ids"],
                         attention_mask=batch_x["attention_mask"],
                         token_type_ids=batch_x["token_type_ids"],
                         output_hidden_states=True)

      # We're going to represent an entire document just by its [CLS] embedding (at position 0)
      # And use the *last* layer output (layer -1)
      # as a result of this choice, this embedding will be optimized for this purpose during the training process.
      
      bert_hidden_states = bert_output['hidden_states']

      out = bert_hidden_states[-1][:,0,:]

      out = self.fc(out)

      return out.squeeze()

   def evaluate(self, batch_x, batch_y):
      
      self.eval()
      corr = 0.
      total = 0.

      with torch.no_grad():

         for x, y in zip(batch_x, batch_y):
            y_preds = self.forward(x)
            for idx, y_pred in enumerate(y_preds):
              prediction=torch.argmax(y_pred)
              if prediction == y[idx]:
                corr += 1.
              total+=1                          
      return corr/total


In [12]:
train_x, train_y = read_acl_data(trainingFile, labels)
dev_x, dev_y = read_acl_data(devFile, labels)

In [9]:
bert_model = BERTClassifier(params={"label_length": len(labels)})
bert_model.to(device)

batch_x, batch_y = bert_model.get_batches(train_x, train_y)
dev_batch_x, dev_batch_y = bert_model.get_batches(dev_x, dev_y)

optimizer = torch.optim.Adam(bert_model.parameters(), lr=1e-5)
cross_entropy=nn.CrossEntropyLoss()

num_epochs=30
best_dev_acc = 0.

for epoch in range(num_epochs):
    bert_model.train()

    # Train
    for x, y in zip(batch_x, batch_y):
      y_pred = bert_model.forward(x)
      loss = cross_entropy(y_pred.view(-1, bert_model.num_labels), y.view(-1))
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
    
    # Evaluate
    dev_accuracy=bert_model.evaluate(dev_batch_x, dev_batch_y)
    if epoch % 1 == 0:
        print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))
        if dev_accuracy > best_dev_acc:
          torch.save(bert_model.state_dict(), 'best-model-parameters.pt')
          best_dev_acc = dev_accuracy

bert_model.load_state_dict(torch.load('best-model-parameters.pt'))
print("\nBest Performing Model achieves dev accuracy of : %.3f" % (best_dev_acc))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…


Epoch 0, dev accuracy: 0.117
Epoch 1, dev accuracy: 0.177
Epoch 2, dev accuracy: 0.411
Epoch 3, dev accuracy: 0.480
Epoch 4, dev accuracy: 0.509
Epoch 5, dev accuracy: 0.546
Epoch 6, dev accuracy: 0.566
Epoch 7, dev accuracy: 0.566
Epoch 8, dev accuracy: 0.580
Epoch 9, dev accuracy: 0.603
Epoch 10, dev accuracy: 0.606
Epoch 11, dev accuracy: 0.614
Epoch 12, dev accuracy: 0.637
Epoch 13, dev accuracy: 0.617
Epoch 14, dev accuracy: 0.620
Epoch 15, dev accuracy: 0.614
Epoch 16, dev accuracy: 0.617
Epoch 17, dev accuracy: 0.623
Epoch 18, dev accuracy: 0.626
Epoch 19, dev accuracy: 0.631
Epoch 20, dev accuracy: 0.617
Epoch 21, dev accuracy: 0.623
Epoch 22, dev accuracy: 0.626
Epoch 23, dev accuracy: 0.637
Epoch 24, dev accuracy: 0.634
Epoch 25, dev accuracy: 0.626
Epoch 26, dev accuracy: 0.631
Epoch 27, dev accuracy: 0.614
Epoch 28, dev accuracy: 0.629
Epoch 29, dev accuracy: 0.614

Best Performing Model achieves dev accuracy of : 0.637


# OBJECT NUMBER Probe

Now let's move on to "Probing" the representations in BERT's layers. We'll explore this using a simple task called "Object Number", which tries to predict whether the direct object of the main verb in a sentence is singular (NN, label=1) or plural (NNS, label=0).  For more on this probe, see Conneau et al. (2018), [What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties](https://arxiv.org/pdf/1805.01070.pdf).

In [13]:
# download the data
!wget https://raw.githubusercontent.com/facebookresearch/SentEval/master/data/probing/obj_number.txt

--2021-02-25 02:27:02--  https://raw.githubusercontent.com/facebookresearch/SentEval/master/data/probing/obj_number.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8947485 (8.5M) [text/plain]
Saving to: ‘obj_number.txt’


2021-02-25 02:27:03 (140 MB/s) - ‘obj_number.txt’ saved [8947485/8947485]



In [14]:
def read_probe_data(filename):
    labels={"NNS":0, "NN":1}

    train=[]
    dev=[]

    with open(filename) as file:
      for line in file:
          cols = line.split("\t")
          split = cols[0]
          label = cols[1]
          text = cols[2].rstrip()

          if split == "tr":
            train.append((text, labels[label]))
          elif split == "va":
            dev.append((text, labels[label]))

    np.random.shuffle(train)
    np.random.shuffle(dev)

    train_x = []
    train_y = []
    
    dev_x = []
    dev_y = []

    for text, label in train[:2000]:
      train_x.append(text)
      train_y.append(label)
    
    for text, label in dev[:2000]:
      dev_x.append(text)
      dev_y.append(label)

    return train_x, train_y, dev_x, dev_y

In [15]:
#Sets random seeds for reproducibility
seed=0
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

# Load the data for the probe
probe_train_x, probe_train_y, probe_dev_x, probe_dev_y=read_probe_data("obj_number.txt")

print("Label(y)\tSentence(x)")
print("----------------------------------------------------------------------------------------")
for i in range(5):
  print("%s\t\t%s" % (probe_train_y[i], probe_train_x[i]))

Label(y)	Sentence(x)
----------------------------------------------------------------------------------------
1		I untied my apron , pulled it off , and tossed it onto the counter .
0		Sally handled her chopsticks like an expert as she popped a piece of fried Calamari into her mouth .
0		I continue my chores with a spring in my step .
1		She had found a large central staircase , and moved up it unerringly .
1		Because those lungs would require oxygen every day until I died .


## Baseline

Since we are working with a new dataset, let's implement a simple baseline to give us some context for how well we can expect our probe to perform. A good baseline to try first is to always predict the most common label in the training data:

In [16]:
counts=Counter()
for l in probe_train_y:
  counts[l]+=1

most_common=counts.most_common(1)[0][0]

cor=tot=0.
for l in probe_dev_y:
  if l == most_common:
    cor+=1
  tot+=1

print("Baseline accuracy: %.3f" % (cor/tot))


Baseline accuracy: 0.504


## Implementing the Probe

Your job in this homework is to implement a probe for object number: for each of the 12 layers in BERT, train a classifier to predict whether the direct object of the main verb in an input sentence is a singular or plural noun. We provide a copy of the BERT classifier below; you will need to make some minor changes to it.  Keep in mind that a probe is designed to uncover what BERT has learned about linguistic structure only given its pretraining as a language model; your model cannot update the BERT parameters.

You may find it useful to refer to Huggingface's documentation on BERT here: https://huggingface.co/transformers/model_doc/bert.html.

In [23]:
class BERTLayerClassifier(nn.Module):

   def __init__(self, layer_id, num_labels):
      #* Do *not* change this function *
      super().__init__()
        
      self.tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False, do_basic_tokenize=False)
      self.bert = BertModel.from_pretrained("bert-base-cased")
      
      self.layer_id = layer_id
      self.num_labels = num_labels

      self.fc = nn.Linear(768, self.num_labels)

      #####
      # Since we're probing what BERT has learned from its language modeling objective,
      # we need to *not* update the BERT parameters during training.
      #
      # Do *not* change this
      #
      for param in self.bert.parameters():
        param.requires_grad = False
      #
      ####
      self.train()

   def get_batches(self, all_x, all_y, batch_size=32, max_toks=256):
            
      """ Get batches for input x, y data, with data tokenized according to the BERT tokenizer 
      (and limited to a maximum number of WordPiece tokens 
      * Do *not* change this function *
      """

      batches_x=[]
      batches_y=[]
        
      for i in range(0, len(all_x), batch_size):

            current_batch=[]

            x=all_x[i:i+batch_size]

            batch_x = self.tokenizer(x, padding=True, truncation=True, return_tensors="pt", max_length=max_toks)
            batch_y=all_y[i:i+batch_size]

            batches_x.append(batch_x.to(device))
            batches_y.append(torch.LongTensor(batch_y).to(device))
            
      return batches_x, batches_y
  

   def forward(self, batch_x): 
      bert_output = self.bert(input_ids=batch_x["input_ids"],
                         attention_mask=batch_x["attention_mask"],
                         token_type_ids=batch_x["token_type_ids"],
                         output_hidden_states=True)
      
      bert_hidden_states = bert_output['hidden_states']

      """ Insert your code here """
      bert_layer_output = bert_hidden_states[self.layer_id][:,0,:]
      # for i in range(len(bert_hidden_states)):
      #   bert_layer_output.append(bert_hidden_states[i][:,0,:])
      """ Insert your code here """

      out = self.fc(bert_layer_output)

      return out.squeeze()

   def evaluate(self, batch_x, batch_y):
      #* Do *not* change this function *
      
      self.eval()
      corr = 0.
      total = 0.

      with torch.no_grad():

         for x, y in zip(batch_x, batch_y):
            y_preds = self.forward(x)
            for idx, y_pred in enumerate(y_preds):
              prediction=torch.argmax(y_pred)
              if prediction == y[idx]:
                corr += 1.
              total+=1
      self.train()                    
      return corr/total

Aside from any change you make to the BERTLayerClassifier module, the rest of your code should be implemented in the `runProbes` function below.

Within the `runProbes` function, you'll need to:

* Create one classifier for each of BERT's 12 layers.

* Train each classifier for ***5*** epochs using the training data and labels for the Object Number probe that have been loaded above.

* Evaluate the classifier using the dev data and labels for the Object Number probe.

* Return your results on the dev data in the dictionary format specified in `runProbes'.

The output of `runProbes` should be a dictionary of 12 accuracies, one for each layer, for layers 0 to 11.

When you're finished, you should be able to give an answer to the following question: What layer in BERT is most encoding information on syntactic objecthood?


**Tips**:

* Set your learning rate to 0.01.  When updating the BERT parameters for a task, the learning rate should be set to a small number (e.g., 1e-5, as in the ACL classification example above); but when you're *not* updating BERT (as is the case here), you should use a larger learning rate (such as 0.01, 0.001, etc.):

    `optimizer = torch.optim.Adam(<your_model>.parameters(), lr=0.01)`

* You should be training *separate* classifiers for each layer; make sure the updates you make to the parameters for a classifier on one layer while training do not affect the parameters for another layer.

* BERT learns representations for each token in it's input. Under the hood, the `BertTokenizer` object used in `BERTLayerClassifier` adds special tokens to the beginning and end of each input, such that the sentence `Sally handled her chopsticks` is transformed into `[CLS] Sally handled her chopsticks [SEP]`. When using BERT to make sentence-level classifications, the [CLS] token is often treated as a representation of the entire sentence. Your classifier for layer ***L*** should only use the "[CLS]" token representation from layer ***L*** for making a prediction.  

* Remember to tell pytorch to use the gpu for your models with the `.to(device)` function. Without using the gpu your code will run much more slowly!  

* You'll want to use the functions `get_batches` and `evaluate` that have been implemented in `BERTLayerClassifier`.


In [24]:
def runProbes():
  """
  Your function should return the `dev_accuracy_per_layer' dictionary.
  This dictionary should be filled with 12 entries.
  The keys of this dictionary should be the numbers 0 through 11, each of which
  represents the ID of one of the 12 BERT layers.
  For each BERT layer, update this dictionary (initial values are set to 0) 
  with your classification accuracy on the dev data for the Object Number task
  using the corresponding BERT layer.
  """
  dev_accuracy_per_layer = {0:0., 1:0., 2:0., 3:0., 4:0., 5:0.,
                            6:0., 7:0., 8:0., 9:0., 10:0., 11:0.,}

  """ Insert your code here """

  for layer_id in range(12):
    bert_layer = BERTLayerClassifier(layer_id, len(labels))
    bert_layer.to(device)

    batch_x, batch_y = bert_layer.get_batches(probe_train_x, probe_train_y)
    dev_batch_x, dev_batch_y = bert_layer.get_batches(probe_dev_x, probe_dev_y)
    optimizer = torch.optim.Adam(bert_layer.parameters(), lr=0.01)
    cross_entropy=nn.CrossEntropyLoss()

    num_epochs = 5
    for epoch in range(num_epochs):
      bert_layer.train()

      #Train
      for x, y in zip(batch_x, batch_y):
        y_pred = bert_layer.forward(x)
        loss = cross_entropy(y_pred.view(-1, bert_layer.num_labels), y.view(-1))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

      #Evaluate
      dev_accuracy=bert_layer.evaluate(dev_batch_x, dev_batch_y)
      if epoch % 1 == 0:
          print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))
          if dev_accuracy > dev_accuracy_per_layer[layer_id]:
            torch.save(bert_layer.state_dict(), 'best-model-parameters.pt')
            dev_accuracy_per_layer[layer_id] = dev_accuracy

  return dev_accuracy_per_layer

In [25]:
# Execute this cell to run your probes and save your results.
dev_accuracy_per_layer = runProbes()

Epoch 0, dev accuracy: 0.495
Epoch 1, dev accuracy: 0.495
Epoch 2, dev accuracy: 0.495
Epoch 3, dev accuracy: 0.495
Epoch 4, dev accuracy: 0.495
Epoch 0, dev accuracy: 0.611
Epoch 1, dev accuracy: 0.613
Epoch 2, dev accuracy: 0.634
Epoch 3, dev accuracy: 0.639
Epoch 4, dev accuracy: 0.644
Epoch 0, dev accuracy: 0.618
Epoch 1, dev accuracy: 0.665
Epoch 2, dev accuracy: 0.615
Epoch 3, dev accuracy: 0.704
Epoch 4, dev accuracy: 0.700
Epoch 0, dev accuracy: 0.619
Epoch 1, dev accuracy: 0.650
Epoch 2, dev accuracy: 0.670
Epoch 3, dev accuracy: 0.674
Epoch 4, dev accuracy: 0.684
Epoch 0, dev accuracy: 0.642
Epoch 1, dev accuracy: 0.657
Epoch 2, dev accuracy: 0.676
Epoch 3, dev accuracy: 0.700
Epoch 4, dev accuracy: 0.697
Epoch 0, dev accuracy: 0.678
Epoch 1, dev accuracy: 0.649
Epoch 2, dev accuracy: 0.705
Epoch 3, dev accuracy: 0.711
Epoch 4, dev accuracy: 0.736
Epoch 0, dev accuracy: 0.632
Epoch 1, dev accuracy: 0.676
Epoch 2, dev accuracy: 0.696
Epoch 3, dev accuracy: 0.642
Epoch 4, dev a

# Export your results

In [26]:
# Print out your accuracy for each layer here

for layer_id in dev_accuracy_per_layer:
  print(f"Accuracy for layer {layer_id}: {dev_accuracy_per_layer[layer_id]}")

Accuracy for layer 0: 0.4955
Accuracy for layer 1: 0.6445
Accuracy for layer 2: 0.7035
Accuracy for layer 3: 0.6845
Accuracy for layer 4: 0.6995
Accuracy for layer 5: 0.736
Accuracy for layer 6: 0.6955
Accuracy for layer 7: 0.685
Accuracy for layer 8: 0.7045
Accuracy for layer 9: 0.7115
Accuracy for layer 10: 0.675
Accuracy for layer 11: 0.6745


In [27]:
# Run this cell to store your accuracy for each layer to a file called 'dev_accuracies.txt'
# Your file should have 12 lines.
# When you submit your homework, upload this saved file to Gradescope along with your notebook.

def save_accuracies(dev_accuracy_per_layer, output_file):
  with open(output_file, 'w') as f:
    for layer_id in dev_accuracy_per_layer:
      f.write(f"{layer_id}\t{dev_accuracy_per_layer[layer_id]}\n")

save_accuracies(dev_accuracy_per_layer, output_file='devAccuracies.txt')