Adapted from the `BertClassification` [notebook](https://github.com/dbamman/anlp21/blob/main/9.neural/BertClassification_TODO.ipynb) in David Bamman's Fall 2021 Applied NLP course [INFO 256](https://people.ischool.berkeley.edu/~dbamman/info256.html)

Trains various `BERT` models using annotated detail samples; saves best performing model of each ilk for further analysis. Most consistent model, on the `dev` set, is `BERT MEDIUM`.

Give this notebook access to the data in your ANLP21 folder so we can train and evaluate BERT on the `details` data.  (Note you are only providing this access to yourself as you execute this notebook.)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install transformers



In [3]:
from transformers import BertModel, BertTokenizer
import torch
from tqdm import tqdm
import torch.nn as nn
import numpy as np
import random
import time

Double-check that this notebook is running on the GPU (this should "Running on cuda").

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Running on {}".format(device))

Running on cuda


In [5]:
def read_labels(filename):
    labels={}
    with open(filename) as file:
        for line in file:
            cols = line.split("\t")
            label = cols[0]
            if label not in labels:
                labels[label]=len(labels)
    return labels

In [6]:
def read_data(filename, labels, max_data_points=None):
    """
    :param filename: the name of the file
    :return: list of tuple ([word index list], label)
    as input for the forward and backward function
    """    
    data = []
    data_labels = []
    with open(filename) as file:
        for line in file:
            cols = line.split("\t")
            label = cols[0]
            text = cols[1]
            
            data.append(text)
            data_labels.append(labels[label])
            

    # shuffle the data
    tmp = list(zip(data, data_labels))
    random.shuffle(tmp)
    data, data_labels = zip(*tmp)
    
    if max_data_points is None:
        return data, data_labels
    
    return data[:max_data_points], data_labels[:max_data_points]

In [7]:
labels=read_labels("/content/drive/MyDrive/ANLP21/details/train.tsv")
print(labels)
assert len(labels) == 2

{'detail': 0, 'not_detail': 1}


In [8]:
train_x, train_y=read_data("/content/drive/MyDrive/ANLP21/details/train.tsv", labels)

In [9]:
dev_x, dev_y=read_data("/content/drive/MyDrive/ANLP21/details/dev.tsv", labels)

In [10]:
def evaluate(model, x, y):
    model.eval()
    corr = 0.
    total = 0.
    with torch.no_grad():
        for x, y in zip(x, y):
            y_preds=model.forward(x)
            for idx, y_pred in enumerate(y_preds):
                prediction=torch.argmax(y_pred)
                if prediction == y[idx]:
                    corr += 1.
                total+=1                          
    return corr/total

In [11]:
class BERTClassifier(nn.Module):

    
    def __init__(self, params):
        super().__init__()
    
        self.model_name=params["model_name"]
        self.tokenizer = BertTokenizer.from_pretrained(self.model_name, do_lower_case=params["doLowerCase"], do_basic_tokenize=False)
        self.bert = BertModel.from_pretrained(self.model_name)
        
        self.num_labels = params["label_length"]

        self.fc = nn.Linear(params["embedding_size"], self.num_labels)

    def get_batches(self, all_x, all_y, batch_size=32, max_toks=256):
            
        """ Get batches for input x, y data, with data tokenized according to the BERT tokenizer 
      (and limited to a maximum number of WordPiece tokens """

        batches_x=[]
        batches_y=[]
        
        for i in range(0, len(all_x), batch_size):

            current_batch=[]

            x=all_x[i:i+batch_size]

            batch_x = self.tokenizer(x, padding=True, truncation=True, return_tensors="pt", max_length=max_toks)
            batch_y=all_y[i:i+batch_size]

            batches_x.append(batch_x.to(device))
            batches_y.append(torch.LongTensor(batch_y).to(device))
            
        return batches_x, batches_y
  

    def forward(self, batch_x): 
    
        bert_output = self.bert(input_ids=batch_x["input_ids"],
                         attention_mask=batch_x["attention_mask"],
                         token_type_ids=batch_x["token_type_ids"],
                         output_hidden_states=True)

        bert_hidden_states = bert_output['hidden_states']

        # We're going to represent an entire document just by its [CLS] embedding (at position 0)
        out = bert_hidden_states[-1][:,0,:]

        out = self.fc(out)

        return out.squeeze()

Now let's train BERT on this data.  A few practicalities of this environment: if you encounter an out of memory error:

* Reset the notebook (Runtime > Factory reset runtime) and execute all cells from the beginning.
* If your `max_length` is high, try reducing the `batch_size` in `get_batches` above.

Even on a GPU, BERT can take a long time to train, so you might try experimenting first with smaller `max_data_points` above. before running it on the full training data.

In [13]:
def train_and_evaluate(bert_model_name, model_filename, train_x, train_y, dev_x, dev_y, labels, embedding_size=768, doLowerCase=None):

  start_time=time.time()
  bert_model = BERTClassifier(params={"doLowerCase": doLowerCase, "model_name": bert_model_name, "embedding_size":embedding_size, "label_length": len(labels)})
  bert_model.to(device)

  batch_x, batch_y = bert_model.get_batches(train_x, train_y)
  dev_batch_x, dev_batch_y = bert_model.get_batches(dev_x, dev_y)

  optimizer = torch.optim.Adam(bert_model.parameters(), lr=1e-5)
  cross_entropy=nn.CrossEntropyLoss()

  num_epochs=5
  best_dev_acc = 0.

  for epoch in range(num_epochs):
      bert_model.train()

      # Train
      for x, y in tqdm(list(zip(batch_x, batch_y))):
          y_pred = bert_model.forward(x)
          loss = cross_entropy(y_pred.view(-1, bert_model.num_labels), y.view(-1))
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      
      # Evaluate
      dev_accuracy=evaluate(bert_model, dev_batch_x, dev_batch_y)
      if epoch % 1 == 0:
          print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))
          if dev_accuracy > best_dev_acc:
              torch.save(bert_model.state_dict(), model_filename)
              best_dev_acc = dev_accuracy

  bert_model.load_state_dict(torch.load(model_filename))
  torch.save(bert_model.state_dict(), "/content/drive/MyDrive/ANLP21/details/models/" + model_filename + ".pt")
  print("\nBest Performing Model achieves dev accuracy of : %.3f" % (best_dev_acc))
  print("Time: %.3f seconds ---" % (time.time() - start_time))

#### BERT TINY
https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-128_A-2.zip

Best Accuracy: .657
Time - ~ 17 seconds

In [14]:
train_and_evaluate("google/bert_uncased_L-2_H-128_A-2", "lmrd-uncased_L-2_H-128_A-2", train_x, train_y, dev_x, dev_y, labels, embedding_size=128, doLowerCase=True)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/382 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/16.9M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/bert_uncased_L-2_H-128_A-2 were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 10/10 [00:00<00:00, 13.50it/s]


Epoch 0, dev accuracy: 0.600


100%|██████████| 10/10 [00:00<00:00, 20.41it/s]


Epoch 1, dev accuracy: 0.629


100%|██████████| 10/10 [00:00<00:00, 21.54it/s]


Epoch 2, dev accuracy: 0.629


100%|██████████| 10/10 [00:00<00:00, 20.37it/s]


Epoch 3, dev accuracy: 0.657


100%|██████████| 10/10 [00:00<00:00, 20.43it/s]


Epoch 4, dev accuracy: 0.629

Best Performing Model achieves dev accuracy of : 0.657
Time: 17.178 seconds ---


#### BERT MINI
https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-256_A-4.zip

Best Accuracy: .829
Training Time: ~ 11

In [15]:
train_and_evaluate("google/bert_uncased_L-4_H-256_A-4", "details-uncased_L-4_H-256_A-4", train_x, train_y, dev_x, dev_y, labels, embedding_size=256, doLowerCase=True)

Some weights of the model checkpoint at google/bert_uncased_L-4_H-256_A-4 were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 10/10 [00:01<00:00,  5.64it/s]


Epoch 0, dev accuracy: 0.571


100%|██████████| 10/10 [00:01<00:00,  6.21it/s]


Epoch 1, dev accuracy: 0.829


100%|██████████| 10/10 [00:01<00:00,  6.24it/s]


Epoch 2, dev accuracy: 0.771


100%|██████████| 10/10 [00:01<00:00,  6.25it/s]


Epoch 3, dev accuracy: 0.714


100%|██████████| 10/10 [00:01<00:00,  6.25it/s]


Epoch 4, dev accuracy: 0.686

Best Performing Model achieves dev accuracy of : 0.829
Time: 11.126 seconds ---


#### BERT Small
https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-512_A-8.zip

Best accuracy: .714
Training Time: ~ 29 seconds

In [16]:
train_and_evaluate("google/bert_uncased_L-4_H-512_A-8", "details-uncased_L-4_H-512_A-8", train_x, train_y, dev_x, dev_y, labels, embedding_size=512, doLowerCase=True)

Some weights of the model checkpoint at google/bert_uncased_L-4_H-512_A-8 were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 10/10 [00:03<00:00,  2.59it/s]


Epoch 0, dev accuracy: 0.600


100%|██████████| 10/10 [00:03<00:00,  2.59it/s]


Epoch 1, dev accuracy: 0.714


100%|██████████| 10/10 [00:03<00:00,  2.59it/s]


Epoch 2, dev accuracy: 0.629


100%|██████████| 10/10 [00:03<00:00,  2.58it/s]


Epoch 3, dev accuracy: 0.657


100%|██████████| 10/10 [00:03<00:00,  2.58it/s]


Epoch 4, dev accuracy: 0.657

Best Performing Model achieves dev accuracy of : 0.714
Time: 24.123 seconds ---


#### BERT Medium
https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-512_A-8.zip

Best Accuracy: .800
Training Time: ~45

In [31]:
train_and_evaluate("google/bert_uncased_L-8_H-512_A-8", "details-uncased_L-8_H-512_A-8", train_x, train_y, dev_x, dev_y, labels, embedding_size=512, doLowerCase=True)

Some weights of the model checkpoint at google/bert_uncased_L-8_H-512_A-8 were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 10/10 [00:07<00:00,  1.26it/s]


Epoch 0, dev accuracy: 0.771


100%|██████████| 10/10 [00:07<00:00,  1.29it/s]


Epoch 1, dev accuracy: 0.629


100%|██████████| 10/10 [00:07<00:00,  1.29it/s]


Epoch 2, dev accuracy: 0.771


100%|██████████| 10/10 [00:07<00:00,  1.29it/s]


Epoch 3, dev accuracy: 0.800


100%|██████████| 10/10 [00:07<00:00,  1.28it/s]


Epoch 4, dev accuracy: 0.771

Best Performing Model achieves dev accuracy of : 0.800
Time: 45.445 seconds ---


#### BERT BASE

Best accuracy, .714
Traning time, ~130 seconds

In [14]:
train_and_evaluate("bert-base-cased", "details-bert-base-cased", train_x, train_y, dev_x, dev_y, labels, embedding_size=768, doLowerCase=False)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 10/10 [00:23<00:00,  2.32s/it]


Epoch 0, dev accuracy: 0.571


100%|██████████| 10/10 [00:21<00:00,  2.19s/it]


Epoch 1, dev accuracy: 0.629


100%|██████████| 10/10 [00:22<00:00,  2.21s/it]


Epoch 2, dev accuracy: 0.686


100%|██████████| 10/10 [00:22<00:00,  2.21s/it]


Epoch 3, dev accuracy: 0.714


100%|██████████| 10/10 [00:22<00:00,  2.22s/it]


Epoch 4, dev accuracy: 0.714

Best Performing Model achieves dev accuracy of : 0.714
Time: 130.274 seconds ---
