#### Finetuning a BERT Model for Sentiment Classification

We will take the original `BERT` model trained on the masked language modeling task and `finetune` it for `sentiment classification` on the Stanford Sentiment Tree and CFIMDB datasets. The `pretrained` BERT model takes in an input sequence of integer tokens and outputs a corresponding sequence of contextualized encoded vectors. A special `[CLS]` token is appended at the start of the input sequence and the coressponding encoded output vector of this token represents a `pooled representation` of the entire sequence. This pooled representation vector can then be used by a feedforward network to perform a sentence classification task. All parameters in this combined model (consisting of the BERT and the feedforward classifier) can be trained together to optimize the model for the sentence classification task, this process is called `finetuning`, because it involves adapting the pretrained parameters of BERT for this specialized task.

In [1]:
import torch
from transformers import BertTokenizer, BertModel
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import random
from tqdm import tqdm
import psutil
import wandb
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mtanzids[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

#### We will use the WordPiece tokenizer and the pre-trained BERT provided by the Hugginface transformers library. First, lets try out the tokenizer.

In [14]:
# load the prettrained WordPiece tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# use it on a test sentence
sentence = "Yay, I'm excited to try out this BERT model from Huggingface!"
tokens_subwords = tokenizer.tokenize(sentence)
tokens_idx = tokenizer.encode(sentence)
idx_to_tokens = tokenizer.convert_ids_to_tokens(tokens_idx)
decoded_sentence = tokenizer.decode(tokens_idx)
print("Original sentence: ", sentence)
print("Subword tokens: ", tokens_subwords)
print("Encoded sentence: ", tokens_idx)
print("Idx back to tokens: ", idx_to_tokens)
print("Decoded sentence: ", decoded_sentence)

# let's also take a look at all the special tokens
print("\nSpecial tokens with their integer id:")
special_tokens = tokenizer.all_special_tokens
for t in special_tokens:
    print(t," <--> " ,tokenizer.convert_tokens_to_ids(t))

Original sentence:  Yay, I'm excited to try out this BERT model from Huggingface!
Subword tokens:  ['ya', '##y', ',', 'i', "'", 'm', 'excited', 'to', 'try', 'out', 'this', 'bert', 'model', 'from', 'hugging', '##face', '!']
Encoded sentence:  [8038, 2100, 1010, 1045, 1005, 1049, 7568, 2000, 3046, 2041, 2023, 14324, 2944, 2013, 17662, 12172, 999]
Idx back to tokens:  ['ya', '##y', ',', 'i', "'", 'm', 'excited', 'to', 'try', 'out', 'this', 'bert', 'model', 'from', 'hugging', '##face', '!']
Decoded sentence:  yay, i'm excited to try out this bert model from huggingface!

Special tokens with their integer id:
[PAD]  <-->  0
[CLS]  <-->  101
[SEP]  <-->  102
[MASK]  <-->  103
[UNK]  <-->  100


#### Note that since we're using the `uncased` version of the tokenizer, everything becomes lowercase.