#### Finetuning a BERT Model for Sentiment Classification

We will take the original `BERT` model trained on the masked language modeling task and `finetune` it for `sentiment classification` on the Stanford Sentiment Tree and CFIMDB datasets. The `pretrained` BERT model takes in an input sequence of integer tokens and outputs a corresponding sequence of contextualized encoded vectors (768 dimensional). A special `[CLS]` token is appended at the start of the input sequence and the coressponding encoded output vector of this token represents a `pooled representation` of the entire sequence. This pooled representation vector can then be used by a feedforward network to perform a sentence classification task. All parameters in this combined model (consisting of the BERT and the feedforward classifier) can be trained together to optimize the model for the sentence classification task, this process is called `finetuning`, because it involves adapting the pretrained parameters of BERT for this specialized task.

In [1]:
import torch
from transformers import BertTokenizer, BertModel
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import csv
from tqdm import tqdm
import psutil
import wandb
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mtanzids[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

#### We will use the WordPiece tokenizer and the pre-trained BERT provided by the Hugginface transformers library. First, lets try out the tokenizer.

In [2]:
# load the prettrained WordPiece tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# use it on a test sentence
sentence = "Yay, I'm excited to try out this BERT model from Huggingface!"
tokens_subwords = tokenizer.tokenize(sentence)
tokens_idx = tokenizer.encode(sentence)
idx_to_tokens = tokenizer.convert_ids_to_tokens(tokens_idx)
decoded_sentence = tokenizer.decode(tokens_idx)
print("Original sentence: ", sentence)
print("Subword tokens: ", tokens_subwords)
print("Encoded sentence: ", tokens_idx)
print("Idx back to tokens: ", idx_to_tokens)
print("Decoded sentence: ", decoded_sentence)

# let's also take a look at all the special tokens
print("\nSpecial tokens with their integer id:")
special_tokens = tokenizer.all_special_tokens
for t in special_tokens:
    print(t," <--> " ,tokenizer.convert_tokens_to_ids(t))

Original sentence:  Yay, I'm excited to try out this BERT model from Huggingface!
Subword tokens:  ['ya', '##y', ',', 'i', "'", 'm', 'excited', 'to', 'try', 'out', 'this', 'bert', 'model', 'from', 'hugging', '##face', '!']
Encoded sentence:  [101, 8038, 2100, 1010, 1045, 1005, 1049, 7568, 2000, 3046, 2041, 2023, 14324, 2944, 2013, 17662, 12172, 999, 102]
Idx back to tokens:  ['[CLS]', 'ya', '##y', ',', 'i', "'", 'm', 'excited', 'to', 'try', 'out', 'this', 'bert', 'model', 'from', 'hugging', '##face', '!', '[SEP]']
Decoded sentence:  [CLS] yay, i'm excited to try out this bert model from huggingface! [SEP]

Special tokens with their integer id:
[UNK]  <-->  100
[SEP]  <-->  102
[PAD]  <-->  0
[CLS]  <-->  101
[MASK]  <-->  103


#### Note that since we're using the `uncased` version of the tokenizer, everything becomes lowercase.

Now let's load the SST dataset from file and package it inside a pytroch dataset class. Each data instance is a sentence-sentiment value pair, there are 5 different sentiment labels: NEGATIVE (0), SOMEWHAT NEGATIVE (1), NEUTRAL (2), SOMEWHAT POSITIVE (3), POSITIVE (4) 

In [2]:
def load_data_sst(split="train"):
    if split == "test":
        filename = "data/ids-sst-test-student.csv"    
        data = []
        with open(filename, 'r') as f:
            for record in csv.DictReader(f, delimiter='\t'):
                sent = record['sentence'].lower().strip()
                sent_id = record['id'].lower().strip()
                data.append((sent,sent_id))
        return data          
    else:
        if split == "train":
            filename = "data/ids-sst-train.csv"
        elif split== "dev":
            filename = "data/ids-sst-dev.csv"   
        data = []
        labels = []
        with open(filename, 'r') as f:
            for record in csv.DictReader(f, delimiter='\t'):
                sent = record['sentence'].lower().strip()
                sent_id = record['id'].lower().strip()
                label = int(record['sentiment'].strip())
                data.append((sent,label,sent_id))
                labels.append(label)
        label_distribution = Counter(labels)        
        return data, label_distribution

In [4]:
sst_train, train_label_distribution = load_data_sst(split="train")
sst_dev, dev_label_distribution = load_data_sst(split="dev")

print(f"Number of training examples: {len(sst_train)}")
print(f"Train Label distribution: {train_label_distribution}")
print(f"Number of dev examples: {len(sst_dev)}")
print(f"Dev Label distribution: {dev_label_distribution}")


Number of training examples: 8544
Train Label distribution: Counter({3: 2322, 1: 2218, 2: 1624, 4: 1288, 0: 1092})
Number of dev examples: 1101
Dev Label distribution: Counter({1: 289, 3: 279, 2: 229, 4: 165, 0: 139})


In [3]:
class SSTDataset(Dataset):
    def __init__(self, data, max_length=128):
        self.data = data
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]
    
    # collate function for padding the sentences to the same length and creating attention masks
    def collate_fn(self, batch):
        sents = [x[0] for x in batch]
        labels = [x[1] for x in batch]
        encoded = self.tokenizer.batch_encode_plus(sents, add_special_tokens=True, padding='max_length', truncation=True, max_length=self.max_length, return_tensors='pt')
        input_idx = encoded['input_ids']
        attn_mask = encoded['attention_mask']   
        #token_type_idx = encoded['token_type_ids'] # don't need this since we only have one sentence
        labels = torch.tensor(labels)
        return input_idx, labels, attn_mask

#### Now we define our sentiment classifier model.

In [4]:
class BERTSentimentClassifier(torch.nn.Module):
    def __init__(self, hidden_size=768, num_classes=5, dropout_rate=0.1, finetune=False):
        super().__init__()
        # load pretrained BERT model
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = torch.nn.Dropout(dropout_rate)
        # define classifier head
        self.classifier_head = torch.nn.Linear(hidden_size, num_classes)

        for param in self.bert.parameters():
            if finetune:
                # make all parameters of BERT model trainable if we're finetuning
                param.requires_grad = True
            else:
                # freeze all parameters of BERT model if we're not finetuning
                param.requires_grad = False

    def forward(self, input_idx, labels, attn_mask):
        # compute BERT encodings
        bert_output = self.bert(input_idx, attention_mask=attn_mask)
        # extract the `[CLS]` encoding (first element of the sequence)
        bert_output = bert_output.last_hidden_state # shape: (batch_size, sequence_length, hidden_size)
        cls_encoding = bert_output[:, 0] # shape: (batch_size, hidden_size)
        # apply dropout 
        cls_encoding = self.dropout(cls_encoding) 
        # compute classifier logits
        logits = self.classifier_head(cls_encoding)  # shape: (batch_size, num_classes)
        # compute loss
        loss = F.cross_entropy(logits, labels)

        return logits, loss

In [5]:
# training loop
def train(model, optimizer, train_dataloader, val_dataloader, scheduler=None, device="cpu", num_epochs=10, val_every=1, save_every=None, log_metrics=None):
    avg_loss = 0
    train_acc = 0
    val_loss = 0
    val_acc = 0
    model.train()
    for epoch in range(num_epochs):
        num_correct = 0
        num_total = 0
        pbar = tqdm(train_dataloader, desc="Epochs")
        for batch in pbar:
            inputs, targets, attn_mask = batch
            # move batch to device
            inputs, targets, attn_mask = inputs.to(device), targets.to(device), attn_mask.to(device)
            # forward pass
            logits, loss = model(inputs, targets, attn_mask)
            # reset gradients
            optimizer.zero_grad()
            # backward pass
            loss.backward()
            # optimizer step
            optimizer.step()
            avg_loss = 0.9* avg_loss + 0.1*loss.item()
            B, _ = inputs.shape
            y_pred = logits.argmax(dim=-1).view(-1) # shape (B,)
            num_correct += y_pred.eq(targets.view(-1)).sum().item()            
            num_total += B
            train_acc = num_correct / num_total        
            
            pbar.set_description(f"Epoch {epoch + 1}, EMA Train Loss: {avg_loss:.3f}, Train Accuracy: {train_acc: .3f}, Val Loss: {val_loss: .3f}, Val Accuracy: {val_acc: .3f}")  

            if log_metrics:
                metrics = {"Batch loss" : loss.item(), "Moving Avg Loss" : avg_loss, "Val Loss": val_loss}
                log_metrics(metrics)

        if scheduler is not None:
            scheduler.step()
        
        if val_every is not None:
            if epoch%val_every == 0:
                # compute validation loss
                val_loss, val_acc = validation(model, val_dataloader, device=device)
                pbar.set_description(f"Epoch {epoch + 1}, EMA Train Loss: {avg_loss:.3f}, Train Accuracy: {train_acc: .3f}, Val Loss: {val_loss: .3f}, Val Accuracy: {val_acc: .3f}") 

        if save_every is not None:
            if (epoch+1) % save_every == 0:
                save_model_checkpoint(model, optimizer, epoch, avg_loss)

def validation(model, val_dataloader, device="cpu"):
    model.eval()
    val_losses = torch.zeros(len(val_dataloader))
    with torch.no_grad():
        num_correct = 0
        num_total = 0
        for i,batch in enumerate(val_dataloader):
            inputs, targets, attn_mask = batch
            inputs, targets, attn_mask = inputs.to(device), targets.to(device), attn_mask.to(device)
            logits, loss = model(inputs, targets, attn_mask)
            B, _ = inputs.shape
            y_pred = logits.argmax(dim=-1).view(-1) # shape (B,)
            num_correct += y_pred.eq(targets.view(-1)).sum().item()            
            num_total += B
            val_losses[i] = loss.item()
    model.train()
    val_loss = val_losses.mean().item()
    val_accuracy = num_correct / num_total
    return val_loss, val_accuracy


def save_model_checkpoint(model, optimizer, epoch=None, loss=None, filename=None):
    # Save the model and optimizer state_dict
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }

    # Save the checkpoint to a file
    if filename:
        torch.save(checkpoint, filename)
    else:
        torch.save(checkpoint, 'sentiment_classifier_checkpoint.pth')
    print(f"Saved model checkpoint!")


def load_model_checkpoint(model, optimizer, filename=None):
    if filename:
        checkpoint = torch.load(filename)
    else:
        checkpoint = torch.load('sentiment_classifier_checkpoint.pth')
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    model.train()
    print("Loaded model from checkpoint!")
    return model, optimizer     

#### Now let's train a model without finetuning the BERT weights, i.e. we keep the BERT weights frozen and only learn the weights for the classifier head.

In [9]:
B = 32
max_length = 128
learning_rate = 5e-4
DEVICE = "cuda"

train_dataset = SSTDataset(sst_train, max_length=max_length)
val_dataset = SSTDataset(sst_dev, max_length=max_length)
train_dataloader = DataLoader(train_dataset, batch_size=B, shuffle=True, pin_memory=True, num_workers=2, collate_fn=train_dataset.collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=B, shuffle=True, pin_memory=True, num_workers=2, collate_fn=val_dataset.collate_fn)

# model with finetuning disabled
model = BERTSentimentClassifier(finetune=False).to(DEVICE)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler =  torch.optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.95)
#model, optimizer = load_model_checkpoint(model, optimizer)

num_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters in transformer network: {num_params/1e6} M")
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

Total number of parameters in transformer network: 109.486085 M
RAM used: 972.72 MB


In [10]:
# create a W&B run
run = wandb.init(
    project="BERT_Sentiment_Classification", 
    config={
        "learning_rate": learning_rate, 
        "epochs": 30,
        "batch_size": B, 
        "corpus": "Stanford Sentiment Tree"},)   

def log_metrics(metrics):
    wandb.log(metrics)

In [11]:
train(model, optimizer, train_dataloader, val_dataloader, device=DEVICE, num_epochs=30, save_every=50, val_every=1, log_metrics=log_metrics) 

Epoch 1, EMA Train Loss: 1.372, Train Accuracy:  0.364, Val Loss:  0.000, Val Accuracy:  0.000: 100%|██████████| 267/267 [00:30<00:00,  8.75it/s]
Epoch 2, EMA Train Loss: 1.327, Train Accuracy:  0.436, Val Loss:  1.378, Val Accuracy:  0.417: 100%|██████████| 267/267 [00:29<00:00,  9.04it/s]
Epoch 3, EMA Train Loss: 1.268, Train Accuracy:  0.449, Val Loss:  1.282, Val Accuracy:  0.436: 100%|██████████| 267/267 [00:29<00:00,  9.00it/s]
Epoch 4, EMA Train Loss: 1.272, Train Accuracy:  0.456, Val Loss:  1.273, Val Accuracy:  0.431: 100%|██████████| 267/267 [00:30<00:00,  8.73it/s]
Epoch 5, EMA Train Loss: 1.277, Train Accuracy:  0.464, Val Loss:  1.246, Val Accuracy:  0.452: 100%|██████████| 267/267 [00:30<00:00,  8.90it/s]
Epoch 6, EMA Train Loss: 1.254, Train Accuracy:  0.472, Val Loss:  1.232, Val Accuracy:  0.473: 100%|██████████| 267/267 [00:29<00:00,  8.94it/s]
Epoch 7, EMA Train Loss: 1.209, Train Accuracy:  0.480, Val Loss:  1.252, Val Accuracy:  0.440: 100%|██████████| 267/267 [00

In [12]:
wandb.finish()

VBox(children=(Label(value='0.003 MB of 0.003 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

[34m[1mwandb[0m: [32m[41mERROR[0m Control-C detected -- Run data was not synced


#### The validation accuracy reaches about 45% with no finetuning. Now let's train the model with finetuning.

In [15]:
B = 32
max_length = 128
learning_rate = 1e-4
DEVICE = "cuda"

train_dataset = SSTDataset(sst_train, max_length=max_length)
val_dataset = SSTDataset(sst_dev, max_length=max_length)
train_dataloader = DataLoader(train_dataset, batch_size=B, shuffle=True, pin_memory=True, num_workers=2, collate_fn=train_dataset.collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=B, shuffle=True, pin_memory=True, num_workers=2, collate_fn=val_dataset.collate_fn)

# model with finetuning disabled
model = BERTSentimentClassifier(finetune=True).to(DEVICE)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler =  torch.optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.95)
#model, optimizer = load_model_checkpoint(model, optimizer)

num_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters in transformer network: {num_params/1e6} M")
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

Total number of parameters in transformer network: 109.483778 M
RAM used: 1530.46 MB


In [26]:
# create a W&B run
run = wandb.init(
    project="BERT_Sentiment_Classification", 
    config={
        "learning_rate": learning_rate, 
        "epochs": 10,
        "batch_size": B, 
        "corpus": "Stanford Sentiment Tree"},)   

def log_metrics(metrics):
    wandb.log(metrics)

In [17]:
train(model, optimizer, train_dataloader, val_dataloader, device=DEVICE, num_epochs=10, save_every=50, val_every=1, log_metrics=log_metrics) 

Epochs:   0%|          | 0/267 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
../at

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


#### With finetuning, the validation accuracy has increased to just over 50%.

#### Now let's try the CFIMDB dataset conraining movie reviews which have two labels: positive and negative.

In [6]:
def load_data_cfimdb(split="train"):
    if split == "train":
        filename = "data/ids-cfimdb-train.csv"
    elif split== "dev":
        filename = "data/ids-cfimdb-dev.csv"   
    data = []
    labels = []
    with open(filename, 'r') as f:
        for record in csv.DictReader(f, delimiter='\t'):
            sent = record['sentence'].lower().strip()
            sent_id = record['id'].lower().strip()
            label = int(record['sentiment'].strip())
            data.append((sent,label,sent_id))
            labels.append(label)
    label_distribution = Counter(labels)        
    return data, label_distribution

In [7]:
cfimdb_train, train_label_distribution = load_data_cfimdb(split="train")
cfimdb_dev, dev_label_distribution = load_data_cfimdb(split="dev")

print(f"Number of training examples: {len(cfimdb_train)}")
print(f"Train Label distribution: {train_label_distribution}")
print(f"Number of dev examples: {len(cfimdb_dev)}")
print(f"Dev Label distribution: {dev_label_distribution}")


Number of training examples: 1707
Train Label distribution: Counter({0: 856, 1: 851})
Number of dev examples: 245
Dev Label distribution: Counter({0: 123, 1: 122})


#### Now let's train the model, without and with finetuning the BERT base.

In [24]:
B = 32
max_length = 128
learning_rate = 5e-4
DEVICE = "cuda"

train_dataset = SSTDataset(cfimdb_train, max_length=max_length)
val_dataset = SSTDataset(cfimdb_dev, max_length=max_length)
train_dataloader = DataLoader(train_dataset, batch_size=B, shuffle=True, pin_memory=True, num_workers=2, collate_fn=train_dataset.collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=B, shuffle=True, pin_memory=True, num_workers=2, collate_fn=val_dataset.collate_fn)

# model with finetuning disabled
model = BERTSentimentClassifier(num_classes=2, finetune=False).to(DEVICE)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler =  torch.optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.95)
#model, optimizer = load_model_checkpoint(model, optimizer)

num_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters in transformer network: {num_params/1e6} M")
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

Total number of parameters in transformer network: 109.483778 M
RAM used: 1429.49 MB


In [27]:
train(model, optimizer, train_dataloader, val_dataloader, device=DEVICE, num_epochs=20, save_every=50, val_every=1, log_metrics=log_metrics) 

Epoch 1, EMA Train Loss: 0.653, Train Accuracy:  0.568, Val Loss:  0.000, Val Accuracy:  0.000: 100%|██████████| 54/54 [00:06<00:00,  8.05it/s]
Epoch 2, EMA Train Loss: 0.607, Train Accuracy:  0.680, Val Loss:  0.628, Val Accuracy:  0.710: 100%|██████████| 54/54 [00:05<00:00,  9.04it/s]
Epoch 3, EMA Train Loss: 0.577, Train Accuracy:  0.719, Val Loss:  0.591, Val Accuracy:  0.743: 100%|██████████| 54/54 [00:05<00:00,  9.04it/s]
Epoch 4, EMA Train Loss: 0.579, Train Accuracy:  0.737, Val Loss:  0.564, Val Accuracy:  0.747: 100%|██████████| 54/54 [00:06<00:00,  8.47it/s]
Epoch 5, EMA Train Loss: 0.533, Train Accuracy:  0.757, Val Loss:  0.540, Val Accuracy:  0.759: 100%|██████████| 54/54 [00:05<00:00,  9.00it/s]
Epoch 6, EMA Train Loss: 0.545, Train Accuracy:  0.772, Val Loss:  0.527, Val Accuracy:  0.739: 100%|██████████| 54/54 [00:06<00:00,  8.95it/s]
Epoch 7, EMA Train Loss: 0.498, Train Accuracy:  0.753, Val Loss:  0.515, Val Accuracy:  0.776: 100%|██████████| 54/54 [00:06<00:00,  8.

In [28]:
wandb.finish()



VBox(children=(Label(value='0.003 MB of 0.010 MB uploaded\r'), FloatProgress(value=0.3456544403374088, max=1.0…

0,1
Batch loss,▇▇▇█▆▅▆▅▅▄▄▃▆▃▄▄▃▃▄▃▂▆▃▃▂▃▂▄▃▂▃▁▅▁▃▃▂▄▂▄
Moving Avg Loss,▆█▇▇▆▆▅▅▅▄▄▄▅▄▃▃▃▃▃▂▃▃▃▂▂▃▂▃▂▂▁▁▂▁▂▁▂▁▂▂
Val Loss,▁▁████▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆

0,1
Batch loss,0.41897
Moving Avg Loss,0.44104
Val Loss,0.45394


#### The validation accuracy reaches close to 79%. Now let's try finetuning.

In [8]:
B = 32
max_length = 128
learning_rate = 4e-6
DEVICE = "cuda"

train_dataset = SSTDataset(cfimdb_train, max_length=max_length)
val_dataset = SSTDataset(cfimdb_dev, max_length=max_length)
train_dataloader = DataLoader(train_dataset, batch_size=B, shuffle=True, pin_memory=True, num_workers=2, collate_fn=train_dataset.collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=B, shuffle=True, pin_memory=True, num_workers=2, collate_fn=val_dataset.collate_fn)

# model with finetuning disabled
model = BERTSentimentClassifier(num_classes=2, finetune=True, dropout_rate=0.2).to(DEVICE)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler =  torch.optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.95)
#model, optimizer = load_model_checkpoint(model, optimizer)

num_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters in transformer network: {num_params/1e6} M")
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

Total number of parameters in transformer network: 109.483778 M
RAM used: 1064.91 MB


In [30]:
# create a W&B run
run = wandb.init(
    project="BERT_Sentiment_Classification", 
    config={
        "learning_rate": learning_rate, 
        "epochs": 10,
        "batch_size": B, 
        "corpus": "Stanford Sentiment Tree"},)   

In [13]:
train(model, optimizer, train_dataloader, val_dataloader, device=DEVICE, num_epochs=5, save_every=50, val_every=1) 

Epoch 1, EMA Train Loss: 0.023, Train Accuracy:  0.995, Val Loss:  0.000, Val Accuracy:  0.000: 100%|██████████| 54/54 [00:18<00:00,  2.90it/s]
Epoch 2, EMA Train Loss: 0.023, Train Accuracy:  0.995, Val Loss:  0.376, Val Accuracy:  0.906: 100%|██████████| 54/54 [00:18<00:00,  2.91it/s]
Epoch 3, EMA Train Loss: 0.013, Train Accuracy:  0.996, Val Loss:  0.363, Val Accuracy:  0.918: 100%|██████████| 54/54 [00:18<00:00,  2.87it/s]
Epoch 4, EMA Train Loss: 0.010, Train Accuracy:  0.997, Val Loss:  0.400, Val Accuracy:  0.914: 100%|██████████| 54/54 [00:18<00:00,  2.92it/s]
Epoch 5, EMA Train Loss: 0.006, Train Accuracy:  0.998, Val Loss:  0.376, Val Accuracy:  0.918: 100%|██████████| 54/54 [00:18<00:00,  2.93it/s]


In [11]:
wandb.finish()

#### With finetuning, the validation accuracy has reached around 92%.