In [None]:
Sure! Here are the detailed steps involved in fine-tuning a BERT model:

Preprocess the data:

Tokenize the input data: Tokenize the raw input data into a sequence of tokens that can be fed into the BERT model. This involves splitting the text into words and converting each word into a numerical token that corresponds to a specific entry in the model's vocabulary.
Truncate and pad the token sequences: BERT models have a fixed input length, so the input sequences must be truncated or padded to the appropriate length. Common choices include truncating sequences to a fixed maximum length or padding sequences with special tokens to a fixed length.
Convert the input data into tensors: Convert the token sequences into tensors that can be fed into the BERT model.
Load the pre-trained BERT model:

Load the pre-trained BERT model: Load the pre-trained BERT model from a pre-trained checkpoint.
Add a classification layer: Add a new classification layer on top of the pre-trained BERT model that can be fine-tuned to classify the new input data.
Define the optimizer:

Choose an optimizer: Choose an optimizer that will be used to update the weights of the model during training. Popular choices include stochastic gradient descent (SGD), Adam, and Adagrad.
Define the learning rate: Set the learning rate for the optimizer, which controls how quickly the weights of the model are updated during training.
Define the loss function:

Choose a loss function: Choose a loss function that will be used to measure the difference between the predicted and actual labels. Common choices include cross-entropy loss and binary cross-entropy loss.
Train the model:

Iterate over the training data: Iterate over the training data in batches of fixed size.
Forward pass: Pass each batch through the BERT model and compute the predicted output.
Compute the loss: Compute the loss between the predicted output and the true labels using the defined loss function.
Backward pass: Compute the gradients of the loss with respect to the model parameters and perform a backward pass through the network.
Update the weights: Use the optimizer to update the weights of the model based on the computed gradients.
Repeat until convergence: Repeat these steps until the loss stops decreasing or the model converges to a satisfactory level of accuracy.
Evaluate the model:

Iterate over the validation or test data: Iterate over the held-out validation or test data in batches of fixed size.
Forward pass: Pass each batch through the BERT model and compute the predicted output.
Compute the accuracy: Compute the accuracy of the model by comparing the predicted output to the true labels.
Repeat for all batches: Repeat these steps for all batches in the validation or test data.
Report the final accuracy: Report the final accuracy of the model on the held-out data.
Tune hyperparameters:

Experiment with different hyperparameters: Try different values for hyperparameters such as the learning rate, batch size, and number of epochs to see how they affect the performance of the model.
Choose the best hyperparameters: Choose the hyperparameters that result in the best performance on the validation data.
Predict on new data:

Preprocess the new data: Preprocess the new data using the same tokenization and padding techniques used during training.
Feed the data through the model: Pass the preprocessed data through the fine-tuned BERT model and compute the predicted output.
Convert the output to labels: Convert the predicted output to class labels based on a threshold or

# Fine-Tuning BERT for Sentiment Analysis on the IMDb Dataset

### Introduction

In this project we will be fine tuning a transformer model for the **Multilabel text classification** problem.

### Importing the necessary liberies
​
In this project, we will be mainly using pytorch to fine-tune the model, other liberies include [**huggingface transformers**](https://huggingface.co/docs/transformers/index) library to load the BERT model, and **numpy**, **pandas** and **sklearn** to preprocess as well as analyze the data 

In [None]:
!pip install torch transformers datasets kaggle
import torch

import numpy as np
import pandas as pd
import transformers

from torch.utils.data import Dataset, DataLoader
from transformers import BertModel, BertTokenizer
from sklearn.model_selection import train_test_split
from torch import nn
from sklearn.metrics import accuracy_score

### Preprocessing the Data
​
Preprocess the input data to prepare it for training. This includes tasks such as tokenization, truncation, and padding to convert the raw input data into the format expected by the BERT model.

In [105]:
!wget -O test.csv https://drive.google.com/uc?export=download&id=1Q3o616NCQLvyciX1Yt5WWBBdZPUxV8LM

--2023-04-23 03:13:30--  https://drive.google.com/uc?export=download
Resolving drive.google.com (drive.google.com)... 74.125.126.113, 74.125.126.100, 74.125.126.101, ...
Connecting to drive.google.com (drive.google.com)|74.125.126.113|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2023-04-23 03:13:30 ERROR 400: Bad Request.



In [None]:
imdb_dataset = pd.read_csv('IMDB Dataset.csv')

imdb_dataset['sentiment'] = imdb_dataset['sentiment'].replace('positive', 1).replace('negative', 0)
print("Dataset count")
print(imdb_dataset['sentiment'].value_counts())

train_data, test_data = train_test_split(imdb_dataset, test_size=0.2)

print(f"""Train: {len(train_data)}, Test: {len(test_data)}""")

train_data.head()

Dataset count
1    25000
0    25000
Name: sentiment, dtype: int64
Train: 40000, Test: 10000


Unnamed: 0,review,sentiment
42163,Ho humm - - - More of nothing. If you are a lo...,0
30758,"In the immortal ""Shaun of the Dead"", we are in...",1
46850,"It's schmaltzy, but then what else did you exp...",0
44550,"This film has an interesting plot, but the act...",0
23295,This film quite literally has every single act...,1


In [95]:
imdb_dataset['review'].apply(lambda x: len(x)).idxmin()

imdb_dataset['review'][27521]

'Read the book, forget the movie!'

In [86]:
max_len = 1000
batch_size = 32 # GLUE
epochs = 10
learning_rate = 0.00001 #5e-5, 4e-5, 3e-5, and 2e-5

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

device

device(type='cuda')

In [None]:
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        review = str(self.data['review'].to_numpy()[idx])
        
        sentiment = self.data['sentiment'].to_numpy()[idx]
        
        tokenized_review  = tokenizer.encode_plus(review,max_length=max_len, add_special_tokens= True,
                                                  pad_to_max_length= True,return_attention_mask=True,
                                                  return_token_type_ids=True,return_tensors='pt')
        
        return {
          'review': review,
          'input_ids': tokenized_review['input_ids'].flatten(),
          'attention_mask': tokenized_review['attention_mask'].flatten(),
          'sentiments': torch.tensor(sentiment, dtype=torch.long),
          'token_type_ids': tokenized_review["token_type_ids"]
        }


In [None]:
train_dataset = CustomDataset(train_data)
test_dataset = CustomDataset(test_data)

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, num_workers=2)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, num_workers=2)

In [None]:
model_name = 'bert-base-uncased'

tokenizer = BertTokenizer.from_pretrained(model_name)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:

# tokenizer = BertTokenizerFast.from_pretrained(model_name)
# model = BertForSequenceClassification.from_pretrained(model_name)
bert = BertModel.from_pretrained(model_name)

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
class MovieBERT(nn.Module):
  def __init__(self):
    super(MovieBERT, self).__init__()
    self.bert = bert
    self.classifier = nn.Linear(self.bert.config.hidden_size, 2)

  def forward(self, input_ids, attention_mask):
    outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
    last_hidden_state = outputs.last_hidden_state

    logits = self.classifier(last_hidden_state[:, 0, :])

    return logits

model = MovieBERT()
model = model.to(device)

In [None]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=learning_rate)
loss_fn = nn.CrossEntropyLoss().to(device)

batch_length = len(next(iter(train_dataloader))['input_ids'])
batch_length

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


4

In [None]:
for epoch in range(epochs):
  model.train()

  train_loss = 0.0
  train_acc = 0.0
  
  batch_count = 0

  for batch in train_dataloader:
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    sentiments = batch['sentiments'].to(device)
    token_type_ids = batch['token_type_ids'].to(device)

    optimizer.zero_grad()

    outputs = model(input_ids, attention_mask)

    loss = loss_fn(outputs, sentiments)

    loss.backward()
    optimizer.step()

    # compute acc
    preds = torch.argmax(outputs, axis=-1)
    acc = accuracy_score(sentiments.cpu().numpy(), preds.cpu().numpy())
    train_loss += loss.item() * batch_size
    train_acc += acc * batch_size
    
    if batch_count % 10 == 0:
      print(f'Batch {batch_count} / {len(train_dataloader)} - Loss: {loss.item():.4f}')

    batch_count += 1

  train_loss /= len(train_dataset)
  train_acc /= len(train_dataset)

  print(f'Epoch {epoch+1} / {epochs} - Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}')


In [32]:
model.eval()  # set model to evaluation mode
correct = 0
total = 0
with torch.no_grad():  # disable gradient calculation for inference
    for batch in test_dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        sentiments = batch['sentiments'].to(device)

        outputs = model(input_ids, attention_mask)
        _, predicted = torch.max(outputs.data, 1)

        total += sentiments.size(0)
        correct += (predicted == sentiments).sum().item()

print('Accuracy of the model on the test set: %d %%' % (100 * correct / total))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Accuracy of the model on the test set: 92 %


In [70]:
# model(input_ids, attention_mask)

# batch['input_ids']
# batch['attention_mask']

# test_movie_review = "Really Amazing, just epic and awesome to watch, no other words.  A bold showing on the theory and practice of collective oligarchies. A consumerist masterpiece that shows the function of a capitalist world built off of cinema and film industry. In every moment the screen fills up with action, not only from the main character. It should win an Oscar for its screenplay. "
# test_movie_review = "Christian Bale is great, Russell Crowe has a good moment, Hemsworth is good but this movie cannot be saved from a terrible script, bad direction and stupid humour that overstays its welcome."
# test_movie_review = "Fine seems to cover it, Disney has become fine, everything is fine. It has got to the point where if spending your time aimlessly going through your phone for 2+ hours or watching a marvel movie is a bigger waste of time."
test_movie_review = "This film is an emotional rollercoaster with some of the coolest superhero plot lines ever drawn up. It's straight up the most epic Marvel film that will probably ever be created. I don't see how Marvel could ever top this, but getting to see these characters all together at least one last time was a reward all on its own."

test_movie_review_encoded = tokenizer.encode_plus(test_movie_review,max_length=max_len, add_special_tokens= True,
                                                  pad_to_max_length= True,return_attention_mask=True,
                                                  return_token_type_ids=True,return_tensors='pt')

test_movie_review_encoded

input_ids = test_movie_review_encoded['input_ids'].to(device)
attention_mask = test_movie_review_encoded['attention_mask'].to(device)

# input_ids.shape, attention_mask.shape
predicted_outputs = model(input_ids, attention_mask)
# _, predicted = 
_, prediction = torch.max(predicted_outputs.data, 1)

prediction, predicted_outputs.data

(tensor([1], device='cuda:0'), tensor([[-4.6858,  4.3056]], device='cuda:0'))

In [33]:
tokens = input_ids[0]
tokens = tokenizer.convert_ids_to_tokens(tokens)
text = tokenizer.convert_tokens_to_string(tokens)

text

'[CLS] the coming attractions to " the order " make it seem like a decent horror mystery / thriller , but what we get is a plot that has potential to be excellent all thrown together to form a pile of garbage . < br / > < br / > first off the whole movie consists of terrible dialogue and god awful special affects . the acting was also nothing to be proud of , but keath ledger ( i think i spelled that right . ) saved the movie in this category . < br / > < br / > for heaven \' s sake : don \' t see this movie ! [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

In [None]:
torch.save({
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'epoch': epoch,
    'loss': loss
}, 'imdb_BERT_10K_train_10_epoch_model.pth')

In [26]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [30]:
checkpoint = torch.load('/content/drive/My Drive/462/imdb_BERT_10K_train_10_epoch_model.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']