# 1 - Fine-tuning DistilBERT

In this notebook, we will demonstrate the process of fine-tuning DistilBERT for sentiment analysis using a dataset of restaurant reviews. DistilBERT is a smaller, faster, and lighter version of BERT (Bidirectional Encoder Representations from Transformers), an encoder-based transformer model introduced by Google in 2018. DistilBERT retains most of BERT's performance while being significantly more efficient, making it a practical choice for many natural language processing (NLP) tasks, including sentiment analysis.

Our goal is to classify each review into positive or negative sentiment categories by leveraging DistilBERT's capabilities. This involves loading the pre-trained DistilBERT model, preparing the dataset, and fine-tuning the model to our specific task.




For further reading on BERT and DistilBERT, refer to the original BERT paper here and the DistilBERT paper [here](https://arxiv.org/abs/1910.01108).


## 1.1 - Knowledge distillation

Model distillation is a technique used to compress the knowledge of a large, complex model (often referred to as the "teacher") into a smaller, more efficient model (known as the "student"). This process involves training the student model to replicate the behavior of the teacher model. The key advantage of model distillation is that it enables the student model to achieve high performance levels, comparable to the teacher model, but with significantly reduced computational resources and faster inference times. DistilBERT is a product of this distillation process, derived from BERT, where it captures the essence of what BERT learns but in a more compact and efficient form.

![alt text](https://editor.analyticsvidhya.com/uploads/30818Knowledge%20Distillation%20Flow%20Chart%201.2.jpg)


Ref: https://www.analyticsvidhya.com/blog/2022/01/knowledge-distillation-theory-and-end-to-end-case-study/






In [None]:
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split
from transformers import DistilBertModel, DistilBertTokenizerFast
from torch.utils.data import DataLoader
from transformers import AdamW

# 2 - Preparing dataset

We'll use a cleaned version of this restaurant review dataset from Kaggle: https://www.kaggle.com/datasets/joebeachcapital/restaurant-reviews.

In [None]:
!wget https://raw.githubusercontent.com/swajayresources/Fine_Tuning_DistilBERT_For_Sentence_Classification-/main/restaurant_reviews.csv

--2024-03-10 13:50:02--  https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/datasets/restaurant_reviews.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2861025 (2.7M) [text/plain]
Saving to: ‘restaurant_reviews.csv’


2024-03-10 13:50:02 (67.4 MB/s) - ‘restaurant_reviews.csv’ saved [2861025/2861025]



In [None]:
# Load the dataset
df = pd.read_csv('restaurant_reviews.csv')

# Map sentiments to numerical labels
sentiment_mapping = {'negative': 0, 'neutral': 1, 'positive': 2}
df['Rating'] = df['Rating'].map(sentiment_mapping)

In [None]:
df.head()

Unnamed: 0,Review,Rating
0,The ambience was good food was quite good . ha...,2
1,Ambience is too good for a pleasant evening. S...,2
2,A must try.. great food great ambience. Thnx f...,2
3,Soumen das and Arun was a great guy. Only beca...,2
4,Food is good.we ordered Kodi drumsticks and ba...,2


In [None]:
# Display the first few rows of the dataframe
print(df.head())

# Display statistics about the dataset
print("\nDataset Statistics:")
print(df['Rating'].value_counts())

                                              Review  Rating
0  The ambience was good food was quite good . ha...       2
1  Ambience is too good for a pleasant evening. S...       2
2  A must try.. great food great ambience. Thnx f...       2
3  Soumen das and Arun was a great guy. Only beca...       2
4  Food is good.we ordered Kodi drumsticks and ba...       2

Dataset Statistics:
2    6331
0    2428
1    1192
Name: Rating, dtype: int64


## 2.2 - PyTorch Dataset and Dataloader

As usual with PyTorch projects, we extend the Dataset class to tailor our data for model training. This approach allows us to preprocess text for DistilBERT, including tokenization and mapping sentiments to numerical labels. It's a necessary step to ensure data compatibility with the model's expectations, facilitating efficient learning and prediction.

The **max_length** parameter is crucial for managing the size of tokenized inputs in transformer models like DistilBERT. It sets a limit on the number of tokens in each input sequence. This consistency in sequence length is necessary for batching and model processing, ensuring all inputs are of the same size. If a sequence exceeds this length, it will be truncated; if it's shorter, it will be padded. This balance allows the model to efficiently process data while retaining the most relevant information for the task.


In [None]:
class ReviewDataset(Dataset):
    def __init__(self, csv_file, tokenizer, max_length):
        self.dataset = pd.read_csv(csv_file)
        self.tokenizer = tokenizer
        self.max_length = max_length
        # Map sentiments to numerical labels
        self.label_dict = {'negative': 0, 'neutral': 1, 'positive': 2}

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        review_text = self.dataset.iloc[idx, 0]  # Assuming reviewText is the first column
        sentiment = self.dataset.iloc[idx, 1]  # Assuming sentiment is the second column
        labels = self.label_dict[sentiment]  # Convert sentiment to numerical label

        # Tokenize the review text
        encoding = self.tokenizer.encode_plus(
          review_text,
          add_special_tokens=True,  # Add [CLS] token at the start for classification
          max_length=self.max_length,
          return_token_type_ids=False,
          padding='max_length',
          return_attention_mask=True,
          return_tensors='pt',
          truncation=True
        )

        return {
          'review_text': review_text,
          'input_ids': encoding['input_ids'].flatten(),
          'attention_mask': encoding['attention_mask'].flatten(), # this is NOT self-attention!
          'labels': torch.tensor(labels, dtype=torch.long)
        }

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
review_dataset = ReviewDataset('restaurant_reviews.csv', tokenizer, 512)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
review_dataset[0]

{'review_text': 'The ambience was good food was quite good . had Saturday lunch which was cost effective . Good place for a sate brunch. One can also chill with friends and or parents. Waiter Soumen Das was really courteous and helpful.',
 'input_ids': tensor([  101,  1996,  2572, 11283,  5897,  2001,  2204,  2833,  2001,  3243,
          2204,  1012,  2018,  5095,  6265,  2029,  2001,  3465,  4621,  1012,
          2204,  2173,  2005,  1037,  2938,  2063,  7987,  4609,  2818,  1012,
          2028,  2064,  2036, 10720,  2007,  2814,  1998,  2030,  3008,  1012,
         15610,  2061, 27417,  8695,  2001,  2428,  2457, 14769,  1998, 14044,
          1012,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0, 

In [None]:
tokenizer.decode(review_dataset[0]['input_ids'])

'[CLS] the ambience was good food was quite good. had saturday lunch which was cost effective. good place for a sate brunch. one can also chill with friends and or parents. waiter soumen das was really courteous and helpful. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [

In [None]:
from torch.utils.data import DataLoader, random_split

# Split dataset into training and validation
train_size = int(0.8 * len(df))
val_size = len(df) - train_size
train_dataset, test_dataset = random_split(review_dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

In [None]:
# Show number of batches
len(train_loader), len(test_loader)

(498, 125)

# 3 - Fine-tuning with custom classfier layer



## 3.1 - Understanding pooled_output

- **Aggregated Representation**: In models like BERT, each input token is transformed into a high-dimensional vector representing the token in context. These vectors are the "hidden states/representations". The pooled_output is usually derived from these hidden representations but is intended to represent the entire sequence's meaning or relevant features in a single vector.
- **Derived from [CLS] Token**: For BERT and similar models, the pooled_output is often obtained by applying an additional dense layer with a non-linear activation function to the hidden state corresponding to the first token ([CLS]). This token's hidden state is designed to capture the context of the entire sequence, making it suitable for tasks requiring a fixed-size representation of variable-length input (like classification).


In [None]:
class CustomDistilBertForSequenceClassification(nn.Module):
    def __init__(self, num_labels=3):
        super(CustomDistilBertForSequenceClassification, self).__init__()
        self.distilbert = DistilBertModel.from_pretrained('distilbert-base-uncased')
        self.pre_classifier = nn.Linear(768, 768)  # DistilBERT's hidden size is 768
        self.dropout = nn.Dropout(0.3)
        self.classifier = nn.Linear(768, num_labels)

    def forward(self, input_ids, attention_mask):
        distilbert_output = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = distilbert_output[0]  # (batch_size, sequence_length, hidden_size)
        pooled_output = hidden_state[:, 0]  # we take the representation of the [CLS] token (first token)
        pooled_output = self.pre_classifier(pooled_output)
        pooled_output = nn.ReLU()(pooled_output)
        pooled_output = self.dropout(pooled_output) # regularization
        logits = self.classifier(pooled_output)
        return logits


In [None]:
model = CustomDistilBertForSequenceClassification()

In [None]:
# Inspect DistilBERT
print(model.distilbert)

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

## 3.2 - Fine-tuning

Fine-tuning DistilBERT involves training the model on our specific dataset to adjust all of its weights, including those in the transformer layers and not just the final classifier layer. This comprehensive update allows the model to better adapt to the nuances of our sentiment analysis task. We use the AdamW optimizer for efficient weight adjustments and train over several epochs, monitoring loss to gauge progress. This method ensures that the model becomes finely tuned to our specific classification challenge, leveraging the full power of DistilBERT's pre-trained knowledge and making it more effective for our dataset.



In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

optimizer = AdamW(model.parameters(), lr=5e-5)

model.train()
for epoch in range(10):
    for i, batch in enumerate(train_loader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        logits = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = nn.CrossEntropyLoss()(logits, labels)
        loss.backward()
        optimizer.step()

        if (i + 1) % 100 == 0:
            print(f"Epoch {epoch + 1}, Batch {i + 1}, Loss: {loss.item():.4f}")


Epoch 1, Batch 100, Loss: 0.6302
Epoch 1, Batch 200, Loss: 0.6793
Epoch 1, Batch 300, Loss: 0.7018
Epoch 1, Batch 400, Loss: 0.3867
Epoch 2, Batch 100, Loss: 0.8255
Epoch 2, Batch 200, Loss: 0.6897
Epoch 2, Batch 300, Loss: 0.2856
Epoch 2, Batch 400, Loss: 0.4991
Epoch 3, Batch 100, Loss: 0.3853
Epoch 3, Batch 200, Loss: 0.5139
Epoch 3, Batch 300, Loss: 0.6422
Epoch 3, Batch 400, Loss: 0.5322
Epoch 4, Batch 100, Loss: 0.6440
Epoch 4, Batch 200, Loss: 0.5079
Epoch 4, Batch 300, Loss: 0.3834
Epoch 4, Batch 400, Loss: 0.2443
Epoch 5, Batch 100, Loss: 0.4784
Epoch 5, Batch 200, Loss: 0.4022
Epoch 5, Batch 300, Loss: 0.3897
Epoch 5, Batch 400, Loss: 0.4921
Epoch 6, Batch 100, Loss: 0.3934
Epoch 6, Batch 200, Loss: 0.3426
Epoch 6, Batch 300, Loss: 0.3082
Epoch 6, Batch 400, Loss: 0.3592
Epoch 7, Batch 100, Loss: 0.2508
Epoch 7, Batch 200, Loss: 0.2450
Epoch 7, Batch 300, Loss: 0.5378
Epoch 7, Batch 400, Loss: 0.8272
Epoch 8, Batch 100, Loss: 0.3176
Epoch 8, Batch 200, Loss: 0.8763
Epoch 8, B

## 3.3 - Evaluation


In [None]:
model.eval()
total_correct = 0
total = 0
for batch in test_loader:
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)

    with torch.inference_mode():
        logits = model(input_ids=input_ids, attention_mask=attention_mask)
    predictions = torch.argmax(logits, dim=1)
    total_correct += (predictions == labels).sum().item()
    total += predictions.size(0)

print(f'Test Accuracy: {total_correct / total:.4f}')


Test Accuracy: 0.8338


In [None]:
def predict_sentiment(review_text, model, tokenizer, max_length = 512):
    """
    Predicts the sentiment of a given review text.

    Args:
    - review_text (str): The review text to analyze.
    - model (torch.nn.Module): The fine-tuned sentiment analysis model.
    - tokenizer (PreTrainedTokenizer): The tokenizer for encoding the text.
    - max_length (int): The maximum sequence length for the model.

    Returns:
    - sentiment (str): The predicted sentiment label ('negative', 'neutral', 'positive').
    """

    # Ensure the model is in evaluation mode
    model.eval()

    # Tokenize the input text
    encoding = tokenizer.encode_plus(
          review_text,
          add_special_tokens=True,
          max_length=max_length,
          return_token_type_ids=False,
          padding='max_length',
          return_attention_mask=True,
          return_tensors='pt',
          truncation=True
    )

    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

    # Move tensors to the same device as the model
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)

    with torch.inference_mode():
        # Forward pass, get logits
        logits = model(input_ids=input_ids, attention_mask=attention_mask)

    # Extract the highest scoring output
    prediction = torch.argmax(logits, dim=1).item()

    # Map prediction to label
    label_dict = {0: 'negative', 1: 'neutral', 2: 'positive'}
    sentiment = label_dict[prediction]

    return sentiment


In [None]:
# Test
review_1 = "We ordered from Papa Johns a so-called pizza... what to say? I'd rather eat a piece of dry cardboard, calling this pizza is an insult to Italians! "
review_2 = "I guess PizzaHut is decent but far from the Italian pizza. This is not going to blow you away, but still quite ok in the end."
review_3 = "Gino's pizza is what authentical Neapolian pizza tastes like, highly recommended."

print(predict_sentiment(review_1, model, tokenizer))
print(predict_sentiment(review_2, model, tokenizer))
print(predict_sentiment(review_3, model, tokenizer))


negative
negative
positive


## 3.4 - Freezing base-model



In [None]:
# Freeze DistilBERT parameters
for param in model.distilbert.parameters():
    param.requires_grad = False

# Re-run the training loop=-
# ...

## 3.5 - Using HuggingFace Transformers

Hugging Face's Transformers library significantly streamlines the process of fine-tuning models like DistilBERT for specific tasks, such as sequence classification. It provides two key abstractions: DistilBertForSequenceClassification and the Trainer.

- **DistilBertForSequenceClassification** is a convenience model that comes pre-configured with a classification head on top of the DistilBERT model, abstracting away the need to manually add and configure the final layers for sequence classification tasks. This allows for an efficient setup where the model is ready to be fine-tuned on a specific dataset without requiring in-depth knowledge of the model's internal architecture.

- The **Trainer** class further simplifies the training process by encapsulating common training steps, including data processing, model optimization, and evaluation. It leverages best practices and provides a straightforward way to train, evaluate, and deploy models with minimal boilerplate code. The use of TrainingArguments within the Trainer setup allows for easy customization of the training process to fit specific requirements, such as learning rate schedules, batch sizes, and the number of epochs.

In [None]:
%pip install accelerate -U

In [None]:
from transformers import DistilBertForSequenceClassification

# Load DistilBertForSequenceClassification, a DistilBERT model pre-configured for sequence classification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import Trainer, TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory for model checkpoints
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset - needs to be a 🤗 Dataset object
    eval_dataset=test_dataset,           # evaluation dataset
)

# Train the model
trainer.train()
