<a href="https://colab.research.google.com/github/sakhawat3003/Finetune-BanglaBERT-with-BANEmo-for-Sentiment-Analysis/blob/main/Finetune_BanglaBERT_with_BANEmo_for_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

I recently published the BANEmo dataset in my IEEE paper [*(link)*](https://ieeexplore.ieee.org/document/11171926), which contains Bangla comments annotated with emotion labels. To demonstrate its practical use, we are goint to fine‑tune the *BanglaBERT* transformer based model for bangla sentiment classification. Not only that, after building the fine-tuned model, we will publish the model in Hugging Face so that anyone can load the model.

The Bangla dataset contains 15k comments. These comments have been meticuloulsy labelled by multiple annotators. Based on the sentiment of each comment, they were categorized to many labels: *Happiness, Sadness, Disgust, Anger, Fear, Surprise, Sarcasm* etc. The dataset is imbalanced, so we will only keep the sentiment categories *Happiness* and *sadness*.

## Import Necessary Libraries

In [None]:
import numpy as np
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler
import torch
from tqdm.auto import tqdm

In [None]:
import random
torch.manual_seed(42)
random.seed(42)
np.random.seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

In [None]:
bangla_comments=pd.read_csv(filepath_or_buffer="/content/drive/MyDrive/Dataset/Bangla Comments.csv")
bangla_comments.head()

Unnamed: 0,Comments,Ankon,Rizvi,Sourov,FinalTag
0,"বিশ্ববিদ্যালয়ে শিক্ষক রাজনীতি, ছাত্র রাজনীতি ...",Sadness,Sadness,Sadness,Sadness
1,পাকিস্তান যেই তালিকায় থাকে ওই তালিকা আমরা এমনে...,Anger,Anger,Anger,Anger
2,"সিংগাপুরের সাথে ভারত, পাকিস্তানের তুলনা কেন? আ...",Sarcasm,Sarcasm,Sarcasm,Sarcasm
3,আমাদের দেশের শিক্ষা প্রতিষ্ঠানগুলোতে পড়ালেখার...,Sadness,Sadness,Sadness,Sadness
4,আমাদের দেশের শিক্ষা প্রতিষ্ঠান রাজনীতিতে প্রথম...,Disgust,Disgust,Disgust,Disgust


We can see the labels annotated by multiple annotators, and we only choose the label as the finaltag which received the most votes from the annotators.

Let's take a look at the names of the columns

In [None]:
bangla_comments.columns

Index(['  Comments', 'Ankon', 'Rizvi', 'Sourov', 'FinalTag'], dtype='object')

First we need to clean the name of these columns and keep only the columns those are necessary.

In [None]:
bangla_comments.columns=bangla_comments.columns.str.strip()
bangla_comments.columns

Index(['Comments', 'Ankon', 'Rizvi', 'Sourov', 'FinalTag'], dtype='object')

In [None]:
bangla_comments01=bangla_comments[['Comments','FinalTag']]
bangla_comments01.head()

Unnamed: 0,Comments,FinalTag
0,"বিশ্ববিদ্যালয়ে শিক্ষক রাজনীতি, ছাত্র রাজনীতি ...",Sadness
1,পাকিস্তান যেই তালিকায় থাকে ওই তালিকা আমরা এমনে...,Anger
2,"সিংগাপুরের সাথে ভারত, পাকিস্তানের তুলনা কেন? আ...",Sarcasm
3,আমাদের দেশের শিক্ষা প্রতিষ্ঠানগুলোতে পড়ালেখার...,Sadness
4,আমাদের দেশের শিক্ষা প্রতিষ্ঠান রাজনীতিতে প্রথম...,Disgust


Check out the number of comments in each sentiment class

In [None]:
bangla_comments01['FinalTag'].value_counts()

Unnamed: 0_level_0,count
FinalTag,Unnamed: 1_level_1
Sadness,4180
Happiness,4130
Disgust,3441
Anger,1752
Fear,787
Surprise,352
Undefined,202
Sarcasm,155


We will keep the sentiment class *Happiness* and *Sadness* for binary classification.

In [None]:
bangla_comments02=bangla_comments01[bangla_comments01['FinalTag'].isin(['Happiness','Sadness'])]

In [None]:
bangla_comments02['FinalTag'].value_counts()

Unnamed: 0_level_0,count
FinalTag,Unnamed: 1_level_1
Sadness,4180
Happiness,4130


Create a new sentiment *labels* column to map happiness to 1 and sadness to 0

In [None]:
bangla_comments02=bangla_comments02.copy()
bangla_comments02.loc[:,'labels']=bangla_comments02['FinalTag'].map({'Happiness':1, 'Sadness':0})
bangla_comments02.head()

Unnamed: 0,Comments,FinalTag,labels
0,"বিশ্ববিদ্যালয়ে শিক্ষক রাজনীতি, ছাত্র রাজনীতি ...",Sadness,0
3,আমাদের দেশের শিক্ষা প্রতিষ্ঠানগুলোতে পড়ালেখার...,Sadness,0
5,দুঃখজনক হলেও সত্য যে আমাদের দেশের বিশ্ববিদ্যাল...,Sadness,0
10,বিশ্বসেরা বিশ্ববিদ্যালয়গুলোতে ছাত্ররা পড়ালেখ...,Sadness,0
12,"শিক্ষাই আলো,,অনেক দিন ধরে জ্বলছে তো, তেল ফুরিয...",Sadness,0


Reset the index of the dataset.

In [None]:
bangla_comments02=bangla_comments02.reset_index(drop=True)

In [None]:
bangla_comments02.head()

Unnamed: 0,Comments,FinalTag,labels
0,"বিশ্ববিদ্যালয়ে শিক্ষক রাজনীতি, ছাত্র রাজনীতি ...",Sadness,0
1,আমাদের দেশের শিক্ষা প্রতিষ্ঠানগুলোতে পড়ালেখার...,Sadness,0
2,দুঃখজনক হলেও সত্য যে আমাদের দেশের বিশ্ববিদ্যাল...,Sadness,0
3,বিশ্বসেরা বিশ্ববিদ্যালয়গুলোতে ছাত্ররা পড়ালেখ...,Sadness,0
4,"শিক্ষাই আলো,,অনেক দিন ধরে জ্বলছে তো, তেল ফুরিয...",Sadness,0


In [None]:
bangla_comments02['labels'].value_counts()

Unnamed: 0_level_0,count
labels,Unnamed: 1_level_1
0,4180
1,4130


We don't need the *FinalTag* column. We can remove this column.

In [None]:
bangla_comments02.drop(columns=['FinalTag'], inplace=True)

## Split the Dataset in to Train and Validation Set

In [None]:
from sklearn.model_selection import train_test_split

train_df, validation_df=train_test_split(bangla_comments02, test_size=0.1, random_state=42)

In [None]:
validation_df['labels'].value_counts()

Unnamed: 0_level_0,count
labels,Unnamed: 1_level_1
1,421
0,410


## Load Dataset using Huggingface

We need to save the train set and the validation set in csv format so that we can load the dataset using the *load_dataset* function from huggingface.

In [None]:
train_df.to_csv('train.csv', index=False)
validation_df.to_csv('validation.csv', index=False)

Now you’ll have train.csv and valid.csv in our Colab working directory.
Once saved, you can load them.

In [None]:
from datasets import load_dataset
raw_datasets=load_dataset('csv', data_files={'train':'train.csv', 'validation':'validation.csv'})

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['Comments', 'labels'],
        num_rows: 7479
    })
    validation: Dataset({
        features: ['Comments', 'labels'],
        num_rows: 831
    })
})

Check out the first comments in the train and validation datasest.

In [None]:
print(raw_datasets['train']['Comments'][0])

সাব্বাশ! শুনে মনটা ভরে গেলো।


In [None]:
print(raw_datasets['validation']['Comments'][0])

দেশের অর্থনীতিতে অবদান রাখতে চায় মনে হয়, তাই এতো পরিশ্রম করছে 


## Load the Necessary Libraries

In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


In [None]:
import evaluate

## Load Transformer Model and Tokenizer

In [None]:
transformer_model='sagorsarker/bangla-bert-base'
tokenizer=AutoTokenizer.from_pretrained(transformer_model)

config.json:   0%|          | 0.00/491 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

## Function for Tokenization

Define a function to tokenize each of the comments from train, and validation dataset.

In [None]:
def tokenize_function(dataset):
  # Ensure all comments are strings to avoid type errors
  comments_as_strings = [str(comment) if comment is not None else "" for comment in dataset['Comments']]
  return tokenizer(comments_as_strings, truncation=True)

#Dataset.map will apply the tokenize_function across all the rows of each split in the dataset
tokenized_dataset=raw_datasets.map(tokenize_function, batched=True)

Map:   0%|          | 0/7479 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/831 [00:00<?, ? examples/s]

Hugging Face has Dataset.map() method that tells us how to feed data into our function. *batched=True* makes tokenizer function process multiple sentences at once instead of one by one, which speeds up tokenization and is the recommended way to use Hugging Face tokenizers.

Check out the tokenized dataset. As you can see three new features have been introduced *'input_ids', 'token_type_ids', 'attention_mask'*

In [None]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['Comments', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7479
    })
    validation: Dataset({
        features: ['Comments', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 831
    })
})

## Data Collator

Data Collator helps to pad the tokenized sentences to the maximum length per batch.

In [None]:
data_collator=DataCollatorWithPadding(tokenizer=tokenizer)

## Prepare data for training

Before actually writing our training loop, we will need to define a few objects. The first ones are the dataloaders we will use to iterate over batches. But before we can define those dataloaders, we need to apply a bit of postprocessing to our tokenized_datasets, to take care of some things that the Trainer did for us automatically. Specifically, we need to:

1. Remove the columns corresponding to values the model does not expect (like the *Comments*).
2. We don't need to rename the column *labels* because the model expects the argument to be named *labels*.
3. Set the format of the datasets so they return PyTorch tensors instead of lists.
4. Also we can remove the column *token_type_ids* as we have only one comment per row. But if you want you can keep it. No harm.

In [None]:
tokenized_dataset=tokenized_dataset.remove_columns(['Comments','token_type_ids'])
tokenized_dataset.set_format('torch')

In [None]:
tokenized_dataset['train'].column_names

['labels', 'input_ids', 'attention_mask']

## Data Loader

Data Loader function will shuffle the train dataset, create batches of examples with padding and feed to the model for training. Instead of feeding one sentence at a time, it groups multiple examples (e.g., 8 if batch_size=8) into a single batch.
Batch dictionary: Each batch is a dictionary with keys like:

* input_ids → tokenized sentence IDs

* attention_mask → indicates which tokens are real vs padding

* token_type_ids (for sentence pairs, not needed for our bangla dataset)

* labels → sentiment labels (0 or 1)

In [None]:
train_dataloader=DataLoader(dataset=tokenized_dataset['train'], shuffle=True, batch_size=8, collate_fn=data_collator)
evaluation_dataloader=DataLoader(dataset=tokenized_dataset['validation'], shuffle=True, batch_size=8, collate_fn=data_collator)

Let's have a look at the first batch from the dataloader for train dataset

In [None]:
for batch in train_dataloader:
  break
{k: v.shape for k,v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 29]),
 'attention_mask': torch.Size([8, 29])}

## Build the Model

Now that we’re completely finished with data preprocessing, let’s turn to the model. We instantiate it exactly as we did it before.

In [None]:
model=AutoModelForSequenceClassification.from_pretrained(transformer_model, num_labels=2)

model.safetensors:   0%|          | 0.00/660M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To make sure that everything will go smoothly during training, we pass our first batch of training data to this model. The first batch contains 8 sentences and the model will return 2 logits for each sentence, one for negative and another for positive sentiment.

In [None]:
outputs=model(**batch)
print(outputs)

SequenceClassifierOutput(loss=tensor(0.6591, grad_fn=<NllLossBackward0>), logits=tensor([[-0.3682,  0.3030],
        [-0.1694,  0.0511],
        [-0.1947, -0.1816],
        [-0.3923, -0.3053],
        [-0.0563, -0.3472],
        [-0.4819, -0.6968],
        [-0.0133,  0.0332],
        [-0.4421, -0.3834]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


We will want to use the GPU if we have access to one. To do this, we define a device we will put our model and our batches on.

In [None]:
device=torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
device

device(type='cuda')

## Optimizer

We’re just missing two things: an *optimizer* and a *learning rate scheduler*. Since we are trying to replicate what the Trainer was doing by hand, we will use the same defaults. The optimizer used by the Trainer is *AdamW*, which is the same as *Adam*, but with a twist for *weight decay regularization*.

In [None]:
optimizer=AdamW(model.parameters(),lr=0.00005)

## Learning Scheduler

Finally, the learning rate scheduler used by default is just a linear decay from the maximum value 0.00005 to 0. To properly define it, we need to know the number of training steps we will take, which is the number of epochs we want to run multiplied by the number of training batches (which is the length of our training dataloader). The Trainer uses three epochs by default, so we will follow that.

In [None]:
from transformers import get_scheduler

In [None]:
num_epochs=3
num_training_steps=num_epochs*len(train_dataloader)
print(num_training_steps)

2805


In [None]:
learning_scheduler=get_scheduler('linear', optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

## Train the Model

We are now ready to train! To get some sense of when training will be finished, we add a progress bar over our number of training steps, using the tqdm library. Caution: the training will take a while.

In [None]:
progress_bar=tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
  for batch in train_dataloader:
    batch={k: v.to(device) for k,v in batch.items()}
    outputs=model(**batch)
    loss=outputs.loss
    loss.backward()

    optimizer.step()
    learning_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

  0%|          | 0/2805 [00:00<?, ?it/s]

## Model Evaluation

After we train our transformer model on the train dataset, we need to evaluate the performance of our model on the validation set.

We will use a metric provided by the *Evaluate* library. Metrics can actually accumulate batches for us as we go over the prediction loop with the method *add_batch()*. Once we have accumulated all the batches, we can get the final result with *metric.compute()*. Here’s how to implement all of this in an evaluation loop.

In [None]:
accuracy=evaluate.load('accuracy')
f1_score=evaluate.load('f1')

model.eval()
for batch in evaluation_dataloader:
  batch={k: v.to(device) for k,v in batch.items()}
  with torch.no_grad():
    outputs=model(**batch)

  predictions=torch.argmax(outputs.logits, dim=-1)
  accuracy.add_batch(predictions=predictions, references=batch['labels'])
  f1_score.add_batch(predictions=predictions, references=batch['labels'])

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
print(f'Accuracy: {accuracy.compute()}')
print(f'F-1 Score: {f1_score.compute()}')

Accuracy: {'accuracy': 0.8351383874849578}
F-1 Score: {'f1': 0.839766081871345}


## Save the Model

In [None]:
model.save_pretrained("bangla_sentiment_model")
tokenizer.save_pretrained("bangla_sentiment_model")

('bangla_sentiment_model/tokenizer_config.json',
 'bangla_sentiment_model/special_tokens_map.json',
 'bangla_sentiment_model/vocab.txt',
 'bangla_sentiment_model/added_tokens.json',
 'bangla_sentiment_model/tokenizer.json')

Both the model and tokenizer will be saved into a folder named *bangla_sentiment_model*.

That folder is created in your current working directory (where our Python script or notebook is running).

Inside it, we will find files like:

* config.json → model configuration

* pytorch_model.bin → model weights

* tokenizer.json / vocab.txt → tokenizer vocabulary

* special_tokens_map.json → tokenizer special tokens

### Check the location

This will show you all the files saved.

In [None]:
import os
os.listdir("bangla_sentiment_model")

['special_tokens_map.json',
 'tokenizer_config.json',
 'config.json',
 'vocab.txt',
 'model.safetensors',
 'tokenizer.json']

### Load the saved model

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bangla_sentiment_model")
tokenizer = AutoTokenizer.from_pretrained("bangla_sentiment_model")

### Save the model in Google drive

In [None]:
model.save_pretrained("/content/drive/MyDrive/bangla_sentiment_model")
tokenizer.save_pretrained("/content/drive/MyDrive/bangla_sentiment_model")

('/content/drive/MyDrive/bangla_sentiment_model/tokenizer_config.json',
 '/content/drive/MyDrive/bangla_sentiment_model/special_tokens_map.json',
 '/content/drive/MyDrive/bangla_sentiment_model/vocab.txt',
 '/content/drive/MyDrive/bangla_sentiment_model/added_tokens.json',
 '/content/drive/MyDrive/bangla_sentiment_model/tokenizer.json')

### Load the model from google drive

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("/content/drive/MyDrive/bangla_sentiment_model")
tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/bangla_sentiment_model")

The tokenizer you are loading from '/content/drive/MyDrive/bangla_sentiment_model' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


### Evaluate again

Evaluate the loaded model again to check the performance is still the same.

In [None]:
device=torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
device

device(type='cuda')

In [None]:
!pip install evaluate
import evaluate

In [None]:
accuracy=evaluate.load('accuracy')
f1_score=evaluate.load('f1')

model.eval()
for batch in evaluation_dataloader:
  batch={k: v.to(device) for k,v in batch.items()}
  with torch.no_grad():
    outputs=model(**batch)

  predictions=torch.argmax(outputs.logits, dim=-1)
  accuracy.add_batch(predictions=predictions, references=batch['labels'])
  f1_score.add_batch(predictions=predictions, references=batch['labels'])

### Model Accuracy

In [None]:
print(f'Accuracy: {accuracy.compute()}')
print(f'F-1 Score: {f1_score.compute()}')

Accuracy: {'accuracy': 0.8351383874849578}
F-1 Score: {'f1': 0.839766081871345}


Our fine-tuned bangla bert model gained an accuracy of 84% which is pretty good.

## Publish the Model to Hugging Face

In this section, we will log in to hugging face and push our model the hugging face space. We will need to create a repository on hugging face hub where we will push and save our model.

### Install Hugging Face tools

In [None]:
!pip install huggingface_hub



### Log in to Hugging Face

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Create a Model Repository

We can either:

* Create a new repo manually at huggingface.co/new, or

* Do it programmatically:

In [None]:
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id="bangla-sentiment-banglabert", private=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


RepoUrl('https://huggingface.co/sakhawat-hossen/bangla-sentiment-banglabert', endpoint='https://huggingface.co', repo_type='model', repo_id='sakhawat-hossen/bangla-sentiment-banglabert')

### Upload the Saved Model

Since we saved our model in Google Drive (/content/drive/MyDrive/bangla_sentiment_model), we can upload that folder:

In [None]:
api = HfApi()

api.upload_folder(
    folder_path="/content/drive/MyDrive/bangla_sentiment_model",
    repo_id="sakhawat-hossen/bangla-sentiment-banglabert"
)

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...t_model/model.safetensors:   0%|          |  553kB /  658MB            

CommitInfo(commit_url='https://huggingface.co/sakhawat-hossen/bangla-sentiment-banglabert/commit/33ed3e41755a7a17dea1d390aa168dafdbf4fab5', commit_message='Upload folder using huggingface_hub', commit_description='', oid='33ed3e41755a7a17dea1d390aa168dafdbf4fab5', pr_url=None, repo_url=RepoUrl('https://huggingface.co/sakhawat-hossen/bangla-sentiment-banglabert', endpoint='https://huggingface.co', repo_type='model', repo_id='sakhawat-hossen/bangla-sentiment-banglabert'), pr_revision=None, pr_num=None)

### Reload From Hub

Once our model for bangla sentiment analysis is uploaded in hugging face hub, anyone can load it directly:

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("sakhawat-hossen/bangla-sentiment-banglabert")
tokenizer = AutoTokenizer.from_pretrained("sakhawat-hossen/bangla-sentiment-banglabert")

config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/658M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

### Check the model output

In [None]:
import torch
text = "আজ আমি খুব খুশি।"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
    outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=-1).item()

print("Prediction:", "Happiness 😀" if prediction == 1 else "Sadness 😢")

Prediction: Happiness 😀
