<a href="https://colab.research.google.com/github/suyeshrimal/FineTuning-BERT/blob/main/FineTuningBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data cleaning and tokenization

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding

1. AutoTokenizer -> Loads a pretrained tokenizer that converts raw text into token IDs suitable for Transformer models
2. AutoModelForSequenceClassification -> Loads a pretrained transformer model (e.g., BERT, RoBERTa) with a classification head on top.
3. DataCollatorWithPadding -> A utility that automatically pads your input batches to the same length during training.

In [2]:
# Example Dataset

data_dict = {
    "text": [
        "  The staff was very kind and attentive to my needs!!!  ",
        "The waiting time was too long, and the staff was rude. Visit us at http://hospitalreviews.com",
        "The doctor answered all my questions...but the facility was outdated.   ",
        "The nurse was compassionate & made me feel comfortable!! :) ",
        "I had to wait over an hour before being seen.  Unacceptable service! #frustrated",
        "The check-in process was smooth, but the doctor seemed rushed. Visit https://feedback.com",
        "Everyone I interacted with was professional and helpful.  "
    ],
    "label": ["positive", "negative", "neutral", "positive", "negative", "neutral", "positive"]
}


In [4]:
# converting the dictionary to dataframe
data = pd.DataFrame(data_dict)
data

Unnamed: 0,text,label
0,The staff was very kind and attentive to my ...,positive
1,"The waiting time was too long, and the staff w...",negative
2,The doctor answered all my questions...but the...,neutral
3,The nurse was compassionate & made me feel com...,positive
4,I had to wait over an hour before being seen. ...,negative
5,"The check-in process was smooth, but the docto...",neutral
6,Everyone I interacted with was professional an...,positive


In [5]:
# Cleaning the data
import re

def clean_data(text):
  text = text.lower().strip()
  text = re.sub(r"http\S+", "", text) #Remove Urls
  text = re.sub(r"[^\w\s]", "", text) #Remove special characters
  return text

data["cleaned_text"] = data["text"].apply(clean_data)

In [7]:
# performing numerical encoding
data["label"] = data["label"].astype("category").cat.codes
# 0-> positive  1-> negative 2-> neutral
data

Unnamed: 0,text,label,cleaned_text
0,The staff was very kind and attentive to my ...,2,the staff was very kind and attentive to my needs
1,"The waiting time was too long, and the staff w...",0,the waiting time was too long and the staff wa...
2,The doctor answered all my questions...but the...,1,the doctor answered all my questionsbut the fa...
3,The nurse was compassionate & made me feel com...,2,the nurse was compassionate made me feel comf...
4,I had to wait over an hour before being seen. ...,0,i had to wait over an hour before being seen ...
5,"The check-in process was smooth, but the docto...",1,the checkin process was smooth but the doctor ...
6,Everyone I interacted with was professional an...,2,everyone i interacted with was professional an...


In [8]:
# Now performing tokenization
# we use Hugging face's tokenizer
# Load BERT tokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenization(text):
  return tokenizer(text,truncation=True,padding="max_length",max_length=128)

#Appying tokenization
data["tokenized"] = data["cleaned_text"].apply(tokenization)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [10]:
data

Unnamed: 0,text,label,cleaned_text,tokenized
0,The staff was very kind and attentive to my ...,2,the staff was very kind and attentive to my needs,"[input_ids, token_type_ids, attention_mask]"
1,"The waiting time was too long, and the staff w...",0,the waiting time was too long and the staff wa...,"[input_ids, token_type_ids, attention_mask]"
2,The doctor answered all my questions...but the...,1,the doctor answered all my questionsbut the fa...,"[input_ids, token_type_ids, attention_mask]"
3,The nurse was compassionate & made me feel com...,2,the nurse was compassionate made me feel comf...,"[input_ids, token_type_ids, attention_mask]"
4,I had to wait over an hour before being seen. ...,0,i had to wait over an hour before being seen ...,"[input_ids, token_type_ids, attention_mask]"
5,"The check-in process was smooth, but the docto...",1,the checkin process was smooth but the doctor ...,"[input_ids, token_type_ids, attention_mask]"
6,Everyone I interacted with was professional an...,2,everyone i interacted with was professional an...,"[input_ids, token_type_ids, attention_mask]"


In [11]:
# Extract tokenized features
data["input_ids"] = data["tokenized"].apply(lambda x:x["input_ids"])
data["attention_mask"] = data["tokenized"].apply(lambda x:x["attention_mask"])

data = data.drop(columns=["tokenized"])

data

Unnamed: 0,text,label,cleaned_text,input_ids,attention_mask
0,The staff was very kind and attentive to my ...,2,the staff was very kind and attentive to my needs,"[101, 1996, 3095, 2001, 2200, 2785, 1998, 2012...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ..."
1,"The waiting time was too long, and the staff w...",0,the waiting time was too long and the staff wa...,"[101, 1996, 3403, 2051, 2001, 2205, 2146, 1998...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,The doctor answered all my questions...but the...,1,the doctor answered all my questionsbut the fa...,"[101, 1996, 3460, 4660, 2035, 2026, 3980, 8569...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ..."
3,The nurse was compassionate & made me feel com...,2,the nurse was compassionate made me feel comf...,"[101, 1996, 6821, 2001, 29353, 2081, 2033, 251...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ..."
4,I had to wait over an hour before being seen. ...,0,i had to wait over an hour before being seen ...,"[101, 1045, 2018, 2000, 3524, 2058, 2019, 3178...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
5,"The check-in process was smooth, but the docto...",1,the checkin process was smooth but the doctor ...,"[101, 1996, 4638, 2378, 2832, 2001, 5744, 2021...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ..."
6,Everyone I interacted with was professional an...,2,everyone i interacted with was professional an...,"[101, 3071, 1045, 11835, 2098, 2007, 2001, 265...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ..."


# Finetune

In [12]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Converting to hugging face dataset format
train_dataset = Dataset.from_pandas(train_data)
test_dataset = Dataset.from_pandas(test_data)

# Removing unnecessary columns
train_dataset = train_dataset.remove_columns(["text","cleaned_text"])
test_dataset = test_dataset.remove_columns(["text","cleaned_text"])

train_dataset


Dataset({
    features: ['label', 'input_ids', 'attention_mask', '__index_level_0__'],
    num_rows: 5
})

In [15]:
pip install --upgrade transformers




In [21]:
# Now using DataCollatorWithPadding for auto padding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments(
  learning_rate=0.0002,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  num_train_epochs=5,
  output_dir="./results",
  logging_dir="./logs",
  report_to="none",
  save_strategy="epoch",
)
# Load pre-trained BERT model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
)

# Train the model
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


TrainOutput(global_step=5, training_loss=0.8788734436035156, metrics={'train_runtime': 99.9186, 'train_samples_per_second': 0.25, 'train_steps_per_second': 0.05, 'total_flos': 1644458860800.0, 'train_loss': 0.8788734436035156, 'epoch': 5.0})

In [23]:
# Evaluating
from sklearn.metrics import accuracy_score, f1_score

# Generate predictions
predictions = trainer.predict(test_dataset)
preds = predictions.predictions.argmax(-1)
labels = test_dataset['label']

# Calculate metrics
accuracy = accuracy_score(labels, preds)
f1 = f1_score(labels, preds, average='weighted')

print(f"Accuracy: {accuracy}, F1 Score: {f1}")

Accuracy: 0.5, F1 Score: 0.5
