# Finetune Huggingface Transformer on Custom Dataset
In this experiment, we going to fine tune the transformer model on our own SMS Spam collection dataset.

In [1]:
!pip install transformers[torch]
!pip install accelerate -U



## Mount Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


## 1. Imports

In [3]:
import pandas as pd
import numpy as np
import torch

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from transformers import TrainingArguments, Trainer
from sklearn.preprocessing import LabelEncoder
from transformers import BertTokenizer, BertForSequenceClassification

## 2. Load the dataset
Here, we are going to use SMS Spam collection dataset from [kaggle](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset), which contains lots of phone sms which are either ham or spam.

In [4]:
data = pd.read_csv("/content/drive/MyDrive/Sms Spam Collection/spam.csv", encoding='latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
data['v1'].value_counts()

ham     4825
spam     747
Name: v1, dtype: int64

In [6]:
df = pd.DataFrame()

In [7]:
df["text"] = data["v2"]
df["label"] = data["v1"]
df.head()

Unnamed: 0,text,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


First, you need to encode the label into number.

In [8]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'class_label' column
df['label'] = label_encoder.fit_transform(df['label'])

# ham = 0, spam = 1

In [9]:
df.head()

Unnamed: 0,text,label
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


In [10]:
X = list(df["text"])
y = list(df["label"])

Split the dataset into train and test.

In [11]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2,stratify=y)

In [12]:
print(len(X_train))
print(len(y_train))
print(len(X_valid))
print(len(y_valid))

4457
4457
1115
1115


Transformer doesn't able to take the text input. The text needs to be converted into numbers.

1. First tokenize the sentence.
2. Assign an index value to each tokens based on their position in the vocabulary

In [13]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sample_data = ["I like you","Alex!, play the music."]
tokenizer(sample_data, padding=True, truncation=True, max_length=512)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


{'input_ids': [[101, 1045, 2066, 2017, 102, 0, 0, 0, 0], [101, 4074, 999, 1010, 2377, 1996, 2189, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [14]:
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_valid_tokenized = tokenizer(X_valid, padding=True, truncation=True, max_length=512)

In [15]:
X_train_tokenized.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

Now, we need to convert our tokenized input into dataset format.

In [16]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [17]:
train_dataset = Dataset(X_train_tokenized, y_train)
valid_dataset = Dataset(X_valid_tokenized, y_valid)

In [18]:
train_dataset[4]

{'input_ids': tensor([  101,  1045,  2228,  2049,  2521,  2062,  2084,  2008,  2021,  2424,
          2041,  1012,  4638,  8224,  7341,  2005,  1037,  2173,  2013,  2115,
         19568,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

## 3. Model Instantiation and Training
We gonna fine-tune the BERT model on our own dataset.

In [19]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Create a performance_metrics function that will measure the accuracy, precision, recall, and f1 score of model during training and testing.

In [20]:
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1-score": f1}

Define trainer

In [21]:
args = TrainingArguments(
    output_dir="output",
    num_train_epochs=20,
    per_device_train_batch_size=8

)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics
)

In [22]:
trainer.train()

Step,Training Loss
500,0.0872
1000,0.0448
1500,0.0164
2000,0.0069
2500,0.0
3000,0.0
3500,0.0
4000,0.0
4500,0.0
5000,0.0


Step,Training Loss
500,0.0872
1000,0.0448
1500,0.0164
2000,0.0069
2500,0.0
3000,0.0
3500,0.0
4000,0.0
4500,0.0
5000,0.0


TrainOutput(global_step=11160, training_loss=0.006957207079967278, metrics={'train_runtime': 3954.826, 'train_samples_per_second': 22.54, 'train_steps_per_second': 2.822, 'total_flos': 1.09023149121096e+16, 'train_loss': 0.006957207079967278, 'epoch': 20.0})

## 4. Model Evaluation

In [23]:
trainer.evaluate()

{'eval_loss': 0.09834945946931839,
 'eval_accuracy': 0.9928251121076234,
 'eval_precision': 0.9795918367346939,
 'eval_recall': 0.9664429530201343,
 'eval_f1-score': 0.9729729729729729,
 'eval_runtime': 12.263,
 'eval_samples_per_second': 90.924,
 'eval_steps_per_second': 11.416,
 'epoch': 20.0}

## 5. Inference

In [24]:
np.set_printoptions(suppress=True)

In [26]:
text = "Congratulations! You’ve won our grand prize. Go to google.com to claim now! Even if you did enter a contest, it’s best to try to contact the company directly before clicking any links in text messages."
inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt').to('cuda')
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions = predictions.cpu().detach().numpy()
print(predictions)

[[0.00000398 0.99999607]]


In [30]:
pred = np.argmax(predictions, axis=1)
if pred == 0:
  print("Ham")
else:
  print("Spam")

Spam


**Save model**

In [25]:
trainer.save_model('/content/drive/MyDrive/Colab Notebooks/Spam Detection/spam_detection_bert')

In [32]:
loaded_model = BertForSequenceClassification.from_pretrained("/content/drive/MyDrive/Colab Notebooks/Spam Detection/spam_detection_bert")
loaded_model.to('cuda')

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [33]:
text = "Congratulations! You’ve won our grand prize. Go to google.com to claim now! Even if you did enter a contest, it’s best to try to contact the company directly before clicking any links in text messages."
inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt').to('cuda')
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions = predictions.cpu().detach().numpy()
print(predictions)

[[0.00000398 0.99999607]]


In [34]:
pred = np.argmax(predictions, axis=1)
if pred == 0:
  print("Ham")
else:
  print("Spam")

Spam


In [35]:
text = "Your Google account will expire later today. Please verify your login details at google.com/12 to prevent your account being deleted. Text messages asking you to verify other accounts are extremely suspicious. Companies with these accounts are unlikely to ever message you asking for these details."
inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt').to('cuda')
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions = predictions.cpu().detach().numpy()
print(predictions)

[[0.00000082 0.99999917]]


In [36]:
pred = np.argmax(predictions, axis=1)
if pred == 0:
  print("Ham")
else:
  print("Spam")

Spam


In [37]:
text = "Just checking in to see if you're still on for the club meeting tomorrow at 7 PM. Let me know if there are any changes."
inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt').to('cuda')
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions = predictions.cpu().detach().numpy()
print(predictions)

[[0.99999964 0.00000033]]


In [38]:
pred = np.argmax(predictions, axis=1)
if pred == 0:
  print("Ham")
else:
  print("Spam")

Ham


## 6. Conclusion
Hence, we successfully fine-tune the BERT model on our custom sms spam collection dataset. Consequently, we got a very good evaluation accuracy of 99%.