In [1]:
import pandas as pd
import numpy as np

In [2]:
# For this example, we will use the "Hate Speech and Offensive Language" dataset on Kaggle
# read csv file
data = pd.read_csv('labeled_data.csv')

In [3]:
# check how many tweets are in the dataset
len(data)

24783

In [4]:
# check out the dataset
data.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [4]:
# We are interested in the class and text columns
# class labels: 0 - hate speech, 1 - offensive language, 2 - neither (3 classes in total)
# we will rename the tweet column as "text" and class column as "label"
data_all = data[['tweet', 'class']].copy()
data_all.rename(columns={'tweet': 'text', 'class': 'label'}, inplace=True)

In [5]:
data_all.head()

Unnamed: 0,text,label
0,!!! RT @mayasolovely: As a woman you shouldn't...,2
1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,1
2,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,1
3,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,1
4,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,1


In [5]:
# we divide data into stratified train and test splits with 80% and 20% ratios, respectively
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
X = data_all.drop('label', axis=1)
y = data_all.label

for train_index, test_index in skf.split(X, y):
    data_train = data_all.iloc[train_index]
    data_test = data_all.iloc[test_index]
    break

In [11]:
len(data_train)

19826

In [6]:
len(data_test)

4957

In [6]:
# We can also spare an evaluation set for training
# we divide data into stratified train and evaluation splits with 80% and 20% ratios, respectively
skf = StratifiedKFold(n_splits=5)
X = data_train.drop('label', axis=1)
y = data_train.label

for train_index, test_index in skf.split(X, y):
    data_train = data_train.iloc[train_index]
    data_eval = data_train.iloc[test_index]
    break

In [7]:
len(data_train)

15860

In [8]:
len(data_eval)

3966

In [8]:
# We will use HuggingFace's implementation of the base RoBERTa model: https://huggingface.co/roberta-base 
# We are dealing with tweets, so max token size of 256 should be enough for training
from Custom_LLM_Models import Transformer_LLM
llm_model = Transformer_LLM('roberta-base', label_no=3, max_token_size=256)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

In [9]:
# make sure to utilize GPU(s) for finetuning the language model
import torch
print(torch.cuda.is_available())

True


In [11]:
# train the transformer model with custom learning rate(lr), number of epochs (epoch_no) and batch size  
llm_model.train(data_train, data_eval, lr=1e-5, epoch_no=1, batch_size=16, save_dir="roberta_finetuned_model")

  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: __index_level_0__, text. If __index_level_0__, text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 15860
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 199
  Number of trainable parameters = 124647939
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
0,No log,0.451117,0.85174


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: __index_level_0__, text. If __index_level_0__, text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3966
  Batch size = 16
Saving model checkpoint to roberta_finetuned_model\checkpoint-199
Configuration saved in roberta_finetuned_model\checkpoint-199\config.json
Model weights saved in roberta_finetuned_model\checkpoint-199\pytorch_model.bin
tokenizer config file saved in roberta_finetuned_model\checkpoint-199\tokenizer_config.json
Special tokens file saved in roberta_finetuned_model\checkpoint-199\special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to roberta_finetuned_model
Configuration saved in roberta_finetuned_model\config.json
Model weights saved in robe

FINETUNING is complete. Model is saved to C:\Users\umit9\Desktop\Github projectsroberta_finetuned_model


In [12]:
# you can predict the labels of the new test set using the trained model
# notice only the text column in the test dataset is provided for the function
predictions = llm_model.predict_labels(data_test[['text']])



  0%|          | 0/5 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: __index_level_0__, text. If __index_level_0__, text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 4957
  Batch size = 16


In [13]:
# you can compute the common classification metrics: [accuracy, f1, recall, precision]
eval_metrics = llm_model.evaluate(data_test['label'].values, predictions)
print(eval_metrics)

{'accuracy': 0.8642323986282026, 'f1': 0.5516236828510586, 'recall': 0.5550414016570672, 'precision': 0.5505369100948146}


  _warn_prf(average, modifier, msg_start, len(result))


In [16]:
# you can also load a finetuned model by providing a loading path to the Transformer_LLM object
from Custom_LLM_Models import Transformer_LLM
llm_model = Transformer_LLM('roberta-base', label_no=3, max_token_size=128, load_dir="roberta_finetuned_model/")

loading file vocab.json
loading file merges.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file roberta_finetuned_model/config.json
Model config RobertaConfig {
  "_name_or_path": "roberta_finetuned_model/",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_t

In [18]:
# Without training, you can predict the labels of the new test set using the loaded finetuned model
# notice only the text column in the test dataset is provided for the function
predictions = llm_model.predict_labels(data_test[['text']])

  0%|          | 0/5 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: __index_level_0__, text. If __index_level_0__, text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 4957
  Batch size = 8


In [19]:
# you can again compute the common classification metrics: [accuracy, f1, recall, precision]
eval_metrics = llm_model.evaluate(data_test['label'].values, predictions)
print(eval_metrics)

{'accuracy': 0.8642323986282026, 'f1': 0.5516236828510586, 'recall': 0.5550414016570672, 'precision': 0.5505369100948146}


  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# That's all.