<a href="https://colab.research.google.com/github/thayeylolu/cyberbullying/blob/main/notebooks/hugging_face_fine_tuned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Download libraries need to pretain the transformer

- transformers: contains the transformer to use
- datasets: to preprocess that data as a Datset type
- evaluate : to calculate the accuracy metric 

In [1]:
!pip install transformers datasets evaluate 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 30.2 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 52.4 MB/s 
[?25hCollecting evaluate
  Downloading evaluate-0.3.0-py3-none-any.whl (72 kB)
[K     |████████████████████████████████| 72 kB 115 kB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 58.3 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 55.0 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |█████

### Import libraries

In [19]:
from google.colab import files

In [2]:
import torch, random
from google.colab import files
import numpy as np
import pandas as pd
import time
from datasets import Dataset
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from sklearn.model_selection import train_test_split
import os 
from transformers import AutoTokenizer
from sklearn import preprocessing
from transformers import TrainingArguments
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer
from transformers import EarlyStoppingCallback


### Set random seed to ensure code reproducuiblity

In [3]:
np.random.seed(3040)
torch.manual_seed(3040)
random.seed(3040)


### Load data from google drive

In [4]:
from google.colab import drive
drive.mount('/content/drive')
url = "drive/MyDrive/NLP/train_data.csv"

Mounted at /content/drive


### Processing
- define a label encoder to transform `cyberbullying-type` to numbers
- import the csv
- drop nan values by the records
- drop `unnamed :0` column
- select only  `cyberbullying_type` and `clean_txt_emoji` columns


In [5]:
le = preprocessing.LabelEncoder()
df = pd.read_csv(url)
df = df.dropna()
df = df.drop(columns = ["Unnamed: 0"])
df = df[['cyberbullying_type', 'clean_txt_emoji']]



### get unique `cyberbullying_type` and convert to a list

In [6]:
label_names = list(df['cyberbullying_type'].unique())
label_names

['not_cyberbullying',
 'gender',
 'religion',
 'other_cyberbullying',
 'age',
 'ethnicity']

- Assign `cyberbullying_type` to a new column named `label`
- fit and transform `label` with defined label transformer
- Assign `cleN_text_emoji` to a new column named `text` 
- call random seeds
- split data into train, test and validate 


In [7]:
df['label'] = df['cyberbullying_type']
df['label'] = le.fit_transform(df['label'])
df['text'] = df['clean_txt_emoji']

np.random.seed(3040)
torch.manual_seed(3040)
random.seed(3040)

train, test_valid = train_test_split(df,train_size= 0.70, random_state= 3040)
test, valid = train_test_split(test_valid,train_size= .50, random_state= 3040)



- call random seeds
- conver the trian,text, validate dataset to a `DataSet` type and remove unneccesary columns
- use GPU to train if it is available, else use a cpu
- initialize a tokenizer `DistilBertTokenizer` that uses `uncased` tokenizer. This means the tokenizer is trained on lower case words.
- define a `tokenize_function` to tokenize the texts and pad them with the maximun length of words in the `text` column
- map the `tokenize_function` to the train and valid dataset
- define the pretrained model by setting the number of classes to the lenght of unique classes in `labels`
- define the evaluation metirc and set it to accuracy


In [8]:
np.random.seed(3040)
torch.manual_seed(3040)
random.seed(3040)


train_dataset = Dataset.from_pandas(train).remove_columns(["__index_level_0__",  'clean_txt_emoji'])
valid_dataset = Dataset.from_pandas(valid).remove_columns(["__index_level_0__",  'clean_txt_emoji'])
test_dataset = Dataset.from_pandas(test).remove_columns(["__index_level_0__", 'clean_txt_emoji'])


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_valid = valid_dataset.map(tokenize_function, batched=True)

modelDistilBert = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=6)

metric = evaluate.load("accuracy")

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

  0%|          | 0/34 [00:00<?, ?ba/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

- set the random seeds
- define a function `compute_metric` that calculates the accuracy of the predictions. It first converts the predictions to logits, the returns the accuracy score
- create a `TrainingArguments` class which contains all the hyperparameters to can tune the training process.
- Create a Trainer object: `trainer` with deined model, training arguments, training and test datasets, and evaluation function.


In [15]:
np.random.seed(3040)
torch.manual_seed(3040)
random.seed(3040)


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="output",
    seed=3040,
    report_to = "none",
    evaluation_strategy="steps",
    label_names = label_names,
    save_total_limit=1,


)

trainer = Trainer(
    model=modelDistilBert,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    compute_metrics=compute_metrics,
 
)

PyTorch: setting up devices


In [16]:
start_t = time.time()
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text, cyberbullying_type. If text, cyberbullying_type are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 33144
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 12429
  Number of trainable parameters = 66958086


Step,Training Loss
500,1.714
1000,1.7242
1500,1.7364
2000,1.7217
2500,1.7206
3000,1.7315
3500,1.7364
4000,1.7356
4500,1.7189
5000,1.7231


Saving model checkpoint to output/checkpoint-500
Configuration saved in output/checkpoint-500/config.json
Model weights saved in output/checkpoint-500/pytorch_model.bin
Saving model checkpoint to output/checkpoint-1000
Configuration saved in output/checkpoint-1000/config.json
Model weights saved in output/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to output/checkpoint-1500
Configuration saved in output/checkpoint-1500/config.json
Model weights saved in output/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to output/checkpoint-2000
Configuration saved in output/checkpoint-2000/config.json
Model weights saved in output/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to output/checkpoint-2500
Configuration saved in output/checkpoint-2500/config.json
Model weights saved in output/checkpoint-2500/pytorch_model.bin
Saving model checkpoint to output/checkpoint-3000
Configuration saved in output/checkpoint-3000/config.json
Model weights saved in output/check

Step,Training Loss
500,1.714
1000,1.7242
1500,1.7364
2000,1.7217
2500,1.7206
3000,1.7315
3500,1.7364
4000,1.7356
4500,1.7189
5000,1.7231


Saving model checkpoint to output/checkpoint-8000
Configuration saved in output/checkpoint-8000/config.json
Model weights saved in output/checkpoint-8000/pytorch_model.bin
Saving model checkpoint to output/checkpoint-8500
Configuration saved in output/checkpoint-8500/config.json
Model weights saved in output/checkpoint-8500/pytorch_model.bin
Saving model checkpoint to output/checkpoint-9000
Configuration saved in output/checkpoint-9000/config.json
Model weights saved in output/checkpoint-9000/pytorch_model.bin
Saving model checkpoint to output/checkpoint-9500
Configuration saved in output/checkpoint-9500/config.json
Model weights saved in output/checkpoint-9500/pytorch_model.bin
Saving model checkpoint to output/checkpoint-10000
Configuration saved in output/checkpoint-10000/config.json
Model weights saved in output/checkpoint-10000/pytorch_model.bin
Saving model checkpoint to output/checkpoint-10500
Configuration saved in output/checkpoint-10500/config.json
Model weights saved in outp

TrainOutput(global_step=12429, training_loss=1.728004052274733, metrics={'train_runtime': 4836.5151, 'train_samples_per_second': 20.559, 'train_steps_per_second': 2.57, 'total_flos': 1.3172437963063296e+16, 'train_loss': 1.728004052274733, 'epoch': 3.0})

In [17]:

trainer.save_model("output")
end_t = time.time()
elapsed_time = end_t - start_t
print('Time taken to Train the model:', elapsed_time, 'seconds')

Saving model checkpoint to output
Configuration saved in output/config.json
Model weights saved in output/pytorch_model.bin


Time taken to Train the model: 5276.4992582798 seconds


In [24]:
# select random checkpoint
files.download('output/checkpoint-9000/pytorch_model.bin')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [21]:
'5'

'5'

### Prediciton

In [None]:
#fine_tuned_model = DistilBertForSequenceClassification.from_pretrained("output/checkpoint-3500")

In [None]:
# next step: follow this:
#https://discuss.huggingface.co/t/predicting-on-new-text-with-fine-tuned-multi-label-model/13046

from transformers import pipeline
#clf = pipeline("text-classification", tokenizer = tokenizer, model=fine_tuned_model)


In [None]:
#answer = clf("black people keep flooding our counrites. they shoudl go")
#answer