<a href="https://colab.research.google.com/github/rayaneghilene/BERT_Hate_Classification/blob/main/Twitter_Arabic_BERT_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of the performance BERT classifiers on Arabic text data

Note: the model requires more storage for processing arabic data, than it does for english and french.

Modify the following variables based on your data
* **DATAPATH**: is the path to your dataset in your environment

In [8]:
DATAPATH = '/content/Arabic_Tweets_dataset.csv'
NUMBER_OF_LABELS = 2

## Data  

In [9]:
# Install and import the required libraries
!pip install -q accelerate==0.21.0 --progress-bar off
!pip install -q peft==0.4.0 --progress-bar off
!pip install -q bitsandbytes==0.40.2 --progress-bar off
!pip install -q transformers==4.31.0 --progress-bar off
!pip install -q trl==0.4.7 --progress-bar off


In [10]:
!pip install nltk --progress-bar off
!pip install keras --progress-bar off
!pip install tensorflow --progress-bar off
!pip install tensorflow_hub --progress-bar off
!pip install transformers --progress-bar off



In [11]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import log_loss

In [38]:
df = pd.read_csv(DATAPATH)
df = df.dropna()
df['label'] = df['label'].astype(int)
df.head()

Unnamed: 0,text,label
0,_ ابن مخيمات احسن ماكون ابن متعة 😂😂😂😂,1
1,_ ابن مخيمات احسن ماكون ابن متعة 😂😂😂😂,1
2,يقهروني ذي النوعية من اولاد شوارع ودي امسكهم و...,1
3,_ _ لانهم بدون اصول أولاد شوارع فبهذا يحاولون ...,1
4,صحت يديك على هالتغريده اشوفها ضاغطه جرذان واجد...,1


In [20]:
# #creation of mapped labels
# df = pd.read_csv(DATAPATH)
# max_features = 10000
# sequence_length = 250
# vectorize_layer = tf.keras.layers.TextVectorization(
#     max_tokens=max_features,
#     output_mode='int',
#     output_sequence_length=sequence_length)

# vectorize_layer.adapt(df['text'])
# label_mapping = {subtype: label for label, subtype in enumerate(df['label'].unique())}
# num_classes = len(label_mapping)

# batch_size = 16
# def vectorize_text(text, label):
#     return vectorize_layer(text), label

# df['label'] = df['label'].map(label_mapping)

# columns = ['text', 'label']
# new_df = df[columns].copy()

# new_df1, nan = train_test_split(new_df, test_size=0.9999, stratify=df['label'], random_state=42)


# csv_file_path = '/content/arabic_dataset.csv'
# new_df1.to_csv(csv_file_path, index=False)

## I - RoBERTa

In [22]:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

df = pd.read_csv(DATAPATH)

df = df.dropna()
df['label'] = df['label'].astype(int)



train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
train_encodings = tokenizer(list(train_df['text']), truncation=True, padding=True, return_tensors="tf")
val_encodings = tokenizer(list(val_df['text']), truncation=True, padding=True, return_tensors="tf")

train_labels = tf.convert_to_tensor(list(train_df['label']))
val_labels = tf.convert_to_tensor(list(val_df['label']))

model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=NUMBER_OF_LABELS)

optimizer = Adam(learning_rate=5e-5)
loss_fn = SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy'])

train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels))
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encodings), val_labels))

train_dataset = train_dataset.batch(16).shuffle(buffer_size=1000)
val_dataset = val_dataset.batch(64)

model.fit(train_dataset, epochs=3, validation_data=val_dataset)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predicti

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x78a4148e4d00>

In [30]:
import tensorflow as tf
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')



custom_prompt = "كراهية"


inputs = tokenizer(custom_prompt, return_tensors="tf")
outputs = model(inputs["input_ids"])
predictions = tf.nn.softmax(outputs.logits, axis=-1).numpy()
predicted_class_index = tf.argmax(predictions, axis=-1).numpy()[0]

predicted_class_name = "contains hate speech" if predicted_class_index == 1 else "normal discourse"

# Print the result
print("Predicted Class Name:", predicted_class_name)

Predicted Class Name: contains hate speech


## II - ELECTRA

In [39]:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import ElectraTokenizerFast, TFElectraForSequenceClassification
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

df = pd.read_csv(DATAPATH)

df = df.dropna()
df['label'] = df['label'].astype(int)

train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)
tokenizer = ElectraTokenizerFast.from_pretrained('google/electra-base-discriminator')
train_encodings = tokenizer(list(train_df['text']), truncation=True, padding=True, return_tensors="tf")
val_encodings = tokenizer(list(val_df['text']), truncation=True, padding=True, return_tensors="tf")

train_labels = tf.convert_to_tensor(list(train_df['label']))
val_labels = tf.convert_to_tensor(list(val_df['label']))

model = TFElectraForSequenceClassification.from_pretrained('google/electra-base-discriminator', num_labels=NUMBER_OF_LABELS)

optimizer = Adam(learning_rate=5e-5)
loss_fn = SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy'])

train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels))
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encodings), val_labels))

train_dataset = train_dataset.batch(16).shuffle(buffer_size=1000)
val_dataset = val_dataset.batch(64)

model.fit(train_dataset, epochs=3, validation_data=val_dataset)

Some layers from the model checkpoint at google/electra-base-discriminator were not used when initializing TFElectraForSequenceClassification: ['discriminator_predictions']
- This IS expected if you are initializing TFElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x78a3c87f38e0>

## III - DistilBERT

In [40]:
import tensorflow as tf
import pandas as pd
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
from sklearn.model_selection import train_test_split

# Data preprocessing
df = pd.read_csv(DATAPATH)
df = df.dropna()

df['label'] = df['label'].astype(int)
data_texts = df["text"].to_list()
data_labels = df["label"].to_list()
train_texts, test_texts, train_labels, test_labels = train_test_split(data_texts, data_labels, test_size=0.2)

# Tokenisation
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

# Dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

# DistilBERT Model
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=NUMBER_OF_LABELS)

# Optimizer and loss function
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])


train_dataset = train_dataset.batch(16)
test_dataset = test_dataset.batch(16)

# Training
model.fit(train_dataset, epochs=3, validation_data=test_dataset)

# Save the model
model.save_pretrained("./distilbert-fine-tuned")

# Load the fine-tuned model
model = TFDistilBertForSequenceClassification.from_pretrained("./distilbert-fine-tuned")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['label'] = df['label'].astype(int)


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Epoch 1/3
Epoch 2/3
Epoch 3/3


Some layers from the model checkpoint at ./distilbert-fine-tuned were not used when initializing TFDistilBertForSequenceClassification: ['dropout_171']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at ./distilbert-fine-tuned and are newly initialized: ['dropout_191']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
