# Arabic Hate Speech - Detecting Offensive Language


#### Fine-Grained Hate Speech Detection on Arabic Twitter
**Disclaimer: Some examples have offensive language and hate speech!**


Detecting offensive language and hate speech is very important for online safety, content moderation, etc. Studies show that the presence of hate speech may be connected to hate crimes (Hate Speech Watch, 2014).

Given the largest annotated Arabic tweets without being biased towards specific topics, genres, or dialects. Each tweet is judged by 3 annotators using crowdsourcing for offensiveness. Offensive tweets were classified into one of the hate speech types: Race, Religion, Ideology, Disability, Social Class, and Gender. Also, annotators judged whether a tweet has vulgar language or violence.

Hate speech is defined as any kind of offensive language (insults, slurs, threats, encouraging violence, impolite language, etc.) that targets a person or a group of people based on common characteristics such as race/ethnicity/nationality, religion/belief, ideology, disability/disease, social class, gender, etc.


Hate Speech types in our dataset are:
HS1 (race/ethnicity/nationality).
HS2 (religion/belief).
HS3 (ideology).
HS4 (disability/disease).
HS5 (social class).
HS6 (gender).


The corpus contains ~13K tweets in total: 35% are offensive and 11% are hate speech. Vulgar and violent tweets represent 1.5% and 0.7% of the whole corpus.


#### This task consists of 3 subtasks:

**Subtask A:** Detect whether a tweet is offensive or not.
Labels for this task are: OFF (Offensive) or NOT_OFF (Not Offensive)
Example: الله يلعنه على هالسؤال (May God curse him for this question! )

**Subtask B:** Detect whether a tweet has hate speech or not.
Labels are: HS (Hate Speech) or NOT_HS (Not Hate Speech).
Subtask B is more challenging than Subtask A as 11% only of the tweets are labeled as hate speech.
Example: أنتم شعب متخلف (You are a retarded people)


**Subtask C:** Detect the fine-grained type of hate speech.
Labels are: HS1 (Race), HS2 (Religion), HS3 (Ideology), HS4 (Disability), HS5 (Social Class), and HS6 (Gender).
A tweet takes only one label for hate speech type based on the majority voting of the 3 annotators. In case there is no majority label, the final label was determined by a domain expert.


Data has been split into 70% for training, 10% for development, and 20% for testing.

>### **Dataset**
>Download training data from: https://alt.qcri.org/resources/OSACT2022/>OSACT2022-sharedTask-train.txt
>
>Download development data from: https://alt.qcri.org/resources/OSACT2022/>OSACT2022-sharedTask-dev.txt
>
>Download test data from: https://alt.qcri.org/resources/OSACT2022/>OSACT2022-sharedTask-test-tweets.txt

#Necessary Imports

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install emoji
!pip install pyarabic
!pip install tkseem
#!pip install Data_Fetching
import nltk
nltk.download('punkt')

Collecting emoji
  Downloading emoji-2.9.0-py2.py3-none-any.whl (397 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/397.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/397.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m397.5/397.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.9.0
Collecting pyarabic
  Downloading PyArabic-0.6.15-py3-none-any.whl (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.4/126.4 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarabic
Successfully installed pyarabic-0.6.15
Collecting tkseem
  Downloading tkseem-0.0.3-py3-none-any.whl (30.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.9/30.9 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
from gensim.utils import simple_preprocess
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
from nltk.tokenize import TweetTokenizer
import re
import string
import emoji
from pyarabic.araby import strip_tashkeel
from pyarabic.araby import normalize_ligature
from tqdm import tqdm
import numpy as np
import os
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelBinarizer
from gensim.models import FastText
import tkseem as tk

# **Subtask A**

# Helper Functions

In [4]:
nltk.download("stopwords")
STOP_WORDS = set(nltk.corpus.stopwords.words("arabic"))

def tokenize_tweet(tweet):
    # create a TweetTokenizer object
    tknzr = TweetTokenizer()
    # tokenize the tweet
    tokens = tknzr.tokenize(tweet)
    return tokens

def remove_extra_spaces(words):
    """Removes extra whitespaces at the beginning and at the end of each word in a list"""
    cleaned_words = []
    for word in words:
        cleaned_word = ' '.join(word.split()).strip()
        cleaned_words.append(cleaned_word)
    return cleaned_words

def remove_urls(lst):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return [re.sub(pattern, '', item).strip() for item in lst if re.sub(pattern, '', item).strip() != '']

def remove_user_mentions(words):
    """Removes user mentions (@user) from a list of words"""
    cleaned_words = []
    for word in words:
        if not word.startswith('@'):
            cleaned_words.append(word)
    return cleaned_words

def remove_punctuation(lst):
    """Removes punctuation from a list of strings, including single punctuation characters"""
    translator = str.maketrans('', '', string.punctuation+'؟')
    result = []
    for item in lst:
        # Remove all punctuation characters
        item = item.translate(translator)
        # Remove any remaining single punctuation characters
        if item != '':
          result.append(item)
    return result

def remove_numbers(lst):
    """Removes numbers from a list of strings"""
    pattern = re.compile(r'\d+')
    return [re.sub(pattern, '', item) for item in lst if re.sub(pattern, '', item).strip() != '']

def remove_emojis(words):
    """Removes emojis from a list of words"""
    cleaned_words = []
    for word in words:
        cleaned_word = ''.join(c for c in word if c not in emoji.EMOJI_DATA)
        if cleaned_word != '':
            cleaned_words.append(cleaned_word)
    return cleaned_words

def remove_foreign_language(lst):
    pattern = re.compile(r'[^\u0600-\u06ff]+')
    return [re.sub(pattern, "", item) for item in lst if re.sub(pattern, "", item) != '']

def remove_tashkeel(lst):
    return [normalize_ligature(strip_tashkeel(word)) for word in lst]

def remove_repeated_chars(lst):
    pattern = re.compile(r"(\w)\1{2,}")
    return [re.sub(pattern, r"\1\1", item).strip() for item in lst if re.sub(pattern, '', item).strip() != '']

def remove_stop_words(lst):

    result = []
    for word in lst:
        if word not in STOP_WORDS:
            result.append(word)
    return result

def form_sentence(words):
    """Forms a sentence from a list of words"""
    sentence = ' '.join(words)
    return sentence

def clean_tweet(tweet,mode="ml"):
    """
    A function to clean a single tweet.
    """
    if mode=="ml":
        #tokenize tweet
        words = tokenize_tweet(tweet)
        #remove extra white-spaces
        words = remove_extra_spaces(words)
        #remove urls
        words = remove_urls(words)
        #remove user mentions
        words = remove_user_mentions(words)
        #remove punctiation
        words = remove_punctuation(words)
        #remove numbers
        words = remove_numbers(words)
        #remove emojis
        words = remove_emojis(words)
        #remove non-arabic charachters
        words = remove_foreign_language(words)
        #remove tashkeel
        words = remove_tashkeel(words)
        #remove repeated charachters
        words = remove_repeated_chars(words)
        #remove stop words
        words = remove_stop_words(words)
        #form a new sentence
        sentence = form_sentence(words)
    else:
        words = tokenize_tweet(tweet)
        #remove extra white-spaces
        words = remove_extra_spaces(words)
        #remove urls
        words = remove_urls(words)
        #remove user mentions
        words = remove_user_mentions(words)
        #remove punctiation
        words = remove_punctuation(words)
        #remove numbers
        words = remove_numbers(words)
        #remove emojis
        words = remove_emojis(words)
        #remove non-arabic charachters
        words = remove_foreign_language(words)
        #form a new sentence
        sentence = form_sentence(words)
    return sentence


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
def tokenize_and_pad_tweets(data,datatype= 'train',max_words=None,max_seq_len=None,model_path= './Models/tokenizer_model.pkl'):
    """Tokenizes tweets and pads the sequences to the length of the longest sequence in the dataset.

    Args:
        df_column (pandas.Series): A DataFrame column containing tweets.

    Returns:
        tuple:
            numpy.ndarray: An array of padded sequences.
            int: The vocabulary size.
            int: The maximum sequence length.
            Tokenizer: The tokenizer object used for the tokenization.
    """
    PROJECT_PATH = os.path.realpath(os.path.dirname(__file__))
    model_path = os.path.join(PROJECT_PATH, model_path)

    if max_words == None:
          tokenizer = tk.WordTokenizer()
    else:
          tokenizer = tk.WordTokenizer(vocab_size=max_words)

    if datatype == 'train':
        # Create tokenizer
        path = os.path.join(PROJECT_PATH, './Data/tokenizer.txt')
        df = pd.DataFrame(data,columns=['tweet'])
        df.to_csv(path, sep='\n', header=False,index=False)

        tokenizer.train(path)

        sequences = [tokenizer.encode(sentence) for sentence in data]
        max_seq_len = max(len(seq) for seq in sequences)

        vocab_size = tokenizer.vocab_size
        sequences = pad_sequences(sequences, maxlen=max_seq_len,value = 0, padding='post')


        tokenizer.save_model(os.path.join(PROJECT_PATH,'./Models/tokenizer_model.pkl'))

    elif datatype == 'test' and max_seq_len != None:
        try:
            tokenizer.load_model(model_path)
            sequences = [tokenizer.encode(sentence) for sentence in data]
            vocab_size = tokenizer.vocab_size
            sequences = pad_sequences(sequences, maxlen=max_seq_len,value = 0, padding='post')

        except:
            print("please check if tokenizer model is passed correctly!")

    return sequences, vocab_size, max_seq_len, tokenizer


# Main Class

In [6]:
test_labels_path = '/content/drive/MyDrive/hate-speech/Data/subtask_A_labels.txt'
with open(test_labels_path, 'r', encoding='utf-8') as file:
      test_labels = file.readlines()

In [7]:
class TextClassifier:
  def __init__(self, folder_path,label_name):
    self.folder_path = folder_path
    self.X_train, self.X_test, self.y_train, self.y_test, self.X_dev, self.y_dev = None, None, None, None, None, None
    self.model = None
    self.label_name = label_name
    self.max_seq_length  = None
    self.vocab_size = None

  def tokenize_and_pad_tweets(self, data, datatype= 'train', max_words=None, max_seq_len=None, model_path= '/content/tokenizer_model.pkl'):

      sequences = [] # Initialize sequences to an empty list

      if max_words == None:
          tokenizer = tk.WordTokenizer()
      else:
          tokenizer = tk.WordTokenizer(vocab_size=max_words)
          self.vocab_size = max_words

      if datatype == 'train':
          path = './tokenizer.txt'
          df = pd.DataFrame(data,columns=['tweet'])
          df.to_csv(path, sep='\n', header=False,index=False)
          tokenizer.train(path)
          sequences = [tokenizer.encode(sentence) for sentence in data]
          max_seq_len = max(len(seq) for seq in sequences)
          sequences = pad_sequences(sequences, maxlen=max_seq_len, value = 0, padding='post')
          tokenizer.save_model('/content/tokenizer_model.pkl')
          #self.vocab_size = tokenizer.get_vocab_size()
      elif datatype == 'test' and max_seq_len != None:
          # try:
          tokenizer.load_model(model_path)
          sequences = [tokenizer.encode(sentence) for sentence in data]
          sequences = pad_sequences(sequences, maxlen=max_seq_len, value = 0, padding='post')
          # except:
          # print("please check if tokenizer model is passed correctly!")

      return sequences, max_seq_len, tokenizer

  def preprocess_data(self):
    #label names are 'id', 'text', 'subtask_a', 'subtask_b', 'subtask_c1', 'subtask_c2'
    train_data_path = self.folder_path + '/train.txt'
    test_data_path = self.folder_path + '/test.txt'
    dev_data_path = self.folder_path + '/dev.txt'

    with open(train_data_path, 'r', encoding='utf-8') as file:
        train_tweets = file.readlines()
    self.train_data = pd.DataFrame([tweet.strip().split('\t') for tweet in train_tweets], columns=['id', 'text', 'subtask_a', 'subtask_b', 'subtask_c1', 'subtask_c2'])
    with open(test_data_path, 'r', encoding='utf-8') as file:
        test_tweets = file.readlines()
    self.test_data = pd.DataFrame([tweet.strip().split('\t') for tweet in test_tweets], columns=['id', 'text'])
    with open(dev_data_path, 'r', encoding='utf-8') as file:
        dev_tweets = file.readlines()
    self.dev_data = pd.DataFrame([tweet.strip().split('\t') for tweet in dev_tweets], columns=['id', 'text', 'subtask_a', 'subtask_b', 'subtask_c1', 'subtask_c2'])



    self.X_train = self.train_data.apply(lambda x: clean_tweet(x['text'], "ml"), axis=1)
    self.y_train = self.train_data.apply(lambda x: x[self.label_name], axis=1)
    #print(self.X_train)
    #print(self.y_train)
    self.X_test = self.test_data.apply(lambda x: clean_tweet(x['text'], "ml"), axis=1)
    #self.y_test = self.test_data.apply(lambda x: x[self.label_name], axis=1)
    self.X_dev = self.dev_data.apply(lambda x: clean_tweet(x['text'], "ml"), axis=1)
    self.y_dev = self.dev_data.apply(lambda x: x[self.label_name], axis=1)

    self.X_train, max_seq_length, _ = self.tokenize_and_pad_tweets(self.X_train, 'train', 1000)
    self.max_seq_length = max_seq_length
    #print(self.max_seq_length)
    #print(self.X_train)
    #print(self.X_test)
    self.X_test, _, _ = self.tokenize_and_pad_tweets(self.X_test, 'test', 1000, self.max_seq_length)
    self.X_dev, _, _ = self.tokenize_and_pad_tweets(self.X_dev, 'test', 1000, self.max_seq_length)

    # Convert labels to numerical values
    lb = LabelBinarizer()
    self.y_train = lb.fit_transform(self.y_train)
    #print(self.y_train)
    #self.y_test = lb.transform(self.y_test)
    self.y_dev = lb.transform(self.y_dev)
    # print(self.y_train)
    # print(self.y_dev)

    #print(self.X_train.shape)
    #print(self.y_  train.shape)
    #print(self.dev_data)
    #print(self.X_dev)
    #print(self.y_dev)

  def train(self, model_type):
    if model_type == 'SVM':
        self.model = SVC()
    elif model_type == 'NaiveBayes':
        self.model = MultinomialNB()
    elif model_type == 'RandomForest':
        self.model = RandomForestClassifier()
    elif model_type == 'LSTM':
        self.model = Sequential()
        self.model.add(Embedding(input_dim=self.vocab_size, output_dim=150, input_length=self.max_seq_length))#ML embedding pipeline
        self.model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
        self.model.add(Dense(1, activation='sigmoid'))
        self.model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    else:
        print('Invalid model type')
        return

    self.model.fit(self.X_train, self.y_train, validation_data=(self.X_dev, self.y_dev), epochs=10, batch_size=16)

  def test(self):
    predictions = self.model.predict(self.X_test)
    #print('Accuracy:', accuracy_score(self.y_test, predictions))
    #print('F1 Score:', f1_score(self.y_test, predictions))

  def run(self, model_type):
    self.preprocess_data()
    self.train(model_type)
    #self.test()



In [8]:
# Add the Folder that contains Data
folder_path = "/content/drive/MyDrive/hate-speech/Data"
#label names are 'id', 'text', 'subtask_a', 'subtask_b', 'subtask_c1', 'subtask_c2'
label_name = 'subtask_a'
classifier = TextClassifier(folder_path,label_name)
classifier.run('LSTM')  # Replace with the desired model type

Training WordTokenizer ...
Saving as pickle file ...
Loading as pickle file ...
Loading as pickle file ...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
