# Data Preprocessing
The first step in our hate speech detection model involves collecting and preprocessing the data. This step is crucial as the quality and cleanliness of the data significantly affect the model's performance.

We use the provided dataset (HateSpeechDetection.csv), which contains text data labeled as hate speech (1) or not hate speech (0).

In [1]:
import re
import pandas as pd
import numpy as np

data = pd.read_csv("HateSpeechDetection.csv")
data

Unnamed: 0,Text,Label
0,Damn I thought they had strict gun laws in Ger...,0
1,I dont care about what it stands for or anythi...,0
2,It's not a group it's an idea lol,0
3,So it's not just America!,0
4,The dog is a spectacular dancer considering he...,0
...,...,...
17591,I find rats nicer and cleaner than most Chinese,1
17592,"Check out this niggar, they hit things like wi...",1
17593,"this country has become an absolute shambles, ...",0
17594,Me aged 16 = anti-Semitism is bad Me aged 18 =...,1


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17596 entries, 0 to 17595
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    17596 non-null  object
 1   Label   17596 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 275.1+ KB


In [3]:
data['Label'].value_counts()

Label
0    10496
1     7100
Name: count, dtype: int64

After observing and analysing the dataset, we can observe that there are 17,596 rows in the dataset with 'Text' and 'Label' columns. We have 10,496 rows with label '0' indicating no hate speech, whereas 7,100 rows with label '1' indicating hate speech. Finally we can infer that the dataset should be cleaned for further steps. Finalized data cleaning tasks are:
1. Removing extra spaces
2.Removing usernames
3.Removing hashtags
4.Handling contractions
5.Lowercasing
6.Removing punctuation
7.Removing URLs
8.Removing short words
9.Lemmatization

In [5]:
pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.7/110.7 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.1.0 textsearch-0.0.24


In [7]:
def data_cleaning(text):

  #Removing Extra Spaces: Normalize the spacing in the text to remove any extra spaces.
  text = re.sub(r'\s+', ' ', text) #the re.sub function replaces one or more whitespace characters (\s+) with a single space.

  #Remove usernames: Same as for the URL, a username in a text won’t give any valuable information because it won’t be recognized as a word carrying meaning. We will then remove it.
  text = re.sub(r"@\S+", "",text)

  #Remove Hashtags: Hashtags are hard to apprehend, but usually contain useful information about the context of a text and its content.
  #The problem with hashtags is that the words are all after the other, without a space.
  text = re.sub(r'#', '', text)

  #Handling Contractions: It is an important step in text preprocessing, especially for tasks like hate speech detection where understanding the full meaning of the words is crucial.
  #Contractions are shortened forms of words or combinations of words created by omitting certain letters and sounds (e.g., "don't" for "do not", "I'm" for "I am").
  import contractions
  text=contractions.fix(text)

  #Lowercasing: Convert all text to lowercase to ensure uniformity, as the model should treat "Hate" and "hate" as the same word.
  text = text.lower()

  #Removing Punctuation: Strip out punctuation to focus on the words themselves.
  text = re.sub(r'[^\w\s]', '', text)
  #\w: Represents any alphanumeric character (equivalent to [a-zA-Z0-9_]).
  #\s: Denotes any whitespace character, such as space, tab, or newline.
  # so it defines the other than a alphanumeric character followed by a single space, ('^' for negation) remove other characters

  #Remove URLs: URLs do not give any information when we try to analyze text from words.
  text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
  # it identifies the words starting with http or https or www and ending with a non-white space Character(\S) then remove it

  #Removing Short words
  text = ' '.join([word for word in text.split() if len(word) > 2 or word.isnumeric()])

  #Lemmatization

  from nltk.stem import WordNetLemmatizer
  lemmatizer = WordNetLemmatizer()
  text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

  return text

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
data['Text']=data['Text'].apply(data_cleaning)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [8]:
data

Unnamed: 0,Text,Label
0,damn thought they had strict gun law germany,0
1,not care about what stand for anything it conn...,0
2,not group idea lol,0
3,not just america,0
4,the dog spectacular dancer considering ha two ...,0
...,...,...
17591,find rat nicer and cleaner than most chinese,1
17592,check out this niggar they hit thing like wild...,1
17593,this country ha become absolute shamble the am...,0
17594,aged 16 antisemitism bad aged 18 antisemitism ...,1


# Tokenization:


In [9]:
from nltk.tokenize import word_tokenize

data['Tokens']=data['Text'].apply(word_tokenize)

In [10]:
data

Unnamed: 0,Text,Label,Tokens
0,damn thought they had strict gun law germany,0,"[damn, thought, they, had, strict, gun, law, g..."
1,not care about what stand for anything it conn...,0,"[not, care, about, what, stand, for, anything,..."
2,not group idea lol,0,"[not, group, idea, lol]"
3,not just america,0,"[not, just, america]"
4,the dog spectacular dancer considering ha two ...,0,"[the, dog, spectacular, dancer, considering, h..."
...,...,...,...
17591,find rat nicer and cleaner than most chinese,1,"[find, rat, nicer, and, cleaner, than, most, c..."
17592,check out this niggar they hit thing like wild...,1,"[check, out, this, niggar, they, hit, thing, l..."
17593,this country ha become absolute shamble the am...,0,"[this, country, ha, become, absolute, shamble,..."
17594,aged 16 antisemitism bad aged 18 antisemitism ...,1,"[aged, 16, antisemitism, bad, aged, 18, antise..."


# Splitting training and test data

In [12]:
from sklearn.model_selection import train_test_split

X=data['Tokens']
y=data['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
y_train.value_counts()

Label
0    8402
1    5674
Name: count, dtype: int64

# Embedding:
Embedding in the context of deep learning and natural language processing (NLP) is a way of representing words or phrases as dense vectors in a continuous vector space. These vectors capture semantic meanings and relationships between words. Embeddings transform the sparse, high-dimensional data of words into a lower-dimensional space, where similar words have similar vector representations.

# Word2Vec Skip-Gram
Predicts context words based on a target word. Trains the model to maximize the probability of context words given a target word.

In [19]:
from gensim.models import Word2Vec
import numpy as np
def word2vec_embedding_sg(texts):
    model = Word2Vec(sentences=X_train, vector_size=200, window=6, min_count=1, workers=4,sg=1)
    word_vectors = model.wv
    #print(word_vectors)

    def get_word2vec_embeddings(text, word_vectors):
        embeddings = [word_vectors[word] for word in text if word in word_vectors]
        if embeddings:
            return np.mean(embeddings, axis=0)
        else:
            return np.zeros(200)

    embeddings = np.array([get_word2vec_embeddings(text, word_vectors) for text in texts])
    return embeddings
X_train_w2v=word2vec_embedding_sg(X_train)
X_test_w2v=word2vec_embedding_sg(X_test)

# Handling Imbalanced data:
From the above output, we can observe that there is an imbalance in 'Label' column in the dataset as we can see there are 8402 instances of label '0' where as there are only 5674 instances of label '1'. This would significantly effect model training because models trained on this dataset might be biased towards the majority class (non-hate speech) and may not perform as well in identifying hate speech instances.

# SMOTE:
Synthetic Minority Oversampling Technique or SMOTE, which is another technique to oversample the minority class. Simply adding duplicate records of minority class often don’t adon’ty new information to the model. In SMOTE new instances are synthesized from the existing data.

In [20]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='minority')
X_smote, y_smote = smote.fit_resample(X_train_w2v,y_train)

print("Oversampled dataset shape using SMOTE:\n", y_smote.value_counts())

Oversampled dataset shape using SMOTE:
 Label
1    8402
0    8402
Name: count, dtype: int64


# Logistic Regression Model:
Logistic regression is a statistical model used for binary classification tasks. It estimates the probability that a given input belongs to a certain class.

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_auc_score

# Train a Logistic Regression model
lr_clf = LogisticRegression()
lr_clf.fit(X_smote, y_smote)

# Make predictions
y_pred = lr_clf.predict(X_test_w2v)

# Evaluate the classifier
print("Logistic Regression")
# Precision
precision = precision_score(y_test, y_pred)
print("Precision:", precision)

# Recall
recall = recall_score(y_test, y_pred)
print("Recall:", recall)


# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# ROC-AUC
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC-AUC Score:", roc_auc)

#f1_score
f1 = f1_score(y_test, y_pred)
print("f1:", f1)

Logistic Regression
Precision: 0.7743853469531525
Recall: 0.9046283309957924
Accuracy: 0.7413068181818182
ROC-AUC Score: 0.7024478808751645
f1: 0.6049237983587339


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Hyperparameter Tuning

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_auc_score

# Train a Logistic Regression model
lr_clf = LogisticRegression(penalty='l2',C=0.1,solver='lbfgs')
lr_clf.fit(X_smote, y_smote)

# Make predictions
y_pred = lr_clf.predict(X_test_w2v)

# Evaluate the classifier
print("Logistic Regression")
# Precision
precision = precision_score(y_test, y_pred)
print("Precision:", precision)

# Recall
recall = recall_score(y_test, y_pred)
print("Recall:", recall)


# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# ROC-AUC
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC-AUC Score:", roc_auc)

#f1_score
f1 = f1_score(y_test, y_pred)
print("f1:", f1)

Logistic Regression
Precision: 0.7083824843610367
Recall: 0.7664796633941094
Accuracy: 0.8201136363636363
ROC-AUC Score: 0.7598396406750871
f1: 0.7466157205240176
