Dataset is taken from kaggle - [Link](https://www.kaggle.com/datasets/thedevastator/hate-speech-and-offensive-language-detection)

# Importing Dataset

In [1]:
# to reload the notebook everytime a function is written in helper.py
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import helper

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df = pd.read_csv("../dataset/archive/train.csv")
df.head()

Unnamed: 0,count,hate_speech_count,offensive_language_count,neither_count,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


# Exploring Dataset

In [3]:
df.sample(20)

Unnamed: 0,count,hate_speech_count,offensive_language_count,neither_count,class,tweet
11571,3,0,3,0,1,If she wears these to dinner then she's paying...
21313,3,0,2,1,1,"Tell your bae ""ho tre testicoli"" it means ""I l..."
4574,3,0,3,0,1,@RoseGoldBenzo duh...lol. Me tweeting about ea...
4164,3,0,3,0,1,@MorganSmith_20 unfortunately you are correct ...
13741,3,0,3,0,1,Over bitches trying to act like I own them a p...
15853,3,0,0,3,2,RT @JoleenDoreen: I was in a Kik group once. B...
656,3,0,3,0,1,# That son of a bitch moment when it rains and...
9874,3,1,2,0,1,Holy fuck they some bomb ass bitches at Buc-ee's
3559,3,3,0,0,0,@JPantsdotcom @Todd__Kincannon @the__realtony ...
12598,3,0,3,0,1,Lol bitches be hella mad about they exs like w...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24783 entries, 0 to 24782
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   count                     24783 non-null  int64 
 1   hate_speech_count         24783 non-null  int64 
 2   offensive_language_count  24783 non-null  int64 
 3   neither_count             24783 non-null  int64 
 4   class                     24783 non-null  int64 
 5   tweet                     24783 non-null  object
dtypes: int64(5), object(1)
memory usage: 1.1+ MB


In [5]:
df["class"].value_counts()

class
1    19190
2     4163
0     1430
Name: count, dtype: int64

This dataset is imbalanced

In [6]:
df.isnull().sum()

count                       0
hate_speech_count           0
offensive_language_count    0
neither_count               0
class                       0
tweet                       0
dtype: int64

In [7]:
df["tweet"][1]

'!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!'

In [8]:
df["tweet"][100]

'"@ClicquotSuave: LMAOOOOOOOOOOO this nigga @Krillz_Nuh_Care http://t.co/AAnpSUjmYI" &lt;bitch want likes for some depressing shit..foh'

In [9]:
df["tweet"][200]

'"@NewsomeJade: I ain\'t never seen a bitch so obsessed with they nigga&#128514;" I\'m obsessed with mine &#128529;'

In [10]:
old_df = df.copy()

# Preprocessing Data

Preprocessing Techniques:

Prior to training machine learning models or algorithms, we should apply standard preprocessing techniques such as removing URLs, usernames/handles, special characters/punctuation marks, stop words removal, tokenization, stemming/lemmatization etc., depending on your analysis requirements.

### Removing Retweets (RT)

Some of the text which are retweeted contains RT in the text which does not contain any information. So we will remove this

In [11]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_retweets_rt(x))
df["tweet"]

0        !!! @mayasolovely: As a woman you shouldn't co...
1        !!!!! @mleew17: boy dats cold...tyga dwn bad f...
2        !!!!!!! @UrKindOfBrand Dawg!!!! @80sbaby4life:...
3        !!!!!!!!! @C_G_Anderson: @viva_based she look ...
4        !!!!!!!!!!!!! @ShenikaRoberts: The shit you he...
                               ...                        
24778    you's a muthaf***in lie &#8220;@LifeAsKing: @2...
24779    you've gone and broke the wrong heart baby, an...
24780    young buck wanna eat!!.. dat nigguh like I ain...
24781                youu got wild bitches tellin you lies
24782    ~~Ruffled | Ntac Eileen Dahlia - Beautiful col...
Name: tweet, Length: 24783, dtype: object

# Removing emojis

In [12]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_emojis(x))
df["tweet"]

0        !!! @mayasolovely: As a woman you shouldn't co...
1        !!!!! @mleew17: boy dats cold...tyga dwn bad f...
2        !!!!!!! @UrKindOfBrand Dawg!!!! @80sbaby4life:...
3        !!!!!!!!! @C_G_Anderson: @viva_based she look ...
4        !!!!!!!!!!!!! @ShenikaRoberts: The shit you he...
                               ...                        
24778    you's a muthaf***in lie @LifeAsKing: @20_Pearl...
24779    you've gone and broke the wrong heart baby, an...
24780    young buck wanna eat!!.. dat nigguh like I ain...
24781                youu got wild bitches tellin you lies
24782    ~~Ruffled | Ntac Eileen Dahlia - Beautiful col...
Name: tweet, Length: 24783, dtype: object

# Removing URLs

In [13]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_html_links(x))
df["tweet"]

0        !!! @mayasolovely: As a woman you shouldn't co...
1        !!!!! @mleew17: boy dats cold...tyga dwn bad f...
2        !!!!!!! @UrKindOfBrand Dawg!!!! @80sbaby4life:...
3        !!!!!!!!! @C_G_Anderson: @viva_based she look ...
4        !!!!!!!!!!!!! @ShenikaRoberts: The shit you he...
                               ...                        
24778    you's a muthaf***in lie @LifeAsKing: @20_Pearl...
24779    you've gone and broke the wrong heart baby, an...
24780    young buck wanna eat!!.. dat nigguh like I ain...
24781                youu got wild bitches tellin you lies
24782    ~~Ruffled | Ntac Eileen Dahlia - Beautiful col...
Name: tweet, Length: 24783, dtype: object

### Lowecasing

In [14]:
df["tweet"] = df["tweet"].apply(lambda x: x.lower())
df["tweet"]

0        !!! @mayasolovely: as a woman you shouldn't co...
1        !!!!! @mleew17: boy dats cold...tyga dwn bad f...
2        !!!!!!! @urkindofbrand dawg!!!! @80sbaby4life:...
3        !!!!!!!!! @c_g_anderson: @viva_based she look ...
4        !!!!!!!!!!!!! @shenikaroberts: the shit you he...
                               ...                        
24778    you's a muthaf***in lie @lifeasking: @20_pearl...
24779    you've gone and broke the wrong heart baby, an...
24780    young buck wanna eat!!.. dat nigguh like i ain...
24781                youu got wild bitches tellin you lies
24782    ~~ruffled | ntac eileen dahlia - beautiful col...
Name: tweet, Length: 24783, dtype: object

### Remove Usernames

In [15]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_usernames(x))
df["tweet"]

0        !!! : as a woman you shouldn't complain about ...
1        !!!!! : boy dats cold...tyga dwn bad for cuffi...
2        !!!!!!!  dawg!!!! : you ever fuck a bitch and ...
3                      !!!!!!!!! :  she look like a tranny
4        !!!!!!!!!!!!! : the shit you hear about me mig...
                               ...                        
24778    you's a muthaf***in lie :   right! his tl is t...
24779    you've gone and broke the wrong heart baby, an...
24780    young buck wanna eat!!.. dat nigguh like i ain...
24781                youu got wild bitches tellin you lies
24782    ~~ruffled | ntac eileen dahlia - beautiful col...
Name: tweet, Length: 24783, dtype: object

### Remove Numbers

In [16]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_numbers(x))
df["tweet"]

0        !!! : as a woman you shouldn't complain about ...
1        !!!!! : boy dats coldtyga dwn bad for cuffin d...
2        !!!!!!!  dawg!!!! : you ever fuck a bitch and ...
3                      !!!!!!!!! :  she look like a tranny
4        !!!!!!!!!!!!! : the shit you hear about me mig...
                               ...                        
24778    you's a muthaf***in lie :   right! his tl is t...
24779    you've gone and broke the wrong heart baby, an...
24780    young buck wanna eat!! dat nigguh like i aint ...
24781                youu got wild bitches tellin you lies
24782    ~~ruffled | ntac eileen dahlia - beautiful col...
Name: tweet, Length: 24783, dtype: object

### Removing Punctuations

In [17]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_punctuations(x))
df["tweet"]

0          as a woman you shouldnt complain about clean...
1          boy dats coldtyga dwn bad for cuffin dat hoe...
2          dawg  you ever fuck a bitch and she start to...
3                                   she look like a tranny
4          the shit you hear about me might be true or ...
                               ...                        
24778    yous a muthafin lie    right his tl is trash  ...
24779    youve gone and broke the wrong heart baby and ...
24780    young buck wanna eat dat nigguh like i aint fu...
24781                youu got wild bitches tellin you lies
24782    ruffled  ntac eileen dahlia  beautiful color c...
Name: tweet, Length: 24783, dtype: object

### Remove Unwanted Whitespace

In [18]:
tst = "@nameis_as-13 623 wew 3242nnk                                                  wrwrwrs124124 0.233232424242412133131313133 100000000000000.1111"
helper.remove_unwanted_whitespaces(tst)

'@nameis_as-13 623 wew 3242nnk wrwrwrs124124 0.233232424242412133131313133 100000000000000.1111'

In [19]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_unwanted_whitespaces(x))
df["tweet"]


0         as a woman you shouldnt complain about cleani...
1         boy dats coldtyga dwn bad for cuffin dat hoe ...
2         dawg you ever fuck a bitch and she start to c...
3                                   she look like a tranny
4         the shit you hear about me might be true or i...
                               ...                        
24778    yous a muthafin lie right his tl is trash now ...
24779    youve gone and broke the wrong heart baby and ...
24780    young buck wanna eat dat nigguh like i aint fu...
24781                youu got wild bitches tellin you lies
24782    ruffled ntac eileen dahlia beautiful color com...
Name: tweet, Length: 24783, dtype: object

# Correcting Spelling Mistakes

Since this is a social media text. There might be lot of texts that contains spelling mistakes. So we will try to correct them.

In [20]:
helper.correct_spelling_mistakes("ceertain conditionas")

'certain conditions'

In [21]:
# df["tweet"] = df["tweet"].apply(lambda x: helper.correct_spelling_mistakes(x))
# df["tweet"]

### Tokenization

Word-level tokenization can capture the nuances of language more finely. It's beneficial if the classification task relies on understanding the specific words and phrases used in the text to determine whether it contains hate speech.

Sentence-level tokenization can be useful if the classification task focuses more on the overall context of the text rather than specific words.

For our model we will use word level tokenization because:

1. **Granularity of Information**: YouTube comments can be short and contain specific words or phrases that are indicative of hate speech. Word-level tokenization allows the model to capture the nuances of language at a more granular level, enabling it to identify hate speech based on the presence of specific words or combinations of words.

2. **Contextual Understanding**: Hate speech can manifest in various ways, and understanding the context of specific words or phrases is crucial for accurate classification. Word-level tokenization enables the model to consider the surrounding words and phrases when making predictions, which can improve its ability to differentiate between hate speech and non-hateful language.

3. **Flexibility in Handling Short Texts**: YouTube comments are often short and may not contain complete sentences. Word-level tokenization can handle such short texts effectively by breaking them down into individual words, ensuring that the model can still extract meaningful features from the comments even if they lack sentence structure.



In [22]:
df["tweet"] = df["tweet"].apply(lambda x: helper.tokenization(x))
df["tweet"]

0        [as, a, woman, you, shouldnt, complain, about,...
1        [boy, dats, coldtyga, dwn, bad, for, cuffin, d...
2        [dawg, you, ever, fuck, a, bitch, and, she, st...
3                             [she, look, like, a, tranny]
4        [the, shit, you, hear, about, me, might, be, t...
                               ...                        
24778    [yous, a, muthafin, lie, right, his, tl, is, t...
24779    [youve, gone, and, broke, the, wrong, heart, b...
24780    [young, buck, wan, na, eat, dat, nigguh, like,...
24781        [youu, got, wild, bitches, tellin, you, lies]
24782    [ruffled, ntac, eileen, dahlia, beautiful, col...
Name: tweet, Length: 24783, dtype: object

### Stop words removal

In [23]:
df["tweet"] = df["tweet"].apply(lambda x: helper.stopwords_removal(x))
df["tweet"]

0        [woman, shouldnt, complain, cleaning, house, m...
1        [boy, dats, coldtyga, dwn, bad, cuffin, dat, h...
2        [dawg, ever, fuck, bitch, start, cry, confused...
3                                     [look, like, tranny]
4        [shit, hear, might, true, might, faker, bitch,...
                               ...                        
24778    [yous, muthafin, lie, right, tl, trash, mine, ...
24779    [youve, gone, broke, wrong, heart, baby, drove...
24780    [young, buck, wan, na, eat, dat, nigguh, like,...
24781             [youu, got, wild, bitches, tellin, lies]
24782    [ruffled, ntac, eileen, dahlia, beautiful, col...
Name: tweet, Length: 24783, dtype: object

### Lemmatization

Converting the word to its root word.

In [24]:
df["tweet"] = df["tweet"].apply(lambda x: helper.lemmatize(x))
df["tweet"]

0        [woman, shouldnt, complain, cleaning, house, m...
1        [boy, dats, coldtyga, dwn, bad, cuffin, dat, h...
2        [dawg, ever, fuck, bitch, start, cry, confused...
3                                     [look, like, tranny]
4        [shit, hear, might, true, might, faker, bitch,...
                               ...                        
24778    [yous, muthafin, lie, right, tl, trash, mine, ...
24779    [youve, gone, broke, wrong, heart, baby, drove...
24780    [young, buck, wan, na, eat, dat, nigguh, like,...
24781             [youu, got, wild, bitches, tellin, lies]
24782    [ruffled, ntac, eileen, dahlia, beautiful, col...
Name: tweet, Length: 24783, dtype: object

In [25]:
" ".join(df["tweet"][100])

'lmaooooooooooo nigga bitch want likes depressing shitfoh'

In [26]:
df["tweet"] = df["tweet"].apply(lambda x: " ".join(x))

In [27]:
df["class"].unique()

array([2, 1, 0], dtype=int64)

# Data Splitting

In [52]:
X = df.drop("class",axis=1)
y = df["class"]

In [53]:
X

Unnamed: 0,count,hate_speech_count,offensive_language_count,neither_count,tweet
0,3,0,0,3,woman shouldnt complain cleaning house man alw...
1,3,0,3,0,boy dats coldtyga dwn bad cuffin dat hoe st place
2,3,0,3,0,dawg ever fuck bitch start cry confused shit
3,3,0,2,1,look like tranny
4,6,0,6,0,shit hear might true might faker bitch told ya
...,...,...,...,...,...
24778,3,0,2,1,yous muthafin lie right tl trash mine bible sc...
24779,3,0,1,2,youve gone broke wrong heart baby drove rednec...
24780,3,0,3,0,young buck wan na eat dat nigguh like aint fuc...
24781,6,0,6,0,youu got wild bitches tellin lies


In [54]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=41)

# Feature Extraction / Text Vectorization

Converting text into numbers

In [55]:
df.head()

Unnamed: 0,count,hate_speech_count,offensive_language_count,neither_count,class,tweet
0,3,0,0,3,2,woman shouldnt complain cleaning house man alw...
1,3,0,3,0,1,boy dats coldtyga dwn bad cuffin dat hoe st place
2,3,0,3,0,1,dawg ever fuck bitch start cry confused shit
3,3,0,2,1,1,look like tranny
4,6,0,6,0,1,shit hear might true might faker bitch told ya


In [56]:
# binary is true because we want to know if certain hate words orccurs in a text data or not rather 
# than calculating the frequency.

cv = CountVectorizer(binary=True,max_features=1000)
X_train_trf = cv.fit_transform(X_train["tweet"])
X_test_trf = cv.transform(X_test["tweet"])

In [40]:
len(cv.vocabulary_)

1000

In [64]:
# X_train_trf[0].toarray()

# Model Building

In [42]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [43]:
xgb = XGBClassifier()
xgb.fit(X_train_trf,y_train)
y_pred = xgb.predict(X_test_trf)
accuracy_score(y_test,y_pred)

0.9043776477708292

In [63]:
from sklearn.model_selection import KFold

k_fold = KFold(n_splits=5)

accuracy_list = []

for train_idx,test_idx in k_fold.split(X):
    cv = CountVectorizer(binary=True,max_features=1000)

    X_train,y_train = X.loc[train_idx],y.loc[train_idx]
    X_test,y_test = X.loc[test_idx],y.loc[test_idx]

    X_train_trf = cv.fit_transform(X_train["tweet"])
    X_test_trf = cv.transform(X_test["tweet"])


    xgb = XGBClassifier()
    xgb.fit(X_train_trf,y_train)
    y_pred = xgb.predict(X_test_trf)
    print(f"accuracy is {accuracy_score(y_test,y_pred)}")
    accuracy_list.append(accuracy_score(y_test,y_pred))

print(f"Mean Accuracy is {np.mean(accuracy_list)}")

accuracy is 0.8721000605204761
accuracy is 0.8763364938470849
accuracy is 0.9211216461569498
accuracy is 0.9188861985472155
accuracy is 0.9130347054075868
Mean Accuracy is 0.9002958208958628
