Dataset is taken from kaggle - [Link](https://www.kaggle.com/datasets/thedevastator/hate-speech-and-offensive-language-detection)

In [56]:
# to reload the notebook everytime a function is written in helper.py
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import helper

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [57]:
df = pd.read_csv("../dataset/archive/train.csv")
df.head()

Unnamed: 0,count,hate_speech_count,offensive_language_count,neither_count,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


# Exploring Dataset

In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24783 entries, 0 to 24782
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   count                     24783 non-null  int64 
 1   hate_speech_count         24783 non-null  int64 
 2   offensive_language_count  24783 non-null  int64 
 3   neither_count             24783 non-null  int64 
 4   class                     24783 non-null  int64 
 5   tweet                     24783 non-null  object
dtypes: int64(5), object(1)
memory usage: 1.1+ MB


In [59]:
df["class"].value_counts()

class
1    19190
2     4163
0     1430
Name: count, dtype: int64

This dataset is imbalanced

In [60]:
df.isnull().sum()

count                       0
hate_speech_count           0
offensive_language_count    0
neither_count               0
class                       0
tweet                       0
dtype: int64

In [61]:
df["tweet"][1]

'!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!'

In [62]:
old_df = df.copy()

# Preprocessing Data

Preprocessing Techniques:

Prior to training machine learning models or algorithms, we should apply standard preprocessing techniques such as removing URLs, usernames/handles, special characters/punctuation marks, stop words removal, tokenization, stemming/lemmatization etc., depending on your analysis requirements.

### Lowecasing

In [63]:
df["tweet"] = df["tweet"].apply(lambda x: x.lower())
df["tweet"]

0        !!! rt @mayasolovely: as a woman you shouldn't...
1        !!!!! rt @mleew17: boy dats cold...tyga dwn ba...
2        !!!!!!! rt @urkindofbrand dawg!!!! rt @80sbaby...
3        !!!!!!!!! rt @c_g_anderson: @viva_based she lo...
4        !!!!!!!!!!!!! rt @shenikaroberts: the shit you...
                               ...                        
24778    you's a muthaf***in lie &#8220;@lifeasking: @2...
24779    you've gone and broke the wrong heart baby, an...
24780    young buck wanna eat!!.. dat nigguh like i ain...
24781                youu got wild bitches tellin you lies
24782    ~~ruffled | ntac eileen dahlia - beautiful col...
Name: tweet, Length: 24783, dtype: object

### Remove Usernames

In [64]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_usernames(x))
df["tweet"]

0        !!! rt : as a woman you shouldn't complain abo...
1        !!!!! rt : boy dats cold...tyga dwn bad for cu...
2        !!!!!!! rt  dawg!!!! rt : you ever fuck a bitc...
3                   !!!!!!!!! rt :  she look like a tranny
4        !!!!!!!!!!!!! rt : the shit you hear about me ...
                               ...                        
24778    you's a muthaf***in lie &#8220;:   right! his ...
24779    you've gone and broke the wrong heart baby, an...
24780    young buck wanna eat!!.. dat nigguh like i ain...
24781                youu got wild bitches tellin you lies
24782    ~~ruffled | ntac eileen dahlia - beautiful col...
Name: tweet, Length: 24783, dtype: object

### Remove Numbers

In [65]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_numbers(x))
df["tweet"]

0        !!! rt : as a woman you shouldn't complain abo...
1        !!!!! rt : boy dats coldtyga dwn bad for cuffi...
2        !!!!!!! rt  dawg!!!! rt : you ever fuck a bitc...
3                   !!!!!!!!! rt :  she look like a tranny
4        !!!!!!!!!!!!! rt : the shit you hear about me ...
                               ...                        
24778    you's a muthaf***in lie &#;:   right! his tl i...
24779    you've gone and broke the wrong heart baby, an...
24780    young buck wanna eat!! dat nigguh like i aint ...
24781                youu got wild bitches tellin you lies
24782    ~~ruffled | ntac eileen dahlia - beautiful col...
Name: tweet, Length: 24783, dtype: object

### Removing Punctuations

In [66]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_punctuations(x))
df["tweet"]

0         rt  as a woman you shouldnt complain about cl...
1         rt  boy dats coldtyga dwn bad for cuffin dat ...
2         rt  dawg rt  you ever fuck a bitch and she st...
3                              rt   she look like a tranny
4         rt  the shit you hear about me might be true ...
                               ...                        
24778    yous a muthafin lie    right his tl is trash  ...
24779    youve gone and broke the wrong heart baby and ...
24780    young buck wanna eat dat nigguh like i aint fu...
24781                youu got wild bitches tellin you lies
24782    ruffled  ntac eileen dahlia  beautiful color c...
Name: tweet, Length: 24783, dtype: object

### Remove URLs

In [67]:
urls = "name is www.fb.com is bets"
helper.remove_urls(urls)

'name is  is bets'

In [68]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_urls(x))
df["tweet"]

0         rt  as a woman you shouldnt complain about cl...
1         rt  boy dats coldtyga dwn bad for cuffin dat ...
2         rt  dawg rt  you ever fuck a bitch and she st...
3                              rt   she look like a tranny
4         rt  the shit you hear about me might be true ...
                               ...                        
24778    yous a muthafin lie    right his tl is trash  ...
24779    youve gone and broke the wrong heart baby and ...
24780    young buck wanna eat dat nigguh like i aint fu...
24781                youu got wild bitches tellin you lies
24782    ruffled  ntac eileen dahlia  beautiful color c...
Name: tweet, Length: 24783, dtype: object

### Remove Unwanted Whitespace

In [69]:
tst = "@nameis_as-13 623 wew 3242nnk                                                  wrwrwrs124124 0.233232424242412133131313133 100000000000000.1111"
helper.remove_unwanted_whitespaces(tst)

'@nameis_as-13 623 wew 3242nnk wrwrwrs124124 0.233232424242412133131313133 100000000000000.1111'

In [70]:
df["tweet"] = df["tweet"].apply(lambda x: helper.remove_unwanted_whitespaces(x))
df["tweet"]

0         rt as a woman you shouldnt complain about cle...
1         rt boy dats coldtyga dwn bad for cuffin dat h...
2         rt dawg rt you ever fuck a bitch and she star...
3                                rt she look like a tranny
4         rt the shit you hear about me might be true o...
                               ...                        
24778    yous a muthafin lie right his tl is trash now ...
24779    youve gone and broke the wrong heart baby and ...
24780    young buck wanna eat dat nigguh like i aint fu...
24781                youu got wild bitches tellin you lies
24782    ruffled ntac eileen dahlia beautiful color com...
Name: tweet, Length: 24783, dtype: object

### Tokenization

In [73]:
df["tweet"] = df["tweet"].apply(lambda x: helper.tokenization(x))
df["tweet"]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Saurav\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


0        [rt, as, a, woman, you, shouldnt, complain, ab...
1        [rt, boy, dats, coldtyga, dwn, bad, for, cuffi...
2        [rt, dawg, rt, you, ever, fuck, a, bitch, and,...
3                         [rt, she, look, like, a, tranny]
4        [rt, the, shit, you, hear, about, me, might, b...
                               ...                        
24778    [yous, a, muthafin, lie, right, his, tl, is, t...
24779    [youve, gone, and, broke, the, wrong, heart, b...
24780    [young, buck, wan, na, eat, dat, nigguh, like,...
24781        [youu, got, wild, bitches, tellin, you, lies]
24782    [ruffled, ntac, eileen, dahlia, beautiful, col...
Name: tweet, Length: 24783, dtype: object

### Stop words removal

In [76]:
df["tweet"] = df["tweet"].apply(lambda x: helper.stopwords_removal(x))
df["tweet"]

0        [rt, woman, shouldnt, complain, cleaning, hous...
1        [rt, boy, dats, coldtyga, dwn, bad, cuffin, da...
2        [rt, dawg, rt, ever, fuck, bitch, start, cry, ...
3                                 [rt, look, like, tranny]
4        [rt, shit, hear, might, true, might, faker, bi...
                               ...                        
24778    [yous, muthafin, lie, right, tl, trash, mine, ...
24779    [youve, gone, broke, wrong, heart, baby, drove...
24780    [young, buck, wan, na, eat, dat, nigguh, like,...
24781             [youu, got, wild, bitches, tellin, lies]
24782    [ruffled, ntac, eileen, dahlia, beautiful, col...
Name: tweet, Length: 24783, dtype: object

### Lemmatization

In [79]:
df["tweet"] = df["tweet"].apply(lambda x: helper.lemmatize(x))
df["tweet"]

0        [rt, woman, shouldnt, complain, cleaning, hous...
1        [rt, boy, dats, coldtyga, dwn, bad, cuffin, da...
2        [rt, dawg, rt, ever, fuck, bitch, start, cry, ...
3                                 [rt, look, like, tranny]
4        [rt, shit, hear, might, true, might, faker, bi...
                               ...                        
24778    [yous, muthafin, lie, right, tl, trash, mine, ...
24779    [youve, gone, broke, wrong, heart, baby, drove...
24780    [young, buck, wan, na, eat, dat, nigguh, like,...
24781             [youu, got, wild, bitches, tellin, lies]
24782    [ruffled, ntac, eileen, dahlia, beautiful, col...
Name: tweet, Length: 24783, dtype: object