# # Data Preprocessing
The first step in our hate speech detection model involves collecting and preprocessing the data. This step is crucial as the quality and cleanliness of the data significantly affect the model's performance.

We use the provided dataset (HateSpeechDetection.csv), which contains text data labeled as hate speech (1) or not hate speech (0).

In [2]:
import re
import pandas as pd
import numpy as np

data = pd.read_csv("HateSpeechDetection.csv")
data.head(15)


Unnamed: 0,Text,Label
0,Damn I thought they had strict gun laws in Ger...,0
1,I dont care about what it stands for or anythi...,0
2,It's not a group it's an idea lol,0
3,So it's not just America!,0
4,The dog is a spectacular dancer considering he...,0
5,If ppl dont wear masks you complain... If ppl ...,0
6,We should send them All back to africa,1
7,Checking to see if it's whataboutism from the ...,0
8,"As a european, I approve this.",0
9,"Idk which of these groups to join, they're all...",0


In [3]:
data.tail(10)

Unnamed: 0,Text,Label
17586,africans should be in a zoo with the other ani...,1
17587,I wouldn't want my daughter to marry a Chink,1
17588,not anymore. they are like the shitty utube no...,1
17589,"Yet still, better for a 750cc to seek a win (o...",0
17590,people this is new thread due to the impending...,0
17591,I find rats nicer and cleaner than most Chinese,1
17592,"Check out this niggar, they hit things like wi...",1
17593,"this country has become an absolute shambles, ...",0
17594,Me aged 16 = anti-Semitism is bad Me aged 18 =...,1
17595,so messed up saying blacks don't deserve rights,0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17596 entries, 0 to 17595
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    17596 non-null  object
 1   Label   17596 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 275.1+ KB


After observing the dataset, we can infer that we need to clean and transform the raw text data into a format suitable for our  model. This involves several sub-steps:

Removing Extra Spaces: Normalize the spacing in the text to remove any extra spaces.

In [5]:
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text)
data['Text'] = data['Text'].apply(remove_extra_spaces)


Remove usernames: Same as for the URL, a username in a text won’t give any valuable information because it won’t be recognized as a word carrying meaning. We will then remove it.

In [6]:
def remove_username(text):
    return re.sub(r'@[^ ]+', '', text)

data['Text'] = data['Text'].apply(remove_username)


Remove Hashtags: Hashtags are hard to apprehend, but usually contain useful information about the context of a text and its content. The problem with hashtags is that the words are all after the other, without a space. 

In [7]:
def remove_hashtags(text):
    return re.sub(r'#', '', text)

data['Text'] = data['Text'].apply(remove_hashtags)

Data after Cleaning

In [8]:
data.head(15)

Unnamed: 0,Text,Label
0,Damn I thought they had strict gun laws in Ger...,0
1,I dont care about what it stands for or anythi...,0
2,It's not a group it's an idea lol,0
3,So it's not just America!,0
4,The dog is a spectacular dancer considering he...,0
5,If ppl dont wear masks you complain... If ppl ...,0
6,We should send them All back to africa,1
7,Checking to see if it's whataboutism from the ...,0
8,"As a european, I approve this.",0
9,"Idk which of these groups to join, they're all...",0


In [9]:
data.tail(15)

Unnamed: 0,Text,Label
17581,i do not understand why so many Indians have a...,0
17582,Traaaniiieeesss should be eliiiimiiiinateeed,1
17583,ALL CURRY MUNCHERS SHOULD DIE - this was the m...,0
17584,this government is full of rubbish,0
17585,i dislike bisexual people,1
17586,africans should be in a zoo with the other ani...,1
17587,I wouldn't want my daughter to marry a Chink,1
17588,not anymore. they are like the shitty utube no...,1
17589,"Yet still, better for a 750cc to seek a win (o...",0
17590,people this is new thread due to the impending...,0
