# # Data Preprocessing
The first step in our hate speech detection model involves collecting and preprocessing the data. This step is crucial as the quality and cleanliness of the data significantly affect the model's performance.

We use the provided dataset (HateSpeechDetection.csv), which contains text data labeled as hate speech (1) or not hate speech (0).

In [18]:
import re
import pandas as pd
import numpy as np

data = pd.read_csv("HateSpeechDetection.csv")
data.head(15)


Unnamed: 0,Text,Label
0,Damn I thought they had strict gun laws in Ger...,0
1,I dont care about what it stands for or anythi...,0
2,It's not a group it's an idea lol,0
3,So it's not just America!,0
4,The dog is a spectacular dancer considering he...,0
5,If ppl dont wear masks you complain... If ppl ...,0
6,We should send them All back to africa,1
7,Checking to see if it's whataboutism from the ...,0
8,"As a european, I approve this.",0
9,"Idk which of these groups to join, they're all...",0


In [19]:
data.tail(10)

Unnamed: 0,Text,Label
17586,africans should be in a zoo with the other ani...,1
17587,I wouldn't want my daughter to marry a Chink,1
17588,not anymore. they are like the shitty utube no...,1
17589,"Yet still, better for a 750cc to seek a win (o...",0
17590,people this is new thread due to the impending...,0
17591,I find rats nicer and cleaner than most Chinese,1
17592,"Check out this niggar, they hit things like wi...",1
17593,"this country has become an absolute shambles, ...",0
17594,Me aged 16 = anti-Semitism is bad Me aged 18 =...,1
17595,so messed up saying blacks don't deserve rights,0


In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17596 entries, 0 to 17595
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    17596 non-null  object
 1   Label   17596 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 275.1+ KB


After observing the dataset, we can infer that we need to clean and transform the raw text data into a format suitable for our  model. This involves several sub-steps:

Removing Extra Spaces: Normalize the spacing in the text to remove any extra spaces.

In [21]:
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text) #the re.sub function replaces one or more whitespace characters (\s+) with a single space.
data['Text'] = data['Text'].apply(remove_extra_spaces)



Remove usernames: Same as for the URL, a username in a text won’t give any valuable information because it won’t be recognized as a word carrying meaning. We will then remove it.

In [22]:
def remove_username(text):
    return re.sub(r"@\S+", "",text) 
#We used pattern “@\S+” -> it suggests string group which starts with ‘@’ and followed by non-whitespace character(\S), ‘+’ means repeatition of preceding character one or more times

data['Text'] = data['Text'].apply(remove_username)



Remove Hashtags: Hashtags are hard to apprehend, but usually contain useful information about the context of a text and its content. The problem with hashtags is that the words are all after the other, without a space. 

In [23]:
def remove_hashtags(text):
    return re.sub(r'#', '', text)
# replacing the character("#") with "" but not removing the term.

data['Text'] = data['Text'].apply(remove_hashtags)
data.tail()

Unnamed: 0,Text,Label
17591,I find rats nicer and cleaner than most Chinese,1
17592,"Check out this niggar, they hit things like wi...",1
17593,"this country has become an absolute shambles, ...",0
17594,Me aged 16 = anti-Semitism is bad Me aged 18 =...,1
17595,so messed up saying blacks don't deserve rights,0


Lowercasing: Convert all text to lowercase to ensure uniformity, as the model should treat "Hate" and "hate" as the same word.

In [24]:
def text_lower(text):
    return text.lower()
data['Text'] = data['Text'].apply(text_lower)

Removing Punctuation: Strip out punctuation to focus on the words themselves.

In [25]:
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)
    #\w: Represents any alphanumeric character (equivalent to [a-zA-Z0-9_]).
    #\s: Denotes any whitespace character, such as space, tab, or newline.
    # so it defines the other than a alphanumeric character followed by a single space, ('^' for negation) remove other characters

data['Text'] = data['Text'].apply(remove_punctuation)
data.tail()

Unnamed: 0,Text,Label
17591,i find rats nicer and cleaner than most chinese,1
17592,check out this niggar they hit things like wil...,1
17593,this country has become an absolute shambles t...,0
17594,me aged 16 antisemitism is bad me aged 18 an...,1
17595,so messed up saying blacks dont deserve rights,0


Remove URLs: URLs do not give any information when we try to analyze text from words.

In [26]:
def remove_url(text):
    return re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# it identifies the words starting with http or https or www and ending with a non-white space Character(\S) then remove it

data['Text'] = data['Text'].apply(remove_url)

In [14]:
data

In [27]:
from nltk.tokenize import word_tokenize
data['Text'] = data['Text'].apply(word_tokenize)
data

Unnamed: 0,Text,Label
0,"[damn, i, thought, they, had, strict, gun, law...",0
1,"[i, dont, care, about, what, it, stands, for, ...",0
2,"[its, not, a, group, its, an, idea, lol]",0
3,"[so, its, not, just, america]",0
4,"[the, dog, is, a, spectacular, dancer, conside...",0
...,...,...
17591,"[i, find, rats, nicer, and, cleaner, than, mos...",1
17592,"[check, out, this, niggar, they, hit, things, ...",1
17593,"[this, country, has, become, an, absolute, sha...",0
17594,"[me, aged, 16, antisemitism, is, bad, me, aged...",1
