# # Data Preprocessing
The first step in our hate speech detection model involves collecting and preprocessing the data. This step is crucial as the quality and cleanliness of the data significantly affect the model's performance.

We use the provided dataset (HateSpeechDetection.csv), which contains text data labeled as hate speech (1) or not hate speech (0).

In [21]:
import re
import pandas as pd
import numpy as np

data = pd.read_csv("HateSpeechDetection.csv")
data.head(15)


Unnamed: 0,Text,Label
0,Damn I thought they had strict gun laws in Ger...,0
1,I dont care about what it stands for or anythi...,0
2,It's not a group it's an idea lol,0
3,So it's not just America!,0
4,The dog is a spectacular dancer considering he...,0
5,If ppl dont wear masks you complain... If ppl ...,0
6,We should send them All back to africa,1
7,Checking to see if it's whataboutism from the ...,0
8,"As a european, I approve this.",0
9,"Idk which of these groups to join, they're all...",0


In [22]:
data.tail(10)

Unnamed: 0,Text,Label
17586,africans should be in a zoo with the other ani...,1
17587,I wouldn't want my daughter to marry a Chink,1
17588,not anymore. they are like the shitty utube no...,1
17589,"Yet still, better for a 750cc to seek a win (o...",0
17590,people this is new thread due to the impending...,0
17591,I find rats nicer and cleaner than most Chinese,1
17592,"Check out this niggar, they hit things like wi...",1
17593,"this country has become an absolute shambles, ...",0
17594,Me aged 16 = anti-Semitism is bad Me aged 18 =...,1
17595,so messed up saying blacks don't deserve rights,0


In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17596 entries, 0 to 17595
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    17596 non-null  object
 1   Label   17596 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 275.1+ KB


After observing the dataset, we can infer that we need to clean and transform the raw text data into a format suitable for our  model. This involves several sub-steps:

Removing Extra Spaces: Normalize the spacing in the text to remove any extra spaces.

In [24]:
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text) #the re.sub function replaces one or more whitespace characters (\s+) with a single space.
data['Text'] = data['Text'].apply(remove_extra_spaces)



Remove usernames: Same as for the URL, a username in a text won’t give any valuable information because it won’t be recognized as a word carrying meaning. We will then remove it.

In [25]:
def remove_username(text):
    return re.sub(r"@\S+", "",text) 
#We used pattern “@\S+” -> it suggests string group which starts with ‘@’ and followed by non-whitespace character(\S), ‘+’ means repeatition of preceding character one or more times

data['Text'] = data['Text'].apply(remove_username)



Remove Hashtags: Hashtags are hard to apprehend, but usually contain useful information about the context of a text and its content. The problem with hashtags is that the words are all after the other, without a space. 

In [26]:
def remove_hashtags(text):
    return re.sub(r'#', '', text)
# replacing the character("#") with "" but not removing the term.

data['Text'] = data['Text'].apply(remove_hashtags)
data.tail()

Unnamed: 0,Text,Label
17591,I find rats nicer and cleaner than most Chinese,1
17592,"Check out this niggar, they hit things like wi...",1
17593,"this country has become an absolute shambles, ...",0
17594,Me aged 16 = anti-Semitism is bad Me aged 18 =...,1
17595,so messed up saying blacks don't deserve rights,0


Handling Contractions

Handling contractions in text is an important step in text preprocessing, especially for tasks like hate speech detection where understanding the full meaning of the words is crucial. Contractions are shortened forms of words or combinations of words created by omitting certain letters and sounds (e.g., "don't" for "do not", "I'm" for "I am"). 

In [27]:
import contractions
data['Text']=data['Text'].apply(lambda x:contractions.fix(x))

Lowercasing: Convert all text to lowercase to ensure uniformity, as the model should treat "Hate" and "hate" as the same word.

In [28]:
def text_lower(text):
    return text.lower()
data['Text'] = data['Text'].apply(text_lower)

Removing Punctuation: Strip out punctuation to focus on the words themselves.

In [29]:
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)
    #\w: Represents any alphanumeric character (equivalent to [a-zA-Z0-9_]).
    #\s: Denotes any whitespace character, such as space, tab, or newline.
    # so it defines the other than a alphanumeric character followed by a single space, ('^' for negation) remove other characters

data['Text'] = data['Text'].apply(remove_punctuation)
data.tail()

Unnamed: 0,Text,Label
17591,i find rats nicer and cleaner than most chinese,1
17592,check out this niggar they hit things like wil...,1
17593,this country has become an absolute shambles t...,0
17594,me aged 16 antisemitism is bad me aged 18 an...,1
17595,so messed up saying blacks do not deserve rights,0


Remove URLs: URLs do not give any information when we try to analyze text from words.

In [30]:
def remove_url(text):
    return re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# it identifies the words starting with http or https or www and ending with a non-white space Character(\S) then remove it

data['Text'] = data['Text'].apply(remove_url)

Removing Short words

In [31]:
data['Text'] = data['Text'].apply(lambda x: ' '.join([word for word in x.split() if len(word) >= 3]))

In [32]:
data

Unnamed: 0,Text,Label
0,damn thought they had strict gun laws germany,0
1,not care about what stands for anything its co...,0
2,not group idea lol,0
3,not just america,0
4,the dog spectacular dancer considering has two...,0
...,...,...
17591,find rats nicer and cleaner than most chinese,1
17592,check out this niggar they hit things like wil...,1
17593,this country has become absolute shambles the ...,0
17594,aged antisemitism bad aged antisemitism does n...,1


In [33]:

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemmatizer = WordNetLemmatizer()
# Lemmatization

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\balui\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\balui\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\balui\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [34]:
def lemmatizers(text):
    text = nltk.word_tokenize(text)
    word=[]
    for i in text:
        if i not in stopwords.words('english'):
            word.append(lemmatizer.lemmatize(i))
        else:
            word.append(i)
    return ' '.join(word)
data['Text'] = data['Text'].apply(lemmatizers)

In [35]:
data

Unnamed: 0,Text,Label
0,damn thought they had strict gun law germany,0
1,not care about what stand for anything its con...,0
2,not group idea lol,0
3,not just america,0
4,the dog spectacular dancer considering has two...,0
...,...,...
17591,find rat nicer and cleaner than most chinese,1
17592,check out this niggar they hit thing like wild...,1
17593,this country has become absolute shamble the a...,0
17594,aged antisemitism bad aged antisemitism does n...,1


Text Vectorization:
Vectorization is the process of converting text into numerical representations. The TextVectorization layer is designed to standardize the text data, tokenize it, and convert it into integer sequences that can be used as input for deep learning model.

In [36]:
X = data['Text']
y = data[data.columns[1]].values
X

0             damn thought they had strict gun law germany
1        not care about what stand for anything its con...
2                                       not group idea lol
3                                         not just america
4        the dog spectacular dancer considering has two...
                               ...                        
17591         find rat nicer and cleaner than most chinese
17592    check out this niggar they hit thing like wild...
17593    this country has become absolute shamble the a...
17594    aged antisemitism bad aged antisemitism does n...
17595                messed saying black not deserve right
Name: Text, Length: 17596, dtype: object

In [37]:
y

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

Vocabulary Size (max_tokens=10000):

By setting max_tokens to 10,000, we limit the vocabulary to the 10,000 most frequent words in the dataset. This helps in reducing the computational complexity and memory usage while retaining the most important words for the task.

Sequence Length (output_sequence_length=350):

The output_sequence_length parameter ensures that all text sequences are of equal length 35.0 tokens in this case). Shorter sequences will be padded (usually with zeros), and longer sequences will be truncated. This uniformity is necessary for efficient batch processing and model training.

Integer Token Indices (output_mode='int'):

The output_mode='int' setting indicates that the output will be integer indices of tokens. This is a common approach in NLP tasks, where each unique token in the vocabulary is assigned a unique integer index.

In [38]:
from tensorflow.keras.layers import TextVectorization
vectorizer = TextVectorization(max_tokens=10000,
                               output_sequence_length=300,
                               output_mode='int')
vectorizer.adapt(X.values)
vectorized_text = vectorizer(X.values)
vectorized_text

<tf.Tensor: shape=(17596, 300), dtype=int64, numpy=
array([[ 414,  251,    8, ...,    0,    0,    0],
       [   5,  157,   28, ...,    0,    0,    0],
       [   5,  172,  249, ...,    0,    0,    0],
       ...,
       [  11,   51,   57, ...,    0,    0,    0],
       [2125, 1532,  144, ...,    0,    0,    0],
       [2424,  168,   24, ...,    0,    0,    0]], dtype=int64)>

In [39]:
vectorizer.get_vocabulary()

['',
 '[UNK]',
 'the',
 'and',
 'are',
 'not',
 'you',
 'that',
 'they',
 'for',
 'have',
 'this',
 'people',
 'with',
 'all',
 'but',
 'like',
 'can',
 'woman',
 'just',
 'their',
 'them',
 'was',
 'will',
 'black',
 'what',
 'would',
 'who',
 'about',
 'there',
 'from',
 'your',
 'get',
 'because',
 'fucking',
 'should',
 'when',
 'one',
 'she',
 'think',
 'more',
 'want',
 'how',
 'out',
 'white',
 'why',
 'being',
 'know',
 'our',
 'these',
 'muslim',
 'country',
 'some',
 'men',
 'even',
 'make',
 'her',
 'has',
 'fuck',
 'only',
 'hate',
 'say',
 'than',
 'were',
 'need',
 'really',
 'time',
 'now',
 'gay',
 'here',
 'see',
 'any',
 'most',
 'good',
 'those',
 'shit',
 'other',
 'then',
 'never',
 'thing',
 'does',
 'right',
 'look',
 'way',
 'many',
 'did',
 'been',
 'jew',
 'man',
 'life',
 'going',
 'had',
 'much',
 'world',
 'too',
 'its',
 'his',
 'could',
 'love',
 'into',
 'year',
 'off',
 'which',
 'take',
 'over',
 'back',
 'always',
 'day',
 'also',
 'him',
 'got',
 're

In [47]:
import tensorflow
dataset = tensorflow.data.Dataset.from_tensor_slices((vectorized_text, y))
dataset = dataset.shuffle(18000)
dataset.as_numpy_iterator().next()

(array([6055,    1, 1397, 5935, 1883,   57,  910,  205, 5018,  681, 7764,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0, 

In [41]:
data.iloc[0]

Text     damn thought they had strict gun law germany
Label                                               0
Name: 0, dtype: object