# Hate Speech Detection

<img src="images/hate.webp">

Hate speech is one of the serious issues we see on social media platforms like Twitter and Facebook daily. Most of the posts containing hate speech can be found in the accounts of people with political views. So, if you want to learn how to train a hate speech detection model with machine learning, this project is for you. In this project, I will walk you through the task of hate speech detection with machine learning using Python.

In [41]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px 

from nltk.util import pr
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import re
import nltk
stemmer = nltk.SnowballStemmer("english")
from nltk.corpus import stopwords
import string
stopword=set(stopwords.words('english'))


import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_column', 100)

In [42]:
train = pd.read_csv('data/train.csv')
print("Training Set:"% train.columns, train.shape, len(train))
test = pd.read_csv('data/test.csv')
print("Test Set:"% test.columns, test.shape, len(test))

Training Set: (31962, 3) 31962
Test Set: (17197, 2) 17197


In [43]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


## Data Cleaning

In [44]:
import re
def clean_text(text):
    text = text.lower()
    text = re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text)
    return text

train['tweet'] = train['tweet'].apply(clean_text)
test['tweet'] = test['tweet'].apply(clean_text)

In [45]:
# import neattext.functions as nfx

# def clean_text(tweet):
#     tweet = nfx.normalize(tweet)
#     tweet = nfx.remove_emails(tweet)
#     tweet = nfx.remove_urls(tweet)
#     tweet = nfx.remove_special_characters(tweet)
#     tweet = nfx.remove_stopwords(tweet)
#     tweet = nfx.remove_punctuations(tweet)
#     tweet = nfx.remove_numbers(tweet)
#     tweet = nfx.remove_emojis(tweet)
#     tweet = nfx.remove_html_tags(tweet)
#     return tweet

# train['tweet'] = train['tweet'].apply(clean_text)

## Handling Imbalanced Data

In [46]:
from sklearn.utils import resample
train_majority = train[train.label==0]
train_minority = train[train.label==1]
train_minority_upsampled = resample(train_minority, 
                                 replace=True,    
                                 n_samples=len(train_majority),   
                                 random_state=123)
train_upsampled = pd.concat([train_minority_upsampled, train_majority])
train_upsampled['label'].value_counts()

1    29720
0    29720
Name: label, dtype: int64

## Creating a Pipeline

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
pipeline_sgd = Pipeline([
    ('vect', CountVectorizer(preprocessor=clean_text)),
    ('tfidf',  TfidfTransformer()),
    ('nb', SGDClassifier()),])

In [48]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train_upsampled['tweet'],
                                                    train_upsampled['label'],random_state = 42)

In [49]:
pipeline_sgd.fit(x_train, y_train)

In [50]:
y_pred = pipeline_sgd.predict(x_test)
from sklearn.metrics import f1_score
f1_score(y_test, y_pred)

0.9650945909938716

In [51]:
from joblib import dump
dump(pipeline_sgd, 'pipeline_model.joblib')

['pipeline_model.joblib']