## Hate Speech Recognition

Speech or Expression that denigrates a person or persons on the basis of (alleged) membership in a social group identified by attributes such as race, ethnicity, gender, sexual orientation, religion, age, physical or mental disability, and others.

##### Hate Speech Detection is usually a task of sentiment classification.So for the task of hate speech detection model, I will use the Twitter data.

In [26]:
# Importing essential Libraries
import pandas as pd 
import re 

from sklearn.utils import resample
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [2]:
# Reading train and test datasets
train= pd.read_csv('train.csv')
test= pd.read_csv('test.csv')

In [4]:
# Checking the shape of Training and Test Datasets
print("Shape of Train Set:- ", train.shape)
print("Shape of Test Set:- ", test.shape)

Shape of Train Set:-  (31962, 3)
Shape of Test Set:-  (17197, 2)


In [5]:
# Getting a glimpse of train dataset
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


### 1. Data Cleaning

Changing all the text to lower case and removing the unneccesary text.

In [7]:
def clean_text(df, text_field):
    df[text_field]= df[text_field].str.lower()
    df[text_field]= df[text_field].apply(lambda eliminate: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", eliminate))
    return df 

train_clean= clean_text(train, 'tweet')
test_clean= clean_text(test, 'tweet')

### 2. Handling Imbalanced data

In [18]:
train_clean["label"].value_counts()

0    29720
1     2242
Name: label, dtype: int64

As the tweets regarding hate speeches are comparatively lesser than others, so we try to balance this situation.

In [19]:
train_majority= train_clean[train_clean.label==0]
train_minority= train_clean[train_clean.label==1]
upsampled= resample(train_minority, replace=True, n_samples=len(train_majority), random_state=12)

train_upsampled= pd.concat([upsampled, train_majority])
train_upsampled['label'].value_counts()

1    29720
0    29720
Name: label, dtype: int64

### 3. Creating a pipeline for reproducibility

In [23]:
pipeline_sgd= Pipeline([('vect', CountVectorizer()),
                       ('tfidf', TfidfTransformer()),
                       ('nb', SGDClassifier())])

#### Training the model

In [29]:
X_train, X_test, y_train, y_test= train_test_split(train_upsampled['tweet'],
                                                  train_upsampled['label'],
                                                  random_state= 1)
model= pipeline_sgd.fit(X_train, y_train)
y_predict= model.predict(X_test)
f1_score(y_test, y_predict)

0.966640604639041