# Hate Detection: LSTM

## Author: Rami Abulfadl

A Bi-LSTM network is created as binray classifier for hate detection using the dataset [Twitter hate speech] which can be downloaded from Kaggle´s competition with this link(https://www.kaggle.com/vkrahul/twitter-hate-speech?select=train_E6oV3lV.csv) in which tweets are identified as hateful by internet users and compiled by Hatebase.org based on Davidson et al. (https://arxiv.org/pdf/1703.04009.pdf)


###  **1. Importation**
The necessary librraies needed to create the model and the dataset it self have been imported in this section



**1.1** Importing necessary Libraries

In [None]:
import pandas as pd
import numpy as np
from keras.preprocessing.text import Tokenizer
import tensorflow as tf
from keras.layers import LSTM
import matplotlib.pyplot as plt
from nltk.tokenize import RegexpTokenizer
from imblearn.over_sampling import SMOTE
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D
from keras.models import Model, Sequential
import keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix




**1.2.** Importing the CSV files of the dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [None]:
train = pd.read_csv("/content/drive/My Drive/TFMColab/hate/train.csv")
val = pd.read_csv("/content/drive/My Drive/TFMColab/hate/dev.csv")
test = pd.read_csv("/content/drive/My Drive/TFMColab/hate/test.csv")

###  **2. Preprocessing**

Typically, the tokenizer provided by Keras has been set to vectorize the tweets into integers by fitting it on the training tweets. Most frequenct words are kept up to 10000 words. Texts are not converted into lower case as it may add information forexample when a user is shouting. 

All punctuation, plus tabs and line breaks are filtered from the corpus

In [None]:
max_words = 10000 
maxlen = 25 
embedding_dim = 100

tokenizer = Tokenizer(num_words=max_words,lower=False, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')

#tokenizer fitting on training tweets
tokenizer.fit_on_texts(list(train['Phrase']))
word_index = tokenizer.word_index
print('There are %s unique tokens.' % len(word_index))

#apply the tokenizer on the 3 datasets partitions
train_X = tokenizer.texts_to_sequences(train['Phrase'])
test_X = tokenizer.texts_to_sequences(test['Phrase'])
val_X = tokenizer.texts_to_sequences(val['Phrase'])

#Reshape the outcome into a numpy array
train_X = pad_sequences(train_X, maxlen = maxlen)
test_X = pad_sequences(test_X, maxlen = maxlen)
val_X = pad_sequences(val_X, maxlen = maxlen)

train_y = train['sentiment_values']
test_y = test['sentiment_values']
val_y = val['sentiment_values']

There are 39745 unique tokens.


In [None]:
train_X.shape


(25533, 25)

Glove embedding file has been uploaded, which is coposed of pre-trained word vectors glove.6B.100d that contains 6 B tokens of 400 k vocab forming a 100 dimension vector representation gathered form Wikipedia 2014 and Gigaword 5 corpora. The embedding matrix is then prepaired before being used as a manual for the embedding layer in Bi-LSTM model.



In [None]:
#defining glove directory 
EMBEDDING_FILE = '/content/drive/My Drive/TFMColab/glove_6B_100d.txt'

def coefs_fetcher(word,*arr): 
    return word, np.asarray(arr, dtype='float32')

#construct the embedding index from the glove file

embeddings_index = {}
with open(EMBEDDING_FILE, encoding="utf8") as f:
    for line in f:
        word, coefs = coefs_fetcher(*line.split(" "))
        embeddings_index[word] = coefs
            
# embedding matrix 
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: 
        # words not in the embedding index are regarded as zeros
        embedding_matrix[i] = embedding_vector

### **3. Build LSTM Model**

The model layers are constructed by an 128 unit Bi-LSTM layer after the embedding layer. Then, the input is designed to pass through 2 hidden layers with relu activation functions, consisting of 40 and 20 units respectively.Lastley final probability is given by the output layer with a sigmoid activation function for binary classification either 1 for hatefull tweet or 0 for neutral.

In [None]:
model = Sequential()
model.add(Embedding(len(word_index) + 1, embedding_dim, weights = [embedding_matrix]))
model.add(Bidirectional(LSTM(128, return_sequences = True)))
model.add(GlobalMaxPool1D())
model.add(Dense(40, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(20, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy',keras.metrics.Precision(), keras.metrics.Recall(), keras.metrics.TruePositives()])

For regularization purpose, the dropout with p=0.5 is set in the above section. Moreover, early stopping approach is set through comapring the loss of the validation and training partitions.


In [None]:
#Autosave after each epochs
saveBestModel = keras.callbacks.ModelCheckpoint('/content/drive/My Drive/TFMColab/best_model.hdf5', monitor='val_acc', verbose=0, save_best_only=True, save_weights_only=False, mode='auto', period=1)
#setting early stoping with patience of 2 epochs
earlyStopping = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=2, verbose=0, mode='auto')



### **4. Fit the model**

The class weights had been adjusted in order to deal with the imbalance of classes before fitting the model.

In [None]:
class_weight = {0: 1.,
                1: 10.
                }
batch_size = 100
epochs = 25
model.fit(train_X, train_y, batch_size=batch_size,class_weight=class_weight, epochs=epochs, validation_data=(val_X, val_y), callbacks=[saveBestModel, earlyStopping])

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25


<tensorflow.python.keras.callbacks.History at 0x7f9aff47be80>


### **5. Evaluate model results with test data**

In [None]:
loss, accuracy, precision, recall, true_positives = model.evaluate(test_X, test_y, batch_size=batch_size)



In [None]:
#calculating the F1 score
mult=precision*recall
sum=precision+recall
frac=mult/sum
f1_score=2*frac

#the evaluation metrics
print('The Accuracy is:',accuracy)
print('The f1 score is:',f1_score)
print('The Precision is:',precision)
print('The Recall is:',recall)
print('The Loss is:',loss)

The Accuracy is: 0.9363096952438354
The f1 score is: 0.6194029688361286
The Precision is: 0.5407165884971619
The Recall is: 0.7248908281326294
The Loss is: 0.24641041457653046


### **6. Extract False Positives and False Negatives**



The saved model predicts the sentiment of the test partition inorder to compare them with the actual tweet labels.Then the confusion matrix is constructed.

In [None]:
predict_y = model.predict_classes(test_X, batch_size=batch_size)

In [None]:
confusion_matrix(test_y, predict_y)

array([[2833,  141],
       [  63,  166]])

The following function is created to panda dataframes for the false positives and negatives.

In [None]:
def getFP_FN(test_X, test_y, pred_y):
    FP_text = []
    FP_index = []
    FN_text = []
    FN_index = []
    for i in range(len(test_y)):
        if(pred_y[i]==1 and test_y[test_y.index[i]]==0):
            FP_text.append(test['Phrase'][test_y.index[i]])
            FP_index.append(test_y.index[i])
        elif(pred_y[i]==0 and test_y[test_y.index[i]]==1):
            FN_text.append(test['Phrase'][test_y.index[i]])
            FN_index.append(test_y.index[i])
    d_FP = {'FP_text':FP_text,'FP_index':FP_index}
    df_FP = pd.DataFrame(d_FP)
    d_FN = {'FN_text':FN_text,'FN_index':FN_index}
    df_FN = pd.DataFrame(d_FN)
            
    return df_FP,df_FN

In [None]:
#directory of saved CSV files of false positives and negatives
df_FP,df_FN = getFP_FN(test_X, test_y, predict_y)
df_FP.to_csv('/content/drive/My Drive/TFMColab/hate/hateFP.csv', index=True)
df_FN.to_csv('/content/drive/My Drive/TFMColab/hate/hateFN.csv', index=True)

In [None]:
df_FP

Unnamed: 0,FP_text,FP_index
0,#stupidity makes me more than even #negligen...,8
1,it's a firework!! weeheeeee~ ððððð...,19
2,@user but it's your fault you have to use it t...,28
3,sick verbal irony of the #left: equaling homo...,99
4,"@user , shocked by your ignorance",148
...,...,...
136,@user when a former liberal warrior becomes a...,3099
137,cue the violins #sososad,3103
138,#whatson @user @user @user @user @user please...,3162
139,@user .. #new#year..,3171


In [None]:
df_FN

Unnamed: 0,FN_text,FN_index
0,fox new just coming out and saying it bluntly....,47
1,@user thank you!! the power of #social #media!...,63
2,decolonizing the curriculum: the only way thro...,113
3,yay! except #ellen made comments so should we...,125
4,"show me your tits, idiot!",205
...,...,...
58,"in queue with basket of food shopping, 50's gu...",2925
59,"@user i have the ssn,address,home network info...",2933
60,"why? please explain to us why we're, ""bad.""",3006
61,goodbye 2016.... i definitely hope we leave be...,3133
