# **SENTIMENT ANALYSIS USING LSTM**

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

In [None]:
#importing libraries
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Embedding,LSTM,Dense,Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#importing the dataset
ds = pd.read_csv("../input/twitter-sentiment-analysis-hatred-speech/train.csv")
ds.head()

In [None]:
#checking for null values
ds.isnull().sum()


**NO NULL VALUES FOUND**

In [None]:
#defining dependent and independent vectors
#taking only title for prediction
x = ds.iloc[:,2:3]
y = ds['label']

In [None]:
ds['label'].value_counts()

In [None]:
#checking number of real and fake news
sns.countplot(x = 'label',data = ds)

**AS YOU CAN SEE O HAVE (~ 30000) VALUES AND 1 HAVE (~ 2500) VALUES**

In [None]:
#Text Cleaning and preprocessing

cleaned = []
for i in range(0,len(ds)):
    
    #removing words any other than (a-z) and (A-Z)
    text = re.sub('[^a-zA-Z]',' ', x['tweet'][i])
    
    #converting all words into lower case
    text = text.lower()
    
    #tokenizing 
    text = text.split()
    
    #stemming and removing stopwords
    ps = PorterStemmer()
    text = [ps.stem(words) for words in text if words not in stopwords.words('english')]
    text = ' '.join(text)
    cleaned.append(text)

In [None]:
#cleaned text
cleaned[:5]

**DATA IS NOW READY FOR ONE HOT ENCODING**

Our motive here is to create an embedding layer of texts for the LSTM, OneHot encoding prepares our text array into a format required by embedding layer.

In [None]:
#taking dictionary size 5000
vocab_size = 5000

#one hot encoding
one_hot_dir = [one_hot(words,vocab_size) for words in cleaned]

#length of all rows should be equal therefore applying padding
#this will adjust size by adding 0 at staring of the shorter rows
embedded_layer = pad_sequences(one_hot_dir,padding = 'pre')
embedded_layer

**OUR MATRIX IS NOW READY FOR THE LSTM**

In [None]:
#converting into numpy arrays.
x = np.array(embedded_layer)
y = np.array(y)

In [None]:
#splitting the Dataset into Train and Test set
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [None]:
from tensorflow.keras import regularizers
#creating model using LSTM
model = Sequential()

#taking number features as 64
model.add(Embedding(vocab_size,64,input_length = len(embedded_layer[0])))
#model.add(Dropout(0.4))

#adding LSTM layers with 128 neurons
model.add(LSTM(128))
model.add(Dropout(0.4))

#adding output layer 
model.add(Dense(1,activation="sigmoid"))

#compiling the model
model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])

#summary of model
model.summary()

#training the model
model.fit(x_train, y_train, validation_data = (x_test,y_test), epochs = 5, batch_size = 32)

In [None]:
#predicting and getting accuracy
y_pred = model.predict(x_test)
y_pred = (y_pred > 0.5)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

In [None]:
#getting confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)