<a href="https://colab.research.google.com/github/satyhim/Projects/blob/main/NLP_Sentiment_Analysis_Tweeter_using_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment Analysis : Tweeter Data

#Problem Statement:

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiments associalted with it. so the task is to classify racist or sexist tweets from the other tweets.


Formally, given a training sample of tweets and label, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.

In [147]:
import pandas as pd
import numpy as np
import re
import nltk
import warnings
import seaborn as sns
import matplotlib.pyplot as plt


In [148]:
df_train=pd.read_csv('/content/drive/MyDrive/Python/Project/Sentiment_analysis/Train_tweets.csv')
df_test=pd.read_csv('/content/drive/MyDrive/Python/Project/Sentiment_analysis/Test_tweets.csv')


In [149]:
df_train.shape

(31962, 3)

In [150]:
df_train.columns

Index(['id', 'label', 'tweet'], dtype='object')

In [151]:
df_test.shape

(17197, 2)

In [152]:
df_test.columns

Index(['id', 'tweet'], dtype='object')

In [153]:
df_train.isna().sum()

id       0
label    0
tweet    0
dtype: int64

In [154]:
df_test.isna().sum()

id       0
tweet    0
dtype: int64

In [155]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [156]:
df_train.describe()

Unnamed: 0,id,label
count,31962.0,31962.0
mean,15981.5,0.070146
std,9226.778988,0.255397
min,1.0,0.0
25%,7991.25,0.0
50%,15981.5,0.0
75%,23971.75,0.0
max,31962.0,1.0


In [157]:
df_train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [158]:
df_train['label'].unique()

array([0, 1])

Now we will check out some non racist/sexist tweets : 

In [159]:
df_train[df_train['label']==0].head(10)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


Now we will check out some racist/sexist tweets :

In [160]:
df_train[df_train['label']==1].head(10)

Unnamed: 0,id,label,tweet
13,14,1,@user #cnn calls #michigan middle school 'buil...
14,15,1,no comment! in #australia #opkillingbay #se...
17,18,1,retweet if you agree!
23,24,1,@user @user lumpy says i am a . prove it lumpy.
34,35,1,it's unbelievable that in the 21st century we'...
56,57,1,@user lets fight against #love #peace
68,69,1,ð©the white establishment can't have blk fol...
77,78,1,"@user hey, white people: you can call people '..."
82,83,1,how the #altright uses &amp; insecurity to lu...
111,112,1,@user i'm not interested in a #linguistics tha...


Now we can see there are many words and characters which are not required in the tweets for our analysis.

Lets check label distribution in the dataset:

In [161]:
df_train['label'].value_counts()

0    29720
1     2242
Name: label, dtype: int64

So we can see there are 29720 tweets are non racist/sexist and only 2242 tweets are racist/sexist tweets.

#Data Cleaning

Before we begin cleaning, let's first combine train and test datasets. Combining the datasets will make it convenient for us to preprocess the data. Later we will split it back into train and test data.

In [162]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Given below is a user-defined function to remove unwanted text patterns from tweets.

In [163]:
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
    return input_txt

**1. Removing Twitter Handles (@user)**

In [164]:
df_train['tidy_tweet'] = np.vectorize(remove_pattern)(df_train['tweet'], "@[\w]*")
df_train

Unnamed: 0,id,label,tweet,tidy_tweet
0,1,0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so sel...
1,2,0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i can't use cause th...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation,factsguide: society now #motivation
...,...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...,ate isz that youuu?ððððððð...
31958,31959,0,to see nina turner on the airwaves trying to...,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,...","#sikh #temple vandalised in in #calgary, #wso..."


In [165]:
df_test['tidy_tweet'] = np.vectorize(remove_pattern)(df_test['tweet'], "@[\w]*")
df_test

Unnamed: 0,id,tweet,tidy_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...,#white #supremacists want everyone to see th...
2,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew...","3rd #bihday to my amazing, hilarious #nephew..."
...,...,...,...
17192,49155,thought factory: left-right polarisation! #tru...,thought factory: left-right polarisation! #tru...
17193,49156,feeling like a mermaid ð #hairflip #neverre...,feeling like a mermaid ð #hairflip #neverre...
17194,49157,#hillary #campaigned today in #ohio((omg)) &am...,#hillary #campaigned today in #ohio((omg)) &am...
17195,49158,"happy, at work conference: right mindset leads...","happy, at work conference: right mindset leads..."


**2. Removing Punctuations, Numbers, and Special Characters**

Here we will replace everything except characters and hashtags with spaces. The regular expression "[^a-zA-Z#]" means anything except alphabets and '#'

In [166]:
df_train['tidy_tweet'] = df_train['tidy_tweet'].str.replace("[^a-zA-Z#]", " ") 
df_train.head(10)

  """Entry point for launching an IPython kernel.


Unnamed: 0,id,label,tweet,tidy_tweet
0,1,0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so sel...
1,2,0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i can t use cause th...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation,factsguide society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...,huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...,camping tomorrow danny
7,8,0,the next school year is the year for exams.ð...,the next school year is the year for exams ...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,we won love the land #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...,welcome here i m it s so #gr


In [167]:
df_test['tidy_tweet'] = df_test['tidy_tweet'].str.replace("[^a-zA-Z#]", " ") 
df_test.head(10)

  """Entry point for launching an IPython kernel.


Unnamed: 0,id,tweet,tidy_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...,#white #supremacists want everyone to see th...
2,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways to heal your #acne #altwaystohe...
3,31966,is the hp and the cursed child book up for res...,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew...",rd #bihday to my amazing hilarious #nephew...
5,31968,choose to be :) #momtips,choose to be #momtips
6,31969,something inside me dies ð¦ð¿â¨ eyes nes...,something inside me dies eyes nes...
7,31970,#finished#tattoo#inked#ink#loveitâ¤ï¸ #â¤ï¸...,#finished#tattoo#inked#ink#loveit # ...
8,31971,@user @user @user i will never understand why...,i will never understand why my dad left me...
9,31972,#delicious #food #lovelife #capetown mannaep...,#delicious #food #lovelife #capetown mannaep...


**3.Removing Short Words**

We have to be a little careful here in selecting the length of the words which we want to remove . So , I have decided to remove all the words having length 3 or less . For example , terms like " hmm " , " oh " are of very little use . It is better to get rid of them .

In [168]:
df_train['tidy_tweet'] =df_train['tidy_tweet'].apply(lambda x:' '.join([w for w in x.split() if len(w)>3])) 
df_train

Unnamed: 0,id,label,tweet,tidy_tweet
0,1,0,@user when a father is dysfunctional and is s...,when father dysfunctional selfish drags kids i...
1,2,0,@user @user thanks for #lyft credit i can't us...,thanks #lyft credit cause they offer wheelchai...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,#model love take with time
4,5,0,factsguide: society now #motivation,factsguide society #motivation
...,...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...,that youuu
31958,31959,0,to see nina turner on the airwaves trying to...,nina turner airwaves trying wrap herself mantl...
31959,31960,0,listening to sad songs on a monday morning otw...,listening songs monday morning work
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,...",#sikh #temple vandalised #calgary #wso condemns


In [169]:
df_test['tidy_tweet'] =df_test['tidy_tweet'].apply(lambda x:' '.join([w for w in x.split() if len(w)>3])) 
df_test

Unnamed: 0,id,tweet,tidy_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...,#white #supremacists want everyone #birds #mov...
2,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways heal your #acne #altwaystoheal #heal...
3,31966,is the hp and the cursed child book up for res...,cursed child book reservations already where w...
4,31967,"3rd #bihday to my amazing, hilarious #nephew...",#bihday amazing hilarious #nephew ahmir uncle ...
...,...,...,...
17192,49155,thought factory: left-right polarisation! #tru...,thought factory left right polarisation #trump...
17193,49156,feeling like a mermaid ð #hairflip #neverre...,feeling like mermaid #hairflip #neverready #fo...
17194,49157,#hillary #campaigned today in #ohio((omg)) &am...,#hillary #campaigned today #ohio used words li...
17195,49158,"happy, at work conference: right mindset leads...",happy work conference right mindset leads cult...


In [170]:
df = df_train.append(df_test, ignore_index=True, sort=False)
df.shape

(49159, 4)

In [171]:
df.isna().sum()

id                0
label         17197
tweet             0
tidy_tweet        0
dtype: int64

In [172]:
df['label'] = df['label'].fillna(df['label'].max())

In [173]:
df.isna().sum()

id            0
label         0
tweet         0
tidy_tweet    0
dtype: int64

In [174]:
df

Unnamed: 0,id,label,tweet,tidy_tweet
0,1,0.0,@user when a father is dysfunctional and is s...,when father dysfunctional selfish drags kids i...
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks #lyft credit cause they offer wheelchai...
2,3,0.0,bihday your majesty,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...,#model love take with time
4,5,0.0,factsguide: society now #motivation,factsguide society #motivation
...,...,...,...,...
49154,49155,1.0,thought factory: left-right polarisation! #tru...,thought factory left right polarisation #trump...
49155,49156,1.0,feeling like a mermaid ð #hairflip #neverre...,feeling like mermaid #hairflip #neverready #fo...
49156,49157,1.0,#hillary #campaigned today in #ohio((omg)) &am...,#hillary #campaigned today #ohio used words li...
49157,49158,1.0,"happy, at work conference: right mindset leads...",happy work conference right mindset leads cult...


In [175]:
df_1 = df.iloc[:31962,:]

In [176]:
df_2 = df.iloc[31962:,:]

In [177]:
df_1

Unnamed: 0,id,label,tweet,tidy_tweet
0,1,0.0,@user when a father is dysfunctional and is s...,when father dysfunctional selfish drags kids i...
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks #lyft credit cause they offer wheelchai...
2,3,0.0,bihday your majesty,bihday your majesty
3,4,0.0,#model i love u take with u all the time in ...,#model love take with time
4,5,0.0,factsguide: society now #motivation,factsguide society #motivation
...,...,...,...,...
31957,31958,0.0,ate @user isz that youuu?ðððððð...,that youuu
31958,31959,0.0,to see nina turner on the airwaves trying to...,nina turner airwaves trying wrap herself mantl...
31959,31960,0.0,listening to sad songs on a monday morning otw...,listening songs monday morning work
31960,31961,1.0,"@user #sikh #temple vandalised in in #calgary,...",#sikh #temple vandalised #calgary #wso condemns


In [178]:
df_2

Unnamed: 0,id,label,tweet,tidy_tweet
31962,31963,1.0,#studiolife #aislife #requires #passion #dedic...,#studiolife #aislife #requires #passion #dedic...
31963,31964,1.0,@user #white #supremacists want everyone to s...,#white #supremacists want everyone #birds #mov...
31964,31965,1.0,safe ways to heal your #acne!! #altwaystohe...,safe ways heal your #acne #altwaystoheal #heal...
31965,31966,1.0,is the hp and the cursed child book up for res...,cursed child book reservations already where w...
31966,31967,1.0,"3rd #bihday to my amazing, hilarious #nephew...",#bihday amazing hilarious #nephew ahmir uncle ...
...,...,...,...,...
49154,49155,1.0,thought factory: left-right polarisation! #tru...,thought factory left right polarisation #trump...
49155,49156,1.0,feeling like a mermaid ð #hairflip #neverre...,feeling like mermaid #hairflip #neverready #fo...
49156,49157,1.0,#hillary #campaigned today in #ohio((omg)) &am...,#hillary #campaigned today #ohio used words li...
49157,49158,1.0,"happy, at work conference: right mindset leads...",happy work conference right mindset leads cult...


We can see the difference between the raw tweets and the cleaned tweets (tidy_tweet) quite clearly. Only the important words in the tweets have been retained and the noise (numbers, punctuations, and special characters) has been removed.

In [179]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [180]:
X_train = df_1['tidy_tweet']
X_test = df_2['tidy_tweet']

y_train = df_1['label']
y_test = df_2['label']

In [181]:
#OneHot Encoding the target column
y_train = pd.get_dummies(y_train)
y_test = pd.get_dummies(y_test)
y_test.head()

Unnamed: 0,1.0
31962,1
31963,1
31964,1
31965,1
31966,1


In [182]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
#The maximum number of words to be used(most frequent)
MAX_NB_WORDS = 10000
#Max number of words in each Tweet
MAX_SEQUENCE_LENGTH = 100


# Initialize and fit the tokenizer
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True, split=' ')
tokenizer.fit_on_texts(X_train)

In [183]:
#Use that tokenizer to transform the text messages in the training and test sets
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

X_train_seq[10]

[1072, 2323, 1152, 1653, 9053, 16, 869, 71, 105, 98, 122]

In [184]:
#Pad the sequences so each sequence is the same length
X_train_seq_padded = pad_sequences(X_train_seq,44)
X_test_seq_padded = pad_sequences(X_test_seq,44)

X_train_seq_padded[10]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
       1072, 2323, 1152, 1653, 9053,   16,  869,   71,  105,   98,  122],
      dtype=int32)

In [185]:
print('X_train_seq_padded:', X_train_seq_padded.shape)
print('X_test_seq_padded:', X_test_seq_padded.shape)

print('y_train:', y_train.shape)
print('y_test:', y_test.shape)

X_train_seq_padded: (31962, 44)
X_test_seq_padded: (17197, 44)
y_train: (31962, 2)
y_test: (17197, 1)


#Model Building

In [186]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Embedding, SpatialDropout1D
from tensorflow.keras.optimizers import Adam
from keras.callbacks import EarlyStopping


model=Sequential()
model.add(Embedding(input_dim=MAX_NB_WORDS, output_dim=64, input_length=X_train_seq_padded.shape[1]))
#input_dim: Size of the vocabulary.
#output_dim: Dimension of the dense embedding.

model.add(SpatialDropout1D(0.4))
model.add(LSTM(128, activation='relu', dropout=0.2, recurrent_dropout=0.2, return_sequences=True))

model.add(LSTM(128, activation='relu', dropout=0.2, recurrent_dropout=0.2, return_sequences=True))

model.add(LSTM(128, activation='relu', dropout=0.2, recurrent_dropout=0.2))

model.add(Dense(3, activation='softmax'))

model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 44, 64)            640000    
                                                                 
 spatial_dropout1d_3 (Spatia  (None, 44, 64)           0         
 lDropout1D)                                                     
                                                                 
 lstm_9 (LSTM)               (None, 44, 128)           98816     
                                                                 
 lstm_10 (LSTM)              (None, 44, 128)           131584    
                                                                 
 lstm_11 (LSTM)              (None, 128)               131584    
                                                                 
 dense_3 (Dense)             (None, 3)                 387       
                                                      

In [187]:
#Compile the model
model.compile(optimizer=Adam(),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

#Adding an early stopping
es = EarlyStopping(monitor='val_accuracy', 
                   mode='max', 
                   patience=4, #Stop the model training if the validation accuracy doesnt increase in 4 consecutive Epochs
                   restore_best_weights=True)

In [188]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('mode.chained_assignment', None)
palette=sns.color_palette('magma')
sns.set(palette=palette)

#Fit the RNN
history = model.fit(X_train_seq_padded, y_train, 
                    batch_size=32, epochs=100, callbacks =[es],
                    validation_data=(X_test_seq_padded, y_test))

Epoch 1/100


ValueError: ignored

In [189]:
df_1.shape

(31962, 4)

In [190]:
df_2.shape

(17197, 4)

In [191]:
df_2.isna().sum()

id            0
label         0
tweet         0
tidy_tweet    0
dtype: int64