# Final Masters Project

## Name: Sreekanth Palagiri, Student ID: R00184198

## Project Topic: Evaluation of Ensemble Approach for Sentiment Analysis on a Small Dataset

##NoteBook1: Trainer LSTM


### **Mount google drive**

In [1]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
!ls "gdrive/My Drive/Colab Notebooks/Masters Project"

'Airline Tweets dataset'  'Sentence Polarity Dataset'
 glove.6B.300d.txt	   VMDataset


### **Load Data and Preprocess**

In [3]:
import pandas as pd
import numpy as np

df=pd.read_csv("/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/sentimentpolarity.csv")
print(df.groupby(['label']).size())
df.head()

label
0    1000
1    1000
dtype: int64


Unnamed: 0,text,label
0,[ferrera] has the charisma of a young woman wh...,1
1,"both flawed and delayed , martin scorcese's ga...",1
2,"for his first attempt at film noir , spielberg...",1
3,easily one of the best and most exciting movie...,1
4,this director's cut -- which adds 51 minutes -...,0


**Preprocessor to Remove all special characters except emoticons**

In [4]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[^A-Za-z0-9\']+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

print(df['text'][19])
print(preprocessor(df['text'][19]))

the only fun part of the movie is playing the obvious game . you try to guess the order in which the kids in the house will be gored . 
the only fun part of the movie is playing the obvious game you try to guess the order in which the kids in the house will be gored 


In [5]:
df['text'] = df['text'].apply(preprocessor)

### **Seperate Into Train and Test Sets**

In [6]:
from sklearn.model_selection import train_test_split

df_train, df_test, sentiment_train, sentiment_test = train_test_split(df['text'], df['label'], 
                                                                      random_state=1, test_size=0.15, 
                                                                      shuffle=False)


print('Length of train set:',len(df_train),'Length of test set:',len(df_test))

Length of train set: 1700 Length of test set: 300


### **LSTM Model**

**Define and Fit Tokenizer**

In [7]:
from tensorflow.keras.preprocessing.text import Tokenizer

numwords= 20000

t = Tokenizer(num_words=numwords)
t.fit_on_texts(df_train)
word_index= t.word_index

**Convert text to sequences for further processing**



In [8]:
train_sequences = t.texts_to_sequences(df_train)
test_sequences = t.texts_to_sequences(df_test)

**Pad Sequences**

In [9]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

mylen = np.vectorize(len)
max_seq_length= 500
embed_dim = 300

train_sequences = pad_sequences(train_sequences,maxlen =max_seq_length)
test_sequences = pad_sequences(test_sequences,maxlen =max_seq_length)

print('Shape of training data tensor:',train_sequences.shape)
print('Shape of test data tensor:',test_sequences.shape)

Shape of training data tensor: (1700, 500)
Shape of test data tensor: (300, 500)


In [10]:
X_train=train_sequences
Y_train=sentiment_train

X_test=test_sequences
Y_test=sentiment_test

**DownLoad Glove embeddings**

In [None]:
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip "glove.6B.zip"
#mv glove* "/content/gdrive/My Drive/Colab Notebooks/Masters Project"

**Create Embedding Matrix**

In [None]:
embeddings_index = {}
f = open('/content/gdrive/My Drive/Colab Notebooks/Masters Project/glove.6B.300d.txt')
for line in f:
  values = line.split()
  word = values[0]
  coefs=np.asarray(values[1:], dtype ='float32')
  embeddings_index[word] = coefs
f.close()

In [None]:
embedding_matrix= np.zeros((numwords , embed_dim))
for word, i in word_index.items():
  if i < numwords:
  # if this word is contained in the downloaded embedding vector
  # then add it to our embedding matrix.
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
      embedding_matrix[i] = embedding_vector
print(embedding_matrix.shape)

(20000, 300)


**Define and Fit LSTM Model, CallBack to save the best performing model**

In [None]:
from tensorflow.keras import models, layers, callbacks

model_lstm = models.Sequential()

model_lstm.add(layers.Embedding(numwords,embed_dim,input_length =max_seq_length))
model_lstm.add(layers.LSTM(64,dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(layers.Dense(64,activation='relu'))
model_lstm.add(layers.Dropout(0.3))
model_lstm.add(layers.Dense(2, activation='softmax'))

filepath="/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/model_lstm.h5"
checkpoint3 = callbacks.ModelCheckpoint (filepath, monitor='val_accuracy', verbose=1, 
                                         save_best_only=True, save_weights_only=False, mode='auto')
callbacks_list3 = [checkpoint3]

model_lstm.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 300)          6000000   
_________________________________________________________________
lstm (LSTM)                  (None, 64)                93440     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 130       
Total params: 6,097,730
Trainable params: 6,097,730
Non-trainable params: 0
_________________________________________________________________


In [None]:
model_lstm.layers[0].set_weights([embedding_matrix])
model_lstm.layers[0].trainable = True

model_lstm.compile(optimizer='rmsprop',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

history = model_lstm.fit(X_train, 
                     Y_train,
                     batch_size=32,
                     epochs=20,
                     shuffle=True,
                     callbacks=callbacks_list3,
                     verbose=2,
                     validation_data=(X_test,Y_test))

Epoch 1/20
54/54 - 86s - loss: 0.6697 - accuracy: 0.5765 - val_loss: 0.6617 - val_accuracy: 0.6033

Epoch 00001: val_accuracy improved from -inf to 0.60333, saving model to /content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/model_lstm.h5
Epoch 2/20
54/54 - 83s - loss: 0.5200 - accuracy: 0.7471 - val_loss: 0.7166 - val_accuracy: 0.6367

Epoch 00002: val_accuracy improved from 0.60333 to 0.63667, saving model to /content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/model_lstm.h5
Epoch 3/20
54/54 - 83s - loss: 0.3958 - accuracy: 0.8229 - val_loss: 0.5881 - val_accuracy: 0.7400

Epoch 00003: val_accuracy improved from 0.63667 to 0.74000, saving model to /content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/model_lstm.h5
Epoch 4/20
54/54 - 83s - loss: 0.3098 - accuracy: 0.8688 - val_loss: 0.6124 - val_accuracy: 0.7033

Epoch 00004: val_accuracy did not improve from 0.74000
Epoch 5/

**Train and Test Scores**

In [None]:
print('Train Accuracy Score:',model_lstm.evaluate(X_train, Y_train))
print('Test Accuracy Score:',model_lstm.evaluate(X_test, Y_test))

Train Accuracy Score: [2.2730979253537953e-05, 1.0]
Test Accuracy Score: [2.0930142402648926, 0.753333330154419]


**Save Tokenizer**

In [None]:
import io
import json

tokenizer_json = t.to_json()
with io.open('/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))

In [13]:
from tensorflow import keras

model_lstm=keras.models.load_model('/content/gdrive/My Drive/Colab Notebooks/Masters Project/Sentence Polarity Dataset/Models/model_lstm.h5')



In [18]:
from sklearn import metrics

Y_prob=model_lstm.predict(X_test)
Y_pred = Y_prob.argmax(axis=-1)
print('F1 Score:',metrics.f1_score(Y_test,Y_pred),
      'Precision:',metrics.precision_score(Y_test,Y_pred),
      'Recall:',metrics.recall_score(Y_test,Y_pred),
      'Accuracy:',metrics.accuracy_score(Y_test,Y_pred))

F1 Score: 0.7914110429447851 Precision: 0.7633136094674556 Recall: 0.821656050955414 Accuracy: 0.7733333333333333
