# Final Masters Project

## Name: Sreekanth Palagiri, Student ID: R00184198

## Project Topic: Evaluation of Ensemble Approach for Sentiment Analysis on a Small Dataset

##NoteBook1: Trainer LSTM


### **Mount google drive**

In [1]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
!ls "gdrive/My Drive/Colab Notebooks/Masters Project"

'Airline Tweets dataset'  'Sentence Polarity Dataset'
 glove.6B.300d.txt	   VMDataset


### **Load Data and Preprocess**

In [3]:
import pandas as pd
import numpy as np

df=pd.read_csv("/content/gdrive/My Drive/Colab Notebooks/Masters Project/VMDataset/Export_loop-sentiment-pos-neg-train_05112020000000.csv")
print(df.groupby(['label']).size())
df.head()

label
Negative     887
Positive    1013
dtype: int64


Unnamed: 0,label,text
0,Negative,No one cares about marketing slides - a techni...
1,Positive,Are all three hosts providing storage/capacity...
2,Negative,would loved to had managed to get down to the ...
3,Negative,Vending machine at work is out of Dasani water...
4,Positive,"RT @VMwareEdu: Paul Maritz, CEO and President ..."


**Preprocessor to Remove all special characters except emoticons**

In [4]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[^A-Za-z0-9\']+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

print(df['text'][19])
print(preprocessor(df['text'][19]))

Kristina,  Any updates from your side ? I volunteer for beta test :)  - really need that app running as workaround are driving me nuts ....
kristina any updates from your side i volunteer for beta test really need that app running as workaround are driving me nuts :)


In [5]:
df['text'] = df['text'].apply(preprocessor)

### **Label Encode the Sentiment**

In [6]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
df['Sentiment']=le.fit_transform(df['label'])
df.head(10)

Unnamed: 0,label,text,Sentiment
0,Negative,no one cares about marketing slides a technica...,0
1,Positive,are all three hosts providing storage capacity...,1
2,Negative,would loved to had managed to get down to the ...,0
3,Negative,vending machine at work is out of dasani water...,0
4,Positive,rt vmwareedu paul maritz ceo and president of ...,1
5,Positive,had few folks ask if you're interested johnny ...,1
6,Positive,get notified of the latest vsan patch releases...,1
7,Negative,end of general support is 3 12 2020 6 5 and 6 ...,0
8,Negative,placed 4th in funrun today in the 17 39 age gr...,0
9,Positive,yup guys being currently under nda know this f...,1


In [7]:
le.classes_

array(['Negative', 'Positive'], dtype=object)

### **Seperate Into Train and Test Sets**

In [8]:
from sklearn.model_selection import train_test_split

df_train, df_test, sentiment_train, sentiment_test = train_test_split(df['text'], df['Sentiment'], 
                                                                      random_state=1, test_size=0.15, 
                                                                      shuffle=False)


print('Length of train set:',len(df_train),'Length of test set:',len(df_test))

Length of train set: 1615 Length of test set: 285


### **LSTM Model**

**Define and Fit Tokenizer**

In [9]:
from tensorflow.keras.preprocessing.text import Tokenizer

numwords= 20000

t = Tokenizer(num_words=numwords)
t.fit_on_texts(df_train)
word_index= t.word_index

**Convert text to sequences for further processing**



In [10]:
train_sequences = t.texts_to_sequences(df_train)
test_sequences = t.texts_to_sequences(df_test)

**Pad Sequences**

In [11]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

mylen = np.vectorize(len)
max_seq_length= 500
embed_dim = 300

train_sequences = pad_sequences(train_sequences,maxlen =max_seq_length)
test_sequences = pad_sequences(test_sequences,maxlen =max_seq_length)

print('Shape of training data tensor:',train_sequences.shape)
print('Shape of test data tensor:',test_sequences.shape)

Shape of training data tensor: (1615, 500)
Shape of test data tensor: (285, 500)


In [12]:
X_train=train_sequences
Y_train=sentiment_train

X_test=test_sequences
Y_test=sentiment_test

**DownLoad Glove embeddings**

In [13]:
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip "glove.6B.zip"
#mv glove* "/content/gdrive/My Drive/Colab Notebooks/Masters Project"

**Create Embedding Matrix**

In [14]:
embeddings_index = {}
f = open('/content/gdrive/My Drive/Colab Notebooks/Masters Project/glove.6B.300d.txt')
for line in f:
  values = line.split()
  word = values[0]
  coefs=np.asarray(values[1:], dtype ='float32')
  embeddings_index[word] = coefs
f.close()

In [15]:
embedding_matrix= np.zeros((numwords , embed_dim))
for word, i in word_index.items():
  if i < numwords:
  # if this word is contained in the downloaded embedding vector
  # then add it to our embedding matrix.
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
      embedding_matrix[i] = embedding_vector
print(embedding_matrix.shape)

(20000, 300)


**Define and Fit LSTM Model, CallBack to save the best performing model**

In [16]:
from tensorflow.keras import models, layers, callbacks

model_lstm = models.Sequential()

model_lstm.add(layers.Embedding(numwords,embed_dim,input_length =max_seq_length))
model_lstm.add(layers.LSTM(64,dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(layers.Dense(64,activation='relu'))
model_lstm.add(layers.Dropout(0.3))
model_lstm.add(layers.Dense(2, activation='softmax'))

filepath="/content/gdrive/My Drive/Colab Notebooks/Masters Project/VMDataset/Models/model_lstm.h5"
checkpoint3 = callbacks.ModelCheckpoint (filepath, monitor='val_accuracy', verbose=1, 
                                         save_best_only=True, save_weights_only=False, mode='auto')
callbacks_list3 = [checkpoint3]

model_lstm.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 300)          6000000   
_________________________________________________________________
lstm (LSTM)                  (None, 64)                93440     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 130       
Total params: 6,097,730
Trainable params: 6,097,730
Non-trainable params: 0
_________________________________________________________________


In [17]:
model_lstm.layers[0].set_weights([embedding_matrix])
model_lstm.layers[0].trainable = True

model_lstm.compile(optimizer='rmsprop',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

history = model_lstm.fit(X_train, 
                     Y_train,
                     batch_size=32,
                     epochs=20,
                     shuffle=True,
                     callbacks=callbacks_list3,
                     verbose=2,
                     validation_data=(X_test,Y_test))

Epoch 1/20
51/51 - 88s - loss: 0.6487 - accuracy: 0.6093 - val_loss: 0.5982 - val_accuracy: 0.7053

Epoch 00001: val_accuracy improved from -inf to 0.70526, saving model to /content/gdrive/My Drive/Colab Notebooks/Masters Project/VMDataset/Models/model_lstm.h5
Epoch 2/20
51/51 - 85s - loss: 0.5325 - accuracy: 0.7412 - val_loss: 0.8710 - val_accuracy: 0.5825

Epoch 00002: val_accuracy did not improve from 0.70526
Epoch 3/20
51/51 - 83s - loss: 0.4177 - accuracy: 0.8025 - val_loss: 0.5214 - val_accuracy: 0.7474

Epoch 00003: val_accuracy improved from 0.70526 to 0.74737, saving model to /content/gdrive/My Drive/Colab Notebooks/Masters Project/VMDataset/Models/model_lstm.h5
Epoch 4/20
51/51 - 83s - loss: 0.3188 - accuracy: 0.8737 - val_loss: 0.6734 - val_accuracy: 0.6702

Epoch 00004: val_accuracy did not improve from 0.74737
Epoch 5/20
51/51 - 84s - loss: 0.2394 - accuracy: 0.9028 - val_loss: 0.6464 - val_accuracy: 0.7263

Epoch 00005: val_accuracy did not improve from 0.74737
Epoch 6/20

**Train and Test Scores**

In [20]:
print('Train Accuracy Score:',model_lstm.evaluate(X_train, Y_train))
print('Test Accuracy Score:',model_lstm.evaluate(X_test, Y_test))

Train Accuracy Score: [8.791703294264153e-06, 1.0]
Test Accuracy Score: [2.441232442855835, 0.7052631378173828]


**Save Tokenizer**

In [19]:
import io
import json

tokenizer_json = t.to_json()
with io.open('/content/gdrive/My Drive/Colab Notebooks/Masters Project/VMDataset/Models/tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))

### **Predict and Evaluate Metrics**

In [21]:
from tensorflow import keras

model_lstm=keras.models.load_model('/content/gdrive/My Drive/Colab Notebooks/Masters Project/VMDataset/Models/model_lstm.h5')



In [22]:
from sklearn import metrics

Y_prob=model_lstm.predict(X_test)
Y_pred = Y_prob.argmax(axis=-1)
print('F1 Score:',metrics.f1_score(Y_test,Y_pred),
      'Precision:',metrics.precision_score(Y_test,Y_pred),
      'Recall:',metrics.recall_score(Y_test,Y_pred),
      'Accuracy:',metrics.accuracy_score(Y_test,Y_pred))

F1 Score: 0.7567567567567569 Precision: 0.8 Recall: 0.717948717948718 Accuracy: 0.7473684210526316
