# Final Masters Project

## Name: Sreekanth Palagiri, Student ID: R00184198

## Project Topic: Evaluation of Ensemble Approach for Sentiment Analysis on a Small Dataset

##NoteBook1: Trainer LSTM


### **Mount google drive**

In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
!ls "gdrive/My Drive/Colab Notebooks/Masters Project"

'Airline Tweets dataset'  'Sentence Polarity Dataset'
 glove.6B.300d.txt	   VMDataset


### **Load Data and Preprocess**

In [None]:
import pandas as pd
import numpy as np

df=pd.read_csv("/content/gdrive/My Drive/Colab Notebooks/Masters Project/Airline Tweets dataset/airlinecomplaint.csv")
print(df.groupby(['label']).size())
df.head()

label
0    1700
1    1700
dtype: int64


Unnamed: 0,tweet,label
0,@united UA maintenance issues strike again. Fl...,0
1,With @AirCanada taking away more @ACAltitude b...,1
2,@DudePerfect @AmericanAir that's a lot of miss...,1
3,Oh @AmericanAir we promise to try hard not to ...,1
4,@AmericanAir liked to me that I couldn't chang...,0


**Preprocessor to Remove all special characters except emoticons**

In [None]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[^A-Za-z0-9\']+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

print(df['tweet'][1])
print(preprocessor(df['tweet'][1]))

With @AirCanada taking away more @ACAltitude benefits, which airline should I switch to in 2015? @AmericanAir? @United? #flyerstalk
with aircanada taking away more acaltitude benefits which airline should i switch to in 2015 americanair united flyerstalk


In [None]:
df['tweet'] = df['tweet'].apply(preprocessor)

### **Seperate Into Train and Test Sets**

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test, sentiment_train, sentiment_test = train_test_split(df['tweet'], df['label'], 
                                                                      random_state=1, test_size=0.20, 
                                                                      shuffle=False)


print('Length of train set:',len(df_train),'Length of test set:',len(df_test))

Length of train set: 2720 Length of test set: 680


### **LSTM Model**

**Define and Fit Tokenizer**

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

numwords= 20000

t = Tokenizer(num_words=numwords)
t.fit_on_texts(df_train)
word_index= t.word_index

**Convert text to sequences for further processing**



In [None]:
train_sequences = t.texts_to_sequences(df_train)
test_sequences = t.texts_to_sequences(df_test)

**Pad Sequences**

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

mylen = np.vectorize(len)
max_seq_length= 500
embed_dim = 300

train_sequences = pad_sequences(train_sequences,maxlen =max_seq_length)
test_sequences = pad_sequences(test_sequences,maxlen =max_seq_length)

print('Shape of training data tensor:',train_sequences.shape)
print('Shape of test data tensor:',test_sequences.shape)

Shape of training data tensor: (2720, 500)
Shape of test data tensor: (680, 500)


In [None]:
X_train=train_sequences
Y_train=sentiment_train

X_test=test_sequences
Y_test=sentiment_test

**DownLoad Glove embeddings** (Commented after first execution)

In [None]:
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip "glove.6B.zip"
#mv glove* "/content/gdrive/My Drive/Colab Notebooks/Masters Project"

**Create Embedding Matrix**

In [None]:
embeddings_index = {}
f = open('/content/gdrive/My Drive/Colab Notebooks/Masters Project/glove.6B.300d.txt')
for line in f:
  values = line.split()
  word = values[0]
  coefs=np.asarray(values[1:], dtype ='float32')
  embeddings_index[word] = coefs
f.close()

In [None]:
embedding_matrix= np.zeros((numwords , embed_dim))
for word, i in word_index.items():
  if i < numwords:
  # if this word is contained in the downloaded embedding vector
  # then add it to our embedding matrix.
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
      embedding_matrix[i] = embedding_vector
print(embedding_matrix.shape)

(20000, 300)


**Define and Fit LSTM Model, CallBack to save the best performing model**

In [None]:
from tensorflow.keras import models, layers, callbacks

model_lstm = models.Sequential()

model_lstm.add(layers.Embedding(numwords,embed_dim,input_length =max_seq_length))
model_lstm.add(layers.LSTM(64,dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(layers.Dense(64,activation='relu'))
model_lstm.add(layers.Dropout(0.3))
model_lstm.add(layers.Dense(2, activation='softmax'))

filepath="/content/gdrive/My Drive/Colab Notebooks/Masters Project/Airline Tweets dataset/Models/model_lstm.h5"
checkpoint3 = callbacks.ModelCheckpoint (filepath, monitor='val_accuracy', verbose=1, 
                                         save_best_only=True, save_weights_only=False, mode='auto')
callbacks_list3 = [checkpoint3]

model_lstm.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 300)          6000000   
_________________________________________________________________
lstm (LSTM)                  (None, 64)                93440     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 130       
Total params: 6,097,730
Trainable params: 6,097,730
Non-trainable params: 0
_________________________________________________________________


In [None]:
model_lstm.layers[0].set_weights([embedding_matrix])
model_lstm.layers[0].trainable = True

model_lstm.compile(optimizer='rmsprop',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

history = model_lstm.fit(X_train, 
                     Y_train,
                     batch_size=32,
                     epochs=20,
                     shuffle=True,
                     callbacks=callbacks_list3,
                     verbose=2,
                     validation_data=(X_test,Y_test))

Epoch 1/20
85/85 - 63s - loss: 0.6581 - accuracy: 0.6040 - val_loss: 0.5666 - val_accuracy: 0.7176

Epoch 00001: val_accuracy improved from -inf to 0.71765, saving model to /content/gdrive/My Drive/Colab Notebooks/Masters Project/Airline Tweets dataset/Models/model_lstm.h5
Epoch 2/20
85/85 - 58s - loss: 0.5371 - accuracy: 0.7342 - val_loss: 0.5215 - val_accuracy: 0.7250

Epoch 00002: val_accuracy improved from 0.71765 to 0.72500, saving model to /content/gdrive/My Drive/Colab Notebooks/Masters Project/Airline Tweets dataset/Models/model_lstm.h5
Epoch 3/20
85/85 - 58s - loss: 0.4325 - accuracy: 0.8033 - val_loss: 0.5698 - val_accuracy: 0.7456

Epoch 00003: val_accuracy improved from 0.72500 to 0.74559, saving model to /content/gdrive/My Drive/Colab Notebooks/Masters Project/Airline Tweets dataset/Models/model_lstm.h5
Epoch 4/20
85/85 - 58s - loss: 0.3528 - accuracy: 0.8515 - val_loss: 0.5595 - val_accuracy: 0.7529

Epoch 00004: val_accuracy improved from 0.74559 to 0.75294, saving model

**Train and Test Scores**

In [None]:
print('Train Accuracy Score:',model_lstm.evaluate(X_train, Y_train))
print('Test Accuracy Score:',model_lstm.evaluate(X_test, Y_test))

Train Accuracy Score: [4.6182878577383235e-05, 1.0]
Test Accuracy Score: [2.340956449508667, 0.7441176176071167]


**Save Tokenizer**

In [None]:
import io
import json

tokenizer_json = t.to_json()
with io.open('/content/gdrive/My Drive/Colab Notebooks/Masters Project/Airline Tweets dataset/Models/tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))

### **Predict and Evaluate Metrics**

In [None]:
from tensorflow import keras

model_lstm=keras.models.load_model('/content/gdrive/My Drive/Colab Notebooks/Masters Project/Airline Tweets dataset/Models/model_lstm.h5')



In [None]:
from sklearn import metrics

Y_prob=model_lstm.predict(X_test)
Y_pred = Y_prob.argmax(axis=-1)
print('F1 Score:',metrics.f1_score(Y_test,Y_pred),
      'Precision:',metrics.precision_score(Y_test,Y_pred),
      'Recall:',metrics.recall_score(Y_test,Y_pred),
      'Accuracy:',metrics.accuracy_score(Y_test,Y_pred))

F1 Score: 0.7732962447844228 Precision: 0.7453083109919572 Recall: 0.8034682080924855 Accuracy: 0.7602941176470588
