### **Starting with Milestone 2**

**In Milestone 1 last part (part2b) we concluded that we will take forward lstm model with pre-trained GloVe embedding and improve upon it for this milestone. We created the most basic LSTM model. The architecture of that model is as follows:**<br>
- input layer of size 76 (x_train.shape[1])<br>
- embedding layer with input_dim = size of the vocabulary, output_dim = 200, weight=GloveEmbeddingMatrix, input length = size of input i.e x_train.shape[1] and trainable=True<br>
- LSTM layer with 100 units and return sequence = False<br>
- Output dense layer with units = # labels in the target i.e 74 (y_train.shape[1])<br><br>


In this Milestone, We will improve the architecture as well as tune the hyper-parameters of the architecture. **This is how we plan it:**<br>
**First we will improve upon the architecture:**<br>
- turn on return sequences<br>
- add some regularization such as dropouts<br>
- introduce bi-directional layer<br>
- add more LSTM units<br><br>

**After deciding the architecture we will tune for better hyper-parameters:**<br>
- try a different optimizer<br>
- try to hit the right learning rate<br>
- implement callbacks in order to train for more epochs and account for overfitting



In [None]:
import sys
sys.path.append('/content/drive/MyDrive/Automatic Ticket Assignment')
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Input, GlobalMaxPool1D, Flatten
from Model.DLModelTuningAndEvaluation import *

**First we will load GloVeEmbeddingMatrix, train set, train labels, test set and test labels that we saved in Milestone 1 part2b:**

In [None]:
# USE THIS SAVEPATH IF RUNNING IN GOOGLE COLAB, GIVE THE PATH WHERE YOU WANT TO SAVE
SAVEPATH = '/content/drive/MyDrive/Automatic Ticket Assignment/DataFiles/Milestone2/'
# *************************** --------------------------************************************
# SAVEPATH = 'DataFiles/Milestone2/'

embedding_matrix = np.load(SAVEPATH+'GloveEmbeddingMatrix.npy')
x_train = np.load(SAVEPATH+'xtrain.npy')
y_train = np.load(SAVEPATH+'ytrain.npy')
x_test = np.load(SAVEPATH+'xtest.npy')
y_test = np.load(SAVEPATH+'ytest.npy')

print('JUST TO RECALL\n','train set:',x_train.shape)
print('train labels:',y_train.shape)
print('test set:',x_test.shape)
print('test labels:',y_test.shape,'\n')
print('embedding matrix:',embedding_matrix.shape)

JUST TO RECALL
 train set: (6432, 76)
train labels: (6432, 74)
test set: (1608, 76)
test labels: (1608, 74) 

embedding matrix: (19235, 200)


**MODEL 2:**<BR>
**So we will start improving upon the model architecture using the above plan:**<br>
**Turn on return sequences**<br>
When we configure return_sequences = True, the model returns the lstm's hidden state at every timestamp (at every word of the sequence). We have 76 words in every sequence so the Lstm layer output shape would be (None,76,100) as opposed to (None,100) when return_sequences = False. To maintain the shape integrity for the output layer we will need to flatten the input coming from lstm layer before passing to the output layer. **This can be done in two ways:** Either we can use the **dense layer** or a pool layer to flatten it; whichever gives the best result.<br>
**Theoretically with return_sequences = true, the output of the hidden state is used as an input to the next LSTM layer at every timestamp**<br><br>
**Continuing on model1b from last part, we will train each model for 15 epochs to gauge any improvements. First we will flatten with dense layer:**

In [None]:
# model2 architecture
input = Input(shape=(x_train.shape[1],),batch_size=None)
model2 = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1], weights=[embedding_matrix], input_length=x_train.shape[1], trainable=True)(input)
model2 = LSTM(units=100, return_sequences=True)(model2) # return_sequences=True
model2 = Flatten()(model2) # flattening to maintain shape integrity via dense layer
out = Dense(y_train.shape[1], activation="softmax")(model2)
model2 = Model(input, out)
model2.summary()
print("\n")

# train model2
train_model(model2, x_train, y_train, x_test, y_test, ep=15, bs=16)#66.6

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 76)]              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 76, 200)           3847000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 76, 100)           120400    
_________________________________________________________________
flatten_1 (Flatten)          (None, 7600)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 74)                562474    
Total params: 4,529,874
Trainable params: 4,529,874
Non-trainable params: 0
_________________________________________________________________


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Ep

<tensorflow.python.keras.engine.functional.Functional at 0x7f703f656c88>

**MODEL 3:**<br>Using **GlobalMaxPooling layer** to flatten the input coming from LSTM layer that will go in the output layer instead of dense layer:

In [None]:
# model3 architecture
input = Input(shape=(x_train.shape[1],),batch_size=None)
model3 = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1], weights=[embedding_matrix], input_length=x_train.shape[1], trainable=True)(input)
model3 = LSTM(units=100, return_sequences=True)(model3) # return_sequences=true 
model3 = GlobalMaxPool1D()(model3) # flattening to maintain shape integrity via global max pooling
out = Dense(y_train.shape[1], activation="softmax")(model3)
model3 = Model(input, out)
model3.summary()
print("\n")

# train model3
train_model(model3, x_train, y_train, x_test, y_test, ep=15, bs=16)#68.6

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 76)]              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 76, 200)           3847000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 76, 100)           120400    
_________________________________________________________________
global_max_pooling1d (Global (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 74)                7474      
Total params: 3,974,874
Trainable params: 3,974,874
Non-trainable params: 0
_________________________________________________________________


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Ep

<tensorflow.python.keras.engine.functional.Functional at 0x7f70393f5a20>

**Clearly flattening with pooling gives better validation accuracy and loss. Also since its not a dense layer, it has much less parameters to train which means faster training. Hence we will add pooling layer to the architecture**<br><br>
**MODEL 4:**<br>
Next up is **Regularisation:**<br>
Adding recurrent dropout to the lstm layer and a dropout layer after pooling with 10% dropout rate for both

In [None]:
# model4 architecture
input = Input(shape=(x_train.shape[1],),batch_size=None)
model4 = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1], weights=[embedding_matrix], input_length=x_train.shape[1], trainable=True)(input)
model4 = LSTM(units=100, return_sequences=True,recurrent_dropout=0.1)(model4) #recurrent dropout
model4 = GlobalMaxPool1D()(model4)
model4 = Dropout(0.1)(model4) #dropout layer
out = Dense(y_train.shape[1], activation="softmax")(model4)
model4 = Model(input, out)
model4.summary()
print("\n")

# train model4
train_model(model4, x_train, y_train, x_test, y_test, ep=15, bs=16)#69.09

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 76)]              0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 76, 200)           3847000   
_________________________________________________________________
lstm_3 (LSTM)                (None, 76, 100)           120400    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 100)               0         
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 74)                7474      
Total params: 3,974,874
Trainable params: 3,974,874
Non-trainable params: 0
_________________________________________________

<tensorflow.python.keras.engine.functional.Functional at 0x7f703881b390>

**MODEL 5:**<br>
Increasing the dropout rate to **20%**

In [None]:
# model5 architecture 
input = Input(shape=(x_train.shape[1],),batch_size=None)
model5 = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1], weights=[embedding_matrix], input_length=x_train.shape[1], trainable=True)(input)
model5 = LSTM(units=100, return_sequences=True,recurrent_dropout=0.2)(model5) #recurrent dropout
model5 = GlobalMaxPool1D()(model5)
model5 = Dropout(0.2)(model5) #dropout layer
out = Dense(y_train.shape[1], activation="softmax")(model5)
model5 = Model(input, out)
model5.summary()
print("\n")

# train model5
train_model(model5, x_train, y_train, x_test, y_test, ep=15, bs=16)

Model: "model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         [(None, 76)]              0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 76, 200)           3847000   
_________________________________________________________________
lstm_4 (LSTM)                (None, 76, 100)           120400    
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 100)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 74)                7474      
Total params: 3,974,874
Trainable params: 3,974,874
Non-trainable params: 0
_________________________________________________

<tensorflow.python.keras.engine.functional.Functional at 0x7f70372c0eb8>

**Slightly better performance with dropout rate of 20% Hence we will go ahead with model 5 and add onto it in the upcoming models**<br><br>
**MODEL 6:**<br>
**Adding Bi-directional Layer:**<br>
Bi-directional duplicates the first lstm layer in the network and puts the two layers side-by-side. The input sequence is fed as it is to the first layer while reversed copy of the input is fed to the second layer.<br>In simpler words the **bi-directional layer computes the inputs in two ways: past to future and future to past.** The idea behind bi-directional layer is to have better context while computing the word at any given timestamp. Due to computation of sequence and reverse sequence, at any given timestamp, lstm will have the context not only from the previous words but also from the words that are yet to come in future in that sequence.<br>**Now since we will add a bi-directional layer on lstm with 100 units, the total units will be 200 in our case**

In [None]:
# model6 architecture
input = Input(shape=(x_train.shape[1],),batch_size=None)
model6 = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1], weights=[embedding_matrix], input_length=x_train.shape[1], trainable=True)(input)
model6 = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.2))(model6) #adding bi-directional layer
model6 = GlobalMaxPool1D()(model6)
model6 = Dropout(0.2)(model6)
out = Dense(y_train.shape[1], activation="softmax")(model6)
model6 = Model(input, out)
model6.summary()
print("\n")

# train model6
train_model(model6, x_train, y_train, x_test, y_test, ep=15, bs=16)#68.59

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 76)]              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 76, 200)           3847000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 76, 200)           240800    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 200)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 74)                14874     
Total params: 4,102,674
Trainable params: 4,102,674
Non-trainable params: 0
_________________________________________________

<tensorflow.python.keras.engine.functional.Functional at 0x7fa664150278>

**More or less same performance as with model 5... but it takes DOUBLE the amount of time to train which is self explanatory; we have double the LSTM units. Since there is no difference in performance, therefore we retain model 5**<br><br>
**Model 7:**<br>
We rejected model 6 and retained model 5, hence in model 7 we will retain the architecture of model 5 **except add more LSTM units**


In [None]:
# model7 architecture
input = Input(shape=(x_train.shape[1],),batch_size=None)
model7 = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1], weights=[embedding_matrix], input_length=x_train.shape[1], trainable=True)(input)
model7 = LSTM(units=150, return_sequences=True, recurrent_dropout=0.2)(model7)
model7 = GlobalMaxPool1D()(model7)
model7 = Dropout(0.2)(model7)
out = Dense(y_train.shape[1], activation="softmax")(model7)
model7 = Model(input, out)
model7.summary()
print("\n")

# train model7
train_model(model7, x_train, y_train, x_test, y_test, ep=16, bs=16)#68.16

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 76)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 76, 200)           3847000   
_________________________________________________________________
lstm (LSTM)                  (None, 76, 150)           210600    
_________________________________________________________________
global_max_pooling1d (Global (None, 150)               0         
_________________________________________________________________
dropout (Dropout)            (None, 150)               0         
_________________________________________________________________
dense (Dense)                (None, 74)                11174     
Total params: 4,068,774
Trainable params: 4,068,774
Non-trainable params: 0
___________________________________________________

<tensorflow.python.keras.engine.functional.Functional at 0x7faf5ba42128>

**Again, not much performance difference from model 5, but training time increased 2 times**. We have covered all the steps from deciding the architecture. **We select model 5 to be the apt architecture, which we will take forward**.<br><br>  Now we will tune for better hyper-parameters starting with **trying a different optimizer:**<br> 
Untill now we have been using Adams optimizer but lets see the performance with RMSProp. As we said, model architecture of model 5 is our choice and from now on we will work on model 5

In [None]:
# model5 architecture 
input = Input(shape=(x_train.shape[1],),batch_size=None)
model5 = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1], weights=[embedding_matrix], input_length=x_train.shape[1], trainable=True)(input)
model5 = LSTM(units=100, return_sequences=True,recurrent_dropout=0.2)(model5)
model5 = GlobalMaxPool1D()(model5)
model5 = Dropout(0.2)(model5)
out = Dense(y_train.shape[1], activation="softmax")(model5)
model5 = Model(input, out)
model5.summary()
print("\n")

# train model5
train_model(model5, x_train, y_train, x_test, y_test, optm='rmsprop', ep=15, bs=16) #rmsprop optimizer

Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         [(None, 76)]              0         
_________________________________________________________________
embedding_6 (Embedding)      (None, 76, 200)           3847000   
_________________________________________________________________
lstm_6 (LSTM)                (None, 76, 100)           120400    
_________________________________________________________________
global_max_pooling1d_6 (Glob (None, 100)               0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 74)                7474      
Total params: 3,974,874
Trainable params: 3,974,874
Non-trainable params: 0
_________________________________________________

<tensorflow.python.keras.engine.functional.Functional at 0x7fa659f3f390>

**Not much performance difference from adam, hence we will go ahead with adam only**
### In the next part we will tune the learning rate and train for more epochs with callbacks to further avoid overfitting. 