# GRU Prototype With Attention

Previous notebook we implemented a [single layer GRU](https://github.com/sv650s/sb-capstone/blob/master/2019-07-22-GRU_prototype.ipynb) without attention

For this notebook, we will implement a 1 layer bidirectional GRU network with attention and 3 dense layer architecture


As before, I am using some utility functions so I don't have copy so much code around. Source code for the modules are here:
* [dict_util](https://github.com/sv650s/sb-capstone/blob/master/util/dict_util.py)
* [plot_util](https://github.com/sv650s/sb-capstone/blob/master/util/plot_util.py)
* [keras_util](https://github.com/sv650s/sb-capstone/blob/master/util/keras_util.py)
* [file_util](https://github.com/sv650s/sb-capstone/blob/master/util/file_util.py)

In [0]:
from google.colab import drive
import sys
drive.mount('/content/drive')
# add this to sys patch so we can import utility functions
DRIVE_DIR = 'drive/My Drive/Springboard/capstone'
sys.path.append(DRIVE_DIR)


%tensorflow_version 2.x


import tensorflow as tf
# checl to make sure we are using GPU here
tf.test.gpu_device_name()

try:
  %tensorflow_version 2.x  # Colab only.
except Exception:
  pass


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals


from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers.normalization import BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers.convolutional import Conv1D, MaxPooling1D
from tensorflow.keras.layers.embeddings import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorlfow.keras.layers import CuDNNGRU


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report


import pandas as pd
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
import pickle
from datetime import datetime
from sklearn.metrics import confusion_matrix, classification_report
import os
import seaborn as sns
import matplotlib.pyplot as plt
import logging


import util.dict_util as du
import util.plot_util as pu
import util.file_util as fu
import util.keras_util as ku
import util.report_util as ru



logging.basicConfig(level=logging.ERROR)

%matplotlib inline
sns.set()


DATE_FORMAT = '%Y-%m-%d'
TIME_FORMAT = '%Y-%m-%d %H:%M:%S'
# DATA_DIR = "dataset/feature_files"
MODEL_NAME = "LSTM"
FEATURE_COLUMN = "star_rating"
REVIEW_COLUMN = "review_body"


if DEBUG:
  DATA_FILE = f'{DRIVE_DIR}/review_body-word2vec-df_none-ngram_none-89-100-nolda.csv'
else:
  DATA_FILE = f"{DRIVE_DIR}/amazon_reviews_us_Wireless_v1_00-200k-preprocessed.csv"
  # DATA_FILE = f"{DRIVE_DIR}/amazon_reviews_us_Wireless_v1_00-preprocessed-110k.csv"

MODEL_NAME = "GRUbi"
ARCHITECTURE = "1_attention"


GRU_DIM = 250 # total GRU units


# length of our embedding - 300 is standard
EMBED_SIZE = 300
EPOCHS  = 50
BATCH_SIZE = 128
PATIENCE = 4


# From EDA, we know that 90% of review bodies have 100 words or less, 
# we will use this as our sequence length
MAX_SEQUENCE_LENGTH = 100


directory, INBASENAME = fu.get_dir_basename(DATA_FILE)
DESCRIPTION = f"{MODEL_NAME}-{ARCHITECTURE}-nobatch-{INBASENAME}-sampling_none-{FEATURE_COLUMN}"


Using TensorFlow backend.


In [0]:
df = pd.read_csv(f"{DATA_FILE}")

## Preprocessing

*  Preprocessing data file and create the right inputs for Keras models
     *   Features:
        * tokenize
        * pad features into sequence
     *   Labels:
       *  one hot encoder
* split between training and testing

See [keras_util](https://github.com/sv650s/sb-capstone/blob/master/util/keras_util.py) for souce code

In [0]:
X_train, X_test, y_train, y_test, tokenizer = \
                                  ku.preprocess_file(data_df=df, 
                                                      feature_column=FEATURE_COLUMN, 
                                                      label_column=LABEL_COLUMN, 
                                                      max_sequence_length=MAX_SEQUENCE_LENGTH)

One hot enocde label data...
Splitting data into training and test sets...


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


Vocabulary size=40788
Number of Documents=84032
Max Sequence Length: 186


# Building Our GRU Model

In [0]:
# from keras.engine.topology import Layer
# from keras import backend as K
# import keras


# class AttentionLayer(Layer):
    
#     def __init__(self, step_dim,
#                  W_regularizer=None, b_regularizer=None,
#                  W_constraint=None, b_constraint=None,
#                  bias=True, **kwargs):
        
#         """
#         Keras Layer that implements an Attention mechanism for temporal data.
#         Supports Masking.
#         Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
#         # Input shape
#             3D tensor with shape: `(samples, steps, features)`.
#         # Output shape
#             2D tensor with shape: `(samples, features)`.
#         :param kwargs:
#         Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
#         The dimensions are inferred based on the output shape of the RNN.
#         """
        
#         self.supports_masking = True
#         self.init = keras.initializers.get('glorot_uniform')

#         self.W_regularizer = keras.regularizers.get(W_regularizer)
#         self.b_regularizer = keras.regularizers.get(b_regularizer)

#         self.W_constraint = keras.constraints.get(W_constraint)
#         self.b_constraint = keras.constraints.get(b_constraint)

#         self.bias = bias
#         self.step_dim = step_dim
#         self.features_dim = 0
#         super(AttentionLayer, self).__init__(**kwargs)
        

#     def build(self, input_shape):
#         assert len(input_shape) == 3

#         self.W = self.add_weight((input_shape[-1],),
#                                  initializer=self.init,
#                                  name='{}_W'.format(self.name),
#                                  regularizer=self.W_regularizer,
#                                  constraint=self.W_constraint)
#         self.features_dim = input_shape[-1]

#         if self.bias:
#             self.b = self.add_weight((input_shape[1],),
#                                      initializer='zero',
#                                      name='{}_b'.format(self.name),
#                                      regularizer=self.b_regularizer,
#                                      constraint=self.b_constraint)
#         else:
#             self.b = None

#         self.built = True
        

#     def compute_mask(self, input, input_mask=None):
#         # do not pass the mask to the next layers
#         return None

    
#     def call(self, x, mask=None):
#         # TF backend doesn't support it
#         # eij = K.dot(x, self.W) 
#         # features_dim = self.W.shape[0]
#         # step_dim = x._keras_shape[1]

#         features_dim = self.features_dim
#         step_dim = self.step_dim

#         eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), 
#                               K.reshape(self.W, (features_dim, 1))),
#                         (-1, step_dim))

#         if self.bias:
#             eij += self.b

#         eij = K.tanh(eij)

#         a = K.exp(eij)

#         # apply mask after the exp. will be re-normalized next
#         if mask is not None:
#             # Cast the mask to floatX to avoid float64 upcasting in theano
#             a *= K.cast(mask, K.floatx())

#         # in some cases especially in the early stages of training the sum may be almost zero
#         a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
#         a = K.expand_dims(a)
#         weighted_input = x * a
        
#         return K.sum(weighted_input, axis=1)

    
#     def compute_output_shape(self, input_shape):
#         return input_shape[0],  self.features_dim
    
    
#     def get_config(self):
#         config = {'step_dim': self.step_dim}
#         base_config = super(AttentionLayer, self).get_config()
#         return dict(list(base_config.items()) + list(config.items()))

In [0]:
vocab_size = len(tokenizer.word_counts)+1


model = Sequential()
model.add(Embedding(vocab_size, EMBED_SIZE, input_length=max_sequence_length))
model.add(Bidirectional(CuDNNGRU(GRU_DIM*2, return_sequences=True)))
model.add(AttentionLayer(MAX_SEQUENCE_LENGTH))
model.add(Dense(GRU_DIM*2, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(GRU_DIM, activation='relu'))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['categorical_accuracy'])


W0729 05:29:48.871664 139623069144960 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0729 05:29:48.877236 139623069144960 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0729 05:29:48.881768 139623069144960 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0729 05:29:50.668831 139623069144960 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0729 05:29:50.681504 

In [0]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 186, 300)          12236700  
_________________________________________________________________
bidirectional_1 (Bidirection (None, 186, 1000)         2406000   
_________________________________________________________________
attention_layer_1 (Attention (None, 1000)              1186      
_________________________________________________________________
dense_1 (Dense)              (None, 500)               500500    
_________________________________________________________________
dropout_1 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 250)               125250    
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 1255      
Total para

In [0]:
# reduce learning rate if we sense a plateau
reduce_lr = ReduceLROnPlateau(monitor='val_loss', 
                              factor=0.4,
                              patience=PATIENCE, 
                              min_lr=0.00001,
                             mode='auto')
early_stop = EarlyStopping(monitor='val_loss', 
                           patience=2, 
                           mode='auto', 
                           verbose=1,
                          restore_best_weights=True)

mw = ku.ModelWrapper(model, MODEL_NAME, LABEL_COLUMN, DATA_FILE, 
                     embedding=EMBED_SIZE,
                     tokenizer=tokenizer,
                     description=DESCRIPTION)


network_history = mw.fit(X_train, y_train,
                      batch_size=BATCH_SIZE,
                      epochs=EPOCHS,
                      verbose=1,
                      validation_split=0.2,
                      callbacks=[reduce_lr, early_stop])

W0729 05:29:50.922225 139623069144960 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 67225 samples, validate on 16807 samples
Epoch 1/50
 4992/67225 [=>............................] - ETA: 3:09 - loss: 1.2816 - acc: 0.5445

KeyboardInterrupt: ignored

## Evaluating the Model

In [0]:
importlib.reload(pu)

scores = mw.evaluate(x_test, y_test)
print("Accuracy: %.2f%%" % (mw.scores[1]*100))



pu.plot_network_history(mw.network_history, "categorical_accuracy", "val_categorical_accuracy")
plt.show()

print("\nConfusion Matrix")
print(mw.confusion_matrix)

print("\nClassification Report")
print(mw.classification_report)

fig = plt.figure(figsize=(5,5))
pu.plot_roc_auc(mw.name, mw.roc_auc, mw.fpr, mw.tpr)

## Save off filees

In [0]:
importlib.reload(ku)
importlib.reload(ru)

mw.save(DRIVE_DIR, append_report=True)
report = mw.get_report().to_df()
print(report.tail())

cr = json.loads(report.classification_report.values[0])
print(f'\n\nOverall score: {ru.calculate_metric(cr)}')

In [0]:
print(datetime.now())