<a href="https://colab.research.google.com/github/sv650s/sb-capstone/blob/master/2019_08_09_TF2_biGRU_1Layer_prototype.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 Layer biDirectional GRU with Attention using TensorFlow 2

In our [Deep Learning Summary](https://github.com/sv650s/sb-capstone/blob/master/2019_07_30_deep_learning_summary.ipynb) notebook, we determined that biDirectional GRU with 1 Layer with Attention gave us the best classification results based on our Amazon Reviews dataset. In this [notebook](https://github.com/sv650s/sb-capstone/blob/master/2019_08_02_biGRU_1layer_random_embedding_vs_pretrained.ipynb), we also determined that using this model we also determined that using random embedding actually give us better results compared to using pre-trained embeddings.

Our previous implementation was implemented in Keras

We will re-implement this using Keras on Tensor Flow 2.0 beta so that we can package up the model and deploy this on GCP. The following were updated in the code:

* sklearn's OneHotEncoder -> tf.one_hot
* keras Tokenizer -> tf Keras Tokenizer
* keras sequence -> tf Keras sequence
* all layers that previously used keras reference impl now use tf's keras implementation
* previously, we used Keras' Sequential layer - I had trouble saving the model when using this, so I replaced it with the Keras functional API to create the model

Since we will loading models and then using them do inference as a REST API, we will save our model and then reload them and do some quick predictions to make sure that we can re-create this networking on GCP


TODO:

* TF recommends that we use the Dataset api for large datasets. This notebook does not use that yet

In [1]:
from google.colab import drive
import sys
drive.mount('/content/drive')
DRIVE_DIR = "drive/My Drive/Springboard/capstone"

# add this to sys patch so we can import utility functions
sys.path.append(DRIVE_DIR)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
!pip install -q tensorflow==2.0.0-beta1

In [3]:
try:
  %tensorflow_version 2.x  # Colab only.
except Exception:
  pass


`%tensorflow_version` only switches the major version: `1.x` or `2.x`.
You set: `2.x  # Colab only.`. This will be interpreted as: `2.x`.


TensorFlow 2.x selected.


In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers


from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, GRU, Dropout, Bidirectional, Embedding
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence



import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

# import our utility functions
import util.keras_util as ku
import util.dict_util as du
import util.plot_util as pu


import logging

logging.basicConfig(level=logging.INFO)

# check to see if we are using GPU - must be placed at the beginning of the program
tf.debugging.set_log_device_placement(True)
print("GPU Available: ", tf.test.is_gpu_available())

print(f'Tensorflow Version: {tf.version.VERSION}')
print(f'Keras Version: {tf.keras.__version__}')


In [0]:
DATE_FORMAT = '%Y-%m-%d'
TIME_FORMAT = '%Y-%m-%d %H:%M:%S'
DATA_FILE = f"{DRIVE_DIR}/data/amazon_reviews_us_Wireless_v1_00-preprocessed-110k.csv"
LABEL_COLUMN = "star_rating"
FEATURE_COLUMN = "review_body"

In [0]:
import pandas as pd

df = pd.read_csv(DATA_FILE)

In [0]:

# X_train, X_test, y_train, y_test, tokenizer, max_sequence_length = \
#                                   ku.preprocess_file(data_df=df, 
#                                                       feature_column=FEATURE_COLUMN, 
#                                                       label_column=LABEL_COLUMN, 
#                                                       keep_percentile=0.99)

# Pre-process our data


In [0]:
from tensorflow.keras.preprocessing.text import Tokenizer


labels = df[LABEL_COLUMN]
features = df[FEATURE_COLUMN]






In [0]:
# couldn't find a TF versio of this
from sklearn.model_selection import train_test_split

features_train, features_test, y_train, y_test = train_test_split(features, labels, random_state=1)


In [10]:
# TF expects the indeces to be actual indexes - since our star ratings starts at 1, 
# should subtract 1 from it to create 0-based index
y_train_encoded = tf.one_hot(y_train.apply(lambda x: x-1).tolist(), depth=5, axis=-1)
y_test_encoded = tf.one_hot(y_test.apply(lambda x: x-1).tolist(), depth=5, axis=-1)
print(y_train[:5])
print(type(y_train_encoded))
print(y_train_encoded.shape)
y_train_encoded[:5]

Executing op OneHot in device /job:localhost/replica:0/task:0/device:GPU:0
7993     4
2317     5
50296    5
31341    5
28373    5
Name: star_rating, dtype: int64
<class 'tensorflow.python.framework.ops.EagerTensor'>
(84032, 5)
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:GPU:0


<tf.Tensor: id=13, shape=(5, 5), dtype=float32, numpy=
array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.]], dtype=float32)>

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(lower=True, oov_token="<UNK>")
tokenizer.fit_on_texts(features_train)

x_train_sequences = tokenizer.texts_to_sequences(features_train)
x_test_sequences = tokenizer.texts_to_sequences(features_test)


vocab_size = len(tokenizer.word_index)
                 
print(f'Vocabulary size={vocab_size}')
print(f'Number of Documents={tokenizer.document_count}')

Vocabulary size=40789
Number of Documents=84032


In [12]:
import tensorflow.keras.preprocessing.sequence


# from our previous notebook, we saw that 99% of our reviews have 184 or less
# words - we will use 200 as a round number
MAX_FEATURES = 200

X_train = sequence.pad_sequences(x_train_sequences, 
                                 maxlen=MAX_FEATURES,
                                padding='post',
                                truncating='post')
X_test = sequence.pad_sequences(x_test_sequences, 
                                 maxlen=MAX_FEATURES,
                                padding='post',
                                truncating='post')

X_test[:1]

array([[  10,  609, 2585,  671,  438,  247,    3,  324,    6,   32,  204,
         341, 1415,  644,  343,  148,  243,  273,   55,  129,  279,  909,
          28,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0, 

# Implement our Attention Layer

In [0]:
from tensorflow.keras import backend as K


# define our attention layer for later
class AttentionLayer(layers.Layer):

    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):

        """
        Keras Layer that implements an Attention mechanism for temporal data.
        Supports Masking.
        Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
        # Input shape
            3D tensor with shape: `(samples, steps, features)`.
        # Output shape
            2D tensor with shape: `(samples, features)`.
        :param kwargs:
        Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
        The dimensions are inferred based on the output shape of the RNN.
        """

        self.supports_masking = True
        self.init = keras.initializers.get('glorot_uniform')

        self.W_regularizer = keras.regularizers.get(W_regularizer)
        self.b_regularizer = keras.regularizers.get(b_regularizer)

        self.W_constraint = keras.constraints.get(W_constraint)
        self.b_constraint = keras.constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(AttentionLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3
        
        print(f'build.input_shape {input_shape}')
        print(f'build.input_shape[-1] {input_shape[-1]}')

        self.W = self.add_weight(shape=(input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]
        
        print(f'build.features_dim {self.features_dim}')

        if self.bias:
            self.b = self.add_weight(shape=(input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        # TF backend doesn't support it
        # eij = K.dot(x, self.W)
        # features_dim = self.W.shape[0]
        # step_dim = x._keras_shape[1]

        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
                              K.reshape(self.W, (features_dim, 1))),
                        (-1, step_dim))
        
        print(f'call.eij {eij}')

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())

        # in some cases especially in the early stages of training the sum may be almost zero
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)
        weighted_input = x * a
        
        print(f'call.weighted_input {weighted_input}')

        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0], self.features_dim

    def get_config(self):
        config = {'step_dim': self.step_dim}
        base_config = super(AttentionLayer, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

In [14]:

from tensorflow.keras.layers import GRU
from tensorflow.keras import layers

MODEL_NAME = "TF2-biGRU_1layer_attention"
EMBED_SIZE = 300
EPOCHS  = 50
BATCH_SIZE = 128
VOCAB_SIZE = len(tokenizer.word_index)
GRU_DIM = 250 # total GRU units


# use functional syntax to create our graph
inp = layers.Input(shape=(MAX_FEATURES, ))
x = Embedding(VOCAB_SIZE + 1, EMBED_SIZE, trainable=True)(inp)
x = Bidirectional(GRU(units=GRU_DIM*2, return_sequences=True))(x)
x = AttentionLayer(MAX_FEATURES)(x)
x = Dense(GRU_DIM*2, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(GRU_DIM, activation='relu')(x)
outp = Dense(5, activation='softmax')(x)

model = keras.models.Model(inputs=inp, outputs=outp)

model.compile(loss='categorical_crossentropy', 
              optimizer=keras.optimizers.Adam(), 
              metrics=['accuracy'])

Executing op RandomUniform in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Sub in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Mul in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Add in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarIsInitializedOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op LogicalNot in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Assert in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RandomUniform in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Sub in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Mul in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Add in device /job:localhost/replica:0/task:0/device:GPU:

In [15]:
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 200)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 200, 300)          12237000  
_________________________________________________________________
bidirectional (Bidirectional (None, 200, 1000)         2406000   
_________________________________________________________________
attention_layer (AttentionLa (None, 1000)              1200      
_________________________________________________________________
dense (Dense)                (None, 500)               500500    
_________________________________________________________________
dropout (Dropout)            (None, 500)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               125250

In [16]:
early_stop = EarlyStopping(monitor='val_loss', patience=2, verbose=1, restore_best_weights=True)

# mw = ku.ModelWrapper(model, 
#                      MODEL_NAME,
#                      LABEL_COLUMN,
#                      DATA_FILE,
#                      embedding=EMBED_SIZE,
#                      tokenizer=tokenizer,
#                      description="TF2 1 layer biDirectional GRU")

history = model.fit(X_train, 
       y_train_encoded,
        batch_size=BATCH_SIZE,
        epochs=EPOCHS,
        verbose=1,
        validation_split=0.2,
        callbacks=[early_stop])

Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:GPU:0


W0813 06:48:08.758161 140396788017024 deprecation.py:323] From /tensorflow-2.0.0b1/python3.6/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Fill in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Fill in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op LogicalNot in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Assert in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op VarHandleOp in device /job:localhost/repli

In [17]:
scores = model.evaluate(X_test, y_test_encoded)
print(f'Model Accuracy: {scores[1]}')

Executing op __inference_keras_scratch_graph_7956 in device <unspecified>
Model Accuracy: 0.67641282081604


In [18]:
# make some predictions so we can test our model later when we load it back in

import numpy as np

y_predicted = np.argmax(model.predict(X_test[:20]), axis=1)
y_predicted

Executing op __inference_keras_scratch_graph_32250 in device <unspecified>


array([4, 4, 4, 0, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 3, 4, 0, 4, 4, 3])

# Save off files so we can load later

In [19]:
# save the model file to load late
MODEL_FILE = f'{DRIVE_DIR}/models/amazon_reviews_us_Wireless_v1_00-preprocessed-110k-TF2-biGRU_1layer_attention-186-star_rating-model.h5'
model.save(MODEL_FILE)


Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0


In [0]:
# save the weights so we can reload this
WEIGHTS_FILE = f'{DRIVE_DIR}/models/amazon_reviews_us_Wireless_v1_00-preprocessed-110k-TF2-biGRU_1layer_attention-186-star_rating-weights.h5'
model.save_weights(WEIGHTS_FILE)


In [0]:
# save to json so we don't have to re-create the architecture
MODEL_JSON_FILE = f'{DRIVE_DIR}/models/amazon_reviews_us_Wireless_v1_00-preprocessed-110k-TF2-biGRU_1layer_attention-186-star_rating-model_json.h5'
model_json = model.to_json()
with open(MODEL_JSON_FILE, 'w') as json_file:
  json_file.write(model_json)


In [0]:
import pickle

# save off tokenizer file
TOKENIZER_FILE = f'{DRIVE_DIR}/models/tf2-tokenizer.pkl'
pickle.dump(tokenizer, open(TOKENIZER_FILE, 'wb'))

In [26]:
# save network history file
HISTORY_FILE = f'{DRIVE_DIR}/models/amazon_reviews_us_Wireless_v1_00-preprocessed-110k-TF2-biGRU_1layer_attention-186-star_rating-history.pkl'
pickle.dump(history, open(HISTORY_FILE, 'wb'))

TypeError: ignored

# Loading the models

We are going to load the models and re-create our networks in a couple different ways

After that, we are going to run the same prediction as before. Results should be

```
array([4, 3, 4, 2, 4, 4, 3, 3, 4, 4, 4, 4, 4, 4, 3, 4, 0, 4, 4, 3])
```

## First - let's load the model with weights and then run the same predictions as before to see if we get she same values

In [27]:
model_loaded = keras.models.load_model(MODEL_FILE, custom_objects={'AttentionLayer': AttentionLayer})

build.input_shape (None, 200, 1000)
build.input_shape[-1] 1000
build.features_dim 1000
call.eij Tensor("attention_layer_1/Reshape_2:0", shape=(None, 200), dtype=float32)
call.weighted_input Tensor("attention_layer_1/mul:0", shape=(None, 200, 1000), dtype=float32)


In [28]:
np.argmax(model_loaded.predict(X_test[:20]), axis=1)

Executing op __inference_keras_scratch_graph_37273 in device <unspecified>


array([4, 4, 4, 0, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 3, 4, 0, 4, 4, 3])

## Second - let's load it using our json config and weights

we will run the same tests as before and see how this looks

In [29]:
import json

with open(MODEL_JSON_FILE) as json_file:
  json_config = json_file.read()
model_from_json = keras.models.model_from_json(json_config, custom_objects={'AttentionLayer': AttentionLayer})
model_from_json.load_weights(WEIGHTS_FILE)

build.input_shape (None, 200, 1000)
build.input_shape[-1] 1000
build.features_dim 1000
call.eij Tensor("attention_layer_2/Reshape_2:0", shape=(None, 200), dtype=float32)
call.weighted_input Tensor("attention_layer_2/mul:0", shape=(None, 200, 1000), dtype=float32)


In [30]:
np.argmax(model_from_json.predict(X_test[:20]), axis=1)

Executing op __inference_keras_scratch_graph_38131 in device <unspecified>


array([4, 4, 4, 0, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 3, 4, 0, 4, 4, 3])