# CNN 2-Layer Prototype

Based on the [single layer notebook](https://github.com/sv650s/sb-capstone/blob/master/2019-06-23-CNN_prototype.ipynb), we will add a couple convolution layers to see if we get better results

Because the previous notebooks were rather large, I put some of the common code for gather metrics and plotting into utility modules that are loaded below

Source code for the modules are here:
* [dict_util](https://github.com/sv650s/sb-capstone/blob/master/util/dict_util.py)
* [plot_util](https://github.com/sv650s/sb-capstone/blob/master/util/plot_util.py)
* [keras_util](https://github.com/sv650s/sb-capstone/blob/master/util/keras_util.py)
* [file_util](https://github.com/sv650s/sb-capstone/blob/master/util/file_util.py)

In [0]:
from google.colab import drive
import sys
drive.mount('/content/drive')
# add this to sys patch so we can import utility functions
DRIVE_DIR = 'drive/My Drive/Springboard/capstone'
sys.path.append(DRIVE_DIR)


%tensorflow_version 2.x


import tensorflow as tf
# checl to make sure we are using GPU here
tf.test.gpu_device_name()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers.normalization import BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers.convolutional import Conv1D
from tensorflow.keras.layers.convolutional import MaxPooling1D
from tensorflow.keras.layers.embeddings import Embedding
from tensorflow.keras.utils.vis_utils import model_to_dot
import pandas as pd
from IPython.display import SVG
import pickle
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix, classification_report
import os
import matplotlib.pyplot as plt
import seaborn as sns
import logging

# custom utilities
import util.dict_util as du
import util.plot_util as pu
import util.keras_util as ku
import util.file_util as fu
import util.report_util as ru

logging.basicConfig(level=logging.ERROR)

%matplotlib inline
sns.set()

# check to see if we are using GPU - must be placed at the beginning of the program
tf.debugging.set_log_device_placement(True)
print("GPU Available: ", tf.test.is_gpu_available())
print(f'Tensorflow Version: {tf.version.VERSION}')
print(f'Keras Version: {tf.keras.__version__}')

DRIVE_DIR = "drive/My Drive/Springboard/capstone"
DATE_FORMAT = '%Y-%m-%d'
TIME_FORMAT = '%Y-%m-%d %H:%M:%S'
FEATURE_COLUMN = "star_rating"
REVIEW_COLUMN = "review_body"

DEBUG = False

MODEL_NAME = "CNN"
FEATURE_SET_NAME = "random_embedding"
if DEBUG:
  DATA_FILE = f'{DRIVE_DIR}/review_body-word2vec-df_none-ngram_none-89-100-nolda.csv'
  MODEL_NAME = f'test-{MODEL_NAME}'
else:
  DATA_FILE = f"{DRIVE_DIR}/amazon_reviews_us_Wireless_v1_00-200k-preprocessed.csv"

directory, INBASENAME = fu.get_dir_basename(DATA_FILE)

# first layer filter
FILTER1 = 100
FILTER2 = 200
FILTER2 = 300
# Network Settings
KERNEL_SIZE1 = 3
KERNEL_SIZE2 = 2
KERNEL_SIZE3 = 1

# length of our embedding - 300 is standard
EMBED_SIZE = 300
EPOCHS  = 50
BATCH_SIZE = 128
PATIENCE = 4

# From EDA, we know that 90% of review bodies have 100 words or less, 
# we will use this as our sequence length
MAX_SEQUENCE_LENGTH = 100

Using TensorFlow backend.


In [0]:
# load data file
df = pd.read_csv(f"{DATA_FILE}")

# extract feature and label columns
rating = df[FEATURE_COLUMN]
reviews = df[REVIEW_COLUMN]

# Preprocessing

Same as previous notebooks but consolidated to one cell to make thing easier to understand

Features:
* tokenize our review body - this gives us about 50k words
* then we will pad the sequences to length 186 since this encapsulates 99% of the lenght of our training data

Labels:
* one hot encode our star rating labels (y)


In [0]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from sklearn.preprocessing import LabelEncoder


# pre-process our lables
# one hot encode our star ratings since Keras/TF requires this for the labels
y = OneHotEncoder().fit_transform(rating.values.reshape(len(rating), 1)).toarray()


# split our data into train and test sets
reviews_train, reviews_test, y_train, y_test = train_test_split(reviews, y, random_state=1)


# Pre-process our features (review body)
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(reviews_train)
# tokenize both our training and test data
train_sequences = t.texts_to_sequences(reviews_train)
test_sequences = t.texts_to_sequences(reviews_test)

print("Vocabulary size={}".format(len(t.word_counts)))
print("Number of Documents={}".format(t.document_count))


# pad our reviews to the max sequence length
X_train = sequence.pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
X_test = sequence.pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)

vocab_size = len(t.word_counts)+1

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


Vocabulary size=40788
Number of Documents=84032
Max Sequence Length: 186


## Build Our 2 Layer Model

* we will use embedding size of 300 since this gave us slight improvement from previous notebook for class 1 and 2

In [0]:
model = Sequential()
model.add(Embedding(vocab_size, EMBED_SIZE, input_length=MAX_SEQUENCE_LENGTH))
model.add(Conv1D(filters=FILTER1, kernel_size=KERNEL_SIZE1, padding='same', activation='relu'))
model.add(Conv1D(filters=FILTER2, kernel_size=KERNEL_SIZE2, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])


W0724 15:14:17.724280 140554893887360 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0724 15:14:17.761323 140554893887360 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0724 15:14:17.768372 140554893887360 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0724 15:14:17.839851 140554893887360 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

W0724 15:14:17.898248 140554893887360 deprecation_wrapp

In [0]:
# reduce learning rate if we sense a plateau
reduce_lr = ReduceLROnPlateau(monitor='val_loss', 
                              factor=0.4,
                              patience=PATIENCE, 
                              min_lr=0.00001,
                             mode='auto')

early_stop = EarlyStopping(monitor='val_loss', patience=PATIENCE, verbose=1, restore_best_weights=True)

ARCHITECTURE = "100x200"
DESCRIPTION = "2 Layer CNN 100x200 Filters"


mw2 = ku.ModelWrapper(model, MODEL_NAME, 
                     ARCHITECTURE,
                     FEATURE_SET_NAME
                     LABEL_COLUMN, DATA_FILE, 
                     embedding=EMBED_SIZE,
                     tokenizer=tokenizer,
                     description=DESCRIPTION)

network_history = mw2.fit(X_train, y_train,
                      batch_size=BATCH_SIZE,
                      epochs=EPOCHS,
                      verbose=1,
                      validation_split=0.2,
                      callbacks=[early_stop, reduce_lr])


print(model.summary())

W0724 15:14:18.360604 140554893887360 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0724 15:14:18.477739 140554893887360 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Train on 67225 samples, validate on 16807 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 00003: early stopping


## Evaluate our 2 Layer Model

* look at accuracy scores
* epoch vs loss and accuarcy
* confusion matrix
* ROC/AUC plot

In [0]:
importlib.reload(pu)

scores = mw2.evaluate(x_test, y_test)
print("Accuracy: %.2f%%" % (mw2.scores[1]*100))



pu.plot_network_history(mw2.network_history, "categorical_accuracy", "val_categorical_accuracy")
plt.show()

print("\nConfusion Matrix")
print(mw2.confusion_matrix)

print("\nClassification Report")
print(mw2.classification_report)

fig = plt.figure(figsize=(5,5))
pu.plot_roc_auc(mw2.name, mw2.roc_auc, mw2.fpr, mw2.tpr)

print(f'Score: {ru.calculate_metric(mw2.crd)}')

Accuracy: 67.10%


## Save off files for our 2 layer model

In [0]:
mw2.save(DRIVE_DIR, append_report=True)


# Build a 3 Layer CNN with max pooling at the end

In [0]:
model3 = Sequential()
model3.add(Embedding(VOCAB_SIZE, EMBED_SIZE, input_length=MAX_SEQUENCE_LENGTH))
model3.add(Conv1D(filters=FILTER1, kernel_size=KERNEL_SIZE1, padding='same', activation='relu'))
model3.add(Conv1D(filters=FILTER2, kernel_size=KERNEL_SIZE2, padding='same', activation='relu'))
model3.add(Conv1D(filters=FILTER3, kernel_size=KERNEL_SIZE3, padding='same', activation='relu'))
model3.add(MaxPooling1D(pool_size=2))
model3.add(Flatten())
model3.add(Dense(250, activation='relu'))
model3.add(Dense(5, activation='softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])


In [0]:
print(model3.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 186, 300)          12236700  
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 186, 100)          90100     
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 186, 100)          30100     
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 186, 100)          30100     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 93, 100)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 9300)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 250)               2325250   
__________

In [0]:
# this is ame as before
# early_stop = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

ARCHITECTURE = "100x200"
DESCRIPTION = "2 Layer CNN 100x200 Filters"


mw3 = ku.ModelWrapper(model, MODEL_NAME, 
                      ARCHITECTURE,
                      FEATURE_SET_NAME,
                      LABEL_COLUMN, DATA_FILE, 
                     embedding=EMBED_SIZE,
                     tokenizer=tokenizer,
                     description=DESCRIPTION)

network_history3 = mw.fit(X_train, y_train,
                      batch_size=BATCH_SIZE,
                      epochs=EPOCHS,
                      verbose=1,
                      validation_split=0.2,
                      callbacks=[early_stop, reduce_lr])

Train on 67225 samples, validate on 16807 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 00004: early stopping


## Evaluate our 3 Layer Model

* look at accuracy scores
* epoch vs loss and accuarcy
* confusion matrix
* ROC/AUC plot

In [0]:
importlib.reload(pu)

scores = mw3.evaluate(x_test, y_test)
print("Accuracy: %.2f%%" % (mw3.scores[1]*100))

pu.plot_network_history(mw3.network_history, "categorical_accuracy", "val_categorical_accuracy")
plt.show()

print("\nConfusion Matrix")
print(mw3.confusion_matrix)

print("\nClassification Report")
print(mw3.classification_report)

fig = plt.figure(figsize=(5,5))
pu.plot_roc_auc(mw3.name, mw3.roc_auc, mw3.fpr, mw3.tpr)


print(f'Score: {ru.calculate_metric(mw3.crd)}')

Accuracy: 67.10%


# Comparing Our 2 Architectures

In [0]:
print(f'2 Layer Score: {ru.calculate_metric(mw2.crd)}')
print(f'3 Layer Score: {ru.calculate_metric(mw3.crd)}')

## Save our 3 Layer Model

In [0]:
mw.save(DRIVE_DIR, append_report=True)

In [0]:
print(datetime.now())