<a href="https://colab.research.google.com/github/sv650s/sb-capstone/blob/master/2019_07_30_deep_learning_summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning Summary Notebook

We ran a number of deep learning models. During their execution, I saved the results into a CSV file. We are now going to look at all the models and look at which ones performed the best.

Criteria for evaluation:

* I am mainly going to concentrate on the following Star Ratings (in order): 2 and 4. These are generally our problematic classes. 2 has the least amount of examples and 4 tends to be mis-classified as 5's. After that, we will look at Star rating 3's
* We will look at the following metrics for these classes:
  * AUC - this tells us how well the model is able to separate out the various classes
  * From classification report:
    * precision - this tells us when the model predicts a class, how likely is it to be accurate
    * recall - this tell us that out of all the labels, how likely is it to identify all reviews in that label class - although I value this less that precision. As we saw in previous notebooks - there is generally a tradeoff between recall and precision. As we see recall increase generally precision decreases in our models. I would rather have a model where we trust the predictions vs sheer volume of being able to identify the problem classes
    * F1 score - this number takes into account both precision and recall
    
    
Hopefully, by looking at these numbers we will have a model the converges to be the best out of the models that we ran
    



In [2]:
from google.colab import drive
import sys
drive.mount('/content/drive')
DRIVE_DIR = "drive/My Drive/Springboard/capstone"

# add this to sys patch so we can import utility functions
sys.path.append(DRIVE_DIR)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import pandas as pd
import json
import util.dict_util as du


In [0]:
# load report file form all of our deep learning models
REPORT_FILE = f"{DRIVE_DIR}/reports/2019-07-31-dl_protype-report.csv"
report = pd.read_csv(REPORT_FILE, quotechar="'")

## First Let's get a list of all models that we ran

There were a couple DNN models where we ran the same architecture on different embedding sizes, so we will have to look at both

In [5]:
report[["model_name", "embedding"]].sort_values(["model_name"])

Unnamed: 0,model_name,embedding
7,CNN_1layerEmbedding300,300.0
6,CNN_1layerEmbedding32,32.0
12,CNN_2layer,300.0
13,CNN_3layer,300.0
15,CNN_3layer_maxpooling_15epoch,300.0
14,CNN_3layer_maxpooling_earlystop,300.0
8,DNN_128_128,word2vec
3,DNN_128_128_batchnorm,word2vec
10,DNN_340,word2vec
9,DNN_340_batchnorm,word2vec


## Now let's look at accuracy

I'm not going to weight heavily on accuracy as this number is averaged across all classes. Since reviews lead heavily toward 5 stars - this tends to skew our results when everything is averaged, but it's an interesting data point to look at

For a review of the dataset, please see the [exploratory data analysis notebook](https://github.com/sv650s/sb-capstone/blob/master/amazon_review-eda.ipynb)


Data here is sorted in descending order

**1 layer Bi directional GRU with Attention has the highest accuracy score out of th all models**

In [22]:
report[["model_name", "embedding", "accuracy"]].sort_values(["accuracy"], ascending=False)

Unnamed: 0,model_name,embedding,accuracy
1,biGRU_1layer_attention,300.0,0.679019
0,GRU_1layer,300.0,0.677591
12,CNN_2layer,300.0,0.67702
4,biGRU_2layer_attention,300.0,0.676877
14,CNN_3layer_maxpooling_earlystop,300.0,0.675806
13,CNN_3layer,300.0,0.674806
5,LSTM_1layer,300.0,0.672771
2,DNN_384_384,word2vec,0.66899
8,DNN_128_128,word2vec,0.668525
3,DNN_128_128_batchnorm,word2vec,0.667918


## ROC AUC

Next, we are going to look at AUC for our problem classes

AUC is area under the curve of our False Postive Rate plotted against our True Positive Rate - this generally tells us how well the model is able to separate out our classes. The closer the number is to 1, the better the model is able to discern between the various classes.

Again, we are going sort the data in descending order by the most difficult star ratings first.

**Again, 1 layer bidirectional GRU with attention is the winner here. AUC for class 2 is around 2% higher than a 2 layer counterpart. AUC for this model seems to be higher than the rest of the models in all classes except for Rating 1**

In [21]:

auc_df = pd.DataFrame()
for index, row in report.iterrows():
  roc_auc = json.loads(row.roc_auc)
  roc_auc["model_name"] = f'{row.model_name}-{row.embedding}'
  auc_df = auc_df.append(roc_auc, ignore_index=True)
  
auc_df.set_index("model_name").sort_values(["auc_2", "auc_3", "auc_4"], ascending=False)


Unnamed: 0_level_0,auc_1,auc_2,auc_3,auc_4,auc_5,auc_macro,auc_micro
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
biGRU_1layer_attention-300.0,0.95401,0.863699,0.831802,0.757932,0.89285,0.860079,0.913316
biGRU_2layer_attention-300.0,0.954302,0.861583,0.832197,0.756094,0.890222,0.8589,0.911958
GRU_1layer-300.0,0.952725,0.86096,0.824941,0.750702,0.890541,0.855994,0.911163
CNN_3layer-300.0,0.9551,0.860194,0.823187,0.752341,0.891452,0.856476,0.910376
CNN_3layer_maxpooling_earlystop-300.0,0.955349,0.859686,0.826831,0.749974,0.889777,0.856343,0.911021
CNN_2layer-300.0,0.953798,0.858409,0.82904,0.754654,0.891029,0.857407,0.911872
LSTM_1layer-300.0,0.949565,0.854624,0.825177,0.745389,0.885501,0.852072,0.908309
DNN_384_384-word2vec,0.948823,0.854195,0.825289,0.747416,0.888107,0.852787,0.90822
DNN_128_128_batchnorm-word2vec,0.949632,0.854177,0.822947,0.747401,0.888097,0.852471,0.908077
CNN_1layerEmbedding300-300.0,0.944266,0.851232,0.802395,0.724836,0.870334,0.838637,0.899501


## Classification Report - precision and recall

Here we will look at precision and recall for our various classes. I am going to order the models by precision then recall. Also, by class important (ie, difficulty) - so that would Star rating 2, 3, then 4

We are only going to display the top 5 results as the table gets quite wide

**Again, 1 layer bidirectional GRU with attention shows up at the top of the list**

Let's see how this model did when we look at each specific class:

* 1
  * precision - a little bit over 75% percentile
  * recall - 25% precentile
* 2
  * precision - highest precision of all models
  * recall - a little bit lower than 25th precentile
* 3
  * precision - somewhere between 25th and 50th percentile
  * recall - around 50th percentile
* 4
  * precision - highest precision of all models
  * recall - below 25th percentile
* 5
  * precision - around 50th percentile
  * recall - close to 100th percentile

In [0]:
# create our dataframe with classification report data
cr_df = pd.DataFrame()
for index, row in report.iterrows():
  cr = {}
  cr = du.add_dict_to_dict(cr, json.loads(row.classification_report))
  cr["model_name"] = f'{row.model_name}-{row.embedding}'
  cr_df = cr_df.append(cr, ignore_index=True)


In [31]:
# let's look at some overall precision statistics - I don't want to see average precisions
import re

cols = [col for col in cr_df.columns if re.search(r'\d_precision', col)]
cr_df[cols].describe()


Unnamed: 0,1_precision,2_precision,3_precision,4_precision,5_precision
count,16.0,16.0,16.0,16.0,16.0
mean,0.624006,0.258688,0.347518,0.425423,0.768891
std,0.046691,0.074283,0.023435,0.036229,0.014714
min,0.552041,0.0,0.286286,0.327026,0.746219
25%,0.593671,0.25084,0.338714,0.41217,0.758441
50%,0.615689,0.270945,0.351005,0.420537,0.769293
75%,0.638632,0.286292,0.36216,0.45067,0.777872
max,0.731681,0.342857,0.378297,0.467841,0.796923


In [32]:
# let's look at some overall precision statistics
cols = [col for col in cr_df.columns if re.search(r'\d_recall', col)]
cr_df[cols].describe()


Unnamed: 0,1_recall,2_recall,3_recall,4_recall,5_recall
count,16.0,16.0,16.0,16.0,16.0
mean,0.792186,0.077553,0.259358,0.259957,0.902403
std,0.089137,0.071795,0.04814,0.048299,0.039544
min,0.520917,0.0,0.192578,0.184133,0.765397
25%,0.790028,0.035442,0.220605,0.235713,0.900351
50%,0.806403,0.052729,0.248716,0.25372,0.910813
75%,0.841115,0.082309,0.292375,0.261837,0.925333
max,0.883569,0.28915,0.372975,0.400042,0.932568


In [34]:
# get only precision and recall columns
cols = [col for col in cr_df.columns if re.search(r"\d_precision", col) or re.search(r"\d_recall", col)]  
# add model name so we can use it as keys
cols.append("model_name")
cr_df[cols].set_index("model_name").sort_values(["2_precision", 
                                           "2_recall", 
                                           "3_precision", 
                                           "3_recall",
                                           "4_precision", 
                                           "4_recall",
                                          ], ascending=False).iloc[:5].T


model_name,biGRU_1layer_attention-300.0,CNN_2layer-300.0,DNN_128_128-word2vec,LSTM_1layer-300.0,CNN_3layer_maxpooling_earlystop-300.0
1_precision,0.648007,0.623523,0.59468,0.635507,0.580848
1_recall,0.790575,0.837702,0.825032,0.798387,0.883569
2_precision,0.342857,0.319767,0.290698,0.288026,0.285714
2_recall,0.032068,0.029396,0.052994,0.095136,0.044896
3_precision,0.340303,0.369026,0.345727,0.362433,0.378297
3_recall,0.372975,0.241011,0.207031,0.291979,0.249309
4_precision,0.465426,0.446816,0.420021,0.462229,0.439189
4_recall,0.22096,0.251052,0.254259,0.184133,0.259891
5_precision,0.769775,0.757877,0.764483,0.751515,0.783327
5_recall,0.928807,0.92565,0.912045,0.932568,0.904963


### Classification Report - F1 Score

F1 score takes into account both precision and recall - 2 * (precision * recall)/(precision + recall)

Again, we are going to display the top 5 results

Here we are actually seeing that 3 layer CNN with maxpooling inbetween the CNN layers is the best model - note we did not use early stopping and ran the model with 15 epochs. As you can see our precision is lower ~25%, however, our recall is at 23% which is much higher than the 3% recall rate we saw in the 1 layer bi-directional GRU with attention

This model is interesting, F1 score compared to other models
* 1 - lowest of all models
* 2 - highest of all models
* 3 - 50% percentile
* 4 - somewhere betwen 75th and 100% percentile
* 5 - lowest of all models


This model seems to do well in our difficult classes but does not do well at all in classes that other models do well at

In [33]:
# let's look at some statistics regarding F1 scores so we get a sense of range of numbers from various models
cols = [col for col in cr_df.columns if re.search(r'\d_f1', col)]
cr_df[cols].describe()


Unnamed: 0,1_f1-score,2_f1-score,3_f1-score,4_f1-score,5_f1-score
count,16.0,16.0,16.0,16.0,16.0
mean,0.692308,0.106255,0.293804,0.318357,0.829619
std,0.024973,0.066872,0.02841,0.026463,0.01427
min,0.608568,0.0,0.250699,0.263356,0.780842
25%,0.69078,0.062535,0.27065,0.301809,0.827745
50%,0.697281,0.088206,0.298045,0.319121,0.832248
75%,0.704311,0.12489,0.310818,0.332269,0.836344
max,0.714916,0.257926,0.355891,0.369023,0.841846


In [26]:
# uncomment this out if you only want to look at f1 columns
# # get columns with f1 scores
# cols = [col for col in cr_df.columns if "f1" in col]
# # add model_name
# cols.append("model_name")
# cr_df[cols].set_index("model_name").sort_values(["2_f1-score", 
#                                            "3_f1-score", 
#                                            "4_f1-score",
#                                           ], ascending=False).T


cr_df.set_index("model_name").sort_values(["2_f1-score", 
                                           "3_f1-score", 
                                           "4_f1-score",
                                          ], ascending=False).iloc[:5].T



model_name,CNN_3layer_maxpooling_15epoch-300.0,CNN_1layerEmbedding300-300.0,CNN_1layerEmbedding32-32.0,LSTM_1layer-300.0,CNN_3layer-300.0
1_f1-score,0.608568,0.699803,0.68786,0.707696,0.703182
1_precision,0.731681,0.648934,0.71518,0.635507,0.590644
1_recall,0.520917,0.759325,0.66255,0.798387,0.8687
1_support,3968.0,3968.0,3968.0,3968.0,3968.0
2_f1-score,0.257926,0.207792,0.200822,0.143029,0.118844
2_precision,0.232788,0.275618,0.279847,0.288026,0.249147
2_recall,0.28915,0.166756,0.156601,0.095136,0.078033
2_support,1871.0,1871.0,1871.0,1871.0,1871.0
3_f1-score,0.298106,0.308486,0.301021,0.323414,0.320812
3_precision,0.286286,0.323097,0.318623,0.362433,0.353641


## Compare Confusion Matrix for our 2 models


1 layer bidirectional GRU with attention heavily favors predicting a review as star rating 5 with the next class being star rating 1

CNN without early stop still favors star rating 5, but the next class if rating 4. It's pretty evenly distributed in terms of whether it would predict betwen star rating 1, 2, and 3






In [71]:
cm_bigru_1layer = pd.DataFrame()
for row in json.loads(report.loc[report.model_name == "biGRU_1layer_attention", "confusion_matrix"].values[0]):
  cm_bigru_1layer = cm_bigru_1layer.append(pd.Series(row), ignore_index=True)
cm_bigru_1layer

Unnamed: 0,0,1,2,3,4
0,3137.0,62.0,476.0,40.0,253.0
1,843.0,60.0,603.0,132.0,233.0
2,452.0,42.0,944.0,475.0,618.0
3,147.0,7.0,516.0,1050.0,3032.0
4,262.0,4.0,235.0,559.0,13829.0


In [77]:
cm_bigru_1layer.sum(axis='index')

0     4841.0
1      175.0
2     2774.0
3     2256.0
4    17965.0
dtype: float64

In [72]:
cm_cnn_3layer_maxpooling_15epoch = pd.DataFrame()
for row in json.loads(report.loc[report.model_name == "CNN_3layer_maxpooling_15epoch", "confusion_matrix"].values[0]):
  cm_cnn_3layer_maxpooling_15epoch = cm_cnn_3layer_maxpooling_15epoch.append(pd.Series(row), ignore_index=True)
cm_cnn_3layer_maxpooling_15epoch

Unnamed: 0,0,1,2,3,4
0,2067.0,1060.0,480.0,152.0,209.0
1,409.0,541.0,513.0,231.0,177.0
2,178.0,425.0,787.0,736.0,405.0
3,60.0,149.0,529.0,1901.0,2113.0
4,111.0,149.0,440.0,2793.0,11396.0


In [78]:
cm_cnn_3layer_maxpooling_15epoch.sum(axis='index')

0     2825.0
1     2324.0
2     2749.0
3     5813.0
4    14300.0
dtype: float64

# Conclusion

While 3 layer CNN model seems to strike a good balance between precision and recall on class 2 and 4 and roughly the same with class 3. I still think the 1 layer bidirectional GRU is a better model and precision was much higher in classes 2 and 4 even though it is super conservative in predicting those classes and have low recall.

