* Best deep learning model turned out to be LSTM (with balanced weights) with the following architecture
    * Embedding input layer with 300 features
    * Spatial Dropout layer - 20%
    * LSTM nodes: 64
    * Output Layer - 5 nodes with softmax activation
    * Class Weight: balanced (calculated using sklearn compute_class_weight method)
    * Loss funcation: categorical cross entropy
    * Optimizer: adam
    * Accuracy Scorer: Categorical Accuracy
* Input for this model is 200k samples of our pre-processed review body
* With only 200k samples, LSTM did not best Logistic Regression in our score ~ LSTM (~ .33) scored about 13% lower than Logistic Regression with balanced weight (~ .46)
* By using Random Under Sampling, we were able to improve the model score by 0.05. Generally, 2-star reviews are the most rare accounting for around 5% of our data. Under sampling would make all classes balanced which would allow us to retain only about 25% (5 classes * 5%) of our data. For our full dataset with 9 million samples, this would only give us 2.2 million samples. Because this drastically reduction of samples, I'm not sure if this will be a viable model. 
* Further investigation: we did see a plateau in model performance for Logistic Regression at around 500k samples. Perhaps training LSTM with the full dataset (9 million samples), it will be able to exceed the performance of Logistic Regression?


# Sampling for LSTM Models (Balanced Weights)

LSTM seems to be our best model so far. Since we have class imbalance for our dataset. I experimented with a couple sampling techniques to see if we can improve our model performance

### Techniques that we used:
* Random Under Sampling - this will randomly remove majority class samples until we have even distribution of all classes in our sample data. Since this technique removes a bunch of samples, I had to start with a dataset that had 1 million entries to end up with a dataset that is around 200k
* Random Over Sampling - creates balanced class distribution in our datast by duplicating minority class samples
* ADSYN - this is a synthetic over sampling technique where it will generate new samples by interpolating between minority class samples with an emphasis on creating samples near minority class samples that are wrongly classified by KNN classifier

### Sampling Result
* Random Under Sampling did the best to improve model performance and brought our score up by around 5% to .39 - however, this technique reduces our dataset significantly so I have concerns about viability of this technique
* Both methods of over sampling did not beat our model with no sampling at all
    * ADASYN actually did about 3% worse than our baseline, whereas random over sampling did roughly the same as if no sampling was done

Documentation from imbalanced-learn: https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html


In [None]:
lstms = report[report.model_name == "LSTMB"].copy().astype({"sample_size":np.object})
lstms["display_name"] = lstms[["sample_size", "sampling_type"]].fillna("none").apply(lambda x: x.sampling_type + "-" + str(x.sample_size), axis=1)
lstms["ml_score"] = ml_best.iloc[0].eval_metric

plt.figure(figsize=(20, 5))
_ = sns.lineplot(data = lstms, y = "eval_metric", x = "display_name", sort=False, label="Sampling Score")
_ = sns.lineplot(data = lstms, y = "ml_score", x = "display_name", sort=False, color="r", label="LRB Score")
_ = plt.title("Sampling Scores")
_ = plt.xlabel("Model + Sample Size")
_ = plt.ylabel("Score")

lstms[["display_name", "eval_metric"]]

# Confusion Matrix

For random under sampling for LSTM
* 2-star ratings are most commonly being misclassified as 1-star or 2-star with more being 1-star
* 3-star ratings are most commonly being misclassified as 1-star or 4-star
* 4-star ratings are most commonly being misclassified as 5-star

In [None]:
for idx, row in report_best.iterrows():
    print(f'\n{row.model_name} Sample Size: {row.sample_size} Sampling: {row.sampling_type}')
    print("Confusion Matrix")
    cm = json.loads(row.confusion_matrix)
    print(pd.DataFrame(cm, index=np.arange(1, 6), columns=np.arange(1, 6)))


## Classification Report Histogram

* Recall for our base (LRB) is much higher for minority classes - although there is a tradeoff for recall for 1-star and 5-star reviews as they are lower
* Precision between LSTM (balanced weight) and Logistic Regression is roughly the same with LSTM using random under sampling doing the best out of all models
* Interestingly, recall for 3-star reviews is generally worse that 4-star reviews under our tradition ML models actually does better under LSTM


In [None]:
report_best_histo = report_best.copy()
report_best_histo["display_name"] = report_best_histo.model_name + "-" + report_best_histo.sampling_type
pu.plot_score_histograms(report_best_histo, version=2, label="display_name")


# Conclusion


* Overall LSTM using only 200k samples did not do as well as our tradition ML model - logistic regression - although as we saw in previous notebooks logistic regression plateaus at around 500k samples. If we train with the full dataset for LSTM, we may be able to improve on model performance to beat our logistic regression model - will have to try this
* With random under sampling, LSTM was able to improve by about 5%, however, this significatly reduces our dataset so I'm not sure if it's a realistic solution to our class imbalance problem

