# Use simple techniques for feature Engineering and to do multi-class classification to predict review ratings based on the Amazon Reviews dataset

<b>Objective for this exercise:</b>
    * Establish NLP prediction accuracy baseline using simple ML models
    * Explore different permutation of feature engineering techniques, data, and classification algorithms
    * Compare accuracy of preduction using the following information:
        * Product Title
        * Review Headline
        * Review Body
    * (If time allows) see if using only helpful reviews to train improves our accuracy for our predictions - this reduces our 110k dataset to 35k


<b>Feature Engineering Techniques:</b>
    * bag of words
    * TF-IDF
    * Topic Modeling
    
    
<b>Classification:</b>
    * Logistic Regression Classification
    * K-nearest Neighbors Classification
    * Radius Neighbors Classification - document suggests the Radius Neighbors might be a better fit if our data is no uniform. From our exploratory data analysis, we see that most reviews skew towards 4 or 5 stars
    
    
    
<b>Data:</b>

Data used in this notebooks came from Amazon reviews dataset - Wirless category. First it was converted from tsv to csv. Then it was pre-processed in the previous notebook using various text processing techniques. For details, please see: [amazon_review_preprocessing.ipynb](amazon_review_preprocessing.ipynb)


Example of how to do this:
```
python preprocess_amazon.py -l INFO -r -o dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-smallout.csv dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-smallin.csv
```


<b>Memory Requirement:</b>

| File | Python Memory |
|------|---------------|
| amazon_reviews_us_Wireless_v1_00-tinyout.csv | 20 - 26 GB |



<b>Code:</b>

I found that this notebook was getting way too big and hard to manage so I moved most of the code that runs the models along with utility functions have been moved to a python class [ClassifierRunner.py](models/ClassifierRunner.py). Unit tests for this is in [test_classifier_runner.py](models/test_classifier_runnyer.py)

In [58]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier, RadiusNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from models.ClassifierRunner import ClassifierRunner
import logging

logging.basicConfig(format='%(asctime)s %(name)s.%(funcName)s %(levelname)s - %(message)s', level=logging.INFO)

%matplotlib inline

In [33]:
# set global variables

# I'm finding that running these models on my laptop takes forever and they are not finishing so I'm going to start
# with a really small file just to validate my code
#
# datafile was generated from amazon_review_preprocessing.ipynb - this file has 1k reviews randomly chosen
# from original file
KEEP_COLUMNS = ["product_title", "helpful_votes", "review_headline", "review_body", "star_rating"]
TIME_FORMAT = '%Y-%m-%d %H:%M:%S'
DATE_FORMAT = '%Y-%m-%d'
OUTCOME_COLUMN = "star_rating"


# Configuration
DATA_FILE = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-tinyout.csv"
NEIGHBORS = [5] # default
# NEIGHBORS = [1, 3, 5, 7, 9, 11]

# Radius for RadiusNeighbor
# RADII = [5.0] # this is the lowest number I tried that was able to find a neighbor for review_headline
RADII = [30.0] # this is the lowest number I tried that was able to find a neighbor for review_headline
# RADII = [5.0, 7.0, 9.0, 11.0, 13.0]

# logistic regression settings
C= [1.0] # default
# C = [0.2, 0.4, 0.6, 0.8, 1.0]

N_JOBS=6
LR_ITER=500
FEATURE_COLUMN = "review_headline"
ENABLE_KNN = True
ENABLE_TFIDF = True

# model flags
ENABLE_RN = True
ENABLE_LR = True
ENABLE_BOW = True


WRITE_TO_CSV = True
OUTFILE = "amazon_review_classifier_simple.csv"


In [34]:
# read in DF
df = pd.read_csv(DATA_FILE)[KEEP_COLUMNS]
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22414 entries, 0 to 22413
Data columns (total 5 columns):
product_title      22414 non-null object
helpful_votes      22414 non-null int64
review_headline    22414 non-null object
review_body        22414 non-null object
star_rating        22414 non-null int64
dtypes: int64(2), object(3)
memory usage: 875.6+ KB
None


Unnamed: 0,product_title,helpful_votes,review_headline,review_body,star_rating
0,tfy universal car headrest mount holder portab...,0,good enough,serves purpose loud whoever sitting seat attached,3
1,iccker art nylon hair paint brush tools set bl...,0,five stars,works really well samsung s6 otterbox defender...,5
2,jbl gx series coaxial car loudspeakers certifi...,1,speakers did not sound well thought,speakers did not sound well thought would jbls...,2
3,otium screen protectors,0,really easy install included guide,absoultely perfect included install guide make...,5
4,apple watch stand vtin aluminum alloy build ho...,0,love,heres lot like stand apple watch very modern s...,5


### <font color="red">Should I include add these along with the word vectors as part of the feature set?</font>

In [35]:
# let's get some data on our text

def wc(x:str):
    return len(str(x).split())

df["pt_wc"] = df.product_title.apply(wc)
df["rh_wc"] = df.review_headline.apply(wc)
df["rb_wc"] = df.review_body.apply(wc)
df.describe()

Unnamed: 0,helpful_votes,star_rating,pt_wc,rh_wc,rb_wc
count,22414.0,22414.0,22414.0,22414.0,22414.0
mean,0.904792,3.895333,15.907335,2.956322,25.9797
std,8.709008,1.465474,9.716846,1.915684,41.441713
min,0.0,1.0,1.0,1.0,1.0
25%,0.0,3.0,9.0,2.0,8.0
50%,0.0,5.0,14.0,2.0,15.0
75%,0.0,5.0,20.0,4.0,28.0
max,868.0,5.0,92.0,21.0,1133.0


In [36]:
# Set up different dataframes for training

# outcome
Y = df["star_rating"]
X = df[FEATURE_COLUMN]

# Bag of Words - Generate Feature Vectors

In [37]:
# TODO: try different parameters for CountVectorizers?
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(X.array)
vocab = cv.get_feature_names()
# print(f"vocab: {vocab}")
bag_pd = pd.DataFrame(cv_matrix.toarray(), columns=vocab)

# split results into training and test set
bag_X_train, bag_X_test, bag_Y_train, bag_Y_test = train_test_split(bag_pd, Y, random_state=1)

print(f"training set size {len(bag_X_train)}")
print(f"test set size {len(bag_X_test)}")

training set size 16810
test set size 5604


In [10]:
# explore the data
print(len(vocab))
bag_pd.head()

20541


Unnamed: 0,00,000,0000,000hz,000hzsensitivity,000mah,001,002,003,004,...,zoomed,zooming,zooms,zperia,zr,zte,zumo,zune,zuzo,zx4
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Set up BoW models

In [50]:
cr_bow = ClassifierRunner(write_to_csv=WRITE_TO_CSV)

if ENABLE_BOW:
    if ENABLE_KNN:
        for neighbor in NEIGHBORS:
            neigh = KNeighborsClassifier(n_neighbors=neighbor, n_jobs=N_JOBS)
            cr_bow.addModel(neigh, 
                        bag_X_train, 
                        bag_Y_train, 
                        bag_X_test, 
                        bag_Y_test, 
                        name="KNN", 
                        dataset="BoW", 
                        parameters={"n_jobs": N_JOBS,
                                   "n_neighbors": neighbor})

    if ENABLE_RN:
        for radius in RADII:
            rnc = RadiusNeighborsClassifier(radius=radius, n_jobs=N_JOBS)
            cr_bow.addModel(neigh, 
                        bag_X_train, 
                        bag_Y_train, 
                        bag_X_test, 
                        bag_Y_test, 
                        name="RN", 
                        dataset="BoW", 
                        parameters={"n_jobs": N_JOBS,
                                   "radius": radius})
            
    if ENABLE_LR:
        for c in C:
            lr = LogisticRegression(random_state=0, solver='lbfgs',
                                    multi_class='auto',
                                    max_iter=LR_ITER, n_jobs=N_JOBS, C=c)
            cr_bow.addModel(lr, 
                        bag_X_train, 
                        bag_Y_train, 
                        bag_X_test, 
                        bag_Y_test, 
                        name="LR", 
                        dataset="BoW", 
                        parameters={"n_jobs": N_JOBS,
                                   "c": c,
                                   "max_iter": LR_ITER})



2019-05-18 13:49:45,311 models.ClassifierRunner.$(funcName)s INFO - Initializing models.ClassifierRunner
2019-05-18 13:49:45,313 models.ClassifierRunner.$(funcName)s INFO - write to csv: True
2019-05-18 13:49:45,314 models.ClassifierRunner.$(funcName)s INFO - outfile: 2019-05-18-models.ClassifierRunner-report.csv


In [47]:
report_bow = cr_bow.runModels()
reprot_bow.head()

2019-05-18 13:39:11,860 models.ClassifierRunner.$(funcName)s INFO - Running model: KNN
	with data: BoW
	and parameters: {'n_jobs': 6, 'n_neighbors': 5}
2019-05-18 13:39:11,861 models.ClassifierRunner.$(funcName)s INFO - Start training: 2019-05-18 13:39:11
2019-05-18 13:39:18,786 models.ClassifierRunner.$(funcName)s INFO - End training: 2019-05-18 13:39:18
2019-05-18 13:39:18,787 models.ClassifierRunner.$(funcName)s INFO - End Scoring: 2019-05-18 13:39:18
2019-05-18 13:42:01,063 models.ClassifierRunner.$(funcName)s INFO - End predict: 2019-05-18 13:42:01
2019-05-18 13:42:01,064 models.ClassifierRunner.$(funcName)s INFO - Training time (min): 0.1
2019-05-18 13:42:01,064 models.ClassifierRunner.$(funcName)s INFO - Scoring time (min): 0.0
2019-05-18 13:42:01,065 models.ClassifierRunner.$(funcName)s INFO - Predict time (min): 2.7
2019-05-18 13:42:01,089 models.ClassifierRunner.$(funcName)s INFO - Finished running model: KNN
	with data: BoW
	and parameters: {'n_jobs': 6, 'n_neighbors': 5}	st

Unnamed: 0,1_f1-score,1_precision,1_recall,1_support,2_f1-score,2_precision,2_recall,2_support,3_f1-score,3_precision,...,test_examples,test_features,total_time_min,train_examples,train_features,train_time_min,weighted avg_f1-score,weighted avg_precision,weighted avg_recall,weighted avg_support
0,0.656018,0.573852,0.765646,751.0,0.337308,0.458333,0.266846,371.0,0.425532,0.584416,...,5604.0,5646.0,2.8,16810.0,5646.0,0.1,0.647042,0.64871,0.659886,5604.0
1,0.656018,0.573852,0.765646,751.0,0.337308,0.458333,0.266846,371.0,0.425532,0.584416,...,5604.0,5646.0,2.7,16810.0,5646.0,0.1,0.647042,0.64871,0.659886,5604.0
2,0.707731,0.670238,0.749667,751.0,0.363964,0.548913,0.272237,371.0,0.439716,0.603896,...,5604.0,5646.0,4.4,16810.0,5646.0,4.4,0.674834,0.679572,0.705924,5604.0


# TFIDF - default settings

### Feature Generation

In [53]:
# TODO: play with min_df and max_df
# TODO: play with variations of ngram
tv = TfidfVectorizer(min_df=0., max_df=1., ngram_range=(1,3), use_idf=True)
tv_matrix = tv.fit_transform(X.array)
vocab = tv.get_feature_names()
tv_pd = pd.DataFrame(np.round(tv_matrix.toarray(), 2), columns=vocab)

# split results into training and test set
tv_X_train, tv_X_test, tv_Y_train, tv_Y_test = train_test_split(tv_pd, Y, random_state=1)

print(f"training set size {len(tv_X_train)}")
print(f"test set size {len(tv_X_test)}")

training set size 16810
test set size 5604


### Set Up Models

In [59]:
cr_tfidf = ClassifierRunner(write_to_csv=WRITE_TO_CSV)


if ENABLE_TFIDF:
    if ENABLE_KNN:
        for neighbor in NEIGHBORS:
            neigh = KNeighborsClassifier(n_neighbors=neighbor, n_jobs=N_JOBS)
            cr_tfidf.addModel(neigh, 
                        tv_X_train, 
                        tv_Y_train, 
                        tv_X_test, 
                        tv_Y_test, 
                        name="KNN", 
                        dataset="TFIDF", 
                        parameters={"n_jobs": N_JOBS,
                                   "n_neighbors": neighbor})

    if ENABLE_RN:
        for radius in RADII:
            rnc = RadiusNeighborsClassifier(radius=radius, n_jobs=N_JOBS)
            cr_tfidf.addModel(neigh, 
                        tv_X_train, 
                        tv_Y_train, 
                        tv_X_test, 
                        tv_Y_test, 
                        name="RN", 
                        dataset="TFIDF", 
                        parameters={"n_jobs": N_JOBS,
                                   "radius": radius})
            
    if ENABLE_LR:
        for c in C:
            lr = LogisticRegression(random_state=0, solver='lbfgs',
                                    multi_class='auto',
                                    max_iter=LR_ITER, n_jobs=N_JOBS, C=c)
            cr_tfidf.addModel(lr, 
                        tv_X_train, 
                        tv_Y_train, 
                        tv_X_test, 
                        tv_Y_test, 
                        name="LR", 
                        dataset="TFIDF", 
                        parameters={"n_jobs": N_JOBS,
                                   "c": c,
                                   "max_iter": LR_ITER})

2019-05-18 13:55:34,079 models.ClassifierRunner.$(funcName)s INFO - Initializing models.ClassifierRunner
2019-05-18 13:55:34,080 models.ClassifierRunner.$(funcName)s INFO - write to csv: True
2019-05-18 13:55:34,081 models.ClassifierRunner.$(funcName)s INFO - outfile: 2019-05-18-models.ClassifierRunner-report.csv


In [None]:
report_tfidf = cr_tfidf.runModels()
report_tfidf.head()

2019-05-18 13:55:35,906 models.ClassifierRunner.$(funcName)s INFO - Running model: KNN
	with data: TFIDF
	and parameters: {'n_jobs': 6, 'n_neighbors': 5}
2019-05-18 13:55:35,909 models.ClassifierRunner.$(funcName)s INFO - Start training: 2019-05-18 13:55:35
2019-05-18 13:56:33,224 models.ClassifierRunner.$(funcName)s INFO - End training: 2019-05-18 13:56:33
2019-05-18 13:56:33,239 models.ClassifierRunner.$(funcName)s INFO - End Scoring: 2019-05-18 13:56:33


In [11]:
# # TODO: play with min_df and max_df
# # TODO: play with variations of ngram
# tv = TfidfVectorizer(min_df=0., max_df=1., ngram_range=(1,3), use_idf=True)
# tv_matrix = tv.fit_transform(X.array)
# vocab = tv.get_feature_names()
# tv_pd = pd.DataFrame(np.round(tv_matrix.toarray(), 2), columns=vocab)

# # split results into training and test set
# tv_X_train, tv_X_test, tv_Y_train, tv_Y_test = train_test_split(tv_pd, Y, random_state=1)

# print(f"training set size {len(tv_X_train)}")
# print(f"test set size {len(tv_X_test)}")

# Data Visualization For Our Results

In [None]:
# # visualize some data
# sns.set(font_scale=2)
# sns.set_context(font_scale=3)
# f, ax = plt.subplots(6, 2, figsize=(20,50))
# plt.tight_layout(pad=2, h_pad=5)

# # KNN Graphs


# # total time by neighbor
# sns.lineplot(x="neighbors", y="total_time_min", data=knn_results_pd, marker='o', color='b', ax=ax[0, 0])
# ax[0, 0].set_title("KNN BoW Total Time (minutes)")

# # score by neighbor
# sns.lineplot(x="neighbors", y="score", data=knn_results_pd, marker='o', color='b', ax=ax[0, 1])
# ax[0, 1].set_title("KNN BoW Score")

# # total time by neighbor
# sns.lineplot(x="neighbors", y="total_time_min", data=knn_tv_results_pd, marker='o', color='b', ax=ax[1, 0])
# ax[1, 0].set_title("KNN TFIDF Total Time (minutes)")

# # score by neighbor
# sns.lineplot(x="neighbors", y="score", data=knn_tv_results_pd, marker='o', color='b', ax=ax[1, 1])
# ax[1, 1].set_title("KNN TFIDF Score")


# # Radius Neighbor Graphs

# # total time by radius
# sns.lineplot(x="radius", y="total_time_min", data=rn_results_pd, marker='o', color='g', ax=ax[2, 0])
# ax[2, 0].set_title("Radius BoW Total Time (minutes)")

# # score by radius
# sns.lineplot(x="radius", y="score", data=rn_results_pd, marker='o', color='g', ax=ax[2, 1])
# ax[2, 1].set_title("Radius BoW Score")

# # total time by radius
# sns.lineplot(x="radius", y="total_time_min", data=rn_tv_results_pd, marker='o', color='g', ax=ax[3, 0])
# ax[3, 0].set_title("Radius TFIDF Total Time (minutes)")

# # score by radius
# sns.lineplot(x="radius", y="score", data=rn_tv_results_pd, marker='o', color='g', ax=ax[3, 1])
# ax[3, 1].set_title("Radius TFIDF Score")


# # Logistic Regression Graphs

# # total time by c
# sns.lineplot(x="c", y="total_time_min", data=lr_results_pd, marker='o', color='c', ax=ax[4, 0])
# ax[4, 0].set_title("Logistic Regression BoW Total Time (minutes)")

# # score by c
# sns.lineplot(x="c", y="score", data=lr_results_pd, marker='o', color='c', ax=ax[4, 1])
# ax[4, 1].set_title("Logistic BoW Regression Score")


# # total time by c
# sns.lineplot(x="c", y="total_time_min", data=lr_tv_results_pd, marker='o', color='c', ax=ax[5, 0])
# ax[5, 0].set_title("Logistic Regression TFIDF Total Time (minutes)")

# # score by c
# sns.lineplot(x="c", y="score", data=lr_tv_results_pd, marker='o', color='c', ax=ax[5, 1])
# ax[5, 1].set_title("Logistic TFIDF Regression Score")

