# Use simple techniques for feature Engineering and to do multi-class classification to predict review ratings based on the Amazon Reviews dataset

Feature Engineering Techniques:
    * bag of words
    * TF-IDF
    
    
Classification:
    * Logistic Regression
    * K-nearest neighbors
    
    
Objective for this exercise:
    * Establish NLP prediction accuracy baseline using simple ML models
    * Explore different permutation of feature engineering techniques, data, and classification algorithms
    * Compare accuracy of preduction using the following information:
        * Product Title
        * Review Headline
        * Review Body
    * (If time allows) see if using only helpful reviews to train improves our accuracy for our predictions - this reduces our 150k dataset to 35k
    
    
Data used in this notebooks has already been pre-processed in the previous notebook. For details, please see: [amazon_review_preprocessing.ipynb](amazon_review_preprocessing.ipynb)

```
python preprocess_amazon.py -l INFO -r -o dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-smallout.csv dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-smallin.csv
```

In [8]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier, RadiusNeighborsClassifier
from sklearn.model_selection import train_test_split



%matplotlib inline

In [2]:
# set global variables

# datafile was generated from amazon_review_preprocessing.ipynb - this file has 300k reviews randomly chosen
# from original file
DATA_FILE = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-smallout.csv"
KEEP_COLUMNS = ["product_title", "helpful_votes", "review_headline", "review_body", "star_rating"]

FEATURE_COLUMN = "review_headline"

df = pd.read_csv(DATA_FILE)[KEEP_COLUMNS]
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149467 entries, 0 to 149466
Data columns (total 5 columns):
product_title      149467 non-null object
helpful_votes      149467 non-null int64
review_headline    149467 non-null object
review_body        149467 non-null object
star_rating        149467 non-null int64
dtypes: int64(2), object(3)
memory usage: 5.7+ MB
None


Unnamed: 0,product_title,helpful_votes,review_headline,review_body,star_rating
0,lifeproof fre iphone 5c waterproof case retail...,0,five stars,everything great thank,5
1,soundbeats universal hv 800 wireless music a2d...,0,five stars,perfect sound like lg,5
2,li amefuel lite series,0,perfect,finished using trip loved able keep tablet ope...,5
3,synthetic leather galaxy s6 sleeve thin,0,exactly described,quality looks good prompt postage cannot argue...,5
4,luxury air vent car mount universal smartphone...,0,five stars,device holder awesome holds phone securely no ...,5


In [3]:
# let's get some data on our text

def wc(x:str):
    return len(str(x).split())

df["pt_wc"] = df.product_title.apply(wc)
df["rh_wc"] = df.review_headline.apply(wc)
df["rb_wc"] = df.review_body.apply(wc)
df.describe()

Unnamed: 0,helpful_votes,star_rating,pt_wc,rh_wc,rb_wc
count,149467.0,149467.0,149467.0,149467.0,149467.0
mean,0.918229,3.892578,15.984291,2.959048,26.372617
std,11.283936,1.463551,9.857662,1.890387,45.961244
min,0.0,1.0,1.0,1.0,1.0
25%,0.0,3.0,9.0,2.0,8.0
50%,0.0,5.0,14.0,2.0,15.0
75%,0.0,5.0,20.0,4.0,29.0
max,2186.0,5.0,96.0,23.0,3062.0


In [4]:
# Set up different dataframes for training

# outcome
Y = df["star_rating"]
X = df[FEATURE_COLUMN]

# Bag of Words

In [5]:
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(X.array)
vocab = cv.get_feature_names()
# print(f"vocab: {vocab}")
bag_pd = pd.DataFrame(cv_matrix.toarray(), columns=vocab)

In [6]:
# explore the data
print(len(vocab))
bag_pd.head()

15200


Unnamed: 0,00,000,000mah,001,00522,00a,00pm,00s,01,010,...,zoomback,zoombak,zooom,zooooooo,zr,zre,zte,zumo,zune,zzx
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# split results into training and test set
bag_X_train, bag_X_test, bag_Y_train, bag_Y_test = train_test_split(bag_pd, Y, random_state=1)



In [None]:
# use K-nearest neighbors to train

# TODO: Use GridCV to test different parameters for n_neighbors
neigh = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
%timeit neigh.fit(bag_X_train, bag_Y_train)
%timeit output = neigh.predict(bag_X_test)


# TF-IDF

In [7]:
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(X.array)
vocab = tv.get_feature_names()
tv_pd = pd.DataFrame(np.round(tv_matrix.toarray(), 2), columns=vocab)