# Predicting the Helpfulness of Amazon Reviews
### Keane Johnson and Tucker Anderson

This notebook builds a model that predicts the helpfulness of Amazon reviews. It uses the [2015 Amazon Review dataset](http://jmcauley.ucsd.edu/data/amazon/index.html), compiled by Julian McAuley, associate professor in the Computer Science department at the University of California, San Diego.

The dataset contains product reviews from Amazon from May 1996 - July 2014, and includes ratings, text, helpfulness votes, descriptions, category information, price, brand, and image features. It is broken into smaller subsets, organized by product category.

This notebook focuses on the Home and Kitchen sub-category, and uses the aforementioned features to predict helpfulness.

## Outline
- Import Libraries
- Load and Prepare Dataset
- Exploratory Data Analysis
- Baseline Model (Naive Bayes)
- Neural Net Models

## Import Libraries

In [1]:
# load packages
import gzip
import json
import os
import wget

import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Load and Prepare Dataset

In [2]:
# load dataset - download directly from source, save to data directory

file_name = "data/reviews_Home_and_Kitchen_5.json.gz"
output_dir = "data"
url = "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz"

if not os.path.isdir(output_dir):
    os.makedirs(output_dir)

if not os.path.isfile(file_name):
    file_name = wget.download(url, out=output_dir)

In [3]:
# helper functions to parse data from compressed json into pandas DF
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def get_dataframe(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')


# helper function to pull out total helpful votes
def get_helpful_votes(helpful):
    [helpful, total] = helpful
    return helpful


# helper function to pull out total votes (helpful and unhelpful)
def get_total_votes(helpful):
    [helpful, total] = helpful
    return total
    
    
# helper function to calculate helpfulness percentage 
def calculate_helpful_perc(helpful):
    [helpful, total] = helpful
    if total == 0:
        return 0
    else:
        return (helpful/total)

In [4]:
# create dataframe
df = get_dataframe(file_name)

# parse helpful column into new columns of helpful_votes, total_votes, helpful_perc
df['helpful_votes'] = df['helpful'].apply(get_helpful_votes)
df['total_votes'] = df['helpful'].apply(get_total_votes)
df['helpful_perc'] = df['helpful'].apply(calculate_helpful_perc)

# take a peak at the data
df.sample(20)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,helpful_votes,total_votes,helpful_perc
540594,A3GDWPIZDJB9T3,B00G9C002G,"H. Selby ""online shopper""","[0, 1]",I am surprised at how highly rated this was. ...,2.0,Juice goes everywhere,1403654400,"06 25, 2014",0,1,0.0
350879,A2W33TFHCANZPV,B0039RH4AK,M. Agnone,"[0, 0]",This is okay....just okay. A little disappoin...,3.0,Maytex Garden Flight PEVA Shower Curtain,1359504000,"01 30, 2013",0,0,0.0
269477,AT2M67P5I23VI,B001GAQKMU,B. Brown,"[0, 0]",I bought this on the recommendation of America...,5.0,This is awesome...,1401753600,"06 3, 2014",0,0,0.0
432213,A2WZNM833CJH82,B00558UWEQ,Grandma,"[11, 11]",I really like this littleDirt Devil Quick Flex...,4.0,Nice Compact Vac for Light Duty Use,1357603200,"01 8, 2013",11,11,1.0
257592,A34UVV757IKPVB,B001CJ2Y5M,justsomeguy,"[28, 28]",I have had this thing on my wishlist forever. ...,5.0,Been needing this for a loooong time.,1322438400,"11 28, 2011",28,28,1.0
410830,AJCSYHWI8U2TF,B004M17KB0,counterfugue,"[0, 0]","Stainless steel, sturdy, fast heat time, good ...",5.0,Perfect,1400976000,"05 25, 2014",0,0,0.0
161148,A3NNC9336VD5VG,B000HK03HI,rascales,"[0, 0]",This is a solid and well made product with goo...,5.0,Horse face says....,1380499200,"09 30, 2013",0,0,0.0
438300,A1KXDXSCX05LBT,B005DN65WG,Valkyrie,"[0, 0]",The cover fits well and has a soft outer surfa...,5.0,Cover,1383091200,"10 30, 2013",0,0,0.0
350845,A1XWIWNEB38BO0,B0039PMJW0,"julie ""julie""","[0, 0]",This does what you think it does. Simple and ...,5.0,works,1346457600,"09 1, 2012",0,0,0.0
179589,A4DXE5ZVH1RBW,B000MDHH06,J. Adamo,"[1, 1]",Bought this one before I knew about slow juice...,3.0,First Juicer,1389398400,"01 11, 2014",1,1,1.0


## Exploratory Data Analysis

## Baseline Model (Naive Bayes)