# Amazon Review Helpfulness Prediction 

## Problem Statement:
In this project, we are tring to address the bias associated with the amazon review helpfulness which is currently ranked based on the number of upvotes the review received. We will use machine learning techniques to design a model which will predict or classify the review whether it is helpful or not helpful.The final outcome of the project is how well we are able to predict the new review as helpful or not helpful.

For the problem statement, we will use the Home and Kitchen dataset which is having around 346,355 reviews. Dataset is available at JmCauley page: http://jmcauley.ucsd.edu/data/amazon/links.html

## Data Analysis

In [4]:
# Importing the relevant dependencies
import numpy as np
import pandas as pd
import gzip
import math, time, random, datetime

# data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import missingno
import seaborn as sns
plt.style.use('seaborn-whitegrid')


In [6]:
# Loading the home and kitchen dataset which is downloaded in the /data path
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i +=1
    return pd.DataFrame.from_dict(df, orient='index')

data = getDF('../data/raw/reviews_Home_and_Kitchen_5.json.gz')    

In [8]:
# Lets look at the data
data.head(5)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,APYOBQE6M18AA,615391206,Martin Schwartz,"[0, 0]",My daughter wanted this book and the price on ...,5.0,Best Price,1382140800,"10 19, 2013"
1,A1JVQTAGHYOL7F,615391206,Michelle Dinh,"[0, 0]",I bought this zoku quick pop for my daughterr ...,5.0,zoku,1403049600,"06 18, 2014"
2,A3UPYGJKZ0XTU4,615391206,mirasreviews,"[26, 27]",There is no shortage of pop recipes available ...,4.0,"Excels at Sweet Dessert Pops, but Falls Short ...",1367712000,"05 5, 2013"
3,A2MHCTX43MIMDZ,615391206,"M. Johnson ""Tea Lover""","[14, 18]",This book is a must have if you get a Zoku (wh...,5.0,Creative Combos,1312416000,"08 4, 2011"
4,AHAI85T5C2DH3,615391206,PugLover,"[0, 0]",This cookbook is great. I have really enjoyed...,4.0,A must own if you own the Zoku maker...,1402099200,"06 7, 2014"


As mentioned in the jmcauley site, belaw are the description of the fields:
1. reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
2. asin - ID of the product, e.g. 0000013714
3. reviewerName - name of the reviewer
4. helpful - helpfulness rating of the review, e.g. 2/3.
5. reviewText - text of the review
6. overall - rating of the product
7. summary - summary of the review
8. unixReviewTime - time of the review (unix time)
9. reviewTime - time of the review (raw)


As per the problem statement of the project, lets extract the useful information needed:
1. 'reviewText' which will be used to generate the features
2. 'helpful'. As it is an array of helpful_ratings and total ratings . We will split them and predict the helpful ratings and use it as target labels.
3. 'overall' as one of the feature. We will see if there is any correlation between overall and helpful_ratings and check if it can help in improving the performance of the model.

In [12]:
# Extracting the useful columns from the data
df = data.loc[:, ['helpful', 'reviewText', 'overall']]

# Split the helpful into helpful_ratings and total_ratings
df['helpful_ratings'] = df['helpful'].apply(lambda x: x[0])
df['total_ratings'] = df['helpful'].apply(lambda x: x[1])

# Delete helpful from df
del df['helpful']

# Check if there is any null values
print(df.isnull().sum())

reviewText         0
overall            0
helpful_ratings    0
total_ratings      0
dtype: int64


In [13]:
# Check the df statistics
df.describe()

Unnamed: 0,overall,helpful_ratings,total_ratings
count,551682.0,551682.0,551682.0
mean,4.316655,3.497348,3.939469
std,1.110749,76.539142,77.801556
min,1.0,0.0,0.0
25%,4.0,0.0,0.0
50%,5.0,0.0,0.0
75%,5.0,1.0,2.0
max,5.0,52176.0,52861.0


There is clearly some outliers in the data as there is huge difference between max and min value of helpful_ratings and total_ratings. Lets move to exploratory visualization for finding some insights on this.

## Exploratory Visualization 