### Tasks In The Notebook:

1. Calculates the average rating given by each reviewer.
2. Calculates the number of reviews by each reviewer.
3. Converts date to YYYYMMDD format.
4. Sort the date in ascending order.
5. Gets the first date when the Reviewer first rated a product.
6. Merges previous data frames.
7. Calculate the helpfulness ration.
8. Weights the helpfulness ratio.
9. Merges the helpfulness ratio to get the final result.

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import plotly.plotly as py
#py.set_credentials_file(username='raoshaheryarkhan', api_key='KswnKD2cSzUMp6zxf25p')
import plotly.figure_factory as ff
import math
from datetime import datetime
import sys
import os

In [2]:
project_path = "PycharmProjects/Amazon-Mining"
# make sure to use position 1
sys.path.insert(1, project_path)


In [3]:
os.chdir("/Users\RSK\PycharmProjects\Amazon-Mining")
from src.data.json_loader import JSONLoader

In [4]:
# file_path corresponds to the file of the .gz file which contains the JSON file. 
os.chdir("/Users\RSK")
file_path = 'Documents\DataMining/reviews_Electronics_5.json.gz'
loader = JSONLoader()

In [5]:
# data is a Pandas Dataframe object.
data = loader.load_data(file_path)

#### Data Has Been Loaded

###### Calculating Avg rating given by each Reviewer.

In [6]:
meanReview = data[['reviewerID','overall']].groupby(['reviewerID'],as_index=False).mean()
meanReview = meanReview.rename(columns = {'overall':'Avgrating'})

###### Calculating number of Reviews given by each reviewer and joining it with mean rating given by each reviewer.

In [7]:
reviewerCount = data[['reviewerID','asin']].groupby(['reviewerID'], as_index=False).count()
reviewerCount = reviewerCount.rename(columns = {'asin':'reviewCount'})
reviewer = pd.merge(reviewerCount, meanReview, on = "reviewerID")

###### Below function converts the date in data to format of YYYYMMDD e.g., 20171101 . its 1st Nov 2017

In [8]:
def to_YYYYMMDD(row):   
    datetime_object = datetime.strptime(row['reviewTime'], '%m %d, %Y').date()
    return datetime.strftime(datetime_object, "%Y%m%d")

In [9]:
data['date(YYYYMMDD)'] = data.apply (lambda row: to_YYYYMMDD (row),axis=1)

###### loading the interested columns into the other data frame for better processing

In [10]:
reviewerData = data[['reviewerID','reviewerName','date(YYYYMMDD)']].copy()

###### Now sorting the values in ascending order on YYYYMMDD format.

In [11]:
reviewerData = reviewerData.groupby('reviewerID').apply(lambda x: x.sort_values('date(YYYYMMDD)'))

###### Get the first record of each Employee. as now first record contains the date on which Reviewer first gave review

###### These del reviewerID column is just to remove redundant column. which will cause issue when we reset index.

In [12]:
del reviewerData['reviewerID']
temp = reviewerData
temp = temp.groupby('reviewerID').head(1)

###### level_1 was the column generated as a result of sorting and reset index. so getting rid of it

In [13]:
temp = temp.reset_index()
del temp['level_1']

###### merging date wise result with previous data frame.

In [14]:
reviewer_df = pd.merge(temp, reviewer, on = "reviewerID")

###### Aamna's code till next 4 cell

In [15]:
data['helpfulness_ratio']=(data['helpful'].str[0])/(data['helpful'].str[1])

###### Code to extract helpful votes(num) and total votes(den)

In [16]:
help_df=pd.DataFrame(data['helpful'])
help_df['helpful']=help_df['helpful'].astype(str)
help_df['a'] = help_df['helpful'].apply(lambda x: x.split(',')[0])
help_df['b'] = help_df['helpful'].apply(lambda x: x.split(',')[1])
help_df['num']=help_df['a'].map(lambda x: x.lstrip('['))
help_df['deno']=help_df['b'].map(lambda x: x.rstrip(']'))
help_df['deno']=help_df['deno'].astype(int)
help_df['deno']=help_df['num'].astype(int)

###### calculated mean of total votes(den) and divided each den with this mean to generate weights/scale Multiplied each helpfulness_ratio in the main dataframe called 'data' with this scale to get helpfulness score

In [17]:
mean=help_df['deno'].mean()
mean
help_df['scale']= help_df['deno']/mean
#help_df['final_ratio']= help_df['helpful_ratio']*help_df['scale']
help_df
#help_df['helpful_ratio']= help_df['num']/help_df['deno']
data['helpfulness_score']=data['helpfulness_ratio']*help_df['scale']

###### Since this score has a wide range of values, I have normalized it to get the values bw 0 and 1

In [18]:

data['helpfulness_score']=(data['helpfulness_score']-data['helpfulness_score'].min())/(data['helpfulness_score'].max()-data['helpfulness_score'].min())

###### Taking mean of all helfulness score per Reviewer.

In [19]:
meanHelpfulnessScore = data[['reviewerID','helpfulness_score']].groupby(['reviewerID'],as_index=False).mean()
meanHelpfulnessScore = meanHelpfulnessScore.rename(columns = {'helpfulness_score':'Avghelfulness'})

###### Joining Helpfulness with the previous data frame to get our final data frame.

In [20]:
reviewer_df = pd.merge(reviewer_df, meanHelpfulnessScore, on = "reviewerID")

###### The final result

In [21]:
reviewer_df

Unnamed: 0,reviewerID,reviewerName,date(YYYYMMDD),reviewCount,Avgrating,Avghelfulness
0,A000715434M800HLCENK9,DP,20140519,5,3.200000,0.000033
1,A00101847G3FJTWYGNQA,Cang Cheng,20130919,6,4.666667,0.000025
2,A00166281YWM98A3SVD55,Band Mom,20130417,5,4.800000,0.000200
3,A0046696382DWIPVIWO0K,"Fernando Molinas ""Fernando Molinas""",20130621,5,4.200000,0.000025
4,A00472881KT6WR48K907X,Matthew Wright,20130128,7,4.571429,0.000067
5,A00473363TJ8YSZ3YAGG9,Thomas Rogers,20130708,9,3.666667,
6,A005721627VX5W2COKKK2,Purav Patel,20121227,7,4.714286,0.000067
7,A00700212KB3K0MVESPIY,sandra conklin,20131125,6,5.000000,0.000033
8,A007780739H92ZWKGVWGJ,Ron Burnette,20130509,6,4.666667,
9,A00814373OZEDXYMXP04T,Nathan Johnson,20130705,6,4.833333,0.000033
