# Capstone Project 2: Recommender System

## Data Cleaning and Wrangling:

The dataset that will be used for this recommender system comes from Amazon electronics sales. It comes from a UCSD Professor named Julian McAuley at the following website with the following citation:

http://jmcauley.ucsd.edu/data/amazon/ 

###### Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering  
R. He, J. McAuley  
WWW, 2016   

The raw dataset contains the amazon standard identification number(asin) of each product, the ratings, the helpfulness rating of each review, the review text, the date of the review, the reviewer name and ID, and a summary of the review. The dataset is saved in json format and the file is 1.4 gb. This means that the data cannot be uploaded to GitHub and so it will be stored and accessed locally. The necessary packages and libraries will be imported.

In [1]:
import pandas as pd
import numpy as np
import os

The Python os module allows for accessing the path where the data file is stored. The dataset will be read into a pandas DataFrame.

In [2]:
#storing the path to the data file
IMPORT_PATH = os.path.join(os.environ['HOMEPATH'], 'data', 'raw', 'amazon_electronics.json')
df = pd.read_json(IMPORT_PATH, lines=True)

In [3]:
#observing the DataFrame
df.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,528881469,"[0, 0]",5,We got this GPS for my husband who is an (OTR)...,"06 2, 2013",AO94DHGC771SJ,amazdnu,Gotta have GPS!,1370131200
1,528881469,"[12, 15]",1,"I'm a professional OTR truck driver, and I bou...","11 25, 2010",AMO214LNFCEI4,Amazon Customer,Very Disappointed,1290643200
2,528881469,"[43, 45]",3,"Well, what can I say. I've had this unit in m...","09 9, 2010",A3N7T0DY83Y4IG,C. A. Freeman,1st impression,1283990400
3,528881469,"[9, 10]",2,"Not going to write a long review, even thought...","11 24, 2010",A1H8PY3QHMQQA0,"Dave M. Shaw ""mack dave""","Great grafics, POOR GPS",1290556800
4,528881469,"[0, 0]",1,I've had mine for a year and here's what we go...,"09 29, 2011",A24EV6RXELQZ63,Wayne Smith,"Major issues, only excuses for support",1317254400


It looks like the dataset was loaded well and the columns are all present. Lets observe the info of the DataFrame to see the data types as well as check for any missing values.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1689188 entries, 0 to 1689187
Data columns (total 9 columns):
asin              1689188 non-null object
helpful           1689188 non-null object
overall           1689188 non-null int64
reviewText        1689188 non-null object
reviewTime        1689188 non-null object
reviewerID        1689188 non-null object
reviewerName      1664458 non-null object
summary           1689188 non-null object
unixReviewTime    1689188 non-null int64
dtypes: int64(2), object(7)
memory usage: 116.0+ MB


There is a very large amount of records, nearly 1.7 million. It looks like the only field that may contain missng values is the reviewerName. To deal with this problem the reviewerID will be looked at, instead, and the reviewerName column will be dropped. It also looks like there are two methods for recording time, the unixReviewTime and the reviewTime. The unixReviewTime will be dropped and the data type of the reviewTime will be converted to datetime.

In [5]:
#dropping uneeded columns
df_noName = df.drop(['reviewerName', 'unixReviewTime'], axis=1)

#converting reviewTime to datetime datatype
df_noName['reviewTime'] = pd.to_datetime(df_noName['reviewTime'])

In [6]:
#observing changes
df_noName.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,summary
0,528881469,"[0, 0]",5,We got this GPS for my husband who is an (OTR)...,2013-06-02,AO94DHGC771SJ,Gotta have GPS!
1,528881469,"[12, 15]",1,"I'm a professional OTR truck driver, and I bou...",2010-11-25,AMO214LNFCEI4,Very Disappointed
2,528881469,"[43, 45]",3,"Well, what can I say. I've had this unit in m...",2010-09-09,A3N7T0DY83Y4IG,1st impression
3,528881469,"[9, 10]",2,"Not going to write a long review, even thought...",2010-11-24,A1H8PY3QHMQQA0,"Great grafics, POOR GPS"
4,528881469,"[0, 0]",1,I've had mine for a year and here's what we go...,2011-09-29,A24EV6RXELQZ63,"Major issues, only excuses for support"


The helpful column contains two values which indicates first value of the total second value found the review helpful. These two values will be split up into separate columns and the original helpful column will be dropped. They will also be converted into an integer data type.

In [7]:
#splitting the helpful columns
df_noName['foundHelpful'] = [h1 for h1, h2 in df_noName['helpful']]
df_noName['totalHelpful'] = [h2 for h1, h2 in df_noName['helpful']]

#converting into integer
df_noName[['foundHelpful','totalHelpful']].astype('int')

#dropping the helpful column
df_help_split = df_noName.drop('helpful', axis=1)

In [8]:
#observing DataFrame
df_help_split.head()

Unnamed: 0,asin,overall,reviewText,reviewTime,reviewerID,summary,foundHelpful,totalHelpful
0,528881469,5,We got this GPS for my husband who is an (OTR)...,2013-06-02,AO94DHGC771SJ,Gotta have GPS!,0,0
1,528881469,1,"I'm a professional OTR truck driver, and I bou...",2010-11-25,AMO214LNFCEI4,Very Disappointed,12,15
2,528881469,3,"Well, what can I say. I've had this unit in m...",2010-09-09,A3N7T0DY83Y4IG,1st impression,43,45
3,528881469,2,"Not going to write a long review, even thought...",2010-11-24,A1H8PY3QHMQQA0,"Great grafics, POOR GPS",9,10
4,528881469,1,I've had mine for a year and here's what we go...,2011-09-29,A24EV6RXELQZ63,"Major issues, only excuses for support",0,0


In [9]:
#observing the updated data types
df_help_split.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1689188 entries, 0 to 1689187
Data columns (total 8 columns):
asin            1689188 non-null object
overall         1689188 non-null int64
reviewText      1689188 non-null object
reviewTime      1689188 non-null datetime64[ns]
reviewerID      1689188 non-null object
summary         1689188 non-null object
foundHelpful    1689188 non-null int64
totalHelpful    1689188 non-null int64
dtypes: datetime64[ns](1), int64(3), object(4)
memory usage: 103.1+ MB


It looks like there was not much preparing needed to be performed in this dataset as it was in a well maintained state with little to no missing values.This cleaned/wrangled dataset will be exported as a csv and used for exploratory data analysis.

In [10]:
EXPORT_PATH = os.path.join(os.environ['HOMEPATH'], 'data', 'amazon_cleaned.csv')
df_help_split.to_csv(EXPORT_PATH)