# Data Science For Business Project

# Step 1

## Preprocessing


In [1]:
# Load file (we keep the git repo as light as possible by only hosting the .gz's)
!rm -f *.json
!gunzip -c amazon_step1.json.gz > amazon_step1.json

import pandas as pd
import numpy as np

# A first look at the data
df1 = pd.read_json('amazon_step1.json', lines=True)
df1.head()

Unnamed: 0,asin,category,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,B000J4HXUC,Sports_and_Outdoors,"[1, 1]",5,It's a .50 Caliber Ammo Can. That largely sums...,"01 5, 2014",A3QRW0UJPKIAX7,Grant Fritchey,Clean and Exactly as Advertised,1388880000
1,0983393214,Books,"[0, 0]",5,This was a very good book. It kept me excited ...,"06 23, 2013",A2SEIOM4H06WTH,TJ,Great read!,1371945600
2,B003G4FVMY,Grocery_and_Gourmet_Food,"[0, 0]",5,"If you love coconut the way I do, you can't go...","05 19, 2013",A3GDEXMU9587JX,"K. Parsley ""kindlekat""","If you love coconut, get this coffee",1368921600
3,B00F9VRNF0,Cell_Phones_and_Accessories,"[0, 0]",5,I recently switched from the Galaxy S3 to the ...,"04 25, 2014",ASP3J2NEHDN4E,ChriS,Superior Protection!!!,1398384000
4,B00D5OZQUC,Amazon_Instant_Video,"[0, 0]",5,"Good show,looks like the gap from season 2 to ...","11 4, 2013",A1EDBI6TBKP9CO,Grants Book Trade,"Love the show, thanks for putting Season 3 on ...",1383523200


In [2]:
df1.dtypes

asin              object
category          object
helpful           object
overall            int64
reviewText        object
reviewTime        object
reviewerID        object
reviewerName      object
summary           object
unixReviewTime     int64
dtype: object

In [3]:
num_total_samples = len(df1.index)
num_total_features = len(df1.columns)
print("Number of features:", num_total_features)
print("Number of samples:", num_total_samples)

Number of features: 10
Number of samples: 96000


In [4]:
num_valid_entries_per_sample = df1.count(axis=1)

num_complete_samples = num_valid_entries_per_sample.tolist().count(num_total_features)

percentage_damaged_samples = 1 - num_complete_samples/num_total_samples
print('Number of damaged samples:', num_total_samples - num_complete_samples)
print('Percentage of damaged Samples:', np.around(100*percentage_damaged_samples,decimals=1), '%')

Number of damaged samples: 994
Percentage of damaged Samples: 1.0 %


It appears that we have some missing data. <br />
Let's see the number of valid entries for each feature

In [5]:
num_valid_entries_per_feature = df1.count(axis=0).sort_values()
print(num_valid_entries_per_feature)

reviewerName      95006
asin              96000
category          96000
helpful           96000
overall           96000
reviewText        96000
reviewTime        96000
reviewerID        96000
summary           96000
unixReviewTime    96000
dtype: int64


Only `reviewerName` is sometimes missing.<br />
### asin
`asin` (amazon standard identification number which is unique for each product) can only add noise to our future model so we drop it.

In [6]:
df1 = df1.drop("asin", 1)

### category

`Category`: TODO

### helpful
`helpful` needs to be converted from the format $[a, b]$ to a numerical value.<br />
We set $helpful = \dfrac{a}{b}$

In [7]:
def ratio(row):
    if row[1] == 0:
        return 1
    else:
        return row[0] / row[1]
df1["helpful"] = df1["helpful"].apply(ratio)

### overall
TODO

### reviewText
TODO<br />
We add the feature $length(reviewText)$

In [21]:
i = 0
def text_length(row):
    global i
    if i == 1:
        print(row)
    i += 1
    return len(row)
df1["reviewTextLength"] = df1["reviewText"].apply(text_length)
df1.head(100)

This was a very good book. It kept me excited about what was gonna happen next. I read the whole series.


Unnamed: 0,category,helpful,overall,reviewText,reviewerName,summary,unixReviewTime,reviewTextLength
0,Sports_and_Outdoors,1.000000,5,It's a .50 Caliber Ammo Can. That largely sums...,Grant Fritchey,Clean and Exactly as Advertised,0.967145,530
1,Books,1.000000,5,This was a very good book. It kept me excited ...,TJ,Great read!,0.934786,104
2,Grocery_and_Gourmet_Food,1.000000,5,"If you love coconut the way I do, you can't go...","K. Parsley ""kindlekat""","If you love coconut, get this coffee",0.929008,170
3,Cell_Phones_and_Accessories,1.000000,5,I recently switched from the Galaxy S3 to the ...,ChriS,Superior Protection!!!,0.985306,1645
4,Amazon_Instant_Video,1.000000,5,"Good show,looks like the gap from season 2 to ...",Grants Book Trade,"Love the show, thanks for putting Season 3 on ...",0.956909,148
5,Baby,1.000000,5,"I'm ordering more of these spoons, as one just...",A. Jordan,Perfect for babies,0.765891,375
6,Tools_and_Home_Improvement,1.000000,4,We installed some sensor lights in the house f...,"Dr. Oceanfront ""Oceanfront""","Nice little sensor light, good for power fail...",0.979528,1648
7,Digital_Music,0.600000,5,It was 1979 and the B-52s were on Saturday Nig...,"R. M. Ettinger ""rme1963""",Defining An Era,0.326069,627
8,Pet_Supplies,1.000000,5,My Manchester terrier eats so fast she chokes ...,"Grider ""Just Horses""",Perfect Solution For Dogs That Gobble Food Too...,0.712399,369
9,Movies_and_TV,1.000000,2,This movie starts off with some clever and fun...,CelticWomanFanPiano,Reverse Cinderella just doesn't cut it.,0.736999,779


### reviewTime & reviewerID
Since `reviewTime` is redundant with `unixReviewTime`, we drop this feature.<br />
We also drop the feature `reviewerID` since less than 1% of the data is missing a `reviewerName`.

In [8]:
df1 = df1.drop(["reviewTime", "reviewerID"], axis=1)

### reviewerName
TODO

### summary
TODO

### unixReviewTime
We standardize `unixReviewTime` using a min-max scaler

In [15]:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
df1["unixReviewTime"] = min_max_scaler.fit_transform(df1["unixReviewTime"].values.reshape(-1, 1))
print("Mean of unixReviewTime:", df1["unixReviewTime"].mean())
df1.head()

Mean of unixReviewTime: 0.8616993423587033


Unnamed: 0,category,helpful,overall,reviewText,reviewerName,summary,unixReviewTime
0,Sports_and_Outdoors,1.0,5,It's a .50 Caliber Ammo Can. That largely sums...,Grant Fritchey,Clean and Exactly as Advertised,0.967145
1,Books,1.0,5,This was a very good book. It kept me excited ...,TJ,Great read!,0.934786
2,Grocery_and_Gourmet_Food,1.0,5,"If you love coconut the way I do, you can't go...","K. Parsley ""kindlekat""","If you love coconut, get this coffee",0.929008
3,Cell_Phones_and_Accessories,1.0,5,I recently switched from the Galaxy S3 to the ...,ChriS,Superior Protection!!!,0.985306
4,Amazon_Instant_Video,1.0,5,"Good show,looks like the gap from season 2 to ...",Grants Book Trade,"Love the show, thanks for putting Season 3 on ...",0.956909


# Step 2
## Preprocessing

In [10]:
!gunzip -c amazon_step23.json.gz > amazon_step23.json
# TODO
#df2 = pd.read_json('amazon_step23.json.gz', lines=True)
#df2.head()

# Step 3
## Preprocessing