# 1. Install Dependencies

In [None]:
!pip install simplejson



# 2. Import Libraries

In [None]:
import os
import json

import random
import simplejson as json

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 3. Parse Data

In [None]:
def parse(path):
  g = open(path, 'rb')
  for l in g:
    yield json.loads(l)
    

In [None]:
def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

# 4. Load Data

In [None]:
review = getDF('/content/drive/My Drive/Applied_ML/data/Video_Games.json')


In [None]:
meta = getDF('/content/drive/My Drive/Applied_ML/data/meta_Video_Games.json')

# we are going to drop the columns for which no description is provided by the author
# we have discussed rigorously and decided to drop them
meta = meta.drop(axis=1, columns='fit')
meta = meta.drop(axis=1, columns='date')

In [None]:
review.head(5)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,1.0,True,"06 9, 2014",A21ROB4YDOZA5P,439381673,Mary M. Clark,I used to play this game years ago and loved i...,Did not like this,1402272000,,,
1,3.0,True,"05 10, 2014",A3TNZ2Q5E7HTHD,439381673,Sarabatya,The game itself worked great but the story lin...,Almost Perfect,1399680000,,,
2,4.0,True,"02 7, 2014",A1OKRM3QFEATQO,439381673,Amazon Customer,I had to learn the hard way after ordering thi...,DOES NOT WORK WITH MAC OS unless it is 10.3 or...,1391731200,15.0,,
3,1.0,True,"02 7, 2014",A2XO1JFCNEYV3T,439381673,ColoradoPartyof5,The product description should state this clea...,does not work on Mac OSX,1391731200,11.0,,
4,4.0,True,"01 16, 2014",A19WLPIRHD15TH,439381673,Karen Robinson,I would recommend this learning game for anyon...,Roughing it,1389830400,,,


**review** dataset with $2,565,349$ reviews where, 

>**reviewerID** - reviewer ID
>
> **asin** - product ID 
>
> **reviewerName** - reviewer name 
>
> **vote** - no. of votes for the review, indicating its helpfulness 
>
> **style** - dictionary of product attributes 
>
> **reviewText** - review statement
>
> **overall** - product rating provided by the reviewer 
>
> **summary** - review summary 
>
> **unixReviewTime** - unix time of the review 
>
> **reviewTime** - raw time of the review 
>
> **image** - images posted by reviewer


In [None]:
meta.head(5)

Unnamed: 0,category,tech1,description,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,price,asin,imageURL,imageURLHighRes,details
0,"[Video Games, PC, Games]",,[],Reversi Sensory Challenger,[],,Fidelity Electronics,[],"[>#2,623,937 in Toys &amp; Games (See Top 100 ...",[],Toys &amp; Games,,,42000742,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
1,"[Video Games, Xbox 360, Games, </span></span><...",,[Brand new sealed!],Medal of Honor: Warfighter - Includes Battlefi...,[B00PADROYW],,by\n \n EA Games,[],"[>#67,231 in Video Games (See Top 100 in Video...","[B0050SY5BM, B072NQJCW5, B000TI836G, B002SRSQ7...",Video Games,,"\n\t\t\t\t\t\t\t\t\t\t\t\t<span class=""vertica...",78764343,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,"[Video Games, Retro Gaming & Microconsoles, Su...",,[],street fighter 2 II turbo super nintendo snes ...,[],,Nintendo,[],"[>#134,433 in Video Games (See Top 100 in Vide...",[],Video Games,,$0.72,276425316,[],[],
3,"[Video Games, Xbox 360, Accessories, Controlle...",,[MAS's Pro Xbox 360 Stick (Perfect 360 Stick) ...,Xbox 360 MAS STICK,[],,by\n \n MAS SYSTEMS,[Original PCB used from Xbox 360 Control Pad (...,"[>#105,263 in Video Games (See Top 100 in Vide...",[],Video Games,,,324411812,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
4,"[Video Games, PC, Games, </span></span></span>...",,"[Phonics Alive! 3, The Speller teaches student...",Phonics Alive! 3: The Speller,[],,by\n \n Advanced Software Pty. Ltd.,"[Grades 2-12, Spelling Program, Teaches Spelli...","[>#92,397 in Video Games (See Top 100 in Video...",[B000BCZ7U0],Video Games,,,439335310,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,


**meta** dataset with $84,819$ sample points where,

> **asin** - ID of the product
>
> **title** - name of the product
>
> **feature** - bullet-point format features of the product
>
> **description** - description of the product
>
> **price** - price in US dollars (at time of crawl)
>
> **imageURL** - url of the product image
>
> **imageURLHighRes** - url of the high resolution product image 
>
> **related** - related products (also bought, also viewed, bought together, buy after viewing)
>
> **salesRank** - sales rank information
>
> **brand** - brand name
>
> **categories** - list of categories the product belongs to
>
> **tech1** - the first technical detail table of the product
>
> **tech2** - the second technical detail table of the product
>
> **similar_item** - similar product table

# 5. Cleaning

## 5.1 Handling Duplicates

Since meta dataset contains product information, we expect to have unique values for the 'asin' column (which contains product ID). However, as we can see below, this is not the case:

In [None]:
print('Number of unique values in asin for meta dataset: {:,}'.format(meta['asin'].nunique()))
print('Total number of rows in meta dataset: {:,}'.format(len(meta)))

Number of unique values in asin for meta dataset: 71,911
Total number of rows in meta dataset: 84,819


Hence, we need to get rid of duplicates...

In [None]:
meta = meta.loc[meta.astype(str).drop_duplicates().index].astype(str)
meta.head()


Unnamed: 0,category,tech1,description,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,price,asin,imageURL,imageURLHighRes,details
0,"['Video Games', 'PC', 'Games']",,[],Reversi Sensory Challenger,[],,Fidelity Electronics,[],"['>#2,623,937 in Toys &amp; Games (See Top 100...",[],Toys &amp; Games,,,42000742,['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,
1,"['Video Games', 'Xbox 360', 'Games', '</span><...",,['Brand new sealed!'],Medal of Honor: Warfighter - Includes Battlefi...,['B00PADROYW'],,by\n \n EA Games,[],"['>#67,231 in Video Games (See Top 100 in Vide...","['B0050SY5BM', 'B072NQJCW5', 'B000TI836G', 'B0...",Video Games,,"\n\t\t\t\t\t\t\t\t\t\t\t\t<span class=""vertica...",78764343,['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,
2,"['Video Games', 'Retro Gaming & Microconsoles'...",,[],street fighter 2 II turbo super nintendo snes ...,[],,Nintendo,[],"['>#134,433 in Video Games (See Top 100 in Vid...",[],Video Games,,$0.72,276425316,[],[],
3,"['Video Games', 'Xbox 360', 'Accessories', 'Co...",,"[""MAS's Pro Xbox 360 Stick (Perfect 360 Stick)...",Xbox 360 MAS STICK,[],,by\n \n MAS SYSTEMS,['Original PCB used from Xbox 360 Control Pad ...,"['>#105,263 in Video Games (See Top 100 in Vid...",[],Video Games,,,324411812,['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,
4,"['Video Games', 'PC', 'Games', '</span></span>...",,"['Phonics Alive! 3, The Speller teaches studen...",Phonics Alive! 3: The Speller,[],,by\n \n Advanced Software Pty. Ltd.,"['Grades 2-12', 'Spelling Program', 'Teaches S...","['>#92,397 in Video Games (See Top 100 in Vide...",['B000BCZ7U0'],Video Games,,,439335310,['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,


In [None]:
print('\n\nAfter dropping duplicates, we have')
print('\nNumber of unique values in asin for meta dataset: {:,}'.format(meta['asin'].nunique()))
print('Total number of rows in meta dataset: {:,}'.format(len(meta)))



After dropping duplicates, we have

Number of unique values in asin for meta dataset: 71,911
Total number of rows in meta dataset: 71,911


Now, we will note the number of occurences where a a specific reviewer has reviewed the same video game multiple times...

In [None]:
t = review[['reviewerID', 'asin']]
f"Number of duplicated reviews: {len(t[t.duplicated()]):,}"

'Number of duplicated reviews: 75,954'

To handle these duplicates, we will first sort the reviews dataset by reviewTime and then, we'll only keep the lastest (last) review among the duplicates.

In [None]:
review = review.sort_values(by=['reviewTime'])
review = review.loc[review.astype(str).drop_duplicates(subset=['reviewerID', 'asin'], keep='last').index].astype(str).reset_index()
review.head()

Unnamed: 0,index,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,22370,5.0,False,"01 1, 2000",AYT4C9HQ5EEJ3,B00002CF9G,William G. Leaming,I've played Caesar for sometime so was a litle...,Highly Addictive! A Must Have,946684800,7,,
1,25386,5.0,False,"01 1, 2000",A2UNE0FPB7UPJ,B00002NDRR,Andrew,I just recieved my copy of FS2000 an I was ama...,AMAZING!,946684800,5,,
2,16332,5.0,False,"01 1, 2000",A26IQJUNT6OR80,B00001LDCK,Jamie S. Anderson,The graphics in this game are absolutely incre...,Intense!,946684800,8,{'Edition:': ' Standard'},
3,23601,5.0,False,"01 1, 2000",A261TLAGXR52NH,B00002CF8U,THOR (Global Gamer Reviewer/Previewer),GTA2 is set in a futuristic city where you try...,Just read it!,946684800,2,{'Format:': ' Video Game'},
4,16466,5.0,False,"01 1, 2000",AMENNPIINM03J,B00000K4MC,John,It's so realalistic! It's practice to be respo...,This game is amazing!,946684800,7,{'Platform:': ' PC'},


In [None]:
f"After dropping duplicates, number of reviews reduced from 2,565,349 to {len(review):,}, a reduction of {2565349-len(review):,} reviews"


'After dropping duplicates, number of reviews reduced from 2,565,349 to 2,489,395, a reduction of 75,954 reviews'

Now, we can see that review and meta datasets are related to each other by 'asin', which is the product ID. 

In [None]:
f"Number of rows in review dataset with missing values in asin: {review['asin'].isna().sum()}"

'Number of rows in review dataset with missing values in asin: 0'

In [None]:
f"Number of rows in meta dataset with missing values in asin: {meta['asin'].isna().sum()}"

'Number of rows in meta dataset with missing values in asin: 0'

## 5.2 Left Join

To make sense of the data we have at hand, and to be able to draw meaningful insights that would help us later in model-building, we will first merge review dataset with meta dataset using a **LEFT JOIN** on 'asin', which is the product ID. This will result in a dataset consisting of all records from the review dataset and matching records from the meta dataset.

**Note:** Doing so will get rid of those product IDs (and their corresponding data from meta dataset) that are present in the meta dataset but are absent from the review dataset. These product IDs are those which have no reviews listed in the review dataset, and hence, getting rid of these is quite reasonable.

In [None]:
df = review.merge(meta, on='asin', how='left')
df.head()

Unnamed: 0,index,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image,category,tech1,description,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,price,imageURL,imageURLHighRes,details
0,22370,5.0,False,"01 1, 2000",AYT4C9HQ5EEJ3,B00002CF9G,William G. Leaming,I've played Caesar for sometime so was a litle...,Highly Addictive! A Must Have,946684800,7,,,"['Video Games', 'PC', 'Games', '</span></span>...",,['Pharaoh is a strategic city-building game se...,Pharaoh - PC,"['B00004TFLJ', 'B00004TJ2N']",,by\n \n Vivendi Universal,[],"['>#41,983 in Video Games (See Top 100 in Vide...","['B00006FXDV', 'B000C05XRI', 'B00004TFLJ', 'B0...",Video Games,,,['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,{}
1,25386,5.0,False,"01 1, 2000",A2UNE0FPB7UPJ,B00002NDRR,Andrew,I just recieved my copy of FS2000 an I was ama...,AMAZING!,946684800,5,,,"['Video Games', 'PC', 'Games', '</span></span>...",,['Microsoft continues its 17-year tradition of...,Microsoft Flight Simulator 2000 Professional - PC,[],,by\n \n Microsoft,[],"['>#42,893 in Video Games (See Top 100 in Vide...","['B000096L71', 'B00002NDRL', 'B001DPZE84', 'B0...",Video Games,,\n\t\t ...,['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,{}
2,16332,5.0,False,"01 1, 2000",A26IQJUNT6OR80,B00001LDCK,Jamie S. Anderson,The graphics in this game are absolutely incre...,Intense!,946684800,8,{'Edition:': ' Standard'},,"['Video Games', 'PC', 'Games', '</span></span>...",,['<i>Homeworld</i> is the next evolution of re...,Homeworld - PC,['B00004T77G'],,by\n \n Vivendi Universal,[],"['>#39,346 in Video Games (See Top 100 in Vide...","['B00K6ZUOQE', 'B000063EKR', 'B000QIBWDA', 'B0...",Video Games,,\n\t\t ...,['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,{}
3,23601,5.0,False,"01 1, 2000",A261TLAGXR52NH,B00002CF8U,THOR (Global Gamer Reviewer/Previewer),GTA2 is set in a futuristic city where you try...,Just read it!,946684800,2,{'Format:': ' Video Game'},,"['Video Games', 'PC', 'Games', '</span></span>...",,"[""The sequel to the ever-popular car jacking g...",Grand Theft Auto 2 - PC,[],,by\n \n Rockstar Games,[],"['>#49,434 in Video Games (See Top 100 in Vide...",['B00001ZUL7'],Video Games,,\n\t\t ...,[],[],{}
4,16466,5.0,False,"01 1, 2000",AMENNPIINM03J,B00000K4MC,John,It's so realalistic! It's practice to be respo...,This game is amazing!,946684800,7,{'Platform:': ' PC'},,"['Video Games', 'PC', 'Games', '</span></span>...",,['Experience the challenges and pulse-pounding...,Roller Coaster Tycoon - PC,"['B0000695GX', 'B00008K2Y6', 'B01M5BXF54', 'B0...",,by\n \n Atari,"['The rest is up to you', 'Construct, demolish...","['>#13,429 in Video Games (See Top 100 in Vide...","['B00006471Z', 'B0000695GX', 'B01M5BXF54', 'B0...",Video Games,,"\n\t\t\t\t\t\t\t\t\t\t\t\t<span class=""vertica...",['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,{}


As a sanity check, we can see that our merged dataset has no duplicate rows

In [None]:
f"Number of duplicate rows in merged data frame: {len(df[df.duplicated()])}"

'Number of duplicate rows in merged data frame: 0'

Furthermore,

In [None]:
print('Number of rows in review dataset equals number of rows in merged dataframe? {}'.format(len(df) == len(review)))

Number of rows in review dataset equals number of rows in merged dataframe? True


## 5.3 Handling Inconsistency

However, our merged data frame does contain some rows with missing value in 'title'. This has to be handled since there are no missing values for 'title' in meta dataset.


In [None]:
f"Number of rows in meta dataset with missing values in title: {meta['title'].isna().sum()}"

'Number of rows in meta dataset with missing values in title: 0'

In [None]:
f"Number of rows in merged dataframe with missing values in title: {df['title'].isna().sum()}"

'Number of rows in merged dataframe with missing values in title: 2224'

These rows can be interpreted as follows:

>There are some product IDs in review dataset which are not present in meta dataset. There rows in the merged data frame arise precisely because of the presence of such product IDs in review dataset.

Since our meta dataset does not contain any information for these products (not even their title), it would be reasonable to drop these entries.

In [None]:
df = df.dropna(subset=['title'], how='any', axis=0)
df.head()

Unnamed: 0,index,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image,category,tech1,description,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,price,imageURL,imageURLHighRes,details
0,22370,5.0,False,"01 1, 2000",AYT4C9HQ5EEJ3,B00002CF9G,William G. Leaming,I've played Caesar for sometime so was a litle...,Highly Addictive! A Must Have,946684800,7,,,"['Video Games', 'PC', 'Games', '</span></span>...",,['Pharaoh is a strategic city-building game se...,Pharaoh - PC,"['B00004TFLJ', 'B00004TJ2N']",,by\n \n Vivendi Universal,[],"['>#41,983 in Video Games (See Top 100 in Vide...","['B00006FXDV', 'B000C05XRI', 'B00004TFLJ', 'B0...",Video Games,,,['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,{}
1,25386,5.0,False,"01 1, 2000",A2UNE0FPB7UPJ,B00002NDRR,Andrew,I just recieved my copy of FS2000 an I was ama...,AMAZING!,946684800,5,,,"['Video Games', 'PC', 'Games', '</span></span>...",,['Microsoft continues its 17-year tradition of...,Microsoft Flight Simulator 2000 Professional - PC,[],,by\n \n Microsoft,[],"['>#42,893 in Video Games (See Top 100 in Vide...","['B000096L71', 'B00002NDRL', 'B001DPZE84', 'B0...",Video Games,,\n\t\t ...,['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,{}
2,16332,5.0,False,"01 1, 2000",A26IQJUNT6OR80,B00001LDCK,Jamie S. Anderson,The graphics in this game are absolutely incre...,Intense!,946684800,8,{'Edition:': ' Standard'},,"['Video Games', 'PC', 'Games', '</span></span>...",,['<i>Homeworld</i> is the next evolution of re...,Homeworld - PC,['B00004T77G'],,by\n \n Vivendi Universal,[],"['>#39,346 in Video Games (See Top 100 in Vide...","['B00K6ZUOQE', 'B000063EKR', 'B000QIBWDA', 'B0...",Video Games,,\n\t\t ...,['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,{}
3,23601,5.0,False,"01 1, 2000",A261TLAGXR52NH,B00002CF8U,THOR (Global Gamer Reviewer/Previewer),GTA2 is set in a futuristic city where you try...,Just read it!,946684800,2,{'Format:': ' Video Game'},,"['Video Games', 'PC', 'Games', '</span></span>...",,"[""The sequel to the ever-popular car jacking g...",Grand Theft Auto 2 - PC,[],,by\n \n Rockstar Games,[],"['>#49,434 in Video Games (See Top 100 in Vide...",['B00001ZUL7'],Video Games,,\n\t\t ...,[],[],{}
4,16466,5.0,False,"01 1, 2000",AMENNPIINM03J,B00000K4MC,John,It's so realalistic! It's practice to be respo...,This game is amazing!,946684800,7,{'Platform:': ' PC'},,"['Video Games', 'PC', 'Games', '</span></span>...",,['Experience the challenges and pulse-pounding...,Roller Coaster Tycoon - PC,"['B0000695GX', 'B00008K2Y6', 'B01M5BXF54', 'B0...",,by\n \n Atari,"['The rest is up to you', 'Construct, demolish...","['>#13,429 in Video Games (See Top 100 in Vide...","['B00006471Z', 'B0000695GX', 'B01M5BXF54', 'B0...",Video Games,,"\n\t\t\t\t\t\t\t\t\t\t\t\t<span class=""vertica...",['https://images-na.ssl-images-amazon.com/imag...,['https://images-na.ssl-images-amazon.com/imag...,{}


Now, after handling duplicates, our data frame is pretty clean.

In [None]:
f"Number of reviews after handling duplicates: {len(df):,}, an overall reduction of {2565349-len(df):,}"

'Number of reviews after handling duplicates: 2,487,171, an overall reduction of 78,178'

## 5.4 10-core Subset

However, our data frame still consists of video games that have very few reviews as well as reviewers who have reviewed only a very few games.


In [None]:
df['reviewerID'].value_counts()

A3V6Z4RCDGRC44    835
AJKWF4W7QD4NS     778
A3W4D8XOGLWUN5    705
A2TCG2HV1VJP6V    627
A2QHS1ZCIQOL7E    466
                 ... 
A1KD0XG1A7LWL1      1
A2VTJMGN21OS3O      1
A2DXIA72UUP1UK      1
A14U4442HIO6HH      1
A1NDVMP5L1464T      1
Name: reviewerID, Length: 1539732, dtype: int64

In [None]:
df['asin'].value_counts()

B00HTK1NCS    6462
B004RMK57U    5069
B00KKAQYXM    4359
B00JJNQG98    3962
B003ZSP0WW    3960
              ... 
B016PG5LGK       1
B000EFVGHC       1
B00BKC9PZS       1
B00U0A8QBK       1
B005GT2AX0       1
Name: asin, Length: 71909, dtype: int64

In order to deal with sparsity which we might face later on and also to overcome computational limitation, we'd like to put a threshold on the minimum number of reviews for a video game that appears in our data frame as well as for the minimum number of reviews provided by a reviewer who is present in our data frame. 

To accomplish this, we take the subset of data frame which ensures that each video game has alteast 10 reviews and each reviewer has provided atleast 10 reviews. The resultant subset is called **10-core** dataset, as defined by the author of the Amazon Review Data repository. 

In [None]:
# build 10-core subset
while(1):
  if df['asin'].value_counts()[-1] < 10:
    leastReviewedProducts = []
    counts = df['asin'].value_counts()

    for val in list(df['asin'].unique()):
      if counts[val] < 10:
        leastReviewedProducts.append(val)

    df = df[~(df['asin'].isin(leastReviewedProducts))]
    print('Removed products\t\t\t Remaining Number of Reviews: {:,}'.format(len(df)))

  elif df['reviewerID'].value_counts()[-1] < 10:
    leastReviewsBy = []
    counts = df['reviewerID'].value_counts()

    for val in list(df['reviewerID'].unique()):
      if counts[val] < 10:
        leastReviewsBy.append(val)

    df = df[~(df['reviewerID'].isin(leastReviewsBy))]
    print('Removed reviewers\t\t\t Remaining Number of Reviews: {:,}'.format(len(df)))

  else:
    print('\n\n----')
    print('Obtained 10-core subset')
    break

Removed products			 Remaining Number of Reviews: 2,348,836
Removed reviewers			 Remaining Number of Reviews: 230,116
Removed products			 Remaining Number of Reviews: 172,402
Removed reviewers			 Remaining Number of Reviews: 136,480
Removed products			 Remaining Number of Reviews: 125,639
Removed reviewers			 Remaining Number of Reviews: 116,375
Removed products			 Remaining Number of Reviews: 112,551
Removed reviewers			 Remaining Number of Reviews: 108,935
Removed products			 Remaining Number of Reviews: 107,575
Removed reviewers			 Remaining Number of Reviews: 106,210
Removed products			 Remaining Number of Reviews: 105,462
Removed reviewers			 Remaining Number of Reviews: 104,700
Removed products			 Remaining Number of Reviews: 104,460
Removed reviewers			 Remaining Number of Reviews: 104,166
Removed products			 Remaining Number of Reviews: 103,981
Removed reviewers			 Remaining Number of Reviews: 103,810
Removed products			 Remaining Number of Reviews: 103,721
Removed reviewers			 

In [None]:
print("No. of rows in review dataset: {:,}".format(len(review)))
print("No. of rows in 10-core subset: {:,}".format(len(df)))


No. of rows in review dataset: 2,489,395
No. of rows in 10-core subset: 103,362


Sanity check:

In [None]:
df['asin'].value_counts()

B00JK00S0S    255
B00GODZYNA    243
B00BGA9YZK    227
B00BGA9Y3W    226
B00KVR4HEC    221
             ... 
B000069675     10
B01H2DKHSM     10
B00009VE6E     10
B000006OVJ     10
B000GHLBUA     10
Name: asin, Length: 3784, dtype: int64

In [None]:
df['reviewerID'].value_counts()

A3V6Z4RCDGRC44    417
AJKWF4W7QD4NS     355
A29BQ6B90Y1R5F    307
A2QHS1ZCIQOL7E    271
A119Q9NFGVOEJZ    193
                 ... 
A3UGHKTPO1C5IZ     10
A8GWAPQEW7VYU      10
ATIVK9XUANIUE      10
A10IJF7UD9I86G     10
AT2L9P49VHVY5      10
Name: reviewerID, Length: 5921, dtype: int64

In [None]:
len(df[df.duplicated()])

0

We are now ready to carry out data exploration on this dataset.

# Cleaned Dataset for Data Visualization

In [None]:
df.to_csv(path_or_buf='/content/drive/My Drive/Applied_ML/data/10_core_Video_Games.csv', index=False)

Data visualization continued in 'video_games_data_visualization.ipynb'

# References

1. [Amazon Review Data (2018)](https://nijianmo.github.io/amazon/index.html)
>Justifying recommendations using distantly-labeled reviews and  fined-grained aspects
>
>Jianmo Ni, Jiacheng Li, Julian McAuley
>
>Empirical Methods in Natural Language Processing (EMNLP), 2019



2. Download 10_core_Video_Games [CSV file](https://drive.google.com/file/d/1Mg4PivasbqapPq5ov5cZvZa69Xpt5jXU/view?usp=sharing)
