## Data Description
- In this notebook I will use **Amazon product data**.
- This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

- This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

## Citation

- **Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering** 
  - R. He, J. McAuley
  WWW, 2016
  - [pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf)

- **Image-based recommendations on styles and substitutes**
  - J. McAuley, C. Targett, J. Shi, A. van den Hengel
  SIGIR, 2015
  - [pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf)

In [3]:
# fn = 'http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Video_Games.json.gz'

# import pandas as pd
# import os
# print(os.path.isfile(fn))
# df = pd.read_json(fn, lines=True, compression='gzip')
# df.head()

In [4]:
%sh curl -O 'http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Video_Games.json.gz'

# spark = SparkSession.builder.appName('ops').getOrCreate()
# df = spark.read.json(fn)

In [5]:
%fs ls "file:/databricks/driver"

path,name,size
file:/databricks/driver/conf/,conf/,4096
file:/databricks/driver/derby.log,derby.log,724
file:/databricks/driver/reviews_Video_Games.json.gz,reviews_Video_Games.json.gz,386419180
file:/databricks/driver/logs/,logs/,4096
file:/databricks/driver/ganglia/,ganglia/,4096
file:/databricks/driver/eventlogs/,eventlogs/,4096


In [6]:
# define path to file
path = 'file:/databricks/driver/reviews_Video_Games.json.gz'

# load data using sqlContext
df = sqlContext.read.\
     format("json").\
     option("header", "true").\
     option("inferSchema", "true").\
     load(path).\
     withColumnRenamed("overall", "rating")

# display in table format
display(df)

asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0078764343,"List(1, 1)",5.0,"I haven't gotten around to playing the campaign but the multiplayer is solid and pretty fun. Includes Zero Dark Thirty pack, an Online Pass, and the all powerful Battlefield 4 Beta access.","07 7, 2013",AB9S9279OZ3QO,Alan,Good game and Beta access!!,1373155200
0078764343,"List(0, 0)",5.0,I want to start off by saying I have never played the Call of Duty games. This is only the second first person shooter game that I have own. I think it is a lot of fun. Has good graphics and nice story line. It does take some skill to get through the levels. I think all players can enjoy this game. There are three levels to choose from based on your skill level. If your looking for first person shooter game that has current military type play than this is a good buy.,"08 24, 2013",A24SSUT5CSW8BH,Kindle Customer,Love the game,1377302400
0078764343,"List(0, 0)",4.0,this will be my second medal of honor I love how the incorporate real life military stories in the game great,"07 4, 2013",AK3V0HEBJMQ7J,"Miss Kris ""Krissy""",MOH nice,1372896000
043933702X,"List(0, 0)",5.0,"great game when it first came out, and still a great game","07 10, 2014",A10BECPH7W8HM7,"GMC ""Old Time Modeler""",Five Stars,1404950400
043933702X,"List(0, 0)",5.0,this is the first need for speed I bought years and years ago. I lost it so I bought this for a trip down memory lane. Pretty tame by todays games. It brought back memories of fun times.,"12 4, 2013",A2PRV9OULX1TWP,grimi,memory lane,1386115200
043933702X,"List(0, 0)",1.0,Doesn't load after installation. Tried it on 3 different computers. Waited a while to receive it... was kind of excited,"04 17, 2013",AE7GUHCDQQ4UI,Imperfection91,>.<,1366156800
043933702X,"List(0, 0)",5.0,i love this game i have had the best fun with this game if it does not work try it on a laptopif u like racing game you might like this.It is kinda old so be expecting that but other wise other than graphics it is grait,"07 19, 2013",A48ABFDDRMKI8,me,Best racing game,1374192000
0439339960,"List(0, 0)",3.0,"Gift for my granddaughter. Likes the game a lot, but prefers to just listen to the music & sounds. We use it at my house on a separate computer (just for Lana). Money well spent--provides hours of amusement & education.","11 1, 2010",A26B0P6K95SIKW,Linda G. Baudoin,Lana's opinion,1288569600
0439339987,"List(0, 0)",5.0,"I have reviewed this software many times and I still think it isawesome.I continue to order more and more.The kids at my school love, love it.","04 25, 2013",AZ3UWOC8QSO6C,"Amazon Customer ""shoppnmama""",excellent game,1366848000
0439342260,"List(2, 2)",4.0,"I am an Ice Cream Truck Vendor (I lease out 20+ Trucks), and an the owner of a small business that distributes Good Humor Ice Cream in Dayton, Ohio. I purchased this program (as I do most Ice Cream related stuff) just to see what it was all about. I was pleasantly surprised... as there are exercises that allow you to run your own business within the game. You have to restock, and consider expenses... and you either win or you can loose the day, in business, depending on your decisions. Great tool (game) to teach your kids business and some responsibility. It was also fun!","12 19, 2012",A182S3ANC0W7DL,James,Teach Business to Kids & Adults,1355875200


I will drop the following three columns *reviewTime, unixReviewTime, and helpful* we won't need them in our analysis

In [8]:
df_1 = df.drop('reviewTime', 'unixReviewTime', 'helpful')
df_1.head()

In [9]:
df_1.printSchema()

* reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
* asin - ID of the product, e.g. 0000013714
* reviewText - text of the review
* rating - rating of the product
* summary - summary of the review

In [11]:
# caching 
newdf = df_1.cache()

As we can see, both reviewerID and asin are mixed type. We need to convert them into int to use the ALS algorithm.
I will use the StringIndexer class to cast them.

In [13]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer

indexer_1= StringIndexer(inputCol="asin", outputCol="asin_index").fit(newdf)
df_ind = indexer.transform(newdf)

indexer_2= StringIndexer(inputCol="reviewerID", outputCol="reviewerID_index").fit(df_ind)
df_ind = indexer_2.transform(df_ind)
df_ind.show()


## Data exploration

In [15]:
display(df_ind.head(10))

asin,rating,reviewText,reviewerID,reviewerName,summary,asin_index,reviewerID_index
0078764343,5.0,"I haven't gotten around to playing the campaign but the multiplayer is solid and pretty fun. Includes Zero Dark Thirty pack, an Online Pass, and the all powerful Battlefield 4 Beta access.",AB9S9279OZ3QO,Alan,Good game and Beta access!!,30148.0,419986.0
0078764343,5.0,I want to start off by saying I have never played the Call of Duty games. This is only the second first person shooter game that I have own. I think it is a lot of fun. Has good graphics and nice story line. It does take some skill to get through the levels. I think all players can enjoy this game. There are three levels to choose from based on your skill level. If your looking for first person shooter game that has current military type play than this is a good buy.,A24SSUT5CSW8BH,Kindle Customer,Love the game,30148.0,8343.0
0078764343,4.0,this will be my second medal of honor I love how the incorporate real life military stories in the game great,AK3V0HEBJMQ7J,"Miss Kris ""Krissy""",MOH nice,30148.0,6306.0
043933702X,5.0,"great game when it first came out, and still a great game",A10BECPH7W8HM7,"GMC ""Old Time Modeler""",Five Stars,26586.0,802086.0
043933702X,5.0,this is the first need for speed I bought years and years ago. I lost it so I bought this for a trip down memory lane. Pretty tame by todays games. It brought back memories of fun times.,A2PRV9OULX1TWP,grimi,memory lane,26586.0,394255.0
043933702X,1.0,Doesn't load after installation. Tried it on 3 different computers. Waited a while to receive it... was kind of excited,AE7GUHCDQQ4UI,Imperfection91,>.<,26586.0,206347.0
043933702X,5.0,i love this game i have had the best fun with this game if it does not work try it on a laptopif u like racing game you might like this.It is kinda old so be expecting that but other wise other than graphics it is grait,A48ABFDDRMKI8,me,Best racing game,26586.0,34891.0
0439339960,3.0,"Gift for my granddaughter. Likes the game a lot, but prefers to just listen to the music & sounds. We use it at my house on a separate computer (just for Lana). Money well spent--provides hours of amusement & education.",A26B0P6K95SIKW,Linda G. Baudoin,Lana's opinion,48587.0,733307.0
0439339987,5.0,"I have reviewed this software many times and I still think it isawesome.I continue to order more and more.The kids at my school love, love it.",AZ3UWOC8QSO6C,"Amazon Customer ""shoppnmama""",excellent game,43319.0,43328.0
0439342260,4.0,"I am an Ice Cream Truck Vendor (I lease out 20+ Trucks), and an the owner of a small business that distributes Good Humor Ice Cream in Dayton, Ohio. I purchased this program (as I do most Ice Cream related stuff) just to see what it was all about. I was pleasantly surprised... as there are exercises that allow you to run your own business within the game. You have to restock, and consider expenses... and you either win or you can loose the day, in business, depending on your decisions. Great tool (game) to teach your kids business and some responsibility. It was also fun!",A182S3ANC0W7DL,James,Teach Business to Kids & Adults,40207.0,298989.0


A fundamental task in many statistical analyses is to characterize the location and variability of a data set. A further characterization of the data includes skewness and kurtosis for the target variable **rating**. 
  - **Skewness** is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
  - **Kurtosis** is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case.

In [17]:
from pyspark.sql.functions import col, skewness, kurtosis

df_ind.select(skewness('rating'), kurtosis('rating')).show()

As we can see the **rating** target variable has a negative skewness, which means that the **mean and median** will be less than **mode** - the data is highly skewed. However, its kurtosis value is less than 3; meaning the peak is lower and broader than Mesokurtic, which means that data are light-tailed or lack of outliers. The reason for this is because the extreme values are less than that of the normal distribution.

In [19]:
print((df_ind.count(), len(df_ind.columns)))

- We have **1,324,753** observations and 5 columns.

Checking missing data

In [22]:
from pyspark.sql.functions import col,sum
df_ind.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df_ind.columns)).show()


The data does have only missing data in reviewer name, which we can drop; we won't use it any where.

In [24]:
from pyspark.sql import functions as f 

review_dist_df = df_ind.groupBy('rating').agg(f.count(df_ind.rating).alias('Count'))
display(review_dist_df)


rating,Count
1.0,152840
4.0,260260
3.0,124370
2.0,77513
5.0,709770


In [25]:
# need to count reviews done by reviewerID per product id asin
product_ids_with_avg_ratings_df = df_ind.groupBy('asin_index', 'summary').agg(f.count(df_ind.rating).alias('count'), f.avg(newdf.rating).alias('average'))

print('product_ids_with_avg_ratings_df: ')
display(product_ids_with_avg_ratings_df.head(10))

asin_index,summary,count,average
3313.0,The Beer Baron,1,4.0
38721.0,"I liked it, but ....",1,4.0
4206.0,Good buy,1,5.0
4206.0,quick and easy,1,5.0
3355.0,Horrible Product! DO NOT BUY!!!!!,1,1.0
31737.0,Good for the price,1,3.0
4136.0,Does its job its the original,1,5.0
4136.0,GBM WC,1,4.0
2847.0,Very Effective and worth it!,1,4.0
5414.0,MegaMan - mega fun,1,5.0


In [26]:
products_with_more_50_review = product_ids_with_avg_ratings_df.filter(f.col('count')>=50)\
  .sort('average')

print('products_with_more_50_review: ')
display(products_with_more_50_review.head(10))

asin_index,summary,count,average
3.0,Bingo Blitz,188,3.7180851063829783
3.0,bingo blitz,139,4.158273381294964
3.0,fun game,53,4.245283018867925
0.0,Despicable Me: Minion Rush,74,4.3108108108108105
2.0,fun game,63,4.333333333333333
0.0,good game,71,4.380281690140845
0.0,Minion Rush,88,4.386363636363637
2.0,good game,92,4.423913043478261
2.0,fun,93,4.505376344086022
6.0,fun,87,4.505747126436781


## Split the data

## Creating a Training Set
break the ratings_df into three parts:

* A training set (DataFrame), which we will use to train models
* A validation set (DataFrame), which we will use to choose the best model
* A test set (DataFrame), which we will use for our experiments
* To randomly split the dataset into the multiple groups, we can use the pySpark randomSplit() transformation. randomSplit() takes a set of splits and a seed and returns multiple DataFrames.

In [29]:
# hold out 60% for training, 20% of our data for validation, and leave 20% for testing

seed = 42
(split_60_df, split_a_20_df, split_b_20_df) = df_ind.randomSplit([0.6, 0.2, 0.2], seed = seed)

In [30]:
# cache the resulted datasets
training_df = split_60_df.cache()
validation_df = split_a_20_df.cache()
test_df = split_b_20_df.cache()

print('Training: {0}, validation: {1}, test: {2}\n'.format(
  training_df.count(), validation_df.count(), test_df.count())
)
display(training_df.head(3))
display(validation_df.head(3))
display(test_df.head(3))

asin,rating,reviewText,reviewerID,reviewerName,summary,asin_index,reviewerID_index
0078764343,5.0,"I haven't gotten around to playing the campaign but the multiplayer is solid and pretty fun. Includes Zero Dark Thirty pack, an Online Pass, and the all powerful Battlefield 4 Beta access.",AB9S9279OZ3QO,Alan,Good game and Beta access!!,30148.0,419986.0
0078764343,5.0,I want to start off by saying I have never played the Call of Duty games. This is only the second first person shooter game that I have own. I think it is a lot of fun. Has good graphics and nice story line. It does take some skill to get through the levels. I think all players can enjoy this game. There are three levels to choose from based on your skill level. If your looking for first person shooter game that has current military type play than this is a good buy.,A24SSUT5CSW8BH,Kindle Customer,Love the game,30148.0,8343.0
043933702X,1.0,Doesn't load after installation. Tried it on 3 different computers. Waited a while to receive it... was kind of excited,AE7GUHCDQQ4UI,Imperfection91,>.<,26586.0,206347.0


In [31]:
print((training_df.count(), len(training_df.columns)))
print((validation_df.count(), len(validation_df.columns)))
print((test_df.count(), len(test_df.columns)))


Prediction algorithms

In [33]:
from pyspark.ml.recommendation import ALS

als = ALS(maxIter=10, regParam=0.01, 
          userCol="reviewerID_index", itemCol="asin_index", ratingCol="rating",
          coldStartStrategy="drop",
          implicitPrefs=False)

In [34]:
from pyspark.ml.evaluation import RegressionEvaluator
import itertools
reg_eval=RegressionEvaluator(metricName="rmse",labelCol="rating",predictionCol="prediction")


tolerance = 0.03
ranks = [4, 5]
errors = [0, 0, 0]
models = [0, 0, 0]
err = 0
min_error = float('inf')
best_rank = -1

for rank in ranks:
  # Set the rank here:
  als.setRank(5)
  # Create the model with these parameters.
  model = als.fit(training_df)
  # Run the model to create a prediction. Predict against the validation_df.
  predict_df = model.transform(validation_df)

  # Remove NaN values from prediction (due to SPARK-14489)
  predicted_ratings_df = predict_df.filter(predict_df.prediction != float('nan'))

  # Run the previously created RMSE evaluator, reg_eval, on the predicted_ratings_df DataFrame
  error = reg_eval.evaluate(predict_df)
  errors[err] = error
  models[err] = model
  print ('For rank %s the RMSE is %s' % (rank, error))
  if error < min_error:
    min_error = error
    best_rank = err
  err += 1

als.setRank(ranks[best_rank])
print ('The best model was trained with rank %s' % ranks[best_rank])
my_model = models[best_rank]


In [35]:
import itertools
from math import sqrt
from operator import add
import sys
from pyspark.ml.recommendation import ALS

from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
def computeRmse(model, data):
    """
    Compute RMSE (Root mean Squared Error).
    """
    predictions = model.transform(data)
    rmse = evaluator.evaluate(predictions)
    print("Root-mean-square error = " + str(rmse))
    return rmse

#train models and evaluate them on the validation set

ranks = [4,5]
lambdas = [0.05]
numIters = [30]
bestModel = None
bestValidationRmse = float("inf")
bestRank = 0
bestLambda = -1.0
bestNumIter = -1

val = test_df.na.drop()
for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters):
    als = ALS(rank=rank, maxIter=numIter, regParam=lmbda, numUserBlocks=10, numItemBlocks=10, implicitPrefs=False,
              alpha=1.0,
              userCol="reviewerID_index", itemCol="asin_index", seed=42, ratingCol="rating", nonnegative=True)
    model=als.fit(training_df)

    validationRmse = computeRmse(model, val)
    print("RMSE (validation) = %f for the model trained with " % validationRmse + \
            "rank = %d, lambda = %.1f, and numIter = %d." % (rank, lmbda, numIter))
    if (validationRmse, bestValidationRmse):
        bestModel = model
        bestValidationRmse = validationRmse
        bestRank = rank
        bestLambda = lmbda
        bestNumIter = numIter

model = bestModel