# Using PySpark to Create an Amazon Review Recommendation System 



This Jupyter Notebook contains code to create a recommendation system for Amazon user reviews on specific products using PySpark.  It was created as a final project for the class INFO 607: Applied Database Technologies at Drexel University.  The data was downloaded from FIXME: data source.  

Additional documentation on this project can be found at the Github repository [here](https://github.com/zachcarlson/ProductRecommender).

## Configuration

We recommend running this notebook in Google Colab using a local runtime and your GPU.  Here are [links](https://stackoverflow.com/questions/51002045/how-to-make-jupyter-notebook-to-run-on-gpu) to setting up this configuration:
- [Local Runtime](https://research.google.com/colaboratory/local-runtimes.html)
- [Utilizing GPU](https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d)

Configure your input directory below:

In [None]:
#INPUT_DIRECTORY = "/content/drive/MyDrive/Grad School/INFO 607/ProductRecommender/data/" #for google mount
INPUT_DIRECTORY = "ProductRecommender/data/" #for jupyter notebook

### Google Colab Hosted Runtime

**NOTE**: Due to the limited resources available for Google Colab's Free Tier, this notebook might not run for you if you are running it in Google Drive using a Hosted Runtime.  We recommend using a Google Colab Local Runtime.  However, if you have Colab Pro/Pro+, this notebook *might* work and you can uncomment the cells below to continue with that particular configuration.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

The cell below may take 1-2 minutes to execute:

In [None]:
# %%capture 
# #prevent large printout with %%capture

# #Download Java
# !apt-get install openjdk-8-jdk-headless -qq > /dev/null

# #Install Apache Spark 3.2.1 with Hadoop 3.2, get zipped folder
# !wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

# #Unzip folder
# !tar xvf spark-3.2.1-bin-hadoop3.2.tgz

# #Install findspark, pyspark 3.2.1
# !pip install -q findspark
# !pip install pyspark==3.2.1

# #Set variables
# import os
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["SPARK_HOME"] = "spark-3.2.1-bin-hadoop3.2"

### Google Colab Local Runtime

We recommend using a local Jupyter Notebook as it is much faster for a free user, however, it will require some additional configuration.  Follow this tutorial [here](https://changhsinlee.com/install-pyspark-windows-jupyter/).  

In [None]:
import findspark
findspark.init()

## Load Packages

In [None]:
import pandas as pd
import pyspark.sql.functions as F

## Data Acquisition, Preprocessing

### Import Data

In [None]:
#create SparkSession and SparkContext objects
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = SparkContext.getOrCreate()


In [None]:
#Import data
file_path = INPUT_DIRECTORY + "ratings_electronics.csv"
ratings = spark.read.csv(file_path, header=False, inferSchema=True)
ratings.show(5)

+--------------+----------+---+----------+
|           _c0|       _c1|_c2|       _c3|
+--------------+----------+---+----------+
| AKM1MP6P0OYPR|0132793040|5.0|1365811200|
|A2CX7LUOHB2NDG|0321732944|5.0|1341100800|
|A2NWSAGRHCP8N5|0439886341|1.0|1367193600|
|A2WNBOD3WNDNKT|0439886341|3.0|1374451200|
|A1GI0U4ZRJA8WN|0439886341|1.0|1334707200|
+--------------+----------+---+----------+
only showing top 5 rows



In [None]:
ratings.count()

7824482

### Pre-processing

#### **Rename columns**

In [None]:
ratings = ratings.withColumnRenamed("_c0", "reviewerID") \
                  .withColumnRenamed("_c1", "productID") \
                  .withColumnRenamed("_c2", "rating") \
                  .withColumnRenamed("_c3", "timestamp")
ratings.show(5)

+--------------+----------+------+----------+
|    reviewerID| productID|rating| timestamp|
+--------------+----------+------+----------+
| AKM1MP6P0OYPR|0132793040|   5.0|1365811200|
|A2CX7LUOHB2NDG|0321732944|   5.0|1341100800|
|A2NWSAGRHCP8N5|0439886341|   1.0|1367193600|
|A2WNBOD3WNDNKT|0439886341|   3.0|1374451200|
|A1GI0U4ZRJA8WN|0439886341|   1.0|1334707200|
+--------------+----------+------+----------+
only showing top 5 rows



#### **Check datatypes**

In [None]:
ratings.printSchema()

root
 |-- reviewerID: string (nullable = true)
 |-- productID: string (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



We need `reviewerID` and `productID` to be integers for the ALS algorithm.  We'll create separate tables for `reviewers` and `products`.  At the end of the pre-processing section, we'll combine the tables.

In [None]:
from pyspark.sql.functions import monotonically_increasing_id

reviewers = ratings.select("reviewerID").distinct().coalesce(1)
reviewers.show(5)

+--------------------+
|          reviewerID|
+--------------------+
|A06983862QXQ79V19...|
|A10123371OF8W0NAB...|
|      A10616E4HZB41X|
|      A10AKE9TAADHVV|
|      A10AZ52KX1UM1N|
+--------------------+
only showing top 5 rows



In [None]:
reviewers = reviewers.withColumn("userID", monotonically_increasing_id()).persist()
reviewers.show(5)

+--------------------+------+
|          reviewerID|userID|
+--------------------+------+
|A06983862QXQ79V19...|     0|
|A10123371OF8W0NAB...|     1|
|      A10616E4HZB41X|     2|
|      A10AKE9TAADHVV|     3|
|      A10AZ52KX1UM1N|     4|
+--------------------+------+
only showing top 5 rows



In [None]:
products = ratings.select("productID").distinct().coalesce(1)
products = products.withColumn("product_ID", monotonically_increasing_id()).persist()
products.show(5)

+----------+----------+
| productID|product_ID|
+----------+----------+
|7793224531|         0|
|9966694242|         1|
|9967222247|         2|
|9985975413|         3|
|9990950369|         4|
+----------+----------+
only showing top 5 rows



#### **No duplicate ratings**

In [None]:
ratings.groupby("reviewerID", "productID").count().select(F.max("count")).show()

+----------+
|max(count)|
+----------+
|         1|
+----------+



Each user has only one rating per product, thus filtering based on `timestamp` is not needed.  We will keep the timestamp for EDA purposes and to allow for future filtering if the dataset contains multiple ratings for a given user for  a given product.

#### **Join tables**

In [None]:
#Join ratings table with new integer IDs for products and reviewers
product_ratings = ratings.join(reviewers, on="reviewerID", how="left")
product_ratings = product_ratings.join(products, on="productID", how="left")

#select just integer IDs, rating and timestamp
product_ratings = product_ratings.select("userID", "product_ID", "rating", "timestamp")
product_ratings.show(5)

+------+----------+------+----------+
|userID|product_ID|rating| timestamp|
+------+----------+------+----------+
| 41406|    363476|   4.0|1403654400|
|277251|    461912|   4.0|1172102400|
| 95851|    184723|   5.0|1376524800|
|340235|    461912|   4.0|1140652800|
|508955|    461912|   3.0|1167782400|
+------+----------+------+----------+
only showing top 5 rows



In [None]:
#rename columns for readability
product_ratings = product_ratings.withColumnRenamed("userID", "reviewerID")
product_ratings = product_ratings.withColumnRenamed("product_ID", "productID")
product_ratings.show(5)

+----------+---------+------+----------+
|reviewerID|productID|rating| timestamp|
+----------+---------+------+----------+
|     41406|   363476|   4.0|1403654400|
|    277251|   461912|   4.0|1172102400|
|     95851|   184723|   5.0|1376524800|
|    340235|   461912|   4.0|1140652800|
|    508955|   461912|   3.0|1167782400|
+----------+---------+------+----------+
only showing top 5 rows



## EDA

Find reviewers with the most ratings:

In [None]:
product_ratings.select("reviewerID", "productID", "rating")\
        .groupby("reviewerID")\
        .count()\
        .sort("count", ascending = False)\
        .show(5)

+----------+-----+
|reviewerID|count|
+----------+-----+
|    948530|  520|
|   3950835|  501|
|   4098154|  498|
|   3173364|  431|
|   4182714|  406|
+----------+-----+
only showing top 5 rows



Find products with the most ratings:

In [None]:
product_ratings.select("reviewerID", "productID", "rating")\
        .groupby("productID")\
        .count()\
        .sort("count", ascending = False)\
        .show(5)

+---------+-----+
|productID|count|
+---------+-----+
|   374843|18244|
|   119358|16454|
|   232798|14172|
|   404930|12285|
|   192505|12226|
+---------+-----+
only showing top 5 rows



Count and average ratings for each product

In [None]:
avg_ratings = (product_ratings
                .select("productID", "rating")              # Select Columns
                .groupby("productID")                       # Group by productID
                .agg(                           
                     F.count("rating").alias("Count"),      # Count number of ratings
                     F.avg("rating").alias("Average")       # Average ratings for each product
                     )
                .sort("Average", "Count", ascending = [False, False]) # Sort results by average and count
            )
avg_ratings.show(5)

+---------+-----+-------+
|productID|Count|Average|
+---------+-----+-------+
|   372621|   45|    5.0|
|   430427|   41|    5.0|
|    40019|   38|    5.0|
|   123258|   36|    5.0|
|   210718|   36|    5.0|
+---------+-----+-------+
only showing top 5 rows



In [None]:
low_avg_rating = avg_ratings.filter(avg_ratings.Average < 2)
low_avg_rating.show(5)

product_num = avg_ratings.select("productID").distinct().count()
lar_count = low_avg_rating.count()
print(f"Number of distinct products: {product_num :,.2f}")
print(f"Number of products with low average (less than 2): {lar_count :,.2f}")
print(f"% low ratings: {lar_count / product_num * 100 :,.2f}")


+---------+-----+------------------+
|productID|Count|           Average|
+---------+-----+------------------+
|   353808|   75|1.9866666666666666|
|   323153|   74|1.9864864864864864|
|   285525|   56|1.9821428571428572|
|   198144|   55| 1.981818181818182|
|   253006|   48|1.9791666666666667|
+---------+-----+------------------+
only showing top 5 rows

Number of distinct products: 476,002.00
Number of products with low average (less than 2): 33,341.00
% low ratings: 7.00


The precentage of low rating (less than 2) products is low (7%). 

## Recommendation System

Now we'll build our ALS algorithm using collaborative filtering:

In [None]:
from pyspark.ml.recommendation import ALS

# Initialize ALS with parameters
als = ALS(userCol="reviewerID", itemCol="productID", ratingCol="rating",
          nonnegative=True, coldStartStrategy="drop", implicitPrefs=False)

Next, we'll build the `ParamGridBuilder`:

In [None]:
from pyspark.ml.tuning import ParamGridBuilder

param_grid = ParamGridBuilder() \
                  .addGrid(als.rank, [10]) \
                  .build()

Now we'll build our evaluator and use RMSE as the performance metric:

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

# Define evaluator
reg_eval = RegressionEvaluator(metricName = "rmse",
                               predictionCol = "prediction",
                               labelCol = "rating")

print(f"Num models to be tested: {len(param_grid)}")

Creating `CrossValidator`:

In [None]:
from pyspark.ml.tuning import CrossValidator

cv = CrossValidator(estimator = als, 
                    estimatorParamMaps= param_grid,
                    evaluator = reg_eval,
                    numFolds = 5)

Now, we can fit our training data:

In [None]:
# Split data into 80% train, 20% test
training_data, test_data = product_ratings.randomSplit([0.8, 0.2], seed = 0)

# Training model
model = cv.fit(training_data)

# Get best model
best_model = model.bestModel

In [None]:
print(type(best_model))

print("\n**Best Model**")
print("  Rank:", best_model.rank)
print("  MaxIter:", best_model._java_obj.parent().getMaxIter())
print("  RegParam:", best_model._java_obj.parent().getRegParam())

Now we can evaluate our model's performance on the test data:

In [None]:
# Predict ratings using trained model
predictions = best_model.transform(test_data)
predictions.show(5)

In [None]:
# Evaluate the "test_predictions" dataframe
RMSE = reg_eval.evaluate(predictions)

# Print the RMSE
print(RMSE)

In [None]:
# # Drop NaNs from predictions
# predictions_clean = predictions.dropna()


In [None]:
# # print(f"Before drop: {predictions.count()}")

# print(f"After drop: {predictions_clean.count()}")


After drop: 1340165


In [None]:

# # Calc and print RMSE
# print(f"RMSE: {reg_eval.evaluate(predictions_clean) : ,.2f}")

RMSE:  3.21


In [None]:
# predictions.show(5)

## Conclusions