## DETAILED DESCRIPTIONS OF DATA FILES
Here are brief descriptions of the data.

ml-data.tar.gz -- Compressed tar file. To rebuild the u data files do this:
gunzip ml-data.tar.gz
tar xvf ml-data.tar
mku.sh

**u.data** -- The full u data set, 100000 ratings by 943 users on 1682 items.
Each user has rated at least 20 movies. Users and items are
numbered consecutively from 1. The data is randomly
ordered. This is a tab separated list of
user id | item id | rating | timestamp.
The time stamps are unix seconds since 1/1/1970 UTC

**u.info** -- The number of users, items, and ratings in the u data set.

**u.item** -- Information about the items (movies); this is a tab separated
list of
movie id | movie title | release date | video release date |
IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western |
The last 19 fields are the genres, a 1 indicates the movie
is of that genre, a 0 indicates it is not; movies can be in
several genres at once.
The movie ids are the ones used in the u.data data set.

**u.genre** -- A list of the genres.

**u.user** -- Demographic information about the users; this is a tab
separated list of
user id | age | gender | occupation | zip code
The user ids are the ones used in the u.data data set.

**u.occupation** -- A list of the occupations.

**u1.base** -- The data sets u1.base and u1.test through u5.base and u5.test
**u1.test** are 80%/20% splits of the u data into training and test data.
**u2.base** Each of u1, …, u5 have disjoint test sets; this if for
**u2.test** 5 fold cross validation (where you repeat your experiment
**u3.base with each training and test set and average the results).
**u3.test** These data sets can be generated from u.data by mku.sh.
**u4.base**
**u4.test**
**u5.base**
**u5.test**

**ua.base** -- The data sets ua.base, ua.test, ub.base, and ub.test
**ua.test** split the u data into a training set and a test set with
**ub.base** exactly 10 ratings per user in the test set. The sets
**ub.test** ua.test and ub.test are disjoint. These data sets can
be generated from u.data by mku.sh.

**allbut.pl** -- The script that generates training and test sets where
all but n of a users ratings are in the training data.

**mku.sh** -- A shell script to generate all the u data sets from u.data.

In [38]:
import pandas as pd
from pyspark.sql.types import *
from pyspark.sql import SparkSession
import pyspark.sql.functions as sf 

# Import the required ML functions
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator


import sys

### Following are the parameters of a SparkContext.

**Master** − It is the URL of the cluster it connects to.

**appName** − Name of your job.

**sparkHome** − Spark installation directory.

**pyFiles** − The .zip or .py files to send to the cluster and add to the PYTHONPATH.

**Environment** − Worker nodes environment variables.

**batchSize** − The number of Python objects represented as a single Java object. Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size.

**Serializerve** − RDD serializer.

**Conf** − An object of L{SparkConf} to set all the Spark properties.

**Gateway** − Use an existing gateway and JVM, otherwise initializing a new JVM.

**JSC** − The JavaSparkContext instance.
**profiler_cls** − A class of custom Profiler used to do profiling (the default is pyspark.profiler.BasicProfiler).


In [2]:
spark.stop()

In [3]:
 spark = SparkSession \
 .builder \
 .appName("recommendation_engine") \
 .master("local") \
 .getOrCreate()

In [4]:
ratings = spark.read.csv("./data/ml-100k/u.data", sep="\t")
#ratings = spark.read.load('data/ml-100k/ratings.parquet', schema=schema)

In [5]:
ratings = ratings.toDF('userId','movieId','rating','timestamp')

In [6]:
ratings.columns

['userId', 'movieId', 'rating', 'timestamp']

In [8]:
ratings = ratings.sort(['userId','movieId'], ascending=True)

In [9]:
# show schema
ratings.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|     5|874965758|
|     1|     10|     3|875693118|
|     1|    100|     5|878543541|
|     1|    101|     2|878542845|
|     1|    102|     2|889751736|
|     1|    103|     1|878542845|
|     1|    104|     1|875241619|
|     1|    105|     2|875240739|
|     1|    106|     4|875241390|
|     1|    107|     4|875241619|
|     1|    108|     5|875240920|
|     1|    109|     5|874965739|
|     1|     11|     2|875072262|
|     1|    110|     1|878542845|
|     1|    111|     5|889751711|
|     1|    112|     1|878542441|
|     1|    113|     5|878542738|
|     1|    114|     5|875072173|
|     1|    115|     5|878541637|
|     1|    116|     3|878542960|
+------+-------+------+---------+
only showing top 20 rows



In [10]:
# count rows
ratings.count()

100000

In [11]:
# Count the total number of ratings in the dataset
numerator = ratings.select("rating").count()

In [12]:
# Count the number of distinct userIds and distinct movieIds
num_users = ratings.select("userId").distinct().count()
print('number of unique users: ', num_users)

number of unique users:  943


In [13]:
num_movies = ratings.select("movieId").distinct().count()
print('number of unique movies: ', num_movies)

number of unique movies:  1682


In [14]:
# Set the denominator equal to the number of users multiplied by the number of movies
denominator = num_users * num_movies
print('number of possible ratings: ', denominator)

number of possible ratings:  1586126


In [15]:
# Divide the numerator by the denominator
sparsity = (1.0 - (numerator *1.0)/denominator)*100
print("The ratings dataframe is ", "%.2f" % sparsity + "% empty.")

The ratings dataframe is  93.70% empty.


In [29]:
# Filter to show only userIds less than 100
ratings.filter(sf.col("userId") < 100).show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|     5|874965758|
|     1|     10|     3|875693118|
|     1|    100|     5|878543541|
|     1|    101|     2|878542845|
|     1|    102|     2|889751736|
|     1|    103|     1|878542845|
|     1|    104|     1|875241619|
|     1|    105|     2|875240739|
|     1|    106|     4|875241390|
|     1|    107|     4|875241619|
|     1|    108|     5|875240920|
|     1|    109|     5|874965739|
|     1|     11|     2|875072262|
|     1|    110|     1|878542845|
|     1|    111|     5|889751711|
|     1|    112|     1|878542441|
|     1|    113|     5|878542738|
|     1|    114|     5|875072173|
|     1|    115|     5|878541637|
|     1|    116|     3|878542960|
+------+-------+------+---------+
only showing top 20 rows



In [18]:
# Group data by userId, count ratings
ratings.groupBy("userId").count().show()

+------+-----+
|userId|count|
+------+-----+
|   296|  147|
|   467|   44|
|   675|   34|
|   691|   32|
|   829|   64|
|   125|  182|
|   451|   98|
|   800|   28|
|   853|   41|
|   666|  245|
|   870|  269|
|   919|  217|
|   926|   20|
|   124|   24|
|   447|  139|
|    51|   23|
|   591|   84|
|     7|  403|
|   307|  112|
|   475|   20|
+------+-----+
only showing top 20 rows



In [32]:
# Min num ratings for movies
print("Movie with the fewest ratings: ")
ratings.groupBy("movieId").count().select(sf.min("count")).show()

Movie with the fewest ratings: 
+----------+
|min(count)|
+----------+
|         1|
+----------+



In [30]:
# Min num ratings for movies
print("Movie with the fewest ratings: ")
ratings.groupBy("movieId").count().select(sf.avg('count')).show()

Movie with the fewest ratings: 
+-----------------+
|       avg(count)|
+-----------------+
|59.45303210463734|
+-----------------+



In [33]:
# Min num ratings for user
print("User with the fewest ratings: ")
ratings.groupBy("userId").count().select(sf.min("count")).show()

User with the fewest ratings: 
+----------+
|min(count)|
+----------+
|        20|
+----------+



In [34]:
# Avg num ratings per users
print("Avg num ratings per user: ")
ratings.groupBy("userId").count().select(sf.avg("count")).show()

Avg num ratings per user: 
+------------------+
|        avg(count)|
+------------------+
|106.04453870625663|
+------------------+



In [35]:
# Use .printSchema() to see the datatypes of the ratings dataset
ratings.printSchema()

root
 |-- userId: string (nullable = true)
 |-- movieId: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- timestamp: string (nullable = true)



In [36]:
# Tell Spark to convert the columns to the proper data types
ratings = ratings.select(ratings.userId.cast("integer"), ratings.movieId.cast("integer"), ratings.rating.cast("double"))

In [37]:
# Call .printSchema() again to confirm the columns are now in the correct format
ratings.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)



# ALS Model

In [39]:
# Create test and train set
(train, test) = ratings.randomSplit([0.8, 0.2], seed = 1234)

In [40]:
# Create ALS model
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", nonnegative = True, implicitPrefs = False)

In [41]:
# Confirm that a model called "als" was created
type(als)

pyspark.ml.recommendation.ALS

In [42]:
# Add hyperparameters and their respective values to param_grid
param_grid = ParamGridBuilder() \
            .addGrid(als.rank,  [10, 50, 100, 150]) \
            .addGrid(als.maxIter, [5, 50, 100, 200]) \
            .addGrid(als.regParam, [.01, .05, .1, .15]) \
            .build()

In [43]:
# Define evaluator as RMSE and print length of evaluator
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction") 
print ("Num models to be tested: ", len(param_grid))

Num models to be tested:  64
