# Recommender System

The classic recommender tutorial uses the [movielens data set](https://grouplens.org/datasets/movielens/). It is similar to using the iris or MNIST data set for other algorithms. Let's do a code along to get an idea of how this all works!


Looking for more datasets? Check out: https://gist.github.com/entaroadun/1653794

In [2]:
# Initiate Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('rec').getOrCreate()

With Collaborative filtering we make predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption is that if a user A has the same opinion as a user B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a user chosen randomly.

The image below (from Wikipedia) shows an example of collaborative filtering. At first, people rate different items (like videos, images, games). Then, the system makes predictions about a user's rating for an item not rated yet. The new predictions are built upon the existing ratings of other users with similar ratings with the active user. In the image, the system predicts that the user will not like the video.

<img src=https://upload.wikimedia.org/wikipedia/commons/5/52/Collaborative_filtering.gif />

Spark MLlib library for Machine Learning provides a Collaborative Filtering implementation by using Alternating Least Squares. The implementation in MLlib has these parameters:

* numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
* rank is the number of latent factors in the model.
* iterations is the number of iterations to run.
* lambda specifies the regularization parameter in ALS.
* implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
* alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.

In [3]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

In [4]:
# Load the data

data = spark.read.csv('resources/movielens_ratings.csv', inferSchema=True, header=True)

data.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      2|   3.0|     0|
|      3|   1.0|     0|
|      5|   2.0|     0|
|      9|   4.0|     0|
|     11|   1.0|     0|
|     12|   2.0|     0|
|     15|   1.0|     0|
|     17|   1.0|     0|
|     19|   1.0|     0|
|     21|   1.0|     0|
|     23|   1.0|     0|
|     26|   3.0|     0|
|     27|   1.0|     0|
|     28|   1.0|     0|
|     29|   1.0|     0|
|     30|   1.0|     0|
|     31|   1.0|     0|
|     34|   1.0|     0|
|     37|   1.0|     0|
|     41|   2.0|     0|
+-------+------+------+
only showing top 20 rows



In [5]:
# Lets check the schema

data.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- userId: integer (nullable = true)



In [6]:
# Check summary statistics

data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



We can do a split to evaluate how well our model performed, but keep in mind that it is very hard to know conclusively how well a recommender system is truly working for some topics. Especially if subjectivity is involved, for example not everyone that loves star wars is going to love star trek, even though a recommendation system may suggest otherwise.

In [7]:
# We will do a training and test set... since the dataset is small

training, test = data.randomSplit([0.8, 0.2])

In [8]:
# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating')

# fit the model into training data
model = als.fit(training)

In [9]:
# Evaluate the model by computing RMSE on the test data

predictions = model.transform(test)

predictions.show()

+-------+------+------+-----------+
|movieId|rating|userId| prediction|
+-------+------+------+-----------+
|     31|   1.0|    26|  1.0926963|
|     31|   4.0|    12| -2.1344895|
|     31|   1.0|    24|-0.52947056|
|     31|   1.0|    29|-0.19568816|
|     85|   5.0|    16|-0.74210286|
|     85|   1.0|    23|   4.996647|
|     85|   1.0|    25|   1.207609|
|     65|   1.0|    22|  1.0294774|
|     65|   2.0|     5|  1.8284413|
|     65|   5.0|    23| 0.27484703|
|     53|   3.0|    20| 0.18861663|
|     53|   1.0|     7|  2.7834985|
|     78|   1.0|    20|  1.1139445|
|     34|   1.0|    14|  1.6396035|
|     81|   5.0|    28|  0.9889173|
|     81|   1.0|    22|   2.614241|
|     81|   1.0|    19|   1.529076|
|     81|   4.0|    11| 0.55323243|
|     81|   3.0|    18|  0.7314743|
|     28|   1.0|    27| -1.4481549|
+-------+------+------+-----------+
only showing top 20 rows



- The way to interpret this is , for movieId 31, userId 26 gave it a rating of 1 and we predicted the rating to be 1.09

but some other ratings are way off

In [10]:
# In order to evaluate the model..

evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')
rmse = evaluator.evaluate(predictions)

print(f'RMSE is : {rmse}')

RMSE is : 2.1139732612594777


Given the scale of movie rating is between 1 to 5, an RMSE of 2.11 is awefully bad

So now that we have the model, how would you actually supply a recommendation to a user?

The same way we did with the test data! For example:

In [11]:
single_user = test.filter(test['userId'] == 11).select(['movieId', 'userId'])

In [12]:
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|      0|    11|
|     10|    11|
|     13|    11|
|     21|    11|
|     22|    11|
|     45|    11|
|     48|    11|
|     50|    11|
|     67|    11|
|     69|    11|
|     71|    11|
|     75|    11|
|     81|    11|
|     86|    11|
|     89|    11|
|     90|    11|
+-------+------+

