# Collaborative Filtering using ALS 

Recommender System is an information filtering tool that seeks to predict what product a user will like. Based on the predictions, products can be recommended to users.

In [7]:
import pandas as pd
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
from pyspark import SparkContext

import warnings
warnings.filterwarnings('ignore')

In [11]:
DATA_DIR = 'data'

### Initiate spark session

In [4]:
sc = SparkContext

In [20]:
spark = SparkSession.builder.\
        appName('Recommendations').getOrCreate()

## Explicit v.s. Implicit data

There are two ways to gather user preference data to recommend items.
1. Explicit data:
    - concrete rating scale. e.g.: rate the movie from 1-5 stars
    - makes it easier to extrapolate from data to predict future ratings
    - drawback: responsibility of data collection on the user,
        who might not take the time to enter ratings



2. Implicit data:
    - easier to collect in large quantities without any extra effort on
        part of the user
    - more difficult to work with

## Load data

In [12]:
movies = spark.read.csv(f'{DATA_DIR}/movies.csv', header=True)

In [14]:
ratings = spark.read.csv(f'{DATA_DIR}/ratings.csv', header=True)

In [18]:
ratings.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
|     1|    163|   5.0|964983650|
|     1|    216|   5.0|964981208|
|     1|    223|   3.0|964980985|
|     1|    231|   5.0|964981179|
|     1|    235|   4.0|964980908|
|     1|    260|   5.0|964981680|
|     1|    296|   3.0|964982967|
|     1|    316|   3.0|964982310|
|     1|    333|   5.0|964981179|
|     1|    349|   4.0|964982563|
+------+-------+------+---------+
only showing top 20 rows



In [19]:
ratings.printSchema()

root
 |-- userId: string (nullable = true)
 |-- movieId: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- timestamp: string (nullable = true)



In [22]:
# convert strings to app. type
ratings = ratings.\
    withColumn('userId', col('userId').cast('integer')).\
    withColumn('movieId', col('movieId').cast('integer')).\
    withColumn('rating', col('rating').cast('float')).\
    drop('timestamp')
ratings.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



# Calculate sparsity

In the real world, the utility matrix is expected to be very sparse, as each user only encounters a small fraction of items among the vast pool of options available.

Cold start is a problems that we can run into during addition of a new user or a new item where both do not have ratings history.

In [54]:
# Count the total number of ratings in the dataset
numerator = ratings.select('rating').count()
# print('numerator:', numerator)

# Count the number of distinct users and distint movies
num_users = ratings.select('userId').distinct().count()
# print('num_users:', num_users)
num_movies = ratings.select('movieId').distinct().count()
# print('num_movies:', num_movies)

# Set the denominator equal to the number of users
# multiplied by the number of movies
denominator = num_users * num_movies
# print('denominator:', denominator)

# Divide the numerator by the denominator
sparsity = (1.0 - numerator/denominator) * 100
print('The ratings dataframe is', '%.2f' % sparsity + '% empty')

The ratings dataframe is 98.30% empty


# Interpret ratings

In [33]:
# Group data by userId, count ratings
userId_ratings = ratings.groupBy('userId').count().orderBy(
    'count', ascending=False)

In [34]:
userId_ratings.show()

+------+-----+
|userId|count|
+------+-----+
|   414| 2698|
|   599| 2478|
|   474| 2108|
|   448| 1864|
|   274| 1346|
|   610| 1302|
|    68| 1260|
|   380| 1218|
|   606| 1115|
|   288| 1055|
|   249| 1046|
|   387| 1027|
|   182|  977|
|   307|  975|
|   603|  943|
|   298|  939|
|   177|  904|
|   318|  879|
|   232|  862|
|   480|  836|
+------+-----+
only showing top 20 rows



In [37]:
# Group data by movieId, count ratings
movieId_ratings = ratings.groupBy('movieId').count().orderBy(
                    'count', ascending=False)

In [38]:
movieId_ratings.show()

+-------+-----+
|movieId|count|
+-------+-----+
|    356|  329|
|    318|  317|
|    296|  307|
|    593|  279|
|   2571|  278|
|    260|  251|
|    480|  238|
|    110|  237|
|    589|  224|
|    527|  220|
|   2959|  218|
|      1|  215|
|   1196|  211|
|   2858|  204|
|     50|  204|
|     47|  203|
|    780|  202|
|    150|  201|
|   1198|  200|
|   4993|  198|
+-------+-----+
only showing top 20 rows



# Approaches to Recommendation

The two widely used approaches for building a recommender system are:
    1. Content-bases filtering (CBF) - most widely used
    2. Collaborative filtering (CF) 
    
The primary difference between the two approaches is that CF looks for similar users to recommend items while CBF looks for similar contents to recommend items.

# Content-based Filtering (CBF)

The main idea behind CBF is to recommend items similar to the items previously liked by the user. Example: if user rated some items in the past, than these items are used for _user-modeling_ where the user's interests are quantified.

Given a new item $x$, the likeness (rating) of the item is predicted using the user model.

This can be achieved in two different ways:
- Predicting ratings using parametric models like regression or logistic 
    regression for multiple ratings and binary ratings

- Similarity based techniques using distance measures to find similar 
    items to the items liked by the user based on item features

CBF can be applied even when a strong user-base is not built, as it depends on the item's meta data (features) therefore does not suffer from cold-start problem.

However:
- this also makes it computationally intensive, as similarities between 
    each user and all the items must be computed.
- since the recommendations are based on item similarity to the item that 
    the user already knows about, it leaves no room for serendepidity and 
    causes over specialisation
- CBF also ignores popularity of an item and other users' feedbacks 
   

# Collaborative filtering (CF)

Collaborative filtering aggregates the past behaviour of all users. It recommends items to a user based on the items liked by another set of users whose likes (and dislikes) are similar to the user under consideration. This approach is also called the _user-user_ based CF.



# Build the ALS model

In [39]:
# Import the required functions
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [41]:
# Create test and train set
(train, test) = ratings.randomSplit([0.8, 0.2], seed = 1234)

# Create ALS model
als = ALS(userCol='userId',
          itemCol='movieId',
          ratingCol='rating',
          nonnegative=True,
          implicitPrefs=False,
          coldStartStrategy='drop')

# Confirm model type
type(als)

pyspark.ml.recommendation.ALS

# Hypertuning the ALS model

In [42]:
# Add hyperparameters and their respective values to param_grid
param_grid = ParamGridBuilder() \
                .addGrid(als.rank, [10, 50, 100, 150]) \
                .addGrid(als.regParam, [.01, .05, .1, .15]) \
                .build()

# Define evaluator as RMSE and print length of evaluator
evaluator = RegressionEvaluator(metricName="rmse",
                                labelCol="rating",
                                predictionCol="prediction")
print("Num models to be tested: ", len(param_grid))

Num models to be tested:  16


# Building the cross validation pipeline

In [43]:
# Build cross validation using CrossValidator
cv = CrossValidator(estimator=als,
                    estimatorParamMaps=param_grid,
                    evaluator=evaluator,
                    numFolds=5)

cv

CrossValidator_0b0f6f08cb52

# Get best model and model parameters

In [44]:
# Fit cross validator to the 'train' dataset
model = cv.fit(train)

# Extract best model from the cv model above
best_model = model.bestModel

21/10/03 18:29:12 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
21/10/03 18:29:12 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
                                                                                

In [45]:
print("Best model:\n")
print("  Rank:", best_model._java_obj.parent().getRank())
print("  MaxIter:", best_model._java_obj.parent().getMaxIter())
print("  RegParam:", best_model._java_obj.parent().getRegParam())

Best model:

  Rank: 50
  MaxIter: 10
  RegParam: 0.15


In [46]:
# View the predictions
test_predictions = best_model.transform(test)
RMSE = evaluator.evaluate(test_predictions)
print(RMSE)



0.8686031234435


                                                                                

In [47]:
test_predictions.show()



+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   597|    471|   2.0| 4.1380954|
|   436|    471|   3.0|   3.59407|
|   218|    471|   4.0| 3.0229354|
|   387|    471|   3.0| 2.9660947|
|   217|    471|   2.0| 2.8500836|
|   287|    471|   4.5| 2.8535714|
|    32|    471|   3.0| 3.7028766|
|   260|    471|   4.5|  3.564133|
|   104|    471|   4.5|   3.51176|
|   111|   1088|   3.0|  3.344483|
|   177|   1088|   3.5|  3.540231|
|    41|   1088|   1.5| 2.5977957|
|   387|   1088|   1.5| 2.6075828|
|   594|   1088|   4.5|   4.45407|
|   307|   1088|   3.0| 2.7343934|
|   509|   1088|   3.0| 3.1654263|
|   104|   1088|   3.0| 3.6615238|
|   268|   1238|   5.0|  3.864694|
|   462|   1238|   3.5| 3.6027348|
|   307|   1342|   2.0| 2.1763792|
+------+-------+------+----------+
only showing top 20 rows



                                                                                

# Make recommendations

In [49]:
# Generate n recommendations for all users
n_recommendations = best_model.recommendForAllUsers(10)
n_recommendations.limit(10).show()



+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   471|[{3379, 4.8430214...|
|   463|[{3379, 4.9695063...|
|   496|[{3379, 4.633355}...|
|   148|[{33649, 4.555777...|
|   540|[{3379, 5.351739}...|
|   392|[{3379, 4.758028}...|
|   243|[{3379, 5.659164}...|
|    31|[{33649, 5.207429...|
|   516|[{4429, 4.8151984...|
|   580|[{3379, 4.772037}...|
+------+--------------------+



                                                                                

In [50]:
n_recommendations = n_recommendations\
    .withColumn("rec_exp", explode("recommendations"))\
    .select("userId", col("rec_exp.movieId"),
            col("rec_exp.rating"))

In [51]:
n_recommendations.limit(10).show()



+------+-------+---------+
|userId|movieId|   rating|
+------+-------+---------+
|   471|   3379|4.8430214|
|   471|  33649|4.5660295|
|   471| 171495| 4.553143|
|   471|  86781|4.5139523|
|   471|   7096|4.4966307|
|   471| 100714|4.4645295|
|   471|  78836| 4.455993|
|   471|   7767|4.4456406|
|   471|  26073| 4.419719|
|   471| 117531| 4.419719|
+------+-------+---------+



                                                                                

# Making sense of the recommendation

### Merge movie names and genres to the recommendations matrix for interpretability

In [52]:
n_recommendations.join(movies, on="movieId")\
    .filter("userId = 10").show()

+-------+------+---------+--------------------+--------------------+
|movieId|userId|   rating|               title|              genres|
+-------+------+---------+--------------------+--------------------+
|  71579|    10|4.5206785|Education, An (2009)|       Drama|Romance|
| 113275|    10| 4.349285|The Hundred-Foot ...|        Comedy|Drama|
|  51705|    10|4.2712903|Priceless (Hors d...|      Comedy|Romance|
|  94070|    10| 4.228056|Best Exotic Marig...|        Comedy|Drama|
|   7169|    10| 4.189049|Chasing Liberty (...|      Comedy|Romance|
|   3086|    10| 4.132452|Babes in Toyland ...|Children|Comedy|F...|
|  42730|    10|4.0880737|   Glory Road (2006)|               Drama|
|  67618|    10| 4.067024|Strictly Sexual (...|Comedy|Drama|Romance|
|  25906|    10| 4.045791|Mr. Skeffington (...|       Drama|Romance|
|  77846|    10| 4.045791| 12 Angry Men (1997)|         Crime|Drama|
+-------+------+---------+--------------------+--------------------+



In [53]:
ratings.join(movies, on="movieId").filter("userId = 100").\
    sort("rating", ascending=False).limit(10).show()

+-------+------+------+--------------------+--------------------+
|movieId|userId|rating|               title|              genres|
+-------+------+------+--------------------+--------------------+
|   1101|   100|   5.0|      Top Gun (1986)|      Action|Romance|
|   1958|   100|   5.0|Terms of Endearme...|        Comedy|Drama|
|   2423|   100|   5.0|Christmas Vacatio...|              Comedy|
|   4041|   100|   5.0|Officer and a Gen...|       Drama|Romance|
|   5620|   100|   5.0|Sweet Home Alabam...|      Comedy|Romance|
|    368|   100|   4.5|     Maverick (1994)|Adventure|Comedy|...|
|    934|   100|   4.5|Father of the Bri...|              Comedy|
|    539|   100|   4.5|Sleepless in Seat...|Comedy|Drama|Romance|
|     16|   100|   4.5|       Casino (1995)|         Crime|Drama|
|    553|   100|   4.5|    Tombstone (1993)|Action|Drama|Western|
+-------+------+------+--------------------+--------------------+

