# Introductory Recommender Systems

Summarized by QH

Last updated on 2023-01-19

## 1. What are recommender systems?

Recommender systems are been used very widely these days in the world in simple ways or more sophisticated ways:
* Your best friend recommended one restaruant to you because one of his friends recommended him this restaurant.
* Youtube recommends the videos you may like from your watching history
* Amazon recommends products based on your purchasing history or other people's choices who has similar purchasing behavior or preferences as you.

We are introducing the recommender systems used in the digital world. It is classified in three types: _Simple popularity-based Recommenders_, _Content-based recommenders_, _Collaborative filtering engines_.

## 1.1 Simple popularity-based Recommenders
It offers a general recommendation to every user based on the popularity. It can be further grouped into genre, district, etc. For movies recommending system, IMDB Top 250 is an example.
* _Algorithm_: 
    * Calculate the number of votes for each item.
    * Rank the items by their number votes.
    * Select the top $k$ items.
    
* _Limitations_: No customization for different users. All users get the same recommendation.

## 1.2 Content-based recommenders
It recommends similar items based on properties or characteristics of a item. The system utilizes the metadata of an item to make recommendations, for example, genre, director, description, actors, for movies. The assumption behind it is that if a user likes an item, then he/she will like similar items. Youtube is an example for this type.

* _Algorithm_:
    * Prepare the _profile_ for the items, a lot of times using one-hot encoding for categorical features. e.g movies with genre, director, actors.
    * Find the similarity between each item. We can choose different similarity metrics.
    * Rank the similarity score from highest to lowest and select top $k$.

* _Limitations_: The recommendations are based on past experience of users. 
    * It will not recommend new areas the user has not experienced before. 
    * Even different users have different preference, if they have same past experience, the recommender will not differentiate them.


## 1.3 Collaborative filtering engines
It predicts the rating or preference of an item based on past ratings and preferences from other users.
* _User_based Filtering_: If a similar users like the item, the system will recommend the item to the user.
* _Item_based Filtering_: The system finds similar items based on how people have rated it in the past and if a user like one item, it will recommend the other one that is similar.

### 1.3.1 User-Item Interaction Matrix (Utility Matrix)
One of the important data used in the _collaborative filtering engines_ is User-Item Interaction Matrix, where each cell is either __explicit__ or __implicit__ rating from the user to the item:
    * __explicit__ rating: Scores that users explicitly give to the item.
    * __implicit__ rating: Scores that derived from user behavior, e.g., number of time user is using/watching the item.
    * The matrix normally is very sparse since users cannot interactive with every items.
    
|User/Item|Item 1| Item 2 | Item ... | Item M|
|:--      |:--   |:--     |:--       |:--    |
|User 1   | 2    | 4      |          | 3     |
|User 2   |      | 3      |          | 1     |
|User ... | 1    |        | 4        |       |
|User N   | 5    |        |          | 2     |

* _Limitations_: We need prior information about the user or the item before we can derive the similarity between the user and other users or between the item and other items.  

### 1.3.2 Similarity Score Based Recommenders
Based on the User-Item Interaction Matrix, we can caluate the similarity score (metrics can be pearson correlation or consine correlation) between items or users. We use the user review as the measures to calculate the similarities.

* _User_based Filtering_: 
    * For a particular user, find a group of similar users (can use the metrics threshold to determine the group size).
    * Calculate the average rating of each item from the group of similar users.
    * Rank the items based on the average rating from highest to lowest and select the top $k$ that they have not interacted before.

* _Item_based Filtering_: 
    * For a particular item that a user liked, based on the similarity metric, rank the other items from highest to lowest.
    * Select the top $k$ items that the user has not interacted before.

### 1.3.3 Similarity Metrics
#### Cosine similarity
Cosine similarity is defined as 
$$K(X, Y) = \frac{X \cdot Y}{||X|| ||Y||}$$ 
It represents is the cosine of the angle between the two vectors.
#### Jaccard similarity
Jaccard similarity is used to define the similary between binary vectors or sets. It is defined as 
$$J(X, Y) = \frac{X \cap  Y}{X \cup Y}$$

### 1.3.4 Model Based Recommenders
A lot of times, there are some latent factors that motivates users to give such one rating. One of the techniques is to find these latent factors by decompose the user-item interaction matrix and then use latent features to infer the ratings the users might give for a product they have never interacted with before.

There are several methods for matrix decomposition:
* Factor Analysis
* PCA
* Non-negative matrix factorization (NMF)
* Truncated SVD

## 2. Examples using Movie Lens data
The source data is from [grouplens.org - movielens data](https://grouplens.org/datasets/movielens/). We only use the ml-latest-small which includes 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users, last updated as 9/2018 as the sample dataset.

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [11]:
# Read in the movie information and ids for imdb and tmdb
movies = pd.read_csv('./ml-latest-small/movies.csv')
links = pd.read_csv('./ml-latest-small/links.csv')
display(movies.head())
display(links.head())

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


First, we will consolidate the movies and links dataset to create a movie information data frame. We will modify the dataset as following:
* year: extracting the year in the brackets from title column
* genres: create a list of the characteristics

In [82]:
movies_info = movies.copy()
# Create year column
## Find the (yyyy) or (yyyy-yyyy)
movies_info['year'] = movies_info['title'].str.extract(r'(\(\d{4}\))', expand=False)
## Extract the digits without parenthesis
movies_info['year'] = movies_info['year'].str.extract(r'(\d{4})', expand=False)
## convert year to int
movies_info['year'] = pd.to_numeric(movies_info['year'], errors='coerce')
## Create decades
movies_info['decades'] = np.floor((movies_info['year']-1900)/10) * 10 + 1900

# Remove the year component from title
movies_info['title'] = movies_info['title'].str.replace(r'(\(\d{4}\))', '', regex=True)
# Remove the leading and trailing blanks of the title
movies_info['title'] = movies_info['title'].str.strip(' ')
# Split the genres column
movies_info['genres'] = movies_info['genres'].str.split('|')

# Merge with links dataset
movies_info = movies_info.merge(links, on='movieId', how='left')
movies_info.head(10)

Unnamed: 0,movieId,title,genres,year,decades,imdbId,tmdbId
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995.0,1990.0,114709,862.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995.0,1990.0,113497,8844.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995.0,1990.0,113228,15602.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995.0,1990.0,114885,31357.0
4,5,Father of the Bride Part II,[Comedy],1995.0,1990.0,113041,11862.0
5,6,Heat,"[Action, Crime, Thriller]",1995.0,1990.0,113277,949.0
6,7,Sabrina,"[Comedy, Romance]",1995.0,1990.0,114319,11860.0
7,8,Tom and Huck,"[Adventure, Children]",1995.0,1990.0,112302,45325.0
8,9,Sudden Death,[Action],1995.0,1990.0,114576,9091.0
9,10,GoldenEye,"[Action, Adventure, Thriller]",1995.0,1990.0,113189,710.0


In [83]:
movies_info.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9742 entries, 0 to 9741
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   title    9742 non-null   object 
 2   genres   9742 non-null   object 
 3   year     9729 non-null   float64
 4   decades  9729 non-null   float64
 5   imdbId   9742 non-null   int64  
 6   tmdbId   9734 non-null   float64
dtypes: float64(3), int64(2), object(2)
memory usage: 608.9+ KB


In [15]:
# Read in the user ratings for the dataset
ratings = pd.read_csv('./ml-latest-small/ratings.csv')
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


### 2.1 Popularity-based recommender system

First approach is simply based on the popularity - number of votes from users and select the top movies.

In [84]:
# Top 10 mosted voted movies
top_10_voted = ratings.groupby(by=['movieId'], as_index=False)['rating'].count().nlargest(10, 'rating').rename(columns={'rating':'votes'})
# Merge with movie info dataset to get information
top_10_voted = top_10_voted.merge(movies_info, on='movieId', how='left')
top_10_voted

Unnamed: 0,movieId,votes,title,genres,year,decades,imdbId,tmdbId
0,356,329,Forrest Gump,"[Comedy, Drama, Romance, War]",1994.0,1990.0,109830,13.0
1,318,317,"Shawshank Redemption, The","[Crime, Drama]",1994.0,1990.0,111161,278.0
2,296,307,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994.0,1990.0,110912,680.0
3,593,279,"Silence of the Lambs, The","[Crime, Horror, Thriller]",1991.0,1990.0,102926,274.0
4,2571,278,"Matrix, The","[Action, Sci-Fi, Thriller]",1999.0,1990.0,133093,603.0
5,260,251,Star Wars: Episode IV - A New Hope,"[Action, Adventure, Sci-Fi]",1977.0,1970.0,76759,11.0
6,480,238,Jurassic Park,"[Action, Adventure, Sci-Fi, Thriller]",1993.0,1990.0,107290,329.0
7,110,237,Braveheart,"[Action, Drama, War]",1995.0,1990.0,112573,197.0
8,589,224,Terminator 2: Judgment Day,"[Action, Sci-Fi]",1991.0,1990.0,103064,280.0
9,527,220,Schindler's List,"[Drama, War]",1993.0,1990.0,108052,424.0


The first approach does not take ratings into consideration. So we can use a more sophisticated weighted average of votes-score an rating-score to generate a score and select the top movies.

In [85]:
# First, scale the number of votes to be 0-5.
votes_avg_ratings = ratings.groupby(by='movieId', as_index=False).agg({'userId': 'count', 'rating': 'mean'}).rename(columns={'userId': 'votes', 'rating': 'avg_rating'})
votes_avg_ratings['scaled_votes'] = (votes_avg_ratings['votes'] - votes_avg_ratings['votes'].min())/(votes_avg_ratings['votes'].max() - votes_avg_ratings['votes'].min()) * 5
# Then, Calculate weighted rating
votes_avg_ratings['weighted_rating'] = 0.8 * votes_avg_ratings['scaled_votes'] + 0.2 * votes_avg_ratings['avg_rating']
top_10_movies = votes_avg_ratings.nlargest(10, 'weighted_rating')

# Merge with movie info dataset to get information
top_10_movies = top_10_movies.merge(movies_info, on='movieId', how='left')
top_10_movies

Unnamed: 0,movieId,votes,avg_rating,scaled_votes,weighted_rating,title,genres,year,decades,imdbId,tmdbId
0,356,329,4.164134,5.0,4.832827,Forrest Gump,"[Comedy, Drama, Romance, War]",1994.0,1990.0,109830,13.0
1,318,317,4.429022,4.817073,4.739463,"Shawshank Redemption, The","[Crime, Drama]",1994.0,1990.0,111161,278.0
2,296,307,4.197068,4.664634,4.571121,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994.0,1990.0,110912,680.0
3,593,279,4.16129,4.237805,4.222502,"Silence of the Lambs, The","[Crime, Horror, Thriller]",1991.0,1990.0,102926,274.0
4,2571,278,4.192446,4.222561,4.216538,"Matrix, The","[Action, Sci-Fi, Thriller]",1999.0,1990.0,133093,603.0
5,260,251,4.231076,3.810976,3.894996,Star Wars: Episode IV - A New Hope,"[Action, Adventure, Sci-Fi]",1977.0,1970.0,76759,11.0
6,110,237,4.031646,3.597561,3.684378,Braveheart,"[Action, Drama, War]",1995.0,1990.0,112573,197.0
7,480,238,3.75,3.612805,3.640244,Jurassic Park,"[Action, Adventure, Sci-Fi, Thriller]",1993.0,1990.0,107290,329.0
8,527,220,4.225,3.338415,3.515732,Schindler's List,"[Drama, War]",1993.0,1990.0,108052,424.0
9,589,224,3.970982,3.39939,3.513709,Terminator 2: Judgment Day,"[Action, Sci-Fi]",1991.0,1990.0,103064,280.0


### 2.2 Content-based Recommender System
Using this approach, we will construct a movie dataset that has the attributes: movieId, year genres(for which to split to columns with binary indicators). Then we will calculate the similarity metrics between them.

In [111]:
# Calculate the Jaccard Similarity score between this movie with other movies
def jaccard_binary(x, y):
    """
    Jaccard similary for two binary vectors
    """
    intersection = np.logical_and(x, y)
    union = np.logical_or(x, y)
    similarity = intersection.sum() / float(union.sum())
    return similarity

In [117]:
# First select potential movies using the number of votes and average ratings (number of votes >= 20, average rating >= 2)
movie_ids = votes_avg_ratings[(votes_avg_ratings['votes'] >= 20) & (votes_avg_ratings['avg_rating'] >= 2)]['movieId']

In [118]:
# Prepare the profile for movies
from itertools import chain
movie_attr = movies_info.copy()
# Select arrange of movies with minimum votes and average rating
movie_attr = movie_attr[movie_attr['movieId'].isin(movie_ids)]
# Genres
unique_genres = list(set(chain.from_iterable(movies_info['genres'].to_list())))
unique_genres.remove('(no genres listed)')
for g in unique_genres:
    movie_attr[g] = movie_attr['genres'].apply(lambda x: g in x).astype(np.int8)
# Years
decades_pivot = movie_attr.pivot_table(index='movieId', columns='decades', values='year', aggfunc='count', fill_value=0).reset_index()

In [119]:
movie_attr = movie_attr.merge(decades_pivot, on='movieId', how='left')
movie_attr = movie_attr.drop(columns=['imdbId', 'tmdbId', 'genres', 'year', 'decades', 'title'])
movie_attr = movie_attr.set_index('movieId')
movie_attr.head()

Unnamed: 0_level_0,Comedy,IMAX,Romance,Musical,Documentary,War,Adventure,Drama,Animation,Film-Noir,...,1920.0,1930.0,1940.0,1950.0,1960.0,1970.0,1980.0,1990.0,2000.0,2010.0
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [120]:
# Randomly select a movie
movie_id_sel = 1
selected_movie = movie_attr.loc[movie_id_sel]
similarity_list = movie_attr.apply(lambda row: jaccard_binary(row, selected_movie), axis=1)
top_10_movies = similarity_list.sort_values(ascending=False).head(10)
movies_info[movies_info['movieId'].isin(top_10_movies.index)]

Unnamed: 0,movieId,title,genres,year,decades,imdbId,tmdbId
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995.0,1990.0,114709,862.0
506,588,Aladdin,"[Adventure, Animation, Children, Comedy, Musical]",1992.0,1990.0,103639,812.0
551,661,James and the Giant Peach,"[Adventure, Animation, Children, Fantasy, Musi...",1996.0,1990.0,116683,10539.0
559,673,Space Jam,"[Adventure, Animation, Children, Comedy, Fanta...",1996.0,1990.0,117705,2300.0
1706,2294,Antz,"[Adventure, Animation, Children, Comedy, Fantasy]",1998.0,1990.0,120587,8916.0
1757,2355,"Bug's Life, A","[Adventure, Animation, Children, Comedy]",1998.0,1990.0,120623,9487.0
2355,3114,Toy Story 2,"[Adventure, Animation, Children, Comedy, Fantasy]",1999.0,1990.0,120363,863.0
3000,4016,"Emperor's New Groove, The","[Adventure, Animation, Children, Comedy, Fantasy]",2000.0,2000.0,120917,11688.0
3568,4886,"Monsters, Inc.","[Adventure, Animation, Children, Comedy, Fantasy]",2001.0,2000.0,198781,585.0
6486,53121,Shrek the Third,"[Adventure, Animation, Children, Comedy, Fantasy]",2007.0,2000.0,413267,810.0


### 2.3 Collaborative-Filtering

#### 2.3.1 Matrix-Decomposition Collaborative Filtering

In [122]:
from sklearn.decomposition import TruncatedSVD, NMF

In [None]:
ratings_pivot = ratings.pivot_table(index='userId', columns='movieId', values='rating', fill_value = 0)

In [127]:
model = NMF(n_components=10, init='random', random_state=43, max_iter = 1000)
W = model.fit_transform(ratings_pivot)
H = model.components_

#### 2.3.2 Spark ALS for Collaborative-Filtering

In [1]:
# Pyspark Implementation for collaborative filtering
# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, StringType, FloatType, TimestampType

In [2]:
# Create my_spark
my_spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/31 17:05:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# In order for ALS to work, we need to make sure userId and movieId are integers
ratings_schema = StructType([
    # Define the name field
    StructField('userId', IntegerType(), True),
    StructField('movieId', IntegerType(), True),
    StructField('rating', FloatType(), True),
    StructField('timestamp', IntegerType(), True)
])
ratings_sp = my_spark.read.csv('./ml-latest-small/ratings.csv', header=True, schema=ratings_schema)
ratings_sp.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+
only showing top 5 rows



In [4]:
movies_schema = StructType([
    # Define the name field
    StructField('movieId', IntegerType(), True),
    StructField('title', StringType(), True),
    StructField('genres', StringType(), True)
])
movies_sp = my_spark.read.csv('./ml-latest-small/movies.csv', header=True, schema=movies_schema)
movies_sp.show(5)

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows



In [6]:
movie_ratings = ratings_sp.join(movies_sp, 'movieId', 'left')
movie_ratings.show(5)

+-------+------+------+---------+--------------------+--------------------+
|movieId|userId|rating|timestamp|               title|              genres|
+-------+------+------+---------+--------------------+--------------------+
|      1|     1|   4.0|964982703|    Toy Story (1995)|Adventure|Animati...|
|      3|     1|   4.0|964981247|Grumpier Old Men ...|      Comedy|Romance|
|      6|     1|   4.0|964982224|         Heat (1995)|Action|Crime|Thri...|
|     47|     1|   5.0|964983815|Seven (a.k.a. Se7...|    Mystery|Thriller|
|     50|     1|   5.0|964982931|Usual Suspects, T...|Crime|Mystery|Thr...|
+-------+------+------+---------+--------------------+--------------------+
only showing top 5 rows



Sparsity of the dataset $ = 1 - \frac{\text{Number of Ratings}}{\text{Number of Users} \times \text{Number of Movies}} $

In [7]:
# Number of users
users_num = movie_ratings.select('userId').distinct().count()
# Number of movies
movies_num = movie_ratings.select('movieId').distinct().count()
# Number of ratings
ratings_num = movie_ratings.count()

# Calculate Sparsity
sparsity = 1 - ratings_num / (users_num * movies_num)
sparsity

0.9830003169443864

In [8]:
## Fiting a basic model
# Split data
(training_data, test_data) = movie_ratings.randomSplit([0.8, 0.2])

In [9]:
# Build ALS model
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
als = ALS(
          userCol="userId",        # Name of column that contains user ids
          itemCol="movieId",       # Name of column that contains item ids
          ratingCol="rating",      # Name of the column that contains ratings from user to the item
          rank=25,                 # Number of latent features k
          maxIter=100,             # Number of iterations
          regParam=.05,            # Regulation parameter lamda
          nonnegative=True,        # Ensure positive numbers of latent features
          coldStartStrategy="drop", 
          implicitPrefs=False)     # No implicit ratings

# Fit the model to training data
als_model = als.fit(training_data)
# Generate predictions on testing data for evaluation
predictions = als_model.transform(test_data)
# Create an evaluator
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')
# Calculate the rmse for the testing data
rmse = evaluator.evaluate(predictions)

23/01/31 17:05:44 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/01/31 17:05:44 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
23/01/31 17:05:48 ERROR Executor: Exception in task 3.0 in stage 84.0 (TID 544)
java.lang.StackOverflowError
	at java.lang.Exception.<init>(Exception.java:102)
	at java.lang.ReflectiveOperationException.<init>(ReflectiveOperationException.java:89)
	at java.lang.reflect.InvocationTargetException.<init>(InvocationTargetException.java:72)
	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1185)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2294)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInp

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/pyth

ConnectionRefusedError: [Errno 61] Connection refused

Parameter tuning

In [None]:
# Imports ParamGridBuilder package
from pyspark.ml.tuning import ParamGridBuilder
# Imports CrossValidator package
from pyspark.ml.tuning import CrossValidator
# ALS
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

# Build generic ALS model without hyperparameters
als = ALS(
          userCol="userId",        # Name of column that contains user ids
          itemCol="movieId",       # Name of column that contains item ids
          ratingCol="rating",      # Name of the column that contains ratings from user to the item
          rank=25,                 # Number of latent features k
          maxIter=100,             # Number of iterations
          regParam=.05,            # Regulation parameter lamda
          nonnegative=True,        # Ensure positive numbers of latent features
          coldStartStrategy="drop", 
          implicitPrefs=False)     # No implicit ratings

# Creates a ParamGridBuilder
param_grid = ParamGridBuilder().addGrid(als.rank, [5, 40, 80, 120]).addGrid(als.maxIter, [5, 100, 250, 500]).addGrid(als.regParam, [.05, .1, 1.5]).build()

# Creates cross validator and tells Spark what to use when training # and evaluating
cv = CrossValidator(estimator = als,
estimatorParamMaps = param_grid,
evaluator = evaluator,
numFolds = 5)

# Run the cv on the training data
model = cv.fit(training_data)
# Extract best combination of values from cross validation
best_model = model.bestModel

## References
1. Datacamp Recommender Systems in python beginner tutorial: https://www.datacamp.com/tutorial/recommender-systems-python
2. A Complete Guide To Recommender Systems — Tutorial with Sklearn, Surprise, Keras, Recommenders, [Medium](https://towardsdatascience.com/a-complete-guide-to-recommender-system-tutorial-with-sklearn-surprise-keras-recommender-5e52e8ceace1)
3. Stanford Course Notes - Chapter 9 Recommendation systems