# Movie Recommender: Part 2: Collaborative Filtering

This Jupyter Notebook is part 1 of 3 to create a Recommender System using PySpark and the [MovieLens](https://grouplens.org/datasets/movielens/) dataset from GroupLens.   It uses the small dataset for education and development, which contains ~100,000 ratings from ~9,000 movies by ~600 users.  It was last updated September 2018 (as of 3/3/2022).  The ratings were created between March 29th, 1996 and September 24th, 2018.  More information can be found [here](https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html).

We are interested in creating a recommender system that can accurately predict the ratings of movies for a given user.  We will be using collaborative-filtering first.

**Note**: The culmination of this project is a separate journal-formatted paper, so this Jupyter Notebook will have less text than usual.

Notebook breakdown:
- **Part 1:** Importing and EDA
- **Part 2:** Collaborative Filtering
- **Part 3:** Content-based Filtering

## Configuration:

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
INPUT_DIRECTORY = "/content/drive/MyDrive/Grad School/DSCI 632/MovieRecommender/data/" #for google mount
# INPUT_DIRECTORY = "./data/" #for jupyter notebook

In [3]:
%%capture 
#prevent large printout with %%capture

#Download Java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

#Install Apache Spark 3.2.1 with Hadoop 3.2, get zipped folder
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

#Unzip folder
!tar xvf spark-3.2.1-bin-hadoop3.2.tgz

#Install findspark, pyspark 3.2.1
!pip install -q findspark
!pip install pyspark==3.2.1

#Set variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.2.1-bin-hadoop3.2"

## Load Packages and Functions

In [4]:
from pyspark import SparkContext
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.sql import SparkSession

In [5]:
def get_movie_title_from_id(movieId):
  title =  movie_titles.loc[movie_titles["movieId"]==movieId,"title"].item()
  return title

In [6]:
def get_user_recommended_movies(recs_df, userId):
  try:
    recommendations = recs_df[recs_df["userId"] == userId]["recommendations"]
    for movie in recommendations[0]:
      print(f"Movie: \n{get_movie_title_from_id(movie[0])}\nPredicted Rating: {movie[1]}\n")
  except:
    print("That userId does not exist in the dataset.  Try another.")

## Import Data

In [7]:
#create SparkSession and SparkContext objects
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()
print('Master : ', sc.master)
print('Cores  : ', sc.defaultParallelism)

Master :  local[*]
Cores  :  2


In [8]:
import pandas as pd

file_path = INPUT_DIRECTORY + "movies.csv"
movie_titles = pd.read_csv(file_path)
movie_titles.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [9]:
#Import data
file_path = INPUT_DIRECTORY + "ratings.csv"
ratings = spark.read.csv(file_path, header=True, inferSchema=True)
ratings.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+
only showing top 5 rows



## ALS Model Creation

We'll split our data 80/20% into training/testing sets and set `seed` to 1 for reproducibility:

In [10]:
ratings = ratings.select("userId", "movieId", "rating")
(training_data, test_data) = ratings.randomSplit([.8, .2], seed=42)

Initialize our model.  We'll set the following parameters before optimizing hyperparameters:
- `nonnegative`: `True`. We only want non-negative numbers, as a negative rating has no meaning in this context.  
- `coldStartStrategy`: `"drop"`.  Helps avoid situations where all of a user's ratings are added to the training set only.  This data will not be used when calculating RMSE, because predictions on these users would be meaningless because there is nothing to test.
- `implicitPrefs`: `False`.  We have actual ratings, so we don't need to use implicit feedback.

In [11]:
from pyspark.ml.recommendation import ALS

als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", 
          nonnegative = True, coldStartStrategy = "drop", implicitPrefs = False)

Now we'll build our `ParamGridBuilder`:

In [12]:
from pyspark.ml.tuning import ParamGridBuilder

param_grid = ParamGridBuilder() \
                  .addGrid(als.rank, [5, 20]) \
                  .addGrid(als.maxIter, [5]) \
                  .addGrid(als.regParam, [0.01, 0.05, 1]) \
                  .build()

Next, we'll create our evaluator and use RMSE as our metric:

In [13]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction") 
print ("Num models to be tested: ", len(param_grid))

Num models to be tested:  6


Create CrossValidator:

In [14]:
from pyspark.ml.tuning import CrossValidator

cv = CrossValidator(estimator = als, 
                    estimatorParamMaps= param_grid,
                    evaluator = evaluator,
                    numFolds = 5)

Fit Data:

In [15]:
model = cv.fit(training_data)

best_model = model.bestModel

Get information on the best model:

In [16]:
print(type(best_model))

print("\n**Best Model**")
print("  Rank:", best_model.rank)
print("  MaxIter:", best_model._java_obj.parent().getMaxIter())
print("  RegParam:", best_model._java_obj.parent().getRegParam())

<class 'pyspark.ml.recommendation.ALSModel'>

**Best Model**
  Rank: 5
  MaxIter: 5
  RegParam: 0.05


## Performance Evaluation

Let's generate predictions on the test data:

In [17]:
test_predictions = model.transform(test_data)
test_predictions.show(5)

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   148|   4896|   4.0| 3.3036947|
|   148|   5618|   3.0| 3.9722085|
|   148|   7153|   3.0|  3.810783|
|   148|  40629|   5.0| 3.6137528|
|   148|  40815|   4.0| 3.3124118|
+------+-------+------+----------+
only showing top 5 rows



In [18]:
# Evaluate the "test_predictions" dataframe
RMSE = evaluator.evaluate(test_predictions)

# Print the RMSE
print(RMSE)

0.9023503056630517


## Generate Recommendations:

In [19]:
# Generate top 10 movie recommendations for each user
userRecs = best_model.recommendForAllUsers(5)
userRecs.show(5, truncate=False)



+------+---------------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                          |
+------+---------------------------------------------------------------------------------------------------------+
|1     |[{104875, 6.9180565}, {147410, 6.8609915}, {118270, 6.8609915}, {131610, 6.8609915}, {139640, 6.8609915}]|
|2     |[{26171, 6.088721}, {299, 5.79033}, {59018, 5.639411}, {60943, 5.639411}, {3814, 5.4988527}]             |
|3     |[{4442, 5.7323136}, {1241, 5.4825706}, {6835, 5.376806}, {5746, 5.376806}, {5181, 5.2396426}]            |
|4     |[{89904, 6.288658}, {104875, 6.110387}, {86345, 6.1036525}, {2295, 5.898678}, {1354, 5.8697443}]         |
|5     |[{25771, 5.653799}, {2295, 5.548646}, {8477, 5.5303373}, {25825, 5.4557447}, {1241, 5.4454384}]          |
+------+------------------------------------------------------------------------

In [20]:
userRecs_pandas = userRecs.toPandas()
userRecs_pandas.head()

Unnamed: 0,userId,recommendations
0,1,"[(104875, 6.918056488037109), (147410, 6.86099..."
1,2,"[(26171, 6.088720798492432), (299, 5.790329933..."
2,3,"[(4442, 5.732313632965088), (1241, 5.482570648..."
3,4,"[(89904, 6.288658142089844), (104875, 6.110386..."
4,5,"[(25771, 5.653799057006836), (2295, 5.54864597..."


In [21]:
get_movie_title_from_id(10)

'GoldenEye (1995)'

In [22]:
get_user_recommended_movies(userRecs_pandas, 1)

Movie: 
History of Future Folk, The (2012)
Predicted Rating: 6.918056488037109

Movie: 
A Perfect Day (2015)
Predicted Rating: 6.860991477966309

Movie: 
Hellbenders (2012)
Predicted Rating: 6.860991477966309

Movie: 
Willy/Milly (1986)
Predicted Rating: 6.860991477966309

Movie: 
Ooops! Noah is Gone... (2015)
Predicted Rating: 6.860991477966309



In [23]:
#try a user that doesn't exist
get_user_recommended_movies(userRecs_pandas, 2)

That userId does not exist in the dataset.  Try another.


**View the `Part3_ContentBased.ipynb` file to see:**
- ALS Model Creation with Content-Based Filtering