# Movie Recommender: Part 2: Collaborative Filtering

This Jupyter Notebook is part 2 of 3 to create a Recommender System using PySpark and the [MovieLens](https://grouplens.org/datasets/movielens/) dataset from GroupLens.   It uses the small dataset for education and development, which contains ~100,000 ratings from ~9,000 movies by ~600 users.  It was last updated September 2018 (as of 3/3/2022).  The ratings were created between March 29th, 1996 and September 24th, 2018.  More information can be found [here](https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html).

We are interested in creating a recommender system that can accurately predict the ratings of movies for a given user.  We will be using collaborative-filtering first.

**Note**: The culmination of this project is a separate journal-formatted paper, so this Jupyter Notebook will have less text than usual.

**Notebook breakdown:**
- **Part 1:** Importing and EDA
- **Part 2:** Collaborative Filtering
- **Part 3:** Content-based Filtering

## Configuration:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
INPUT_DIRECTORY = "/content/drive/MyDrive/Grad School/DSCI 632/MovieRecommender/data/" #for google mount
# INPUT_DIRECTORY = "./data/" #for jupyter notebook

In [3]:
%%capture 
#prevent large printout with %%capture

#Download Java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

#Install Apache Spark 3.2.1 with Hadoop 3.2, get zipped folder
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

#Unzip folder
!tar xvf spark-3.2.1-bin-hadoop3.2.tgz

#Install findspark, pyspark 3.2.1
!pip install -q findspark
!pip install pyspark==3.2.1

#Set variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.2.1-bin-hadoop3.2"

## Load Packages and Functions

In [4]:
from pyspark import SparkContext
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.sql import SparkSession

In [19]:
def get_movie_title_from_id(movieId):
  title =  movie_titles.loc[movie_titles["movieId"]==movieId,"title"].item()
  return title

In [20]:
def get_user_recommended_movies(recs_df, userId):
  try:
    recommendations = recs_df[recs_df["userId"] == userId]["recommendations"]
    for movie in recommendations[0]:
      print(f"Movie: \n{get_movie_title_from_id(movie[0])}\nPredicted Rating: {movie[1]}\n")
  except:
    print("That userId does not exist in the dataset.  Try another.")

## Import Data

In [5]:
#create SparkSession and SparkContext objects
sc = SparkContext.getOrCreate()
spark = SparkSession.builder \
  .master("local[*]") \
  .config("spark.executor.memory", "70g") \
  .config("spark.driver.memory", "50g") \
  .config("spark.memory.offHeap.enabled",True) \
  .config("spark.memory.offHeap.size","16g") \
  .getOrCreate()

print('Master : ', sc.master)
print('Cores  : ', sc.defaultParallelism)

Master :  local[*]
Cores  :  2


In [21]:
import pandas as pd

file_path = INPUT_DIRECTORY + "movies.csv"
movie_titles = pd.read_csv(file_path)
movie_titles.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
#Import data
file_path = INPUT_DIRECTORY + "ratings.csv"
ratings = spark.read.csv(file_path, header=True, inferSchema=True)
ratings.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+
only showing top 5 rows



## ALS Model Creation

We'll split our data 80/20% into training/testing sets and set `seed` to 1 for reproducibility:

In [8]:
ratings = ratings.select("userId", "movieId", "rating")
(training_data, test_data) = ratings.randomSplit([.8, .2], seed=42)

Initialize our model.  We'll set the following parameters before optimizing hyperparameters:
- `nonnegative`: `True`. We only want non-negative numbers, as a negative rating has no meaning in this context.  
- `coldStartStrategy`: `"drop"`.  Helps avoid situations where all of a user's ratings are added to the training set only.  This data will not be used when calculating RMSE, because predictions on these users would be meaningless because there is nothing to test.
- `implicitPrefs`: `False`.  We have actual ratings, so we don't need to use implicit feedback.

In [9]:
from pyspark.ml.recommendation import ALS

als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", 
          nonnegative = True, coldStartStrategy = "drop", implicitPrefs = False)

Now we'll build our `ParamGridBuilder`:

In [10]:
from pyspark.ml.tuning import ParamGridBuilder

param_grid = ParamGridBuilder() \
                  .addGrid(als.rank, [5, 20]) \
                  .addGrid(als.maxIter, [5]) \
                  .addGrid(als.regParam, [0.01, 0.05, 1]) \
                  .build()

Next, we'll create our evaluator and use RMSE as our metric:

In [11]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction") 
print ("Num models to be tested: ", len(param_grid))

Num models to be tested:  6


Create CrossValidator:

In [12]:
from pyspark.ml.tuning import CrossValidator

cv = CrossValidator(estimator = als, 
                    estimatorParamMaps= param_grid,
                    evaluator = evaluator,
                    numFolds = 5)

Fit Data:

In [13]:
model = cv.fit(training_data)

best_model = model.bestModel

Get information on the best model:

In [14]:
print(type(best_model))

print("\n**Best Model**")
print("  Rank:", best_model.rank)
print("  MaxIter:", best_model._java_obj.parent().getMaxIter())
print("  RegParam:", best_model._java_obj.parent().getRegParam())

<class 'pyspark.ml.recommendation.ALSModel'>

**Best Model**
  Rank: 5
  MaxIter: 5
  RegParam: 0.05


## Performance Evaluation

Let's generate predictions on the test data:

In [15]:
test_predictions = model.transform(test_data)
test_predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   148|   4896|   4.0|  3.613617|
|   148|   5618|   3.0|   3.69381|
|   148|   7153|   3.0| 3.5649483|
|   148|  40629|   5.0| 3.4003148|
|   148|  40815|   4.0| 3.6592054|
|   148|  60069|   4.5|  3.824732|
|   148|  68954|   4.0| 3.6093912|
|   148|  69844|   4.0| 3.6372879|
|   148|  79132|   1.5| 3.4755905|
|   148|  79702|   4.0| 3.2865078|
|   148|  81834|   4.0| 4.0275917|
|   148|  81847|   4.5| 3.2648077|
|   148|  98243|   4.5|  3.290474|
|   148|  98491|   5.0|  3.809442|
|   148| 108932|   4.0| 3.4892316|
|   463|   1088|   3.5| 3.8741417|
|   463|   1221|   4.5| 4.1042433|
|   463|   2028|   4.5|  4.180805|
|   463|   2167|   3.0|  3.478098|
|   463|   3448|   3.0| 4.2075872|
+------+-------+------+----------+
only showing top 20 rows



In [16]:
# Evaluate the "test_predictions" dataframe
RMSE = evaluator.evaluate(test_predictions)

# Print the RMSE
print(RMSE)

0.903445587157575


## Generate Recommendations:

In [17]:
# Generate top 10 movie recommendations for each user
userRecs = best_model.recommendForAllUsers(10)
userRecs.show(5, truncate=False)



+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                                                                                                                         |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1     |[{96004, 6.5599866}, {3379, 6.5599866}, {6201, 6.3621697}, {8235, 6.3621697}, {2295, 6.2520733}, {8477, 6.06789}, {1684, 6.0549073}, {33649, 6.0447793}, {183897, 6.018463}, {7008, 6.0062227}]         |
|3     |[{4821, 6.7343135}, {26171, 5.7091475}, {2303, 5.3618293}, {5075, 5.3373117}, {6835, 5.274478}, {5746, 5.274478}, {68073, 5.192239}, {5181, 5.182099}, {

In [18]:
userRecs_pandas = userRecs.toPandas()
userRecs_pandas.head()

Unnamed: 0,userId,recommendations
0,1,"[(96004, 6.559986591339111), (3379, 6.55998659..."
1,3,"[(4821, 6.734313488006592), (26171, 5.70914745..."
2,5,"[(8477, 5.729140758514404), (187717, 5.5115857..."
3,6,"[(183897, 6.354036808013916), (112804, 6.34918..."
4,9,"[(160565, 6.128843784332275), (8477, 6.0538110..."


In [22]:
get_movie_title_from_id(10)

'GoldenEye (1995)'

In [23]:
get_user_recommended_movies(userRecs_pandas, 1)

Movie: 
Dragon Ball Z: The History of Trunks (Doragon bôru Z: Zetsubô e no hankô!! Nokosareta chô senshi - Gohan to Torankusu) (1993)
Predicted Rating: 6.559986591339111

Movie: 
On the Beach (1959)
Predicted Rating: 6.559986591339111

Movie: 
Lady Jane (1986)
Predicted Rating: 6.3621697425842285

Movie: 
Safety Last! (1923)
Predicted Rating: 6.3621697425842285

Movie: 
Impostors, The (1998)
Predicted Rating: 6.252073287963867

Movie: 
Jetée, La (1962)
Predicted Rating: 6.067890167236328

Movie: 
Mrs. Dalloway (1997)
Predicted Rating: 6.054907321929932

Movie: 
Saving Face (2004)
Predicted Rating: 6.044779300689697

Movie: 
Isle of Dogs (2018)
Predicted Rating: 6.018463134765625

Movie: 
Last Tango in Paris (Ultimo tango a Parigi) (1972)
Predicted Rating: 6.006222724914551



In [24]:
#try a user that doesn't exist
get_user_recommended_movies(userRecs_pandas, 2)

That userId does not exist in the dataset.  Try another.
