
## Recommendation Systems

## Part 1: Pyspark

### Setup

Let's setup Spark on your Colab environment.  Run the cell below!

In [None]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 33 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 46.7 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=49654121086100d533d9661f10289604850528cbc6266f56f563dc8f22eb77a2
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1
The following packages were automatically installed and are no longer required:
  libnvidia-common-460 nsight-compute-2020.2.0
Use 'apt autoremove' to remove them.
The following additional packages will be installed:
 

Now we authenticate a Google Drive client to download the filea we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
id='1QtPy_HuIMSzhtYllT3-WeM3Sqg55wK_D'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('MovieLens.training')

id='1ePqnsQTJRRvQcBoF2EhoPU8CU1i5byHK'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('MovieLens.test')

id='1ncUBWdI5AIt3FDUJokbMqpHD2knd5ebp'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('MovieLens.item')

If you executed the cells above, you should be able to see the dataset we will use for this Colab under the "Files" tab on the left panel.

Next, we import some of the common libraries needed for our task.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

Let's initialize the Spark context.

In [None]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

You can easily check the current version and get the link of the web interface. In the Spark UI, you can monitor the progress of your job and debug the performance bottlenecks (if your Colab is running with a **local runtime**).

In [None]:
spark

If you are running this Colab on the Google hosted runtime, the cell below will create a *ngrok* tunnel which will allow you to still check the Spark UI.

### Data Loading

In this Colab, we will be using the [MovieLens dataset](https://grouplens.org/datasets/movielens/), specifically the 100K dataset (which contains in total 100,000 ratings from 1000 users on ~1700 movies).

We load the ratings data in a 80%-20% ```training```/```test``` split, while the ```items``` dataframe contains the movie titles associated to the item identifiers.

In [None]:
schema_ratings = StructType([
    StructField("user_id", IntegerType(), False),
    StructField("item_id", IntegerType(), False),
    StructField("rating", IntegerType(), False),
    StructField("timestamp", IntegerType(), False)])

schema_items = StructType([
    StructField("item_id", IntegerType(), False),
    StructField("movie", StringType(), False)])

training = spark.read.option("sep", "\t").csv("MovieLens.training", header=False, schema=schema_ratings)
test = spark.read.option("sep", "\t").csv("MovieLens.test", header=False, schema=schema_ratings)
items = spark.read.option("sep", "|").csv("MovieLens.item", header=False, schema=schema_items)



In [None]:
training.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- item_id: integer (nullable = true)
 |-- rating: integer (nullable = true)
 |-- timestamp: integer (nullable = true)



In [None]:
items.printSchema()

root
 |-- item_id: integer (nullable = true)
 |-- movie: string (nullable = true)



Let's compute some stats!  What is the number of ratings in the training and test dataset? How many movies are in our dataset?

In [None]:
#training
tr_n= training.select(training["rating"]).count()
print("Training Count", tr_n)
#test
ts_n= test.select(test['rating']).count()
print("test count", ts_n)

Training Count 80000
test count 20000


Using the training set, train a model with the Alternating Least Squares method available in the Spark MLlib: [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html](https://spark.apache.org/docs/latest/ml-collaborative-filtering.html)

maxIter = 5, regParam=0.01

In [None]:
from pyspark.ml.recommendation import ALS

In [None]:
als= ALS(maxIter=5, regParam=0.01, userCol= 'user_id', itemCol='item_id', ratingCol= 'rating', coldStartStrategy='drop')
model= als.fit(training)
predict= model.transform(test)

In [None]:
predict.show()

+-------+-------+------+---------+----------+
|user_id|item_id|rating|timestamp|prediction|
+-------+-------+------+---------+----------+
|    148|      1|     4|877019411| 4.1824536|
|    148|      7|     5|877017054| 4.1388817|
|    148|     70|     5|877021271|0.46862817|
|    148|     71|     5|877019251| 2.8146887|
|    148|     78|     1|877399018| 2.5061257|
|    148|     98|     3|877017714| 4.8495936|
|    148|    114|     5|877016735| 3.0446277|
|    148|    116|     5|877398648|-0.5782753|
|    148|    140|     1|877019882| 2.9579701|
|    148|    163|     4|877021402| 3.8615117|
|    148|    169|     5|877020297| 3.9293299|
|    148|    172|     5|877016513| 5.6955647|
|    148|    177|     2|877020715|  6.429349|
|    148|    185|     1|877398385|  5.927431|
|    148|    204|     3|877016912| 3.6045551|
|    148|    214|     5|877019882| 3.2601361|
|    148|    228|     4|877016514| 4.6114254|
|    148|    357|     5|877016735| 3.1947732|
|    148|    408|     5|877399018|

Now compute the RMSE on the test dataset.


In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

eval= RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')
rmse= eval.evaluate(predict)
print("Root Mean Square Error", rmse)


Root Mean Square Error 1.1295064739701401


At this point, you can use the trained model to produce the top-K recommendations for each user.  Recommend the top three movies for each user. 

In [None]:
recommendations= model.recommendForAllUsers(3)




In [None]:
recommendations.select('user_id', 'recommendations.item_id', 'recommendations.rating').show()


+-------+------------------+--------------------+
|user_id|           item_id|              rating|
+-------+------------------+--------------------+
|      1| [1643, 1368, 320]|[6.9078393, 6.706...|
|      3| [1268, 854, 1065]|[7.8648257, 7.800...|
|      5|  [968, 1077, 601]|[7.4901333, 7.456...|
|      6|  [1137, 919, 906]|[6.2349424, 5.650...|
|      9| [1059, 1483, 916]|[11.311695, 11.25...|
|     12| [962, 1065, 1394]|[9.937338, 8.5124...|
|     13|[1184, 1473, 1643]|[6.290877, 6.2048...|
|     15|[1137, 1159, 1540]|[7.2059, 7.177743...|
|     16| [965, 1643, 1589]|[8.001853, 7.9140...|
|     17|  [1160, 253, 865]|[10.695154, 10.25...|
|     19|[1268, 1205, 1172]|[10.449025, 10.22...|
|     20|  [916, 1615, 774]|[12.800352, 10.97...|
|     22|  [253, 583, 1643]|[8.234476, 8.0419...|
|     26| [1643, 793, 1195]|[5.546549, 5.1324...|
|     27|  [253, 1319, 583]|[13.237639, 11.83...|
|     28|[1059, 1643, 1131]|[6.588251, 6.3342...|
|     31| [955, 1643, 1240]|[7.801644, 7.3746...|


Print the name of the movies recommended for user 444  

In [None]:
movie_id= recommendations.where(recommendations.user_id==444).select('recommendations.item_id').collect()
items.filter(items.item_id == movie_id[0][0][1])
for i in range(3):  # 3--> number of recommended movies
  print('Movie', i+1)
  items.filter(items.item_id == movie_id[0][0][i]).show()


Movie 1
+-------+--------------------+
|item_id|               movie|
+-------+--------------------+
|   1160|Love! Valour! Com...|
+-------+--------------------+

Movie 2
+-------+--------------------+
|item_id|               movie|
+-------+--------------------+
|    989|Cats Don't Dance ...|
+-------+--------------------+

Movie 3
+-------+--------------------+
|item_id|               movie|
+-------+--------------------+
|    253|Pillow Book, The ...|
+-------+--------------------+



## Part 2: Collaborative Filtering

In [None]:
# install surprise to build recommender in python
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 14.5 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1630168 sha256=e9c30666b06241dd77de8afa17f32a2e3314ac0f0c28bfe4c78d94311765385e
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


### Task. Memory-based Filtering 

Your task is to train a predictor using the `training` set provided above, and make predictions on the `test` set.

A. User-based recommendation

To make a prediction on user $u$'s rating on item $i$ ($R_{u, i}$), User-based recommendation finds the top-N user neighbors who have already completed rating on $i$, taking their average (unweighted or weighted by their similarity with $u$) as the prediction $\hat{R}_{u,i}$.


(1). Use default parameters, report *RMSE* on training & test set, respectively.

In [None]:
from surprise import KNNWithMeans
from surprise import Reader
from surprise import Dataset


In [None]:
from sklearn.model_selection import train_test_split

#convert pyspark dataframe to pandas dataframe
training_df, testing_df = train_test_split(training.toPandas(), test_size=0.25)
test_df= test.toPandas()

#creating data
reader= Reader(rating_scale=(0,5))
training_data = Dataset.load_from_df(training_df[['user_id','item_id','rating']], reader)
testing_data= Dataset.load_from_df(testing_df[['user_id','item_id','rating']], reader)
test_data= Dataset.load_from_df(test_df[['user_id','item_id','rating']], reader)

#generating raw data frame
tr= Dataset.construct_trainset(training_data, [(uid, iid, r, None) for (uid, iid, r) in zip(training_df['user_id'], training_df['item_id'], training_df['rating'])])
ts= Dataset.construct_testset(testing_data, [(uid, iid, r, None) for (uid, iid, r) in zip(testing_df['user_id'], testing_df['item_id'], testing_df['rating'])])
testset= Dataset.construct_testset(test_data, [(uid, iid, r, None) for (uid, iid, r) in zip(test_df['user_id'], test_df['item_id'], test_df['rating'])])

In [None]:
#train - train
predictions_tr_user= KNNWithMeans().fit(tr).test(ts)

Computing the msd similarity matrix...
Done computing similarity matrix.


Calculate RMSE of the actual ratings $R$ and the predicted ratings $\hat{R}$ in the training set.

In [None]:
#rmse train set
from surprise.accuracy import rmse
print("RMSE On Train", rmse(predictions_tr_user))

RMSE: 0.9676
RMSE On Train 0.9675561391297695


Now let's make predictions on the test set

In [None]:
training_df1= training.toPandas()
training_data1 = Dataset.load_from_df(training_df1[['user_id','item_id','rating']], reader)
trainset= Dataset.construct_trainset(training_data1, [(uid, iid, r, None) for (uid, iid, r) in zip(training_df1['user_id'], training_df1['item_id'], training_df1['rating'])])

#testset
predictions_ts_user= KNNWithMeans().fit(trainset).test(testset)

Computing the msd similarity matrix...
Done computing similarity matrix.


Calculate RMSE of the actual ratings $R$ and the predicted ratings $\hat{R}$ from the trained user-based recommendation.

In [None]:
#rmse test set
print("RMSE On Test Data", rmse(predictions_ts_user))

RMSE: 0.9663
RMSE On Test Data 0.9663023895782573


(2). Now display the top 10 movies for user 10, ranked by the predicted rating scores in the test set.

In [None]:
type(items.toPandas())

pandas.core.frame.DataFrame

In [None]:
temp = pd.DataFrame(predictions_ts_user)

for i in range(1,11):
  df = temp[temp.uid == i].sort_values('est')[:10]
  print(" For user", i)
  for iid in df.iid:
    movie= items[items.item_id == iid].collect()
    print(movie[0][1], end= ",")
  print("\n")

 For user 1
Mad Love (1995),Striptease (1996),Jungle2Jungle (1997),All Dogs Go to Heaven 2 (1996),Free Willy (1993),Theodore Rex (1995),Flipper (1996),Kansas City (1996),Batman & Robin (1997),Kull the Conqueror (1997),

 For user 2
3 Ninjas: High Noon At Mega Mountain (1998),In & Out (1997),Up Close and Personal (1996),Hoodlum (1997),Fierce Creatures (1997),Midnight in the Garden of Good and Evil (1997),River Wild, The (1994),Once Upon a Time... When We Were Colored (1995),FairyTale: A True Story (1997),Mighty Aphrodite (1995),

 For user 3
Hoodlum (1997),Critical Care (1997),Mimic (1997),Dante's Peak (1997),Prophecy II, The (1998),Hard Rain (1998),How to Be a Player (1997),Alien: Resurrection (1997),Liar Liar (1997),Fallen (1998),

 For user 4
Event Horizon (1997),Liar Liar (1997),Mimic (1997),Client, The (1994),Scream (1996),Wedding Singer, The (1998),Ulee's Gold (1997),Incognito (1997),Star Wars (1977),One Flew Over the Cuckoo's Nest (1975),

 For user 5
Children of the Corn: The Ga

(3). From what we learned in class, the number of nearest neighbors ($k$) considered for rating estimation $\hat{R}$ is an important hyperparameter affecting the prediction results. Repeat the training procedure above with different nearest neighbor selections (2-10), find the optimal $k$ in your experiment and report the corresponding *RMSE* in the test set.

In [None]:
rmse_k=[]
for i in range(2,11):
  predictions= KNNWithMeans(k=i).fit(tr).test(testset)   
  rmse_k.append((i,rmse(predictions)))

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.1426
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0916
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0638
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0450
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0334
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0239
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0164
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0108
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0066


In [None]:
best_kvalue= sorted(rmse_k, key= lambda x: x[1])[0][0]
print("Optimal K value", best_kvalue)

Optimal K value 10


Note that we can write a for-loop to iterate throuogh different choices, but scikit-surprise provides us with a simplified cross-validation interface (`surprise.model_selection.GridSearchCV`) to fine-tune such hyperparameter.


Report the optimal k value. 

Report the RMSE given the optimal k value

In [None]:
from surprise.model_selection import GridSearchCV

In [None]:
params= {"k":[2,3,4,5,6,7,8,9,10] }
grid= GridSearchCV(KNNWithMeans, param_grid= params, measures=['rmse'])
grid.fit(training_data1)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [None]:
print("Best Value of K:{} and corresponding RMSE Value is {}".format(grid.best_params['rmse']['k'], grid.best_score['rmse']))


Best Value of K:10 and corresponding RMSE Value is 0.9808492031739593


B. **item-based recommendation**

To make a prediction on user $u$'s rating on item $i$ ($R_{u, i}$), Item-based recommendation finds the top-N item neighbors (the user has rated) to $i$, taking their average (unweighted or weighted by their similarity with $i$) as the prediction $\hat{R}_{u,i}$.



(1). Similar to the previous question, implement the item-based recommender systems trained on the  `training` set, report the *RMSE* on both the `training` and `test` set. (Note: apply the optimal $k$ obtained in last question.)

In [None]:
#train set
sim_options= {'user_based': False }
predictions_tr_item= KNNWithMeans( sim_options=sim_options).fit(tr).test(ts)
print("RMSE On Train", rmse(predictions_tr_item))

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9521
RMSE On Train 0.9520802458970272


In [None]:
#test set
predictions_ts_item= KNNWithMeans(sim_options=sim_options).fit(tr).test(testset)
print("RMSE On Train", rmse(predictions_ts_item))

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9648
RMSE On Train 0.9648360997055023


(2). Similar to previous question, display the top 10 movies for user 10, ranked by the predicted rating scores in the test set.

In [None]:
# Your code here
temp = pd.DataFrame(predictions_ts_item)

for i in range(1,11):
  df = temp[temp.uid == i].sort_values('est')[:10]
  print(" For user", i)
  for iid in df.iid:
    movie= items[items.item_id == iid].collect()
    print(movie[0][1], end= ",")
  print("\n")

 For user 1
Kansas City (1996),Theodore Rex (1995),Striptease (1996),Jungle2Jungle (1997),All Dogs Go to Heaven 2 (1996),Event Horizon (1997),Lawnmower Man, The (1992),Mad Love (1995),Flipper (1996),Hot Shots! Part Deux (1993),

 For user 2
3 Ninjas: High Noon At Mega Mountain (1998),Fierce Creatures (1997),River Wild, The (1994),Hoodlum (1997),Up Close and Personal (1996),FairyTale: A True Story (1997),In & Out (1997),Midnight in the Garden of Good and Evil (1997),Mighty Aphrodite (1995),Once Upon a Time... When We Were Colored (1995),

 For user 3
Mimic (1997),House of Yes, The (1997),187 (1997),Dante's Peak (1997),Devil's Own, The (1997),Critical Care (1997),Hard Rain (1998),Prophecy II, The (1998),Hoodlum (1997),Alien: Resurrection (1997),

 For user 4
Mimic (1997),Event Horizon (1997),Incognito (1997),Liar Liar (1997),Scream (1996),Wedding Singer, The (1998),Client, The (1994),Ulee's Gold (1997),Star Wars (1977),One Flew Over the Cuckoo's Nest (1975),

 For user 5
Amityville: A Ne

(3). Given the same number of nearest neighbor ($k$), compare and discuss the user-based and item-based recommendation, which performs better on the test set?

Based on the testset: 
User-Based works better than Item-Based. 

User Based Recommendation is user centric and suffer from high variance and low bias due to sensitivity towards recorded interactions and as interactions are based on user's similarity. 

Item Based Recommendation is intem centric, and suffer from low variance, and high bias. 
