# Recommender Systems using Surprise
---

Welcome to this skeleton notebook on using the Surprise library for solving recommender system problems. The purpose of this notebook is to provide you with an overview of the key concepts and steps involved in building a recommender system using Surprise. The notebook comes with utility code that you will need to fill in with key code snippets that are missing, marked as **#TODO**. There are a total of **8** sections marked as **#TODO**.

The ultimate objective of this notebook is to make accurate predictions for the specified users in the last 2 cells of the notebook.

To successfully complete this notebook, you will need to have basic knowledge of Python programming and familiarity with the Surprise library. We recommend following the notebook in sequential order and reading the instructions carefully before filling in the missing code snippets. Remember to save your notebook frequently as you work through it. If you encounter any issues, consult the Surprise documentation or seek help from the online community. Good luck!

In [1]:
# Credits-Prof Eirinaki, Rashmi Sharma and Aditya Patel
!pip install scikit-surprise



In [2]:
# import required libraries
# utility
import numpy as np
import pandas as pd

# surprise utility
from surprise import Reader
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection.split import train_test_split
from surprise.model_selection import cross_validate, GridSearchCV

# models
from surprise import KNNBasic

## Understanding the recommendations' generation problem:
---

### Basic recommender system design revolves around three fields:
- user_id 
- item _id
- rating

As seen in class, major techniques to predict ratings of the user for an item are:
- Collaborative Filtering
- Matrix Factorization

We will work through the **Collaborative Filtering** technique using Python's **Surprise** library which provides a lot of built in function tailored to build recommender system. [Link to Official Documentation](https://surprise.readthedocs.io/en/stable/index.html)

In [3]:
# load the utility matrix into a dataframe
data = pd.read_csv('data/dataset.csv')

data.head()

Unnamed: 0,user,movie,rating
0,1,2,4.0
1,1,3,5.0
2,1,6,4.0
3,1,8,3.0
4,1,11,3.0


In [4]:
# set the scale of ratings
reader = Reader(rating_scale=(1,5))

#load dataset into Surprise data-structure: Dataset
data = Dataset.load_from_df(data, reader)

data

<surprise.dataset.DatasetAutoFolds at 0x291aed575c0>

## Training the Model:
---

There are several ways to train a recommender system using the Surprise library.

The first way is to set similarity measures and employ one of the collaborative filtering algorithms (i.e. the "original" algorithms and their variations). There is also an option of using baseline estimates (i.e.minimizing error using some optimization).

We follow the first approach here.

### Neighborhood-based Collaborative Filtering:

Before training the model, we need to create a training set. This needs to be distinct from any set used for cross-validation or testing/evaluation.

There are several ways to perform hyperparameter tuning and/or evaluation.

Surprise library provides several cross-validation iterators that allow to do the split from user-item matrix as below. (Ref: https://surprise.readthedocs.io/en/stable/getting_started.html#use-cross-validation-iterators)

### Option 1: Using Holdout Set
---

In [None]:
## TODO

# initial setup for training
# create training set
# test_size=0.2
trainingSet, testSet = 

# let's configure some parameters for neighborhood-based Collaborative Filtering Algorithm
sim_options = {
    'name': 'pearson', #similarity measure default is MSD
    'user_based': True #user-based CF
}

# other options:
# for item-based CF -> False
# for name -> pearson, cosine, msd, pearson_baseline

### 1a: Training KNN-based CF Model

In [None]:
## TODO

## training
# KNN -- implements neighborhood-based CF algorithms

# define model
# k=neighbours=5, other parameters set as above
knn = 

# fit model to the training set
knn.fit(trainingSet)

### 1b: Performance Evaluation

In [None]:
## TODO

# look into test() to fetch predictions for testSet
# predict for test set values
predictions_knn = 

# evaluating rating predictions using RMSE
accuracy.rmse(predictions_knn, verbose=True) 

### 1c: Making Predictions

In [None]:
# we can also predict for a particular user-item combination
pred = knn.predict(4, 1, verbose=True)

### Option 2: Cross Validation
---

Run a cross validation procedure for a given algorithm, reporting accuracy measures and computation times.

You have several options in surprise library: https://surprise.readthedocs.io/en/stable/model_selection.html

In [None]:
## TODO

# use cross validation approach for training the knn model
# instead of manual splitting the dataset
# divides dataset into folds
# uses each fold for training and testing by running many passes
# final model performance is the average over the model performance in all passes

# look into cross_validate() function
# use both rmse and mae for measures
# set number of folds as 5

cross_validate()

### Option 3: GridSearchCV
---

The GridSearchCV class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. 

This is useful for finding the best set of parameters for a prediction algorithm. It is analogous to GridSearchCV from scikit-learn.

In [None]:
## TODO

# define the set of parameters to be searched on
param_grid = {
    'k': [3, 5, 10, 20],
    'sim_options': {
        'name': ['pearson', 'cosine'],
        'min_support': [1, 5],   #the minimum number of common items needed between users to consider them for similarity. For the item-based approach, this corresponds to the minimum number of common users for two items.
        'user_based': [False, True]
    }
}

# find optimal params for KNN
# create object of GridSearchCV for KNNBasic model
# look into the required parameters to the GridSearchCV() constructor
# pass model as KNNBasic
# use both rmse and mae as measures
# keep number of cross validation folds as 5

gs = GridSearchCV()

# fit the model on training data
# entire data used for GridSearchCV
# data used, not trainSet
gs.fit(data)

In [None]:
## TODO

# print best RMSE score
print()

In [None]:
## TODO

# print combination of parameters that gave the best RMSE score
print()

In [None]:
# extract the best model
knn = gs.best_estimator['rmse']

# train the best model on the trainset
knn.fit(data.build_full_trainset())

# You may use this instead of some parts of the following section, to make predictions for the unseen data (i.e. all the missing ratings)

## Example: Making Predictions for Unknown Ratings
---

### UI Prep:

For our demo, we will create a user dictionary and movie dictionary, where for the user dictionary the key is user_name and the value is the userId (which is used in our original dataset). For the movie dictionary the key is movieId and value is movie_name.

In [5]:
user_df = pd.read_csv("data/user_name.csv")
user_df.sample(5)

Unnamed: 0,username,id
340,SPRING-18-660,341
164,FALL-19-352,165
237,FALL-18-288,238
362,SPRING-18-788,363
120,SPRING-20-767,121


In [6]:
user_dict = {}
for i in range(len(user_df)):
    user_dict[user_df.iloc[i].username] = user_df.iloc[i].id

In [7]:
movie_df = pd.read_csv("data/movie_name.csv")
movie_df.head(5)

Unnamed: 0,movieName,id
0,Rating [Rogue One/Star Wars],1
1,Rating [Fight Club],2
2,Rating [The Lord of the Rings],3
3,Rating [Trolls],4
4,Rating [Despicable Me],5


In [8]:
movie_dict = {}
for i in range(len(movie_df)):
    movie_dict[movie_df.iloc[i].id] = movie_df.iloc[i].movieName

### Find user-item pairs with no ratings:

The **build_anti_testset()** function returns all the ratings that are not in the trainset, i.e. all the ratings **𝑟𝑢𝑖** where the user **𝑢** is known, the item **𝑖** is known, but the rating **𝑟𝑢𝑖** is not in the trainset. 

As 𝑟𝑢𝑖 is unknown, it is either replaced by the fill value or assumed to be equal to the mean of all ratings (global_mean).

In [10]:
# using the best model from gridSearchCV: knn

# Find missing values and predict
trainset = data.build_full_trainset()
anti_test_set = trainset.build_anti_testset()
predictions = knn.test(anti_test_set)

In [11]:
from collections import defaultdict

def getMovieRecommendations(topN=3):
    top_recs = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions: 
        top_recs[uid].append((iid, est))
     
    for uid, user_ratings in top_recs.items():
        user_ratings.sort(key = lambda x: x[1], reverse = True)
        top_recs[uid] = user_ratings[:topN]
     
    return top_recs


'''
The getMovieRecommendationsForUser fuction takes 
username, and recommendations 
which we get from getMovieRecommendations function.
'''
def getMovieName(movie_id):
    if movie_id not in movie_dict:
        return ""
    m = movie_dict[movie_id].split('[')
    temp = m[1].split(']')
    return temp[0]


def getMovieRecommendationsForUser(userId, recommendations):
    if userId not in user_dict:
        print("User id is not present")
        return
    u_id = user_dict[userId]
    recommended_movies = recommendations[u_id]
    movie_list = []
    for movie in recommended_movies:
        movie_list.append((getMovieName(movie[0]),movie[1]))
    return movie_list    

In [None]:
## TODO

# generate recommendations for all user, movie pairs
# look into getMovieRecommendations() function defined above
# set top similar neighbors value as 3
recommendations = 

In [None]:
# extract movie recommendations for a user from the above predictions
getMovieRecommendationsForUser('SPRING-23-477',recommendations)

In [None]:
# What happens when a user has rated everything? Here's a user from our dataset that has done that.
getMovieRecommendationsForUser('SPRING-23-230',recommendations)

## Tips
1.Surprise dataset function just takes three columns,user, item and rating.

2.Building Antitest set gives you all the unknown user-item ratings, you may not require all of them.

3.Explore more and have fun!