http://surpriselib.com

<h1 style="color:blue; font-weight:bold">Assignment todo list</h1>

- Extract all the pipeline implemented function in a python script and import them in the notebook to use them.
- Implement the *get_user_recommendation* function.
- **Bonus** : Generalize the *get_trained_model* function to use any surprise model kwargs.

# Load data

<p style="color:blue; font-weight:bold">TODO</p>

- Extract the implemened functions in a python script and import them in the notebook.

## From surprise

In [1]:
from surprise import Dataset

ratings = Dataset.load_builtin('ml-100k')
ratings

<surprise.dataset.DatasetAutoFolds at 0x7fc94c16b640>

In [2]:
from surprise.dataset import DatasetAutoFolds

def load_ratings_from_surprise() -> DatasetAutoFolds:
    ratings = Dataset.load_builtin('ml-100k')
    return ratings

load_ratings_from_surprise()

<surprise.dataset.DatasetAutoFolds at 0x7fc94c092ac0>

## From file

In [3]:
from pathlib import Path
from surprise import Reader

ratings_filepath = Path('../data/movielens/ml-latest-small/ratings.csv')
reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)
ratings = Dataset.load_from_file(ratings_filepath, reader)
ratings

<surprise.dataset.DatasetAutoFolds at 0x7fc94cdca7f0>

In [4]:
def load_ratings_from_file(ratings_filepath : Path) -> DatasetAutoFolds:
    reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)
    ratings = Dataset.load_from_file(ratings_filepath, reader)
    return ratings
    
ratings_filepath = Path('../data/movielens/ml-latest-small/ratings.csv')
load_ratings_from_file(ratings_filepath)

<surprise.dataset.DatasetAutoFolds at 0x7fc94cdcab80>

## Modular function

In [5]:
def get_data(from_surprise : bool = True) -> DatasetAutoFolds:
    data = load_ratings_from_surprise() if from_surprise else load_ratings_from_file()
    return data

data = get_data(from_surprise=True)
data

<surprise.dataset.DatasetAutoFolds at 0x7fc94cd14d00>

# Manual pipeline

<p style="color:blue; font-weight:bold">TODO</p>

- Extract the implemened functions in a python script and import them in the notebook.

## Split data in train and test

In [6]:
from surprise.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2, random_state=42)
train

<surprise.trainset.Trainset at 0x7fc94cd14dc0>

In [7]:
train.n_users, train.n_items

(943, 1651)

## Train model

<p style="color:blue; font-weight:bold">TODO</p>

- Change the *model_kwargs* argument in the *get_trained_model* function to make it usable for any surprise model (SVD, KNN, NMF, etc).

In [8]:
from surprise import SVD

model = SVD()

In [9]:
model.fit(train)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fc94c16b5b0>

In [10]:
from surprise.trainset import Trainset
from  surprise.prediction_algorithms.algo_base import AlgoBase

from surprise.prediction_algorithms.knns import KNNBasic


def get_trained_model(model_class: AlgoBase, model_kwargs: dict, train_set: Trainset) -> AlgoBase:
    model = model_class(sim_options = model_kwargs)
    model.fit(train_set)
    return model

model_kwargs = {'sim_options': {'user_based': False, 'name': 'pearson'}}
get_trained_model(KNNBasic, {'user_based': False, 'name': 'pearson'}, train)
# {'sim_options': {'user_based': False, 'name': 'pearson'}} - **kwargs

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7fc94c14fdc0>

## Make predictions

In [11]:
predictions = model.test(test)
predictions[:10]

[Prediction(uid='907', iid='143', r_ui=5.0, est=4.762202632237946, details={'was_impossible': False}),
 Prediction(uid='371', iid='210', r_ui=4.0, est=4.211045843957054, details={'was_impossible': False}),
 Prediction(uid='218', iid='42', r_ui=4.0, est=3.4576465941225325, details={'was_impossible': False}),
 Prediction(uid='829', iid='170', r_ui=4.0, est=4.07731408042207, details={'was_impossible': False}),
 Prediction(uid='733', iid='277', r_ui=1.0, est=3.0701961367499555, details={'was_impossible': False}),
 Prediction(uid='363', iid='1512', r_ui=1.0, est=3.601462078997732, details={'was_impossible': False}),
 Prediction(uid='193', iid='487', r_ui=5.0, est=3.7561068104413047, details={'was_impossible': False}),
 Prediction(uid='808', iid='313', r_ui=5.0, est=4.54216185300126, details={'was_impossible': False}),
 Prediction(uid='557', iid='682', r_ui=2.0, est=3.6357756491341084, details={'was_impossible': False}),
 Prediction(uid='774', iid='196', r_ui=3.0, est=2.376520149285778, deta

## Evaluation

In [12]:
from surprise import accuracy

accuracy.rmse(predictions=predictions)

RMSE: 0.9378


0.9378456428063894

In [13]:
accuracy.mae(predictions=predictions)

MAE:  0.7395


0.7395408044495279

In [14]:
from surprise import accuracy

def evaluate_model(model: AlgoBase, test_set: [(int, int, float)]) -> dict:
    predictions = model.test(test_set)
    metrics_dict = {}
    metrics_dict['RMSE'] = accuracy.rmse(predictions, verbose=False)
    metrics_dict['MAE'] = accuracy.rmse(predictions, verbose=False)
    return metrics_dict

## Modular code

In [15]:
from surprise.model_selection import train_test_split


from surprise.prediction_algorithms.knns import KNNBasic

def train_and_evalute_model_pipeline(model_class: AlgoBase, model_kwargs: dict = {},
                                     from_surprise: bool = True,
                                     test_size: float = 0.2) -> (AlgoBase, dict):
    data = get_data(from_surprise)
    train_set, test_set = train_test_split(data, test_size, random_state=42)
    model = get_trained_model(model_class, model_kwargs, train_set)
    metrics_dict = evaluate_model(model, test_set)
    return model, metrics_dict

my_model, metrics_dict = train_and_evalute_model_pipeline(KNNBasic)
metrics_dict

Computing the msd similarity matrix...
Done computing similarity matrix.


{'RMSE': 0.980150596704479, 'MAE': 0.980150596704479}

In [16]:
my_model, metrics_dict = train_and_evalute_model_pipeline(SVD)
metrics_dict

TypeError: __init__() got an unexpected keyword argument 'sim_options'

In [20]:
get_trained_model(KNNBasic, {'user_based': False, 'name': 'pearson'}, train)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7fc94c0925e0>

# Benchmarking

<p style="color:blue; font-weight:bold">TODO</p>

- Add the other models (baseline, item based with cosine & pearson sim metrics, NMF, SVD)
- Add the fit_time in the benchmarking

In [17]:
from surprise.prediction_algorithms.knns import KNNBasic

benchmark_dict = {}

model_kwargs = {'user_based': True, 'name': 'cosine'}
knn, metrics_dict = train_and_evalute_model_pipeline(KNNBasic, model_kwargs)
benchmark_dict['KNN user based cosine'] = metrics_dict

model_kwargs = {'user_based': True, 'name': 'pearson'}
knn, metrics_dict = train_and_evalute_model_pipeline(KNNBasic, model_kwargs)
benchmark_dict['KNN user based pearson'] = metrics_dict

model_kwargs = {'user_based': False, 'name': 'cosine'}
knn, metrics_dict = train_and_evalute_model_pipeline(KNNBasic, model_kwargs)
benchmark_dict['KNN item based cosine'] = metrics_dict

model_kwargs = {'user_based': False, 'name': 'pearson'}
knn, metrics_dict = train_and_evalute_model_pipeline(KNNBasic, model_kwargs)
benchmark_dict['KNN item based pearson'] = metrics_dict


benchmark_dict

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.


{'KNN user based cosine': {'RMSE': 1.0193536815834319,
  'MAE': 1.0193536815834319},
 'KNN user based pearson': {'RMSE': 1.0150350905205965,
  'MAE': 1.0150350905205965},
 'KNN item based cosine': {'RMSE': 1.0264295933767333,
  'MAE': 1.0264295933767333},
 'KNN item based pearson': {'RMSE': 1.041104054968961,
  'MAE': 1.041104054968961}}

In [18]:
benchmark_dict = {}

model_dict_list = [
    {
        'model_name' : 'KNN user based with cosine similarity',
        'model_class' : KNNBasic,
        'model_kwargs' : {'user_based': True, 'name': 'cosine'}
    },
    {
        'model_name' : 'KNN user based with pearson similarity',
        'model_class' : KNNBasic,
        'model_kwargs' : {'user_based': True, 'name': 'pearson'}
    },
]

for model_dict in model_dict_list:
    model, metrics_dict = train_and_evalute_model_pipeline(
        model_dict['model_class'], model_dict['model_kwargs'])
    benchmark_dict[model_dict['model_name']] = metrics_dict
    model_dict['fitted_model'] = model
    
benchmark_dict

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.


{'KNN user based with cosine similarity': {'RMSE': 1.0193536815834319,
  'MAE': 1.0193536815834319},
 'KNN user based with pearson similarity': {'RMSE': 1.0150350905205965,
  'MAE': 1.0150350905205965}}

# Cross validation

In [19]:
from surprise.model_selection import cross_validate

cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0187  1.0078  1.0135  1.0096  1.0084  1.0116  0.0041  
MAE (testset)     0.8084  0.7987  0.8071  0.8025  0.7994  0.8032  0.0039  
Fit time          1.27    1.05    1.12    1.15    1.26    1.17    0.09    
Test time         3.60    3.31    2.77    3.26    2.91    3.17    0.30    


{'test_rmse': array([1.01872866, 1.00775992, 1.01354856, 1.00955551, 1.00843408]),
 'test_mae': array([0.80843632, 0.79869761, 0.8071084 , 0.80245938, 0.79940892]),
 'fit_time': (1.272223949432373,
  1.0536251068115234,
  1.116072177886963,
  1.1464581489562988,
  1.2636258602142334),
 'test_time': (3.5992281436920166,
  3.3081600666046143,
  2.7677388191223145,
  3.261117935180664,
  2.909980058670044)}

# User recommendation

<p style="color:blue; font-weight:bold">TODO</p>

- Create a function that 

In [17]:
import pandas

def get_user_recommendation(model: AlgoBase, user_id: int, k: int, data, movies : pandas.DataFrame
                           ) -> pandas.DataFrame:
    """Makes movie recommendations a user.
    
    Parameters
    ----------
        model : AlgoBase
            A trained surprise model
        user_id : int
            The user for whom the recommendation will be done.
        k : int
            The number of items to recommend.
        data : FIXME
            The data needed to do the recommendation.
        movies : pandas.DataFrame
            The dataframe containing the movies metadata (title, genre, etc)
        
    Returns
    -------
    pandas.Dataframe
        A dataframe with the k movies that will be recommended the user. The dataframe should have the following
        columns (movie_name : str, movie_genre : str, predicted_rating : float, true_rating : float)
        
    Notes
    -----
    - You should create other functions that are used in this one and not put all the code in the same function.
        For example to create the final dataframe, instead of implemented all the code
        in this function (get_user_recommendation), you can create a new one (create_recommendation_dataframe)
        that will be called in this function.
    - You can add other arguments to the function if you need to.
    """
    # FIXME
    pass