### Building a Recommender system with Surprise

This try-it focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to grouplens and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.



In [123]:
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.pipeline import Pipeline
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate
from surprise.accuracy import rmse
from surprise import accuracy

import pandas as pd
import plotly.express as px
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

In [124]:
def loadBookRatingsData():
    print("Loading user data ...")
    user = pd.read_csv('data/BX-Users.csv', sep=';', encoding="latin-1")
    print("Loading ratings data ...")
    ratings = pd.read_csv('data/BX-Book-Ratings.csv', sep=';', encoding="latin-1")
    print("Merge and return dataframe ...")
    df = pd.merge(user, ratings, on='User-ID', how='inner')
    return df

In [125]:
def recommenderMetrics(predictions):
    mae = accuracy.mae(predictions, verbose=False)
    rmse = accuracy.rmse(predictions, verbose=False)
    return (mae, rmse)

In [126]:
df = loadBookRatingsData()
df.drop(['Location','Age'], axis=1, inplace=True)
df.sample(5)

Loading user data ...
Loading ratings data ...
Merge and return dataframe ...


Unnamed: 0,User-ID,ISBN,Book-Rating
327036,80099,6109786,7
527399,129358,373835817,0
148241,35859,312957866,7
862493,211137,62500228,5
241398,57105,345339711,9


### Plots to understand the ratings distribution

In [90]:
dist_ratings = df['Book-Rating'].value_counts().sort_index(ascending=False)

fig = px.bar(dist_ratings, x=dist_ratings.index, y=dist_ratings.values,
             text = ['{:.1f} %'.format(val) for val in (dist_ratings.values / df.shape[0] * 100)],
             hover_data=['Book-Rating'], color='Book-Rating',
             title="Ratings Distribution",
             labels={'index':'Rating Scale (0-10)','y':'% of ratings rcvd.'}, height=500)
fig.show()

### Outlier Analysis

In [91]:
def trim_ds(df, variable, value):
    filter_v = df[variable].value_counts() > value
    filter_v = filter_v[filter_v].index.tolist()

    return df[(df[variable].isin(filter_v))]

In [92]:
df_1 = trim_ds(df,'ISBN',25)

In [93]:
df_new = trim_ds(df_1,'User-ID',50)

In [94]:
df.shape, df_new.shape

((1149780, 3), (164763, 3))

### SURPRISE !!!!

In [95]:
reader = Reader(rating_scale=(0, 9))
data = Dataset.load_from_df(df_new[['User-ID', 'ISBN', 'Book-Rating']], reader)

In [96]:
def evaluator():
    benchmark_results = []
    algorithms = [KNNBasic(), SVD(), NMF(), SlopeOne(), CoClustering()]
    for algorithm in algorithms:
        cv_results = cross_validate(algorithm, data, measures=['RMSE'], cv=5, verbose=False)
        tmp = pd.DataFrame.from_dict(cv_results).mean(axis=0)
        tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['algorithm']))
        benchmark_results.append(tmp)
    
    return benchmark_results

In [97]:
rs = evaluator()
pd.DataFrame(rs).set_index('algorithm').sort_values('test_rmse')  

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


Unnamed: 0_level_0,test_rmse,fit_time,test_time
algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CoClustering,3.33859,2.41038,0.131734
SlopeOne,3.352486,2.360354,4.654308
SVD,3.413536,1.211167,0.21452
KNNBasic,3.607983,0.237295,1.971068
NMF,3.70242,2.304718,0.191144


In [98]:
train = data.build_full_trainset()
test = train.build_testset()

In [99]:
cc = CoClustering(n_cltr_u=3, n_cltr_i=3, n_epochs=20, random_state=None, verbose=False)

In [100]:
cc.fit(train)

<surprise.prediction_algorithms.co_clustering.CoClustering at 0x7f9d0e351e50>

In [104]:
cc_preds = cc.test(test)

In [105]:
accuracy.rmse(cc_preds)

RMSE: 3.0609


3.060850290943349

In [108]:
cc_preds_df = pd.DataFrame(cc.test(test))
cc_preds_df

Unnamed: 0,uid,iid,r_ui,est,details
0,243,0060915544,10.0,1.934903,{'was_impossible': False}
1,243,0060977493,7.0,3.973102,{'was_impossible': False}
2,243,0156006529,0.0,0.028881,{'was_impossible': False}
3,243,0312169787,0.0,0.711030,{'was_impossible': False}
4,243,0316096199,0.0,1.351620,{'was_impossible': False}
...,...,...,...,...,...
164758,278418,1551668122,0.0,0.000000,{'was_impossible': False}
164759,278418,1551668270,0.0,0.672890,{'was_impossible': False}
164760,278418,155166884X,0.0,0.000000,{'was_impossible': False}
164761,278418,1559029838,0.0,0.000000,{'was_impossible': False}


In [117]:
# Testing ratings accuracy with one user id = 243
rslt_df = cc_preds_df[cc_preds_df['uid'] == 243]
rslt_df['err'] = abs(rslt_df['est']-rslt_df['r_ui'])
rslt_df



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,uid,iid,r_ui,est,details,err
0,243,0060915544,10.0,1.934903,{'was_impossible': False},8.065097
1,243,0060977493,7.0,3.973102,{'was_impossible': False},3.026898
2,243,0156006529,0.0,0.028881,{'was_impossible': False},0.028881
3,243,0312169787,0.0,0.711030,{'was_impossible': False},0.711030
4,243,0316096199,0.0,1.351620,{'was_impossible': False},1.351620
...,...,...,...,...,...,...
61,243,0684848783,0.0,0.619045,{'was_impossible': False},0.619045
62,243,0743486226,0.0,1.953293,{'was_impossible': False},1.953293
63,243,0786863986,5.0,3.246815,{'was_impossible': False},1.753185
64,243,140003180X,0.0,2.046627,{'was_impossible': False},2.046627


In [119]:
best_predictions_243 = rslt_df.sort_values(by='err')[:5]
worst_predictions_243 = rslt_df.sort_values(by='err')[-5:]

In [120]:
best_predictions_243

Unnamed: 0,uid,iid,r_ui,est,details,err
37,243,0446353205,0.0,0.0,{'was_impossible': False},0.0
38,243,0446358592,0.0,0.0,{'was_impossible': False},0.0
41,243,0446600466,0.0,0.0,{'was_impossible': False},0.0
54,243,051513290X,0.0,0.0,{'was_impossible': False},0.0
11,243,0345311396,0.0,0.0,{'was_impossible': False},0.0


In [122]:
worst_predictions_243

Unnamed: 0,uid,iid,r_ui,est,details,err
10,243,0316899984,7.0,0.798076,{'was_impossible': False},6.201924
36,243,044023722X,7.0,0.692167,{'was_impossible': False},6.307833
27,243,0385720106,7.0,0.530512,{'was_impossible': False},6.469488
5,243,0316601950,9.0,1.492289,{'was_impossible': False},7.507711
0,243,0060915544,10.0,1.934903,{'was_impossible': False},8.065097
