### Building a Recommender system with Surprise

This try-it focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to grouplens and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.



In [None]:
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate

import pandas as pd
import plotly.express as px

In [None]:
def loadBookRatingsData():
    print("Loading user data ...")
    user = pd.read_csv('data/BX-Users.csv', sep=';', encoding="latin-1")
    print("Loading ratings data ...")
    ratings = pd.read_csv('data/BX-Book-Ratings.csv', sep=';', encoding="latin-1")
    print("Merge and return dataframe ...")
    df = pd.merge(user, ratings, on='User-ID', how='inner')
    return df

In [None]:
def recommenderMetrics(predictions):
    mae = accuracy.mae(predictions, verbose=False)
    rmse = accuracy.rmse(predictions, verbose=False)
    return (mae, rmse)

In [None]:
df = loadBookRatingsData()
df.drop(['Location','Age'], axis=1, inplace=True)
df.sample(5)

### Plots to understand the ratings distribution

In [None]:
dist_ratings = df['Book-Rating'].value_counts().sort_index(ascending=False)

fig = px.bar(dist_ratings, x=dist_ratings.index, y=dist_ratings.values,
             text = ['{:.1f} %'.format(val) for val in (dist_ratings.values / df.shape[0] * 100)],
             hover_data=['Book-Rating'], color='Book-Rating',
             title="Ratings Distribution",
             labels={'index':'Rating Scale (0-10)','y':'% of ratings rcvd.'}, height=500)
fig.show()

### Outlier Analysis

In [None]:
def find_boundaries(df, variable, distance):
    IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)
    
    lower_boundary = df[variable].quantile(0.25) - (IQR*distance)
    upper_boundary = df[variable].quantile(0.75) + (IQR*distance)

    return lower_boundary, upper_boundary

In [None]:
lo, up = find_boundaries(df, 'ISBN', 1.5)
outliers = np.where(df['ISBN'] > up, True, 
                    np.where(df['ISBN'] < lo, True, False))

df.loc[~outliers]

In [None]:
# filter rarely-rated books
min_book_ratings = 25
filter_1 = df['ISBN'].value_counts() > min_ratings
filter_1 = filter_1[filter_1].index.tolist()

# filter rarely-rating users
min_user_ratings = 50
filter_2 = df['User-ID'].value_counts() > min_user_ratings
filter_2 = filter_2[filter_2].index.tolist()

df_new = df[(df['ISBN'].isin(filter_1)) & (df['User-ID'].isin(filter_2))]
print('Original data frame has--->:\t{}'.format(df.shape))
print('After applying filters---->:\t{}'.format(df_new.shape))

### SURPRISE !!!!

In [None]:
reader = Reader(rating_scale=(0, 9))
#data = Dataset.load_from_df(df_new[['User-ID', 'ISBN', 'Book-Rating']], reader)
data = Dataset.load_from_df(df[['User-ID', 'ISBN', 'Book-Rating']], reader)

In [None]:
def evaluator():
    benchmark_results = []
    algorithms = [KNNBasic(), SVD(), NMF(), SlopeOne(), CoClustering()]
    for algorithm in algorithms:
        cv_results = cross_validate(algorithm, data, measures=['MAE','RMSE'], cv=3, verbose=False)
        tmp = pd.DataFrame.from_dict(cv_results).mean(axis=0)
        tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['algorithm']))
        benchmark_results.append(tmp)
    
    return benchmark_results

In [None]:
rs = evaluator()
pd.DataFrame(rs).set_index('algorithm').sort_values('test_rmse')  