# CF Recommendation System

**COMP9417**   

**Aim:** The aim of this project is to build a CF recommendation engine using the **Book-Crossing** dataset.

In [20]:
#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from surprise import Reader, Dataset
from surprise import model_selection, accuracy
from surprise import NMF
from surprise import SVD
from surprise import SVDpp
from surprise import CoClustering
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
plt.style.use('seaborn-white') # Use seaborn-style plots
%matplotlib inline

#Read files.
users = pd.read_csv('book_crossing_dataset/BX-Users.csv', sep=';', encoding='cp1252')
book_ratings = pd.read_csv('book_crossing_dataset/BX-Book-Ratings.csv', sep=';',encoding='cp1252')
books = pd.read_csv('book_crossing_dataset/BX-Books.csv', sep=';',encoding ='iso-8859-1')


#users
users.loc[(users.Age>100)|(users.Age<10), 'Age'] = np.nan#Reduce the age of the readers within a reasonable range. For example, older than 3 years old and younger than 90 years old.

# split_location = users.Location.str.split(',', 2, expand=True)
# split_location.columns = ['city', 'state', 'country']
# users = users.join(split_location)

# users.drop(columns=['Location'], inplace=True)#Split location to three parts: city, state, country.
# users.country.replace('', np.nan, inplace=True)#Replace empty string with nan.
# users.city.replace('', np.nan, inplace=True)#Replace empty string with nan.
# users.state.replace('', np.nan, inplace=True)#Replace empty string with nan.

#books
books.columns = books.columns.str.replace('Year-Of-Publication', 'Publicationyear')
books.Publicationyear = pd.to_numeric(books.Publicationyear, errors='coerce')
books.Publicationyear.replace(0, np.nan, inplace=True)
# books.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], inplace=True) # drop image-url columns

books = books.loc[~(books.ISBN.isin(books[books.Publicationyear<1900].ISBN))] # remove historical books
books = books.loc[~(books.ISBN.isin(books[books.Publicationyear>2020].ISBN))] # remove future books

# books.Publisher = books.Publisher.str.replace('&amp', '&', regex=False)#这步真的有必要吗？

#books_ratings
book_ratings.columns = book_ratings.columns.str.strip().str.replace('-', '_')
book_ratings = book_ratings[book_ratings.Book_Rating != 0]#Remove 0 rating rows.也感觉并不是特别必要。

books_with_ratings = book_ratings.join(books.set_index('ISBN'), on='ISBN')#join the books to books_ratings.
books_with_ratings.dropna(subset=['Book-Title'], inplace=True) # remove rows with missing title/author data
#作者把ISBN设为唯一是考虑到不同出版社出的书应该会同一个，他认为这个推荐系统的目的在于推荐书而不在于哪个出版社。但是我觉得不同出版社
#的出版，也会影响同一个本书内容的质量，比如有些出版社出的《简爱》就是会比另一些出版社的更加良心，排版插图也会更好
#所以我认为，不能把这个因素去除。就把ISBN唯一化这些操作去掉了。（之后如果觉得实在有必要，可以再加上这个操作，也不麻烦。）

users.columns = users.columns.str.replace('User-ID', 'User_ID')
books_users_ratings = books_with_ratings.join(users.set_index('User_ID'), on='User_ID')#join all tables by user_id.
books_users_ratings.columns = books_users_ratings.columns.str.replace('Book-Title', 'Book_Title')
books_users_ratings.columns = books_users_ratings.columns.str.replace('Book-Author', 'Book_Author')

#Create user_item_rating table, used in CF model.
user_item_rating = books_users_ratings[['User_ID', 'ISBN', 'Book_Rating']]




reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(user_item_rating, reader)

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
user_item_rating.count()

User_ID        383821
ISBN           383821
Book_Rating    383821
dtype: int64

This is much more clear! Now we can see that 8 is the most frequent rating, while users tend to give ratings > 5, with very few low ratings given.

# Building a recommender system using collaborative filtering

Collaborative filtering use similarities of the 'user' and 'item' fields, with values of 'rating' predicted based on either user-item, or item-item similarity:
 - Item-Item CF: "Users who liked this item also liked..."
 - User-Item CF: "Users who are similar to you also liked..."
 
In both cases, we need to create a user-item matrix built from the entire dataset. We'll create a matrix for each of the training and testing sets, with the users as the rows, the books as the columns, and the rating as the matrix value. Note that this will be a very sparse matrix, as not every user will have watched every movie etc.

We'll first create a new dataframe that contains only the relevant columns (```user_id```, ```unique_isbn```, and ```book_rating```).

Looks perfect! Continue.

# Using the ```surprise``` library for building a recommender system
Several common model-based algorithms including SVD, KNN, and non-negative matrix factorization are built-in!  
See [here](http://surprise.readthedocs.io/en/stable/getting_started.html#basic-usage) for the docs.

Where: SVD = Singular Value Decomposition (orthogonal factorization), NMF = Non-negative Matrix Factorization.

**Note** that when using the ```surprise``` library we don't need to manually create the mapping of user_id and unique_isbn to integers in a custom dict. See [here](http://surprise.readthedocs.io/en/stable/FAQ.html#raw-inner-note) for details. 

### SVD model

**_Using cross-validation (5 folds)_**

In [162]:
# Load SVD algorithm
model = SVD()

# Train on books dataset
%time model_selection.cross_validate(model, data, measures=['RMSE'], cv=10, verbose=True)

Evaluating RMSE of algorithm SVD on 10 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Fold 8  Fold 9  Fold 10 Mean    Std     
RMSE (testset)    1.6385  1.6474  1.6266  1.6325  1.6261  1.6346  1.6295  1.6313  1.6310  1.6255  1.6323  0.0063  
Fit time          22.63   22.51   22.76   22.12   22.84   22.99   22.87   22.50   22.83   22.45   22.65   0.25    
Test time         0.34    0.59    0.29    0.34    0.34    0.58    0.34    0.35    0.66    0.34    0.42    0.13    
CPU times: user 4min 6s, sys: 1.25 s, total: 4min 8s
Wall time: 4min 8s


{'test_rmse': array([1.63845871, 1.64744161, 1.62656201, 1.632533  , 1.62613652,
        1.63455574, 1.62946535, 1.63131143, 1.63099894, 1.62545001]),
 'fit_time': (22.62565779685974,
  22.509323120117188,
  22.755794763565063,
  22.116678953170776,
  22.84161925315857,
  22.985612154006958,
  22.869303226470947,
  22.503407955169678,
  22.832932233810425,
  22.445928812026978),
 'test_time': (0.3417341709136963,
  0.5862329006195068,
  0.2906801700592041,
  0.34453821182250977,
  0.3365468978881836,
  0.5803861618041992,
  0.3440110683441162,
  0.350543737411499,
  0.6571528911590576,
  0.34018874168395996)}

The SVD model gives an average RMSE of ca. 1.64 after 5-folds, with a fit time of ca. 28 s for each fold.

### NMF model

In [166]:
# Load NMF algorithm
model = NMF()
# Train on books dataset
%time model_selection.cross_validate(model, data, measures=['RMSE'], cv=10, verbose=True)

Evaluating RMSE of algorithm NMF on 10 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Fold 8  Fold 9  Fold 10 Mean    Std     
RMSE (testset)    2.4990  2.4815  2.4597  2.4701  2.4634  2.4633  2.4811  2.4650  2.4650  2.4753  2.4723  0.0115  
Fit time          39.69   45.88   48.24   44.02   43.59   42.57   43.02   42.18   44.62   48.50   44.23   2.58    
Test time         0.28    0.31    0.30    0.30    0.29    0.27    0.29    0.29    0.37    0.31    0.30    0.03    
CPU times: user 7min 35s, sys: 3.97 s, total: 7min 39s
Wall time: 7min 44s


{'test_rmse': array([2.49903772, 2.48153306, 2.45972642, 2.47006799, 2.46335234,
        2.46333562, 2.48109868, 2.46501613, 2.46499439, 2.47530794]),
 'fit_time': (39.68878102302551,
  45.87646198272705,
  48.23786282539368,
  44.01504421234131,
  43.59172606468201,
  42.56552600860596,
  43.024904012680054,
  42.18158173561096,
  44.62297987937927,
  48.50446915626526),
 'test_time': (0.27743101119995117,
  0.3066141605377197,
  0.29669189453125,
  0.30005407333374023,
  0.28624987602233887,
  0.27041196823120117,
  0.287106990814209,
  0.2915160655975342,
  0.36918187141418457,
  0.310427188873291)}

In [3]:
SVDppmodel = SVDpp()

%time model_selection.cross_validate(SVDppmodel, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.6577  1.6476  1.6470  1.6459  1.6439  1.6484  0.0048  
Fit time          1150.80 1175.12 1150.90 1163.04 1179.14 1163.80 11.83   
Test time         15.74   15.26   14.79   15.89   15.23   15.38   0.39    
CPU times: user 1h 37min 51s, sys: 13 s, total: 1h 38min 4s
Wall time: 1h 38min 24s


{'test_rmse': array([1.6577234 , 1.64759927, 1.64699565, 1.64592592, 1.64393435]),
 'fit_time': (1150.8043429851532,
  1175.1193573474884,
  1150.904592037201,
  1163.0380742549896,
  1179.1386499404907),
 'test_time': (15.741928100585938,
  15.261472940444946,
  14.787420988082886,
  15.885241985321045,
  15.225876808166504)}

In [3]:
CoClusteringmodel = CoClustering()

%time model_selection.cross_validate(CoClusteringmodel, data, measures=['RMSE'], cv=10, verbose=True)

Evaluating RMSE of algorithm CoClustering on 10 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Fold 8  Fold 9  Fold 10 Mean    Std     
RMSE (testset)    1.8688  1.8527  1.8528  1.8456  1.8349  1.8555  1.8288  1.8528  1.8407  1.8604  1.8493  0.0114  
Fit time          27.44   26.46   27.38   29.25   28.28   29.33   27.76   27.80   31.84   28.41   28.40   1.41    
Test time         0.30    0.51    0.32    0.29    0.31    0.29    0.30    0.28    0.30    0.30    0.32    0.06    
CPU times: user 4min 58s, sys: 2.27 s, total: 5min
Wall time: 5min 5s


{'test_rmse': array([1.86877187, 1.85269757, 1.8527994 , 1.84562557, 1.83491895,
        1.85548012, 1.82883519, 1.85275956, 1.84067192, 1.86039078]),
 'fit_time': (27.441571950912476,
  26.462570905685425,
  27.38216996192932,
  29.25297713279724,
  28.275870084762573,
  29.331010103225708,
  27.760194063186646,
  27.803400993347168,
  31.844605922698975,
  28.411500930786133),
 'test_time': (0.3049488067626953,
  0.5077171325683594,
  0.31685495376586914,
  0.29229187965393066,
  0.31412506103515625,
  0.2852327823638916,
  0.2976999282836914,
  0.28429079055786133,
  0.29709696769714355,
  0.29652976989746094)}

In [2]:
KNNBasicmodel = KNNBasic()

%time model_selection.cross_validate(KNNBasicmodel, data, measures=['RMSE'], cv=10, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBasic on 10 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Fold 8  Fold 9  Fold 10 Mean    Std     
RMSE (testset)    1.9550  1.9269  1.9446  1.9494  1.9267  1.9393  1.9362  1.9366  1.9341  1.9384  1.

{'test_rmse': array([1.95503598, 1.92689883, 1.94458085, 1.94943428, 1.92673164,
        1.93925069, 1.93623603, 1.93659582, 1.93405493, 1.93840509]),
 'fit_time': (301.0791528224945,
  295.33977484703064,
  295.8667778968811,
  293.0790169239044,
  293.1191260814667,
  304.39737820625305,
  294.659343957901,
  306.4251730442047,
  297.38123202323914,
  295.9373607635498),
 'test_time': (4.941538095474243,
  4.425936222076416,
  4.949044942855835,
  4.251443147659302,
  4.7189109325408936,
  4.321928262710571,
  3.9479031562805176,
  4.9409191608428955,
  4.21153712272644,
  4.756757020950317)}

In [3]:
KNNWithMeansmodel = KNNWithMeans()

%time model_selection.cross_validate(KNNWithMeansmodel, data, measures=['RMSE'], cv=10, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNWithMeans on 10 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Fold 8  Fold 9  Fold 10 Mean    Std     
RMSE (testset)    1.8416  1.8290  1.8384  1.8378  1.8268  1.8362  1.8245  1.8327  1.8333  1.8341

{'test_rmse': array([1.84157317, 1.8289707 , 1.83839133, 1.83775875, 1.8268469 ,
        1.83616956, 1.82446225, 1.83268145, 1.83334177, 1.83406153]),
 'fit_time': (333.20009207725525,
  349.3229100704193,
  347.1997780799866,
  333.2338807582855,
  321.6053488254547,
  326.5748689174652,
  343.92419695854187,
  321.73685026168823,
  309.9673421382904,
  311.6091048717499),
 'test_time': (3.248699903488159,
  3.8688149452209473,
  3.1183648109436035,
  4.309756278991699,
  3.682321071624756,
  3.93294095993042,
  4.494416952133179,
  4.840296030044556,
  4.339203834533691,
  5.114516973495483)}

In [1]:
KNNWithZScoremodel = KNNWithZScore()

%time model_selection.cross_validate(KNNWithZScoremodel, data, measures=['RMSE'], cv=10, verbose=True)



NameError: name 'KNNWithZScore' is not defined

## Optimizing the SVD algorithm with parameter tuning
Since it seems like the SVD algorithm is our best choice, let's see if we can improve the predictions even further by optimizing some of the algorithm hyperparameters.

One way of doing this is to use the handy ```GridSearchCV``` method from the ```surprise``` library. When passed a range of hyperparameter values, ```GridSearchCV``` will automatically search through the parameter-space to find the best-performing set of hyperparameters.

In [2]:
# We'll remake the training set, keeping 20% for testing
trainset, testset = model_selection.train_test_split(data, test_size=0.2)

In [3]:
### Fine-tune Surprise SVD model useing GridSearchCV
from surprise.model_selection import GridSearchCV

param_grid = {'n_factors': [80, 100, 120], 'lr_all': [0.001, 0.005, 0.01], 'reg_all': [0.01, 0.02, 0.04]}

# Optimize SVD algorithm for both root mean squared error ('rmse') and mean average error ('mae')
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

In [None]:
# Fit the gridsearch result on the entire dataset
%time gs.fit(data)

In [None]:
# Return the best version of the SVD algorithm
model = gs.best_estimator['rmse']

print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

In [None]:
model_selection.cross_validate(model, data, measures=['rmse', 'mae'], cv=5, verbose=True)

The mean RSME using the optimized parameters was 1.6351 over 5 folds, with an average fit time of ca. 24s.

In [21]:
### Use the new parameters with the training set
trainset, testset = model_selection.train_test_split(data, test_size=0.2)
model = SVD(n_factors=80, lr_all=0.005, reg_all=0.04)
model.fit(trainset) # re-fit on only the training data using the best hyperparameters
test_pred = model.test(testset)
print("SVD : Test Set")
accuracy.rmse(test_pred, verbose=True)

SVD : Test Set
RMSE: 1.6346


1.6345602418136698

Using the optimized hyperparameters we see a slight improvement in the resulting RMSE (1.629) compared with the unoptimized SVD algorithm (1.635)1

## Testing some of the outputs (ratings and recommendations)
Would like to do an intuitive check of some of the recommendations being made.

Let's just choose a random user/book pair (represented in the ```suprise``` library as ```uid``` and ```iid```, respectively).

**Note:** The ```model``` being used here is the optimized SVD algorithm that has been fit on the training set.

The following function was adapted from the ```surprise``` docs, and can be used to get the top book recommendations for each user.

In [22]:
from collections import defaultdict

def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
        
    return top_n

Let's get the Top 10 recommended books for each user_id in the test set.

In [23]:
pred = model.test(testset)
top_n = get_top_n(pred)

In [24]:
top_n[60337]

[('0553280589', 9.043365396192447),
 ('0323014305', 8.766980560764933),
 ('0380707799', 8.716578824391407),
 ('0684826801', 8.680433153606122),
 ('1563410443', 8.585409290009137),
 ('0151000719', 8.52619583599254),
 ('0380723239', 8.509354271706854),
 ('0375503196', 8.509354271706854),
 ('0452273811', 8.509354271706854),
 ('0553057340', 8.509354271706854)]

In [25]:
for isbn in top_n[60337]:
    
    print(isbn[0])

0553280589
0323014305
0380707799
0684826801
1563410443
0151000719
0380723239
0375503196
0452273811
0553057340


In [26]:
books.ISBN

0         0195153448
1         0002005018
2         0060973129
3         0374157065
4         0393045218
5         0399135782
6         0425176428
7         0671870432
8         0679425608
9         074322678X
10        0771074670
11        080652121X
12        0887841740
13        1552041778
14        1558746218
15        1567407781
16        1575663937
17        1881320189
18        0440234743
19        0452264464
20        0609804618
21        1841721522
22        1879384493
23        0061076031
24        0439095026
25        0689821166
26        0971880107
27        0345402871
28        0345417623
29        0684823802
             ...    
271349    3320016822
271350    3423200944
271351    3453065123
271352    3525335423
271353    3548740146
271354    381440176X
271355    3893312307
271356    0971854823
271357    0316640786
271358    3257217323
271359    3596156904
271360    1874166633
271361    0130897930
271362    020130998X
271363    2268032019
271364    0684860112
271365    039

In [27]:
books_users_ratings

Unnamed: 0,User_ID,ISBN,Book_Rating,Book_Title,Book_Author,Publicationyear,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,Location,Age
1,276726,0155061224,5,Rites of Passage,Judith Rae,2001.0,Heinle,http://images.amazon.com/images/P/0155061224.0...,http://images.amazon.com/images/P/0155061224.0...,http://images.amazon.com/images/P/0155061224.0...,"seattle, washington, usa",
3,276729,052165615X,3,Help!: Level 1,Philip Prowse,1999.0,Cambridge University Press,http://images.amazon.com/images/P/052165615X.0...,http://images.amazon.com/images/P/052165615X.0...,http://images.amazon.com/images/P/052165615X.0...,"rijeka, n/a, croatia",16.0
4,276729,0521795028,6,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,2001.0,Cambridge University Press,http://images.amazon.com/images/P/0521795028.0...,http://images.amazon.com/images/P/0521795028.0...,http://images.amazon.com/images/P/0521795028.0...,"rijeka, n/a, croatia",16.0
8,276744,038550120X,7,A Painted House,JOHN GRISHAM,2001.0,Doubleday,http://images.amazon.com/images/P/038550120X.0...,http://images.amazon.com/images/P/038550120X.0...,http://images.amazon.com/images/P/038550120X.0...,"torrance, california, usa",
16,276747,0060517794,9,Little Altars Everywhere,Rebecca Wells,2003.0,HarperTorch,http://images.amazon.com/images/P/0060517794.0...,http://images.amazon.com/images/P/0060517794.0...,http://images.amazon.com/images/P/0060517794.0...,"iowa city, iowa, usa",25.0
19,276747,0671537458,9,Waiting to Exhale,Terry McMillan,1995.0,Pocket,http://images.amazon.com/images/P/0671537458.0...,http://images.amazon.com/images/P/0671537458.0...,http://images.amazon.com/images/P/0671537458.0...,"iowa city, iowa, usa",25.0
20,276747,0679776818,8,Birdsong: A Novel of Love and War,Sebastian Faulks,1997.0,Vintage Books USA,http://images.amazon.com/images/P/0679776818.0...,http://images.amazon.com/images/P/0679776818.0...,http://images.amazon.com/images/P/0679776818.0...,"iowa city, iowa, usa",25.0
21,276747,0943066433,7,How to Deal With Difficult People,Rick Brinkman,1995.0,Careertrack Inc.,http://images.amazon.com/images/P/0943066433.0...,http://images.amazon.com/images/P/0943066433.0...,http://images.amazon.com/images/P/0943066433.0...,"iowa city, iowa, usa",25.0
23,276747,1885408226,7,The Golden Rule of Schmoozing,Aye Jaye,1998.0,Listen &amp Live Audio,http://images.amazon.com/images/P/1885408226.0...,http://images.amazon.com/images/P/1885408226.0...,http://images.amazon.com/images/P/1885408226.0...,"iowa city, iowa, usa",25.0
24,276748,0747558167,6,Apricots on the Nile: A Memoir with Recipes,Colette Rossant,2002.0,Bloomsbury Publishing Plc,http://images.amazon.com/images/P/0747558167.0...,http://images.amazon.com/images/P/0747558167.0...,http://images.amazon.com/images/P/0747558167.0...,"jubail ind.-city, eastern province, saudi arabia",39.0


In [28]:
def get_reading_list(userid):
    """
    Retrieve full book titles from full 'books_users_ratings' dataframe
    """
    reading_list = defaultdict(list)
    top_n = get_top_n(predictions, n=10)
    for n in top_n[userid]:
        book, rating = n
        title = books_users_ratings.loc[books_users_ratings.ISBN==book].book_title.unique()[0]
        reading_list[title] = rating
    return reading_list

In [29]:
# Just take a random look at user_id=60337
example_reading_list = get_reading_list(userid=60337)
for book, rating in example_reading_list.items():
    print(f'{book}: {rating}')

NameError: name 'predictions' is not defined

Have tried out a few different ```userid``` entries (from the ```testset```) to see what the top 10 books that user would like are and they seem pretty well related, indicating that the recommendation engine is performing reasonably well!

# Summary

In this notebook a dataset from the 'Book-Crossing' website was used to create a recommendation system. A few different approaches were investigated, including memory-based correlations, and model-based matrix factorization algorithms[2]. Of these, the latter - and particularly the Singular Value Decomposition (SVD) algorithm - gave the best performance as assessed by comparing the predicted book ratings for a given user with the actual rating in a test set that the model was not trained on.

The only fields that were used for the model were the "user ID", "book ID", and "rating". There were others available in the dataset, such as "age", "location", "publisher", "year published", etc, however for these types of recommendation systems it has often been found that additional data fields do not increase the accuracy of the models significantly[1]. A "Grid Search Cross Validation" method was used to optimize some of the hyperparameters for the model, resulting in a slight improvement in model performance from the default values.

Finally, we were able to build a recommender that could predict the 10 most likely book titles to be rated highly by a given user.

It should be noted that this approach still suffers from the "cold start problem"[3] - that is, for users with no ratings or history the model will not make accurate predictions. One way we could tackle this problem may be to initially start with popularity-based recommendations, before building up enough user history to implement the model. Another piece of data that was not utilised in the current investigation was the "implicit" ratings - denoted as those with a rating of "0" in the dataset. Although more information about these implicit ratings (for example, does it represent a positive or negative interaction), these might be useful for supplementing the "explicit" ratings recommender.

# References

1. http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/
2. https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-systems-in-Python/index.html
3. https://towardsdatascience.com/building-a-recommendation-system-for-fragrance-5b00de3829da