# 158.755 Data Science - Making Sense of Data Project 4
# Study In Recommender Engine

## 1.0 Abstract

A recommender system, or a recommender engine intends to seek prediction of 'rating' or 'preference' for a user that would rank an item or event, mainly used in commercial web applications. 

They have been utilized in various areas over the last ten years, commonly used as playlist generators for video and music services such as Youtube, Netflix, and BiliBili, product recommenders like Amazon, Trademe, and Ebay, and content or news based recommenders for social media platforms as Facebook, Twitter and Instergram. Also, there are other popular recommender systems being utilized for specific topics as hotels booking, dating matching, and online competition game team up.

Flow and monetization are the core concept associated with commercial web applications for internet. Simply speaking, flow is the measurement that the number of visitings for a web application, while monetization evaluates its overall income for keeping the business up and running continuously and healthily. (i.e. advertisement income of Youtube, membership subscriptions for Netflix, and fees/charges for each successfully sale for Amazon ). 

Therefore, recommendation engine is one of the key part for monetization that allows finding consumers' real demand through directing them to their most interest items or services. In addition to this , owners of commercial web applications, would be able accurately deploy advertisements and services to their customers based on a successful recommendation engine set up, for example ,playlist generators (Youtube) could generate just right advertisements to their viewers , and product recommenders like Amazon could pushed customers related products when they are viewing a certain item. Thus, it would optimise the resource usage and amplify income for a commercial organisation who has developed such kind of commercial web applications. 

In this project , we would implement a ***'breath first'*** strategy so as to explore as much lifecycle and implementations for a recommendation engine , such as the input and output data structure, machine learning algorithms, evaluation standard for comparison of different algorithms, and its position and role in architecture of a commercial web applications , instead of undertaking ***in-depth*** research on specific algorithms as per their theory and implementations. Thus, we will simply use an existing library to compute our findings and recommendations for this project.



## 2.0 Indroduction


## 3.0 Data Collection

The datasets for a commercial web application have almost never been exposed for public usage or research, as they are private and strategy assets for business. Normally they are generated through 

Fortunately, there are existing web source regarding recommendation engine study for us to play with , which is Amazon product data prepared by Associate Professor Julian McAuley, UCSD at http://jmcauley.ucsd.edu/data/amazon/.

Normally, user activates for a commercial web app could be concluded as viewing a certain web page , purchasing products, comments and rank items , content or news etc. It is quite difficult to summarise a standard data structure for a recommendation engine due to the variety demands and requirements for various business activities.

However, a classic data structure format has been widely used after long term experimentation and operation with major internet giant companies such as Amazon and Facebook:

- ***user id*** : unique user id
- ***item id*** : unique item id
- ***behaviour type*** : type of behaviour , i.e. purchase or view an item
- ***context*** : behaviour context, including location and time etc.
- ***behaviour weight*** : weight could be the viewing length for a video or rank for an item 
- ***behaviour content*** : if a user comments something, the content could be saved as a text file. If user click an item, the content could be a binary input.


### 3.1 Web API

Amazon Instant Video review data , http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Amazon_Instant_Video_5.json.gz , has been selected as the dataset for this project. 




### 3.2 Web Crawling

## 4.0 EDA

## 5.0 Algorithm Research

As there are many approaches to implement recommender engine, such as Collaborative filtering, Content-based filtering, Multi-criteria recommender systems etc. Their brief concept and implementations could be refer to 
https://en.wikipedia.org/wiki/Recommender_system.

In this project , we would use an existing library named as ***SurPRISE***, stands for ***Simple Python RecommendatIon System Engine.***,http://surpriselib.com/.

As it has provided various ready-to-use prediction algorithms with good documentation and use cases, such as baseline algorithms, neighborhood methods, matrix factorization-based ( SVD, PMF, SVD++, NMF), and many others. Also, various similarity measures (cosine, MSD, pearson…) are built-in.

In this section ,we will mainly focus on ***matrix factorization-based modules and similarity modules***.


### 5.1 Input Data Structure

The input data structure is very simple , just a big user - item with rating matrix , that is represented as a specific user ranks an item with certain rating as per the code below:

In [62]:
#full dataset for research
import numpy as np
import pandas as pd
# df = pd.read_csv('full_dataset.csv')
df = pd.read_csv('testForInput.csv')
# df = pd.read_csv('test.csv')
print(df.shape)
# df = df[:1000]
df.shape

(4976, 4)


(4976, 4)

In [63]:
df.head()

Unnamed: 0,itemID,rating,userID,timestamp
0,700026657,5,A1HP7NVNPFMA4N,1445040000
1,6050036071,5,A3MKO61QMJ8V6V,1497312000
2,9629971372,1,A3BR8K6BJMIBEY,1248048000
3,9629971372,5,A1YC9RVAWAAUAN,1361750400
4,9882106463,5,ALJO1MOF8TMNO,1359072000


In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4976 entries, 0 to 4975
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   itemID     4976 non-null   object
 1   rating     4976 non-null   int64 
 2   userID     4976 non-null   object
 3   timestamp  4976 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 155.6+ KB


### 5.2 Performance measures

The commonly used metrics are the ***Mean Absolute Error(MAE)*** and ***Root Mean Squared Error(RMSE)***,  ***precision*** and ***recall*** will also be used to evaluate the quality of a model for comparison. 


##  6.0 Matrix Factorization-Based Modules

In [65]:
from collections import defaultdict
from surprise import Dataset
import pandas as pd
from surprise import SVD
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
import numpy as np
from surprise import dump
import os
from surprise.model_selection import KFold
import io  # needed because of weird encoding of u.item file

from surprise import KNNBaseline
from surprise import get_dataset_dir

def precision_recall_at_k(predictions, k=10, threshold=3.5):
    '''Return precision and recall at k metrics for each user.'''

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

        # Recall@K: Proportion of relevant items that are recommended
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls




In [66]:
# df = pd.read_csv('full_dataset.csv')
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(1, 5))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)

### 6.1 SVD

The famous SVD algorithm, which was popularized by Simon Funk during the Netflix Prize in 2006. It's documentation and reference could be referred to links as below:

https://sifter.org/~simon/journal/20061211.html

https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD

SVD is stand for singular value decomposition mathmatically, is a factorization of a real or complex matrix that generalizes the eigendecomposition of a square normal matrix to any m × n matrix via an extension of the polar decomposition. 

Actually , it is dry and headache to focus on the mathmatical details for this algorithm, ***SurPRISE*** lib has encapsulated ready to use functions to implement this approach.

The code below is a standard machine learning process , which has iterated through all combinations of parameters in ***param_grid*** variable with K folds method (3 folds), so as to find the best prediction model based on RMSE and MAE score.

***{'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.2}*** parameters have been found as the best score model for this approach


In [67]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV


#choose various parameters for model
param_grid = {'n_epochs': [5, 10,15], 'lr_all': [0.002, 0.005,0.01],
              'reg_all': [0.2, 0.4, 0.6]}

#iterate thorugh parameter grid to find best model
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print('best RMSE score')
print(gs.best_score['rmse'])
print(gs.best_score['mae'])
# combination of parameters that gave the best RMSE score
print('combination of parameters that gave the best RMSE score')
print(gs.best_params['rmse'])
print(gs.best_params['mae'])

best RMSE score
1.2022332128844253
0.9547625719462639
combination of parameters that gave the best RMSE score
{'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.6}
{'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.2}


Once the best SVD model has been selected , the ***precision@k and recall@k*** could be computed with 5 folds method. As their results are very close, we could justify that data are distributed with not much outliers.

In [68]:
kf = KFold(n_splits=5)
#best algo based on above
algo = gs.best_estimator['rmse']
#print precision@k and recall@k
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print('precision@k')
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print('recall@k')
    print(sum(rec for rec in recalls.values()) / len(recalls))

precision@k
0.7986761710794297
recall@k
0.9643584521384929
precision@k
0.807008547008547
recall@k
0.9641025641025641
precision@k
0.7849794238683128
recall@k
0.9675925925925926
precision@k
0.7990397805212621
recall@k
0.9655349794238683
precision@k
0.8038212214261344
recall@k
0.9493346980552713


We can use the code below to save this model for future reuse.

In [69]:
#save the model for later use
dump.dump(os.path.expanduser("SVD_GS_BEST_RMSE"), algo=gs.best_estimator['rmse'])

# matrix_factorization.SVDpp

In [70]:
from surprise import SVDpp

In [71]:
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print('best RMSE score')
print(gs.best_score['rmse'])
print(gs.best_score['mae'])
# combination of parameters that gave the best RMSE score
print('combination of parameters that gave the best RMSE score')
print(gs.best_params['rmse'])
print(gs.best_params['mae'])

# algo = NMF()
# cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
kf = KFold(n_splits=5)
#best algo based on above
algo_SVDpp = gs.best_estimator['rmse']
#print precision@k and recall@k
for trainset, testset in kf.split(data):
    algo_SVDpp.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print('precision@k')
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print('recall@k')
    print(sum(rec for rec in recalls.values()) / len(recalls))

best RMSE score
1.199754887301433
0.9524461415755071
combination of parameters that gave the best RMSE score
{'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.4}
{'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.4}
precision@k
0.9634831460674157
recall@k
0.9867211440245148
precision@k
0.95496417604913
recall@k
0.981064483111566
precision@k
0.9645426515930113
recall@k
0.9722507708119219
precision@k
0.9643584521384929
recall@k
0.9821792260692465
precision@k
0.9527235354573484
recall@k
0.9758478931140802


In [73]:
dump.dump(os.path.expanduser("SVDpp_GS_BEST_RMSE"), algo=gs.best_estimator['rmse'])

## 7.0 Similarity Modules

### 7.1 Cosine 

Since there are not many parameters to be tested for a similarity module, we can simple set up model and cross validate them with K folds, the code below computes the RMSE and MAE result plus their fir and test running time.
As RMSE and MAE results are similar for 5 folds, we can specify that there are not much skewness for the dataset.

The documentation for Cosine Similarity could be refered to https://surprise.readthedocs.io/en/stable/similarities.html.

In [76]:
sim_options = {'name': 'cosine', 'user_based': False} # or item based
algo = KNNBaseline(sim_options=sim_options)
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2363  1.2023  1.1384  1.2083  1.2230  1.2017  0.0338  
MAE (testset)     0.9805  0.9614  0.9175  0.9530  0.9596  0.9544  0.0206  
Fit time          0.40    0.49    0.38    0.36    0.30    0.39    0.06    
Test time         0.01    0.01    0.01    0.01    0.01    0.01    0.00    


{'test_rmse': array([1.23629323, 1.2023471 , 1.13836824, 1.20825842, 1.22301658]),
 'test_mae': array([0.9804699 , 0.96139385, 0.9174697 , 0.95300809, 0.95958727]),
 'fit_time': (0.4037201404571533,
  0.49151015281677246,
  0.3783879280090332,
  0.36066722869873047,
  0.3033440113067627),
 'test_time': (0.009987115859985352,
  0.010001897811889648,
  0.010104179382324219,
  0.011300086975097656,
  0.009847164154052734)}

We can also view their ***precision@k and recall@k*** could be computed with 5 folds. Also their results are very close.

In [77]:
kf = KFold(n_splits=5)
#best algo based on above
sim_options = {'name': 'cosine', 'user_based': False} # or item based
algo = KNNBaseline(sim_options=sim_options)
#print precision@k and recall@k
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print('precision@k')
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print('recall@k')
    print(sum(rec for rec in recalls.values()) / len(recalls))

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.8023295649194929
recall@k
0.986639260020555
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.7789780521262003
recall@k
0.9794238683127572
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.7732379979570991
recall@k
0.9816138917262512
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.7823549605217988
recall@k
0.9866117404737385
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.8070624360286591
recall@k
0.9667349027635619


Lastly , save the model on local disk for future reuse.

In [78]:
#save the model for later use
dump.dump(os.path.expanduser("KNN_COSINE"), algo=algo)

# KNNWithMeans

In [24]:
from surprise import KNNWithMeans #

In [25]:
sim_options = {'name': 'cosine', 'user_based': False} # or item based
algo = KNNWithMeans(k=40, min_k=1,sim_options=sim_options) 
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2142  1.1570  1.0523  1.1165  1.0834  1.1247  0.0567  
MAE (testset)     0.9317  0.9314  0.8732  0.8499  0.8756  0.8924  0.0332  
Fit time          0.03    0.03    0.03    0.03    0.03    0.03    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    


{'test_rmse': array([1.21421914, 1.15697936, 1.05228381, 1.11646822, 1.08340149]),
 'test_mae': array([0.93175   , 0.93136875, 0.87323125, 0.8498875 , 0.8756375 ]),
 'fit_time': (0.026008129119873047,
  0.03218984603881836,
  0.03406524658203125,
  0.027578115463256836,
  0.026092052459716797),
 'test_time': (0.002290964126586914,
  0.003396749496459961,
  0.0022368431091308594,
  0.002151012420654297,
  0.0028243064880371094)}

In [26]:
kf = KFold(n_splits=5)
#best algo based on above
sim_options = {'name': 'cosine', 'user_based': False} # or item based
algo = KNNWithMeans(k=40, min_k=1,sim_options=sim_options) 
#print precision@k and recall@k
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print('precision@k')
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print('recall@k')
    print(sum(rec for rec in recalls.values()) / len(recalls))

Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.7927461139896373
recall@k
1.0
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.8229706390328151
recall@k
1.0
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.7853535353535354
recall@k
1.0
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.7727272727272727
recall@k
1.0
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.7806122448979592
recall@k
1.0


In [27]:
#save the model for later use
dump.dump(os.path.expanduser("KNNWithMeans_cosine"), algo=algo)

# SlopeOne

In [29]:
from surprise import SlopeOne

In [33]:
algo = SlopeOne()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SlopeOne on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1906  1.1598  1.0261  1.0905  1.1131  1.1160  0.0570  
MAE (testset)     0.9488  0.9334  0.8137  0.8610  0.8884  0.8890  0.0490  
Fit time          0.02    0.02    0.02    0.02    0.02    0.02    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    


{'test_rmse': array([1.19055617, 1.15977474, 1.02605174, 1.09050662, 1.1131164 ]),
 'test_mae': array([0.9488    , 0.93335625, 0.81368125, 0.861     , 0.888375  ]),
 'fit_time': (0.022203922271728516,
  0.021946191787719727,
  0.02132701873779297,
  0.021461009979248047,
  0.021553754806518555),
 'test_time': (0.002370119094848633,
  0.0021209716796875,
  0.002176046371459961,
  0.002090930938720703,
  0.0021169185638427734)}

In [35]:
kf = KFold(n_splits=5)
#best algo based on above
algo = SlopeOne()
#print precision@k and recall@k
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print('precision@k')
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print('recall@k')
    print(sum(rec for rec in recalls.values()) / len(recalls))

precision@k
0.7803030303030303
recall@k
0.9949494949494949
precision@k
0.7755102040816326
recall@k
0.9948979591836735
precision@k
0.8549222797927462
recall@k
0.9948186528497409
precision@k
0.7769230769230769
recall@k
1.0
precision@k
0.7487046632124352
recall@k
1.0


In [37]:
dump.dump(os.path.expanduser("SlopeOne_MSD"), algo=algo)

# CoClustering

In [38]:
from surprise import CoClustering

In [40]:
algo = CoClustering()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm CoClustering on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1386  1.2428  1.0685  1.0785  1.1034  1.1264  0.0631  
MAE (testset)     0.8978  0.9545  0.8656  0.8803  0.8811  0.8958  0.0310  
Fit time          0.33    0.30    0.29    0.28    0.28    0.29    0.02    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    


{'test_rmse': array([1.1386259 , 1.24284751, 1.06854079, 1.07850058, 1.10335512]),
 'test_mae': array([0.89779375, 0.95446271, 0.86556599, 0.88027391, 0.88105448]),
 'fit_time': (0.3257899284362793,
  0.2955291271209717,
  0.28959226608276367,
  0.2804758548736572,
  0.27942514419555664),
 'test_time': (0.004004955291748047,
  0.0020580291748046875,
  0.0015959739685058594,
  0.001973867416381836,
  0.0015749931335449219)}

In [41]:
kf = KFold(n_splits=5)
#best algo based on above
algo = CoClustering()
#print precision@k and recall@k
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print('precision@k')
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print('recall@k')
    print(sum(rec for rec in recalls.values()) / len(recalls))

precision@k
0.8223350253807107
recall@k
0.9949238578680203
precision@k
0.74
recall@k
1.0
precision@k
0.7755102040816326
recall@k
1.0
precision@k
0.7849740932642487
recall@k
0.9974093264248705
precision@k
0.8350604490500865
recall@k
0.9948186528497409


In [42]:
dump.dump(os.path.expanduser("CoClustering_cosine"), algo=algo)

## 8.0 Top N Recommender 

### 8.1 Get Top N Items 

This approach is based on the the top-10 items with highest rating prediction for each user in the dataset. The input could be user rating to different itmes (i.e. 20 items), then , it will return top 10 best prediction items.

Below is a simple example showing the input and output for the top 10 items recommended for a user based on its prediction model

In [74]:
def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

### 8.2 Get the k nearest neighbors of a user (or item)

We can use the get_neighbors() methods of the algorithm object. This is only relevant for algorithms that use a ***similarity measure***, such as the k-NN algorithms.

Below is an example where we retrieve the 10 nearest neighbors of one of the video games from the video game review dataset. The output is represneted as 10 rows from the dataset.


In [79]:
algo.get_neighbors(int(10000), k=10)

IndexError: index 10000 is out of bounds for axis 0 with size 3211

## 9.0 Presentation of Web App

## 10.0 Recommendation Engine in System Architecture 