# Recommendation Engines - MovieLens Data

## Tuesday June 20 2017

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of: * 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies. * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file can be found at the end of this file.

### Tasks

1. Load the data into the recommendation format
2. Build and assess model accuracy
3. Make individual recommendations
4. Try multiple models and compare accuracy
5. Consider how a company could use this

In [None]:
# Install Surpise - a useful library for recommendation engines
!pip install scikit-surprise

In [1]:
# Load Surprise
from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf
from surprise import Reader

In [2]:
from surprise import Dataset

In [3]:
# 1. Load the data into the recommendation format

# As we're loading a custom dataset, we need to define a reader. In the
# movielens dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path = '../../data/u.data', reader=reader)
data.split(n_folds=5)

In [4]:
# 2. Build and assess model accuracy

# We'll use the famous SVD algorithm.
algo = SVD()

# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

print_perf(perf)

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9369
MAE:  0.7389
------------
Fold 2
RMSE: 0.9325
MAE:  0.7353
------------
Fold 3
RMSE: 0.9365
MAE:  0.7383
------------
Fold 4
RMSE: 0.9401
MAE:  0.7385
------------
Fold 5
RMSE: 0.9368
MAE:  0.7416
------------
------------
Mean RMSE: 0.9366
Mean MAE : 0.7385
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9369  0.9325  0.9365  0.9401  0.9368  0.9366  
MAE     0.7389  0.7353  0.7383  0.7385  0.7416  0.7385  


In [5]:
# 3. Make individual recommendations
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=2, verbose=True)

user: 196        item: 302        r_ui = 2.00   est = 4.37   {'was_impossible': False}


In [9]:
# 4. Try multiple models and compare accuracy

# Try at least 3 of the models mentioned below:
#random_pred.NormalPredictor    Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
#baseline_only.BaselineOnly    Algorithm predicting the baseline estimate for given user and item.
#knns.KNNBasic    A basic collaborative filtering algorithm.
#knns.KNNWithMeans    A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
#knns.KNNBaseline    A basic collaborative filtering algorithm taking into account a baseline rating.
#matrix_factorization.SVD    The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
#matrix_factorization.SVDpp    The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
#matrix_factorization.NMF    A collaborative filtering algorithm based on Non-negative Matrix Factorization.
#slope_one.SlopeOne    A simple yet accurate collaborative filtering algorithm.
#co_clustering.CoClustering    A collaborative filtering algorithm based on co-clustering.


# Here's how to run Non-Negative Matrix Factorisiation
from surprise import NMF

# Now we will try Non-Negative Matrix Factorisiation (a form of collaborative filtering)
algo.NMF = NMF()

# Evaluate performances of our algorithm on the dataset.
perf.NMF = evaluate(algo.NMF, data, measures=['RMSE', 'MAE'])

print_perf(perf.NMF)

Evaluating RMSE, MAE of algorithm NMF.

------------
Fold 1
RMSE: 0.9637
MAE:  0.7560
------------
Fold 2
RMSE: 0.9647
MAE:  0.7581
------------
Fold 3
RMSE: 0.9675
MAE:  0.7611
------------
Fold 4
RMSE: 0.9613
MAE:  0.7530
------------
Fold 5
RMSE: 0.9658
MAE:  0.7620
------------
------------
Mean RMSE: 0.9646
Mean MAE : 0.7580
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9637  0.9647  0.9675  0.9613  0.9658  0.9646  
MAE     0.7560  0.7581  0.7611  0.7530  0.7620  0.7580  


In [17]:
from surprise import NormalPredictor
algo.NormalPredictor = NormalPredictor()
perf.NormalPredictor = evaluate(algo.NormalPredictor, data, measures=['RMSE', 'MAE'])
print_perf(perf.NormalPredictor)

Evaluating RMSE, MAE of algorithm NormalPredictor.

------------
Fold 1
RMSE: 1.5228
MAE:  1.2232
------------
Fold 2
RMSE: 1.5113
MAE:  1.2117
------------
Fold 3
RMSE: 1.5166
MAE:  1.2160
------------
Fold 4
RMSE: 1.5099
MAE:  1.2108
------------
Fold 5
RMSE: 1.5107
MAE:  1.2166
------------
------------
Mean RMSE: 1.5143
Mean MAE : 1.2157
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    1.5228  1.5113  1.5166  1.5099  1.5107  1.5143  
MAE     1.2232  1.2117  1.2160  1.2108  1.2166  1.2157  


In [16]:
from surprise import BaselineOnly
algo.BaselineOnly = BaselineOnly()
perf.BaselineOnly = evaluate(algo.BaselineOnly, data, measures=['RMSE', 'MAE'])
print_perf(perf.BaselineOnly)

Evaluating RMSE, MAE of algorithm BaselineOnly.

------------
Fold 1
Estimating biases using als...
RMSE: 0.9447
MAE:  0.7489
------------
Fold 2
Estimating biases using als...
RMSE: 0.9423
MAE:  0.7461
------------
Fold 3
Estimating biases using als...
RMSE: 0.9461
MAE:  0.7505
------------
Fold 4
Estimating biases using als...
RMSE: 0.9439
MAE:  0.7451
------------
Fold 5
Estimating biases using als...
RMSE: 0.9445
MAE:  0.7521
------------
------------
Mean RMSE: 0.9443
Mean MAE : 0.7486
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9447  0.9423  0.9461  0.9439  0.9445  0.9443  
MAE     0.7489  0.7461  0.7505  0.7451  0.7521  0.7486  


In [15]:
from surprise import KNNBasic
algo.KNNBasic = KNNBasic()
perf.KNNBasic = evaluate(algo.KNNBasic, data, measures=['RMSE', 'MAE'])
print_perf(perf.KNNBasic)

Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9797
MAE:  0.7723
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9794
MAE:  0.7741
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9819
MAE:  0.7747
------------
Fold 4
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9804
MAE:  0.7708
------------
Fold 5
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9757
MAE:  0.7745
------------
------------
Mean RMSE: 0.9794
Mean MAE : 0.7733
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9797  0.9794  0.9819  0.9804  0.9757  0.9794  
MAE     0.7723  0.7741  0.7747  0.7708  0.7745  0.7733  


In [19]:
from surprise import KNNWithMeans
algo.KNNWithMeans = KNNWithMeans()
perf.KNNWithMeans = evaluate(algo.KNNWithMeans, data, measures=['RMSE', 'MAE','FCP'])
print_perf(perf.KNNWithMeans)

Evaluating RMSE, MAE, FCP of algorithm KNNWithMeans.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9548
MAE:  0.7520
FCP:  0.7005
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9480
MAE:  0.7467
FCP:  0.6986
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9536
MAE:  0.7506
FCP:  0.7065
------------
Fold 4
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9494
MAE:  0.7454
FCP:  0.7014
------------
Fold 5
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9517
MAE:  0.7528
FCP:  0.7071
------------
------------
Mean RMSE: 0.9515
Mean MAE : 0.7495
Mean FCP : 0.7028
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9548  0.9480  0.9536  0.9494  0.9517  0.9515  
MAE     0.7520  0.7467  0.7506  0.7454  0.7528  0.7495  
FCP     0

In [20]:
from surprise import KNNBaseline
algo.KNNBaseline = KNNBaseline()
perf.KNNBaseline = evaluate(algo.KNNBaseline, data, measures=['RMSE', 'MAE','FCP'])
print_perf(perf.KNNBaseline)

Evaluating RMSE, MAE, FCP of algorithm KNNBaseline.

------------
Fold 1
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9313
MAE:  0.7330
FCP:  0.7073
------------
Fold 2
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9290
MAE:  0.7318
FCP:  0.7029
------------
Fold 3
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9329
MAE:  0.7339
FCP:  0.7096
------------
Fold 4
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9317
MAE:  0.7311
FCP:  0.7067
------------
Fold 5
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9312
MAE:  0.7368
FCP:  0.7121
------------
------------
Mean RMSE: 0.9312
Mean MAE : 0.7333
Mean FCP : 0.7077
------------
------------
        Fold 1  Fold 2  Fo

In [18]:
from surprise import SVDpp
algo.SVDpp = SVDpp()
perf.SVDpp = evaluate(algo.SVDpp, data, measures=['RMSE', 'MAE'])
print_perf(perf.SVDpp)

Evaluating RMSE, MAE of algorithm SVDpp.

------------
Fold 1
RMSE: 0.9213
MAE:  0.7229
------------
Fold 2
RMSE: 0.9209
MAE:  0.7208
------------
Fold 3
RMSE: 0.9212
MAE:  0.7218
------------
Fold 4
RMSE: 0.9219
MAE:  0.7213
------------
Fold 5
RMSE: 0.9189
MAE:  0.7223
------------
------------
Mean RMSE: 0.9208
Mean MAE : 0.7218
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9213  0.9209  0.9212  0.9219  0.9189  0.9208  
MAE     0.7229  0.7208  0.7218  0.7213  0.7223  0.7218  


In [21]:
from surprise import SlopeOne
algo.SlopeOne = SlopeOne()
perf.SlopeOne = evaluate(algo.SlopeOne, data, measures=['RMSE', 'MAE','FCP'])
print_perf(perf.SlopeOne)

Evaluating RMSE, MAE, FCP of algorithm SlopeOne.

------------
Fold 1
RMSE: 0.9422
MAE:  0.7409
FCP:  0.6960
------------
Fold 2
RMSE: 0.9464
MAE:  0.7426
FCP:  0.6883
------------
Fold 3
RMSE: 0.9479
MAE:  0.7452
FCP:  0.6977
------------
Fold 4
RMSE: 0.9455
MAE:  0.7409
FCP:  0.6985
------------
Fold 5
RMSE: 0.9455
MAE:  0.7460
FCP:  0.7059
------------
------------
Mean RMSE: 0.9455
Mean MAE : 0.7431
Mean FCP : 0.6973
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9422  0.9464  0.9479  0.9455  0.9455  0.9455  
MAE     0.7409  0.7426  0.7452  0.7409  0.7460  0.7431  
FCP     0.6960  0.6883  0.6977  0.6985  0.7059  0.6973  


In [23]:
from surprise import CoClustering
algo.CoClustering = CoClustering()
perf.CoClustering = evaluate(algo.CoClustering, data, measures=['RMSE', 'MAE','FCP'])
print_perf(perf.CoClustering)

Evaluating RMSE, MAE, FCP of algorithm CoClustering.

------------
Fold 1
RMSE: 0.9756
MAE:  0.7652
FCP:  0.6846
------------
Fold 2
RMSE: 0.9643
MAE:  0.7549
FCP:  0.6933
------------
Fold 3
RMSE: 0.9616
MAE:  0.7507
FCP:  0.6980
------------
Fold 4
RMSE: 0.9675
MAE:  0.7555
FCP:  0.6900
------------
Fold 5
RMSE: 0.9650
MAE:  0.7590
FCP:  0.7022
------------
------------
Mean RMSE: 0.9668
Mean MAE : 0.7571
Mean FCP : 0.6936
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9756  0.9643  0.9616  0.9675  0.9650  0.9668  
MAE     0.7652  0.7549  0.7507  0.7555  0.7590  0.7571  
FCP     0.6846  0.6933  0.6980  0.6900  0.7022  0.6936  


##### 5. Consider how a company could use this

How might a company use a recommendation like this in practice? Write a few paragraphs covering how they could use the above covering:
- How the algorithm works?
- What data would be used?
- How would we know if it's working?
- What is the benefit of using an algorithm over this over just recommending the most popular films overall?