# Recommendation Engines - MovieLens Data


MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of: * 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies. * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file can be found at the end of this file.

### Tasks

1. Load the data into the recommendation format
2. Build and assess model accuracy
3. Make individual recommendations
4. Try multiple models and compare accuracy
5. Consider how a company could use this

In [1]:
# Install Surpise - a useful library for recommendation engines
!pip install scikit-surprise



In [2]:
# Load Surprise
from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf
from surprise import Reader

In [20]:
! head '../../data/u.data' -10

==> ../../data/u.data <==
196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013
head: -10: No such file or directory


In [3]:
# 1. Load the data into the recommendation format

# As we're loading a custom dataset, we need to define a reader. In the
# movielens dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path = '../../data/u.data', reader=reader)
data.split(n_folds=5)

In [4]:
# 2. Build and assess model accuracy

# We'll use the famous SVD algorithm.
algo = SVD()

# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

print_perf(perf)

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9295
MAE:  0.7362
------------
Fold 2
RMSE: 0.9382
MAE:  0.7392
------------
Fold 3
RMSE: 0.9412
MAE:  0.7420
------------
Fold 4
RMSE: 0.9437
MAE:  0.7442
------------
Fold 5
RMSE: 0.9333
MAE:  0.7349
------------
------------
Mean RMSE: 0.9372
Mean MAE : 0.7393
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
MAE     0.7362  0.7392  0.7420  0.7442  0.7349  0.7393  
RMSE    0.9295  0.9382  0.9412  0.9437  0.9333  0.9372  


In [37]:
# 3. Make individual recommendations
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(242)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=1, verbose=True)


user: 196        item: 242        r_ui = 1.00   est = 3.74   {'was_impossible': False}


In [None]:
# 4. Try multiple models and compare accuracy

# Try at least 3 of the models mentioned below:
#random_pred.NormalPredictor    Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
#baseline_only.BaselineOnly    Algorithm predicting the baseline estimate for given user and item.
#knns.KNNBasic    A basic collaborative filtering algorithm.
#knns.KNNWithMeans    A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
#knns.KNNBaseline    A basic collaborative filtering algorithm taking into account a baseline rating.
#matrix_factorization.SVD    The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
#matrix_factorization.SVDpp    The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
#matrix_factorization.NMF    A collaborative filtering algorithm based on Non-negative Matrix Factorization.
#slope_one.SlopeOne    A simple yet accurate collaborative filtering algorithm.
#co_clustering.CoClustering    A collaborative filtering algorithm based on co-clustering.


# Here's how to run Non-Negative Matrix Factorisiation
from surprise import NMF
# Now we will try Non-Negative Matrix Factorisiation (a form of collaborative filtering)
algo.NMF = NMF()
# Evaluate performances of our algorithm on the dataset.
perf.NMF = evaluate(algo.NMF, data, measures=['RMSE', 'MAE'])
print_perf('NMF'.format(perf.NMF))

In [38]:
? evaluate

In [13]:


from surprise import NormalPredictor

npp = NormalPredictor()
perf.npp = evaluate(npp, data, measures=['RMSE', 'MAE'])
print_perf('Normal Predictor'.format(perf.npp))



Evaluating RMSE, MAE of algorithm NormalPredictor.

------------
Fold 1
RMSE: 1.5192
MAE:  1.2187
------------
Fold 2
RMSE: 1.5225
MAE:  1.2204
------------
Fold 3
RMSE: 1.5268
MAE:  1.2266
------------
Fold 4
RMSE: 1.5248
MAE:  1.2261
------------
Fold 5
RMSE: 1.5118
MAE:  1.2134
------------
------------
Mean RMSE: 1.5210
Mean MAE : 1.2210
------------
------------


AttributeError: 'str' object has no attribute 'values'

In [14]:
perf.npp

CaseInsensitiveDefaultDict(list,
                           {'mae': [1.2186791241730255,
                             1.2203502477160961,
                             1.2265953873673363,
                             1.2260725042885312,
                             1.2133844370776059],
                            'rmse': [1.5191928058835191,
                             1.5225401042212694,
                             1.5267727621564633,
                             1.5248178082960904,
                             1.5117973303997585]})

In [None]:
# Here's how to run Non-Negative Matrix Factorisiation
from surprise import NormalPredictor

# Now we will try Non-Negative Matrix Factorisiation (a form of collaborative filtering)
algo.NormalPredictor = NormalPredictor()

# Evaluate performances of our algorithm on the dataset.
perf.NormalPredictor = evaluate(algo.NormalPredictor, data, measures=['RMSE', 'MAE'])

print_perf(perf.NormalPredictor)

##### 5. Consider how a company could use this

How might a company use a recommendation like this in practice? Write a few paragraphs covering how they could use the above covering:
- How the algorithm works?
- What data would be used?
- How would we know if it's working?
- What is the benefit of using an algorithm over this over just recommending the most popular films overall?