<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Recommenders" data-toc-modified-id="Recommenders-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Recommenders</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#User-sample" data-toc-modified-id="User-sample-1.0.0.1"><span class="toc-item-num">1.0.0.1&nbsp;&nbsp;</span>User sample</a></span></li><li><span><a href="#Movie-sample" data-toc-modified-id="Movie-sample-1.0.0.2"><span class="toc-item-num">1.0.0.2&nbsp;&nbsp;</span>Movie sample</a></span></li><li><span><a href="#Ratings" data-toc-modified-id="Ratings-1.0.0.3"><span class="toc-item-num">1.0.0.3&nbsp;&nbsp;</span>Ratings</a></span></li></ul></li><li><span><a href="#Test-train-split" data-toc-modified-id="Test-train-split-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Test-train split</a></span></li></ul></li><li><span><a href="#Singular-Value-Decomposition" data-toc-modified-id="Singular-Value-Decomposition-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Singular Value Decomposition</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Cross-Validation" data-toc-modified-id="Cross-Validation-1.1.0.1"><span class="toc-item-num">1.1.0.1&nbsp;&nbsp;</span>Cross-Validation</a></span></li><li><span><a href="#Grid-search-for-the-best-hyperparameters" data-toc-modified-id="Grid-search-for-the-best-hyperparameters-1.1.0.2"><span class="toc-item-num">1.1.0.2&nbsp;&nbsp;</span>Grid-search for the best hyperparameters</a></span></li></ul></li><li><span><a href="#KNN-Basic-approach" data-toc-modified-id="KNN-Basic-approach-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>KNN Basic approach</a></span><ul class="toc-item"><li><span><a href="#Getting-predictions-for-a-given-user" data-toc-modified-id="Getting-predictions-for-a-given-user-1.1.1.1"><span class="toc-item-num">1.1.1.1&nbsp;&nbsp;</span>Getting predictions for a given user</a></span></li></ul></li></ul></li></ul></li></ul></div>

In [4]:
%run supportvectors-common.ipynb



<div style="color:#aaa;font-size:8pt">
<hr/>

 </blockquote>
 <hr/>
</div>



In [2]:
# Uncomment this to install surprise.
# !pip install surprise

# Recommenders

In Python, a popular, and easy to use library to use when getting started with the recommenders is `scikit-surprise`. https://surprise.readthedocs.io/en/stable/index.html .If the `surprise` library is not yet installed, one can install it with a simple `pip install surprise`.

There is an often used dataset in the context of the recommenders: the `movielens` dataset. It comprises of users' ratings for various movies. We will use this dataset to explore the recommenders. The `surprise` library has built-in support for this dataset, and it can be loadede with a `Dataset.load_builtin("ml-100k")`.

Note that the resulting object `data` is **not** a pandas dataframe. Therefore, we cannot apply the sklearn's method `sklearn.model_selection.train_test_split()` on this dataset.

Some explanation of this dataset. When we load it, it needs to download it for the first time -- and you will have to say `Y` at the prompt.

Once downloaded, it should show up in the `~/.surprise_data` directory. There are three significant files among the many we find there:

1. `u.data`: this it the main interaction data between users and items. In other words, it contains the ratings given by users to items. Each row is in the format: `userId, itemId, rating, timestamp`. 

2. `u.user`: this contains demographic information about the users.

3. `u.item`: this contains descriptive metadata on each of the movies.

We can explore a few rows of each of these with the following:



In [5]:
from surprise import Dataset
from surprise.model_selection import cross_validate


# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin("ml-100k")


The above should case not only the data to be loaded into the object `data`, but also download to our local directory. Let's explore it with `pandas.DataFrame`. There is a `README` file which provides information of the column names and files downloaded. We use that in what follows.

Let us now preview a sample of users:

#### User sample

In [6]:
DIR = '~/.surprise_data/ml-100k/ml-100k'

user_file = f'{DIR}/u.user'
users = pd.read_csv(names=['user_id', 'age', 'gender', 'occupation', 'zip_code'], 
                    filepath_or_buffer = user_file, 
                    sep='|')
users.sample(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
96,97,43,M,artist,98006
265,266,62,F,administrator,78756
810,811,40,F,educator,73013
23,24,21,F,artist,94533
30,31,24,M,artist,10003
280,281,15,F,student,6059
568,569,34,M,educator,91903
259,260,40,F,artist,89801
331,332,20,M,student,40504
323,324,21,F,student,2176


In [7]:
users.describe(include='all')

Unnamed: 0,user_id,age,gender,occupation,zip_code
count,943.0,943.0,943,943,943.0
unique,,,2,21,795.0
top,,,M,student,55414.0
freq,,,670,196,9.0
mean,472.0,34.051962,,,
std,272.364951,12.19274,,,
min,1.0,7.0,,,
25%,236.5,25.0,,,
50%,472.0,31.0,,,
75%,707.5,43.0,,,


#### Movie sample

Likewise, let us sample a few movies. Unfortunately, the file `u.item` contains non-utf8 characters, so we have to clean it first. On ubuntu, we can do it with the command:

```
iconv -f utf-8 -t utf-8 -c u.item -o u.item_cleaned
```

To detect the rows that had non utf-8 characters, we used: `grep -axv '.*' u.item`. 

In [8]:
COLUMNS = ['id', 'title', 'release_date', 'video_release_date', 
              'IMDb_URL', 'unknown', 'Action','Adventure','Animation',         
              "Children",  'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
              'Film_Noir', 'Horror', 'Musical', 'Mystery','Romance','Sci-Fi',
              'Thriller', 'War', 'Western']
movie_file = f'{DIR}/u.item_cleaned'
movies = pd.read_csv(names=COLUMNS, 
                     filepath_or_buffer = movie_file, 
                    sep='|',
                   )
movies.sample(5).transpose()

FileNotFoundError: [Errno 2] No such file or directory: '/home/kate/.surprise_data/ml-100k/ml-100k/u.item_cleaned'

In [7]:
movies.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
id,1682.0,,,,841.5,485.695893,1.0,421.25,841.5,1261.75,1682.0
title,1682.0,1664.0,"Designated Mourner, The (1997)",2.0,,,,,,,
release_date,1681.0,240.0,01-Jan-1995,215.0,,,,,,,
video_release_date,0.0,,,,,,,,,,
IMDb_URL,1679.0,1660.0,http://us.imdb.com/M/title-exact?Designated%20...,2.0,,,,,,,
unknown,1682.0,,,,0.001189,0.034473,0.0,0.0,0.0,0.0,1.0
Action,1682.0,,,,0.149227,0.356418,0.0,0.0,0.0,0.0,1.0
Adventure,1682.0,,,,0.080262,0.271779,0.0,0.0,0.0,0.0,1.0
Animation,1682.0,,,,0.02497,0.156081,0.0,0.0,0.0,0.0,1.0
Children,1682.0,,,,0.072533,0.259445,0.0,0.0,0.0,0.0,1.0


#### Ratings

The ratings that users have given to items is present in the `u.data` file. Let's look at a few rows:

In [8]:
COLUMNS = ['user_id', 'item_id', 'rating', 'timestamp']
rating_file = f'{DIR}/u.data'
ratings = pd.read_csv(names=COLUMNS, 
                     filepath_or_buffer = rating_file, 
                    sep='\t',
                   )
ratings.sample(5)

Unnamed: 0,user_id,item_id,rating,timestamp
35496,303,1511,3,879544843
2436,299,186,3,889503233
18294,42,230,5,881109148
60903,846,464,2,883947778
78483,747,26,3,888733314


### Test-train split

As we mentioned before, since `data` object is not a standard `pandas.DataFrame`, we cannot directly apply `sklearn.model_selection.train_test_split()`. Instead we have to use the specific method for splitting data built-in the `Dataset` object:

In [9]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=0.25)

## Singular Value Decomposition

As we learned in yesterday's session, singular value decomposition (SVD) is a linear-algebra approach that factorizes the ratings matrix (a matrix, where rows are users, columns are items, and cells are values -- if present -- of the rating a user gave to an item). Recall that the ratings matrix is a very sparse matrix: here, the total number of users is 943. The total number of items is 1682. The total number of possible ratings therefore is: $943 x 1682 = \mathbf{1586126}$. However, only 100K ratings are present.

Recall that SVD will help us project the data into a lower-dimensional latent (lower-rank) subspace, the `taste`-space. This projection is a linear projection, unlike later methods which will use more nonlinear approaches, such as neural collaborative filtering.

In [10]:
from surprise import accuracy, SVD
from surprise.model_selection import cross_validate

svd = SVD()
svd.fit(trainset)
predictions = svd.test(testset)



We discussed yesterday that the training of the recommender algorithm is a form of (dual) regression. Therefore,one appropriate way to quantify the model performance is with `rmse` (root-mean-squared-errors).

In [11]:
accuracy.rmse(predictions)

RMSE: 0.9380


0.9379879620279583

In other words, a predicted rating seems to exhibit a deviation of about one rating unit. This is reasonably good!

#### Cross-Validation

Let us run a 5-fold cross-validation, and find the best model:


In [12]:
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9320  0.9315  0.9392  0.9412  0.9394  0.9366  0.0041  
MAE (testset)     0.7343  0.7354  0.7396  0.7438  0.7407  0.7388  0.0035  
Fit time          0.30    0.30    0.29    0.30    0.29    0.30    0.00    
Test time         0.04    0.04    0.04    0.10    0.04    0.05    0.02    


{'test_rmse': array([0.93196606, 0.93150656, 0.93915529, 0.94122669, 0.93936456]),
 'test_mae': array([0.73430988, 0.73542033, 0.73960985, 0.74380132, 0.74073409]),
 'fit_time': (0.29563069343566895,
  0.3015155792236328,
  0.29415225982666016,
  0.2985680103302002,
  0.2930922508239746),
 'test_time': (0.040586233139038086,
  0.038741350173950195,
  0.03872847557067871,
  0.09914994239807129,
  0.03904533386230469)}

#### Grid-search for the best hyperparameters

We can perform a grid search for the best hyperpameters -- i.e. those values-combination of the hyperparameters which yield the best model.

In [13]:
# Adapted, with minor modification, from the surprise documentation example

from surprise.model_selection import GridSearchCV

param_grid = {"n_epochs": [5, 10, 20], 
              "lr_all": [0.01, 0.001, 0.002, 0.005], 
              "reg_all": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print(gs.best_params["rmse"])

0.9291356498968272
{'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.1}


Now, let's find the best model, and use it.

In [14]:
# We can now use the algorithm that yields the best rmse:
algo = gs.best_estimator["rmse"]
algo.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f1162f538e0>

### KNN Basic approach

Let us take a k-nn approach to this problem. For this, it is advantageous to use the entire dataset, so we can get closer neighbors.

In [15]:
from surprise import KNNBasic
knn = KNNBasic()

trainset = data.build_full_trainset()
knn.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7f1162f53250>

In [16]:
predictions = knn.test(testset)
accuracy.rmse(predictions)

RMSE: 0.7778


0.7777856659820315

Clearly, this gives us lower `rmse`. Can you explain why this is so?

#### Getting predictions for a given user

Let us randomly pick a user, and a movie, and see what predictions the two algorithms make:

In [17]:
user = str(26)
item = str(760)

svd_prediction = svd.predict(user, item)
knn_prediction = knn.predict(user, item)

print(f'SVD prediction: {svd_prediction}')
print(f'KNN prediction: {knn_prediction}')

svd_prediction.est

SVD prediction: user: 26         item: 760        r_ui = None   est = 1.84   {'was_impossible': False}
KNN prediction: user: 26         item: 760        r_ui = None   est = 2.33   {'actual_k': 40, 'was_impossible': False}


1.838631854417943

In [18]:
exemplars = ratings.sample(10)

predictions = pd.DataFrame(columns=['User', 'Movie', 'Rating', 'SVD-Prediction', 'kNN-Prediction'])

for index, row in exemplars.iterrows():
    user = row[0]
    movie = row[1]
    rating = row[2]
    svd_prediction = svd.predict(user, movie)
    knn_prediction = knn.predict(user, movie)
    predictions.loc[len(predictions)] = [user, movie, rating, svd_prediction.est, knn_prediction.est]

predictions

Unnamed: 0,User,Movie,Rating,SVD-Prediction,kNN-Prediction
0,495.0,1046.0,5.0,3.527213,3.52986
1,548.0,1014.0,4.0,3.527213,3.52986
2,316.0,716.0,5.0,3.527213,3.52986
3,752.0,750.0,2.0,3.527213,3.52986
4,279.0,71.0,3.0,3.527213,3.52986
5,828.0,1068.0,4.0,3.527213,3.52986
6,627.0,276.0,2.0,3.527213,3.52986
7,455.0,778.0,4.0,3.527213,3.52986
8,373.0,95.0,5.0,3.527213,3.52986
9,758.0,229.0,3.0,3.527213,3.52986


As we can see, it is a reasonably good predictor of ratings. We will learn in a later course (`ML-400: Neural Architectures`) that there are other, more powerful algorithms that have evolved over the years, using various neural network architectures.