# 11. Recommender systems

## Introduction to recommender systems

Recommendation algorithms provide the customers of companies such as Netflix, Amazon and YouTube intelligent suggestions about the items they might be interested in. There are many alternative approaches to constructing such algorithms. One of them is called **content-based filtering**, which uses the items' metadata (or descriptive characteristics) as explanatory variables to *e.g.* create models for identifying similar items than those the customer has previously expressed interest to. However, in this module we shall focus on another very common technique called **collaborative filtering**.

Collaborative filtering generates recommendations about what the user might enjoy on the basis of collected reactions from other users. This technique can be approached from two alternative directions: **item-based** (looking for items that elicit similar reactions) or **user-based** (looking for users with similar tastes) collaborative filtering. Below, we shall investigate both of these alternative approaches.  

## Item-based collaborative filtering

Collaborative filtering requires a dataset of user responses (purchase history, reviews, ratings etc.) concerning a set of items. As an example, let us consider the following simple table of user ratings given by six users U1, ..., U6 to a set of four different movies M1, ..., M4.


|        | U1  | U2  | U3  | U4  | U5  | U6  | 
|:------:|:---:|:---:|:---:|:---:|:---:|:---:|
| **M1** |  1  |  5  |  5  |     |  2  |  1  |
| **M2** |  1  |  5  |     |  1  |     |  1  |
| **M3** |  5  |  2  |     |  5  |     |  5  |
| **M4** |  5  |  2  |     |     |  4  |     |

For example, user U5 has given movie M1 a rating of 2 stars. Empty slots in the table indicate that the user has not yet seen the movie, or has not submitted a rating. Let us also store the same information as a pandas DataFrame: 

In [3]:
import pandas as pd

ratings = {'item': ['M1', 'M2', 'M3', 'M4', 'M1', 'M2', 'M3', 'M4', 'M1', 'M3', 'M2', 'M1', 'M4','M1','M2','M3'],
           'user': ['U1', 'U1', 'U1', 'U1', 'U2', 'U2', 'U2', 'U2', 'U3','U4', 'U4', 'U5', 'U5', 'U6', 'U6', 'U6'],
           'rating': [1,1,5,5,5,5,2,2,5,5,1,2,4,1,1,5]}

df = pd.DataFrame(ratings)
df

Unnamed: 0,item,user,rating
0,M1,U1,1
1,M2,U1,1
2,M3,U1,5
3,M4,U1,5
4,M1,U2,5
5,M2,U2,5
6,M3,U2,2
7,M4,U2,2
8,M1,U3,5
9,M3,U4,5


Suppose now that we would like to recommend a movie to user four, U4. There are two alternatives that this user has not yet seen: M1 and M4. Looking at the table, which would you think U4 would prefer? 

First, note that U4 has given good ratings to M3. It would then be reasonable to expect that U4 might enjoy another movie with similar rating characteristics than M3. Next, comparing the last two rows of numbers in the above table, we find that M4 has a rating history that has some similarities with that of M3 (for example, both U1 and U2 have given these two movies the same ratings). In contrast, the ratings received by M1 (the first row) are clearly different from those received by M3 (and fairly similar to those of M2, which U4 did not like). Therefore, we conclude that U4 would probably prefer M4 to M1. This is the essential logic behind item-based collaborative filtering.

More precisely, we wish to predict some unknown rating in the table (to fill in the blanks), say that for M4 as given by U4. Item-based collaborative filtering can be used for that purpose as follows:

-  Look for items with user responses that are nearest to those of the item under consideration, and have been rated by the user in question.
- Estimate the unknown rating by calculating the weighted average of the known ratings with these nearest neighbor items.

For finding the nearest neighbor items, we need a measure of similarity. Each item (or row in the data) can be viewed as a vector with numerical values; the number of these vector components is equal to the number of users. One possible way of obtaining a similarity measure between two such vectors is the familiar Euclidian distance. However, another very often used similarity measure in the context of recommender systems is the **cosine similarity**.

## Cosine similarity: a new measure of distance

Consider two $N$-dimensional vectors $a = (a_{1}, a_{2}, ..., a_{N})$ and $b = (b_{1}, b_{2}, ..., b_{N})$. The **dot product** $a \cdot b$ of these vectors is defined as

$$
a \cdot b = a_{1}b_{1} + a_{2}b_{2} + ... + a_{N}b_{N},
$$
and is related to the absolute value (length) of the vectors and the angle $\theta$ between them as

$$
a \cdot b = \vert a \vert \vert b \vert \cos\theta,
$$
where $\vert a \vert = \sqrt{a_{1}^2 + ... + a_{N}^2}$, and similarly for $\vert b \vert$. Accordingly, the cosine similarity $sim(a,b)$ between two item vectors $a$ and $b$ is

$$
sim(a,b) = \frac{a \cdot b}{\vert a \vert \vert b \vert}.
$$ 
The cosine similarity ranges between 1 (parallel vectors pointing in the same direction) and -1 (antiparallel vectors pointing in opposite directions); larger values indicate more similarity between the vector directions. 

Using the example data above, let us calculate the cosine similarities between items M3 and M4, substituting zeroes for unknown component values in the row vectors. Since M3 = (5, 2, 0, 5, 0, 5), and M4 = (5, 2, 0, 0, 4, 0), we obtain

$$
sim(M3, M4) = \frac{25 + 4 + 0 + 0 + 0 + 0}{\sqrt{79} \cdot \sqrt{45}} \approx 0,49.
$$ 
A similar analysis between M2 = (1, 5, 0, 1, 0, 1) and M4 gives $sim(M2, M4) \approx 0,42$, which is a little bit lower, but not by very much. Since we know that U4 has given a rating of 1 to M2, and a rating of 5 to M3, we could estimate the unknown rating U4 would give M4 as a weighted average

$$
r(U4, M4) = \frac{sim(M2, M4)r(U4, M2) + sim(M3, M4)r(U4, M3)}{sim(M2, M4)+sim(M3, M4)}
\approx \frac{0,42 \cdot 1 + 0,49 \cdot 5}{0,42 + 0,49} \approx 3,1.
$$
Usually, the weighted average is taken with $k$ nearest-neighbor items rated by the user. For the movie M1, an identical analysis would give the estimated rating of $r(U4, M1) \approx 2,2$ stars, so from these two alternatives, a better recommendation for U4 would seem to be movie M4.

Item-based collaborative filtering often produces good results, but in some cases the recommendations might turn out to be somewhat too obvious. Two Harry Potter movies are likely to end up with similar user reactions, but recommending another film in the same series is not terribly useful for the customer, who probably can think of this without an algorithm's help. Our next topic, **user-based collaborative filtering**, can sometimes produce more interesting results  

## User-based collaborative filtering

In user-based collaborative filtering, the aim is to find similar users instead of similar items: if other users' tastes are similar to yours, and they like a certain item, you might also turn out to like it. 

While in item-based collaborative filtering we compared row vectors for different items, in user-based collaborative filtering we compare *column vectors* indicating the ratings given by individual users. Otherwise, the procedure is entirely similar: the similarity between users U1 and U2, for example can be quantified by calculating the cosine similarity of U1 = (1, 1, 5, 5) and U2 = (5, 5, 2, 2). The result is

$$
r(U1, U2) = \frac{30}{\sqrt{52}\cdot\sqrt{58}} \approx 0,55
$$ 
while that between U1 and U6 is larger: $r(U1, U6) \approx 0,72$. User-based collaborative filtering for predicting an unknown rating in the ratings table can then be implemented as follows:

-  Look for users whose ratings are nearest to those of the user under consideration, and who have rated the item in question.
- Estimate the unknown rating by calculating the weighted average of the known ratings given by these nearest-neighbor users.

When the predicted ratings for the user have been calculated for the items with missing values, the highest ratings among them can be presented as recommendations.

Because the mathematical principles behind collaborative filtering are quite simple, the necessary Python code could be programmed from scratch relatively easily. However, recommender system algorithms are also available through the add-on SciKit package **Surprise**. In the following, we look at how to implement user-based collaborative filtering with Surprise.   

## Python implementation of user-based collaborative filtering



In [7]:
!pip install scikit-surprise

Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4.tar.gz (154 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml): started
  Building wheel for scikit-surprise (pyproject.toml): finished with status 'error'
Failed to build scikit-surprise


  error: subprocess-exited-with-error
  
  Building wheel for scikit-surprise (pyproject.toml) did not run successfully.
  exit code: 1
  
  [115 lines of output]
  running bdist_wheel
  running build
  running build_py
  creating build\lib.win-amd64-cpython-312\surprise
  copying surprise\accuracy.py -> build\lib.win-amd64-cpython-312\surprise
  copying surprise\builtin_datasets.py -> build\lib.win-amd64-cpython-312\surprise
  copying surprise\dataset.py -> build\lib.win-amd64-cpython-312\surprise
  copying surprise\dump.py -> build\lib.win-amd64-cpython-312\surprise
  copying surprise\reader.py -> build\lib.win-amd64-cpython-312\surprise
  copying surprise\trainset.py -> build\lib.win-amd64-cpython-312\surprise
  copying surprise\utils.py -> build\lib.win-amd64-cpython-312\surprise
  copying surprise\__init__.py -> build\lib.win-amd64-cpython-312\surprise
  copying surprise\__main__.py -> build\lib.win-amd64-cpython-312\surprise
  creating build\lib.win-amd64-cpython-312\surprise\mod