# [1.1: KNN] // Collaborative Filtering with KNN

## Context

Services like Bol.com, Amazon, Facebook, Youtube, Instagram, Google, TikTok, and Netflix all try to predict which products (books, movies, videos, newspaper articles, or other items) you might be interested in. In this assignment, you will create a first version of such a prediction system, using a well know machine learning algorithm called **k-nearest neighbors** (KNN).

These predictions are usually based on users’ behavior on a platform. Users view items, purchase them, give ratings, or mark things as favorites. All this information helps reveal the kind of products someone likes. Once we know this, we can look for similar products they have not seen yet and estimate whether they might enjoy them too.

A common way to decide whether two items (e.g., two movies) are similar is to compare their content:

* Are they about the same topic?
* Do they feature the same actors?
* Do they belong to the same genre?

Recommending items based on these kinds of features is known as **content-based filtering**, which we will explore later.

In this module, we focus on **collaborative filtering**. The key idea is that user behavior (ratings and interactions) contains information about both users *and* items.

Collaborative filtering comes in two main forms:

* **User-based filtering:** Find users who behave similarly and recommend items that those similar users liked.
* **Item-based filtering:** Find items that receive similar ratings and recommend items that are similar to the ones a user already likes.

Which approach works best, content-based, user-based collaborative filtering, or item-based collaborative filtering?
There is no universal answer. The best choice depends strongly on the specific application and dataset.

Each approach also involves design decisions that affect performance. For instance, we can measure similarity in many ways. A common choice is **Euclidean distance** and **cosine similarity**, but in practice even more options exist.

## Study Material

* **Video:** [Collaborative Filtering – Harvard CS50](https://www.youtube.com/watch?v=Eeg1DEeWUjA)
* **Book:** *Recommender Systems: The Textbook* — Chapters 1 and 2
  (Available as a free download via the UvA network.)

## Goal

For this assignment, we will build a collaborative filtering system using a dataset from [MovieLens](http://movielens.org), a well-known movie recommendation platform. The dataset (will automatically be downloaded further down) consists of three CSV files:

* `movies.csv`
* `ratings.csv`
* `tags.csv`

CSV stands for *comma-separated values*, a common format for storing tabular data.

Our goal is to use this dataset to recommend movies that a user has **not** watched yet. To do this, we look at the ratings a user has given to **other** movies and use that information to estimate how much they would enjoy a new movie.

If we frame this as a prediction task:

> **Given a user who has not seen a particular movie, predict the rating they would likely give after watching it.**

This predicted rating will form the basis of our recommendations.

## Getting Started

### Libraries

Let's start by loading the libraries we'll need by running the cell below:

In [None]:
#provide
import pandas as pd
import numpy as np
import pooch

%reload_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

### Download data

Now download the required data and helper files.

In [None]:
# download data
DATA_REPO = "https://raw.githubusercontent.com/uvapl/recommender-systems/main/data/m1/"

print("downloading data files")
for fname in ["movies.csv", "users.csv", "ratings.csv"]:
    pooch.retrieve(url = DATA_REPO + fname, known_hash=None, fname=fname, path="data", progressbar=True)
for fname in ["helpers_m1.py", "cosine.png", "tests_m1.py"]:
    pooch.retrieve(url = DATA_REPO + fname, known_hash=None, fname=fname, path=".", progressbar=True)
print("done!")

import helpers_m1
import tests_m1

## User-based filtering using KNN

Our goal is to recommend movies to a specific user. There are many approaches to this, but in this assignment we begin with **collaborative filtering**. The core idea behind collaborative filtering is to use user interactions (in this case, ratings from the MovieLens dataset) to guide our recommendations.

In **user-based filtering**, we focus on finding users who behave similarly. The intuition is straightforward: if two users tend to give similar ratings to many of the same movies, they likely have similar taste. If one of these users watches and enjoys a movie that the other has not seen yet, that movie becomes a good candidate for a recommendation.

To build a user-based filtering system, we will go through the following steps:

1. **Reading the data**
2. **Transforming the data into a usable format** (a utility matrix with users as rows and movies as columns)
3. **Define similarities between users**
4. **Using these similarities to make a recommendation (using KNN)**

### Reading Data

We'll start with a micro-dataset. This small set contains 3 movies and 30 users. It is a syntetic dataset similar in structure to the MovieLens dataset. This allows you to easily test your code since you have an overview of all the data. For comparison, the original dataset contains over 30 milion users. The algorithms we are going to apply should, in principle, also work on the full dataset, although it will be challenging to create solutions that are efficient enought to deal with those amounts of data.

The dataset consists of 3 tables (which we will read as Pandas `DataFrame`s):
- **movies**, a table with the movie's id, name, and genre.
- **users**, a table with the user's id and name.
- **ratings**, a table with the user's id, the movie's id, the corresponding rating, and the timestamp.

In the following tasks, we will only use the ratings table. The other two tables are provided for our intuition and reasoning. It's easier to talk about Shrek than movie 4306, so please do so when asked about certain movies in open questions.

Load the data:

In [None]:
df_movies = pd.read_csv("data/movies.csv", index_col="movieID")
df_users = pd.read_csv("data/users.csv", index_col="userID")
df_ratings = pd.read_csv("data/ratings.csv")

print(df_movies.head())
print(df_users.head())
print(df_ratings.head())

### Pivot

To analyze the data, we first create a table that shows how each user rated each movie. A small example (the real table is much larger) might look like this:

<table border="1" class="dataframe"><thead><tr style="text-align: right;"><th>movieID</th><th>M1024</th><th>M2048</th><th>M4096</th></tr><tr><th>userID</th><th></th><th></th><th></th></tr></thead><tbody><tr><th>U025</th><td>7.9</td><td>7.0</td><td>10.0</td></tr><tr><th>U027</th><td>5.4</td><td>2.6</td><td>10.0</td></tr><tr><th>U030</th><td>7.0</td><td>6.2</td><td>10.0</td></tr><tr><th>U032</th><td>4.9</td><td>10.0</td><td>NaN</td></tr><tr><th>...</th><td>...</td><td>...</td><td>...</td></tr></tbody></table>


This table is called the **utility matrix**. It records which user rated which movie and with what score. For example, from the matrix we can immediately see that user `U027` (Sofia Garcia) gave a rating of `5.4` to movie `M1024` (Titanic).

Because we are working on **user-based filtering**, we want to compare **users** with each other. The convention for this method is to put **users on the vertical axis**. Each row represents a user, and the columns correspond to the **features** of that user (in this case, their ratings for the available movies).

### Question 1

_2 pt._

The function `pivot_ratings()` below is still incomplete. It should generate the utility matrix described above. You will implement this function yourself; this means you are **not** allowed to use pandas’ built-in `pivot()` function (yet).

If there is no rating for a particular (user, movie) combination, the corresponding entry in the table should be `NaN`. You can use `np.nan` for this.

* **Tip 1:** The tests expect the output of `pivot_ratings()` to be a pandas `DataFrame` with elements of type `float`. You can enforce this by using the argument `dtype=float` when creating the `DataFrame`.

Implement `pivot_ratings()` below:


In [None]:
def pivot_ratings(ratings: pd.DataFrame) -> pd.DataFrame:
    """
    Takes a rating table as input and computes the utility matrix using pandas pivot function.
    Overwrites the pivot function from the previous exercise.
    """
    # your code here


df_utility = pivot_ratings(df_ratings)
display(df_utility.head())

In [None]:
# test your solution
tests_m1.knn_01(pivot_ratings)

### Plot

Next, we want to visualize the ratings for two specific movies: **Frozen** and **Titanic**. The plot should help us see how users rated these movies relative to each other.

In [None]:
helpers_m1.plot_movie_data(df_utility, "M1024", "M2048", "M4096", {"M1024": "Titanic", "M2048": "Frozen", "M4096": "Inception"})

### Mean-Centering

In the scatterplot of **Frozen** versus **Titanic**, most points are clustered in the **upper-right corner**. This shows that many users rated *both* movies quite highly.

At first glance, this suggests that these users have similar taste. However, the plot is misleading because **users have different rating habits**: some rate generously and give almost everything a 4 or 5, while others are stricter and rarely go above a 3. When the movies being compared are generally well-liked, these differences get hidden, and users appear more similar than they actually are.

To correct this, we should not use the **absolute** ratings, but rather how each user rated the movie **relative to their own typical rating level**.

**Mean-centering** does exactly that. By subtracting each user’s average rating from all their ratings, we obtain *relative* preferences that allow for more meaningful similarity comparisons.


### Question 2

_2 pt._

Implement the function `mean_center()` below. It should subtract each user’s average rating from all of their individual ratings. (If you are comfortable with pandas, this can be done in just one or two lines of code.)

In [None]:
def mean_center(df: pd.DataFrame) -> pd.DataFrame:

    # your code here

df_utility_mean_centered = mean_center(df_utility)
display(df_utility_mean_centered.head())

In [None]:
# test your solution
tests_m1.knn_02(mean_center)

Now, let's plot the new DataFrame to see what the mean-centered data looks like:

In [None]:
helpers_m1.plot_movie_data(df_utility_mean_centered, "M1024", "M2048", "M4096", {"M1024": "Titanic", "M2048": "Frozen", "M4096": "Inception"})

As you can see, the data is now much more spread out around the origin. This gives a much clearer view of the differences between data points. It also makes the **cosine similarity** (which we will use later when implementing kNN) behave much better and produce more meaningful similarity scores.

### Split data

Now we want to actually start recommending movies. In particular, we would like to know whether we can recommend the movie **`M4096` (Inception)** to users who have **not** rated it yet (specifically: **U758 - Tomáš Novák** and **U032 - Marta Nowak**).
The first step is to **predict** what rating they *would* give Inception if they had seen it.

Before we can make any predictions, we need to separate the data into two parts:

* **X**: the input data (here: the ratings for movies `M1024` and `M2048`)
* **y**: the target we want to predict (the ratings for `M4096`)

Some users have a rating for `M4096`—these are the users we can learn from. Others do not—these are the users we want to predict for.
The function `transform_data_for_knn()` will split the DataFrame accordingly:

* **`X_known` and `y_known`** — users with a known rating for `M4096`
* **`X_unknown` and `y_unknown`** — users with a missing rating for `M4096`

Because kNN requires complete feature vectors, you must fill missing values in **X** using the column means.
(Do **not** fill missing values in **y**; `y_unknown` should remain `NaN`.)

After this step, only `y_unknown` contains `NaN` values—none of the other returned frames should.


### Question 3

_3 pt._

Complete the function `transform_data_for_knn(df, X_cols, y_col)` so that it:

1. Selects the feature columns `X_cols` and fills any `NaN` values with the column mean.
2. Extracts the target column `y_col`.
3. Splits the data further into four parts:

   * `X_known`: the DataFrame containing the rows in X for which the corresponding y value is known
   * `y_known`: the Series conatinaing the y values that are known
   * `X_unknown`: the DataFrame containing the the rows in X for which the corresponding y value is missing
   * `y_unknown`: the Series containing the y values that are missing (these should be `NaN`)


In [None]:
def transform_data_for_knn(df: pd.DataFrame, X_cols: list[str], y_col: str) -> tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
    # your code here
    

X_known, y_known, X_unknown, y_unknown = transform_data_for_knn(df_utility_mean_centered, ["M1024", "M2048"], "M4096")

In [None]:
# test your solution

tests_m1.knn_03(transform_data_for_knn)

## Similarity

### Manhattan Distance

Previously, we looked at plots to get an intuitive feel for which users are similar. Now we will formalize this idea.

Ultimately, we want a function that can quantify **how similar two users are** based on their ratings.

First, we need a definition of **distance**. A simple starting point is to look at the differences between the ratings that two users gave. Consider the (mean-centered) ratings of users `U025` and `U027`:

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;"><th>movieID</th><th>M1024</th><th>M2048</th></tr>
<tr><th>userID</th><th></th><th></th></tr>
</thead>
<tbody>
<tr><th>U025</th><td>-0.4</td><td>-1.3</td></tr>
<tr><th>U027</th><td>-0.6</td><td>-3.4</td></tr>
</tbody>
</table>

We ignore movie `M4096`, because that is the movie we want to **predict** ratings for. When predicting the rating for a particular movie, it does not make sense to also use that same movie to compute similarities.

The differences in their ratings are:

* For `M1024`: $d_{M1024} = -0.4 - (-0.6) = 0.2$
* For `M2048`: $d_{M2048} = -1.3 - (-3.4) = 2.1$

We can now define the distance between these two users as the sum of the absolute values of these differences:

$$
d = |0.2| + |2.1| = 2.3
$$

This distance measure is called the **Manhattan distance** (see [Wikipedia:Taxicab Geometry](https://en.wikipedia.org/wiki/Taxicab_geometry) if you are curious about the name).


### NaN-Values

When computing similarities, it is essential that the data contains **no NaN values**. Consider the following example:

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;"><th>movieID</th><th>M1024</th><th>M2048</th></tr>
<tr><th>userID</th><th></th><th></th></tr>
</thead>
<tbody>
<tr><th>U025</th><td>-0.4</td><td>-1.3</td></tr>
<tr><th>U027</th><td>-0.6</td><td><b>NaN</b></td></tr>
</tbody>
</table>

If we compute the rating differences, the NaN immediately “contaminates” the calculation:

* For `M2048`: $d_{M2048} = -1.3 - \mathbf{NaN} = \mathbf{NaN}$
* $d = |0.2| + |\mathbf{NaN}| = \mathbf{NaN}$

So the entire distance between the users becomes `NaN`, even though we *do* have at least one movie (`M1024`) that both users rated.

In real datasets, missing values are unavoidable. If NaNs prevented us from computing distances, we would rarely be able to compute similarities at all.

This is why we had to **fill the NaN values in the `X` DataFrame** (the feature data) in the previous assignment. Filling with the column mean ensures that kNN receives complete feature vectors and can compute meaningful similarity scores.

### Cosine Similarity Intuition

In practice, you would rarely use Manhattan distance. A more common alternative is **Euclidean distance**, which takes the square root of the sum of squared differences:

$$
d = \sqrt{0.2^2 + 2.1^2} \approx 2.11
$$

This is simply the Pythagorean distance between two points.

However, in recommender systems the most widely used similarity measure is **cosine similarity**. To compute it, imagine drawing two vectors from the origin to the points we want to compare and then measuring the **angle** between those vectors:

<img src='cosine.png' width="500pt">

In the example above, we compare users `U032` and `U432`, using only their ratings for movies `M1024` and `M2048`. They have the following scores:
<table border="1" class="dataframe"><thead><tr style="text-align: right;"><th>movieID</th><th>M1024</th><th>M2048</th></tr><tr><th>userID</th><th></th><th></th></tr></thead><tbody><tr><th>U032</th><td>-2.55</td><td>2.55</td></tr><tr><th>U432</th><td>-0.77</td><td>3.03</td></tr></tbody></table>

The angle between the two vectors is approximately $30.95^\circ$.
Cosine similarity is defined as the **cosine** of that angle:

$$
\cos(\alpha) \approx 0.859
$$

A key property of cosine similarity is that the **distance to the origin does not matter**, only the **direction** of the vector does. (Think about why this is useful when users have different rating habits.)

Note that cosine similarity always produces a score between **1** and **–1**:

* **1** means the vectors point in exactly the same direction (high similarity),
* **–1** means they point in opposite directions (high dissimilarity).

### Calculating Cosine Similarity

Despite the geometric description, computing cosine similarity is straightforward. For two points
($A = (a_1, a_2)$) and ($B = (b_1, b_2)$), we compute:

$$
\cos(A,B) =
\frac{a_1 b_1 + a_2 b_2}
{\sqrt{a_1^2 + a_2^2},\sqrt{b_1^2 + b_2^2}}
$$

In our specific example:

$$
\cos(\text{U032}, \text{U432}) =
\frac{-2.55 \cdot -0.77 + 2.55 \cdot 3.03}
{\sqrt{(-2.55)^2 + 2.55^2},\sqrt{(-0.77)^2 + 3.03^2}}
\approx 0.859
$$

Here we used only two features for illustration. In reality, each user has many features (ratings for many movies). The general formula for (n) features is:

$$
\cos(A, B) =
\frac{a_1 b_1 + a_2 b_2 + \cdots + a_n b_n}
{\sqrt{a_1^2 + a_2^2 + \cdots + a_n^2},
\sqrt{b_1^2 + b_2^2 + \cdots + b_n^2}}
$$

Or, more formally, using sum notation:

$$
\cos(A, B) =
\frac{\displaystyle \sum_{i=1}^n a_i b_i}
{\sqrt{\displaystyle \sum_{i=1}^n a_i^2};
\sqrt{\displaystyle \sum_{i=1}^n b_i^2}}
$$

### Question 4

_3 pts._

Use the formula above to implement the function `cosine_similarity_matrix(X1, X2)` below. It should compute the cosine similarity between **all** users in `X1` and **all** users in `X2`.

When applied to `X_known` and `X_unknown`, the result should look something like this:

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;"><th>userID</th><th>U032</th><th>U758</th></tr>
<tr><th>userID</th><th></th><th></th></tr>
</thead>
<tbody>
<tr><th>U025</th><td>-4.678877e-01</td><td>4.678877e-01</td></tr>
<tr><th>U027</th><td>-5.734623e-01</td><td>5.734623e-01</td></tr>
<tr><th>U030</th><td>-3.328201e-01</td><td>3.328201e-01</td></tr>
<tr><th>U089</th><td>2.631174e-01</td><td>-2.631174e-01</td></tr>
<tr><th>U095</th><td>9.871575e-01</td><td>-9.871575e-01</td></tr>
<tr><th>U104</th><td>9.251969e-01</td><td>-9.251969e-01</td></tr>
<tr><th>U114</th><td>-2.237114e-17</td><td>2.904027e-16</td></tr>
<tr><th>...</th><td>...</td><td>...</td></tr>
</tbody>
</table>

* The **rows** represent the users from `X1` (in this example, `X_known`).
* The **columns** represent the users from `X2` (here, `X_unknown`).
* Each value is the **cosine similarity** between the corresponding pair of users.

This DataFrame is called a **similarity matrix**. It allows us to easily look up similarity scores later on.


In [None]:
def cosine_similarity_matrix(X1: pd.DataFrame, X2: pd.DataFrame) -> pd.DataFrame:
    """
    Compute cosine similarities between each test sample (rows of X_test)
    and each train sample (rows of X_train). Both are DataFrames.
    Returns a DataFrame of shape (len(X_test), len(X_train)).
    """
    
    # your code here

similarity = cosine_similarity_matrix(X_known, X_unknown)
display(similarity.head())

In [None]:
# test your solution

tests_m1.knn_04(cosine_similarity_matrix)

<div style="border: 2px solid #444; padding: 10px; border-radius: 10px;">

### Intermezzo: Using Vector Operations

In any machine learning algorithm, it is useful to think in terms of **vector** and **matrix** operations, for two main reasons:

* They lead to cleaner (simpler) code.
* Modern hardware (especially GPUs) is highly optimized for these operations.

In later assignments, we will expect you to implement solutions using vector and matrix operations. For this assignment, that is not strictly required yet, but for full credit (and to get used to this way of thinking), youre can try to re-implement the cosine similarity using vector operations.

Cosine similarity can be rewritten in terms of vector operations.

Let (a) and (b) be vectors containing all scores for two particular users.
(In our example above, (a) could be the vector for user `U032`: (a = (-2.55, 2.55)), and (b) could be the vector for user `U432`: (b = (-0.77, 3.03)).)

The cosine similarity between these two vectors is defined as:

$$
\cos(a,b) = \frac{a \cdot b}{\lVert a \rVert \lVert b \rVert}
$$

Recall:

* The **dot product** of two vectors is defined as the sum of the products of their elements:
  $$
  a \cdot b = a_1 b_1 + a_2 b_2 + \ldots + a_n b_n
  $$

* The **norm** (length) of a vector (a) is:
  $$
  \lVert a \rVert = \sqrt{a \cdot a}
  $$

Using these definitions, you should be able to derive that this vector definition of cosine similarity is equivalent to the earlier formula with sums and square roots over individual components.
Try to work this out to check your understanding of the math.


### Question 4b

*3 pts.*

**Only do this question if you are ahead of schedule. Otherwise, finish all other homework first and then return to this section if you still have time.**

Re-implement the `cosine_similarity_matrix` function above, but now using **vector operations**, and verify that you obtain exactly the same results.

Hint: if you have two Pandas `Series`, `s1` and `s2`, you can treat them as vectors and compute their dot product using:

    result = s1 @ s2

</div>

In [None]:
# your code here

## KNN Regression

Now all pieces are in place, and it is time to actually implement **k-nearest neighbors (kNN) regression** yourself (without using scikit-learn). The goal is to predict the missing ratings (`y_unknown`) using the known ratings (`y_known`):

<table>
<tr>
<td>Unknown ratings Series<br>(<code>y_unknown</code>)<br><br>
    <table border="0">
        <tbody>
            <tr><th>U032</th><td>NaN</td></tr>
            <tr><th>U758</th><td>NaN</td></tr>
        </tbody>
    </table>
</td>
<td> ----> KNN ----> </td>
<td>Predicted ratings Series<br>(<code>y_predicted</code>)<br><br>
    <table border="0">
        <tbody>
            <tr><th>U032</th><td>-0.565795</td></tr>
            <tr><th>U758</th><td> 2.797070</td></tr>
        </tbody>
    </table>
</td>
</tr>
</table>

The algorithm works as follows:

1. **Compute cosine similarities**
   For every user we would like to make a prediction (rows of `X_unknown`), compute the cosine similarity with every training user (rows of `X_known`).
   You already implemented `cosine_similarity_matrix` for this.

2. **Loop over each unknown user (each row of `X_unknown`):**

   * Look up the similarity scores for that user in the similarity matrix.
   * Select the **k most similar** users (the k highest similarity values).
   * For those neighbors, obtain:

     * their **similarities** (from the similarity matrix), and
     * their **known ratings** (from `y_known`).
   * Compute a **weighted average** of their known ratings:

     * the **values** are the known ratings,
     * the **weights** are the similarities.

3. **Return the predictions** as a `Series` indexed like `X_unknown` and named like `y_known`.

### Weighted Average

To compute the weighted average, you use the similarities as weights and the known ratings as values. For example:

* The neighbors of `U032` are: `U654`, `U604`, and `U616`.
* Their similarities with `U032` (from the similarity matrix):
  ```
  U654    0.999592
  U604    0.994309
  U616    0.993990
  
  ```
* Their known ratings (from `y_known`):
  ```
  U654   -0.1
  U604   -0.6
  U616   -1.0
  
  ```

* So the predicted rating for `U032` is the weighted average:

$$
\text{predicted rating} =
\frac{-0.1 \cdot 0.999592 + -0.6 \cdot 0.994309 + -1.0 \cdot 0.993990}
{0.999592 + 0.994309 + 0.993990}
\approx -0.565795
$$

The general formula (with ($r_i$) the rating and ($s_i$) the similarity of neighbor ($i$)) is:

$$
\text{predicted rating} =
\frac{\sum_{i \in \text{neighborhood}} r_i \cdot s_i}
{\sum_{i \in \text{neighborhood}} s_i}
$$

### Intuition

* Cosine similarity tells us **how close two users’ rating patterns are**.
* We use only the **k most similar** users to avoid noisy or irrelevant information.
* More similar users receive **higher weight** in the prediction.
* The resulting rating estimate reflects the behavior of the most relevant neighbors.

### Question 5

*8 pts.*

Implement the `knn_regression` function here below. As input it gets the `X_known`, `y_known`, and `X_unknown` objects. It should a Pandas Series with predicted ratings for all users in `X_unknown`. The name of the Series should be the movie we are currently predicting (i.e., the name of `y_known`).

In [None]:
def knn_regression(X_known: pd.DataFrame, y_known: pd.Series, X_unknown: pd.DataFrame, k: int = 3) -> pd.Series:
    """
    Pure pandas / numpy implementation of KNN regression with:
        - cosine similarity
        - k nearest neighbors
        - distance(weights)=similarity weights
    Returns a Pandas Series, indexed like X_unknown and named like y_known.
    """
    
    # Step 1: compute cosine similarity matrix (call cosine_similarity_matrix())

    # Step 2: for each user in X_unknown:

        # get all similarities from the similarity matrix for this user

        # take k highest similarities

        # get the known ratings for these neighbor users

        # compute weighted average of the ratings of all neighbors (use similarities themsleves as weights)

    # Step 3: return predictions as Series with same index as X_unknown

    # your code here


predictions = knn_regression(X_known, y_known, X_unknown)
print(predictions)

In [None]:
# test your solution

tests_m1.knn_05(predictions)

### Recommend

We now have predicted ratings for the users who have not seen the movie yet. But that is not the final step. A predicted rating is just a number, we still need to decide whether it is high enough to justify recommending the movie.

To do that, we pick a threshold value (for example `2.0`): if the predicted rating is above the threshold, we recommend the movie; if not, we do not recommend it.

### Question 6

*2 pts.*

Implement the function `recommend()` below. It should take the predicted ratings and a threshold, and return a `Series` of boolean values indicating whether the movie should be recommended to each user. A value of `True` means the predicted rating is above the threshold; `False` means it is not.

In [None]:
def recommend(predictions: pd.Series, threshold: float) -> pd.Series:
    # your code here

# find recommendations
recommendations = recommend(predictions, 2)

# display results
movie_name = df_movies.loc[recommendations.name].iloc[0]
for user_id, recommended in recommendations.items():
    user_name = df_users.loc[user_id].iloc[0]
    print(f"For {user_name} we do {"" if recommended else "not "}recommend the movie {movie_name}.")

In [None]:
# test your solution

tests_m1.knn_06(recommend)

### Conclusion

Congratulations, you have created your first recommender system! You used kNN, a classic machine learning algorithm, to generate predictions for users who have not yet rated a movie.

But how do we know how well it performs? The predictions above look reasonable, but we still need a systematic way to evaluate the quality of the algorithm.
That is exactly what we will explore in the next notebook!