# [1.2: Evaluation] — How do we evaluate recommender systems?

In the previous notebook, you implemented a kNN-based collaborative filtering system. But this raises an important question: **how well does it actually work?**
To answer that, we need systematic ways to evaluate the quality of its predictions.

A kNN recommender does two things:

1. **Predict ratings** for items a user has not seen yet (a regression task).
2. **Convert those predicted ratings into recommendations** (a classification task).

This means we can evaluate the algorithm in two complementary ways:

* **As a regression model**:
  Here we assess how close the predicted ratings are to the actual ratings.
  A common metric for this is **Mean Squared Error (MSE)**.

* **As a classification model**:
  Once we convert predictions into “recommended” vs. “not recommended,” we can evaluate how often the recommendations are correct.
  Two widely used metrics for this are **precision** and **recall**.

Each perspective captures different aspects of performance. A model with low MSE may still make poor recommendations, and a model with high precision may fail to recommend enough relevant items.

In this notebook, we will explore both evaluation approaches and see how they complement each other.

Keep in mind that we are working with an **extremely small dataset** (the same one used in the previous notebook). The evaluations we perform here do **not** reflect how the algorithm would behave in real-world scenarios. The goal of this notebook is primarily to help you become familiar with different evaluation methods.

In the next not_ebook, we will apply these techniques to a much larger and more realistic dataset.

## Getting Started

### Libraries

Let's start by loading the libraries we'll need by running the cell below:

In [None]:
import pandas as pd
import numpy as np
import pooch

%reload_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

### Download data

Now download the required data and helper files (same as previous notebook).

In [None]:
# download data
DATA_REPO = "https://raw.githubusercontent.com/uvapl/recommender-systems/main/data/m1/"

print("downloading data files")
for fname in ["ratings_X_known.csv", "ratings_y_known.csv", "ratings_extended.csv"]:
    pooch.retrieve(url = DATA_REPO + fname, known_hash=None, fname=fname, path="data", progressbar=True)
for fname in ["tests_m1.py"]:
    pooch.retrieve(url = DATA_REPO + fname, known_hash=None, fname=fname, path=".", progressbar=True)
print("done!")

import tests_m1

# read data
X_known = pd.read_csv("data/ratings_X_known.csv", index_col = "userID")
y_known = pd.read_csv("data/ratings_y_known.csv", index_col = "userID").iloc[:,0]

# Train/Test Split

To evaluate how well a recommender system performs, we need a way to measure how accurately it predicts user ratings. A common approach is to **split the data into a training set and a test set**. The idea is simple: for the test set, we pretend that we do not know the true ratings (even though we actually do). We then let the algorithm predict those ratings and later check how close the predictions were.

We first learn user similarities *only* from the training set. Then we use these learned similarities to predict the ratings in the test set. This separation is crucial: the test set represents data that the model has not seen before, allowing us to evaluate how well the algorithm generalizes to new users or new rating situations.

So we divide both:

* the `X` data (the feature ratings for the movies we are **not** predicting), and
* the `y` data (the **target** ratings we *do* want to predict)

into two parts:

* **Training set (`X_train`, `y_train`)** — the data the model is allowed to learn from.
* **Test set (`X_test`, `y_test`)** — the data the model must predict without having seen it.

After making predictions for the test set, we compare them to the actual ratings in `y_test`. This gives us an objective measure of model performance. We typically call the predicted ratings **`y_hat`** (written mathematically as $\hat{y}$). So in the end, we want to evaluate **how close `y_hat` is to `y_test`**.

In the next step, you will implement a simple function that creates such a train/test split for Pandas DataFrames and Series.

### Question 1
*2 pts.*

Complete the function `train_test_split()` below. It takes the feature data `X`, the target data `y`, and a `test_size` argument that specifies which proportion of the data should be used for testing (for example, `test_size = 0.3` means that 30% of the data will be set aside as the test set).

The function should return four objects:

* **`X_train`** — the portion of `X` used for training
* **`X_test`** — the portion of `X` used for testing
* **`y_train`** — the corresponding training targets
* **`y_test`** — the corresponding test targets

Make sure that `X_train` aligns with `y_train`, and `X_test` aligns with `y_test`, and that the split preserves the original indices. And, make sure that the split is random (e.g., not the first 70% rows for training and the next 30% for testing). So, calling it multiple times should give different splits each time.

In [None]:
def train_test_split(X: pd.DataFrame, y: pd.Series, test_size: float = 0.3) -> (pd.DataFrame, pd.DataFrame, pd.Series, pd.Series):
    # your code here

# apply to the provided data
X_train, X_test, y_train, y_test = train_test_split(X_known, y_known)

display(X_train)
display(X_test)

In [None]:
# test your solution

tests_m1.evaluation_01(train_test_split)

# Using scikit-learn for kNN Regression

In the previous notebook, you implemented kNN regression **manually**, step by step. This helped you understand how cosine similarity, neighbor selection, and weighted averaging together produce a predicted rating. However, for real applications we typically rely on well-tested, optimized libraries rather than writing the algorithm from scratch.

In this notebook, we switch to **scikit-learn**, a widely used machine learning library in Python. Scikit-learn provides a reliable and efficient implementation of kNN through the `KNeighborsRegressor` class. Its internal code is optimized, thoroughly tested, and handles many edge cases automatically.

The workflow in scikit-learn always follows the same pattern:

1. **Create a model object** (e.g., `KNeighborsRegressor(...)`).
2. **Fit** the model on the data using `.fit(X_train, y_train)`.
3. **Predict** on new data using `.predict(X_test)`.

We will still use cosine similarity, but scikit-learn internally works with **cosine distance** (which is `1 - cosine_similarity`). To make the weighting consistent with our earlier implementation, we include a small conversion step that transforms distances back into similarities.

The code below shows how to build a kNN regressor using scikit-learn and apply it to obtain predicted ratings for our unknown users. It produces the same results as the version you implemented in the previous notebook, just much, much faster.

Note that, instead of using `X_known`, `y_known`, `X_unknown`, and `y_unknown` as in the previous notebook, we now use the terms `X_train`, `y_train`, `X_test`, and `y_test`. This terminology is more appropriate in the context of **evaluation**, as you have seen in the previous exercise.

Load the cell below:

In [None]:
from sklearn.neighbors import KNeighborsRegressor

def knn_regression(X_train: pd.DataFrame, y_train: pd.Series, X_test: pd.DataFrame, k: int = 3) -> pd.Series:
    # sklearn uses cosine distance, not similarity. 
    # We need to convert distance back to similarity so weighting 
    # is in line with the Aggarwal book.”  
    def weights_conversion(dist):
        # conversion by sklearn: dist = 1 - similarity = 1 - cos(a)
        # undoing conversion for weights: similarity = 1 - dist = cos(a)
        return np.clip(1 - dist, a_min=0, a_max=None)
        
    # setup sklearn knn regression object
    knn = KNeighborsRegressor(n_neighbors=k, metric="cosine", weights=weights_conversion)

    # add training data
    knn.fit(X_train, y_train)

    # predict for test data and convert result back to Pandas Series object
    predicted_array = knn.predict(X_test)
    predicted_series = pd.Series(predicted_array, index = X_test.index, name = y_train.name)
    
    return predicted_series

# verify that it produces the same results as in the previous notebook
print(knn_regression(X_train, y_train, X_test))

## Apply kNN to the actual data
Now, let's run the kNN alogithm on the date you splitted in the previous assignment:

In [None]:
y_hat = knn_regression(X_train, y_train, X_test)

# Evaluating Regression: Mean Squared Error

Now that we can make predictions, we need a way to measure **how good** those predictions are. We do this by computing the similarity between the predicted ratings (`y_hat`) and the test ratings (`y_test`). Now you might be temped to use cosine similarity for this, and that would actually not be a bad approach, but we will use a slightly different metric called the **mean squared error**.

We first compute an **error** (deviation) for each prediction, then combine those errors into a single score.

For each sample in the test set, we compute the deviation:

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>user</th><th></th>
<th>actual<br>rating</th><th>-</th>
<th>predicted<br>rating</th><th>=</th>
<th>difference</th>
</tr>
</thead>
<tbody>
<tr><td>U168</td><td></td><td>-0.5</td><td>-</td><td>1.48</td><td>=</td><td>-1.98</td></tr>
<tr><td>U169</td><td></td><td> 1.0</td><td>-</td><td>0.46</td><td>=</td><td> 0.54</td></tr>
<tr><td>U170</td><td></td><td>-0.5</td><td>-</td><td>1.63</td><td>=</td><td>-2.13</td></tr>
<tr><td>U171</td><td></td><td>-0.5</td><td>-</td><td>1.46</td><td>=</td><td>-1.96</td></tr>
<tr><td>U172</td><td></td><td> 0.5</td><td>-</td><td>0.49</td><td>=</td><td> 0.01</td></tr>
<tr><th colspan="8" style="text-align:center;">...</th></tr>
</tbody>
</table>

Once we have all individual errors, we need a single number that summarizes how well the model performed.
A widely used metric for this is the **mean squared error (MSE)**.

We take the square of each error and then compute the average:

$
\text{mean squared error} =
\frac{(-1.98)^2 + 0.54^2 + (-2.13)^2 + (-1.96)^2 + 0.01^2 + \ldots}{N}
$

where (N) is the number of samples in the test set.

The MSE is one of the most common evaluation metrics in data science and machine learning. You will encounter it frequently in later courses and real-world projects.

Formally:

$
\text{mse} = \frac{1}{N} \sum_{i=1}^N (a_i - p_i)^2
$

where

* $a_i$ is the **actual** rating,
* $p_i$ is the **predicted** rating.

### Question 2
*2 pt.*

Implement the `mse()` function below:

In [None]:
def mse(y_true: pd.Series, y_pred: pd.Series) -> float:
    # your code here

mse(y_test, y_hat)

In [None]:
# test your solution
tests_m1.evaluation_02(mse)

# Baseline

If everything is implemented correctly, the mean squared error for this dataset should be around **1.3**. But is that good?
On its own, the value of the **mse** is hard to interpret. A number like 1.3 does not directly tell us whether the recommender system is “good” or “bad.”

The real strength of mse lies in **comparison**. It allows us to evaluate which method performs better under the same conditions. For example, we can compare the mse of user-based filtering versus item-based filtering, or compare different similarity metrics, different values of (k), or different preprocessing steps.

In other words, mse is most useful as a **relative** measure rather than an absolute one.

A particularly important comparison is with simple **baseline models**—naive prediction strategies that do not use similarities or machine learning at all. One such baseline is making random predictions. A slightly better baseline is to predict the **mean rating of each movie**. Below, we will compare our recommender system to this latter baseline to see whether the kNN model actually provides an improvement.

The function below does exactly this. It returns a predicted rating that is just the mean of the `y_train`. So the prediction is the exact same value for all users.

In [None]:
def baseline_mean_prediction(y_train, X_test):
    return pd.Series(y_train.mean(), index=X_test.index)

y_hat_mean = baseline_mean_prediction(y_train, X_test)
display(y_hat_mean)

Now let's compute the mean squared error for this baseline:

In [None]:
mse(y_test, y_hat_mean)

# Evaluating Classification: Precision and Recall

We have seen that kNN performs much better than the baseline model. But does that automatically make it a *good* recommender system?

Not necessarily. An important question is how meaningful the **mse** metric is for the final task. To answer that, we need to think about how the recommender system is actually used. What does the system *do* with the predicted ratings?

* Does it show the user a top-10 list of items with the highest predicted scores?
* Does it recommend all movies whose predicted rating is above a certain threshold?
* Does the user see randomly selected recommendations drawn from a pool of “good” items?

Depending on how the recommendations are used, mse may or may not be the right evaluation measure.

## Recommended vs. Hidden Items

To make evaluation concrete, we will assume the following workflow for the recommender system:

1. **Recommended items:**
   For each user, we mark items with a predicted rating **greater than or equal to a threshold** (e.g., 0.25) as items that *should be recommended*.

2. **Hidden items:**
   Items with predicted ratings **below the threshold** are marked as items that should *not* be recommended.

3. **Recommendation display:**
   The system then selects (N) random items from the recommended pool to show to the user.
   (We won’t elaborate further on this step, but for now it’s important to know that *every recommended item has an equal chance of being shown*.)

> With this setup, our recommender system becomes a **classification problem**:
> For each (user, movie) pair, we must decide whether it belongs to the *recommended* class or the *hidden* class.
> This type of binary classification task is very common in machine learning.

This shift from regression (predicting a rating) to classification (predict/not predict) means that mse is no longer the most meaningful performance metric. Instead, we will evaluate the system using **precision** and **recall**, which better capture how well a classifier behaves.

## Classify

We now move from predicting ratings to **classifying** items as *recommended* or *not recommended*.
Use the `recommend()` function below (which is similar to the one you implemented in the previous notebook) to produce these classifications for both the kNN model and the baseline model.

The function returns a Pandas Series containing `True` and `False` values:

* `True` means the predicted rating is above the threshold (0.25) → **recommend**
* `False` means it is below the threshold → **do not recommend**

This function is applied to:

* the kNN predictions (`recommendations_knn`)
* the baseline predictions (`recommendations_mean`)

In [None]:
def recommend(predictions, threshold):
    return predictions >= threshold

threshold = 0.25
recommendations_knn = recommend(y_hat, threshold)
print(recommendations_knn)
recommendations_mean = recommend(y_hat_mean, threshold)
print(recommendations_mean)

## Test Data (liked)
To evaluate the performance of the recommender system, we need to know whether the users in our test set (`y_test`) **actually liked** the movies. We *could* infer this from their ratings, but that is not necessary. This information is already available in our dataset: the file **`ratings_extended.csv`** contains a column indicating whether a user liked a movie.

All we need to do is extract the relevant entries.
The cell below reads `ratings_extended.csv`, selects the rows corresponding to the user–movie pairs in `y_test`, and creates a new Series called `liked_test` that tells us, for each test sample, whether the user liked the movie in reality.

In [None]:
ratings = pd.read_csv("data/ratings_extended.csv", dtype = {"liked": bool})
liked = ratings.pivot(index = 'userID', columns = 'movieID', values = 'liked')["M4096"]
liked_test = liked.loc[y_test.index].astype(bool)

## Confusion

We now have all the information needed to evaluate the recommender system’s performance:

* the kNN-based recommendations (`recommendations_knn`)
* the baseline recommendations (`recommendations_mean`)
* the actual like/dislike information (`liked_test`)

Some terminology:
In evaluation, the **actual** data is often described as *used* (liked) and *unused* (not liked).
The **predicted** data is described as *recommended* (the system would show it) and *hidden* (the system would not show it).

We begin by counting how many items were **correctly recommended**—items the system recommended and the user actually liked. These are the **true positives**.

Given our four categories (recommended, hidden, used, unused), we can define the standard classification outcomes:

* **True positives (TP):** recommended *and* liked
* **False positives (FP):** recommended *but not* liked
* **True negatives (TN):** hidden *and not* liked
* **False negatives (FN):** hidden *but actually* liked

These four values are typically arranged in a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix), a standard evaluation tool:

|             | Liked (Used) | Not Liked (Unused) |
| ----------- | ------------ | ------------------ |
| Recommended | TP           | FP                 |
| Hidden      | FN           | TN                 |

### Question 3

*3 pts.*

Implement the `confusion()` function below. It should take the predicted recommendations and the actual data as input and return a **2×2 DataFrame** containing the confusion matrix in the format shown above.

In [None]:
def confusion(y_true: pd.Series, y_pred: pd.Series) -> pd.DataFrame:
    # your code here

confusion_knn = confusion(liked_test, recommendations_knn)
print(confusion_knn)
confusion_mean = confusion(liked_test, recommendations_mean)
print(confusion_mean)

In [None]:
# test your solution

tests_m1.evaluation_03(confusion)

## Precision

A commonly used evaluation metric in classification tasks is **precision**. Precision answers the question:

**When the system recommends a movie, how often is that recommendation actually relevant?**

Formally:

$$
\textrm{precision} = \frac{\textrm{true positives}}{\textrm{\# recommended items}} = \frac{\textrm{true positives}}{\textrm{true positives + false positives}}
$$


Interpretation:

* If **all** recommended movies are relevant
  (i.e., $\text{false positives} = 0$) ->
  $\text{precision} = 1$

* If **none** of the recommended movies are relevant
  (i.e., $\text{true positives} = 0$) -> $\text{precision} = 0$

Precision is particularly useful when we care about the *quality* of recommendations rather than the quantity. For more information, see: [https://en.wikipedia.org/wiki/Precision_and_recall](https://en.wikipedia.org/wiki/Precision_and_recall)

### Question 4
*2 pts.*

Implement the function `precision()` below. It should take the predicted recommendations and the actual like/dislike data as input, and return the **precision** of the recommender system.

In [None]:
def precision(y_true: pd.Series, y_pred: pd.Series) -> float:
    # your code here

precision_knn = precision(liked_test, recommendations_knn)
print(precision_knn)
precision_mean = precision(liked_test, recommendations_mean)
print(precision_mean)    

In [None]:
# test your solution

tests_m1.evaluation_04(precision)

## Recall

Precision gives us valuable insight into how well the algorithm performs, but it does not tell the whole story. Many other metrics are important when evaluating recommender systems. One of the most common metrics used alongside precision is **recall**.

Recall answers a different question:

**Of all the items a user would actually like, how many does the algorithm successfully recommend?**

In other words: if a user would enjoy a movie, does the system manage to recommend it?

Formally:

$$
\textrm{recall} = \frac{\textrm{true positives}}{\textrm{\#used items}} = \frac{\textrm{true positives}}{\textrm{true positives + false negatives}}
$$


If there are **no false negatives** (that is, if the algorithm recommends *every* movie the user would have liked) then:

$$
\text{recall} = 1.
$$

A low recall indicates that the recommender is missing many potentially good recommendations, even if the ones it does make are accurate.

### Question 5
*2 pts.*

Implement the function `recall()` below. It should take the predicted recommendations and the actual like/dislike data as input, and return the **recall** of the recommender system.

In [None]:
def recall(y_true: pd.Series, y_pred: pd.Series) -> float:
    # your code here

recall_knn = recall(liked_test, recommendations_knn)
print(recall_knn)
recall_mean = recall(liked_test, recommendations_mean)
print(recall_mean)   

In [None]:
# test your solution
tests_m1.evaluation_05(recall)

## Conclusion

Remember that the dataset we used in this notebook is very small. This means we cannot reliably judge how well the algorithm would perform in real-world scenarios. The purpose here is to practice the evaluation methods, not to draw conclusions about actual performance.

But what we can say is this: Precision and recall often behave in opposing ways. Improving one can easily worsen the other. For example, increasing the recommendation threshold or choosing a smaller (k) might raise **precision** (because only very confident recommendations are shown) but this usually lowers **recall**, since fewer potentially relevant items are recommended. Lowering the threshold or increasing (k) tends to have the opposite effect: **recall** improves, but precision may drop.

You can try this effect for yourself by playing around with the threshold and the k-value.

The challenge in designing a recommender system is to find the right **balance** between these two metrics. In the next notebook, we will use a much larger dataset to explore how these measures relate and how different choices in the algorithm (such as the threshold) influence this trade-off.