# Evaluation of Top-N recommender

Can be evaluated offline with the usual train/test and Cross Validation.

- MAE
- RMSE: penalises more the difference between the predicted and real values is way off (because of the square).

MAE and RMSE tells you how good a model is a predicting things that have been already consumed (listened, read, watched), but cannot predict things that the user has never consumed.

Caveat: RMSE doesn't really matter in the real world. What matter is how users react to the recommendation presented to them.

## Hit Rate

- Generate top-n recs for all of the users in the test set. If one of the item in the list is something that was actually rated/watched/liked, that is considered a hit.
- Sum all the hits for every user and divide by the number of users.

### How to measure it

We can't just use the usual train/test cross validation technique.

### Leave-one-out cross validation

- Compute the Top-N recs for each user in our training data and intentionally remove one item from the list for each of them
- We then test the ability of the rec system to recommend that item that was left out for that user

It basically focus on measuring the ability to recommend an item from the top N list for each user that was left out in training

Hit-rate with this technique tends to be really small and difficult to measure with small datasets.

### Average Reciprocal Hit Rate (ARHR)

$$
\sum_{i=1}^{N} \frac{\frac{1}{rank_i}}{users}
$$

| Rank | Reciprocal |
| :-: | :-: |
| 5 | $\frac{1}{5}$ |
| 1 | 1 |
| 2 | $\frac{1}{2}$ |
| 4 | $\frac{1}{4}$ |

This metric is like hit rate, but it accounts for where in the top end list your hits appear. It gives more credit for successfully recommending an item in the top slot, than in the bottom slot.

This is a more user-focused metric, since users tend to focus on the beginning of lists.

The only difference is that instead of summing up the number of hits, we sum up the reciprocal rank of each hit. So if we successfully predict a recommendation in slot three, that only counts as one-third. But a hit in slot one of our top end recommendations receives the full weight of 1.0. 

It penalises good recommendations that are too down in the list.

### Cumulative Hit Rate (cHR)

It sets a threshold and throws away predicted ratings below that threshold.

### Rating Hit Rate (rHR)

## Coverage

- It's the percentage of possible recommendations that the system is able to provide.
- It's the percentage of (user, item) pairs that can be predicted.

This measure is in constrast with accuracy. This measure can tell how quickly new items can appear in the recommendations.

## Diversity

It's a measure of variety of the recommended items.

A low-variety recommender system may just recommend episodes from the same programme or content of the same specific sub-genre.

$(1 - S)$

where $S$ is the average similarity between recommendation pairs

Diversity alone is not a good measure for a recommender system, because we could have a high-degree of diversity by recommending random items, and achieve poor result.

We could consider diversity as a parameter to take into account, and not as an absolute value to measure in isolation.

## Novelty

It's the measure of how popular/unknown a recommendation is.

Serendipity: an unplanned fortunate discovery.

Mean popularity ranking. It's a measure of new item being recommended. By just recommending new items, it doesn't mean it makes the recommendation better. Again, this alone is not a good measure, but can be used as a parameter to tune the recommender system.

You still want the recommender system to show a certain percentage of popular items.

We must strike the right balance between familiar items and new ones.

### The long tail

If we plot a graph where on the X axis we have all the items and the Y axis the popularity/frequency, we will see an exponential function, where very few items amount for the most popular ones to be consumed, and everything else represent niche items with low popularity. But if you combine all the item in the long tail, they would amount to the majority.

Items in the long tails represent niche interests, and the recommender system must be able to let users discover those items.

## Churn

Is the measure of how often do recommendations change for a given user.

It can also measure the sensitivity of the recommender system to change in user behaviour.

It measures the impact of change on the recommendation list.

At the low-end of the churn scale we have staleness. Showing the same recs to user may not be a great idea if the items are then not clicked at all. Randomisation can help, but even in this instance we need to strike the right balance.

## Responsivness

How quick the new user behaviour impacts the recommendation.

A highly responsive recommender systems means the changes are reflected instantly.

Highly responsive recommender are complex and hard to maintain and expensive.

As always, we need to strike the right balance based on the business needs.

## A/B Testing

Tune the recommender by putting it in front of real users.

### The surrogate Problem

Accurately predicted rating, not necessarily tranlsates into good recommendations.

## Perceived quality

Is a qualitative metric and it is measured by asking users if they think the recommendations are good.

Challenges are, you may not get enough data out of it, because it relies on users willing to respond to an online poll for example.