# Evaluation of recommender systems

In this tutorial, we explain how to evaluate recommender systems with implicit feedback
by holding out method.

In [1]:
import numpy as np
import scipy.sparse as sps

from irspack.dataset.movielens import MovieLens1MDataManager
from irspack.recommenders import P3alphaRecommender, TopPopRecommender
from irspack.split import rowwise_train_test_split
from irspack.evaluator import Evaluator

## Read the ML1M dataset again.

As in the previous tutorial, we load the rating dataset and construct a sparse matrix.

In [2]:
loader = MovieLens1MDataManager()

df = loader.read_interaction()

unique_user_ids, user_index = np.unique(df.userId, return_inverse=True)
unique_movie_ids, movie_index = np.unique(df.movieId, return_inverse=True)

X = sps.csr_matrix(
    (
        np.ones(df.shape[0]),
        ( user_index, movie_index)
    )
)

## Split scheme 1. Hold-out for all users.

To evaluate the performance of a recommender system trained with implicit feedback, the standard method is to hide some subset of the known user-item interactions as a validation set and see how the recommender ranks these hidden groundtruths:

![Perform hold out for all users.](./split1.png "split1")

We have prepared a fast implementaion of such a split (with random selection of these subset) in ``rowwise_train_test_split`` function:

In [3]:
X_train, X_valid = rowwise_train_test_split(X, test_ratio=0.2, random_seed=0)

assert X_train.shape == X_valid.shape

They sum back to the original matrix:

In [4]:
X - (X_train + X_valid) # 0 stored elements

<6040x3706 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

There is no overlap of non-zero elements:

In [5]:
X_train.multiply(X_valid) # Element-wise multiplication yields 0 stored elements

<6040x3706 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

This scheme however has a problem regarding the performance because we have to compute the recommendation score for all the users. So in the next tutorial, we will be using a sub-sampled version of this splitting.

## Obtain the evaluation metric

Now we define the `Evaluator` object, which will measure the performance of various recommender systems based on ``X_valid`` (the meaning of ``offset=0`` will be clarified
in the next tutorial).

In [6]:
evaluator = Evaluator(X_valid, offset=0)

We fit ``P3alphaRecommender`` using ``X_train``, and compute its accuracy metrics
against ``X_valid`` using `evaluator`.

Internally, `evaluator` calls the recommender's ``get_score_remove_seen`` method, sorts the score to obtain the rank, and reconciles it with the stored validation interactions.

In [7]:
recommender = P3alphaRecommender(X_train, top_k=100)
recommender.learn()

evaluator.get_scores(recommender, cutoffs=[5, 10])

OrderedDict([('hit@5', 0.7935430463576159),
             ('recall@5', 0.09357329074754896),
             ('ndcg@5', 0.39936130766343303),
             ('map@5', 0.06671765815079535),
             ('precision@5', 0.37695364238410584),
             ('gini_index@5', 0.9737846091714528),
             ('entropy@5', 4.791553116972738),
             ('appeared_item@5', 545.0),
             ('hit@10', 0.8966887417218543),
             ('recall@10', 0.1526165083493217),
             ('ndcg@10', 0.3697677997781819),
             ('map@10', 0.09057228932779285),
             ('precision@10', 0.3239072847682119),
             ('gini_index@10', 0.9613104845194654),
             ('entropy@10', 5.199571410742589),
             ('appeared_item@10', 771.0)])

## Comparison with a simple baseline

Now that we have a qualitative way to measure the recommenders' performance,
we can compare the performance of different algorithms.

As a simple baseline, we fit ``TopPopRecommender``, which recommends items
with descending order of the popularity in the train set, regardless of the users'
watch event history. (But note that already-seen items by a user will not be commended again).

In [8]:
toppop_recommender = TopPopRecommender(X_train)
toppop_recommender.learn()

evaluator.get_scores(toppop_recommender, cutoffs=[5, 10])

OrderedDict([('hit@5', 0.5319536423841059),
             ('recall@5', 0.03990291401116996),
             ('ndcg@5', 0.21637600397711287),
             ('map@5', 0.025107597083713212),
             ('precision@5', 0.20649006622516558),
             ('gini_index@5', 0.997223564436407),
             ('entropy@5', 2.6085822508131637),
             ('appeared_item@5', 43.0),
             ('hit@10', 0.6614238410596026),
             ('recall@10', 0.06688199444797102),
             ('ndcg@10', 0.20042437187888076),
             ('map@10', 0.03338586929172107),
             ('precision@10', 0.18096026490066222),
             ('gini_index@10', 0.9950121246019521),
             ('entropy@10', 3.179463712641221),
             ('appeared_item@10', 65.0)])

So we see that `P3alphaRecommender` actually exhibits better accuracy scores compared to rather trivial `TopPopRecommender`.

In the next tutorial, we will optimize the recommender's performance using the hold-out method.