<i>Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.</i>

# Estimating Baseline Performance
<br>
Estimating baseline performance is as important as choosing right metrics for model evaluation. In this notebook, we briefly discuss about why do we care about baseline performance and how to measure it.

The notebook covers two example scenarios under the context of movie recommendation: 1) rating prediction and 2) top-k recommendation.

### Why does baseline performance matter? 
<br>
Before we go deep dive into baseline performance estimation, it is worth to think about why we need that.

As we can simply see from the definition of the word 'baseline', <b>baseline performance</b> is a minimum performance we expect to achieve by a model or starting point used for model comparisons.

Once we train a model and get results from evaluation metrics we choose, we will wonder how should we interpret the metrics or even wonder if the trained model is better than a simple rule-based model. Baseline results help us to understand those.

Let's say we are building a food recommender. We evaluated the model on the test set and got nDCG (at 10) = 0.3. At that moment, we would not know if the model is good or bad. But once we find out that a simple rule of <i>'recommending top-10 most popular foods to all users'</i> can achieve nDCG = 0.4, we see that our model is not good enough. Maybe the model is not trained well, or maybe we should think about if nDCG is the right metric for prediction of user behaviors in the given problem.

### How can we estimate the baseline performance?
<br>
To estimate the baseline performance, we first pick a baseline model and evaluate it by using the same evaluation metrics we will use for our main model. In general, a very simple rule or even <b>zero rule</b>--<i>predicts the mean for regression or the mode for classification</i>--will be a enough as a baseline model (Random-prediction might be okay for certain problems, but usually it performs poor than the zero rule). If we already have a running model in hand and now trying to improve that, we can use the previous results as a baseline performance for sure.

Most importantly, <b>different baseline approaches should be taken for different problems and business goals</b>. For example, recommending the previously purchased items could be used as a baseline model for food or restaurant recommendation since people tend to eat the same foods repeatedly. For TV show and/or movie recommendation, on the other hand, recommending previously watched items does not make sense. Probably recommending the most popular (most watched or highly rated) items is more likely useful as a baseline.

In this notebook, we demonstrate how to estimate the baseline performance for the movie recommendation with MovieLens dataset. We use the mean for rating prediction, i.e. our baseline model will predict a user's rating of a movie by averaging the ratings the user previously submitted for other movies. For the top-k recommendation problem, we use top-k most-rated movies as the baseline model. We choose the number of ratings here because we regard the binary signal of 'rated vs. not-rated' as user's implicit preference when evaluating ranking metrics.

Now, let's jump into the implementation!

In [1]:
import sys
sys.path.append("../../")

import itertools
import pandas as pd

from reco_utils.common.notebook_utils import is_jupyter
from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.evaluation.python_evaluation import (
    rmse, mae, rsquared, exp_var,
    map_at_k, ndcg_at_k, precision_at_k, recall_at_k
)

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))

System version: 3.6.0 | packaged by conda-forge | (default, Feb  9 2017, 14:54:13) [MSC v.1900 64 bit (AMD64)]
Pandas version: 0.23.4


First, let's prepare training and test data sets. 

In [2]:
MOVIELENS_DATA_SIZE = '100k'

In [3]:
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=['UserId', 'MovieId', 'Rating', 'Timestamp']
)

data.head()

Unnamed: 0,UserId,MovieId,Rating,Timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [4]:
train, test = python_random_split(data, ratio=0.75, seed=123)

### 1. Rating prediction baseline

As we discussed earlier, we use each user's mean rating as the baseline prediction.

In [5]:
# Calculate avg ratings from the training set
users_ratings = train.groupby(['UserId'])['Rating'].mean()
users_ratings = users_ratings.to_frame().reset_index()
users_ratings.rename(columns = {'Rating': 'AvgRating'}, inplace = True)

users_ratings.head()

Unnamed: 0,UserId,AvgRating
0,1,3.655172
1,2,3.711111
2,3,2.756757
3,4,4.111111
4,5,2.93985


In [6]:
# Generate prediction for the test set
baseline_predictions = pd.merge(test, users_ratings, on=['UserId'], how='inner')

baseline_predictions.loc[baseline_predictions['UserId'] == 1].head()

Unnamed: 0,UserId,MovieId,Rating,Timestamp,AvgRating
13289,1,168,5.0,874965478,3.655172
13290,1,101,2.0,878542845,3.655172
13291,1,127,5.0,874965706,3.655172
13292,1,117,3.0,874965739,3.655172
13293,1,61,4.0,878542420,3.655172


Now, let's evaluate how our baseline model will perform on regression metrics

In [7]:
baseline_predictions = baseline_predictions[['UserId', 'MovieId', 'AvgRating']]

cols = {
    'col_user': 'UserId',
    'col_item': 'MovieId',
    'col_rating': 'Rating',
    'col_prediction': 'AvgRating',
}

eval_rmse = rmse(test, baseline_predictions, **cols)
eval_mae = mae(test, baseline_predictions, **cols)
eval_rsquared = rsquared(test, baseline_predictions, **cols)
eval_exp_var = exp_var(test, baseline_predictions, **cols)

print("RMSE:\t\t%f" % eval_rmse,
      "MAE:\t\t%f" % eval_mae,
      "rsquared:\t%f" % eval_rsquared,
      "exp var:\t%f" % eval_exp_var, sep='\n')

RMSE:		1.054252
MAE:		0.846033
rsquared:	0.136435
exp var:	0.136446


As you can see, our baseline model actually performed quite well on the metrics. E.g. MAE (Mean Absolute Error) was 0.846033 on MovieLens 100k data, saying that most of users actual ratings were within +-0.85 of their mean ratings. This also gives us an insight that users' rating could be biased where some users tend to give high ratings for all movies while others give low ratings.

Now, next time we build our machine-learning model, we will want to make the model performs better than this baseline.

### 2. Top-k recommendation baseline

Recommending the most popular items is intuitive and simple approach that works for many of recommendation scenarios. Here, we use top-k most-rated movies as the baseline model as we discussed earlier. 

In [8]:
item_counts = train['MovieId'].value_counts().to_frame().reset_index()
item_counts.columns = ['MovieId', 'Count']
item_counts.head()

Unnamed: 0,MovieId,Count
0,50,426
1,100,396
2,258,390
3,181,384
4,286,363


In [9]:
user_item_col = ['UserId', 'MovieId']

# Cross join users and items
test_users = test['UserId'].unique()
user_item_list = list(itertools.product(test_users, item_counts['MovieId']))
users_items = pd.DataFrame(user_item_list, columns=user_item_col)

print("Number of user-item pairs:", len(users_items))

# Remove seen items (items in the train set) as we will not recommend those again to the users
users_items_remove_seen = users_items.loc[
    ~users_items.set_index(user_item_col).index.isin(train.set_index(user_item_col).index)
]

print("After remove seen items:", len(users_items_remove_seen))


Number of user-item pairs: 1547463
After remove seen items: 1472463


In [11]:
# Generate recommendations
baseline_recommendations = pd.merge(item_counts, users_items_remove_seen, on=['MovieId'], how='inner')
baseline_recommendations.head()

Unnamed: 0,MovieId,Count,UserId
0,50,426,600
1,50,426,607
2,50,426,697
3,50,426,774
4,50,426,666


In [12]:
k = 10

cols['col_prediction'] = 'Count'

eval_map = map_at_k(test, baseline_recommendations, k=k, **cols)
eval_ndcg = ndcg_at_k(test, baseline_recommendations, k=k, **cols)
eval_precision = precision_at_k(test, baseline_recommendations, k=k, **cols)
eval_recall = recall_at_k(test, baseline_recommendations, k=k, **cols)

print("MAP:\t%f" % eval_map,
      "NDCG@K:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

MAP:	0.052850
NDCG@K:	0.248061
Precision@K:	0.223754
Recall@K:	0.108826


Again, the baseline is quite high, nDCG = 0.248061 and Precision = 0.223754. Now we will want to put more effort in building a machine-learning model to perform better than this ;-)

In [13]:
if is_jupyter():
    # Record results with papermill for unit-tests
    import papermill as pm
    pm.record("map", eval_map)
    pm.record("ndcg", eval_ndcg)
    pm.record("precision", eval_precision)
    pm.record("recall", eval_recall)
    pm.record("rmse", eval_rmse)
    pm.record("mae", eval_mae)
    pm.record("exp_var", eval_exp_var)
    pm.record("rsquared", eval_rsquared)

### References

[[1](https://dl.acm.org/citation.cfm?id=1401944)] Yehuda Koren,	Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD '08 pp. 426-434 2008.  
[[2](https://surprise.readthedocs.io/en/stable/basic_algorithms.html)] Surprise lib, Basic algorithms