# Estimating Baseline Performance
<br>
Estimating baseline performance is as important as choosing right metrics for model evaluation. In this notebook, we briefly discuss about why we should consider to estimate baseline performance and how to do that.

The notebook covers two example scenarios under the movie recommendation context: 1) rating prediction and 2) top-k recommendation.

### Why does baseline matter? 
<br>
Before we go deep dive into the baseline performance estimation, it is worth to think about why the baseline matters. Once we train a model and get results from evaluation metrics we choose, we will wonder how should we interpret the result numbers or if the trained model is at least doing better job than simple rule-based system. Estimating baseline performance helps us to understand those.

Let's say we are building a machine-learning model for food recommendation. We evaluated the model on the test set and got nDCG (at 10) = 0.30. At this moment, we don't know if the number is good or bad. But once we find a simple rule of <i>'recommending top-10 most popular foods to all users'</i> could achieve nDCG = 0.35, we can say that our model is not good enough or maybe should think about if nDCG is the right metric for prediction of user behaviors in the given problem.

### How can we estimate the baseline performance?
<br>
Different approaches should be taken for different problems while considering the business goals as well. For example, random-guess would be one approach for certain problems while recommending users previous purchases could be make sense in recommending foods or restaurants since many people tend to eat same foods again. For TV or movie recommendation, on the other hand, recommending the previously watched item does not make sense. In that case, providing the most popular item list is more likely useful.

In the movie recommendation scenario, we can use a user's averaged rating as the baseline. In other words, our baseline model will predict a user's rating to a movie as an averaged value of her previous ratings. For the top-k recommendation problem, we will use top-k most popular movies as the recommendation baseline.

In [1]:
import sys
sys.path.append("../../")

import itertools
import pandas as pd

from reco_utils.common.notebook_utils import is_jupyter
from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.evaluation.python_evaluation import (
    rmse, mae, rsquared, exp_var,
    map_at_k, ndcg_at_k, precision_at_k, recall_at_k
)

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))

First, let's prepare training and test data sets. 

In [2]:
MOVIELENS_DATA_SIZE = '100k'

In [3]:
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=['UserId', 'MovieId', 'Rating', 'Timestamp']
)

data.head()

Unnamed: 0,UserId,MovieId,Rating,Timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [5]:
train, test = python_random_split(data, ratio=0.75, seed=123)

### 1. Rating prediction baseline

As we discussed earlier, we use each user's average rating as the baseline prediction of the rating for new movies the user will watch.

In [6]:
# Calculate avg ratings from the training set
users_ratings = train.groupby(['UserId'])['Rating'].mean()
users_ratings = users_ratings.to_frame().reset_index()
users_ratings.rename(columns = {'Rating': 'AvgRating'}, inplace = True)

users_ratings.head()

Unnamed: 0,UserId,prediction
0,1,3.655172
1,2,3.711111
2,3,2.756757
3,4,4.111111
4,5,2.93985


In [7]:
# Generate prediction for the test set
baseline_predictions = pd.merge(test, users_ratings, on=['UserId'], how='inner')

# Let's see how it look like
baseline_predictions.loc[baseline_predictions['UserId'] == 1].head()

Unnamed: 0,UserId,MovieId,Rating,Timestamp,prediction
13289,1,168,5.0,874965478,3.655172
13290,1,101,2.0,878542845,3.655172
13291,1,127,5.0,874965706,3.655172
13292,1,117,3.0,874965739,3.655172
13293,1,61,4.0,878542420,3.655172


Now, let's evaluate how our baseline model will perform on regression metrics

In [9]:
baseline_predictions = baseline_predictions[['UserId', 'MovieId', 'AvgRating']]

cols = {
    'col_user': 'UserId',
    'col_item': 'MovieId',
    'col_rating': 'Rating',
    'col_prediction': 'AvgRating',
}

eval_rmse = rmse(test, baseline_predictions, **cols)
eval_mae = mae(test, baseline_predictions, **cols)
eval_rsquared = rsquared(test, baseline_predictions, **cols)
eval_exp_var = exp_var(test, baseline_predictions, **cols)

print("RMSE:\t\t%f" % eval_rmse,
      "MAE:\t\t%f" % eval_mae,
      "rsquared:\t%f" % eval_rsquared,
      "exp var:\t%f" % eval_exp_var, sep='\n')

RMSE:		1.054252
MAE:		0.846033
rsquared:	0.136435
exp var:	0.136446


As you can see, our baseline model actually performed quite well on those metrics. E.g. MAE (Mean Absolute Error was 0.846033 for MovieLens 100k data, meaning a user's ratings are within +-1 of her averaged previous ratings. But note that this does not mean it is a good model for recommendation systems, but more like useless model, because it just gives the same rating for all movies for each user.

Nevertheless, the baseline still tells us any machine-learning models should perform batter than those baseline, otherwise it is nothing more than random guess.

### 2. Top-k recommendation baseline

Recommending the most popular items is very intuitive and simple method that works for most of recommendation scenarios. We do the same thing here. You can either average the ratings over all users for each movie and select top-k high-rated movies or simply count the number of ratings each movie has and select top-k most-rated movies. Here, we will use the later approach because we are not predicting ratings here, but we are predicting if a user will watch or not.

In [19]:
item_counts = train['MovieId'].value_counts().to_frame().reset_index()
item_counts.columns = ['MovieId', 'Count']
item_counts.head()

Unnamed: 0,MovieId,Count
0,50,426
1,100,396
2,258,390
3,181,384
4,286,363


In [27]:
test_users = test['UserId'].unique()

user_item_col = ['UserId', 'MovieId']


In [25]:
# Cross join users and items
user_item_list = list(itertools.product(test_users, item_counts['MovieId']))
users_items = pd.DataFrame(user_item_list, columns=user_item_col)

Unnamed: 0,UserId,MovieId
0,600,50
1,600,100
2,600,258
3,600,181
4,600,286


In [28]:
# Remove seen items (items in the train set) -- TODO check if seen items are actually removed.
users_items_remove_seen = users_items.loc[
    ~users_items.set_index(user_item_col).index.isin(train.set_index(user_item_col).index)
]

# Generate recommendations
baseline_recommendations = pd.merge(item_counts, users_items_remove_seen, on=['MovieId'], how='inner')
baseline_recommendations.head()

Unnamed: 0,MovieId,Count,UserId
0,50,426,600
1,50,426,607
2,50,426,697
3,50,426,774
4,50,426,666


In [29]:
k = 10

cols['col_prediction'] = 'Count'

eval_map = map_at_k(test, baseline_recommendations, k=k, **cols)
eval_ndcg = ndcg_at_k(test, baseline_recommendations, k=k, **cols)
eval_precision = precision_at_k(test, baseline_recommendations, k=k, **cols)
eval_recall = recall_at_k(test, baseline_recommendations, k=k, **cols)

print("MAP:\t%f" % eval_map,
      "NDCG:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

MAP:	0.052850
NDCG:	0.248061
Precision@K:	0.223754
Recall@K:	0.108826


Again, the baseline is quite high, nDCG = 0.248061 and Precision@K:	0.223754

Now we have baseline performance. Next time when we train a model on the same dataset, we can tell if the model is doing at least better than naive approach.

In [None]:
if is_jupyter():
    # Record results with papermill for unit-tests
    import papermill as pm
    pm.record("map", eval_map)
    pm.record("ndcg", eval_ndcg)
    pm.record("precision", eval_precision)
    pm.record("recall", eval_recall)
    pm.record("rmse", eval_rmse)
    pm.record("mae", eval_mae)
    pm.record("exp_var", eval_exp_var)
    pm.record("rsquared", eval_rsquared)