# Hyperparameter Optimization

In this tutorial, we first demonstrate how `P3alphaRecommender`'s performance can be optimized
by [optuna](https://github.com/optuna/optuna)-backed `P3alphaOptimizer`.

Then, by further splitting the ground-truth interaction into tran, validation and test ones,
we compare several recommenders' performance optimized on the validation set and measured on the test set.

In [4]:
import numpy as np
import scipy.sparse as sps
from sklearn.model_selection import train_test_split

from irspack.dataset import MovieLens1MDataManager
from irspack import (
    P3alphaRecommender, P3alphaOptimizer, rowwise_train_test_split, Evaluator,
    df_to_sparse
)

## Read the ML1M dataset again.

We again prepare the sparse matrix `X`.

In [5]:
loader = MovieLens1MDataManager()

df = loader.read_interaction()

movies = loader.read_item_info()
movies.head()


Unnamed: 0_level_0,title,genres,release_year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),Animation|Children's|Comedy,1995
2,Jumanji (1995),Adventure|Children's|Fantasy,1995
3,Grumpier Old Men (1995),Comedy|Romance,1995
4,Waiting to Exhale (1995),Comedy|Drama,1995
5,Father of the Bride Part II (1995),Comedy,1995


In [6]:
X, unique_user_ids, unique_movie_ids = df_to_sparse(
    df, 'userId', 'movieId'
)

## Split scheme 2. Hold-out for partial users.

To perform the hyperparameter optimization, we have to repeatedly measure the accuracy metrics on the validation set. As mentioned in the previous tutorial, doing this for all users is time-comsuming (often heavier than the recommender's learning process), so we truncate this subset as follows:

1. First split **users** into "train", "validation" (and "test") ones.
1. For train users, feed all their interactions into the recommender. For validation (test) users, hold-out part of their interaction for the validation ("prediction" part), and feed the rest ("learning" part) into the recommender.
1. After the fit, ask the recommender to output the score only for validation (test) users, and see how it ranks these held-out interactions for the validation (test) users.

![Perform hold out for part of users.](./split2.png "split1")

Although we have prepared another function to do this procedure, let us first do this manually.

In [7]:
# Split users into train and validation users.

X_train_user, X_valid_user = train_test_split(X, test_size=.4, random_state=0)

# Split the validation users' interaction into learning 50% and predcition 50%.

X_valid_learn, X_valid_predict = rowwise_train_test_split(
    X_valid_user, test_ratio=.5, random_state=0
)

## Define the evaluator and optimize the validation metric

As illustrated above, we will use 

 * Train users' all interactions (``X_train_user``)
 * Validation users' 50% interaction (``X_valid_learn``)
 
as the recommender's training resource, and validation users' rest interaction (``X_valid_predict``) as the held-out ground truth:

In [8]:
X_train_val_learn = sps.vstack([X_train_user, X_valid_learn])
evaluator = Evaluator(X_valid_predict, offset=X_train_user.shape[0], target_metric='ndcg', cutoff=20)

The ``offset`` parameter specifies where the validation user block begins (where the train user block ends).

Now to start the optimization.

In [9]:
if True:
    # Truncating the stderr output as it's a bit lengthy to show in the documentation.
    # When you run this note book, you don't have to truncate it.
    from irspack.utils.default_logger import disable_default_handler
    import optuna.logging
    disable_default_handler()
    optuna.logging.disable_default_handler() 
    
optimizer = P3alphaOptimizer(X_train_val_learn, evaluator)
best_params, validation_results = optimizer.optimize(random_seed=0, n_trials=20)

  distribution = IntUniformDistribution(low=low, high=high, step=step)


The best `ndcg@20` value is


In [10]:
validation_results['ndcg@20'].max()

0.5159628863136182

which has been obtained by using these hyper parameters:

In [11]:
best_params

{'top_k': 217, 'normalize_weight': True}

Meanwhile, the default argument of ``P3alphaRecommdner`` (which has been used so far)
attains `ndcg@20` = 0.403. So this is indeed a significant improvement:

In [12]:
rec_default = P3alphaRecommender(X_train_val_learn).learn()
evaluator.get_score(rec_default)['ndcg']

0.4084060191998281

## Check the recommender's output again

Let us check how our recommender has evolved from the first tutorial. We consider the same setting (a new user has watched "Toy Story"), but fit the 
recommender using the obtained parameters.

In [15]:
rec_tuned = P3alphaRecommender(X, **best_params).learn()

from irspack import ItemIDMapper
id_mapper = ItemIDMapper(unique_movie_ids)

In [17]:
toystory_id = 1
recommended_id_and_score = id_mapper.recommend_for_new_user(
    rec_tuned, user_profile=[toystory_id], cutoff=10
)

# Top-10 recommendations
movies.reindex([movie_id for movie_id, score in recommended_id_and_score])

Unnamed: 0_level_0,title,genres,release_year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1265,Groundhog Day (1993),Comedy|Romance,1993
2396,Shakespeare in Love (1998),Comedy|Romance,1998
3114,Toy Story 2 (1999),Animation|Children's|Comedy,1999
1270,Back to the Future (1985),Comedy|Sci-Fi,1985
2028,Saving Private Ryan (1998),Action|Drama|War,1998
34,Babe (1995),Children's|Comedy|Drama,1995
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,1999
356,Forrest Gump (1994),Comedy|Romance|War,1994
2355,"Bug's Life, A (1998)",Animation|Children's|Comedy,1998
1197,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance,1987


Note how drastically the recommended contents have changed (increased significance of genre "Children's" and disapperance of "Star Wars" series, etc...).

## A train/validation/test split example

To rigorously compare the performance of various recommender algorithms,
we should measure the final score against the **test** dataset, not the validation set,
and it is straightforward now.

To begin with, we have prepared a function called ``split_dataframe_partial_user_holdout`` which
splits the users in the original dataframe into train/validation/test users,
holding out partial interaction for validation/test user:

In [18]:
from irspack.split import split_dataframe_partial_user_holdout

dataset, item_ids = split_dataframe_partial_user_holdout(
    df, 'userId', 'movieId', val_user_ratio=.3, test_user_ratio=.3,
    heldout_ratio_val=.5, heldout_ratio_test=.5
)

dataset

{'train': <irspack.split.userwise.UserTrainTestInteractionPair at 0x7f559d7c10f0>,
 'val': <irspack.split.userwise.UserTrainTestInteractionPair at 0x7f559d7c36d0>,
 'test': <irspack.split.userwise.UserTrainTestInteractionPair at 0x7f5624217df0>}

As you can see, the returned ``dataset`` is a dictionary which stores train/validation/test-users' interactions as an instance of ``UserTrainTestInteractionPair``.

In [19]:
train_users = dataset['train']
val_users = dataset['val']
test_users = dataset['test']

# Concatenate train/validation users into one.
train_and_val_users = train_users.concat(val_users)

In [20]:
val_users.X_train

<1812x3706 sparse matrix of type '<class 'numpy.float64'>'
	with 148357 stored elements in Compressed Sparse Row format>

In [21]:
val_users.X_test

<1812x3706 sparse matrix of type '<class 'numpy.float64'>'
	with 147501 stored elements in Compressed Sparse Row format>

In [22]:
val_users.X_all # which equals val_users.X_train + val_users.X_test

<1812x3706 sparse matrix of type '<class 'numpy.float64'>'
	with 295858 stored elements in Compressed Sparse Row format>

In [23]:
# For train users, there is no "test" interaction held out.
train_users.X_test

<2416x3706 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

For each recommender algorithm (here ``P3alpha``, ``RP3beta``, ``IALS`` and ``DenseSLIM``), we perform:

1. Hyperparameter optimization. During this phase, we will be using train users' all interaction and validation
   users' train interaction as the source of learning, and validation users' test interaction as the held-out ground truth.
2. Evaluation. During this phase, we will include train/validation users' all interactions
   as well as test users' train interaction as the source of learning,
   and fit the model using the parameters obtained in the optimization phase.
   Then we measure the recommender's performance against **test** users' test interaction.

In [24]:
from irspack import DenseSLIMOptimizer, RP3betaOptimizer, IALSOptimizer

In [25]:
val_evaluator = Evaluator(
    val_users.X_test,
    offset=train_users.n_users,
    cutoff=20, target_metric="ndcg"
)
test_evaluator = Evaluator(
    test_users.X_test,
    offset=train_and_val_users.n_users
)
test_results = []
for optimizer_class in [IALSOptimizer, RP3betaOptimizer, P3alphaOptimizer, DenseSLIMOptimizer]:
    print(f'Start running {optimizer_class.__name__}.')
    optimizer_ = optimizer_class(
        sps.vstack([train_users.X_all, val_users.X_train]),
        val_evaluator
    )
    best_params, validation_results_df = optimizer_.optimize(n_trials=40, random_seed=0)
    recommender = optimizer_class.recommender_class(
        sps.vstack([train_and_val_users.X_all, test_users.X_train]),
        **best_params
    ).learn()

    test_score = dict(
        algorithm=optimizer_class.__name__, 
        **test_evaluator.get_scores(recommender, cutoffs=[20])
    )
    test_results.append(test_score)

Start running IALSOptimizer.


  distribution = IntUniformDistribution(low=low, high=high, step=step)
  return trial.suggest_loguniform(prefix + self.name, self.low, self.high)
  distribution = LogUniformDistribution(low=low, high=high)


Start running RP3betaOptimizer.


  distribution = IntUniformDistribution(low=low, high=high, step=step)
  return trial.suggest_loguniform(prefix + self.name, self.low, self.high)
  distribution = LogUniformDistribution(low=low, high=high)


Start running P3alphaOptimizer.


  distribution = IntUniformDistribution(low=low, high=high, step=step)


Start running DenseSLIMOptimizer.


  return trial.suggest_loguniform(prefix + self.name, self.low, self.high)
  distribution = LogUniformDistribution(low=low, high=high)


As you can see, iALS outperforms others
in terms of both accuracy measures (recall, ndcg, map) and diversity measures (entropy, gini_index, appeared_item).

In [26]:
import pandas as pd
pd.DataFrame(test_results)

Unnamed: 0,algorithm,hit@20,recall@20,ndcg@20,map@20,precision@20,gini_index@20,entropy@20,appeared_item@20
0,IALSOptimizer,0.995585,0.212304,0.565452,0.136123,0.514763,0.894537,6.210776,1319.0
1,RP3betaOptimizer,0.995033,0.194701,0.535615,0.123241,0.482864,0.952493,5.394201,926.0
2,P3alphaOptimizer,0.996689,0.184545,0.515565,0.11412,0.464045,0.969435,4.985462,562.0
3,DenseSLIMOptimizer,0.996689,0.208938,0.568042,0.136153,0.515287,0.932696,5.772552,964.0
