# Quickstart Example with Off-Policy Learners
---
This notebook provides an example of evaluation of several different off-policy learners with synthetic bandit feedback data.

Our example with synthetic bandit data contains the follwoing four major steps:
- (1) Synthetic Data Generation
- (2) Off-Policy Learning
- (3) Evaluation of Off-Policy Learners

Please see [../examples/opl](../opl) for a more sophisticated example of the evaluation of off-policy learners with synthetic bandit data.

In [1]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.linear_model import LogisticRegression
# import open bandit pipeline (obp)
import obp
from obp.dataset import (
    SyntheticBanditDataset,
    logistic_reward_function,
    linear_reward_function,
    linear_behavior_policy
)
from obp.policy import IPWLearner, NNPolicyLearner, Random
from obp.ope import (
    RegressionModel,
    InverseProbabilityWeighting,
    DirectMethod,
    DoublyRobust
)

In [2]:
# obp version
print(obp.__version__)

0.4.0


## (1) Synthetic Data Generation
We prepare easy-to-use synthetic data generator: `SyntheticBanditDataset` class in the dataset module.

It takes number of actions (`n_actions`), dimension of context vectors (`dim_context`), reward function (`reward_function`), and behavior policy (`behavior_policy_function`) as inputs and generates a synthetic bandit dataset that can be used to evaluate the performance of decision making policies (obtained by `off-policy learning`) and OPE estimators.

In [3]:
# generate a synthetic bandit dataset with 10 actions
# we use `logistic function` as the reward function and `linear_behavior_policy` as the behavior policy.
# one can define their own reward function and behavior policy such as nonlinear ones. 
dataset = SyntheticBanditDataset(
    n_actions=10,
    dim_context=5,
    reward_type="binary", # "binary" or "continuous"
    reward_function=logistic_reward_function,
    behavior_policy_function=linear_behavior_policy,
    random_state=12345,
)
# obtain training and test sets of synthetic logged bandit feedback
n_rounds_train, n_rounds_test = 10000, 10000
bandit_feedback_train = dataset.obtain_batch_bandit_feedback(n_rounds=n_rounds_train)
bandit_feedback_test = dataset.obtain_batch_bandit_feedback(n_rounds=n_rounds_test)

# `bandit_feedback` is a dictionary storing synthetic logged bandit feedback
bandit_feedback_train

{'n_rounds': 10000,
 'n_actions': 10,
 'context': array([[-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057],
        [ 1.39340583,  0.09290788,  0.28174615,  0.76902257,  1.24643474],
        [ 1.00718936, -1.29622111,  0.27499163,  0.22891288,  1.35291684],
        ...,
        [-1.27028221,  0.80914602, -0.45084222,  0.47179511,  1.89401115],
        [-0.68890924,  0.08857502, -0.56359347, -0.41135069,  0.65157486],
        [ 0.51204121,  0.65384817, -1.98849253, -2.14429131, -0.34186901]]),
 'action_context': array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]),
 'action': array([6, 3, 2, ..., 9, 3, 6]),
 'position': None,
 're

## (2) Off-Policy Learning
After generating synthetic data, we now train some candidate evaluation policies using the training bandit dataset. <br>

We use "NN Policy Learner" and *IPW Learner* implemented in the policy module to train evaluation policies. 
For NN Learner, we use *DirectMethod* (DM), *InverseProbabilityWeighting* (IPW), and *DoublyRobust* (DR) as objective functions.
For IPW Learner, we use *RandomForestClassifier* and *LogisticRegression* implemented in scikit-learn for base machine learning methods.

In [4]:
# estimate the mean reward function by using ML model (Logistic Regression here)
# the estimated rewards are used by model-dependent estimators such as DM and DR
regression_model = RegressionModel(
    n_actions=dataset.n_actions,
    action_context=dataset.action_context,
    base_model=LogisticRegression(random_state=12345),
)
# please refer to https://arxiv.org/abs/2002.08536 about the details of the cross-fitting procedure.
estimated_rewards_by_reg_model = regression_model.fit_predict(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    n_folds=3, # use 3-fold cross-fitting
    random_state=12345,
)

In [5]:
# define DM
dm = DirectMethod()
# define NNPolicyLearner with DM as its objective function
nn_dm = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=5,
    off_policy_objective=dm.estimate_policy_value_tensor,
    random_state=12345,
)
# train NNPolicyLearner on the training set of the synthetic logged bandit feedback
nn_dm.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    estimated_rewards_by_reg_model=estimated_rewards_by_reg_model,
)
# obtains action choice probabilities for the test set of the synthetic logged bandit feedback
action_dist_nn_dm = nn_dm.predict_proba(context=bandit_feedback_test["context"])

  Variable._execution_engine.run_backward(


In [6]:
# define IPW
ipw = InverseProbabilityWeighting()
# define NNPolicyLearner with IPW as its objective function
nn_ipw = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=5,
    off_policy_objective=ipw.estimate_policy_value_tensor,
    random_state=12345,
)
# train NNPolicyLearner on the training set of the synthetic logged bandit feedback
nn_ipw.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"],
)
# obtains action choice probabilities for the test set of the synthetic logged bandit feedback
action_dist_nn_ipw = nn_ipw.predict_proba(context=bandit_feedback_test["context"])

In [7]:
# define DR
dr = DoublyRobust()
# define NNPolicyLearner with DR as its objective function
nn_dr = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=5,
    off_policy_objective=dr.estimate_policy_value_tensor,
    random_state=12345,
)
# train NNPolicyLearner on the training set of the synthetic logged bandit feedback
nn_dr.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"],
    estimated_rewards_by_reg_model=estimated_rewards_by_reg_model,
)
# obtains action choice probabilities for the test set of the synthetic logged bandit feedback
action_dist_nn_dr = nn_dr.predict_proba(context=bandit_feedback_test["context"])

In [8]:
# define IPWLearner with Logistic Regression as its base ML model
ipw_lr = IPWLearner(
    n_actions=dataset.n_actions,
    base_classifier=LogisticRegression(C=100, random_state=12345)
)
# train IPWLearner on the training set of the synthetic logged bandit feedback
ipw_lr.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"]
)
# obtains action choice probabilities for the test set of the synthetic logged bandit feedback
action_dist_ipw_lr = ipw_lr.predict(context=bandit_feedback_test["context"])

In [9]:
# define IPWLearner with Random Forest as its base ML model
ipw_rf = IPWLearner(
    n_actions=dataset.n_actions,
    base_classifier=RandomForest(n_estimators=30, min_samples_leaf=10, random_state=12345)
)
# train IPWLearner on the training set of the synthetic logged bandit feedback
ipw_rf.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"]
)
# obtains action choice probabilities for the test set of the synthetic logged bandit feedback
action_dist_ipw_rf = ipw_rf.predict(context=bandit_feedback_test["context"])

In [10]:
# define Uniform Random Policy as a baseline evaluation policy
random = Random(n_actions=dataset.n_actions,)

# compute the action choice probabililties for the test set of the synthetic logged bandit feedback
action_dist_random = random.compute_batch_action_dist(
    n_rounds=bandit_feedback_test["n_rounds"]
)

## (3) Evaluation of Off-Policy Learners
Our final step is the evaluation of off-policy learnres.
With synthetic data, we can calculate the policy value of the off-policy learners. 

In [11]:
# we first calculate the policy values of the three evaluation policies using the expected rewards of the test data
policy_names = [
    "NN Policy Learner with DM",
    "NN Policy Learner with IPW",
    "NN Policy Learner with DR",
    "IPW Learner with Logistic Regression",
    "IPW Learner with Random Forest",
    "Unifrom Random"
]
action_dist_list = [
    action_dist_nn_dm,
    action_dist_nn_ipw,
    action_dist_nn_dr,
    action_dist_ipw_lr,
    action_dist_ipw_rf,
    action_dist_random
]
for name, action_dist in zip(policy_names, action_dist_list):
    true_policy_value = dataset.calc_ground_truth_policy_value(
        expected_reward=bandit_feedback_test["expected_reward"],
        action_dist=action_dist,
    )
    print(f'policy value of {name}: {true_policy_value}')

policy value of NN Policy Learner with DM: 0.6785771195516228
policy value of NN Policy Learner with IPW: 0.7429362678096227
policy value of NN Policy Learner with DR: 0.7651217293062053
policy value of IPW Learner with Logistic Regression: 0.767614655337475
policy value of IPW Learner with Random Forest: 0.703809241480009
policy value of Unifrom Random: 0.6043385526445931


In fact, IPW Learner with Logistic Regression is the best, and NN Policy Learner with DR is the second.

Please see [../examples/opl](../opl) for a more sophisticated example of the evaluation of off-policy learners with synthetic bandit data.