# Quickstart Example with Off-Policy Learners
---
This notebook provides an example of implementing several off-policy learning methods with synthetic logged bandit data.

The example consists of the following four major steps:
- (1) Generating Synthetic Data
- (2) Off-Policy Learning
- (3) Evaluation of Off-Policy Learners

Please see [../examples/opl](../opl) for a more sophisticated example of the evaluation of off-policy learners with synthetic bandit data.

In [1]:
# needed when using Google Colab
# !pip install obp

In [2]:
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.linear_model import LogisticRegression

# import open bandit pipeline (obp)
import obp
from obp.dataset import (
    SyntheticBanditDataset,
    logistic_reward_function,
    linear_reward_function
)
from obp.policy import (
    IPWLearner, 
    QLearner,
    NNPolicyLearner, 
    Random
)

In [3]:
# obp version
print(obp.__version__)

0.5.4


## (1) Generating Synthetic Data
`obp.dataset.SyntheticBanditDataset` is an easy-to-use synthetic data generator.

It takes 
- number of actions (`n_actions`, $|\mathcal{A}|$)
- dimension of context vectors (`dim_context`, $d$)
- reward function (`reward_function`, $q(x,a)=\mathbb{E}[r|x,a]$)

as inputs and generates synthetic logged bandit data that can be used to evaluate the performance of decision making policies (obtained by `off-policy learning`).

In [4]:
# generate synthetic logged bandit data with 10 actions
# we use `logistic function` as the reward function and control the behavior policy with `beta`
# one can define their own reward function and behavior policy function such as nonlinear ones. 
dataset = SyntheticBanditDataset(
    n_actions=10,
    dim_context=5,
    beta=-2, # inverse temperature parameter to control the optimality and entropy of the behavior policy
    reward_type="binary", # "binary" or "continuous"
    reward_function=logistic_reward_function,
    random_state=12345,
)

In [5]:
# obtain training and test sets of synthetic logged bandit data
n_rounds_train, n_rounds_test = 10000, 10000
bandit_feedback_train = dataset.obtain_batch_bandit_feedback(n_rounds=n_rounds_train)
bandit_feedback_test = dataset.obtain_batch_bandit_feedback(n_rounds=n_rounds_test)

the logged bandit dataset is collected by the behavior policy as follows.

$ \mathcal{D}_b := \{(x_i,a_i,r_i)\}_{i=1}^n$  where $(x,a,r) \sim p(x)\pi_b(a | x)p(r | x,a) $

In [6]:
# `bandit_feedback` is a dictionary storing synthetic logged bandit data
bandit_feedback_train

{'n_rounds': 10000,
 'n_actions': 10,
 'context': array([[-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057],
        [ 1.39340583,  0.09290788,  0.28174615,  0.76902257,  1.24643474],
        [ 1.00718936, -1.29622111,  0.27499163,  0.22891288,  1.35291684],
        ...,
        [-1.27028221,  0.80914602, -0.45084222,  0.47179511,  1.89401115],
        [-0.68890924,  0.08857502, -0.56359347, -0.41135069,  0.65157486],
        [ 0.51204121,  0.65384817, -1.98849253, -2.14429131, -0.34186901]]),
 'action_context': array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]),
 'action': array([9, 2, 1, ..., 0, 3, 7]),
 'position': None,
 're

## (2) Off-Policy Learning
After generating synthetic data, we now train some decision making policies.

To train policies on logged bandit data, we use

- `obp.policy.NNPolicyLearner` (Neural Network Policy Learner)
- `obp.policy.IPWLearner`

For `NN Learner`, we use 
- Direct Method ("dm")
- InverseProbabilityWeighting ("ipw")
- DoublyRobust ("dr") 

as its objective functions (`off_policy_objective`). 

For `IPW Learner`, we use `RandomForestClassifier` and *LogisticRegression* implemented in scikit-learn for base ML methods.

A policy is trained by maximizing an OPE estimator as an objective function as follows.

$$ \hat{\pi} \in \arg \max_{\pi \in \Pi} \hat{V} (\pi; \mathcal{D}_{tr}) - \lambda \cdot \Omega (\pi)  $$

where $\hat{V}(\cdot; \mathcal{D})$ is an off-policy objective and $\mathcal{D}_{tr}$ is a training bandit dataset. $\Omega (\cdot)$ is a regularization term.

In [7]:
# define NNPolicyLearner with DM as its objective function
nn_dm = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=dataset.dim_context,
    off_policy_objective="dm",
    batch_size=64,
    random_state=12345,
)

# train NNPolicyLearner on the training set of logged bandit data
nn_dm.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
)

# obtains action choice probabilities for the test set
action_dist_nn_dm = nn_dm.predict_proba(
    context=bandit_feedback_test["context"]
)

q-func learning: 100%|██████████| 200/200 [00:19<00:00, 10.51it/s]
policy learning: 100%|██████████| 200/200 [00:47<00:00,  4.23it/s]


In [8]:
# define NNPolicyLearner with IPW as its objective function
nn_ipw = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=dataset.dim_context,
    off_policy_objective="ipw",
    batch_size=64,
    random_state=12345,
)

# train NNPolicyLearner on the training set of logged bandit data
nn_ipw.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"],
)

# obtains action choice probabilities for the test set
action_dist_nn_ipw = nn_ipw.predict_proba(
    context=bandit_feedback_test["context"]
)

policy learning: 100%|██████████| 200/200 [00:52<00:00,  3.79it/s]


In [9]:
# define NNPolicyLearner with DR as its objective function
nn_dr = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=dataset.dim_context,
    off_policy_objective="dr",
    batch_size=64,
    random_state=12345,
)

# train NNPolicyLearner on the training set of logged bandit data
nn_dr.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"],
)

# obtains action choice probabilities for the test set
action_dist_nn_dr = nn_dr.predict_proba(
    context=bandit_feedback_test["context"]
)

q-func learning: 100%|██████████| 200/200 [00:20<00:00,  9.80it/s]
policy learning: 100%|██████████| 200/200 [00:54<00:00,  3.64it/s]


In [10]:
# define IPWLearner with Logistic Regression as its base ML model
ipw_lr = IPWLearner(
    n_actions=dataset.n_actions,
    base_classifier=LogisticRegression(C=100, random_state=12345)
)

# train IPWLearner on the training set of logged bandit data
ipw_lr.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"]
)

# obtains action choice probabilities for the test set
action_dist_ipw_lr = ipw_lr.predict(
    context=bandit_feedback_test["context"]
)

In [11]:
# define IPWLearner with Random Forest as its base ML model
ipw_rf = IPWLearner(
    n_actions=dataset.n_actions,
    base_classifier=RandomForest(
        n_estimators=30, min_samples_leaf=10, random_state=12345
    )
)

# train IPWLearner on the training set of logged bandit data
ipw_rf.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"]
)

# obtains action choice probabilities for the test set
action_dist_ipw_rf = ipw_rf.predict(
    context=bandit_feedback_test["context"]
)

In [12]:
# define Uniform Random Policy as a baseline evaluation policy
random = Random(n_actions=dataset.n_actions,)

# compute the action choice probabilities for the test set
action_dist_random = random.compute_batch_action_dist(
    n_rounds=bandit_feedback_test["n_rounds"]
)

In [13]:
# action_dist is a probability distribution over actions (can be deterministic)
action_dist_ipw_lr[:, :, 0]

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.]])

## (3) Evaluation of Off-Policy Learners
Our final step is the evaluation and comparison of the off-policy learners.

With synthetic data, we can calculate the policy value of the off-policy learners as follows. 

$$V(\pi_e) \approx \frac{1}{|\mathcal{D}_{te}|} \sum_{i=1}^{|\mathcal{D}_{te}|} \mathbb{E}_{a \sim \pi_e(a|x_i)} [q(x_i, a)], \; \, where \; \, q(x,a) := \mathbb{E}_{r \sim p(r|x,a)} [r]$$

where $\mathcal{D}_{te}$ is the test set of logged bandit data.

In [14]:
# we calculate the policy values of the trained policies based on the expected rewards of the test data
policy_names = [
    "NN Policy Learner with DM",
    "NN Policy Learner with IPW",
    "NN Policy Learner with DR",
    "IPW Learner with Logistic Regression",
    "IPW Learner with Random Forest",
    "Unifrom Random"
]
action_dist_list = [
    action_dist_nn_dm,
    action_dist_nn_ipw,
    action_dist_nn_dr,
    action_dist_ipw_lr,
    action_dist_ipw_rf,
    action_dist_random
]

for name, action_dist in zip(policy_names, action_dist_list):
    true_policy_value = dataset.calc_ground_truth_policy_value(
        expected_reward=bandit_feedback_test["expected_reward"],
        action_dist=action_dist,
    )
    print(f'policy value of {name}: {true_policy_value}')

policy value of NN Policy Learner with DM: 0.7862505830999654
policy value of NN Policy Learner with IPW: 0.7606162025424541
policy value of NN Policy Learner with DR: 0.7732793867972861
policy value of IPW Learner with Logistic Regression: 0.7933299733929567
policy value of IPW Learner with Random Forest: 0.7050722711915117
policy value of Unifrom Random: 0.49992528545607745


In fact, `IPWLearner` with `LogisticRegression` seems to be the best in this simple setting.

We can iterate the above process several times to get more reliable results.

Please see [../examples/opl](../opl) for a more sophisticated example of the evaluation of off-policy learners with synthetic bandit data.