# Qucik Start: Use Cases and Examples
---
In this notebook, we show an example of conducting an offline evaluation of the performance of BernoulliTS using *Inverse Probability Weighting* as an OPE estimator and the Random policy as a behavior policy.

Our example contains the follwoing three major steps:
- (1) Data loading and preprocessing
- (2) Offline Bandit Simulation
- (3) Off-Policy Evaluation

In [1]:
from pathlib import Path

# import open bandit pipeline (obp)
from obp.dataset import OpenBanditDataset
from obp.policy import BernoulliTS, Random
from obp.simulator import OfflineBanditSimulator

## (1) Data loading and preprocessing

We prepare easy-to-use data loader for Open Bandit Dataset, **OpenBanditDataset** class in `dataset` module. <br>
It takes `behavior policy ('bts' or 'random')` and `campaign ('all', 'men', or 'women')` as inputs and provides dataset preprocessing as well as standardized train/test splitting.

In [2]:
# (1) Data loading and preprocessing
# specify the path of sample dataset
data_path = Path('.').resolve().parents[1] / 'obd'
# Load and preprocess raw data in "All" campaign collected by the Random policy
dataset = OpenBanditDataset(behavior_policy='random', campaign='men', data_path=data_path)
# Split the data into 70% training and 30% test sets
train, test = dataset.split_data(test_size=0.3, random_state=42)

# `train` is a dictionary storing action, position, reward, pscore, and some feature vectors
train

{'n_data': 7000,
 'n_actions': 34,
 'action': array([22, 10,  7, ..., 12, 14, 12]),
 'position': array([0, 1, 2, ..., 0, 1, 0]),
 'reward': array([0, 0, 0, ..., 0, 0, 0]),
 'pscore': array([0.02941176, 0.02941176, 0.02941176, ..., 0.02941176, 0.02941176,
        0.02941176]),
 'X_policy': array([[0, 1, 0, ..., 0, 1, 0],
        [1, 0, 1, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0],
        ...,
        [0, 1, 0, ..., 1, 0, 0],
        [0, 1, 0, ..., 1, 0, 0],
        [0, 1, 0, ..., 0, 0, 0]], dtype=uint8),
 'X_reg': array([[ 0.        ,  2.        ,  0.        , ..., 11.        ,
          3.        , -0.69874138],
        [ 1.        ,  1.        ,  1.        , ...,  3.        ,
          0.        ,  1.5864346 ],
        [ 2.        ,  2.        ,  0.        , ...,  3.        ,
          0.        ,  2.85837217],
        ...,
        [ 0.        ,  2.        ,  4.        , ...,  1.        ,
          0.        ,  1.19838585],
        [ 1.        ,  2.        ,  4.        , ..., 12

## (2) Offline Bandit Simulation

After preparing a dataset, we now run an **offline bandit simulation** on the logged bandit feedback. <br>
Below, we use **OfflineBanditSimulator** class in `simulator` module. <br>
We also use **Bernoulli Thompsom Sampling (Bernoulli TS)** as a counterfacutal policy to be evaluated, which is impelemted in `policy` module. <br>
The `simulate` method of the Simulator class takes the counterfactual policy and the training dataset as inputs and run the policy on the dataset.

In [3]:
# (2) Offline Bandit Simulation
# Define a simulator object
simulator = OfflineBanditSimulator(train=train)
# Define a counterfacutal policy, which is the Bernoulli TS policy here
counterfactual_policy = BernoulliTS(
  n_actions=dataset.n_actions,
  len_list=dataset.len_list,
  random_state=42)
# Run an offline bandit simulation on the training set
simulator.simulate(policy=counterfactual_policy)

                such as DM and DR cannot be used.
100%|██████████| 7000/7000 [00:00<00:00, 26096.62it/s]


## (3) Off-Policy Evaluation (OPE)

Our final step is **off-policy evaluation** (OPE), which attempts to estimate the performance of bandit algorithms using log data generated by offline bandit simulations. <br>
Here we use the *InverseProbabilityWeighting* estimator as an OPE estimator and estiamte the performance of Bernoulli TS using the simulation log data. <br>
Finally, we compare the estimated performance of Bernoulli TS with the ground-truth performance of the behavior policy, which is the Random policy here. <br>
The ground-truth performance can easily be estimated with the empirical mean of factual rewads in the `test` set.

In [4]:
# Estimate the policy value of BernoulliTS based on actions selected by that policy
estimated_policy_value = simulator.inverse_probability_weighting()

# Comapre the estimated performance of BernoulliTS (counterfactual policy)
# with the ground-truth performance of Random (behavior policy)
relative_policy_value_of_bernoulli_ts = estimated_policy_value / test['reward'].mean()
# Our OPE procedure estimates that BernoulliTS improves Random by 21.4%
print(relative_policy_value_of_bernoulli_ts)

1.2142857142857142
