# OPE Experiment with Multi-class Classificatoin Data
---
This notebook provides an example of conducting OPE of an evaluation policy using multi-class classification dataset as logged bandit data.

This example notebook contains the follwoing four major steps:
- (1) Bandit Reduction
- (2) Off-Policy Learning
- (3) Off-Policy Evaluation
- (4) Evaluation of OPE

In [1]:
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.linear_model import LogisticRegression

# import open bandit pipeline (obp)
import obp
from obp.dataset import MultiClassToBanditReduction
from obp.ope import (
    OffPolicyEvaluation, 
    RegressionModel,
    InverseProbabilityWeighting,
    DirectMethod,
    DoublyRobust
)

In [2]:
# obp version
print(obp.__version__)

0.4.1


## (1) Bandit Reduction
`obp.dataset.MultiClassToBanditReduction` is an easy-to-use for transforming classification data to bandit data.
It takes 
- feature vectors (`X`)
- class labels (`y`)
- classifier to construct behavior policy (`base_classifier_b`) 
- paramter of behavior policy (`alpha_b`) 

as its inputs and generates a bandit data that can be used to evaluate the performance of decision making policies (obtained by `off-policy learning`) and OPE estimators.

In [3]:
# load raw digits data
# `return_X_y` splits feature vectors and labels, instead of returning a Bunch object
X, y = load_digits(return_X_y=True)

In [4]:
# convert the raw classification data into a logged bandit dataset
# we construct a behavior policy using Logistic Regression and parameter alpha_b
# given a pair of a feature vector and a label (x, c), create a pair of a context vector and reward (x, r)
# where r = 1 if the output of the behavior policy is equal to c and r = 0 otherwise
# please refer to https://zr-obp.readthedocs.io/en/latest/_autosummary/obp.dataset.multiclass.html for the details
dataset = MultiClassToBanditReduction(
    X=X,
    y=y,
    base_classifier_b=LogisticRegression(max_iter=10000, random_state=12345),
    alpha_b=0.8,
    dataset_name="digits",
)

In [5]:
# split the original data into training and evaluation sets
dataset.split_train_eval(eval_size=0.7, random_state=12345)

In [6]:
# obtain logged bandit data generated by behavior policy
bandit_data = dataset.obtain_batch_bandit_feedback(random_state=12345)

# `bandit_data` is a dictionary storing logged bandit feedback
bandit_data

{'n_actions': 10,
 'n_rounds': 1258,
 'context': array([[ 0.,  0.,  0., ..., 16.,  1.,  0.],
        [ 0.,  0.,  7., ..., 16.,  3.,  0.],
        [ 0.,  0., 12., ...,  8.,  0.,  0.],
        ...,
        [ 0.,  1., 13., ...,  8., 11.,  1.],
        [ 0.,  0., 15., ...,  0.,  0.,  0.],
        [ 0.,  0.,  4., ..., 15.,  3.,  0.]]),
 'action': array([6, 8, 5, ..., 2, 5, 9]),
 'reward': array([1., 1., 1., ..., 1., 1., 1.]),
 'position': None,
 'pscore': array([0.82, 0.82, 0.82, ..., 0.82, 0.82, 0.82])}

## (2) Off-Policy Learning
After generating logged bandit data, we now obtain an evaluation policy using the training set.

In [7]:
# obtain action choice probabilities by an evaluation policy
# we construct an evaluation policy using Random Forest and parameter alpha_e
action_dist = dataset.obtain_action_dist_by_eval_policy(
    base_classifier_e=RandomForest(random_state=12345),
    alpha_e=0.9,
)

In [8]:
# which action to take for each context (a probability distribution over actions)
action_dist[:, :, 0]

array([[0.01, 0.01, 0.01, ..., 0.01, 0.01, 0.01],
       [0.01, 0.01, 0.01, ..., 0.01, 0.91, 0.01],
       [0.01, 0.01, 0.01, ..., 0.01, 0.01, 0.01],
       ...,
       [0.01, 0.01, 0.91, ..., 0.01, 0.01, 0.01],
       [0.01, 0.01, 0.01, ..., 0.01, 0.01, 0.01],
       [0.01, 0.01, 0.01, ..., 0.01, 0.01, 0.91]])

## (3) Off-Policy Evaluation (OPE)
OPE attempts to estimate the performance of evaluation policies using their action choice probabilities.

Here, we evaluate/compare the OPE performance of **InverseProbabilityWeighting (IPW)**, **DirectMethod (DM)**, and **Doubly Robust (DR)**.

### (3-1) obtain a reward estimator
$q(x,a) \approx \hat{q}(x,a)$ with cross-fitting

In [9]:
# obp.ope.RegressionModel
regression_model = RegressionModel(
    n_actions=dataset.n_actions, # number of actions; |A|
    base_model=LogisticRegression(C=100, max_iter=10000, random_state=12345), # any sklearn classifier
)

In [10]:
estimated_rewards = regression_model.fit_predict(
    context=bandit_data["context"],
    action=bandit_data["action"],
    reward=bandit_data["reward"],
    position=bandit_data["position"],
    n_folds=2, # use 2-fold cross-fitting
    random_state=12345,
)

In [11]:
estimated_rewards[:, :, 0] # \hat{q}(x,a)

array([[0.90532942, 0.89067336, 0.93692144, ..., 0.86292398, 0.89504067,
        0.89454274],
       [0.65695587, 0.47921038, 0.58244466, ..., 0.3219239 , 0.66947205,
        0.84474829],
       [0.77594118, 0.62461982, 0.71610604, ..., 0.46194026, 0.78553332,
        0.90774452],
       ...,
       [0.76558697, 0.61078044, 0.70404139, ..., 0.44740925, 0.77549413,
        0.90271757],
       [0.99910077, 0.9981303 , 0.99876584, ..., 0.99638258, 0.99914974,
        0.99968332],
       [0.59878879, 0.41762193, 0.52085796, ..., 0.2700673 , 0.61217438,
        0.80917434]])

### (3-2) conduct OPE
$V(\pi_e) \approx \hat{V} (\pi_e; \mathcal{D}_b, \theta)$ with DM, IPW, and DR

In [None]:
# obp.ope.OffPolicyEvaluation
ope = OffPolicyEvaluation(
    bandit_feedback=bandit_data, # bandit data
    ope_estimators=[
        InverseProbabilityWeighting(), DirectMethod(), DoublyRobust(), # used estimators
    ]
)

In [None]:
estimated_policy_value = ope.estimate_policy_values(
    action_dist=action_dist, # \pi_e(a|x)
    estimated_rewards_by_reg_model=estimated_rewards, # \hat{q}
)

In [None]:
# OPE results given by the three estimators
estimated_policy_value

## (4) Evaluation of OPE estimators
Our final step is **the evaluation of OPE**, which evaluates and compares the estimation accuracy of OPE estimators.

With the multi-class classification data, we can calculate the ground-truth policy value of the evaluation policy. 
Therefore, we can compare the policy values estimated by OPE estimators with the ground-turth to evaluate OPE estimators.

## (4-1) Approximate the Ground-truth Policy Value
$V(\pi) \approx \frac{1}{|\mathcal{D}_{te}|} \sum_{i=1}^{|\mathcal{D}_{te}|} \mathbb{E}_{a \sim \pi(a|x_i)} [q(x_i, a)], \; \, where \; \, q(x,a) := \mathbb{E}_{r \sim p(r|x,a)} [r]$

In [15]:
# calculate the ground-truth performance of the evaluation policy
true_policy_value = dataset.calc_ground_truth_policy_value(action_dist=action_dist)

true_policy_value

0.8770906200317964

### (4-2) Evaluation of OPE
$SE (\hat{V}; \mathcal{D}_b) := \left( V(\pi_e) - \hat{V} (\pi_e; \mathcal{D}_b, \theta) \right)^2$,     (squared error of $\hat{V}$)

In [16]:
squared_errors = ope.evaluate_performance_of_estimators(
    ground_truth_policy_value=true_policy_value,
    action_dist=action_dist,
    estimated_rewards_by_reg_model=estimated_rewards,
    metric="se", # squared error
)

In [17]:
squared_errors # DR is the most accurate 

{'ipw': 0.00015129505568611752,
 'dm': 0.010817510080477629,
 'dr': 2.467173451127295e-05}

We can iterate the above process several times and calculate the following MSE

$MSE (\hat{V}) := T^{-1} \sum_{t=1}^T SE (\hat{V}; \mathcal{D}_b^{(t)}) $

where $\mathcal{D}_b^{(t)}$ is the synthetic data in the $t$-th iteration