# Combining exploration and exploitation

Have you ever wanted to conduct A/B/N tests in a case where you had a prior opinion on which variant could be better suited to which customer?

Were you torn between exploration (learning from A/B/N test results) and exploitation (showing each customer the variant you already think is most suited for them)?

Did you know that you can do both at once? 

By making variant assignments random but biased towards what you think will work best for each customer, you can get the best of both worlds. You can observe the actual impact (for example, conversion rates) of your biased assignments and at the same time calculate, *from the same experiment*, an unbiased, model-free estimate of what the conversion rate for a purely random assignment would have been, using a mathematical technique called ERUPT or policy value.  This means that beyond the usual learnings from an A/B/N test you now have a precise, unbiased estimate of the benefit that biasing the assignment has brought. 

Suppose you're not really sure about your prior beliefs. In that case, you can also turn this around: run a fully randomized experiment, then use ERUPT to calculate from that experiment an unbiased estimate of what the impact of _any other assignment policy_ would have been! 

Summing up, if your actual variant assignment policy is at all stochastic, whether fully random or with a probability depending on customer's characteristics, after running the experiment you can use ERUPT to get an unbiased estimate of what the outcome of ANY OTHER ASSIGNMENT POLICY for the same experiment would have been! 

This notebook shows you how.

In [1]:
%load_ext autoreload
%autoreload 2
import os, sys
import warnings
warnings.filterwarnings('ignore') # suppress sklearn deprecation warnings for now..

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# the below checks for whether we run dowhy, causaltune, and FLAML from source
root_path = root_path = os.path.realpath('../..')
try:
    import causaltune
except ModuleNotFoundError:
    sys.path.append(os.path.join(root_path, "causaltune"))

import pandas as pd
import numpy as np

## Creating synthetic data for the experiment with fully random assignment

In [2]:
# Let's create some synthetic data
n=10000

# Let's create a dataset with a single feature
df = pd.DataFrame({"X": np.random.uniform(size=n)})

# Now let's create a response-to-treatment function that correlates with the feature
def outcome(x: np.ndarray, treatment: np.ndarray) -> np.ndarray:
    return 2*np.random.uniform(size=len(x)) + x*(treatment == 1)

# Let's consider a fully random treatment
df["T1"] = np.random.randint(0, 2, size=n)
# and simulate the corresponing experiment outcomes
df["Y1"] = outcome(df["X"], df["T1"])
df.head()

Unnamed: 0,X,T1,Y1
0,0.898227,1,1.288637
1,0.462092,0,0.771976
2,0.858974,0,1.881019
3,0.228084,1,0.357797
4,0.962512,1,1.066413


## Creating synthetic data for the experiment with biased assignment

In [3]:
# Let's consider another experiment on the same population, but with 
# treatment assignment that's biased by the feature, because we believe that 
# customers with higher values of the feature will be more responsive to the treatment

df["p"] = 0.5+0.5*df["X"] # probability of binary treatment being applied
df["T2"] = (np.random.rand(len(df)) <df["p"]).astype(int) # sample with that propensity

# We really only need the ex ante probability of the treatment that actually was applied
# This will work exactly the same way in a multi-treatment case
df["p_of_actual"] = df["p"]*df["T2"] + (1-df["p"])*(1-df["T2"])

# Now let's evaluate the outcome for this experiment

df["Y2"] = outcome(df["X"], df["T2"])

df.head()

Unnamed: 0,X,T1,Y1,p,T2,p_of_actual,Y2
0,0.898227,1,1.288637,0.949114,1,0.949114,2.229118
1,0.462092,0,0.771976,0.731046,0,0.268954,0.572308
2,0.858974,0,1.881019,0.929487,1,0.929487,2.601592
3,0.228084,1,0.357797,0.614042,1,0.614042,0.542638
4,0.962512,1,1.066413,0.981256,1,0.981256,2.401383


## Estimate random assignment outcome from biased assignment experiment

In [4]:
from causaltune.score.erupt_core import erupt_with_std

# Let's use data from biased assignment experiment to estimate the average effect of fully random assignment
est, std = erupt_with_std(actual_propensity=df["p_of_actual"], 
                     actual_treatment=df["T2"],
                     actual_outcome=df["Y2"],
                     hypothetical_policy=df["T1"])


print("Average outcome of the actual biased assignment experiment:", df["Y2"].mean())
print("Estimated outcome of random assignment:", est)
print("95% confidence interval for estimated outcome:", est-2*std, est + 2*std)
print("Average outcome of the actual random assignment experiment:",  df["Y1"].mean())

Average outcome of the actual biased assignment experiment: 1.4064676444383317
Estimated outcome of random assignment: 1.2594221770638483
95% confidence interval for estimated outcome: 1.230204391668238 1.2886399624594587
Average outcome of the actual random assignment experiment: 1.2461659092712785


## Estimate biased assignment outcome from random assignment experiment

In [5]:
# Conversely, we can take the outcome of the fully random test and use it 
# to estimate what the outcome of the biased assignment would have been

hypothetical_policy = df["T2"]
est, std = erupt_with_std(actual_propensity=0.5*pd.Series(np.ones(len(df))), 
                     actual_treatment=df["T1"],
                     actual_outcome=df["Y1"],
                     hypothetical_policy= df["T2"])

print("Average outcome of the actual random assignment experiment:", df["Y1"].mean())
print("Estimated outcome of biased assignment:", est)
print("95% confidence interval for estimated outcome:", est-2*std, est + 2*std)
print("Average outcome of the actual biased assignment experiment:",  df["Y2"].mean())

Average outcome of the actual random assignment experiment: 1.2461659092712785
Estimated outcome of biased assignment: 1.405112521603215
95% confidence interval for estimated outcome: 1.3814865905561569 1.428738452650273
Average outcome of the actual biased assignment experiment: 1.4064676444383317


As you can see, the actual outcome is within the confidence interval estimated by ERUPT

For more details on the math behind ERUPT, consult [Hitsch and Misra (2018)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3111957), who call it policy value. Note also that we assume that treatment takes integer values from 0 to n.