<a href="https://colab.research.google.com/github/trentinsf/AB_causal_inf/blob/master/A_B_Testing_Done_Right.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A/B testing fundamentally looks to evaluate the causal relationship between a treatment and an outcome, but before we dive into the specifics, we must understand why it's so hard to establish a causal relationship.

In any given scenario, we can only observe what happens, not what would have happened had we done something else. While this seems obvious, it also means that we can never truly be certain about what caused what.

So if we can never know what leads to what, what can we do? Instead of sitting in a dark room all day because nothing is for certain, we can collect information and begin to make judgments based on what we see. This is done constantly in our brains--when we notice that the sun tends to rise in the morning, we slowly begin to trust that the sun will rise again tomorrow morning. While the kind of evaluation done in our brains is imperfect and can lead to overgeneralizations and oversimplifications, it's the basis of inductive inference.

###How can we make valid inferences?

In the business world, we're always attempting to perfect our value offering, finding the perfect price for our product in various channels, finding the most effective product to advertise to our customers, or finding the perfect color scheme for our website to make our customers buy more. 

In all of these cases, we're trying to understand the impact of a given intervention (D), on an outcome (Y)--to do this, we must attempt to establish a relationship between the two variables.

###Historical Approaches vs. A/B Testing

Imagine you're a company that sells work boots and you're trying to figure out which kind of boots you should advertise to users when they visit your website. In this case, your intervention is the kind of boot you advertise, and your outcome variable is whether or not the customer makes a purchase. While you may be tempted to look at historical data, and count the number of customers that bought each kind of boot in the past, you would only be finding a correlation between the variables. Your past data may be plagued with non-controlled issues like periods where certain boots were sold out and unable to be purchased, periods where certain kinds of boots were being advertised more heavily, and many other confounding variables such as certain boots being in/out of fashion in varying periods. If you simply counted sales, you'd not only be finding simply a correlational relationship, but you'd also be solving for a different problem than the one you are presented with. Instead of learning what kind of boot would be the best to advertise to users on visiting your website, you're learning what boots sold the best in the past, which is related to your outcome variable but isn't the same.

In order to learn what will be the most effective strategy, we must actively test each treatment. Instead of using this past data, which is marred with hundreds of confounding variables that cloud our insight, we'll actually test which one sells better. Statisticians like to call these kinds of tests RCT's (Randomized Controlled Trials), but in the business world, they're known as A/B tests.

When a user visits our website, we randomly select one of three kinds of boots as their front page and track the percent of users that make a purchase.

While many companies may simply look at which boot ad drove the most sales and make a decision, they're forgetting about the fundamental problem of causal inference, we don't know what each individual person would have done given another treatment. We could have simply happened to advertise one boot to people who really wanted to buy boots. 

###The Solution

In order to truly make a causal inference, we can compare our observed results to a potential world where the treatments are known to have no impact, and see how likely our results would be if we *knew* that the treatments don't make a difference. If we can conclude that it's very unlikely that the differences observed in our treatment groups were due to variables outside of our control, we can conclude that our the variables in our control (the treatment variables) truly made a difference on our outcome.

Before we continue, let's nail down some terms and assumptions.

##TERMS:

####Treatment:

Treatment is our independent variable or the thing we are changing in a given trial to test its impact on what we care about.

Our treatments, T_i, can be understood as a variable denoting which treatment is being used on which person. If the value of T_i is 1, we know that we used treatment 1 on customer i.

Individuals who received the same treatment are part of a treatment group.

####Observed Outcome:

Our observed outcome is the outcome we saw for a given person. 

Our observed outcome, Y_i, can be understood as the outcome we saw  for a given individual. If the value for T_i is 0, we know that customer i ended up doing action 0. Often times, there are only two actions, 0 and 1, with the customer doing something such as purchasing or not purchasing, but the observed outcome can also be a continuous variable such as the total amount spent during the website visit.

##ASSUMPTIONS:

Our main assumption is called SUTVA, or Stable Unit Treatment Value Assumption. 

First, we need to know that the outcome of each trial is independent of the other trials. Practically, this means we need to know that one person's purchase isn't dependent on the treatment we give to someone else.

Second, we must know that there's no hidden variation in our treatments. Practically, this means that we must know that each treatment would be executed exactly the same way for a given individual.

Finally, we must ensure that individuals are equally likely to receive any treatment to ensure that our treatment is the only variable that distinguishes our treatment groups.

###EXECUTION:

While many companies simply rely on collecting a massive amount of data each time an A/B test is run so that they naturally converge to an accurate picture of reality due to the law of large numbers, inference can be done much more methodically and pragmatically.

Consider an A/B tests run at scale, instead of these broad considerations, a company such as Macy's may be running constant A/B tests with hundreds of products and hundreds of potential demographics. This kind of segmentation prevents Macy's from advertising one product to every customer, regardless of their known demographics, but creates the issue of having extremely small sample sizes for their tests. Instead of being able to compare 500000 outcomes of one treatment with 500000 outcomes of another treatment (one million site visits), we now must evaluate the impact of 100 treatments on 100 demographics, meaning we the average number of samples we can collect from our dataset is only 100. If one in twenty people make a purchase, we have an average of 5 positive data points, and can't really know if these 100 people happened to like shopping more, or if the treatment was truly effective.

Let's evaluate this scenario.












Consider a scenario where we found that when 100 people were given each treatment, 4 more people bought our product with treatment 1 than with treatment 2.

In this case, this means that there were 7 purchases in treatment group 1 and 3 purchases in treatment group 2.

This equates to a purchase percent of 7% with treatment group 1 and a purchase percent of 3% with treatment group 2. At this point, you may be tempted to simply go with treatment 1, as in our sample, the purchase rate seen in treatment group 1 was double that of treatment group 2.

But how can we know whether or not there's a real difference between our treatments or if the difference is due to randomness outside of our control?

We know that on average, our customers made purchases 5% of the time (10/200 people made a purchase), so let's simulate a scenario where each customer, regardless of their treatment, has a 5% chance of making a purchase, and evaluate how often we see results like ours.

In [0]:
import scipy.stats
import matplotlib.pyplot as plt

num_observations = 100 #we observe 100 people on our website
num_simulations = 100000 #we'll simulate our exact situation 100,000 times

p1 = .05 #probability of purchasing if given treatment 1 (5%)
p2 = .05 #probability of purchasing if given treatment 2 (5%)
#the probabilities are equal

"""
Below, we simulate whether or not a user to our website makes a purchase given each treatment.
We simulate this 1000 times, assuming that in both cases there is a 5% chance that a user
makes a purchase. We use a binomial distribution, which simulates independent events
with a fixed probability of success and a fixed number of events. In this case, 100 events
each with a 5% likelihood of success. We run this simulation 100,000 times.
"""

treatment_1 = scipy.stats.binom.rvs(n=num_observations, p=p1, size=num_simulations)
treatment_2 = scipy.stats.binom.rvs(n=num_observations, p=p2, size=num_simulations)

"""
treatment_1 is a list the simulated number of purchases given a 
purchase probability of .05 and a sample size 100 customers
"""

In reality, we found that treatment 1 led to four more purchases than treatment 2, each havin. Let's see how often this happens randomly.

In [0]:
treatment_difference = treatment_1 - treatment_2

"""
Treatment_difference is the difference between the number of purchases in each treatment
group. Our real treatment difference was 4. To see how likely it is that we would see
these our observed results due to random chance, let's compare our result with the 
simulated results.
"""

count = 0

for difference in treatment_difference:
  if difference >= 4:
    count += 1 

"""
We run through each simulated scenario and count the number of times that the difference
between treatment group 1 and 2 is greater than or equal to our actual result.
"""

#we must divide by the number of simulations to get a percent:
percent_randomly_observed = count/num_simulations

print("Percent of time results as extreme or more extreme than ours were randomly generated:", percent_randomly_observed*100, "%")

Percent of time results as extreme or more extreme than ours were randomly generated: 12.393 %


There's our result: 12.39%. Even when we *know* that there's no difference between the treatment groups, about 12% of the time, due to randomness outside of our control, 4 or more people purchase when given treatment 1 and treatment 2.

Thinking back to earlier, the only way we can claim that our treatment groups actually make an impact, we must be able to say that our results were very unlikely to be generated randomly (and not due to our treatments).

In our case, we can't infer that treatment 1 is superior to treatment 2 because we can't infer from our data that the difference between the groups was not due to random chance. Especially when we have 100 treatments and 100 demographics, meaning we're running 10,000 smaller experiments of this type, this threshold is unacceptable.

Different contexts require call for different thresholds to claim there is a causal connection. In most modern research contexts, a researcher needs to demonstrate that there's a less than 1% chance that their results would have been seen completely randomly (unrelated to their treatments). And even with those strict guidelines, many studies still end up falsely claiming that a causal relationship is present.

This percent chance that our results were generated randomly is called a P value, and in business contexts, a P value 5% (.05) is required. Meaning a team must demonstrate that their results had a less than 5% of being due to random chance.

While there are many mathematical approaches out there, such as the ANOVA test, or the T-Test, the essence of both is the same as this simulated test. They all evaluate whether or not we can make the assertion that our observed outcome was NOT randomly distributed by computing the probability of getting our observed data in a world where we know our treatments have no causal impact. These tests make something that's relatively simple in concept into a confusing mathematical calculation, so for the sake of understanding, I thought I'd demonstrate the simplest way to understand P values and A/B testing.

FINAL DISCUSSION: P Values

While we have already pointed out the problems will simply trusting the outcome of your A/B test without understanding the possibility of it being due entirely to randomness, it's equally important to discuss the modern discussion taking place about using a P value that's low enough.

A P value represents the likelihood that our results, (or results more extreme) would be seen *in a world where we know that our treatments make no difference.* A p value of .05 does **not** tell us that there's a 95% chance that our treatment has a causal impact on our outcome. This distinction is crucial, as it's been empirically found that when the P value is .05, there's actually only 70-80% chance that there's a causal impact. Meaning that with a P value of .05, your test will lead you to take a misled action 20-30% of the time (Johnson et al., 2013).

This same study, along with more recent research from more than 100 leading researchers suggests that a P value of .005 should be required for any meaningful argument to be made, and a P value of .001 is required for a significant connection to be made. (Benjamin et al., 2018) 

Take that into consideration next time you interpret your A/B test values: even the companies being dilligent and using a P value instead of simply choosing the option with the highest observed value are acting on misleading data up to 20-30% of the time.

This is the reason data literacy, and an understanding of how to interpret the information provided by these test is more important than knowledge about how to execute these tests.

It's the reason that in this practical data-science final I used 11 lines of code, but over 20 paragraphs of writing describing that code. I could have designed a model that took 30,000 variables into consideration and used the top of the line neural network, but without a clear understanding of what the results really mean, the model would be worthless.


References:

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., ... & Cesarini, D. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6.)

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110(48), 19313-19317. https://www.pnas.org/content/110/48/19313.abstract