# A/B Testing

## Learning Objectives
- Learn what is A/B testing
- PICOT
- How to design a good A/B experiment
- Measuring the results of an A/B test

Undoubtedly you have all heard of an A/B test before. In most cases, it is when we want to test a variation of a product/service and see how that influences a metric we have in mind. Some examples are: what is the effect does adding a delay on my website have to the profit I am making (Amazon); which banner should I display to a user to increase their click through rate (Netflix).

Many of you have probably already realised that this is just applied hypothesis testing! We are testing whether a change in some product has a significant effect on a desired metric. Let's work through this under the following context: Whether changing the blue button to orange on an optional newsletter signup box increases the number of emails collected. When doing A/B testing, we should formulate our hypothesis under the PICOT acronym (not sure why they didn't choose PIVOT as the acronym...):
- **P**opulation: The specific group of people you will be studying in the experiment
- **I**ntervention: What is the variant you are introducing
- **C**omparison: What reference are you comparing this variant against
- **O**utcome: What metric/result are you measuring
- **T**ime: How long are you running the experiment for

Why is time important? Besides the constraint that some metrics are only reported at a regularly specified interval (e.g. every quarter), we also have to consider the **novelty factor**. That is, some change to our product may encourage users to experiment and play around with the product - increasing the outcome you're measuring. However, such an increase may only be coming from the novelty of the new feature. In the long run, it could turn out that the inclusion of the variant/feature is detrimental to the outcome you're trying to measure. A lengthier experiement helps mitigate this issue.

Under this acronym, what would an appropiate null and alternate hypothesis be for our experiment?
<details>
    <summary><b>> Click here to reveal the answer</b></summary>
    H0: Non-registered visitors of our website that saw the orange button will <i>not</i> result in a higher level of newsletter email signups over the period of one month compared to those which see the blue button. <br />
    Ha: Non-registered visitors of our website that saw the orange button will result in a higher level of newsletter email signups over the period of one month compared to those which see the blue button.

</details>


And can you break the hypothesis down into where each point follows a word from PICOT?
<details>
    <summary><b>> Click here to reveal the answer</b></summary>
    <ul>
        <li><b>P</b>opulation: Non-registered visitors of our website</li>
        <li><b>I</b>ntervention: Showing an orange button</li>
        <li><b>C</b>omparison: Showing a blue button</li>
        <li><b>O</b>utcome: Number of people signup to the newsletter
        <li><b>T</b>ime: One month</li>
    </ul>
</details>

With that knowledge, what's wrong with the following hypotheses:
- H0: Milk is not a good combination with cookies
- H1: Milk is a good combination with cookies

<details>
    <summary><b>> Click here to reveal the answer</b></summary>
    <ul>
        <li>No clear definition of what 'good combination' is. Sure they may taste great together, but 1) How do we measure this, and more importantly, they could cause negative health issues down the line. Would it be a good combination then?</li>
        <li>How do we measure this? It's not made clear</li>
        <li>Which population are we targetting? If we went to an asian country they might enjoy the cookies, but (cows) milk might make them ill. Then it wouldn't be a good combination for them.</li>
    </ul>
</details>

Now that a hypothesis has been formulated, let's look at how we can pick fair samples, before diving into the statistics. We need a **treamtent** and **control group**. Control groups are measured as a baseline (e.g. they're not shown the orange button), and treatment groups are measured with the change of interest - to try to reject the null hypothesis. We can usually assign groups randomly or by sampling our treatment group from users who have opt-ed into to a beta test, but both of these introduce biases:

**Randomisation bias** can occur when we collect samples of our data with poor randomisation. This will lead to over/under-representation in your samples. This would mean that some of the variables of one of the groups may differ from the distribution of the population. For example, different countries might have different affinities to the blue vs orange button. Capturing information from one country only, when you want to model the global population of your users would introduce randomisation bias and may skew your results.

**Selection bias** occurs when users are given the choice to A/B test a feature. This bias is more dangerous than randomisation bias as those who opt into these kinds of tests might have a higher risk appetite. Selection bias leads to harder to measure latent variables being encoded into the sample group results.

When possible, it is recommened to sample as randomly as possible from your data.

I've simulated this experiement, let's take a look through analysing the results, and determining whether or not an orange button is superior to a blue button.