# What Is A/B Testing?
A/ B testing also known as split testing is a method of comparing 2 versions of a webpage or app against each other to determine which one performs better. It is a controlled experiment where 2 variants (A and B) are compared by testing a subject's response to variant A against variant B and determining which of the 2 variants is more effective.

### How does the A/ B testing process typically work?
1. Object definition: Clearly define the objective of the test. It could be improving click-through rates, conversion rates, engagement or other key performance indicators (KPIs).
2. Variant creation: Create 2 versions (A and B) of the element that is to be tested. This could be a webpage, an email, an advertisment or any other user interface element.
3. Random assignment: Randomly assign users or visitors to either variant A or B. This randomization helps to ensure that the groups are statistically equivalent and any differences in performance are likely due to the changes made.
4. Data collection: Collect data on the performance of each variant. This could involve metrics such as conversion rates, click-through rates, engagement or other relevant KPIs.
5. Statistical analysis: Perform statistical analysis to determine if there is a statistically significant difference between the 2 variants. This analysis helps identify whether any observed differences are likely to be real and not just due to chance.
6. Decision making: Based on the analysis, decide which variant performs better. The better performing variant is typically implemented or used in the future iterations.

### Use cases of A/ B testing
- Multivariate testing.
- Split testing.
- Conversion rate optimization.
- Landing page optimization.
- Online experimentations.

### Why use A/ B testing?
- A/ B testing helps in taking decisions about the product.
- Dynamic pricing algorithms.
- A/ B testing is a causal inference technique used by Data Scientists to take product launch decision.
- 2 variants of a product are shown to 2 identical groups of users. Tests are conducted and observations are made to find the preferred variant.
- A/ B testing is used when there are 2 variants. Similarly, A/ B/ C testing is used when there are 3 variants, and so on.

# Framework For Business Acumen Questions
When addressing business acumen questions, it's essential to demonstrate a structured and analytical approach. 

The following framework can be used for guidance,
1. Understand the problem:
    - Clarify the goal: Ensure a clear understanding of the business objective. Is it to increase revenue, reduce costs, improve customer satisfaction, or something else?
    - Identify the key metrics: Determine the key performance indicators (KPIs) that will measure success. These could include revenue, customer acquisition cost, customer lifetime value, or other relevant metrics.
2. Formulate a hypothesis:
    - State the hypothesis: Develop a clear and testable hypothesis that addresses the problem. For example, "If we implement a new pricing strategy, we will see a 10% increase in revenue."
    - Identify the null hypothesis: The null hypothesis is the opposite of the alternative hypothesis. In this case, it would be "There will be no significant difference in revenue after implementing the new pricing strategy."
3. Design the experiment:
    - A/B testing:
        - Control group: A group that continues with the current strategy.
        - Treatment group: A group that receives the new strategy or intervention.
    - Sample size calculation: Use statistical methods to determine the appropriate sample size to detect a meaningful difference with a desired level of confidence.
    - Randomization: Ensure that participants are randomly assigned to the control and treatment groups to minimize bias.
    - Duration: Determine the optimal duration of the experiment to collect sufficient data.
    - Metrics: Select relevant metrics to measure the impact of the intervention.
4. Data Collection and analysis:
    - Data collection: Gather data on the key metrics for both the control and treatment groups.
    - Data cleaning: Clean the data to remove errors and inconsistencies.
    - Statistical analysis: Use statistical tests (e.g., t-tests, chi-squared tests) to analyze the data and determine the significance of the results.
5. Decision making:
    - Evaluate results: Assess the results of the experiment against the null hypothesis.
    - Draw conclusions: If the results are statistically significant, accept the alternative hypothesis and implement the new strategy.
    - Iterate and learn: Continuously monitor the impact of the new strategy and make adjustments as needed.

### Key Considerations:
- Sample segmentation: Consider segmenting the population based on relevant factors (e.g., demographics, behavior) to identify specific groups that may respond differently to the intervention.
- Ethical considerations: Ensure that the experiment is conducted ethically and does not harm participants.
- Bias mitigation: Take steps to minimize bias in the experiment design and data analysis.
- Practical constraints: Consider practical limitations, such as budget, time, and resource constraints, when designing the experiment.

# Steps Involved In A/ B Testing
A/B testing is a powerful method for testing different versions of a web page or app to determine which performs better. Here's a breakdown of the key steps involved:

1. Define the hypothesis:
    - Identify the problem: Clearly define the problem you're trying to solve.
    - Formulate the hypothesis: Create a clear and testable hypothesis. For example, "If we change the button color from blue to red, we will increase click-through rates by 10%."
2. Set up the experiment:
    - Control group: A group that continues with the current version.
    - Treatment group: A group that receives the new version with the proposed change.
    - Randomization: Ensure that users are randomly assigned to either group to minimize bias.
3. Determine key metrics:
    - Primary metric: The metric you want to improve (e.g., click-through rate, conversion rate, revenue).
    - Secondary metrics: Other metrics to monitor (e.g., bounce rate, time on site).
4. Calculate Sample Size:
    - Statistical power analysis: Determine the required sample size to detect a statistically significant difference between the control and treatment groups.
    - Consider Factors:
        - Desired statistical power (e.g., 80%).
        - Significance level (e.g., 5%).
        - Expected effect size (the minimum difference you want to detect).
        - Variability in the data.
5. Run the experiment:
    - Duration: Determine the optimal duration based on the sample size and the rate of user traffic.
    - Monitor the experiment: Continuously monitor the experiment to identify any issues or unexpected behavior.
6. Analyze the results:
    - Statistical significance: Use statistical tests (e.g., t-test, chi-square test) to determine if the difference between the control and treatment groups is statistically significant.
    - Practical significance: Consider the practical implications of the results. Is the difference large enough to be meaningful?
    - Multiple testing problem: If testing multiple hypotheses, adjust the significance level to account for the increased risk of false positives.

### Pitfalls to avoid
- Premature conclusion: Avoid drawing conclusions too early. Ensure the experiment runs for a sufficient duration and collects enough data.
- Ignoring statistical significance: Don't rely solely on intuition; use statistical tests to validate results.
Ignoring Practical Significance: A statistically significant difference may not always be practically significant.
- Neglecting counter metrics: Monitor secondary metrics to ensure that the change doesn't negatively impact other aspects of the user experience.
- Overcomplicating the experiment: Keep the experiment simple and focused on one key change.

# Case Study: Colored Backgrounds For Statuses On Facebook
Say Facebook is incorporating colored backgrounds to statuses in order to improve user engagement. How should this be tested?

### Clarify the goal of the feature or idea conception:
1. What is hoped to be acheived with this feature incorporation or update?
2. Why this specific feature update to achieve the goal and not any other feature?
3. Has this been experimented before? Are other product lines also following suit?
4. Is it because of previous experiment data, industry insights, reports or other evidence that supports the hypotheses?
5. Is this feature for a specific user group or for all user groups?

### Discuss the metrics
Discuss the metrics that the feature expects to bring an impact to and the data available.

1. Give a list of metrics and finalize the metric of which the significance has to be tested.
    - Metrics: In this case study the focus is on increasing the percentage of user engagement, DAU, etc.
    - Expectation: An increase in both the metrics is expected as a result of adding this feature.
2. Come up with success metric, supporting metrics (if applicable) and guard rail metrics.
    - Success metric: Percentage of user engagement.
    - Supporting metric: DAU (Daily Active Users).
    - Guardrail metric (This metric should not degrade in the pursuit of a new feature): Percentage of media content (assuming media content provides more value, the percentage of this should not decrease because of the addition of new feature).

### Experimentation: How to desin an experiment?
1. Set up hypothesis (state the H0 and H1):
    - H0: There is no significant difference in user engagement between treatment and control groups.
    - H1: There is a significant difference in user engagement betwee the treatment and control groups.
2. Choice of test: Since the population's standard deviation is not known, T-Test can be used.
3. Choosing the experiment's treatment and control subjects:
    - Who is the experiment being run on?
    - Are all the users on the platform targeted, or should a segment of users for whom this test will be particularly well suited be picked?
4. Sample size calculation:
    - Baseline metrics:
        - Assume that before this feature launch, the user engagement is around 45%.
    - Minimum detectable effect: What change is considered meaningful enough to consider taking an action?
        - Assume that the business stakeholders are hoping for a 1% increase in user engagement in the treatment group.
    - Significance level: Usually 95%.
    - Power: Usually. In the context of A/B testing, power refers to the probability of correctly rejecting the null hypothesis when it is actually false. In simpler terms, it's the likelihood that your experiment will detect a true difference between the control and treatment groups. Refer this: https://www.geeksforgeeks.org/introduction-to-power-analysis-in-python/
    - With the above numbers, a rough assumption of 40,000 users in each group will be needed to design the experiment in a statistically significant manner.
5. Duration of the experiment: Based on the sample size estimated and the approximate traffic,
    - Divide the sample size by the number of users in each group.
        - Since a sample size of 80,000 (40,000 in each group) is needed, and assuming that Facebook gets a traffic of 5,000 per day, the duration of the experiment = 80,000/ 5,000 = 16 days.
6. Test the significance after the required sample size is reached on the north start metric to identify the significance.
7. Continue to monitor the supporting and the guardrail metrics.

### Testing pitfalls
How to avoid common challenges or experiment biases?

- Experiment design bias:
    - Novelty or primary effect:
        - Primary effect: When changes happen, some people that got used to how things work, may feel reluctant to change.
            - Sime users in the treatment group are reluctant to try out the new feature as they were used to the older status UI, so they stop using Facebook to post statuses.
            - So the user engagement for the first 2 weeks are low, week 1 = 45% and week 2 = 48%.
            - But as these reluctant users see more users engaging with this colored status, they start to use this feature more.
            - Therefore from week 3 onwards, the user engagement stabilizes to 62%.
            - It is important to not take the first 2 weeks of low user engagement into consideration due to the primary effect when comparing with control group.
                - Here it would show that there is no significant difference between the 2 groups in the first 2 weeks even though in subsequent weeks, there is an increase in user engagement for this feature.
        - Novelty effect: These users resonate with the new change and use more frequently.
            - Some users in the treatment group get excited about the new feature, and they use this feature in the first 2 week (say). Post which, from week 3 onwards, there is a decrease in user engagement due to the fact that their excitement has reduced.
            - So the user engagement for the first 2 weeks is high, week 1 = 65% and week 2 = 68%.
            - But from the week 3 onwards, the user engagement stabilizes at 52%.
            - It is important to not take the first 2 weeks of high user engagement into consideration due to novelty effect when comparing with the control group.
        - Both of these effects are not long term effects, so it is important that results are not biased due to this effect. Treatment group results may get exaggerated or undermined initially due to these effects.
        - Solutions:
            - Run the experiment for a longer time than required if possible, to observe for any novelty or primary effect.
            - The test can be conducted only on the first time users.
            - Compare first time users with experienced users in the treatment group (an estimate of primary or novelty effect can be found).
    - Group interference: 
        - Interference between variants happens a lot. 
        - It is important to select the sample in such a way that this interaction does not cause a biased result. For example, if the treatment group is seeing a positive effect because of this new Facebook status feature, this effect can spill over to the control group. This is called as network effect.
        - So in this, the difference underestimated the treatment effect.
        - In reality, the difference may actually be more than 1% but due to network effect, actual effect > treatment effect.
        - Hence this may give incorrect results, representing that this new feature did not significantly impact the north star metric.
- Outcome bias:
    - Look out for other design or system issues that led to the actual effect being undermined or over estimated the treatment effect.

### Recommendations based on the experiment results
- Linking results to business impact:
    - Quantify the impact: Translate the metric improvements into concrete business outcomes. For example, a 1% increase in user engagement could lead to a specific increase in revenue, reduced customer churn, or increased brand loyalty.
    - Consider the cost-benefit analysis: Weigh the potential benefits of the new feature against the costs of development, testing, and deployment.
- Decision-Making in conflicting scenarios:
    - Trade-offs: If a change leads to an increase in one metric but a decrease in another, carefully evaluate the trade-offs.
        - Example: Increased user engagement might lead to a decrease in daily active users. In this case, consider the revenue impact of the increased engagement. If the revenue gain outweighs the loss in daily active users, the feature might still be worth launching.
    - Long-term implications: Consider the long-term impact of the change. A short-term boost in engagement might not be sustainable if it negatively impacts user experience or brand reputation.
        - Example: A feature that increases short-term engagement but decreases user satisfaction in the long run could harm the brand's reputation.
- Recommendations:
    - Clear and concise language: Use clear and concise language to convey the key findings and recommendations.
    - Data-driven insights: Back up your recommendations with data and statistical analysis.
    - Actionable recommendations: Provide specific recommendations for implementation, such as launching the feature, making further optimizations, or abandoning the experiment.
    - Risk assessment: Identify potential risks and mitigation strategies.
    - Continuous monitoring: Recommend ongoing monitoring of key metrics to track the long-term impact of the change.
- Example recommendation: "Based on the A/B test results, the introduction of colored posts resulted in a statistically significant increase in user engagement. While there was a slight decrease in daily active users, the revenue generated from increased engagement outweighed this loss. Therefore, we recommend launching the colored post feature to all users. However, it is essential to continue monitoring user behavior and making adjustments as needed to optimize the user experience.".