# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [2]:
#Your code here
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
df = pd.read_csv('homepage_actions.csv')

## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [16]:
#Your code here
df['count'] = 1
experiment = df[df['group'] == 'experiment']
experiment = experiment.pivot(index = 'id', columns = 'action', values = 'count')
experiment.fillna(0, inplace = True)

control = df[df['group'] == 'control']
control = control.pivot(index = 'id', columns = 'action', values = 'count')
control.fillna(0, inplace = True)

print(f'The experiment click rate is {experiment.click.mean()}, and the control group click rate is {control.click.mean()}')

The experiment click rate is 0.3097463284379172, and the control group click rate is 0.2797118847539016


In [None]:
'''
H_0: There is no difference between the experiment and control group's mean click rate.
H_1: The difference in mean click rate between experiment and control group is statistically significant.
'''

In [5]:
def pooled_variance(a,b):
    n1 = len(a)
    n2 = len(b)
    return ((n1 - 1) * np.var(a, ddof = 1) + (n2 - 1) * np.var(b, ddof = 1))/(n1 + n2 - 2)

def twosample_tstatistic(a, b):
    numerator = (np.mean(a) - np.mean(b))
    denominator = np.sqrt(pooled_variance(a, b) * (1/len(a) + 1/(len(b))))
    return numerator/denominator

t_stat = twosample_tstatistic(experiment.click, control.click)
lower_tail = stats.t.cdf(-t_stat, (50+50-2), 0, 1)
# Upper tail comulative density function returns area under upper tail curve
upper_tail = 1. - stats.t.cdf(t_stat, (50+50-2), 0, 1)

p_value = lower_tail+upper_tail
print(p_value)

0.010203660092399609


In [13]:
""" With p-value = 0.01 < 0.05, we reject the null hypothesis. Therefore, the difference in 
mean click rate between the experiment group and the control group is statistically significant.
"""

' With p-value = 0.01 < 0.05, we reject the null hypothesis. Therefore, the difference in \nmean click rate between the experiment group and the control group is statistically significant.\n'

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [7]:
#Your code here
control_rate = control.click.mean()
expected_experiment = control_rate * len(experiment)
print(f'The expected number of clicks for the experiment group given it had the same click-through rate: {expected_experiment}')

The expected number of clicks for the experiment group given it had the same click-through rate: 838.0168067226891


### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [10]:
#Your code here
std = np.sqrt(len(experiment) * control_rate * (1-control_rate))
z = (sum(experiment.click) - expected_experiment) / std 
print(f'The standard deviations that the actual number of clicks was from this estimate: {std}')
print(f'The z-score is {z}')

The standard deviations that the actual number of clicks was from this estimate: 24.568547907005815
The z-score is 3.6625360854823588


### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [12]:
#Your code here
p_val = stats.norm.sf(z)
print(f'p-value is {p_val}')

p-value is 0.00012486528006951198


### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: 
Though this p-value (0.0001) is much lower than p-value obtained from t-test above (0.01), it does not contradict our final conclusion on the hypothesis test of rejecting the null hypothesis.
Based on this conclusion, the experiment design output a higher click through rate compared to the control group.

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.