# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [1]:
import pandas as pd

df = pd.read_csv('homepage_actions.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8188 entries, 0 to 8187
Data columns (total 4 columns):
timestamp    8188 non-null object
id           8188 non-null int64
group        8188 non-null object
action       8188 non-null object
dtypes: int64(1), object(3)
memory usage: 256.0+ KB


In [31]:
print('Number of unique visitor IDs: {}'.format(len(set(df.id))))
print('Number of visitors who clicked: {}'.format(len(df.loc[df.action == 'click'])))

clicked = df.loc[df.action == 'click']
viewed = df.loc[df.action == 'view']
anomalies = clicked.set_index('id').join(viewed.set_index('id'), how='left', lsuffix='_clicked', rsuffix='_viewed')
print('Number of anomalies (clicked, but didn\'t view): {})'.\
      format(len(anomalies.loc[anomalies.action_viewed.isna()])))

control = set(df.loc[df.group=='control'].id)
experiment = set(df.loc[df.group=='experiment'].id)
count = 0
for id in control:
    if id in experiment:
        count += 1
print('Number of specimen in both control and experiment: {}'.format(count))

Number of unique visitor IDs: 6328
Number of visitors who clicked: 1860
Number of anomalies (clicked, but didn't view): 0)
Number of specimen in both control and experiment: 0


## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [42]:
ones = np.ones(5)

In [45]:
np.concatenate((ones, ones))

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [50]:
import numpy as np

control = df.loc[df.group=='control']
c_n = len(set(control.id))
c_clicked_n = len(control.loc[control.action=='click'].id)
control_binary = np.concatenate((np.zeros(c_n-c_clicked_n), np.ones(c_clicked_n)))

experiment = df.loc[df.group=='experiment']
e_n = len(set(experiment.id))
e_clicked_n = len(experiment.loc[experiment.action=='click'].id)
experiment_binary = np.concatenate((np.zeros(e_n-e_clicked_n), np.ones(e_clicked_n)))

print('Size of control: {}'.format(c_n))
print('Size of experiment: {}'.format(e_n))
print('Fraction clicked in control: {}'.format(c_clicked_n/c_n))
print('Fraction clicked in experiment: {}'.format(e_clicked_n/e_n))

Size of control: 3332
Size of experiment: 2996
Fraction clicked in control: 0.2797118847539016
Fraction clicked in experiment: 0.3097463284379172


In [51]:
import scipy.stats as stats

stats.ttest_ind(control_binary, experiment_binary, equal_var=False)

Ttest_indResult(statistic=-2.615440020788211, pvalue=0.008932805628674203)

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [52]:
expected = (c_clicked_n/c_n) * e_n
expected

838.0168067226891

### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [58]:
p = c_clicked_n / c_n
var = c_n * p * (1 - p)
std = np.sqrt(var)
print(std)

z = (e_clicked_n - expected) / std
z

25.909622216646923


3.4729643112857396

### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [57]:
1 - stats.norm.cdf(z)

0.00025737189895802537

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: **Roughly! Statistically significant difference.**

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.