# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [1]:
#Your code here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy import stats
import seaborn as sns

In [2]:
df = pd.read_csv('homepage_actions.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8188 entries, 0 to 8187
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   timestamp  8188 non-null   object
 1   id         8188 non-null   int64 
 2   group      8188 non-null   object
 3   action     8188 non-null   object
dtypes: int64(1), object(3)
memory usage: 256.0+ KB


In [4]:
df.head(30)

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 19:59:02.646620,671993,control,view
4,2016-09-24 20:26:14.466886,536734,experiment,view
5,2016-09-24 20:32:25.712659,681598,experiment,view
6,2016-09-24 20:39:03.248853,522116,experiment,view
7,2016-09-24 20:57:20.336757,349125,experiment,view
8,2016-09-24 20:58:01.948663,349125,experiment,click
9,2016-09-24 21:00:12.278374,560027,control,view


In [5]:
df.tail(10)

Unnamed: 0,timestamp,id,group,action
8178,2017-01-18 08:17:12.675797,616692,control,view
8179,2017-01-18 08:53:50.910310,615849,experiment,view
8180,2017-01-18 08:54:56.879682,615849,experiment,click
8181,2017-01-18 09:07:37.661143,795585,control,view
8182,2017-01-18 09:09:17.363917,795585,control,click
8183,2017-01-18 09:11:41.984113,192060,experiment,view
8184,2017-01-18 09:42:12.844575,755912,experiment,view
8185,2017-01-18 10:01:09.026482,458115,experiment,view
8186,2017-01-18 10:08:51.588469,505451,control,view
8187,2017-01-18 10:24:08.629327,461199,control,view


In [6]:
df['timestamp'].min()

'2016-09-24 17:42:27.839496'

In [7]:
df['timestamp'].max()

'2017-01-18 10:24:08.629327'

In [8]:
df.groupby('group')['action'].value_counts()

group       action
control     view      3332
            click      932
experiment  view      2996
            click      928
Name: action, dtype: int64

#### It looks like this is a table with data from 9/24/2016 to 1/18/2017, of individuals, whether in a "control" group or an "experiment" group, and whether they just viewed an ad or something, or if they viewed AND clicked.  I've noted that if an individual clicked and viewed, they have 2 entries in this dataframe, so I will need to figure out a way to account for that

## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [9]:
#Your code here
df_ctl = df[df['group'] == ('control')].reset_index(drop=True)
df_expt = df[df['group'] == ('experiment')].reset_index(drop=True)

In [10]:
ctl_tot = len(df_ctl)
print(ctl_tot)
df_ctl.head()

4264


Unnamed: 0,timestamp,id,group,action
0,2016-09-24 19:59:02.646620,671993,control,view
1,2016-09-24 21:00:12.278374,560027,control,view
2,2016-09-25 00:25:14.141290,281985,control,view
3,2016-09-25 01:14:48.603202,407864,control,view
4,2016-09-25 02:16:11.046654,342984,control,view


In [11]:
expt_tot = len(df_expt)
print(expt_tot)
df_expt.head()

3924


Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 20:26:14.466886,536734,experiment,view
4,2016-09-24 20:32:25.712659,681598,experiment,view


In [12]:
ctl_clicks = len(df_ctl[df_ctl['action'] == "click"])
print("Controls who 'clicked':  " + str(ctl_clicks))
expt_clicks = len(df_expt[df_expt['action'] == "click"])
print("Expts who 'clicked':  " + str(expt_clicks))

Controls who 'clicked':  932
Expts who 'clicked':  928


#### Since the totals of df_ctl and df_expt include repeat entries for anyone who viewed AND clicked, that must be accounted for here:

In [13]:
ctl_viewed_only = ctl_tot - ctl_clicks
expt_viewed_only = expt_tot - expt_clicks
print("Controls who only 'viewed':  " + str(ctl_viewed_only))
print("Expts who only 'viewed':  " + str(expt_viewed_only))

Controls who only 'viewed':  3332
Expts who only 'viewed':  2996


In [14]:
print("Total Controls:  " + str(ctl_tot) + "  Controls who only viewed:  " + str(ctl_viewed_only) + "  Controls who clicked:  " + str(ctl_clicks))
print()
print("Total Expts:  " + str(expt_tot) + "  Expts who only viewed:  " + str(expt_viewed_only) + "  Expts who clicked:  " + str(expt_clicks))

Total Controls:  4264  Controls who only viewed:  3332  Controls who clicked:  932

Total Expts:  3924  Expts who only viewed:  2996  Expts who clicked:  928


#### Since we are dealing with categorical information, a $\chi^2$ will be appropriate for this analysis

In [18]:
contingency_table = np.array([
    (ctl_clicks, ctl_viewed_only),
    (expt_clicks, expt_viewed_only)
])

contingency_table

array([[ 932, 3332],
       [ 928, 2996]])

In [19]:
stats.chi2_contingency(contingency_table)

(3.636160051233291,
 0.056537191086915774,
 1,
 array([[ 968.61748901, 3295.38251099],
        [ 891.38251099, 3032.61748901]]))

#### Since alpha was not provided, we will assign the default, 5%.
p = 5.65% which is just above threshold of 5%, so we would be able to reject the null hypothesis.

### So with the controls, 932/3332 clicked (or ~28%), and with the exptl, 928/2996 (or ~31%, the $\chi^2$ is 3.636 and the p = 0.0565, dof = 1 and the expected frequencies are 969/3295 and 891/3033 for the control and exptl groups, respectively.

###### As far as I can tell, the solutions manual says I did this wrong, but I don't see why the $\chi^2$  was the wrong way to go.  I didn't even know there was a flatiron_stats library to import, but I see that you get a p = .0045 with that method, which looks like a two tail test ("fs.p_value_welch_ttest(control.click, experiment.click") was used, so I guess the answer was supposed to be that we did in fact reject the Null Hypothesis.  However, I see no reason that the way I did this was incorrect, so I will continue the analysis using my results.

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [24]:
#Your code here
ctl_rate = 968.61748901/3295.38251099
expt_rate = 891.38251099/3032.61748901
expt_expected_clicks = ctl_rate * expt_viewed_only
print(str(expt_expected_clicks) + " clicks, assuming failure to reject the null hypothesis.")

880.6194690285429 clicks, assuming failure to reject the null hypothesis.


### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [26]:
#Your code here  =  n * p(1-p)
n = len(df)
p = ctl_rate
variance = n * p * (1-p)
std = np.sqrt(variance)
print(std)

41.22261144925813


In [28]:
#z-score = (measurement(actual clicks) - expected-experiment-clicks)/std
z_score = (expt_clicks - expt_expected_clicks) / std
print(z_score)

1.1493820819619571


### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [30]:
#Your code here
pval = stats.norm.sf(z_score)
print(pval)

0.12519923258779564


### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: **Yes, since the p value calculated in the fashion is still > the standard cutoff of 5%, we have still failed to reject the Null Hypothesis**

### I know this doesn't match the answer key, but it's wrong, and I'm right.  So there.

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.