In [1]:
import math as mt
import numpy as np
import pandas as pd
import math
import scipy.stats as stats
import seaborn as sns
import statsmodels.stats.api as sms

## Preface:
The randomized design is a technique where the treatment and control units are selected completely randomly. This technique is used primarily in situations where there is very little opportunity to control variables, and where the volume and velocity of data are high enough that you are not worried about bias. Ex. website & phone calls based experiments. (Because the company has little ability to control who visits a site or who calls in.)   

--- 

## Project Overview:
In this project, I would like to conduct a randomized-design test for a large grocery chain. One of the company main goals is to drive more customers to download their mobile app, and register for their loyalty program. One way of driving customers to download the app, is via a link on the company webstie. Page view statistics show that only 10% of people who visit the main page are clicking through to visit the page that describes the app and loyalty program. <br>
I want to set up an experiment to test whether changing the button of the page link to a picture would improve the click through rates of visiters who visit the page that describe the app and loyalty program.

---
## Part A. Experiment Design

### STEP 1. Define Variables
**Target variable:** Has the individual clicked on the link? (Binary: Y/N) <br>
Compare the difference in the "click through rate" for the control and treatment groups.<br>
A "click through rate" measures the percentage of people that click on a link or visit a page. 

**Treatment variable**: Which web page did the individual see? (Original page/ Modified page)<br>

**Control variables**: 
Used to ensure that both the treatment and control groups are representative of the population. This helps give confidence, that if the change were applied to the entire population, you could expect to see the same results that you observe in the experiment. Possible control variables are:<br>
  * `Type of people who visits the webpage`: Different types of people may have different levels of intension to click the link.<br>
ex. social economic standing, demographic profile <br>
  * `Site visits`: Ensure that we only count each person once regardless how many times they click on the site. <br>
ex. IP address & cookies -> Users are split at the IP level<br>
  * `Membership`: Whether or not they already have a membership card would likely influence clicking on the link.<br>
-> Remove all data from the analysis for people who already have an account

### STEP 2. Decide Experiment Duration
We want to make sure that all types of users have a chance to visit the site during the experiment, not just those who shop in the middle of the day, for instance. So the experrment will run at least 24 hours, but most likely for **a whole week** to capture visiters who browse over the weekends.  


### STEP 3. Decide Sample Size
It is not required to have an equal number of treatments and controls. We just need to **make sure that the sample size is large enough and the groups are representative of the population**. In this project, each visit to the website is randomly sent to one of three servers. The experimental website page is displayed on server 3, whereas the control website page is displayed on server 1 and 2. As a result, we'd expect to see about 33% of the units in the treatment group, and 67% of the units in the control groups.

### Experiment Summary 

| Questions             | Solutions                       |
| :-------------------- |:--------------------------------| 
| Unit of Diversion     |  IP Address                     | 
| Population            |  Visitors without a membership  |  
| Duration              |  1 week                         |
| Size                  |  33% treatment and 67% control  |

---

## Part B. Results Analysis
### STEP 1. Get Data

In [2]:
data = pd.read_csv('/Users/yinchiahuang/Library/Mobile Documents/com~apple~CloudDocs/AB_testing_project/grocerywebsiteabtestdata.csv')
data

Unnamed: 0,RecordID,IP Address,LoggedInFlag,ServerID,VisitPageFlag
0,1,39.13.114.2,1,2,0
1,2,13.3.25.8,1,1,0
2,3,247.8.211.8,1,1,0
3,4,124.8.220.3,0,3,0
4,5,60.10.192.7,0,2,0
5,6,23.5.199.2,1,3,0
6,7,195.12.126.2,1,1,0
7,8,97.6.126.6,0,3,1
8,9,93.10.165.4,1,1,0
9,10,180.3.76.4,1,1,0


### STEP 2. Prepare Data

**Data Description**
 * `IP Address:` <br>
     An IP Address is always routed to the same server, so an IP address cannot be in both treatment and control groups at the same time.
 
 
 * `LoggedInFlag:` <br>
     1 (the user has an account and is logged in )<br>
     0 (the user doesn't have an account and is not logged in) <br>
 * `ServerID:` <br>
     1 (treatment group)<br>
     2 (control group)<br>
     3 (control group)<br>
 * `VisitPageFlag:`<br>
     1 (the user clicked on the page with information about the loyalty program)<br>
     0 (the user didn't click on the page with information about the loyalty program)<br>
     
**Remove unwanted data points for control purposes.**
1. Make sure individual IP address is only shown once.
2. Remove data for users who already have an account(membership) and would not need to sign up for an account since they would not click the link at all.
     
 
     
 
     
 


In [3]:
# Make sure individual IP address is only shown once
df = data.iloc[:,1:].groupby(['IP Address', 'LoggedInFlag','ServerID','VisitPageFlag']).count().reset_index()

# If one individual IP address has visit the website multiple times, but didn't click the link every time, 
# it will have 2 records(1 & 0). 
# We just need to keep the record of VisitPageFlag == 1 and remove VisitPageFlag == 0.
IPs = df['IP Address']
HasVisitPage_IPs = IPs[IPs.duplicated()].tolist()

def getVisitPage(row):
    if row['IP Address'] in HasVisitPage_IPs:
        return 1
    else:
        return 0
    
df['VisitPageFlag']=df.apply (lambda row: getVisitPage(row), axis=1)
df = df.drop_duplicates(subset='IP Address', keep= 'last')

In [4]:
#Remove data for users who already have an account(membership)
df = df[df.LoggedInFlag == 0].reset_index()
df = df.drop(['index'], axis=1)
df

Unnamed: 0,IP Address,LoggedInFlag,ServerID,VisitPageFlag
0,0.0.108.2,0,1,0
1,0.0.111.8,0,3,0
2,0.0.163.1,0,2,0
3,0.0.181.9,0,1,1
4,0.0.20.3,0,1,0
5,0.0.209.9,0,3,1
6,0.0.213.8,0,1,0
7,0.0.220.4,0,1,1
8,0.0.228.7,0,2,0
9,0.0.230.5,0,1,0


### STEP 3. Analyze Data
If treatment average > control average ... <br>
How can we be sure this result wasn't random? In other words,... <br>
How can we be sure that **the difference we see from the experiment actually exists in the entire population**?<br>

A randomized design experiment is analyzed using a test of means analysis. I use **T-test** to determine if the **target variable's mean values** of the **treatment** and **control** groups are the same. Below I will explain why and how I use T-test to test data.<br>

**Statistic Explanations:**
* `P-value`: 

>Likelihood the actual difference between the means is 0. Generally, if the P-value is below 0.05, indicating a confidence interval of 95%, the difference in means is considered statistically significantly.<br>
A P-value is the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis.<br>
For example, suppose that a this experiment produces a P-value of 0.04. This P-value indicates that if this treatment has no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.

* `Binomial distribution`:

> The data is a binomial distribution, each trial is either 0 or 1. 

* `Central Limit Theorem`

>The Central Limit Theorem (CLT) implies **a sample of independent random variables**, their sums tends towards to a normal distribution even if the original variables themselves aren't normally distributed, also **the sample mean tends towards to a normal distribution (sum and mean are equivalent)**. For the sample in this problem, each variable is bernoulli distributed, so the sums is binomial distributed.

    ** Distribution of number of successes (sample sum): 
> If we consider the sum of random variable in a sample, according to CLT, when n increases, the sum approximates a normal distribution.

\begin{array}{l}\mu_x=np\;\\\sigma_x^2=np(1-p)\\X\sim N(np,np(1-p))\end{array}

    ** Distribution of proportion of successes (sample mean): 

> If we consider the mean of random variable in a sample, according to CLT, when n increases, the mean approximates a normal distribution. 



\begin{array}{l}\overline X\sim N(\frac{np}p,\frac{np(1-p)}{n^2})\;\\\;\;\;\;=N\;(p,\;\frac{p(1-p)}n)\end{array}

* `Use T-test rather than Z-test`

> We use t-test instead of Z-test if we don't know the standard deviation of the population. So we use the standard deviation of sample to replace the unknown standard deviation of the population, then t-test should be implemented. 

> _Note: Binomial distribution is different from normal distribution, for bernoulli, we can compute SD if we know mean, but the normal distribution, SD is still unknown even if mean is known. So for a normal sample, we still need t-test. e.g. given a sample of male weight, test whether the mean equals to 70kg, use t-test since σ is unknown._)


#### T-Test for two population:
Test whether there is difference between Click Through Rate (CTR) between control group and experiment group.
According to CLT, sample mean is limiting normal distributed. Then we have

\begin{array}{l}\overline{X_1}\sim N(p_1,\;\frac{p_1\;(1-p_1)}{n_1})\;\;\;\\\overline{X_2}\sim N(p_2,\frac{p_{2\;}(1-p_2)}{n_2})\;\\\overline{X_1}-\overline{X_2}\;\sim\;N(p_1-p_2\;,\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2})\;\\\end{array}

* `Hypothesis:` 
>\begin{array}{l}H_0:\;\overline{X_1}-\overline{X_2}=0,\\H_1\;:\overline{X_1}-\overline{X_2}\neq0\\\end{array}


* `T-test statistic:`
>\begin{array}{l}t=\frac{\overline{X_1}-\overline{X_2}-0}{SE}\\\;\;=\frac{\overline{X_1}-\overline{X_2}-0}{\sqrt{{\displaystyle\frac{p_1(1-p_1)}{n_1}}+{\displaystyle\frac{p_2(1-p_2)}{n_2}}}}\end{array}

In [5]:
# number of impressions  (1 = expriment ; 2 = control)
n1 = len(df.query("ServerID==1"))
n2 = len(df.query("ServerID==2 or ServerID==3"))
# number of clicks 
c1 = len(df.query("ServerID==1 & VisitPageFlag ==1"))
c2 = len(df.query("(ServerID==2 & VisitPageFlag ==1) or (ServerID==3 & VisitPageFlag ==1)"))
# probabilities
p1 = c1/n1
p2 = c2/n2
# variance
var1 = p1*(1-p1)/n1
var2 = p2*(1-p2)/n2
# Standard Error
SE = mt.sqrt((p1*(1-p1)/n1)+(p2*(1-p2)/n2))
# T statistic
tt = (p1 - p2)/SE
# Degrees of freedom 
df = min(n1 -1, n2 -1)
# p_value
p_value = 1 - stats.t.cdf(tt,df=df)
# Difference 
difference = p1 - p2


In [6]:
print('T-Statistic:',round(tt,3))
print('p_value:',str(2*p_value))
print("Difference:", difference)

alpha = 0.05
if p_value > alpha:
    print('These two groups are the same distributions (fail to reject H0)')
else:
    print('These two groups are different distributions (reject H0)')

np.random.seed(12345678)

data1 = np.random.normal(p1,mt.sqrt(var1),n1)
data2 = np.random.normal(p2,mt.sqrt(var2),n2)

cm = sms.CompareMeans(sms.DescrStatsW(data1), sms.DescrStatsW(data2))
print ('Confidence Interval:',cm.tconfint_diff(usevar='unequal'))


T-Statistic: 6.849
p_value: 7.718714556403938e-12
Difference: 0.023803868905495407
These two groups are different distributions (reject H0)
Confidence Interval: (0.023782575528574344, 0.02388039508202159)


* `Interpret the result:` 
> - T-Statistic= 6.849: This represent how far the difference in sample means is from 0 in standard deviation unit.
  - p_value is less than 0.05: It is extremely unlikely that the two means are the same. 
  - The confidence intervla is between 2.378% to 2.388%: With 95% confidence the difference in mean CTR between two groups is between 2.378% and 2.388%. In other words, if we repeated the experiment 100 times, 95 times the confidence interval I calculated contains the true difference of mean between these two groups. <br>
  _P.S. The differnce of mean has one value. If we repeated the experiment, that value wouldn't change (and we still wouldn't know what it is)._ 
  - The difference between these two means are 0.0238(2.38%): This new treatment can bring 2.38% more users to click on the link.

### STEP 4. Conclusion
From above experiment results, we can conclude that changing the button of the page link to a picture would improve the click through rates of visiters who visit the page that describe the app and loyalty program.

--- 

### References:
- https://byrony.github.io/understanding-ab-testing-and-statistics-behind.html
- https://classroom.udacity.com/courses/ud979/lessons/f06c3bc7-a908-4ab9-9b4b-02d4e39e8f41/concepts/6f09dcd6-c2df-4f20-a39c-ef63992d4b3e
