In [2]:
import numpy as np
import pandas as pd
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize, proportions_ztest

#### Experiment Sizing
Now that we have our main metrics selected: number of cookies as an invariant metric, and the download rate and license purchase rate (relative to number of cookies) as evaluation metrics, we should take a look at the feasibility of the experiment in terms of the amount of time it will take to run. We can use historical data as a baseline to see what it might take to detect our desired levels of change.

Recent history shows that there are about 3250 unique visitors per day, with slightly more visitors on Friday through Monday, than the rest of the week. There are about 520 software downloads per day (a .16 rate) and about 65 licenses purchased each day (a .02 rate). In an ideal case, both the download rate and license purchase rate should increase with the new homepage; a statistically significant negative change should be a sign to not deploy the homepage change. However, if only one of our metrics shows a statistically significant positive change we should be happy enough to deploy the new homepage.

Consider that we want to preserve a maximum 5% Type I error rate for falsely deploying the homepage without any actual effect. We should apply the Bonferroni correction in this.

Let's say that we want to detect an increase of 50 downloads per day (up to 570 per day, or a .175 rate). How many days of data would we need to collect in order to get enough visitors to detect this new rate at an overall 5% Type I error rate and at 80% power?

In [3]:
# a=0.25 after bonferroni correctoion
# for each group the # of cookies should collect
downloadday=NormalIndPower().solve_power(effect_size=proportion_effectsize(.175,.16),alpha=.025,power=.8,alternative='larger')

In [4]:
# for two groups # of cookies
downloadday*2/3250

5.984162758396183

What if we wanted to detect an increase of 10 license purchases per day (up to 75 per day, or a .023 rate). How many days of data would we need to collect in order to get enough visitors to detect this new rate at an overall 5% Type I error rate and at 80% power?

In [5]:
# for each group the # of cookies should collect
licenseday=NormalIndPower().solve_power(effect_size=proportion_effectsize(.023,.02),alpha=.025,power=.8,alternative='larger')

In [6]:
licenseday*2/3250

22.55376921156206

#### Analyze Data

In [9]:
df=pd.read_csv('data/homepage-experiment-data.csv')
df

Unnamed: 0,Day,Control Cookies,Control Downloads,Control Licenses,Experiment Cookies,Experiment Downloads,Experiment Licenses
0,1,1764,246,1,1850,339,3
1,2,1541,234,2,1590,281,2
2,3,1457,240,1,1515,274,1
3,4,1587,224,1,1541,284,2
4,5,1606,253,2,1643,292,3
5,6,1681,287,3,1780,299,3
6,7,1534,262,5,1555,276,8
7,8,1798,331,12,1787,326,20
8,9,1478,223,30,1553,298,38
9,10,1461,236,32,1458,289,23


#### Invariant Metric
First, we should check our invariant metric, the number of cookies assigned to each group. If there is a statistically significant difference detected, then we shouldn't move on to the evaluation metrics right away. We'd need to first dig deeper to see if there was an issue with the group-assignment procedure, or if there is something about the manipulation that affected the number of cookies observed, before we feel secure about analyzing and interpreting the evaluation metrics.

What is the p-value for the test on the number of cookies assigned to each group?

In [12]:
from statsmodels.stats.proportion import proportions_ztest
# exp group cookies number (serve as succeeds in this case)
count=df['Experiment Cookies'].sum()
# total cookies number 
nobs=(df['Control Cookies']+df['Experiment Cookies']).sum()

In [13]:
# the cookies number of exp group in this case can't be too much more/less than control group,
# so we use two-side test, alternative='two-sided'
# value=0.5 in this case is because under H0, proportion of Ho and Ha should be the same
proportions_ztest(count,nobs,value=.5)

(1.6128451019747376, 0.10677816462098283)

Even though there's a few hundred more cookies in the experimental group than the control group, the difference between groups isn't statistically significant. We should feel fine about moving on to test the evaluation metrics.

#### Evaluation Metrics
Assuming that the invariant metric passed inspection, we can move on to the evaluation metrics: download rate and license purchasing rate. For a refresher, the download rate is the total number of downloads divided by the number of cookies, and the license purchasing rate the number of licenses divided by the number of cookies.

One tricky point to consider is that there is a seven or eight day delay between when most people download the software and when they make a purchase. There's no direct way of attributing cookies all the way through license purchases due to the daily aggregation of results, so the best we can do is to make a justified argument for handling the data. To answer the question below about the license purchasing rate, you should only take the cookies observed through day 21 as the denominator of the ratio as being responsible for all of the license purchases observed. (A more informed model of license purchasing could come up with a different handling of the data, such as including part of the day 22 cookies in the denominator.) (Note that we don't need to perform this kind of correction for the download rate, since the link between homepage visits and downloads is much closer.)

What is the p-value for the test on the download rate between groups?

In [14]:
contotal=df['Control Cookies'].sum()
exptotal=df['Experiment Cookies'].sum()

In [15]:
consucceed=df['Control Downloads'].sum()
expsucceed=df['Experiment Downloads'].sum()

In [16]:
proportions_ztest([consucceed,expsucceed],[contotal,exptotal],alternative='smaller')

(-7.870833726066236, 1.7614279636728079e-15)

What is the p-value for the test on the license purchasing rate between groups?

In [17]:
# only take the cookies observed through day 21
contotal_li=df.iloc[:21,1].sum()
exptotal_li=df.iloc[:21,4].sum()

In [18]:
consucceed_li=df['Control Licenses'].sum()
expsucceed_li=df['Experiment Licenses'].sum()

In [19]:
proportions_ztest([consucceed_li,expsucceed_li],[contotal_li,exptotal_li],alternative='smaller')

(-0.2586750111658684, 0.3979430008399871)

#### Analyze Data
For the test of the invariant metric, number of cookies, there were a larger number of cookies recorded in the experiment group, 47 346 vs. 46 851. This ends up generating a p-value of 0.107 (z = -1.61), which is within a reasonable range under the null hypothesis. Since we lack sufficient reason to reject the null, we can continue on to evaluating the evaluation metrics. (Note that this doesn't mean that there wasn't something actually different about the cookie counts between groups, only that we couldn't detect it if such a difference existed.)

For the first evaluation metric, download rate, there was an extremely convincing effect. An absolute increase from 0.1612 to 0.1805 results in a z-score of 7.87, well beyond any standard significance bound. However, the second evaluation metric, license purchasing rate, only shows a small increase from 0.0210 to 0.0213 (following the assumption that only the first 21 days of cookies account for all purchases). This results in a p-value of 0.398 (z = 0.26).


#### Draw Conclusions
Despite the fact that statistical significance wasn't obtained for the number of licenses purchased, the new homepage appeared to have a strong effect on the number of downloads made. Based on our goals, this seems enough to suggest replacing the old homepage with the new homepage. Establishing whether there was a significant increase in the number of license purchases, either through the rate or the increase in the number of homepage visits, will need to wait for further experiments or data collection.

One inference we might like to make is that the new homepage attracted new users who would not normally try out the program, but that these new users didn't convert to purchases at the same rate as the existing user base. This is a nice story to tell, but we can't actually say that with the data as given. In order to make this inference, we would need more detailed information about individual visitors that isn't available. However, if the software did have the capability of reporting usage statistics, that might be a way of seeing if certain profiles are more likely to purchase a license. This might then open additional ideas for improving revenue.