In [89]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
from statsmodels.stats.power import TTestIndPower

We import two groups of customers, one subscribed to a desktop mailing list, the other subscribed to a laptop mailing list.

In [29]:
desktop = pd.read_csv("https://bradfordtuckfield.com/desktop.csv")
laptop = pd.read_csv("https://bradfordtuckfield.com/laptop.csv")

In a t-test, the test statistic measures the difference between the means of two groups relative to the variability within the groups:

$$
t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
$$


In [35]:
print(ttest_ind(desktop["spending"],laptop["spending"]))
print(ttest_ind(desktop["age"],laptop["age"]))
print(ttest_ind(desktop["visits"],laptop["visits"]))

Ttest_indResult(statistic=-2.109853741030508, pvalue=0.03919630411621095)
Ttest_indResult(statistic=-0.7101437106800108, pvalue=0.4804606394128761)
Ttest_indResult(statistic=0.20626752311535543, pvalue=0.8373043059847984)


There's a statistically significant (at the 5% significance level) difference between the spending and ages of the two groups.

#### Running Experiments to Test New Hypotheses

Suppose we’re interested in studying whether changing the color of text in our marketing emails from black to blue will increase the revenue we earn as a result of the emails:

- **Hypothesis 0** - Changing the color of text in our emails from black to blue will have no effect on revenues.

- **Hypothesis 1** - Changing the color of text in our emails from black to blue will lead to a change in revenues (either an increase or a decrease).

Although it leads to problems in our analysis, for demonstration purposes, we split the desktop subscriber list into two groups based on whether age is below or above the median:

In [41]:
median_age = np.median(desktop["age"])
group_a = desktop.loc[desktop["age"] <= median_age,:]
group_b = desktop.loc[desktop["age"] > median_age,:]

In [44]:
# Read the fabricated data showing hypothetical outcomes for members of two groups:
email_results_1 = pd.read_csv("https://bradfordtuckfield.com/emailresults1.csv")

In [45]:
# Join the groups with the results:
group_a_with_revenue = group_a.merge(email_results_1, on="userid")
group_b_with_revenue = group_b.merge(email_results_1, on="userid")

In [50]:
# Perform a t-test to check whether revenue difference 
# between the two groups is statistically significant:
print(ttest_ind(group_a_with_revenue["revenue"],group_b_with_revenue["revenue"]))


Ttest_indResult(statistic=-2.186454851070545, pvalue=0.03730073920038287)


In [83]:
# Calculate the size of the difference:
results_1_effect_size = np.mean(group_b_with_revenue["revenue"])-np.mean(group_a_with_revenue["revenue"])
print(results_1_effect_size)

125.0


In [85]:
# Calculate Cohen's d:
results_1_cohen_d = results_1_effect_size / np.std(email_results_1["revenue"])
print(results_1_cohen_d)

0.763769235188029


There's a stistically significant (at the 5% level) difference of $125 between the average spend of the two groups. 

However, because we split the population by age into young and old, the A/B test is _confounded_: we can't say that the difference is due to text color.

This time, we split the laptop subscriber list into two random groups. We want to test whether adding a picture to a marketing email will improve revenue:

In [66]:
# Generate random indices for splitting
np.random.seed(18811015)
random_indices = np.random.permutation(laptop.index)
split_point = int(0.5 * len(random_indices))

# Split the DataFrame into two groups
group_c = laptop.loc[random_indices[:split_point]]
group_d = laptop.loc[random_indices[split_point:]]

In [76]:
# Use textbook method of splitting the DataFrame
np.random.seed(18811015)
laptop.loc[:,'group_assignment_1'] = 1*(np.random.random(len(laptop.index))>0.5)
group_c = laptop.loc[laptop['group_assignment_1'] == 0,:].copy()
group_d = laptop.loc[laptop['group_assignment_1'] == 1,:].copy()


In [77]:
# Read the fabricated data showing hypothetical outcomes for members of two groups:
email_results_2 = pd.read_csv("https://bradfordtuckfield.com/emailresults2.csv")

In [78]:
# Join the groups with the results:
group_c_with_revenue = group_c.merge(email_results_2, on="userid")
group_d_with_revenue = group_d.merge(email_results_2, on="userid")

In [79]:
# Perform a t-test to check whether revenue difference 
# between the two groups is statistically significant:
print(ttest_ind(group_c_with_revenue["revenue"], group_d_with_revenue["revenue"]))

Ttest_indResult(statistic=-2.381320497676198, pvalue=0.024288828555138562)


We conclude that including the picture in the email has a nonzero effect. 

In [86]:
results_2_effect_size = np.mean(group_d_with_revenue['revenue'])-np.mean(group_c_with_revenue['revenue'])
print(results_2_effect_size)

260.3333333333333


In [88]:
results_2_cohen_d = results_2_effect_size / np.std(email_results_2["revenue"])
print(results_2_cohen_d)

0.8207707199745888


The difference between mean revenue from Group C and mean revenue from Group D, about $260, is the size of the effect of our experiment.

#### Statistical Power

The probability that a correctly run A/B test will correctly reject a null hypothesis is called its _statistical power_.

To calculate the power of a test with `TTestIndPower`, we need to define:

- `alpha` - the chosen statistical significance threshold

- `nobs` - the number of observations

- `estimated effect size` - defined in terms of `Cohen's d`

Suppose we run an A/B test on a geroup of email subscribers consisting of 90 people. We have 45 people in group A and 45 people in group B. We want to calculate the power of a test which can detect a difference as big as the one we saw in our first A/B test.

In [90]:
alpha = 0.05
nobs = 45
effect_size = 0.5 # From the $125 difference we observed in our first A/B test

In [91]:
analysis = TTestIndPower()
power = analysis.solve_power(effect_size=effect_size, nobs1=nobs, alpha=alpha)

In [92]:
print(power)

0.6501855020289932


We expect a 65% chance that our test can detect an effect of this size.

Suppose we want to work out the number of observations we would need to detect an effect 80% of the time:

In [93]:
power = 0.8
observations = analysis.solve_power(effect_size = effect_size, power=power, alpha=alpha)

In [94]:
print(observations)

63.76561177540986


We would need at least 64 participants.