# Analysis of A/B Tests with statistical foundation

## Objective

Data analysts frequently provide reports comparing results before and after an experiment. Experiments can be setup changes, like the change of a call to action creative, the position of a button in a website, the size and kind of form type, etc. While the traditional A/B test requires that both control and test populations be active at the same time window - to avoid bias on the interpretation due to seasonality effects - even common comparisons not specifically structured as A/B tests can benefit from statistical significance Analysis.

In the following exercises we will explore two of the most common comparison scenarios - the comparison of continuous metrics and the comparison of proportions. It's important to clarify from the outset whether the compared metrics are continuous or proportional, as they define whether you'll do a t-test (for continuous variables) or a z-test (for proportions). 

**Continuous** variables are numerical values that can take on any value within a range and are not restricted to intergers. They are **not** proportions and can exceed 1. Examples: 

* *Avg Conversions per User*,
* *Avg Ticket*,
* *Avg Number of Products per Transaction*,
* *Avg Session Time*

  None of those metrics are percentage, (not constrained to 0<x<1). These will be compared by the use of t-tests, which can be **Student's t-test** (when variances are assumed to be equal) or **Welch's t-test** (when the variances of both compared groups are unequal).

**Proportions** in the other hand represent the fraction of interest events out of the total possible events, resulting in values between the 0<x<1 range, and therefore can be expressed as percentages. Examples: 

* *Shoppers* % (Shoppers / Visitors, where shoppers are visitors that had transactions),
* *Late Deliveries %* (number of late deliveries / total deliveries),
* *CTR* (Clickthrough Rate - the number of clicks / displayed ads),
* *CR* (Conversion Rate - the number of conversions / nb of clicks; the definition of a conversion can vary based on attributed models and time windows).

To compare proportions between two groups, we use **z-tests for proportions**, suitable for binary outcomes representing success/failure events.

In [199]:
##-----------------------------
# Libraries
##-----------------------------
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind_from_stats
#https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind_from_stats.html

from statsmodels.stats.proportion import proportions_ztest
# https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportions_ztest.html


# Aggregated data
data = {
    'Period': ['Before Setup Change', 'After Setup Change'],
    'nb_users': [53430, 9701],
    'avg_conv_per_user': [1.60, 1.41],
    'var_conv_per_user': [0.50, 0.60],  
    'avg_ticket': [33.43, 34.53],
    'var_avg_ticket': [150.0, 160.0],
    'nb_displays': [1058307, 198495],  
    'nb_clicks': [49740, 9620],         
    'nb_conversions': [4904, 1692],     
}

df = pd.DataFrame(data)

#Calculate CTR (Click-Through Rate)
df['CTR'] = df['nb_clicks'] / df['nb_displays']

#Calculate CR (Conversion Rate)
df['CR'] = df['nb_conversions'] / df['nb_clicks']

display(df)

Unnamed: 0,Period,nb_users,avg_conv_per_user,var_conv_per_user,avg_ticket,var_avg_ticket,nb_displays,nb_clicks,nb_conversions,CTR,CR
0,Before Setup Change,53430,1.6,0.5,33.43,150.0,1058307,49740,4904,0.047,0.098593
1,After Setup Change,9701,1.41,0.6,34.53,160.0,198495,9620,1692,0.048465,0.175884


In [237]:
##-----------------------------
#Split df in before and after
##-----------------------------
df_before = df[df.Period == 'Before Setup Change']
df_after = df[df.Period == 'After Setup Change']

##-----------------------------
# Comparison of Average Transactions per User using T-Test
##-----------------------------
mean1 = df_before.avg_conv_per_user.iloc[0]
std1 = np.sqrt(df_before.var_conv_per_user.iloc[0])
nobs1 = df_before.nb_users.iloc[0]

mean2 = df_after.avg_conv_per_user.iloc[0]
std2 = np.sqrt(df_after.var_conv_per_user.iloc[0])
nobs2 = df_after.nb_users.iloc[0]

mean2 = df_after.loc[df['Period'] == 'After Setup Change', 'avg_conv_per_user'].values[0]
std2 = np.sqrt(df.loc[df['Period'] == 'After Setup Change', 'var_conv_per_user'].values[0])
nobs2 = df_after.loc[df['Period'] == 'After Setup Change', 'nb_users'].values[0]

t_stat, p_value = ttest_ind_from_stats(mean1, std1, nobs1,
                                              mean2, std2, nobs2,
                                              alternative='two-sided', equal_var=False)

print('Average Transactions per User:')
print(f'T-statistic: {t_stat:.4f}')
print(f'P-value: {p_value:.4e}')

print('------')
metric='avg_conv_per_user'
label_input = ['','<='] if p_value < 0.05 else ['not ','>']
print(f'The difference in {metric} from {mean1} to {mean2} is {label_input[0]}significant (p-value{label_input[1]}0.05)')
##-----------------------------

Average Transactions per User:
T-statistic: 22.5160
P-value: 3.9207e-110
------
The difference in avg_conv_per_user from 1.6 to 1.41 is significant (p-value<=0.05)


In [203]:
##-----------------------------
# Comparison of Average Ticket Size using T-Test
##-----------------------------

mean1 = df_before.avg_ticket.iloc[0]
std1 = np.sqrt(df_before.var_avg_ticket.iloc[0])
nobs1 = df_before.nb_users.iloc[0]

mean2 = df_after.avg_ticket.iloc[0]
std2 = np.sqrt(df_after.var_avg_ticket.iloc[0])
nobs2 = df_after.nb_users.iloc[0]

t_stat, p_value = ttest_ind_from_stats(mean1, std1, nobs1,
                                              mean2, std2, nobs2,
                                              alternative='two-sided', equal_var=False)



print('\nAverage Ticket Size:')
print(f'T-statistic: {t_stat:.4f}')
print(f'P-value: {p_value:.4e}')

print('------')
metric='avg_ticket'
label_input = ['','<='] if p_value < 0.05 else ['not ','>']
print(f'The difference in {metric} from {mean1} to {mean2} is {label_input[0]}significant (p-value{label_input[1]}0.05)')


Average Ticket Size:
T-statistic: -7.9179
P-value: 2.6084e-15
------
The difference in avg_ticket from 33.43 to 34.53 is significant (p-value<=0.05)


In [229]:
##-----------------------------
# Comparison of CTR using Z-Test
##-----------------------------

# Counts of successes (number of clicks)
count = np.array([df_before.nb_clicks.iloc[0], df_after.nb_clicks.iloc[0]])

# Number of trials (number of displays)
nobs = np.array([df_before.nb_displays.iloc[0], df_after.nb_displays.iloc[0]])

mean1 = df_before.CTR.iloc[0]
mean2 = df_after.CTR.iloc[0]

# Perform the two-proportion z-test
stat, p_value = proportions_ztest(count, nobs)
print('\nComparison of CTR:')
print(f'Z-statistic: {stat:.4f}')
print(f'P-value: {p_value:.4f}')

print('------')
metric='CTR'
label_input = ['','<='] if p_value < 0.05 else ['not ','>']
print(f'The difference in {metric} from {mean1:.4f} to {mean2:.4f} is {label_input[0]}significant (p-value{label_input[1]}0.05)')


Comparison of CTR:
Z-statistic: -2.8236
P-value: 0.0047
------
The difference in CTR from 0.0470 to 0.0485 is significant (p-value<=0.05)


In [163]:
df

Unnamed: 0,Period,nb_users,avg_conv_per_user,var_conv_per_user,avg_ticket,var_avg_ticket,nb_displays,nb_clicks,nb_conversions,CTR,CR
0,Before Setup Change,53430,1.6,0.5,33.43,150.0,1058307,49740,4904,0.047,0.099
1,After Setup Change,9701,1.41,0.6,33.53,160.0,190495,11620,1692,0.05,0.146


In [161]:
##-----------------------------
# Comparison of CR using Z-Test
##-----------------------------

# Counts of successes (number of clicks)
count = np.array([df_before.nb_conversions.iloc[0], df_after.nb_conversions.iloc[0]])

# Number of trials (number of displays)
nobs = np.array([df_before.nb_clicks.iloc[0], df_after.nb_clicks.iloc[0]])

mean1 = df_before.CR.iloc[0]
mean2 = df_after.CR.iloc[0]

# Perform the two-proportion z-test
stat, p_value = proportions_ztest(count, nobs)
print('\nComparison of CR:')
print(f'Z-statistic: {stat:.4f}')
print(f'P-value: {p_value:.4e}')

print('------')
metric='CR'
label_input = ['','<='] if p_value < 0.05 else ['not ','>']
print(f'The difference in {metric} from {mean1} to {mean2} is {label_input[0]}significant (p-value{label_input[1]}0.05)')


Comparison of CR:
Z-statistic: -14.7326
P-value: 3.9838e-49
------
The difference in CR from 0.099 to 0.146 is significant (p-value<=0.05)


# Conclusion

Because the Sample sizes are big, in some notable cases (like CTR), the small practical difference in CTR (0.15% difference) is negligible, yet statistically relevant. Although it's unlikely the change was by chance, it's too small to celebrate a CTR increase, so analysts need to add judgement and business background when providing conclusions and suggesting actions backed up on data.

Other changes like the CR (0.099 -> 0.146) are significant and meanigful, so that's a good KPI to highlight.

The sample size differences are also evident (nb of users, displays, clicks, conversions, etc) so using statistical methods bring certainty to a scenario where charts based on absolute metrics (like displays and clicks) are meaningless when compared to KPI charts (like CTR, CR, Avgs).

Knowing better your metrics and tests, with practical sense as in this notebook helps the analyst to distinguish each situation and choose the best test.