In data science, the one-sample t-test and the two-sample t-test are both statistical tests used to determine if there is a significant difference between two sets of data.

The main difference between the two tests is in the number of samples being compared.

The **one-sample t-test** is used when you want to compare a single sample to a known population mean. For example, if you want to determine if the average height of a group of students is significantly different from the average height of the general population, you would use a one-sample t-test.

The **two-sample t-test** is used when you want to compare the means of two independent samples. For example, if you want to determine if there is a significant difference in the average height between male and female students, you would use a two-sample t-test.

In both tests, **the t-statistic is calculated based on the difference between the means of the samples and the variability within the samples**. The p-value is then calculated based on the t-statistic and the degrees of freedom. If the p-value is less than the chosen significance level (usually 0.05), it is concluded that there is a significant difference between the two samples.

# Performing A/B testing for a landing page conversion using Python involves several key steps. Here is a general outline of the steps involved:

1. **Define the hypothesis**: The first step is to clearly define the null and alternative hypotheses for the A/B test. For example, the null hypothesis might be that there is no significant difference in conversion rates between the original landing page and the new landing page, while the alternative hypothesis might be that the new landing page has a higher conversion rate than the original landing page.

2. Choose the **sample size**: Next, you need to choose the sample size for the A/B test. This involves determining the:
- minimum detectable effect size, 
- the significance level, 
- and the power of the test. 

There are several online calculators and Python libraries that can help you with this step.

samle size = 16

3. Collect the data: The next step is to collect the data that you will use to test the hypothesis. This involves randomly assigning visitors to the landing page to either the control group (original landing page) or the treatment group (new landing page) and recording their conversion rates.

4. Calculate the metrics: Once you have collected the data, you need to calculate the metrics that you will use to test the hypothesis. The most common metrics for A/B testing are the conversion rate, the confidence interval, and the p-value.

5. Analyze the results: The next step is to analyze the results of the A/B test to determine whether there is a significant difference in conversion rates between the control group and the treatment group. This involves comparing the metrics calculated in step 4 to the significance level and the confidence interval.

6. Draw conclusions and report results: The final step is to draw conclusions based on the results of the A/B test and report the findings in a clear and concise manner. This involves interpreting the results, discussing the implications of the findings, and making recommendations for future action.


Here are some specific Python libraries and functions that you can use to perform A/B testing:

- Pandas: Pandas is a powerful data analysis library that you can use to manage and analyze your A/B test data.

- SciPy: SciPy is a Python library that provides a wide range of statistical functions and methods, including t-tests and p-value calculations.

- Statsmodels: Statsmodels is a Python library that provides a range of statistical models and methods, including hypothesis testing and regression analysis.

- Matplotlib: Matplotlib is a Python library that you can use to create graphs and visualizations of your A/B test results.

In terms of metrics, the most important metrics to calculate for A/B testing are the conversion rate, the confidence interval, and the p-value. The conversion rate is the percentage of visitors who convert on the landing page, while the confidence interval is the range of values that the true conversion rate is likely to fall within. The p-value is the probability of obtaining a result as extreme or more extreme than the one you observed, assuming the null hypothesis is true. Typically, a p-value less than 0.05 is considered significant and indicates that there is strong evidence against the null hypothesis.

COMMON PROBLEMS:

1. Novelty and Primacy Effect
People may like or dislike the new version of a webiste because some of them prefered the original version while some like the updated one more. This will stabilize with time. HOW TO DEAL WITH IT?: run tests on ly on first time users. If the test is running compare the first time users with the old users in the treatment group
2. Interference between control and treatment group
Users should be split randomly and users should be independent. This won't pertain to i.e. social networks where some users behaviour is impacted by other users. (netweork effect)
3. Dealing with interference and bias
isolate users in the control and treatment group depends on a scenario - geo-based split, time-based, etc., create clusters

### Define the hyphotesis: 
- Null hyphotesis - no significant difference in conversion
- Alternative hypothesis - new landing page has a higher conversion rate than the original landing page

### Choose the smaples size
1. Determine the minimum detectable effect size: This is the smallest difference in conversion rates that you want to be able to detect with your A/B test. This can be based on prior knowledge, industry benchmarks, or other factors.

2. Choose a significance level: This is the probability of rejecting the null hypothesis when it is true, also known as the type I error rate. The most common significance level is 0.05, which means that there is a 5% chance of incorrectly rejecting the null hypothesis.

3. Determine the statistical power: This is the probability of correctly rejecting the null hypothesis when it is false, also known as the type II error rate. The most common statistical power is 0.8, which means that there is an 80% chance of correctly rejecting the null hypothesis when it is false.

4. Use a sample size calculator: You can use a sample size calculator, such as the one provided by Python's statsmodels library, to determine the sample size needed to achieve the desired power and significance level.

### Collect the data

df_data = pd.read_csv('data/ab_data.csv')


In [2]:
import statsmodels.stats.api as sms

# baseline conversion rate (control group)
baseline_cr = 0.10

# minimum detectable effect size (in proportion units)
min_detectable_effect = 0.02

# significance level
alpha = 0.05

# statistical power
power = 0.80

# calculate the sample size
sample_size = sms.TTestIndPower().solve_power(effect_size=min_detectable_effect/baseline_cr, alpha=alpha, power=power, alternative='larger')

print("Sample size: ", round(sample_size, 2))


Sample size:  309.81


In [3]:
#Import libraries
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk import word_tokenize, FreqDist
from nltk.util import ngrams

import string
#from wordcloud import WordCloud, STOPWORDS

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime, timedelta

import gensim
from gensim.models import Word2Vec

#from wordcloud import WordCloud, STOPWORDS

np.random.seed(0)

%matplotlib inline
plt.style.use('seaborn')

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. EDA

In [4]:
df_data = pd.read_csv('data/ab_data.csv')
df_data.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


In [5]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294480 entries, 0 to 294479
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294480 non-null  int64 
 1   timestamp     294480 non-null  object
 2   group         294480 non-null  object
 3   landing_page  294480 non-null  object
 4   converted     294480 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


1.1 Unique users and returning unsers

In [6]:
#Unique users
df_data['user_id'].nunique()

290585

In [7]:
#Returning users
len(df_data) - df_data['user_id'].nunique()

3895

In [8]:
df_returning_users = df_data[df_data['user_id'].duplicated()]
df_returning_users.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
2656,698120,13:42.6,control,old_page,0
2893,773192,55:59.6,treatment,new_page,0
7500,899953,06:54.1,control,new_page,0
8036,790934,32:20.3,treatment,new_page,0
10218,633793,16:00.7,treatment,old_page,0


In [9]:
#Calculate conversion rates
old_page_df = df_data[df_data['landing_page'] == 'old_page']
new_page_df = df_data[df_data['landing_page'] == 'new_page']

old_page_conversion = old_page_df['converted'].sum()
old_page_visitors = len(old_page_df)
new_page_conversion = new_page_df['converted'].sum()
new_page_visitors = len(new_page_df)

print('old_page_conversion: ', old_page_conversion)
print('old_page_visitors: ', old_page_visitors)
print('new_page_conversion: ', new_page_conversion)
print('new_page_visitors: ', new_page_visitors)

old_page_conversion:  17739
old_page_visitors:  147239
new_page_conversion:  17498
new_page_visitors:  147241


In [10]:
import scipy.stats as stats

# conversion counts for control group and treatment group
control_conversions = old_page_conversion
control_visitors = old_page_visitors
treatment_conversions = new_page_conversion
treatment_visitors = new_page_visitors

# conversion rates for control group and treatment group
control_rate = control_conversions / control_visitors
treatment_rate = treatment_conversions / treatment_visitors

# perform t-test for independent samples
t_stat, p_value = stats.ttest_ind_from_stats(control_rate, np.sqrt(control_rate*(1-control_rate)), control_visitors,
                                             treatment_rate, np.sqrt(treatment_rate*(1-treatment_rate)), treatment_visitors)

print("t-statistic:", t_stat)
print("p-value:", p_value)

t-statistic: 1.369696647980389
p-value: 0.1707826599051773


## Discrete Metrics
Let's consider first discrete metrics, e.g. click-though rate. We randomly show visitors one of two possible designs of an advertisement, and based on how many of them click on it we need to determine whether our data significantly contradict the hypothesis that the two designs are equivalently efficient.

In [11]:
df_unique = df_data.drop_duplicates(subset=['user_id'], keep='first')
df_unique['not_converted'] = 1 - df_unique['converted']
df_unique.head(10)

Unnamed: 0,user_id,timestamp,group,landing_page,converted,not_converted
0,851104,11:48.6,control,old_page,0,1
1,804228,01:45.2,control,old_page,0,1
2,661590,55:06.2,treatment,new_page,0,1
3,853541,28:03.1,treatment,new_page,0,1
4,864975,52:26.2,control,old_page,1,0
5,936923,20:49.1,control,old_page,0,1
6,679687,26:46.9,treatment,new_page,1,0
7,719014,48:29.5,control,old_page,0,1
8,817355,58:09.0,treatment,new_page,1,0
9,839785,11:06.6,treatment,new_page,1,0


In [12]:
df_unique_ct = df_unique[['landing_page', 'converted', 'not_converted']].groupby(by=["landing_page"]).sum()
df_unique_ct

Unnamed: 0_level_0,converted,not_converted
landing_page,Unnamed: 1_level_1,Unnamed: 2_level_1
new_page,17256,128065
old_page,17489,127775


### Fisher's exact test
Since we have a 2x2 contingency table we can use Fisher's exact test to compute an exact p-value and test our hypothesis.

Fisher’s exact test is practically applied when sample sizes are small.

In [28]:
odds_ratio, p_value = stats.fisher_exact(df_unique_ct, alternative='two-sided')

In [36]:
#print("statistic:", t_stat)
print("Fisher's exact test p-value is: ", p_value)

Fisher's exact test p-value is:  0.1717870662804108


### Pearson’s chi-squared test

Pearson's Chi-Squared test is a two-tailed test.

The null hypothesis in the test is that there is no association between the two categorical variables being compared. The alternative hypothesis is that there is some association between the variables.

In [30]:
df_unique_ct.values

array([[ 17256, 128065],
       [ 17489, 127775]], dtype=int64)

In [33]:
results = stats.chi2_contingency(df_unique_ct, correction=False)
#https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

In [35]:
print('Pearson’s chi-squared test p-value is: ', results[1])

Pearson’s chi-squared test p-value is:  0.17032977708656222
