
# PARLA

## Problem
- Estimate the minimal required group size for an experiment:
    - Experiment includes customers who made a purchase during the experiment period.
    - Experiment period is the week of February 21 to February 28.
    - Data sample should include events from the half-open interval \[datetime(2022, 2, 21), datetime(2022, 2, 28))
    - Use the data from the file `2022-04-01T12_df_sales.csv` to solve the task.
    - Experiment parameters:
        - Metric — average revenue per user (ARPU) during the experiment
        - Duration — one week
        - Significance level — 0.05
        - Acceptable probability of Type II error — 0.1
        - Expected effect — 20 rubles
    - As your answer, enter the required group size, rounded to the nearest ten (round(x, -1)).
    - Use formula:

$$
n > \frac{ [ {\Phi}^{-1}(1 - \alpha/2) + {\Phi}^{-1}(1 - \beta) ]^2 (\sigma^{2}_{X} + \sigma^{2}_{Y}) }{ \epsilon^2 }
$$

- For the same experiment:
    - Generate control and experimental groups
    - Estimate Minimal Detectable Effect (MDE), by using the generated groups and the following formula:

$$
\epsilon > \sqrt{ \frac{ [ {\Phi}^{-1}(1 - \alpha/2) + {\Phi}^{-1}(1 - \beta) ]^2 (\sigma^{2}_{X} + \sigma^{2}_{Y}) }{ n } }
$$

## Action
- To estimate the required group size for the experiment, I:
    - Specified experiment parameters
    - Loaded, converted, filtered, grouped, and aggregated sales data
    - Calculated variance of the sales data
    - Estimated sample size using the formula for t-test above
- To estimate MDE for the experiment, I:
    - Generated the control group (by randomly sampling the sales dataframe)
    - Generated the experimental group (by subtracting control group from the sales dataframe)
    - Estimated MDE using the formula for t-test above

## Result
- The minimal required group size is correctly estimated
- MDE is also correctly estimated

## Learning
- I revised relevant Python and Pandas functionality
- I learned and applied the formula for estimating the sample size for a t-test
- I learned and applied the formula for estimating MDE for a t-test
- I learned that both minimal sample size and MDE are connected by the same formula and both can be deduced from it

## Application
- I can apply relevant Python and Pandas functionality for similar data-related problems
- I can estimate the necessary sample size for a t-test
- I can estimate MDE for a t-test


In [38]:

from datetime import datetime

import numpy as np
import pandas as pd
import scipy as sp


In [39]:

# specify experiment parameters
alpha = 0.05
beta = 0.1
effect = 20
date_begin = datetime(2022, 2, 21)
date_end = datetime(2022, 2, 28)


In [40]:

# load, convert, filter, group, and aggregate sales data
df_sales = pd.read_csv('2022-04-01T12_df_sales.csv')
df_sales.date = pd.to_datetime(df_sales.date)
df_sales = df_sales[(datetime(2022, 2, 21) <= df_sales.date) & (df_sales.date < datetime(2022, 2, 28))]
df_sales = df_sales.groupby(['user_id'])['price'].sum().reset_index()
df_sales.head()


Unnamed: 0,user_id,price
0,00045f,720
1,0006bb,1260
2,000b52,3480
3,000cbb,780
4,000cf0,840


In [41]:

# calculate variance of the sample
price_var = df_sales.price.var()
print(f'price_var: {price_var}')

# calculate sample size using the formula for t-test
n = ( sp.stats.norm.ppf(1 - alpha / 2) + sp.stats.norm.ppf(1 - beta) )**2 * (price_var + price_var) / effect**2

# correct answer is 34570
print(f'sample size: {round(n, -1)}')


price_var: 658013.5419915788
sample size: 34570.0


In [42]:

# generate control group
group_a = df_sales.sample(frac=0.5, random_state=0)
group_a.describe()


Unnamed: 0,price
count,12420.0
mean,1239.707729
std,811.514161
min,540.0
25%,720.0
50%,840.0
75%,1530.0
max,7560.0


In [43]:

# generate experimental group
group_b = df_sales.merge(group_a, how='left', indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])
group_b.describe()


Unnamed: 0,price
count,12420.0
mean,1229.666667
std,810.847963
min,540.0
25%,720.0
50%,840.0
75%,1500.0
max,8820.0


In [44]:

# calculate Minimal Detectable Effect (MDE)
ppf1 = sp.stats.norm.ppf(1 - alpha / 2)
ppf2 = sp.stats.norm.ppf(1 - beta)
mde = np.sqrt( (ppf1 + ppf2)**2 * (group_a.price.var() + group_b.price.var()) / len(group_a) )
mde


np.float64(33.36722953402226)