
# PARLA

## Problem
In the previous problem, we found that in the 'Refactoring backend' experiment:
- the average load time increased in the experimental group
- but the 99th percentile decreased in the experimental group

Now check the significance of the differences of other percentiles:
- Data for the 'Refactoring backend' experiment: `2022-04-13/2022-04-13T12_df_web_logs.csv` and `2022-04-13/experiment_users.csv`.
- The experiment was conducted from `2022-04-05` to `2022-04-12`.
- Assume that the request processing time measurements are independent.
- When testing, use a normal confidence interval.

## Action
To check if differences between load-times of control and experimental groups, are statistically significant, I:
- sampled control and experimental groups with repetitions (since bootstrap requires sampling with repetition)
- calculated bootstrap quantile deltas
- using the normal confidence interval, to check if there is a significant difference in load-time for that quantile

## Result
For some quantiles there was a statistically significant difference, while for others there was not
- Quantiles WITH statistical difference of load time: 0.7, 0.74, 0.82, 0.86, 0.9, 0.95, 0.99
- Quantiles WITHOUT statistical difference of load time: 0.78, 0.999, 0.9999

## Learning
- I revised relevant Python and Pandas functionality
- I learned how to use bootstrapping for assessing statistical significance with a specified confidence level
- I learned how to calculate confidence interval for normally-distributed metric
- I learned that statistical significance can vary from quantile to quantile

## Application
- I can apply relevant Python and Pandas functionality for similar data-related problems
- I can apply bootstrapping for assessing statistical hypothesis and calculating confidence intervals using real-world data
- I can calculate confidence intervals for normally distributed metrics using real-world data


In [14]:

from datetime import datetime

import numpy as np
import pandas as pd


In [15]:

# load dataset, describing distribution of users into control and experimental groups
df_users = pd.read_csv('data/2022-04-13_experiment_users.csv')

# load dataset, containing web-logs
df_logs = pd.read_csv('data/2022-04-13T12_df_web_logs.csv')
df_logs.date = pd.to_datetime(df_logs.date)

# filter web-logs to keep experimental data
start_date = datetime(2022, 4, 5)
end_date = datetime(2022, 4, 12)
df_logs = df_logs[(df_logs.date >= start_date) & (df_logs.date < end_date)]

# merge df_users and df_logs, to distribute logs into control and experimental groups
mdf = pd.merge(df_users, df_logs, how='inner', on=['user_id'])
df_a = mdf[mdf.pilot == 0].load_time
df_b = mdf[mdf.pilot == 1].load_time
print(df_a.head())
print(df_b.head())


0    106.6
1     49.6
2     49.9
3     75.7
4     61.6
Name: load_time, dtype: float64
19270    60.0
19271    71.7
19272    66.1
19273    73.6
19274    68.7
Name: load_time, dtype: float64


In [16]:

quantiles = [0.7, 0.74, 0.78, 0.82, 0.86, 0.90, 0.95, 0.99, 0.999, 0.9999]

for q in quantiles:
    boot_deltas = []
    for i in range(10**3):
        # sample control and experimental groups with repetitions,
        # since bootstrap requires sampling with repetition
        a = df_a.sample(frac=1, replace=True)
        b = df_b.sample(frac=1, replace=True)

        # calculate bootstrap quantile deltas
        boot_deltas.append(a.quantile(q) - b.quantile(q))

    # using the normal confidence interval,
    # check if there is a significant difference in load-time for that quantile
    boot_deltas = np.array(boot_deltas)
    deltas_mean = boot_deltas.mean()
    deltas_std = boot_deltas.std()
    lower = deltas_mean - 1.96 * deltas_std
    upper = deltas_mean + 1.96 * deltas_std

    # check for statistical significance
    if lower <= 0 <= upper:
        print(f'{q}: there IS NO statistically significant difference in load time between control and experimental groups')
    else:
        print(f'{q}: these IS a statistically significant difference in load time between control and experimental groups')


0.7: these IS a statistically significant difference in load time between control and experimental groups
0.74: these IS a statistically significant difference in load time between control and experimental groups
0.78: there IS NO statistically significant difference in load time between control and experimental groups
0.82: these IS a statistically significant difference in load time between control and experimental groups
0.86: these IS a statistically significant difference in load time between control and experimental groups
0.9: these IS a statistically significant difference in load time between control and experimental groups
0.95: these IS a statistically significant difference in load time between control and experimental groups
0.99: these IS a statistically significant difference in load time between control and experimental groups
0.999: there IS NO statistically significant difference in load time between control and experimental groups
0.9999: there IS NO statistically si