
# PARLA

## Problem
- Write a function that:
    - performs independent t-test using provided control and experimental groups
    - rounds the resulting p-value to the third decimal place
- Analyze the significance of the difference in average revenue per user (ARPU):
    - select the week before the experiment (from 2022-03-16 to 2022-03-23).
    - Use the data from the files 2022-04-01T12_df_sales.csv and experiment_users.csv to solve the task.
    - Perform independent t-test using the previous function.

## Action
- Implemented 'get_ttest_pvalue()' function, using relevant Python, Numpy, and Scipy functionality.
- Analyzed significance of ARPU difference:
    - I loaded, filtered, merged, grouped, aggregated, and sorted original dataframes.
    - I replaced missing values with zeros.
    - Created control and experimental groups, by filtering the merged dataframe.
    - Performed independent t-test using control and experimental groups.

## Result
- Successfully tested 'get_ttest_pvalue()' function.
- Demonstrated that ARPU difference between control and experimental groups is not significant, by calculating a p-value (0.199) that is much larger than the standard significance level (0.05).

## Learning
- I revised relevant Python, Numpy, Pandas, and Scipy functionality.
- I realized that handing of missing data is important (omission of missing values instead of replacing them with zeros changes p-value).
- I learned how to perform independent t-tests.

## Application
- I can apply relevant Python and Pandas functionality for data-related problems.
- I can apply independent t-tests in practice.


In [102]:

from datetime import datetime

import numpy as np
import pandas as pd
import scipy as sp


In [103]:

def get_ttest_pvalue(
        metrics_a_group: np.array,
        metrics_b_group: np.array
) -> np.float64:
    """
    Returns the t-test p-value for the two groups of observations.

    :param metrics_a_group: values of the control group
    :param metrics_b_group: values of the experimental group
    :return: p-value
    """
    return np.round(sp.stats.ttest_ind(metrics_a_group, metrics_b_group).pvalue, decimals=3)


def test_get_ttest_pvalue():
    """
    Test get_ttest_pvalue function.
    """
    metrics_a_group = np.array([964, 1123, 962, 1213, 914, 906, 951, 1033, 987, 1082])
    metrics_b_group = np.array([952, 1064, 1091, 1079, 1158, 921, 1161, 1064, 819, 1065])
    pvalue = get_ttest_pvalue(metrics_a_group, metrics_b_group)
    print(pvalue)


# correct answer is 0.612
test_get_ttest_pvalue()


0.612


In [104]:

# load and filter sales data
df_sales = pd.read_csv('2022-04-01T12_df_sales.csv')
df_sales.date = pd.to_datetime(df_sales.date)
df_sales = df_sales[(datetime(2022, 3, 16) <= df_sales.date) & (df_sales.date < datetime(2022, 3, 23))]

# group data by user_id -> only one value per user_id!
df_sales = df_sales.groupby(by='user_id')['price'].sum().reset_index()
df_sales = df_sales.sort_values('user_id')
df_sales.head()


Unnamed: 0,user_id,price
0,000096,720
1,00092c,780
2,000bb2,720
3,000ea9,1560
4,000ec6,690


In [105]:

# load and sort user experiment-info data (distribution of users into control/pilot groups)
df_users = pd.read_csv('experiment_users.csv')
df_users = df_users.sort_values('user_id')
df_users.head()


Unnamed: 0,user_id,pilot
7487,0000d4,0
4714,0000de,0
19681,0000e4,1
3512,0001e2,0
1313,00062e,0


In [106]:

# merge dataframes
df_users_sales = pd.merge(df_users, df_sales, how='left', on='user_id')
df_users_sales.fillna(0, inplace=True)
df_users_sales.head()


Unnamed: 0,user_id,pilot,price
0,0000d4,0,0.0
1,0000de,0,0.0
2,0000e4,1,0.0
3,0001e2,0,0.0
4,00062e,0,0.0


In [107]:

# create control group, by filtering the merged df
group_a = df_users_sales[df_users_sales.pilot == 0]['price']
group_a.head()


0    0.0
1    0.0
3    0.0
4    0.0
6    0.0
Name: price, dtype: float64

In [108]:

# create experimental group, by filtering the merged df
group_b = df_users_sales[df_users_sales.pilot == 1]['price']
group_b.head()


2     0.0
5     0.0
9     0.0
10    0.0
12    0.0
Name: price, dtype: float64

In [109]:

# perform independent t-test
# correct answer is 0.199
get_ttest_pvalue(group_a, group_b)


np.float64(0.199)