# Задание 2. 
Подведите результаты эксперимента в экселе по следующим данным: ab_stats.csv. Стат значимо ли отличается ARPPU в двух группах? Какие рекомендации дадите менеджеру?

In [47]:
import pandas as pd
import scipy.stats as stats

## Поработаем с датасетом

ARPPU (Average Revenue Per Paying User) - средний платеж платящего пользователя. Рассчитывается как средневзвешенное значение (Платежи LT/Впервые заплативших LT) за период агрегации. Показывает, сколько, в среднем, платит пользователь, ставший платящим, и, зарегистрированный в период агрегации, за всю жизнь.

### Подготовка данных

In [48]:
df = pd.read_csv('ab_stats.csv')
df.head(10)

Unnamed: 0,revenue,num_purchases,purchase,ab_group,av_site visit
0,0.0,0,0,A,9.040174
1,0.0,0,0,A,4.811628
2,0.0,0,0,A,7.342623
3,0.0,0,0,A,7.744581
4,0.0,0,0,A,10.511814
5,0.0,0,0,A,9.578727
6,0.0,0,0,A,6.162601
7,0.0,0,0,A,11.909452
8,0.0,0,0,A,6.54091
9,0.0,0,0,A,7.990794


revenue - выручка
num_purchases - кол-во покупок
purchase - факт покупки
ab_group - A/B группы
av_site visit - кол-во визитов в среднем на пользователя

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23652 entries, 0 to 23651
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   revenue        23652 non-null  float64
 1   num_purchases  23652 non-null  int64  
 2   purchase       23652 non-null  int64  
 3   ab_group       23652 non-null  object 
 4   av_site visit  23652 non-null  float64
dtypes: float64(2), int64(2), object(1)
memory usage: 924.0+ KB


In [50]:
df.describe()

Unnamed: 0,revenue,num_purchases,purchase,av_site visit
count,23652.0,23652.0,23652.0,23652.0
mean,0.324689,0.04359,0.020717,7.013112
std,9.55773,1.079403,0.142438,3.154584
min,0.0,0.0,0.0,-12.073486
25%,0.0,0.0,0.0,5.173787
50%,0.0,0.0,0.0,7.007936
75%,0.0,0.0,0.0,8.864119
max,1303.609284,152.0,1.0,22.446822


In [51]:
df.isna().sum()

revenue          0
num_purchases    0
purchase         0
ab_group         0
av_site visit    0
dtype: int64

In [52]:
df['ab_group'].value_counts()

ab_group
A    11835
B    11817
Name: count, dtype: int64

In [53]:
df[df['revenue'] > 0]

Unnamed: 0,revenue,num_purchases,purchase,ab_group,av_site visit
45,1.885595,1,1,A,7.654627
54,1.002159,1,1,A,6.392489
82,2.990000,1,1,A,8.596604
104,49.990000,1,1,A,8.885633
110,22.093757,4,1,A,8.708759
...,...,...,...,...,...
23426,2.489611,1,1,B,9.015714
23493,74.950000,5,1,B,5.881950
23495,3.667866,1,1,B,7.450014
23584,19.990000,1,1,B,9.813696


### Посмотрим на группы отдельно

In [54]:
df[df['ab_group']=='A'].describe()

Unnamed: 0,revenue,num_purchases,purchase,av_site visit
count,11835.0,11835.0,11835.0,11835.0
mean,0.404462,0.050697,0.021631,6.974724
std,13.133218,1.467511,0.145481,2.023533
min,0.0,0.0,0.0,-12.073486
25%,0.0,0.0,0.0,5.656155
50%,0.0,0.0,0.0,6.982329
75%,0.0,0.0,0.0,8.345572
max,1303.609284,152.0,1.0,17.728836


In [55]:
df[df['ab_group']=='B'].describe()

Unnamed: 0,revenue,num_purchases,purchase,av_site visit
count,11817.0,11817.0,11817.0,11817.0
mean,0.244794,0.036473,0.019802,7.051559
std,3.176534,0.41848,0.139325,3.976799
min,0.0,0.0,0.0,-8.286822
25%,0.0,0.0,0.0,4.380984
50%,0.0,0.0,0.0,7.060873
75%,0.0,0.0,0.0,9.768648
max,113.83,25.0,1.0,22.446822


Пока мы еще ничего не посчитали, но уже можно заметить, что максимальный чек в первой группе сильно больше, чем в группе B, необходимо будет это учесть.

#### Метрика ARPPU

Принимаем $H_0: p_1 = p_2$. 

In [56]:
df.groupby('ab_group')['revenue'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
ab_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,11835.0,0.404462,13.133218,0.0,0.0,0.0,0.0,1303.609284
B,11817.0,0.244794,3.176534,0.0,0.0,0.0,0.0,113.83


Данные распределены ненормально, поэтому нужно использовать **непараметрический критерий Манна-Уитни**.

In [57]:
stats.mannwhitneyu(x=df[(df['ab_group'] == 'A') & (df['purchase'] == 1)]['revenue'].values,
                   y=df[(df['ab_group'] == 'B') & (df['purchase'] == 1)]['revenue'].values)

MannwhitneyuResult(statistic=29729.5, pvalue=0.8871956616344514)

Статистически незначимых различий нет. Выкатывать изменения с тестовой группы не целесообразно, т.к. результаты теста и контроля по основным метрикам не различаются и основания отвергнуть нулевую гипотезу нет.