In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pylab as plt

import datetime
import scipy.stats as ss

import warnings
warnings.filterwarnings('ignore')

### Import data

In [2]:
referral = pd.read_csv('C:/Sophia/School!!!/2023 Spring 1/DS take home/11.user referral data/referral.csv', parse_dates=['date'])
referral.head()

Unnamed: 0,user_id,date,country,money_spent,is_referral,device_id
0,2,2015-10-03,FR,65,0,EVDCJTZMVMJDG
1,3,2015-10-03,CA,54,0,WUBZFTVKXGQQX
2,6,2015-10-03,FR,35,0,CBAPCJRTFNUJG
3,7,2015-10-03,UK,73,0,PRGXJZAJKMXRH
4,7,2015-10-03,MX,35,0,PRGXJZAJKMXRH


In [3]:
referral.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97341 entries, 0 to 97340
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   user_id      97341 non-null  int64         
 1   date         97341 non-null  datetime64[ns]
 2   country      97341 non-null  object        
 3   money_spent  97341 non-null  int64         
 4   is_referral  97341 non-null  int64         
 5   device_id    97341 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 4.5+ MB


In [4]:
dt_referral_starts = datetime.datetime(2015,10,31)

In [5]:
(pd.Series(referral.date.unique()) >= dt_referral_starts).value_counts()

False    28
True     28
dtype: int64

There are 28 days before the program, and 28 days after the program. User Referral program starts right in the middle.

### Q1: Can you estimate the impact the program had on the site?

### Hypothesis test on all data

In [6]:
def count_spent(df):
    d = {}
    d['n_purchase'] = len(df) # number of purchase in that day
    d['total_spent'] = df.money_spent.sum() # total money spent in that day
    d['n_customer'] = len(df.user_id.unique()) # how many customers access the store that day
    return pd.Series(d)

In [7]:
def daily_statistics(df):
    """
    given a dataframe
    1.  group by day, and return '#purchase','total spent money','#customers' on each day
    2.  split daily data into two groups, before the program and after the program
    3.  for each 'sale index' ('#purchase','total spent money','#customers'), 
        calculate the mean before/after the program, their difference, and pvalue 
    """
    grpby_day = df.groupby('date').apply(count_spent)

    grpby_day_before = grpby_day.loc[grpby_day.index < dt_referral_starts, :]
    grpby_day_after = grpby_day.loc[grpby_day.index >= dt_referral_starts, :]

    d = []
    colnames = ['total_spent','n_purchase','n_customer']
    for col in colnames:
        pre_data = grpby_day_before.loc[:,col]
        pre_mean = pre_data.mean()

        post_data = grpby_day_after.loc[:,col]
        post_mean = post_data.mean()

        result = ss.ttest_ind(pre_data, post_data, equal_var=False)
        # either greater or smaller, just one-tail test
        pvalue = result.pvalue / 2 

        d.append({'mean_pre':pre_mean,'mean_post':post_mean,'mean_diff':post_mean - pre_mean,
                  'pvalue':pvalue})

    # re-order the columns
    return pd.DataFrame(d,index = colnames).loc[:,['mean_pre','mean_post','mean_diff','pvalue']]

In [8]:
daily_statistics(referral)

Unnamed: 0,mean_pre,mean_post,mean_diff,pvalue
total_spent,71657.0,83714.392857,12057.392857,0.135194
n_purchase,1690.75,1785.714286,94.964286,0.348257
n_customer,1384.464286,1686.964286,302.5,0.059545


Although after launching the 'user referral' program, in all three 'sale index', i.e., 'daily purchase activity', 'daily money spent', 'daily customers', are all increased, however, none of those increment are significant. (by using a ** 0.05 ** significant level)

### Hypothesis test grouped by country

In [9]:
referral.country.value_counts()

UK    15493
FR    15396
US    15280
IT    11446
DE    11093
ES     9831
CA     9440
MX     8133
CH     1229
Name: country, dtype: int64

In [10]:
daily_stat_bycountry = referral.groupby('country').apply(daily_statistics)

In [11]:
daily_stat_bycountry

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_pre,mean_post,mean_diff,pvalue
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CA,total_spent,7468.428571,7880.428571,412.0,0.351704
CA,n_purchase,177.142857,160.0,-17.142857,0.233985
CA,n_customer,173.285714,159.178571,-14.107143,0.268256
CH,total_spent,1536.321429,1023.892857,-512.428571,0.006941
CH,n_purchase,26.821429,17.071429,-9.75,0.003072
CH,n_customer,26.714286,17.071429,-9.642857,0.003142
DE,total_spent,9856.75,8013.964286,-1842.785714,0.081459
DE,n_purchase,232.142857,164.035714,-68.107143,0.011798
DE,n_customer,224.964286,163.25,-61.714286,0.015665
ES,total_spent,6648.642857,8660.571429,2011.928571,0.037522


From the result above, we know 'User Referral' program has different effect in different countries. The program boosts the sales in some country, but in some other countries, it even decrease the sales.

##### Country-based conclusion
By using a ** 0.1 ** significant level, we can conclude that
- the program fails in CH and DE, it significantly decrease the sales in these two countries.
- the program succeeds in 'MX','IT','FR','ES', it significantly increase the sales.
- the program doesn't seem have any significant effect on UK,CA,US.

### Q2: Based on the data, what would you suggest to do as a next step?

First, the company can perform more accurate A/B test and collect more data to study the impact of the program.
Since the program has different impact in different country, the company can also study the reason for such difference. For example, does the program has any cultural conflicts in CH and DE?

### Q3: The referral program wasn't really tested in a rigorous way. It simply started on a given day for all users and you are drawing conclusions by looking at the data before and after the test started. What kinds of risks this approach presents? Can you think of a better way to test the referral program and measure its impact?

This approach isn't an accurate A/B test. "User Referral" program isn't the only difference between control group and test group. for example, there may be some special holiday after Oct 31 in some country, or just because the weather get colder after Oct 31, people's requirement on some goods are increased.

To get more accurate impact of the program, we need to perform a more careful A/B test. for example:

- During the same peroid of time, randomly split the customers into two groups, and let only one group know the User Referral program.
- Run the experiment some time, then perform the t-test to see whether some 'sale performance index' (e.g., daily spent, daily customers, daily transactions) have significant changes or not.