<a href="https://colab.research.google.com/github/remerge/uplift-report/blob/master/uplift_report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# remerge uplift report

This notebook allows you to validate remerge provided uplift reporting numbers. To do so it downloads and analyses exported campaign and event data from S3. The campaign data contains all users that remerge marked to be part of an uplift test, the A/B group assignment, the timestamp of marking, conversion events (click, app open or similar) and their cost. The event data reflects the app event stream and includes events, their timestamp and revenue (if any). We calculate the incremental revenue and the iROAS in line with the [remerge whitepaper](https://drive.google.com/file/d/1PTJ93Cpjw1BeiVns8dTcs2zDDWmmjpdc/view). 

**Hint**: This notebook can be run in any Jupyter instance with enough space/memory, as a [Google Colab notebook](#Google-Colab-version) or as a standalone Python script.

### Notebook configuration

For this notebook to work properly several variables in the [Configuration](#Configuration) section need to be be set: `customer`, `audience`, `
revenue_event`, `dates` and the AWS credentials. All of these will be provided by your remerge account manager. 


### Verification

To verify that the group split is random and has no bias, user events / attributes before the campaign start can be compared and checked for an equal distribution in test and control group. For example the user age distribution, the user activity distribution or the average spend per user  should be the same in both groups pre campaign.



## Google Colab support

This notebook can be run inside Google Colab. Due to size limitations it cointains several optimizations like removing unused fields from the input files and caching files. Furthermore it installs missing dependencies and restarts the kernel. **Because pandas is upgraded the kernel needs to be restarted once per fresh instance. Just run the cell again after restart** 

In [1]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
  !pip install pyarrow
  import pandas as pdt
  if pdt.__version__ != '0.23.4':
    # upgrading pandas requires a restart of the kernel
    # (we need an up to date pandas because we write to S3 for caching)
    # we kill it and let it auto restart (only needed once per fresh instance)
    !pip install pandas==0.23.4
    import os
    os.kill(os.getpid(), 9)



## Import needed packages

This notebook/script needs pandas and scipy for analysis and boto to access data store on S3.


In [0]:
from datetime import datetime
import pandas as pd
import re
import os
import gzip
import scipy
import scipy.stats 
import s3fs
from IPython.display import display # so we can run this as script as well
import gc

## Configuration

Set the customer name, audience and access credentials for the S3 bucket and path. Furthermore the event for which we want to evaluate the uplift needs to be set `revenue_event`.

In [0]:
# configure path and revenue event 
customer = ''
audiences = ['']
revenue_event = ''

# date range for the report
dates = pd.date_range(start='2019-02-14',end='2019-02-27')

# AWS credentials
os.environ["AWS_ACCESS_KEY_ID"] = ''
os.environ["AWS_SECRET_ACCESS_KEY"] = ''

## Helper
Define a few helper functions to load and cache data.

In [0]:
def path(audience):
  return "s3://remerge-customers/{0}/uplift_data/{1}".format(customer,audience)

# helper to remove a few things we load if we run in Google Colab
def limit_df(df,source):
    if not IN_COLAB:
      return df
    if source != 'attributions':
      return df
    # we drop a few things so we fit into the Colab memory limit
    df.drop(['partner','revenue','revenue_currency'], axis=1)
    df = df[df.partner_event == revenue_event]
    gc.collect()
    return df 

# helper to download CSV files, convert to DF and print time needed
# caches files locally and on S3 to be reused
def read_csv(audience, source, date):
    now = datetime.now()
    date_str = date.strftime('%Y%m%d')
    filename = '{0}/{1}/{2}.csv.gz'.format(path(audience), source, date_str)
    # local cache
    cache_dir = 'cache/{0}/{1}'.format(audience, source)
    cache_filename = '{0}/{1}.parquet'.format(cache_dir, date_str)
    if os.path.exists(cache_filename):
        print(now, 'loading from', cache_filename)
        df = pd.read_parquet(cache_filename, engine='pyarrow')
        df = limit_df(df,source)
        return df
    # s3 cache (useful if we don't have enough space on the Colab instance)
    s3_cache_filename = '{0}/{1}/cache/{2}.parquet'.format(path(audience), source, date_str)
    fs =s3fs. S3FileSystem(anon=False)
    if fs.exists(path=s3_cache_filename):
      print(now, 'loading from S3 cache', s3_cache_filename)
      return pd.read_parquet(s3_cache_filename, engine='pyarrow')
    print(now, 'start loading CSV for', date)
    df = pd.read_csv(filename, escapechar='\\')
    df = limit_df(df,source)
    print(datetime.now(), 'finished loading CSV for', date.strftime('%d.%m.%Y'), 'took', datetime.now()-now)
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)
    df.to_parquet(cache_filename, engine='pyarrow')
    # write it to the S3 cache folder as well
    print(datetime.now(), 'caching as parquet', s3_cache_filename)
    df.to_parquet(s3_cache_filename, engine='pyarrow')
    return df

## Load CSV data from S3

Load mark, spend and event data from S3. 

### IMPORTANT

**The event data is usually quite large (several GB) so this operation might take several minutes or hours to complete, depending on the size and connection.**

In [17]:
bid_df = pd.concat([read_csv(audience,'marks_and_spend',date) for audience in audiences for date in dates], ignore_index = True, verify_integrity=True)

2019-03-11 11:38:37.455038 start loading CSV for 2019-02-14 00:00:00
2019-03-11 11:38:37.729656 finished loading CSV for 14.02.2019 took 0:00:00.274650
2019-03-11 11:38:37.755021 start loading CSV for 2019-02-15 00:00:00
2019-03-11 11:38:37.938140 finished loading CSV for 15.02.2019 took 0:00:00.183151
2019-03-11 11:38:37.972273 start loading CSV for 2019-02-16 00:00:00
2019-03-11 11:38:38.140322 finished loading CSV for 16.02.2019 took 0:00:00.168088
2019-03-11 11:38:38.177744 start loading CSV for 2019-02-17 00:00:00
2019-03-11 11:38:38.354706 finished loading CSV for 17.02.2019 took 0:00:00.176996
2019-03-11 11:38:38.388169 start loading CSV for 2019-02-18 00:00:00
2019-03-11 11:38:38.613885 finished loading CSV for 18.02.2019 took 0:00:00.225749
2019-03-11 11:38:38.646317 start loading CSV for 2019-02-19 00:00:00
2019-03-11 11:38:39.191846 finished loading CSV for 19.02.2019 took 0:00:00.545579
2019-03-11 11:38:39.361665 start loading CSV for 2019-02-20 00:00:00
2019-03-11 11:38:39

In [0]:
attributions_df = pd.concat([read_csv(audience,'attributions',date) for audience in audiences for date in dates], ignore_index = True, verify_integrity=True)

2019-03-11 09:50:50.152831 loading from cache cache/attributions/20190214.parquet
2019-03-11 09:50:50.401528 loading from cache cache/attributions/20190215.parquet
2019-03-11 09:50:50.639826 start loading CSV for 2019-02-16 00:00:00
2019-03-11 09:52:05.588125 finished loading CSV for 16.02.2019 took 0:01:14.948358
2019-03-11 09:52:05.972775 start loading CSV for 2019-02-17 00:00:00
2019-03-11 09:53:25.215945 finished loading CSV for 17.02.2019 took 0:01:19.243234
2019-03-11 09:53:25.557478 start loading CSV for 2019-02-18 00:00:00
2019-03-11 09:54:47.680872 finished loading CSV for 18.02.2019 took 0:01:22.123456
2019-03-11 09:54:48.025925 start loading CSV for 2019-02-19 00:00:00
2019-03-11 09:56:09.794334 finished loading CSV for 19.02.2019 took 0:01:21.768470
2019-03-11 09:56:10.137239 start loading CSV for 2019-02-20 00:00:00
2019-03-11 09:57:38.803291 finished loading CSV for 20.02.2019 took 0:01:28.666105
2019-03-11 09:57:39.141374 start loading CSV for 2019-02-21 00:00:00
2019-03

Print some statistics of the loaded data sets.

In [0]:
bid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 535434 entries, 0 to 535433
Data columns (total 9 columns):
ts               535434 non-null object
event_type       535434 non-null object
ab_test_group    535434 non-null object
user_id          535422 non-null object
campaign_id      535434 non-null int64
cost_currency    25657 non-null object
cost             25657 non-null float64
cost_eur         25657 non-null float64
campaign_name    535434 non-null object
dtypes: float64(2), int64(1), object(6)
memory usage: 36.8+ MB


In [0]:
attributions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 771714 entries, 0 to 771713
Data columns (total 10 columns):
ts                  771714 non-null object
user_id             771714 non-null object
event_id            0 non-null float64
partner             771714 non-null object
partner_event       771714 non-null object
revenue             768543 non-null float64
revenue_currency    771714 non-null object
revenue_eur         768543 non-null float64
ab_test_group       38044 non-null object
event_data          771714 non-null object
dtypes: float64(3), object(7)
memory usage: 58.9+ MB


In [0]:
# set formatting options
pd.set_option('display.float_format', '{:.2f}'.format)

## Remove invalid users

Due to a race condition during marking we need to filter out users that are marked as *control* and *test*. In rare cases we see the same user on different servers in the same second, and unknowingly of each other marked him differently. This was fixed in the latest version of the remerge plattform but we need to filter old data.

In [0]:
# users that are in both groups due to racy bids are invalid
# we need to filter them out
groups_per_user = bid_df.groupby('user_id')['ab_test_group'].nunique()
invalid_users = groups_per_user[groups_per_user > 1]

The `mark_df` dataframe will contain all mark events (without the invalid marks). It is then grouped by the assigend A/B test group.

In [0]:
mark_df = bid_df[bid_df.event_type == 'mark']
mark_df = mark_df[~mark_df['user_id'].isin(invalid_users.index)]
grouped = mark_df.groupby(by='ab_test_group')
control_df = grouped.get_group('control')
test_df = grouped.get_group('test')

Calculate the cost of advertising. Remerge tracks monetary values in micro currency units. 

In [0]:
ad_spend_micros = bid_df[bid_df.event_type == 'buying_conversion']['cost_eur'].sum()
ad_spend = ad_spend_micros / 10**6
ad_spend

6837.308504

Create a dataframe that contains all relevant revenue events.

In [0]:
revenue_df = attributions_df[pd.notnull(attributions_df['revenue_eur'])]
revenue_df = revenue_df[revenue_df.partner_event == revenue_event]

Remerge marks users per campaign. This analysis looks at the per audience uplift, for that reason we drop duplicate marks for users that were marked by multiple campaigns. If a user was marked once for an audience he will have the same group allocation for consecutive marks unless manually reset on audience level.  

In [0]:
sorted_mark_df = mark_df.sort_values('ts')
depuplicated_mark_df = sorted_mark_df.drop_duplicates(['user_id'])

Join the marked users with the revenue events and excluded any revenue event that happend before the user was marked.

In [0]:
merged_df = pd.merge(revenue_df, depuplicated_mark_df, on='user_id')
merged_df = merged_df[merged_df.ts_x > merged_df.ts_y]

## Calculate uplift kpis

We calculate the incremental revenue and the iROAS in line with the [remerge whitepaper](https://drive.google.com/file/d/1PTJ93Cpjw1BeiVns8dTcs2zDDWmmjpdc/view). Afterwards we run a [chi squared test](https://en.wikipedia.org/wiki/Chi-squared_test) on the results to test for significance of the results, comparing conversion to per group uniques.

In [0]:
grouped_revenue = merged_df.groupby(by='ab_test_group_y')
test_group_size = test_df['user_id'].nunique()
test_revenue_micros = grouped_revenue.get_group('test')['revenue_eur'].sum()
test_revenue = test_revenue_micros / 10**6
control_group_size = control_df['user_id'].nunique()
control_revenue_micros = grouped_revenue.get_group('control')['revenue_eur'].sum()
control_revenue = control_revenue_micros / 10**6
test_conversions = grouped_revenue.get_group('test')['revenue_eur'].count()
control_conversion = grouped_revenue.get_group('control')['revenue_eur'].count()
ratio = float(test_group_size) / float(control_group_size)
scaled_control_conversions = float(control_conversion) * ratio
scaled_control_revenue_micros = float(control_revenue_micros) * ratio
incremental_conversions = test_conversions - scaled_control_conversions
incremental_revenue_micros = test_revenue_micros - scaled_control_revenue_micros
incremental_revenue = incremental_revenue_micros / 10**6
iroas = incremental_revenue / ad_spend
chi_df = pd.DataFrame({
    "conversions": [control_conversion, test_conversions],
    "total": [control_group_size, test_group_size]
    }, index=['control', 'test'])

chi,p,*_ = scipy.stats.chi2_contingency(pd.concat([chi_df.total - chi_df.conversions, chi_df.conversions], axis=1), correction=False)

In [0]:
# show results as a dataframe
# if you use Python 3.6+ and pandas 0.23+ columns is not needed as
# the dict will keep its order
# (older verison will sort this by name) 

result_df = pd.DataFrame({
    "ad spend": ad_spend,
    "total revenue": test_revenue + control_revenue,
    "test group size": test_group_size,
    "test conversions": test_conversions,
    "test revenue": test_revenue,
    "size control group": control_group_size,
    "control conversion": control_conversion,
    "control revenue": control_revenue,
    "ratio test/control": ratio,
    "control conversions (scaled)": scaled_control_conversions,
    "control revenue (scaled)": scaled_control_revenue_micros / 10**6,
    "incremental conversions": incremental_conversions,
    "incremental revenue": incremental_revenue,
    "rev/conversions test":test_revenue / test_conversions,
    "rev/conversions control": control_revenue / control_conversion,
    "iROAS": iroas,
    "chi^2":chi,
    "p-value":p,
    "significant":p<0.05},index=["value"], 
    columns=["ad spend","total revenue", "test group size","test conversions",
             "test revenue","size control group","control conversion",
             "control revenue","ratio test/control","control conversions (scaled)",
             "control revenue (scaled)","incremental conversions",
             "incremental revenue","rev/conversions test",
             "rev/conversions control","iROAS","chi^2","p-value","significant"]
  ).transpose()


display(result_df)


Unnamed: 0,value
ad spend,6837.31
total revenue,3168493.83
test group size,332442
test conversions,18468
test revenue,2549640.92
size control group,81292
control conversion,4402
control revenue,618852.91
ratio test/control,4.09
control conversions (scaled),18001.89
