<a href="https://colab.research.google.com/github/remerge/uplift-report/blob/master/uplift_report_per_campaign.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# remerge uplift report

This notebook allows you to validate remerge provided uplift reporting numbers. To do so it downloads and analyses exported campaign and event data from S3. The campaign data contains all users that remerge marked to be part of an uplift test, the A/B group assignment, the timestamp of marking, conversion events (click, app open or similar) and their cost. The event data reflects the app event stream and includes events, their timestamp and revenue (if any). We calculate the incremental revenue and the iROAS in line with the [remerge whitepaper](https://drive.google.com/file/d/1PTJ93Cpjw1BeiVns8dTcs2zDDWmmjpdc/view). 

**Hint**: This notebook can be run in any Jupyter instance with enough space/memory, as a [Google Colab notebook](#Google-Colab-version) or as a standalone Python script. If you are using a copy of this notebook running on Colab or locally you can find the original template on [GitHub: remerge/uplift-report](https://github.com/remerge/uplift-report/blob/master/uplift_report_per_campaign.ipynb)

### Notebook configuration

For this notebook to work properly several variables in the [Configuration](#Configuration) section need to be be set: `customer`, `audience`, `
revenue_event`, `dates` and the AWS credentials. All of these will be provided by your remerge account manager. 


### Verification

To verify that the group split is random and has no bias, user events / attributes before the campaign start can be compared and checked for an equal distribution in test and control group. For example the user age distribution, the user activity distribution or the average spend per user  should be the same in both groups pre campaign.



## Google Colab support

This notebook can be run inside Google Colab. Due to size limitations it cointains several optimizations like removing unused fields from the input files and caching files. Furthermore it installs missing dependencies and restarts the kernel. **Because pandas is upgraded the kernel needs to be restarted once per fresh instance. Just run the cell again after restart** 

In [0]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
  !pip install pyarrow
  !pip install gspread-pandas
  import pandas as pdt
  if pdt.__version__ != '0.23.4':
    # upgrading pandas requires a restart of the kernel
    # (we need an up to date pandas because we write to S3 for caching)
    # we kill it and let it auto restart (only needed once per fresh instance)
    !pip install pandas==0.23.4
    import os
    os.kill(os.getpid(), 9)

## Import needed packages

This notebook/script needs pandas and scipy for analysis and boto to access data store on S3.


In [0]:
from datetime import datetime
import pandas as pd
import re
import os
import gzip
import scipy
import scipy.stats 
import s3fs
from IPython.display import display # so we can run this as script as well
import gc

## Configuration

Set the customer name, audience and access credentials for the S3 bucket and path. Furthermore the event for which we want to evaluate the uplift needs to be set `revenue_event`.

In [0]:
# configure path and revenue event 
customer = ''
audiences = ['']
revenue_event = ''

# date range for the report
dates = pd.date_range(start='2019-02-14',end='2019-02-27')

# grouped report
groups = {}
  

# AWS credentials
os.environ["AWS_ACCESS_KEY_ID"] = ''
os.environ["AWS_SECRET_ACCESS_KEY"] = ''

## Helper
Define a few helper functions to load and cache data.

In [0]:
def path(audience):
  return "s3://remerge-customers/{0}/uplift_data/{1}".format(customer,audience)

# helper to remove a few things we load if we run in Google Colab
def limit_df(df,source):
    if not IN_COLAB:
      return df
    if source != 'attributions':
      return df
    # we drop a few things so we fit into the Colab memory limit
    df.drop(['partner','revenue','revenue_currency'], axis=1)
    df = df[df.partner_event == revenue_event]
    gc.collect()
    return df 

# helper to download CSV files, convert to DF and print time needed
# caches files locally and on S3 to be reused
def read_csv(audience, source, date):
    now = datetime.now()
    date_str = date.strftime('%Y%m%d')
    filename = '{0}/{1}/{2}.csv.gz'.format(path(audience), source, date_str)
    # local cache
    cache_dir = 'cache/{0}/{1}'.format(audience, source)
    cache_filename = '{0}/{1}.parquet'.format(cache_dir, date_str)
    if os.path.exists(cache_filename):
        print(now, 'loading from', cache_filename)
        df = pd.read_parquet(cache_filename, engine='pyarrow')
        df = limit_df(df,source)
        return df
    # s3 cache (useful if we don't have enough space on the Colab instance)
    s3_cache_filename = '{0}/{1}/cache/{2}.parquet'.format(path(audience), source, date_str)
    fs =s3fs. S3FileSystem(anon=False)
    if fs.exists(path=s3_cache_filename):
      print(now, 'loading from S3 cache', s3_cache_filename)
      return pd.read_parquet(s3_cache_filename, engine='pyarrow')
    print(now, 'start loading CSV for', audience, source, date)
    df = pd.read_csv(filename, escapechar='\\')
    df = limit_df(df,source)
    print(datetime.now(), 'finished loading CSV for', date.strftime('%d.%m.%Y'), 'took', datetime.now()-now)
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)
    df.to_parquet(cache_filename, engine='pyarrow')
    # write it to the S3 cache folder as well
    print(datetime.now(), 'caching as parquet', s3_cache_filename)
    df.to_parquet(s3_cache_filename, engine='pyarrow')
    return df

## Load CSV data from S3

Load mark, spend and event data from S3. 

### IMPORTANT

**The event data is usually quite large (several GB) so this operation might take several minutes or hours to complete, depending on the size and connection.**

In [0]:
bids_df = pd.concat([read_csv(audience,'marks_and_spend',date) for audience in audiences for date in dates], ignore_index = True, verify_integrity=True)

In [0]:
attributions_df = pd.concat([read_csv(audience,'attributions',date) for audience in audiences for date in dates], ignore_index = True, verify_integrity=True)

Print some statistics of the loaded data sets.

In [28]:
bids_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 535434 entries, 0 to 535433
Data columns (total 9 columns):
ts               535434 non-null object
event_type       535434 non-null object
ab_test_group    535434 non-null object
user_id          535422 non-null object
campaign_id      535434 non-null int64
cost_currency    25657 non-null object
cost             25657 non-null float64
cost_eur         25657 non-null float64
campaign_name    535434 non-null object
dtypes: float64(2), int64(1), object(6)
memory usage: 36.8+ MB


In [0]:
attributions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1266684 entries, 0 to 1266683
Data columns (total 10 columns):
ts                  1266684 non-null object
user_id             1266684 non-null object
event_id            0 non-null float64
partner             1266684 non-null object
partner_event       1266684 non-null object
revenue             1263513 non-null float64
revenue_currency    1266684 non-null object
revenue_eur         1263513 non-null float64
ab_test_group       38044 non-null object
event_data          1266684 non-null object
dtypes: float64(3), object(7)
memory usage: 96.6+ MB


In [0]:
# set formatting options
pd.set_option('display.float_format', '{:.2f}'.format)

## Remove invalid users

Due to a race condition during marking we need to filter out users that are marked as *control* and *test*. In rare cases we see the same user on different servers in the same second, and unknowingly of each other marked him differently. This was fixed in the latest version of the remerge plattform but we need to filter old data.

In [0]:
# users that are in both groups due to racy bids are invalid
# we need to filter them out
groups_per_user = bids_df.groupby('user_id')['ab_test_group'].nunique()
invalid_users = groups_per_user[groups_per_user > 1]

## Define functions to prepare data frames


Calculate the cost of advertising give a dataframe. Remerge tracks monetary values in micro currency units. 

In [0]:
def ad_spend(df):
  ad_spend_micros = df[df.event_type == 'buying_conversion']['cost_eur'].sum()
  return ad_spend_micros / 10**6

The dataframe created by `marked`  will contain all mark events (without the invalid marks). Remerge marks users per campaign.  If a user was marked once for an audience he will have the same group allocation for consecutive marks (different campaigns) unless manually reset on audience level.  

In [0]:
def marked(df):
  mark_df = df[df.event_type == 'mark']
  mark_df = mark_df[~mark_df['user_id'].isin(invalid_users.index)]
  sorted_mark_df = mark_df.sort_values('ts')
  depuplicated_mark_df = sorted_mark_df.drop_duplicates(['user_id'])
  return depuplicated_mark_df

`revenue` creates a dataframe that contains all relevant revenue events.

In [0]:
def revenue(df):
  revenue_df = df[pd.notnull(df['revenue_eur'])]
  return revenue_df[revenue_df.partner_event == revenue_event]

`merge` joins the marked users with the revenue events and excludes any revenue event that happend before the user was marked.

In [0]:
def merge(mark_df,revenue_df):
  merged_df = pd.merge(revenue_df, mark_df, on='user_id')
  return merged_df[merged_df.ts_x > merged_df.ts_y]


## Calculate uplift kpis

We calculate the incremental revenue and the iROAS in line with the [remerge whitepaper](https://drive.google.com/file/d/1PTJ93Cpjw1BeiVns8dTcs2zDDWmmjpdc/view). Afterwards we run a [chi squared test](https://en.wikipedia.org/wiki/Chi-squared_test) on the results to test for significance of the results, comparing conversion to per group uniques.

In [0]:
def uplift(ad_spend,mark_df,revenue_df,index_name):
  # group marked users by their ab_test_group
  grouped = mark_df.groupby(by='ab_test_group')
  control_df = grouped.get_group('control')
  test_df = grouped.get_group('test')
  
  # join marks and revenue events
  merged_df = merge(mark_df,revenue_df)
  grouped_revenue = merged_df.groupby(by='ab_test_group_y')
  
  # calculate KPIs
  test_group_size = test_df['user_id'].nunique()
  test_revenue_micros = grouped_revenue.get_group('test')['revenue_eur'].sum()
  test_revenue = test_revenue_micros / 10**6
  control_group_size = control_df['user_id'].nunique()
  control_revenue_micros = grouped_revenue.get_group('control')['revenue_eur'].sum()
  control_revenue = control_revenue_micros / 10**6
  test_conversions = grouped_revenue.get_group('test')['revenue_eur'].count()
  control_conversion = grouped_revenue.get_group('control')['revenue_eur'].count()
  ratio = float(test_group_size) / float(control_group_size)
  scaled_control_conversions = float(control_conversion) * ratio
  scaled_control_revenue_micros = float(control_revenue_micros) * ratio
  incremental_conversions = test_conversions - scaled_control_conversions
  incremental_revenue_micros = test_revenue_micros - scaled_control_revenue_micros
  incremental_revenue = incremental_revenue_micros / 10**6
  iroas = incremental_revenue / ad_spend
  chi_df = pd.DataFrame({
    "conversions": [control_conversion, test_conversions],
    "total": [control_group_size, test_group_size]
    }, index=['control', 'test'])

  chi,p,*_ = scipy.stats.chi2_contingency(pd.concat([chi_df.total - chi_df.conversions, chi_df.conversions], axis=1), correction=False)
  
  # show results as a dataframe
  # if you use Python 3.6+ and pandas 0.23+ columns is not needed as
  # the dict will keep its order
  # (older verison will sort this by name) 

  return pd.DataFrame({
    "ad spend": ad_spend,
    "total revenue": test_revenue + control_revenue,
    "test group size": test_group_size,
    "test conversions": test_conversions,
    "test revenue": test_revenue,
    "size control group": control_group_size,
    "control conversion": control_conversion,
    "control revenue": control_revenue,
    "ratio test/control": ratio,
    "control conversions (scaled)": scaled_control_conversions,
    "control revenue (scaled)": scaled_control_revenue_micros / 10**6,
    "incremental conversions": incremental_conversions,
    "incremental revenue": incremental_revenue,
    "rev/conversions test":test_revenue / test_conversions,
    "rev/conversions control": control_revenue / control_conversion,
    "iROAS": iroas,
    "chi^2":chi,
    "p-value":p,
    "significant":p<0.05},index=[index_name], 
    columns=["ad spend","total revenue", "test group size","test conversions",
             "test revenue","size control group","control conversion",
             "control revenue","ratio test/control","control conversions (scaled)",
             "control revenue (scaled)","incremental conversions",
             "incremental revenue","rev/conversions test",
             "rev/conversions control","iROAS","chi^2","p-value","significant"]
  ).transpose()


In [0]:
# helper that adds a uplift report to a dataframe give a name
def add_uplift_report(name, target_df, df):
  tmp = uplift(ad_spend(df),marked(df),revenue_df,name)
  if target_df is not None:
    target_df[name] = tmp[name]
  else:
    target_df = tmp
  return target_df 

### Calculate and display uplift report for the data set as a whole

This takes the whole data set and calculates uplift KPIs.

In [23]:
revenue_df = revenue(attributions_df)
mark_df = marked(bids_df)
uplift(ad_spend(bids_df),mark_df,revenue_df,"total")

Unnamed: 0,total
ad spend,6837.31
total revenue,3168493.83
test group size,332442
test conversions,18468
test revenue,2549640.92
size control group,81292
control conversion,4402
control revenue,618852.91
ratio test/control,4.09
control conversions (scaled),18001.89


### Calculate uplift report per group (if configured)

Sometimes it makes sense to look at groups of similar campaigns. If the `groups`  dictionary contains group names as keys and a list of campaign ids as values per key, this function will compile a per group report. 

In [0]:
# if there are groups filter the events against the per campaign groups and generate report
if len(groups) > 0:
  per_group_df = None
  for name, campaigns in groups.items():
    per_group_df = add_uplift_report(name,per_group_df,bids_df[bids_df.campaign_id.isin(campaigns)])
  display(per_group_df)

### Calculate uplift report per campaign

Sometimes it makes sense to look at the uplift report per campaign. Each campaign usually reflects one segement of users. To do that we iterate over all campaigns in the current dataset.

In [26]:
per_campaign_df = None
for campaign in bids_df['campaign_id'].unique():
  name = "c_{0}".format(campaign)
  df = bids_df[bids_df.campaign_id == campaign]
  tmp = uplift(ad_spend(df),marked(df),revenue_df,name)
  if per_campaign_df is not None:
    per_campaign_df[name] = tmp[name]
  else:
    per_campaign_df = tmp
per_campaign_df

Unnamed: 0,c_16171,c_16172,c_16173,c_16174,c_16175,c_16177,c_17100,c_17099
ad spend,2109.71,876.69,1404.18,982.93,468.14,420.08,292.95,282.61
total revenue,1010707.00,188201.05,207442.42,93475.03,1014197.23,420039.53,231259.53,128966.90
test group size,56594,27503,39984,34755,66176,62081,38666,55993
test conversions,8594,1500,1473,648,4016,1364,1006,512
test revenue,814947.23,151798.24,165654.53,75797.69,831555.16,330499.45,181696.89,100263.51
size control group,13744,6657,9486,8555,16242,15162,9538,13747
control conversion,2090,317,390,145,873,363,242,134
control revenue,195759.77,36402.81,41787.89,17677.34,182642.07,89540.08,49562.64,28703.39
ratio test/control,4.12,4.13,4.22,4.06,4.07,4.09,4.05,4.07
control conversions (scaled),8606.04,1309.67,1643.87,589.07,3556.93,1486.31,981.04,545.80
