<a href="https://colab.research.google.com/github/remerge/uplift-report/blob/remove-invalid-users-low-memory/uplift_report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# remerge uplift report

This notebook allows you to validate remerge provided uplift reporting numbers. To do so it downloads and analyses exported campaign and event data from S3. The campaign data contains all users that remerge marked to be part of an uplift test, the A/B group assignment, the timestamp of marking, conversion events (click, app open or similar) and their cost. The event data reflects the app event stream and includes events, their timestamp and revenue (if any). We calculate the incremental revenue and the iROAS in line with the [remerge whitepaper](https://drive.google.com/file/d/1PTJ93Cpjw1BeiVns8dTcs2zDDWmmjpdc/view). 

**Hint**: This notebook can be run in any Jupyter instance with enough space/memory, as a [Google Colab notebook](#Google-Colab-version) or as a standalone Python script. If you are using a copy of this notebook running on Colab or locally you can find the original template on [GitHub: remerge/uplift-report](https://github.com/remerge/uplift-report/blob/master/uplift_report_per_campaign.ipynb)

### Notebook configuration

For this notebook to work properly several variables in the [Configuration](#Configuration) section need to be be set: `customer`, `audience`, `
revenue_event`, `dates` and the AWS credentials. All of these will be provided by your remerge account manager. 

In [None]:
!pip install xxhash
!pip install pandas==0.24.0
!pip install scipy
!pip install s3fs
!pip install google.colab
!pip install pyarrow

## Google Colab support

This notebook can be run inside Google Colab. Due to size limitations it cointains several optimizations like removing unused fields from the input files and caching files. Furthermore it installs missing dependencies and restarts the kernel. **If pandas was upgraded the kernel needs to be restarted once per fresh instance. Just run the cell again after restart** 

In [None]:
try:
    import google.colab

    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    !pip install pyarrow
    !pip install xxhash
    !pip install partd
    
    import pandas as pdt
    if pdt.__version__ < '0.23.4':
        # upgrading pandas requires a restart of the kernel
        # (we need an up to date pandas because we write to S3 for caching)
        # we kill it and let it auto restart (only needed once per fresh instance)
        !pip install pandas==0.23.4
        
        import os
        os.kill(os.getpid(), 9)

## Import needed packages

This notebook/script needs pandas and scipy for analysis and boto to access data store on S3.


In [None]:
from datetime import datetime
import pandas as pd
import numpy as np
import xxhash
import re
import os
import gzip
import scipy
import scipy.stats
import s3fs
from google.colab import files
from IPython.display import display  # so we can run this as script as well

In [None]:
import helpers

In [None]:
import importlib
importlib.reload(helpers)

## Version
Version of the analysis script corresponding to the methodology version in the whitepaper (Major + Minor version represent the whitepaper version, revision represents changes and fixes of the uplift report script).

In [None]:
display(helpers.VERSION)

## Configuration

Set the customer name, audience and access credentials for the S3 bucket and path. Furthermore the event for which we want to evaluate the uplift needs to be set `revenue_event`.

In [None]:
# configure path and revenue event 
customer = ''
audiences = ['']
revenue_event = 'purchase'

# date range for the report
dates = pd.date_range(start='2019-01-01',end='2019-01-01')

# AWS credentials
os.environ["AWS_ACCESS_KEY_ID"] = ''
os.environ["AWS_SECRET_ACCESS_KEY"] = ''

# Configure the reporting output: 

# named groups that aggregate several campaigns
groups = {}

# show uplift results per campaign:
per_campaign_results = False

# base statistical calculations on unique converters instead of conversions
use_converters_for_significance = False

# enable deduplication heuristic for appsflyer
use_deduplication = False

## Data loading helpers
Define a few helper functions to load and cache data.

## Load CSV data from S3

Load mark, spend and event data from S3. 

### IMPORTANT

**The event data is usually quite large (several GB) so this operation might take several minutes or hours to complete, depending on the size and connection.**

In [None]:
bid_columns = ['ts', 'user_id', 'ab_test_group', 'campaign_id','cost_eur','event_type']
bids_df = pd.concat([helpers.read_csv(customer, audience, 'marks_and_spend', date, columns=bid_columns) for audience in audiences for date in dates],
                    ignore_index=True, verify_integrity=True)

In [None]:
attribution_columns = ['ts', 'user_id', 'partner_event', 'revenue_eur', 'ab_test_group']
attributions_df = pd.concat(
    [helpers.read_csv(customer, audience, 'attributions', date, attribution_columns, revenue_event, helpers.extract_revenue_events ) for audience in audiences for date in dates],
    ignore_index=True, verify_integrity=True)

Print some statistics of the loaded data sets.

In [None]:
bids_df.info(memory_usage='deep')


In [None]:
attributions_df.info(memory_usage='deep')

### Deduplication for appsflyer
AppsFlyer sends some events twice - we want to remove the duplicates before the analysis

In [None]:
if use_deduplication:
  attributions_df = drop_duplicates_in_attributions(attributions_df, pd.Timedelta('1 minute'))

### Calculate and display uplift report for the data set as a whole

This takes the whole data set and calculates uplift KPIs.

In [None]:
report = helpers.uplift_report(bids_df, attributions_df, groups, per_campaign_results, use_converters_for_significance)

## Uplift Results

You can configure the ouput by using variables in the 'Configuration' section

In [None]:
# set formatting options
pd.set_option('display.float_format', '{:.5f}'.format)

In [None]:
display(report)

### CSV Export - combined reports

In [None]:
def export_csv(df, file_name):
    df.to_csv(file_name) 
    files.download(file_name)

In [None]:
start = dates[0]
end = dates[-1]
export_csv(report),'{}_{}-{}.csv'.format(customer, start, end))