# Remerge uplift report

This notebook allows you to validate remerge provided uplift reporting numbers. To do so it downloads and analyses exported campaign and event data from S3. The campaign data contains all users that remerge marked to be part of an uplift test, the A/B group assignment, the timestamp of marking, conversion events (click, app open or similar) and their cost. The event data reflects the app event stream and includes events, their timestamp and revenue (if any). We calculate the incremental revenue and the iROAS in line with the [remerge whitepaper](https://drive.google.com/file/d/1PTJ93Cpjw1BeiVns8dTcs2zDDWmmjpdc/view). 

**Hint**: This notebook can be run in any Jupyter instance with enough space/memory, as a [Google Colab notebook](#Google-Colab-version) or as a standalone Python script. If you are using a copy of this notebook running on Colab or locally you can find the original template on [GitHub: remerge/uplift-report](https://github.com/remerge/uplift-report/blob/master/uplift_report_per_campaign.ipynb)

### Notebook configuration

For this notebook to work properly several variables in the [Configuration](#Configuration) section need to be be set: `customer`, `audience`, `
revenue_event`, `dates` and the AWS credentials. All of these will be provided by your remerge account manager. 

In [0]:
# Import remerge uplift-report library
import os

# if we are in jupyter environment - we have cloned the repo already and `lib` is available
# on Colab we need to clone the repo and enable the same loading path through a symlink
if not os.path.exists('lib'):
    !git clone --branch internal https://github.com/remerge/uplift-report.git
    !ln -s uplift-report/lib
    
    !pip install lib/
    
    # Since we could have upgraded some dependencies, that require restart of the kernel (specifically `pandas`),
    # it is safer to perform this restart now
    os.kill(os.getpid(), 9)    

## Import packages

This notebook/script needs our Uplift Report helper library, as long as several other dependencies it brings with it


## Load helpers

In [0]:
import os
import pandas as pd

from lib.helpers import Helpers

from IPython.display import display  # so we can run this as script as well

## Version
Version of the analysis script corresponding to the methodology version in the whitepaper (Major + Minor version represent the whitepaper version, revision represents changes and fixes of the uplift report script).

In [0]:
display(Helpers.version())

## Configuration

Set the customer name, audience and access credentials for the S3 bucket and path. Furthermore the event for which we want to evaluate the uplift needs to be set `revenue_event`.

In [0]:
customer = ''
audiences = ['']
# date range for the report
dates = pd.date_range(start='2020-01-20',end='2020-01-20')
# AWS credentials
os.environ['AWS_ACCESS_KEY_ID'] = ''
os.environ['AWS_SECRET_ACCESS_KEY'] = ''

# Configure the reporting output: 

# named groups that aggregate several campaigns
groups = {}

# show uplift results per campaign:
per_campaign_results = False

# base statistical calculations on unique converters instead of conversions
use_converters_for_significance = False

In [0]:
# Instantiate & configure the helpers
#
# Hint: Press Atl + / or Tab to see docstring with paramerets descriptions in Google Colab
helpers = Helpers(
    customer=customer,
    audiences=audiences,
    revenue_event='purchase',
    dates=dates,
    attribution_dates=dates,
    groups=groups,
    use_converters_for_significance=use_converters_for_significance,
)

In [0]:
from datetime import datetime

import s3fs
import pandas as pd

from lib.helpers import _S3CachedFile

def read_king_s3_csv(customer, audience, date):
    now = datetime.now()

    file_path_template = 's3://remerge-customers/{0}/uplift_data/{1}/attributions/{2}.csv'

    date_str = date.strftime('%Y%m%d')

    filename = file_path_template.format(
        customer,
        audience,
        date_str,
    )

    print('start loading CSV for', audience, date)
    print('filename', filename)

    fs = s3fs.S3FileSystem(anon=False)
    fs.connect_timeout = 10  # defaults to 5
    fs.read_timeout = 30  # defaults to 15

    df = pd.DataFrame()

    if not fs.exists(path=filename):
        print('WARNING: no CSV file at for: ', audience, date, ', skipping the file: ', filename)
        return df

    read_csv_kwargs = {'chunksize': 10 ** 3}

    with _S3CachedFile(fs, filename) as s3_file:
        print('starting processing CSV for', date.strftime('%d.%m.%Y'))
        for chunk in pd.read_csv(s3_file.local_path, escapechar='\\', low_memory=False, **read_csv_kwargs):
            df = pd.concat([df, chunk], ignore_index=True, verify_integrity=True)
    
    print('finished processing CSV for', date.strftime('%d.%m.%Y'), 'took', datetime.now() - now)

    return df

In [0]:
TEST = "test"
CONTROL = "control"

def king_uplift(
    marks_and_spend_df, 
    attributions_df, 
    index_name="total", 
    m_hypothesis=1, 
    use_converters_for_significance=False,
    ):
    """
    # Uplift Calculation
    We calculate the incremental revenue and the iROAS in line with the
    [remerge whitepaper](https://drive.google.com/file/d/1PTJ93Cpjw1BeiVns8dTcs2zDDWmmjpdc/view). Afterwards we run
    a [chi squared test](https://en.wikipedia.org/wiki/Chi-squared_test) on the results to test for significance of
    the results, comparing conversion to per group uniques.
    """
    
    # calculate group sizes
    test_group_size = attributions_df[attributions_df['group_name'] == TEST]['user_id'].nunique()
    if test_group_size == 0:
        log("WARNING: No users marked as test for ", index_name, 'skipping.. ')
        return None

    control_group_size = attributions_df[attributions_df['group_name'] == CONTROL]['user_id'].nunique()
    if control_group_size == 0:
        log("WARNING: No users marked as control for ", index_name, 'skipping.. ')
        return None

    # join marks and revenue events
    grouped_revenue = attributions_df.groupby(by='group_name')

    # init all KPIs with 0s first:
    test_revenue_dollars = 0
    test_conversions = 0
    test_converters = 0

    control_revenue_dollars = 0
    control_conversions = 0
    control_converters = 0

    # we might not have any events for a certain group in the time-period,
    if TEST in grouped_revenue.groups:
        test_revenue_df = grouped_revenue.get_group(TEST)
        test_revenue_dollars = test_revenue_df['revenue'].sum()
        test_conversions = test_revenue_df['user_id'].count()
        test_converters = test_revenue_df['user_id'].nunique()

    if CONTROL in grouped_revenue.groups:
        control_revenue_df = grouped_revenue.get_group(CONTROL)
        control_revenue_dollars = control_revenue_df['revenue'].sum()
        # control_conversions = control_revenue_df['partner_event'].count()
        # as we filtered by revenue event and dropped the column we can just use
        control_conversions = control_revenue_df['user_id'].count()
        control_converters = control_revenue_df['user_id'].nunique()

    # calculate KPIs
    test_revenue = test_revenue_dollars
    control_revenue = control_revenue_dollars

    ratio = float(test_group_size) / float(control_group_size)
    scaled_control_conversions = float(control_conversions) * ratio
    scaled_control_revenue_dollars = float(control_revenue_dollars) * ratio
    incremental_conversions = test_conversions - scaled_control_conversions
    incremental_revenue_dollars = test_revenue_dollars - scaled_control_revenue_dollars
    incremental_revenue = incremental_revenue_dollars
    incremental_converters = test_converters - control_converters * ratio

    # calculate the ad spend
    ad_spend = marks_and_spend_df[(marks_and_spend_df.event_type == 'buying_conversion') & (marks_and_spend_df.ab_test_group == True)]['cost'].sum() / 10 ** 6
    
    iroas = incremental_revenue / ad_spend
    icpa = ad_spend / incremental_conversions
    cost_per_incremental_converter = ad_spend / incremental_converters

    rev_per_conversion_test = 0
    rev_per_conversion_control = 0
    if test_conversions > 0:
        rev_per_conversion_test = test_revenue / test_conversions
    if control_conversions > 0:
        rev_per_conversion_control = control_revenue / control_conversions

    test_cvr = test_conversions / test_group_size
    control_cvr = control_conversions / control_group_size

    uplift = 0
    if control_cvr > 0:
        uplift = test_cvr / control_cvr - 1

    # calculate statistical significance
    control_successes, test_successes = control_conversions, test_conversions
    if use_converters_for_significance or max(test_cvr, control_cvr) > 1.0:
        control_successes, test_successes = control_converters, test_converters
    chi_df = pd.DataFrame({
        "conversions": [control_successes, test_successes],
        "total": [control_group_size, test_group_size]
    }, index=[CONTROL, TEST])
    # CHI square calculation will fail with insufficient data
    # Fallback to no significance
    try:
        chi, p, _, _ = scipy.stats.chi2_contingency(
            pd.concat([chi_df.total - chi_df.conversions, chi_df.conversions], axis=1), correction=False)
    except:
        chi, p = 0, 1.0

    # bonferroni correction with equal weights - if we have multiple hypothesis:
    # https://en.wikipedia.org/wiki/Bonferroni_correction
    significant = p < 0.05 / m_hypothesis

    dataframe_dict = {
        "ad spend": ad_spend,
        "total revenue": test_revenue + control_revenue,
        "test group size": test_group_size,
        "test conversions": test_conversions,
        "test converters": test_converters,
        "test revenue": test_revenue,
        "control group size": control_group_size,
        "control conversions": control_conversions,
        "control_converters": control_converters,
        "control revenue": control_revenue,
        "ratio test/control": ratio,
        "control conversions (scaled)": scaled_control_conversions,
        "control revenue (scaled)": scaled_control_revenue_dollars,
        "incremental conversions": incremental_conversions,
        "incremental converters": incremental_converters,
        "incremental revenue": incremental_revenue,
        "rev/conversions test": rev_per_conversion_test,
        "rev/conversions control": rev_per_conversion_control,
        "test CVR": test_cvr,
        "control CVR": control_cvr,
        "CVR Uplift": uplift,
        "iROAS": iroas,
        "cost per incr. converter": cost_per_incremental_converter,
        "iCPA": icpa,
        "chi^2": chi,
        "p-value": p,
        "significant": significant
    }

    # show results as a dataframe
    return pd.DataFrame(
        dataframe_dict,
        index=[index_name],
    ).transpose()

## Load CSV data from S3

Load mark, spend and event data from S3. 

### IMPORTANT

**The event data is usually quite large (several GB) so this operation might take several minutes or hours to complete, depending on the size and connection.**

In [0]:
marks_and_spend_df = helpers.load_marks_and_spend_data()

In [0]:
attributions_df = pd.concat(
    [read_king_s3_csv(
        customer=customer,
        audience=audience,
        date=date,
    ) for audience in audiences for date in dates],
    ignore_index=True,
    verify_integrity=True,
)

Print some statistics of the loaded data sets.

In [0]:
marks_and_spend_df.info(memory_usage='deep')

In [0]:
attributions_df.info(memory_usage='deep')

### Remove users in both control and test group

Very rarly due to timing issue users could be marked for both control and test group. Those are filtered out in this step.

In [0]:
marks_and_spend_df = helpers.remove_users_marked_as_control_and_test(marks_and_spend_df)

### Calculate and display uplift report for the data set as a whole

This takes the whole data set and calculates uplift KPIs.

In [0]:
report_df = king_uplift(
    marks_and_spend_df=marks_and_spend_df, 
    attributions_df=attributions_df,
    use_converters_for_significance=use_converters_for_significance,
    )

# if there are groups filter the events against the per campaign groups and generate report
if report_df is not None and groups:
    for name, campaigns in groups.items():
        group_df = marks_and_spend_df[marks_and_spend_df.campaign_id.isin(campaigns)]
        report_df[name] = king_uplift(
            marks_and_spend_df=group_df,
            attributions_df=attributions_df,
            index_name=name,
            m_hypothesis=len(groups))

if report_df is not None and per_campaign_results:
    campaigns = marks_and_spend_df['campaign_id'].unique()
    for campaign in campaigns:
        name = "c_{0}".format(campaign)
        campaign_df = marks_and_spend_df[marks_and_spend_df.campaign_id == campaign]
        report_df[name] = king_uplift(
            marks_and_spend_df=campaign_df,
            attributions_df=attributions_df,
            index_name=name,
            m_hypothesis=len(campaigns),
        )

## Uplift Results

You can configure the ouput by using variables in the 'Configuration' section

In [0]:
# set formatting options
pd.set_option('display.float_format', '{:.5f}'.format)

In [0]:
display(report_df)

### CSV Export

In [0]:
start = dates[0]
end = dates[-1]

helpers.export_csv(df=report, file_name='{}_{}-{}.csv'.format(customer, start, end))

# Evolution through time
Show the evolution of our KPIs through time iterating through `dates` and compiling a report for each day from `start` to `day`.

The report can be customized to use a different `report column` (a group column name or campaign) instead of `'total'`.

In [0]:
# time development analysis
from datetime import timedelta
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams["figure.figsize"] = [20, 20]

# evolution configuration
# which sub column to use for reporting, default='total'
report_column = 'total'

# which columns of the report to plot graphically
plot_columns = ['test group size', 'test CVR', 'control CVR', 'CVR Uplift', 'p-value']

def filter_df(df, from_date, to_date):
    return df[(df['ts'] >= from_date.timestamp()) & (df['ts'] <= (to_date + timedelta(days=1)).timestamp())]

def analyze_evolution(report_column, plot_columns):
    start = dates[0]
    reports_df = pd.DataFrame()

    for end in dates:
        marks = filter_df(marks_and_spend_df, start, end)
        attr = filter_df(attributions_df, start, end)
        report = helpers.uplift_report(marks_and_spend_df=marks, attributions_df=attr)
        # if we miss a data file the report will be empty
        if report is None:
            continue
        report = report[[report_column]]
        report = report.transpose()
        report['date'] = end
        report = report.set_index('date')
        reports_df = reports_df.append(report)  

    #display full reports per day as table
    display(reports_df)

    # plot the selected columns
    plot_df = reports_df[plot_columns]
    plot_df.plot.line(subplots=True, grid=True)
    return reports_df

evolution_report_df = analyze_evolution(report_column, plot_columns)

### Export daily evolution to CSV

In [0]:
# helpers.export_csv(df=evolution_report_df, file_name='daily_evolution_{}_{}-{}.csv'.format(customer, start, end)) 

## Export to reports overview

In [0]:
helpers.export_to_overview(report)