<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Check-Data" data-toc-modified-id="Check-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Check Data</a></span></li><li><span><a href="#Prepare-Data" data-toc-modified-id="Prepare-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Prepare Data</a></span></li><li><span><a href="#Create-A-Test-Case" data-toc-modified-id="Create-A-Test-Case-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Create A Test Case</a></span></li><li><span><a href="#Develop-Trx-Data-Transformation-Functions" data-toc-modified-id="Develop-Trx-Data-Transformation-Functions-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Develop Trx Data Transformation Functions</a></span><ul class="toc-item"><li><span><a href="#Remove-invalid-Redemptions-(no-Purchase-on-same-Date)" data-toc-modified-id="Remove-invalid-Redemptions-(no-Purchase-on-same-Date)-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Remove invalid Redemptions (no Purchase on same Date)</a></span></li><li><span><a href="#Create-columns-for-the-voucher-related-trx" data-toc-modified-id="Create-columns-for-the-voucher-related-trx-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Create columns for the voucher related trx</a></span></li><li><span><a href="#Calculate-intervals" data-toc-modified-id="Calculate-intervals-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Calculate intervals</a></span></li></ul></li><li><span><a href="#Parallelization" data-toc-modified-id="Parallelization-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Parallelization</a></span></li><li><span><a href="#Appendix" data-toc-modified-id="Appendix-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Appendix</a></span></li></ul></div>

<div class='alert alert-block alert-info'>
<b>Note:</b> This notebooks documents the development and testing of the trx data manipulation code. Once everything ran smoothly on the defined test case, fhe functions were outfactored into the final script `transform_data.py`.
</div>

In [1]:
import datetime as dt
import itertools
import sys
from pathlib import Path

import codebook.EDA as EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
%load_ext autoreload
%autoreload 2

%matplotlib inline
plt.style.use('raph-base')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', 30)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

np.random.seed(666)

In [3]:
print(sys.executable)
print(sys.version)
print(f'Pandas {pd.__version__}')

C:\Users\r2d4\miniconda3\envs\py3\python.exe
3.8.3 (default, May 19 2020, 06:50:17) [MSC v.1916 64 bit (AMD64)]
Pandas 1.1.3


## Check Data


In [4]:
# Load clean data from parquet file
data_clean = pd.read_parquet("data/1_trx_data_clean.parquet")

In [5]:
data_clean.head()
data_clean.info()

Unnamed: 0,member,date,trx_type,device,value,discount
0,1,2018-01-10,Purchase,Payment,83.1,60.8
1,1,2018-01-19,Purchase,Payment,146.3,0.0
2,1,2018-02-05,Activation,Loyalty Voucher,5.0,0.0
3,1,2018-02-16,Purchase,Payment,57.1,0.0
4,1,2018-02-16,Redemption,Loyalty Voucher,-5.0,0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1612298 entries, 0 to 1612297
Data columns (total 6 columns):
 #   Column    Non-Null Count    Dtype         
---  ------    --------------    -----         
 0   member    1612298 non-null  object        
 1   date      1612298 non-null  datetime64[ns]
 2   trx_type  1612298 non-null  object        
 3   device    1612298 non-null  object        
 4   value     1612298 non-null  float32       
 5   discount  1612298 non-null  float32       
dtypes: datetime64[ns](1), float32(2), object(3)
memory usage: 61.5+ MB


**Findings:**
- No missing values
- Object dtype columns could be transformed to category dtype

In [6]:
EDA.display_value_counts(data_clean[["device", "trx_type"]])

Unnamed: 0,counts,prop,cum_prop
Payment,1188433,73.7%,73.7%
Loyalty Voucher,423865,26.3%,100.0%


Unnamed: 0,counts,prop,cum_prop
Purchase,1188433,73.7%,73.7%
Activation,262115,16.3%,90.0%
Redemption,161750,10.0%,100.0%


**Findings:**
- Device == "Payment" and trx_type == "Purchase" correspond. About 3/4 of trx are purchases
- Device == "Loyalty Voucher" can either be activated or redeemed. There are less Redemption trx than Activations

## Prepare Data

**Important:** For the functions below to work properly, we have to make sure the trx_type chronology is in the right order when grouped by member and date. This means: Activation --> Purchase --> Redemption

In [7]:
def prepare_trx_df(df):
    """Transform dtype object to dtype category, and - most
    important - make sure trx_types are in the correct 
    chronological order when groupbed by member and date. 
    (Conveniently this is the alphabetical order.) Return
    a copy of the original dataframe.
    """
    df = df.copy()
    for col in df.select_dtypes(include=["object", "string"]):
        df[col] = df[col].astype("category")
    df.sort_values(["member", "date", "trx_type"], inplace=True)
    return df

In [8]:
data_prep = prepare_trx_df(data_clean)

# Pass tests
assert "object" not in data_prep.dtypes

## Create A Test Case

**Important:** Think about edge cases whe creating test data, e.g. 
- There are a few same day activations / redemptions
- There are redemptions without purchase transaction (I won't consider them a purchase)

In [9]:
# Choose a suitable "Test Member"
test_member = data_prep[data_prep["member"] == "102318"].copy()

# Add an edge case: Same-date Activation and redemption
test_member.iloc[5, 1] = dt.datetime.strptime("2018-10-22", "%Y-%m-%d")

# Add an edge case: Redemption without purchase
test_member = test_member.append(test_member.iloc[14, :], ignore_index=True)
test_member.iloc[-1, 1] = dt.datetime.strptime("2018-4-01", "%Y-%m-%d")

# Sort again and re-index
test_member.sort_values(["member", "date", "trx_type"], inplace=True)
test_member.reset_index(drop=True, inplace=True)

#(Re-)Set dtypes
for col in test_member.select_dtypes(include=["object", "string"]):
    test_member[col] = test_member[col].astype("category")

test_member

Unnamed: 0,member,date,trx_type,device,value,discount
0,102318,2018-04-01,Redemption,Loyalty Voucher,-5.0,0.0
1,102318,2018-05-12,Purchase,Payment,107.7,0.0
2,102318,2018-05-12,Redemption,Loyalty Voucher,-10.0,0.0
3,102318,2018-07-09,Purchase,Payment,127.7,0.0
4,102318,2018-08-11,Purchase,Payment,20.0,49.9
5,102318,2018-08-31,Purchase,Payment,31.8,8.0
6,102318,2018-10-22,Activation,Loyalty Voucher,5.0,0.0
7,102318,2018-10-22,Purchase,Payment,49.8,0.0
8,102318,2018-10-22,Redemption,Loyalty Voucher,-5.0,0.0
9,102318,2019-01-03,Purchase,Payment,8.95,8.95


## Develop Trx Data Transformation Functions


1. Remove "invalid" Redemptions (no purchase on same date), few edge cases
TODO

Flags:
-    Flag purchase with redemption
-    Flag purchase without redemption when v_sum = 0
-    Flag puchase without redemption when v_sum > 0
-    Flag purchase by device
-    Flag purchase with discount, considering a threshold
-   _Flag purchase with value < 0 as return (not implemented)_

Intervals:
-   purchase intervals
-   interval from activation to next purchase (not considering activation)

### Remove invalid Redemptions (no Purchase on same Date)

see appendix for a first attempt that did not scale well

In [10]:
def remove_redemptions_without_purchase(df):  
    """Identify and remove all `Redemption` trx that have 
    no purchase trx on the same date for the same member. 
    Return a copy of the original dataframe.
    """
    df = df.copy()
    
    # Get a df each with all redemptions and all purchases
    df_red = df[df["trx_type"] == "Redemption"].copy()
    df_pur = df[df["trx_type"] == "Purchase"][["member", "date"]].copy()
    
    # Use a merge to find all redemptions without a purchase on same date by same member    
    df_red_invalid = pd.merge(
        df_red, df_pur, on=["member", "date"], how="outer", indicator=True
    ).query("_merge == 'left_only'").drop("_merge", axis=1)
 
    # Use a second merge to eliminate all those invalid redemptions
    df_valid = pd.merge(
        df, df_red_invalid, how="outer", on=df.columns.tolist(), indicator=True
    ).query("_merge == 'left_only'").drop("_merge", axis=1)
    
    return df_valid.reset_index(drop=True)


In [11]:
# %timeit remove_redemptions_without_purchase(test_member)

In [12]:
test_member = remove_redemptions_without_purchase(test_member)

# # Pass tests
assert (pd.to_datetime('2018-04-01') in test_member["date"].to_numpy()) == False
assert len(test_member) == 15

# Check results
test_member

Unnamed: 0,member,date,trx_type,device,value,discount
0,102318,2018-05-12,Purchase,Payment,107.7,0.0
1,102318,2018-05-12,Redemption,Loyalty Voucher,-10.0,0.0
2,102318,2018-07-09,Purchase,Payment,127.7,0.0
3,102318,2018-08-11,Purchase,Payment,20.0,49.9
4,102318,2018-08-31,Purchase,Payment,31.8,8.0
5,102318,2018-10-22,Activation,Loyalty Voucher,5.0,0.0
6,102318,2018-10-22,Purchase,Payment,49.8,0.0
7,102318,2018-10-22,Redemption,Loyalty Voucher,-5.0,0.0
8,102318,2019-01-03,Purchase,Payment,8.95,8.95
9,102318,2019-03-25,Purchase,Payment,44.9,0.0


### Create columns for the voucher related trx

This function is a simple and superfast vectorized operation than can be applied to whole df in one go. No need for a groupby.

In [13]:
def create_basic_voucher_cols(df):
    """Create 3 separate columns containing the values for 
    voucher activations ("voucher_act"), voucher redemptions ("voucher_red")
    and both of them combined (voucher_all).
    """
    df = df.copy()
    df["voucher_act"] = np.where(
        (df["device"] == "Loyalty Voucher") & (df["value"] > 0),
        df["value"], 
        0)
    df["voucher_red"] = np.where(
        (df["device"] == "Loyalty Voucher") & (df["value"] < 0),
        df["value"],
        0)
    df["voucher_all"] = np.where(
        df["device"] == "Loyalty Voucher",
        df["value"],
        0)
    return df

In [14]:
%timeit create_basic_voucher_cols(test_member)

1.85 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [15]:
test_member = create_basic_voucher_cols(test_member)

# Pass tests
assert test_member.shape[1] == 9
assert test_member["voucher_all"].sum() == -10

# Check results
test_member

Unnamed: 0,member,date,trx_type,device,value,discount,voucher_act,voucher_red,voucher_all
0,102318,2018-05-12,Purchase,Payment,107.7,0.0,0.0,0.0,0.0
1,102318,2018-05-12,Redemption,Loyalty Voucher,-10.0,0.0,0.0,-10.0,-10.0
2,102318,2018-07-09,Purchase,Payment,127.7,0.0,0.0,0.0,0.0
3,102318,2018-08-11,Purchase,Payment,20.0,49.9,0.0,0.0,0.0
4,102318,2018-08-31,Purchase,Payment,31.8,8.0,0.0,0.0,0.0
5,102318,2018-10-22,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0
6,102318,2018-10-22,Purchase,Payment,49.8,0.0,0.0,0.0,0.0
7,102318,2018-10-22,Redemption,Loyalty Voucher,-5.0,0.0,0.0,-5.0,-5.0
8,102318,2019-01-03,Purchase,Payment,8.95,8.95,0.0,0.0,0.0
9,102318,2019-03-25,Purchase,Payment,44.9,0.0,0.0,0.0,0.0


In [16]:
def _calculate_voucher_cumsum(voucher_all):
    """Helper function to calculate the accumulated sum of
    voucher "credit" a customer has at any given time. We do not 
    know about the remaining credit from earlier periods, but we 
    make sure that the cumsum never becomes negative.
    """
    # Note: Itertools seems to run faster than: np.cumsum(voucher_all)
    voucher_cumsum = np.array(list(itertools.accumulate(voucher_all)))
    
    # Make sure that v_sum never has a negative value
    v_min = np.min(voucher_cumsum)
    if v_min < 0:
        top_up_value = v_min
        voucher_cumsum = voucher_cumsum - top_up_value
    return voucher_cumsum

def create_voucher_cumsum_col(df):
    """Use a groupby "window function" to insert the cumulated 
    voucher sums into a new column "voucher_cumsum".
    """
    df = df.copy()
    df["voucher_cumsum"] = df.groupby(
        ["member"])["voucher_all"].transform(_calculate_voucher_cumsum)
    df.drop("voucher_all", axis=1, inplace=True)
    return df

(Check this stackoverflow post for [apply vs. transform on groupby objects](https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object))

In [17]:
test_member = create_voucher_cumsum_col(test_member)

# Pass tests
assert test_member.shape[1] == 9
assert test_member["voucher_cumsum"].min() >= 0

# Check results
test_member

Unnamed: 0,member,date,trx_type,device,value,discount,voucher_act,voucher_red,voucher_cumsum
0,102318,2018-05-12,Purchase,Payment,107.7,0.0,0.0,0.0,10.0
1,102318,2018-05-12,Redemption,Loyalty Voucher,-10.0,0.0,0.0,-10.0,0.0
2,102318,2018-07-09,Purchase,Payment,127.7,0.0,0.0,0.0,0.0
3,102318,2018-08-11,Purchase,Payment,20.0,49.9,0.0,0.0,0.0
4,102318,2018-08-31,Purchase,Payment,31.8,8.0,0.0,0.0,0.0
5,102318,2018-10-22,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0
6,102318,2018-10-22,Purchase,Payment,49.8,0.0,0.0,0.0,5.0
7,102318,2018-10-22,Redemption,Loyalty Voucher,-5.0,0.0,0.0,-5.0,0.0
8,102318,2019-01-03,Purchase,Payment,8.95,8.95,0.0,0.0,0.0
9,102318,2019-03-25,Purchase,Payment,44.9,0.0,0.0,0.0,0.0


In [18]:
def shift_and_drop_redemptions(df):
    """Shift redemtion values in "v_r" column one row up, so they
    end up in the row of the corresponing transaction. This makes
    it possible to flag the respective transactions as ones with 
    a redemption in a later step. Then delete all redemption rows, as 
    they are no longer needed.
    """
    df = df.copy()
    df["voucher_red"] = df.groupby(["member"])["voucher_red"].shift(-1)
    df = df[~df["trx_type"].isin(["Redemption"])]
    df["voucher_red"] = df["voucher_red"].replace(np.nan, 0)
    
    # Remove "Redemption" Category from trx_type categories
    df["trx_type"].cat.remove_unused_categories(inplace=True)  
    return df.reset_index(drop=True)

In [19]:
test_member = shift_and_drop_redemptions(test_member)

# Pass tests
assert "Redemption" not in test_member["trx_type"]
assert test_member[test_member["trx_type"] == "Purchase"]["voucher_red"].sum() == -20
assert len(test_member) == 12

# Check results
test_member

Unnamed: 0,member,date,trx_type,device,value,discount,voucher_act,voucher_red,voucher_cumsum
0,102318,2018-05-12,Purchase,Payment,107.7,0.0,0.0,-10.0,10.0
1,102318,2018-07-09,Purchase,Payment,127.7,0.0,0.0,0.0,0.0
2,102318,2018-08-11,Purchase,Payment,20.0,49.9,0.0,0.0,0.0
3,102318,2018-08-31,Purchase,Payment,31.8,8.0,0.0,0.0,0.0
4,102318,2018-10-22,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0
5,102318,2018-10-22,Purchase,Payment,49.8,0.0,0.0,-5.0,5.0
6,102318,2019-01-03,Purchase,Payment,8.95,8.95,0.0,0.0,0.0
7,102318,2019-03-25,Purchase,Payment,44.9,0.0,0.0,0.0,0.0
8,102318,2019-04-25,Purchase,Payment,246.9,61.7,0.0,0.0,0.0
9,102318,2019-06-05,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0


### Calculate intervals

**Note:** In the following we apply the `diff` method. It calculates the difference between the column values of rows with a specified lag / lead (here -1, meaning the following row).

In [20]:
def calculate_interval_activation_to_next_purchase(df):
    """Create a new col "delta_a" containing the interval from each
    activation to the next purchase as int (n days). Values of
    all non-activation rows are set to NaN.
    """
    df = df.copy()
    df["delta_a"] = df.groupby(["member"])["date"].diff(-1) * -1
        
    df["delta_a"] = np.where(
        df["trx_type"] == "Activation", 
        df["delta_a"].dt.days,  # Convert timedelta to float
        np.NaN
    )
    return df

In [21]:
def calculate_purchase_interval(df):
    """Create a new col "delta_p" containing the interval from each
    purchase to the next as int (n days). Values of all non-purchase
    rows are set to NaN.
    """
    df = df.copy()
    df["delta_p"] = df.groupby(["member", "trx_type"])["date"].diff(-1) * -1
    
    df["delta_p"] = np.where(
        df["trx_type"] == "Purchase", 
        df["delta_p"].dt.days,  # Convert timedelta to float
        np.NaN
    )
    return df

In [22]:
test_member = calculate_interval_activation_to_next_purchase(test_member)

# Pass tests
assert test_member["delta_a"].sum() == 94
assert test_member["delta_a"].notna().sum() == (test_member["trx_type"] == "Activation").sum()

test_member

Unnamed: 0,member,date,trx_type,device,value,discount,voucher_act,voucher_red,voucher_cumsum,delta_a
0,102318,2018-05-12,Purchase,Payment,107.7,0.0,0.0,-10.0,10.0,
1,102318,2018-07-09,Purchase,Payment,127.7,0.0,0.0,0.0,0.0,
2,102318,2018-08-11,Purchase,Payment,20.0,49.9,0.0,0.0,0.0,
3,102318,2018-08-31,Purchase,Payment,31.8,8.0,0.0,0.0,0.0,
4,102318,2018-10-22,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0,0.0
5,102318,2018-10-22,Purchase,Payment,49.8,0.0,0.0,-5.0,5.0,
6,102318,2019-01-03,Purchase,Payment,8.95,8.95,0.0,0.0,0.0,
7,102318,2019-03-25,Purchase,Payment,44.9,0.0,0.0,0.0,0.0,
8,102318,2019-04-25,Purchase,Payment,246.9,61.7,0.0,0.0,0.0,
9,102318,2019-06-05,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0,94.0


In [23]:
test_member = calculate_purchase_interval(test_member)

# Pass tests
assert test_member.loc[0, "delta_p"] == 58.0
assert (
    test_member["delta_p"].notna().sum() 
    == ((test_member["trx_type"] == "Purchase").sum() -1)  # -1 because of last col
) 

test_member

Unnamed: 0,member,date,trx_type,device,value,discount,voucher_act,voucher_red,voucher_cumsum,delta_a,delta_p
0,102318,2018-05-12,Purchase,Payment,107.7,0.0,0.0,-10.0,10.0,,58.0
1,102318,2018-07-09,Purchase,Payment,127.7,0.0,0.0,0.0,0.0,,33.0
2,102318,2018-08-11,Purchase,Payment,20.0,49.9,0.0,0.0,0.0,,20.0
3,102318,2018-08-31,Purchase,Payment,31.8,8.0,0.0,0.0,0.0,,52.0
4,102318,2018-10-22,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0,0.0,
5,102318,2018-10-22,Purchase,Payment,49.8,0.0,0.0,-5.0,5.0,,73.0
6,102318,2019-01-03,Purchase,Payment,8.95,8.95,0.0,0.0,0.0,,81.0
7,102318,2019-03-25,Purchase,Payment,44.9,0.0,0.0,0.0,0.0,,31.0
8,102318,2019-04-25,Purchase,Payment,246.9,61.7,0.0,0.0,0.0,,135.0
9,102318,2019-06-05,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0,94.0,


In [24]:
def flag_purchases_depending_on_vouchers(df):
    """Create three new boolean columns to classify purchases into
    each of the following three categories: "p_v_red" = purchase with
    redemption, "p_v_miss" = purchase without redemption (but voucher
    credit would have been available), "p_v_empty" = no voucher credit
    available.
    """
    df = df.copy()
    df["p_v_red"] = np.where(
        (df["trx_type"] == "Purchase") & (df["voucher_red"] < 0), 1, 0
    ).astype("bool")
    
    df["p_v_miss"] = np.where(
        (
            (df["trx_type"] == "Purchase") 
            & (df["voucher_red"] == 0) 
            & (df["voucher_cumsum"] > 0)
        )
        , 1
        , 0
    ).astype("bool")
    
    df["p_v_empty"] = np.where(
        (df["trx_type"] == "Purchase") & (df["voucher_cumsum"] == 0), 1, 0
    ).astype("bool")
    
    return df

In [25]:
test_member = flag_purchases_depending_on_vouchers(test_member)

# Pass tests
# tbd

test_member

Unnamed: 0,member,date,trx_type,device,value,discount,voucher_act,voucher_red,voucher_cumsum,delta_a,delta_p,p_v_red,p_v_miss,p_v_empty
0,102318,2018-05-12,Purchase,Payment,107.7,0.0,0.0,-10.0,10.0,,58.0,True,False,False
1,102318,2018-07-09,Purchase,Payment,127.7,0.0,0.0,0.0,0.0,,33.0,False,False,True
2,102318,2018-08-11,Purchase,Payment,20.0,49.9,0.0,0.0,0.0,,20.0,False,False,True
3,102318,2018-08-31,Purchase,Payment,31.8,8.0,0.0,0.0,0.0,,52.0,False,False,True
4,102318,2018-10-22,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0,0.0,,False,False,False
5,102318,2018-10-22,Purchase,Payment,49.8,0.0,0.0,-5.0,5.0,,73.0,True,False,False
6,102318,2019-01-03,Purchase,Payment,8.95,8.95,0.0,0.0,0.0,,81.0,False,False,True
7,102318,2019-03-25,Purchase,Payment,44.9,0.0,0.0,0.0,0.0,,31.0,False,False,True
8,102318,2019-04-25,Purchase,Payment,246.9,61.7,0.0,0.0,0.0,,135.0,False,False,True
9,102318,2019-06-05,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0,94.0,,False,False,False


In [26]:
def calculate_discount_pct(df):
    """Caclulate a column "discount_pct" denoting the relative
    value of discounts. This value will be used to control for a
    threshold when setting a discount flag in the next step. (Note
    the calculation is such that discounts on returns wont reach
    the threshold.)
    """
    df = df.copy()
    df["gross_value"] = df["value"] + df["discount"]
    df["discount_pct"] = df["discount"] / df["gross_value"]
    return df


def flag_purchases_depending_on_discounts(df, threshold_pct=0.1):
    """Create a boolean columns to classify purchases having a
    discount whose relative value to the gross transaction price 
    reaches a certain threshold.
    """
    df = df.copy()
    df["p_discount"] = np.where(
        df["discount_pct"] >= threshold_pct, 1, 0
    ).astype("bool")
    
    df.drop(["gross_value", "discount_pct"], axis=1, inplace=True)
    return df

In [27]:
test_member = calculate_discount_pct(test_member)
test_member = flag_purchases_depending_on_discounts(test_member, threshold_pct=0.1)
test_member

Unnamed: 0,member,date,trx_type,device,value,discount,voucher_act,voucher_red,voucher_cumsum,delta_a,delta_p,p_v_red,p_v_miss,p_v_empty,p_discount
0,102318,2018-05-12,Purchase,Payment,107.7,0.0,0.0,-10.0,10.0,,58.0,True,False,False,False
1,102318,2018-07-09,Purchase,Payment,127.7,0.0,0.0,0.0,0.0,,33.0,False,False,True,False
2,102318,2018-08-11,Purchase,Payment,20.0,49.9,0.0,0.0,0.0,,20.0,False,False,True,True
3,102318,2018-08-31,Purchase,Payment,31.8,8.0,0.0,0.0,0.0,,52.0,False,False,True,True
4,102318,2018-10-22,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0,0.0,,False,False,False,False
5,102318,2018-10-22,Purchase,Payment,49.8,0.0,0.0,-5.0,5.0,,73.0,True,False,False,False
6,102318,2019-01-03,Purchase,Payment,8.95,8.95,0.0,0.0,0.0,,81.0,False,False,True,True
7,102318,2019-03-25,Purchase,Payment,44.9,0.0,0.0,0.0,0.0,,31.0,False,False,True,False
8,102318,2019-04-25,Purchase,Payment,246.9,61.7,0.0,0.0,0.0,,135.0,False,False,True,True
9,102318,2019-06-05,Activation,Loyalty Voucher,5.0,0.0,5.0,0.0,5.0,94.0,,False,False,False,False


---

## Appendix

In [28]:
# The following got stuck on the full dataframe ... (and was messy anyway)

# def remove_redemptions_without_purchase(df):
#     """Identify and remove all `Redemption` trx that have 
#     no purchase trx on the same date for the same member. 
#     Return a copy of the original dataframe.
#     """
#     # Create a dataframe with invalid redemptions only
#     # Important: the following operation on "trx_type" works not with dtype category
#     df = df.copy()
#     df["trx_type"] = df["trx_type"].astype(str)
#     df_grouped = df.groupby(["member", "date"]).agg({"trx_type": str})
#     df_grouped = df_grouped[df_grouped["trx_type"].str.contains("Redemption")]
#     df_grouped = df_grouped[~df_grouped["trx_type"].str.contains("Purchase")]
#     df_grouped["trx_type"] = "Redemption"
    
#     # Use a merge with indicator to mark all invalid redemtions in the original dataframe
#     df = pd.merge(df, df_grouped, on=["member", "date", "trx_type"], how="left", indicator=True)
#     df = df[df["_merge"] == "left_only"]
#     df.drop(columns="_merge", inplace=True)
#     df["trx_type"] = df["trx_type"].astype("category")
    
#     return df