# Simulation: true lift marketing model

Written by Daniel Steinberg and Lachlan McCalman, Gradient Institute Ltd. (info@gradientinstitute.org).

Copyright © 2020 Monetary Authority of Singapore.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


## Introduction

This notebook provides the code used to generate synthetic data that forms the subject of the marketing case study assessment in the FEAT Fairness Principles Assessment Case Studies Document (Dec 2020). Please see Section 2.3 of that document for the assessment itself. This code is not intended for use in production or for assessing high risk AIDA systems under the methodology.

The code to perform the marketing case study assessment is also in this repository, in "assessment_input.ipynb". 

This work was undertaken as part of the Veritas initiative commissioned by the Monetary Authority of Singapore, whose goal is to accelerate the adoption of responsible Artificial Intelligence and Data Analytics (AIDA) in the financial services industry.

## Scenario

_The following is a running example of a hypothetical, simulated, AIDA direct marketing system used for marketing unsecured loans. Please note that this running example:_ 
- _is evaluated at a high level of detail to illustrate the Methodology applied to a higher risk system_
- _is an example assessment of a system determined by a fictional Financial Services Institution (FSI) to be higher risk, not guidance for FSIs on the actual risk associated with this example (for more details on the risk-based approach of the Methodology see Document 1 Section 2)_
- _is intended to be a simple illustration of how to use the Methodology_
- _does not represent any AIDA systems in place at any of the Veritas Consortium members_
- _should not be taken as guidance for any context- or value-sensitive decision such as choices of fairness objectives, measures, or personal attributes_
- _is not intended to constrain the scope of the Methodology: other uses may have different interventions, products, objectives, and use of AIDA systems_
- _uses simulated data that is not intended to depict realistic statistical relationships or performance measures_
- _has omitted some analyses for the sake of brevity (for example, those relating to harms from default)_

_The terms “we”, “us” or “our” in the running example refer to the functional author of the assessment and not to members of the Consortium as elsewhere in this document._


A (fictional) FSI with a new unsecured “fast & simple” loan product would like to embark on a marketing campaign to its existing (non-exempt) customer base. The FSI will make a profit from the interest rate payments of this product, but this profit will be offset by the cost of the marketing campaign that is carried out through a call centre operated by a different subsidiary company. The purpose of the marketing system under analysis is to select existing customers for a marketing call to increase sales of the product.




## Generative model


<img src="uplift_gen_mod.png" alt="Simulation generative model diagram" style="width: 500px;"/>


## References

- Kane, K., Lo, V.S.Y., Zheng, J., 2014. Mining for the truly responsive
    customers and prospects using true-lift modeling: Comparison of new and
    existing methods. J Market Anal 2, 218–238.
    https://doi.org/10.1057/jma.2014.18


In [27]:
import numpy as np
import pandas as pd
import pathlib
from os import path
from scipy.special import softmax

In [28]:
## Model settings

# Output settings
SENSITIVE_ATTRIBUTES = ["age", "isfemale", "isforeign"]
COVARIATES_FILE = "model_inputs.csv"
SENSITIVE_FILE = "sensitive_attributes.csv"
OUTCOMES_FILE = "outcomes.csv"
TRUTH_FILE = "truth.csv"
INDEX_LABEL = "ID"


# Randomness settings
RSEED = 42


# Data size
N_PEOPLE = 100000


# Control group proportion
P_CONTROL = 0.6


# Gender
P_FEMALE = 0.4  # proportion of data that are female


# Nationality
P_ISFOREIGN = 0.3  # proportion of the data that are foreign nationals


# Age
MIN_AGE = 18
MEAN_AGE = 45
STD_AGE = 10
FOREIGN_AGE_EFFECT = -5


# Income
BASE_MALE_INCOME = 45000
BASE_FEMALE_INCOME = 35000
FOREIGN_INCOME_EFFECT = -5000
FOREIGN_INCOME_STD = 5000
STD_MALE_INCOME = 10000
STD_FEMALE_INCOME = 15000
MEAN_AGE_INCOME = 1500  # income/years age multiplier
STD_AGE_INCOME = 0  # income std. dev. age multiplier


# Existing products
MANY_PRODUCTS = 5  # number that defines "many" products (max rate)


# Responded to previous offers
PROB_RESPONDED = 0.05


# Feature transformations
FEATURE_TRANSFORMS = {
    "age": lambda x: x,
    "isfemale": lambda x: x,
    "isforeign": lambda x: x,
    "income": lambda x: x,
    "noproducts": lambda x: x,
    "didrespond": lambda x: x
}


# Environmental effects on lift model
NORM_EFFECTS = True  # standardise environmental values going into the model.
LABEL_BIASES = {
    "P": 0.0,
    "ST": 0.5,
    "LC": 1.0,
    "DND": 0.0
}

LABEL_WEIGHTS = {
    "P": {
        "age": 2.0,
        "isfemale": 1.0,
        "isforeign": 0.0,
        "income": 0.0,
        "noproducts": 1.,
        "didrespond": 10.
    },
    "ST": {
        "age": 1.0,
        "isfemale": 0.,
        "isforeign": -1.0,
        "income": 3.,
        "noproducts": 3.,
        "didrespond": 7.
    },
    "LC": {
        "age": -1.,
        "isfemale": 0.0,
        "isforeign": 1.0,
        "income": -2.,
        "noproducts": -2.,
        "didrespond": -3.
    },
    "DND": {
        "age": -2.,
        "isfemale": -1.0,
        "isforeign": 0.0,
        "income": 0.0,
        "noproducts": 1.,
        "didrespond": -10.
    }
}


# Acquired outcomes
PROB_ACQUIRED = 0.90  # The % of those who apply then acquire
FOREIGN_ACQUIRED_MOD = -0.2
MIN_INCOME_ACQUIRED = 5000  # The minimum annual income to be allowed to apply

# Long-term outcome resolutions
PROB_SUCCESS = 0.95  # The % of those who acquire have successful resolution
FEMALE_SUCCESS_MOD = -0.05


In [29]:
## Simulate Environment

rstate = np.random.RandomState(RSEED)


# Generate gender
isfemale = rstate.binomial(n=1, p=P_FEMALE, size=N_PEOPLE)

# Generate foreign nationals
isforeign = rstate.binomial(n=1, p=P_ISFOREIGN, size=N_PEOPLE)

# Generate ages
mean_age = MEAN_AGE + FOREIGN_AGE_EFFECT * isforeign
age = np.maximum(
    MIN_AGE,
    rstate.normal(loc=mean_age, scale=STD_AGE, size=N_PEOPLE)
)

# Generate income
mean_income = BASE_FEMALE_INCOME * isfemale \
    + BASE_MALE_INCOME * (1 - isfemale) \
    + FOREIGN_INCOME_EFFECT * isforeign \
    + MEAN_AGE * (age - MIN_AGE)
std_income = STD_FEMALE_INCOME * isfemale \
    + STD_MALE_INCOME * (1 - isfemale) \
    + FOREIGN_INCOME_STD * isforeign \
    + STD_AGE * (age - MIN_AGE)
income = np.maximum(0., rstate.normal(loc=mean_income, scale=std_income))

# Generate number of products
product_rate = income * MANY_PRODUCTS / np.max(income)
noproducts = rstate.poisson(lam=product_rate)

# Generate previous response to campaigns
didrespond = rstate.binomial(n=1, p=PROB_RESPONDED, size=N_PEOPLE)

# Concatenate all data
environment = pd.DataFrame({
    "age": age,
    "isfemale": isfemale,
    "isforeign": isforeign,
    "income": income,
    "noproducts": noproducts,
    "didrespond": didrespond
})



In [30]:
## Transform features
for col in environment.columns:
    environment[col] = FEATURE_TRANSFORMS[col](environment[col])



In [31]:
## Simulate truth

weights = pd.DataFrame(LABEL_WEIGHTS)
biases = pd.Series(LABEL_BIASES)
effects = environment.dot(weights)
if NORM_EFFECTS:
    effects -= effects.mean(axis=0)
    effects /= effects.std(axis=0)

factors = effects + biases
probabilites = softmax(factors.values, axis=1)
draws = np.vstack([rstate.multinomial(n=1, pvals=p) for p in probabilites])
truth = pd.DataFrame(data=draws.astype(bool), columns=effects.columns)

# Add in the real lift score
cols = list(effects.columns)
lift = probabilites[:, cols.index("P")] \
    - probabilites[:, cols.index("DND")]
truth["lift"] = pd.Series(lift)


In [32]:
## Experimental outcomes

names = ["TN", "TR", "CN", "CR", "incontrol"]
N, D = len(truth), len(names)
outcomes = pd.DataFrame(data=np.zeros((N, D), dtype=bool), columns=names)

# Persuadables are CN and TR
outcomes.loc[truth["P"], "TR"] = True
outcomes.loc[truth["P"], "CN"] = True

# Do not disturbs are CR and TN
outcomes.loc[truth["DND"], "TN"] = True
outcomes.loc[truth["DND"], "CR"] = True

# Lost causes are TN and CN
outcomes.loc[truth["LC"], "TN"] = True
outcomes.loc[truth["LC"], "CN"] = True

# Sure things are TR and CR
outcomes.loc[truth["ST"], "TR"] = True
outcomes.loc[truth["ST"], "CR"] = True

# Draw if someone was in the control or treatment groups
incontrol = rstate.binomial(n=1, p=P_CONTROL, size=N_PEOPLE)
incontrol = incontrol.astype(bool)
outcomes["incontrol"] = incontrol

# Now set treatment responses to false for those in control
outcomes.loc[incontrol, "TR"] = False
outcomes.loc[incontrol, "TN"] = False

# Now set control responses to false for those in treatment
outcomes.loc[~incontrol, "CR"] = False
outcomes.loc[~incontrol, "CN"] = False



In [33]:
## product outcomes

def product_outcomes(
    truth: pd.DataFrame,
    environment: pd.DataFrame,
    all_selected: bool):
    """Create applied, acquired and long term resolution outcomes."""
    # Who applied
    if all_selected:
        applied = np.logical_or(truth["P"], truth["ST"]).astype(int)
    else:
        applied = np.logical_or(truth["DND"], truth["ST"]).astype(int)

    # Who acquired out of those who applied
    p_acquired = applied * (PROB_ACQUIRED + FOREIGN_ACQUIRED_MOD
                            * environment.isforeign)
    acquired = rstate.binomial(n=1, p=p_acquired, size=N_PEOPLE)
    acquired[environment.income < MIN_INCOME_ACQUIRED] = 0

    # Who successfully resolved out of those who acquired
    p_success = acquired * (PROB_SUCCESS + FOREIGN_ACQUIRED_MOD
                            * environment.isfemale)
    success = rstate.binomial(n=1, p=p_success, size=N_PEOPLE)

    prefix = "s" if all_selected else "ns"
    all_outcomes = pd.DataFrame(data={
        f"{prefix}_applied": applied,
        f"{prefix}_acquired": acquired,
        f"{prefix}_success": success
    })
    return all_outcomes

pr_outcomes_s = product_outcomes(truth, environment, True)
pr_outcomes_ns = product_outcomes(truth, environment, False)


In [34]:
# Split sensitive attributes from covariates
sensitives = environment[SENSITIVE_ATTRIBUTES]
covariates = environment.drop(columns=SENSITIVE_ATTRIBUTES)

# Concatenate all outcomes
all_outcomes = pd.concat((outcomes, pr_outcomes_s, pr_outcomes_ns),axis=1)


In [35]:
print("\nCovariate statistics:")
print(covariates.describe())


Covariate statistics:
              income     noproducts     didrespond
count  100000.000000  100000.000000  100000.000000
mean    40671.937732       1.904140       0.049650
std     14954.919229       1.545501       0.217222
min         0.000000       0.000000       0.000000
25%     31595.710649       1.000000       0.000000
50%     41810.919836       2.000000       0.000000
75%     50865.377051       3.000000       0.000000
max    106953.747312      12.000000       1.000000


In [36]:
print("\nSensitive attribute statistics:")
print(sensitives.describe())




Sensitive attribute statistics:
                 age       isfemale      isforeign
count  100000.000000  100000.000000  100000.000000
mean       43.487451       0.400050       0.302070
std        10.194817       0.489911       0.459158
min        18.000000       0.000000       0.000000
25%        36.572856       0.000000       0.000000
50%        43.491468       0.000000       0.000000
75%        50.372750       1.000000       1.000000
max        87.193663       1.000000       1.000000


In [37]:
print("\nOutcomes statistics:")
print(all_outcomes.sum(axis=0))



Outcomes statistics:
TN             22109
TR             17776
CN             33333
CR             26782
incontrol      60115
s_applied      44816
s_acquired     38052
s_success      33741
ns_applied     44499
ns_acquired    37385
ns_success     33073
dtype: int64


In [38]:
print("\nTruth statistics:")
print(truth.drop(columns=["lift"]).sum(axis=0))


Truth statistics:
P      16933
ST     27883
LC     38568
DND    16616
dtype: int64


In [39]:
## Look at true lift scores by label
print("\nLift statistics:")
print("P av. lift: {:.4f}".format(truth["lift"][truth["P"]].mean()))
print("ST av. lift: {:.4f}".format(truth["lift"][truth["ST"]].mean()))
print("LC av. lift: {:.4f}".format(truth["lift"][truth["LC"]].mean()))
print("DND av. lift: {:.4f}".format(truth["lift"][truth["DND"]].mean()))


Lift statistics:
P av. lift: 0.2169
ST av. lift: 0.0051
LC av. lift: -0.0027
DND av. lift: -0.2118


In [40]:
## Save data
covariates.to_csv(COVARIATES_FILE, index_label=INDEX_LABEL)
sensitives.to_csv(SENSITIVE_FILE, index_label=INDEX_LABEL)
all_outcomes.to_csv(OUTCOMES_FILE, index_label=INDEX_LABEL)
truth.to_csv(TRUTH_FILE, index_label=INDEX_LABEL)