# Meta Learners

Aim:
* Gain familiarity with meta-learners (S-learner, T-learner, X-learner, double-debiased) and their strenghts/ limitations
* Gain familiarity with interpreting gain charts

Methodology
1) Create a simulated dataset with causal relationship and confounders (to better control for "real" relationships)
2) Train meta learners and score out using gain chart


## 1. Set up

In [48]:
# import required packages

import numpy as np
import pandas as pd 
import array
# from matplotlib import pyplot as plt

ModuleNotFoundError: No module named 'matplotlib'

### 1.2 Create simulated dataset

Requirements
* Non-linear treatment effect
* Heterogenous treatment effect given other covariates
* Confounders
* Ignore colliders for now (explore methods to identify and exclude in future)
* Test dataset is a randomised controlled experiment, so no confounders (the better to evaluate the model)


<b>How does credit limit affect balance?</b>

<u>Simulated causal relationships</u>

credit_hunger (unobserved) -> credit_score -> credit_limit

credit_hunger + credit_hunger -> balance
- if credit_hunger is high, balance will be close to credit_limit
- if credit_hunger is low, balance will scale less strongly with credit_limit
- balance cannot exceed credit_limit

<i> Relationship of interest </i>

E(balance) = f(credit_limit | credit_score)

<i> Back-door path </i>

credit_limit <--- credit_score <--- credit_hunger ---> balance

In [111]:
# create simulated dataset

# def create_simulated_dataset(
#         sample_size:int = 10000
#         , credit_hunger_effect_credit_score: int = -20 # higher credit hunger = worse credit score
#         , credit_score_other_variance:int = 100
#         , **kwargs
# ):

sample_size = 10000
credit_hunger_effect_credit_score = -40
credit_score_other_variance = 200

# credit hunger - values 1 (not hungry), 2 (medium), 3 (very hungry)
_assign_credit_hunger = lambda x: 1 if x < 0.2 else (2 if x < 0.8 else 3)

_credit_hunger = map(_assign_credit_hunger,  np.random.uniform(size= sample_size))
_credit_hunger = np.array(list(_credit_hunger))

# credit score - value from 400 to 700, average 500.

_credit_score = (
    500 
    + (credit_hunger_effect_credit_score * (_credit_hunger - 2))
    + (credit_score_other_variance * np.random.standard_normal(sample_size))
)

_credit_score = np.clip(np.round(_credit_score).astype(int),300,700)

# credit limit - (almost) deterministic based on credit_score
# to give some variability for modelling (and simulate limit testing), x % will get the adjacent higher/ lower limit

def _assign_limit(indiv_score, indiv_random_num):
    _limits = [1000,2500,5000,10000,20000,40000]
    _score_thresholds = [350,500,550,600,650,999]
    _test_perc = 0.2

    _limit_index = 0
    while indiv_score > _score_thresholds[_limit_index]:
        _limit_index += 1

    _limit_index_lower = max(0, _limit_index - 1)
    _limit_index_higher = min(5, _limit_index+1)

    if indiv_random_num <= (_test_perc/ 2):
        return _limits[_limit_index_lower]
    elif  indiv_random_num <= (1-(_test_perc/2)):
        return _limits[_limit_index]
    else:
        return _limits[_limit_index_higher]
    # return np.where(credit_score_array <= credit_score_thresholds[0], credit_limits[0], credit_limits[1])

_limit_testing_score = np.random.uniform(low = 0, high = 1, size = sample_size)

_credit_limit = map(_assign_limit, _credit_score, _limit_testing_score)
_credit_limit = np.array(list(_credit_limit))

# balance
def _calculate_balance(indiv_limit, indiv_credit_hunger, indiv_noise):
    if indiv_credit_hunger == 1:
        _indiv_balance = 300 + (20 * np.sqrt((indiv_limit - 1000))) 

    elif indiv_credit_hunger == 2:
        _indiv_balance = 900 + (0.5 * (indiv_limit - 1000)) 

    else:
        _indiv_balance =  (0.9 * indiv_limit) 

    _indiv_balance = _indiv_balance + (_indiv_balance * (0.1 * indiv_noise))
    _indiv_balance = min(indiv_limit, max(0, _indiv_balance))
    _indiv_balance = int(_indiv_balance)
    
    return _indiv_balance

_balance_noise_variable = np.random.standard_normal(sample_size)

_balance = map(_calculate_balance, _credit_limit, _credit_hunger, _balance_noise_variable)
_balance = np.array(list(_balance))

# dataframe
df = pd.DataFrame({
    'credit_hunger': _credit_hunger
    , 'credit_score': _credit_score
    , 'credit_limit': _credit_limit
    , 'balance': _balance
})


In [117]:
df.head(20)

Unnamed: 0,credit_hunger,credit_score,credit_limit,balance,credit_score_band,credit_hunger_high
0,2,581,10000,5542,"(556.0, 675.2]",0
1,2,654,40000,22689,"(556.0, 675.2]",0
2,3,300,2500,2369,"(299.999, 327.0]",1
3,2,700,40000,20057,"(675.2, 700.0]",0
4,2,362,2500,1617,"(327.0, 452.0]",0
5,2,454,2500,1618,"(452.0, 556.0]",0
6,1,405,5000,2313,"(327.0, 452.0]",0
7,3,325,1000,803,"(299.999, 327.0]",1
8,3,347,1000,942,"(327.0, 452.0]",1
9,2,700,40000,20925,"(675.2, 700.0]",0


In [116]:
df['credit_score_band'] = pd.qcut(df['credit_score'], q=5, duplicates='drop')
df['credit_hunger_high'] = (df['credit_hunger']==3).astype(int)

# df.head(20)

df.groupby(['credit_score_band']).mean()['credit_hunger_high']

  df.groupby(['credit_score_band']).mean()['credit_hunger_high']


credit_score_band
(299.999, 327.0]    0.266169
(327.0, 452.0]      0.237811
(452.0, 556.0]      0.191051
(556.0, 675.2]      0.172777
(675.2, 700.0]      0.151500
Name: credit_hunger_high, dtype: float64