### Performance Scoring

* Concept:
  - make use of a credit bureau (CB) score, KGB, and score calibration
  - Use Vantage or FICO score as a customer-level performance for the TTD population
  - The CB score is then calibrated to the KGB of the booked population using a regression function.
    A simple model might be:
      `logOdds = B0 + B1*CB_SCORE`
  - For a given reject or unbooked application, we can then compute its probability of being Good as
      `p(Good) = 1 / (1 + exp{-(B0 + B1*CB_SCORE)})`
  - These estimates are then used in an iterative process to infer the product specific performance for the TTD population.

* Assumption:
  the CB score contains information about their likely performance, had they been granted the credit.
  that is, the booked and reject/unbooked applications have the same performance by the CB score

In [1]:
import sys, os, json
sys.path.insert(1, "../")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%load_ext autoreload
%autoreload 2

plt.style.use('seaborn')

import warnings
warnings.filterwarnings("ignore")

In [2]:
test_df = pd.read_parquet(os.path.join("s3://sofi-data-science/Risk_DS/rdsutils_data/", 
                                       "customer_baseline_n_scores.parquet"))
test_df.head()

Unnamed: 0,pred_incumbent,pred_wo_ind,score_incumbent,score_wo_ind,rg_incumbent,rg_wo_ind,target,fico_score,fraud_score_2
5056065,0.014803,0.048822,502.594054,540.446816,RG2,RG3,False,,0.447
5056066,0.133862,0.264597,574.411334,600.448718,RG4,RG4,False,,
5056067,0.008159,0.012328,484.031408,496.878531,RG2,RG2,False,,0.133
5056068,0.000472,0.000902,395.957985,415.952349,RG1,RG1,False,,0.117
5056069,0.341065,0.23981,611.653962,596.396399,RG5,RG4,False,,


In [None]:
def get_array(x):
    """
    get array from list, series, 
    """
    
    return x


def get_incremental_bad_rate(x, target, bins=None, quantiles=None):
    """
    produce incremental bad rates of array
    
    @params x: np.array or pd.Series
        array which values to be binned. e.g. model prediction, score, bureau scores
    @params target: np.array
        binary target with True = bad
    @params bins: int, sequence of scalars, or IntervalIndex
        cutoff thresholds for value 
    @params quantiles: int or list-like of float
        quantile cutoff thresholds for values
    """
    
    if (bins is None) and (quantile is None):
        raise ValueError(f"one of bins or quantiles must be present")
    
    assert(len(value) == len(target))
    
    
    # make use of pd.cut and pd.qcut
    df = pd.DataFrame()
    df["x"] = x
    df["target"] = target
    
    
    