### Download, explore, and prepare UCI credit card default data¶

https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/xgboost_pdp_ice.ipynb

#### Python Libraries/Packages Import

In [None]:
import numpy as np                   # array, vector, matrix calculations
import pandas as pd                  # DataFrame handling
import shap                          # for consistent, signed variable importance measurements
import xgboost as xgb                # gradient boosting machines (GBMs)

import matplotlib.pyplot as plt      # plotting
pd.options.display.max_columns = 999 # enable display of all columns in notebook

# enables display of plots in notebook
%matplotlib inline

np.random.seed(12345)                # set random seed for reproducibility

#### Import Data and Clean

In [None]:
# Read in Data
data = pd.read_csv("../input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv")

# Rename target column so that it's not using dots in names 
# (replace them with underscore instead; make naming convention consistent)
data = data.rename(columns={'default.payment.next.month': 'DEFAULT_NEXT_MONTH'}) 

- **LIMIT_BAL**: Amount of given credit (NT dollar)
- **SEX**: 1 = male; 2 = female
- **EDUCATION**: 1 = graduate school; 2 = university; 3 = high school; 4 = others
- **MARRIAGE**: 1 = married; 2 = single; 3 = others
- **AGE**: Age in years
- **PAY_** : History of past payment; PAY_0 = the repayment status in September, 2005; PAY_2 = the repayment status in August, 2005; ...; PAY_6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
- **BILL_AMT** : Amount of bill statement (NT dollar). BILL_AMNT1 = amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; ...; BILL_AMT6 = amount of bill statement in April, 2005.
- **PAY_AMT** : Amount of previous payment (NT dollar). PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; ...; PAY_AMT6 = amount paid in April, 2005.

#### Assign Model Roles

In [None]:
# assign target and inputs for GBM
y = 'DEFAULT_NEXT_MONTH'
X = [name for name in data.columns if name not in [y, 'ID']]

print('y =', y)
print('X =', X)

#### Display descriptive statistics

In [None]:
data[X + [y]].describe()

### Investigate pair-wise Pearson correlations for DEFAULT_NEXT_MONTH

Monotonic 이란! X 와 Y가 같은 방향성을 가지고 움직일 때 (움직임의 정도는 같을 필요가 없다; 즉 linear과 monotonic은 다른 개념)

- Monotonic relationships are much easier to explain to colleagues, bosses, customers, and regulators than more complex, non-monotonic relationships
- monotonic relationships may also prevent overfitting and excess error due to variance for new data.

Constrainsts are supplied to XGBoost in the form of a Python tuple with length equal to the number of inputs. Each item in the tuple is associated with an input variable based on its index in the tuple. The first constraint in the tuple is associated with the first variable in the training data, the second constraint in the tuple is associated with the second variable in the training data, and so on. The constraints themselves take the form of a 1 for a positive relationship and a -1 for a negative relationship.

#### Calculate Pearson correlation

In [None]:
pd.DataFrame(data[X + [y]].corr()[y]).iloc[:-1]

#### Create tuple of monotonicity constraints from Pearson correlation values

In [None]:
# creates a tuple in which positive correlation values are assigned a 1
# and negative correlation values are assigned a -1

mono_constraints = tuple([int(i) for i in np.sign(data[X + [y]].corr()[y].values[:-1])])

### Train XGBoost with monotonicity constraints

- monotone_constraints tuning parameter is used to enforce monotonicity between inputs and the prediction for DEFAULT_NEXT_MONTH. 
- XGBoost's early stopping functionality is also used to limit overfitting to the training data

#### Split data into training and test sets for early stopping

In [None]:
np.random.seed(12345) # set random seed for reproducibility
split_ratio = 0.7 #70/30% train & test split

# execute the split
split = np.random.rand(len(data)) < split_ratio
train = data[split]
test = data[~split]

# summarize split
print('Train data rows = %d, columns = %d' % (train.shape[0], train.shape[1]))
print('Test data rows = %d, columns = %d' % (test.shape[0], test.shape[1]))

#### Train XGBoost GBM classifier

- training and test data must be converted from Pandas DataFrames into SVMLight format. (DMatrix function 사용)
- grid search 로 다양한 parameter들의 이상적인 값들을 찾겠지만 간략성을 위해서 그 값들을 미리 찾았다고 가정
- Because gradient boosting methods typically resample training data, an additional random seed is also specified for XGBoost using the seed parameter
- To avoid overfitting, the early_stopping_rounds parameter is used to stop the training process after the test area under the curve (AUC) statistic fails to increase for 50 iterations.

In [None]:
# pip install xgboost --user --upgrade pip

In [None]:
# XGBoost uses SVMLight data structure, not Numpy arrays or Pandas DataFrames
dtrain = xgb.DMatrix(train[X], train[y])
dtest = xgb.DMatrix(test[X], test[y])

# used to calibrate predictions to mean of y
base_y = train[y].mean()

# tuning parameters
params = {
    'objective': 'binary:logistic',             # produces 0-1 probabilities for binary classification
    'booster': 'gbtree',                        # base learner will be decision tree
    'eval_metric': 'auc',                       # stop training based on maximum AUC, AUC always between 0-1
    'eta': 0.08,                                # learning rate
    'subsample': 0.9,                           # use 90% of rows in each decision tree
    'colsample_bytree': 0.9,                    # use 90% of columns in each decision tree
    'max_depth': 15,                            # allow decision trees to grow to depth of 15
    'monotone_constraints':mono_constraints,    # 1 = increasing relationship, -1 = decreasing relationship
    'base_score': base_y,                       # calibrate predictions to mean of y 
    'seed': 12345                               # set random seed for reproducibility
}

# watchlist is used for early stopping
watchlist = [(dtrain, 'train'), (dtest, 'eval')]

# train model
xgb_model = xgb.train(params,                   # set tuning parameters from above                   
                      dtrain,                   # training data
                      1000,                     # maximum of 1000 iterations (trees)
                      evals=watchlist,          # use watchlist for early stopping 
                      early_stopping_rounds=50, # stop after 50 iterations (trees) without increase in AUC
                      verbose_eval=True)        # display iteration progress

#### Global Shapley variable importance

In [None]:
# dtest is DMatrix
# shap_valeus is np.array

# shap_values = xgb_model.predict(dtest, pred_contribs = True,ntree_limit = xgb_model.best_ntree_limit)
# ====> ntree_limit parameter deprecated

# pred_contribus = True ==> display Shapley values
shap_values = xgb_model.predict(dtest, pred_contribs = True, ntree_limit=xgb_model.best_ntree_limit)

In [None]:
# plot Shapley variable importance summary 

shap.summary_plot(shap_values[:, :-1], test[xgb_model.feature_names])

### Calculating partial dependence and ICE to validate and explain monotonic behavior

- Partial Dependence Plot (PDP): Global Average Explanations
    - https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12.pdf (10.13 section 참고)
- Individual Conditional Expectational (ICE) Plot: more localized explanations for a single observation of data using the same basic ideas as partial dependence plots
    - 논문 참고: https://arxiv.org/abs/1309.6392
- Partial dependence can be misleading in the presence of strong interactions or correlation. ICE curves diverging from the partial dependence curve can be indicative of such problems.
- Histograms are also presented with the partial dependence and ICE curves, to enable a rough measure of epistemic uncertainty for model predictions: predictions based on small amounts of training data are likely less dependable.

#### Function for calculating partial dependence

In [None]:
def par_dep(xs, frame, model, resolution=20, bins=None):
    
    """ Creates Pandas DataFrame containing partial dependence for a 
        single variable.
    
    Args:
        - xs: Variable for which to calculate partial dependence.
        - frame: Pandas DataFrame for which to calculate partial dependence.
        - model: XGBoost model for which to calculate partial dependence.
        - resolution: The number of points across the domain of xs for which 
                    to calculate partial dependence, default 20.
        - bins: List of values at which to set xs, default 20 equally-spaced 
              points between column minimum and maximum.
    
    Returns:
        Pandas DataFrame containing partial dependence values.
        
    """
    
    # turn off pesky Pandas copy warning
    pd.options.mode.chained_assignment = None
    
    # initialize empty Pandas DataFrame with correct column names
    par_dep_frame = pd.DataFrame(columns=[xs, 'partial_dependence'])
    
    # cache original column values 
    col_cache = frame.loc[:, xs].copy(deep=True)
  
    # determine values at which to calculate partial dependence
    if bins == None:
        min_ = frame[xs].min()
        max_ = frame[xs].max()
        by = (max_ - min_)/resolution
        bins = np.arange(min_, max_, by)
        
    # calculate partial dependence  
    # by setting column of interest to constant 
    # and scoring the altered data and taking the mean of the predictions
    for j in bins:
        frame.loc[:, xs] = j
        dframe = xgb.DMatrix(frame)
        par_dep_i = pd.DataFrame(model.predict(dframe, ntree_limit=xgb_model.best_ntree_limit))
        par_dep_j = par_dep_i.mean()[0]
        par_dep_frame = par_dep_frame.append({xs:j,
                                              'partial_dependence': par_dep_j}, 
                                              ignore_index=True)
        
    # return input frame to original cached state    
    frame.loc[:, xs] = col_cache

    return par_dep_frame

#### Calculate partial dependence for the most important input variables in the GBM

In [None]:
# Looking at just three variables
par_dep_PAY_0 = par_dep('PAY_0', test[X], xgb_model)         # calculate partial dependence for PAY_0
par_dep_LIMIT_BAL = par_dep('LIMIT_BAL', test[X], xgb_model) # calculate partial dependence for LIMIT_BAL
par_dep_BILL_AMT1 = par_dep('BILL_AMT1', test[X], xgb_model) # calculate partial dependence for BILL_AMT1

In [None]:
# display partial dependence for PAY_0
par_dep_PAY_0

In [None]:
# display partial dependence for LIMIT_BAL
par_dep_LIMIT_BAL

In [None]:
# display partial dependence for BILL_AMT1
par_dep_BILL_AMT1

#### Helper function for finding percentiles of predictions

- Calculating and analyzing ICE curves for every row of training and test data set can be overwhelming
- so... start with calculating ICE curves at every decile of predicted probabilities in a dataset, giving an indication of local prediction behavior across the dataset
- The function below finds and returns the row indices for the maximum, minimum, and deciles of one column in terms of another -- in this case, the model predictions (p_DEFAULT_NEXT_MONTH) and the row identifier (ID), respectively.

In [None]:
def get_percentile_dict(yhat, id_, frame):

    """ Returns the percentiles of a column, yhat, as the indices based on 
        another column id_.
    
    Args:
        yhat: Column in which to find percentiles.
        id_: Id column that stores indices for percentiles of yhat.
        frame: Pandas DataFrame containing yhat and id_. 
    
    Returns:
        Dictionary of percentile values and index column values.
    
    """
    
    # create a copy of frame and sort it by yhat
    sort_df = frame.copy(deep=True)
    sort_df.sort_values(yhat, inplace=True)
    sort_df.reset_index(inplace=True)
    
    # find top and bottom percentiles
    percentiles_dict = {}
    percentiles_dict[0] = sort_df.loc[0, id_]
    percentiles_dict[99] = sort_df.loc[sort_df.shape[0]-1, id_]

    # find 10th-90th percentiles
    inc = sort_df.shape[0]//10
    for i in range(1, 10):
        percentiles_dict[i * 10] = sort_df.loc[i * inc,  id_]

    return percentiles_dict

#### Find some percentiles of yhat in the test data

The values for ID that correspond to the maximum, minimum, and deciles of p_DEFAULT_NEXT_MONTH are displayed below. ICE will be calculated for the rows of the test dataset associated with these ID values.

In [None]:
# merge GBM predictions onto test data
yhat_test = pd.concat([test.reset_index(drop=True), pd.DataFrame(xgb_model.predict(dtest, 
                                                                                   ntree_limit=xgb_model.best_ntree_limit))], axis=1)
yhat_test = yhat_test.rename(columns={0:'p_DEFAULT_NEXT_MONTH'})

# find percentiles of predictions
percentile_dict = get_percentile_dict('p_DEFAULT_NEXT_MONTH', 'ID', yhat_test)

In [None]:
test.columns

In [None]:
# display percentiles dictionary
# ID values for rows from lowest prediction to highest prediction
percentile_dict

#### Calculate ICE curve values

- ICE values represent a model's prediction for a row of data while an input variable of interest is varied across its domain. 
- The values of the input variable are chosen to match the values at which partial dependence was calculated earlier, and ICE is calculated for the top three most important variables and for rows at each percentile of the test dataset.

In [None]:
# retreive bins from original partial dependence calculation

bins_PAY_0 = list(par_dep_PAY_0['PAY_0'])
bins_LIMIT_BAL = list(par_dep_LIMIT_BAL['LIMIT_BAL'])
bins_BILL_AMT1 = list(par_dep_BILL_AMT1['BILL_AMT1'])

# for each percentile in percentile_dict
# create a new column in the par_dep frame 
# representing the ICE curve for that percentile
# and the variables of interest
for i in sorted(percentile_dict.keys()):
    
    col_name = 'Percentile_' + str(i)
    
    # ICE curves for PAY_0 across percentiles at bins_PAY_0 intervals
    par_dep_PAY_0[col_name] = par_dep('PAY_0', 
                                    test[test['ID'] == int(percentile_dict[i])][X],  
                                    xgb_model, 
                                    bins=bins_PAY_0)['partial_dependence']
    
    # ICE curves for LIMIT_BAL across percentiles at bins_LIMIT_BAL intervals
    par_dep_LIMIT_BAL[col_name] = par_dep('LIMIT_BAL', 
                                          test[test['ID'] == int(percentile_dict[i])][X], 
                                          xgb_model, 
                                          bins=bins_LIMIT_BAL)['partial_dependence']
    


    # ICE curves for BILL_AMT1 across percentiles at bins_BILL_AMT1 intervals
    par_dep_BILL_AMT1[col_name] = par_dep('BILL_AMT1', 
                                          test[test['ID'] == int(percentile_dict[i])][X],  
                                          xgb_model, 
                                          bins=bins_BILL_AMT1)['partial_dependence']

#### Display partial dependence and ICE for LIMIT_BAL

In [None]:
par_dep_PAY_0

In [None]:
par_dep_LIMIT_BAL

In [None]:
par_dep_BILL_AMT1

### 5. Plotting partial dependence and ICE to validate and explain monotonic behavior

Overlaying partial dependence onto ICE in a plot is a convenient way to validate and understand both global and local monotonic behavior. Plots of partial dependence curves overlayed onto ICE curves for several percentiles of predictions for DEFAULT_NEXT_MONTH are used to validate monotonic behavior, describe the GBM model mechanisms, and to compare the most extreme GBM behavior with the average GBM behavior in the test data. Partial dependence and ICE plots are displayed for the three most important variables in the GBM: PAY_0, LIMIT_BAL, and BILL_AMT1.

In [None]:
#### Function to plot partial dependence and ICE

def plot_par_dep_ICE(xs, par_dep_frame):

    
    """ Plots ICE overlayed onto partial dependence for a single variable.
    
    Args: 
        xs: Name of variable for which to plot ICE and partial dependence.
        par_dep_frame: Name of Pandas DataFrame containing ICE and partial
                       dependence values.
    
    """
    
    # initialize figure and axis
    fig, ax = plt.subplots()
    
    # plot ICE curves
    par_dep_frame.drop('partial_dependence', axis=1).plot(x=xs, 
                                                          colormap='gnuplot',
                                                          ax=ax)

    # overlay partial dependence, annotate plot
    par_dep_frame.plot(title='Partial Dependence and ICE for ' + str(xs),
                       x=xs, 
                       y='partial_dependence',
                       style='r-', 
                       linewidth=3, 
                       ax=ax)

    # add legend
    _ = plt.legend(bbox_to_anchor=(1.05, 0),
                   loc=3, 
                   borderaxespad=0.)

#### Partial dependence and ICE plot for LIMIT_BAL

- Monotonic decreasing behavior is evident at every percentile of predictions for DEFAULT_NEXT_MONTH. 
- Most percentiles of predictions show that sharper decreases in probability of default occur when LIMIT_BAL increases just slightly from its lowest values in the test set. 
- However, for the custumers that are most likely to default according to the GBM model, no increase in LIMIT_BAL has a strong impact on probabilitiy of default.

In [None]:
plot_par_dep_ICE('LIMIT_BAL', par_dep_LIMIT_BAL) # plot partial dependence and ICE for LIMIT_BAL

In [None]:
_ = train['LIMIT_BAL'].plot(kind='hist', bins=20, title='Histogram: LIMIT_BAL')

 As can be seen from the displayed histogram, above ~$NT 500,000 prediction behavior may have been learned from extremely small samples of data.

#### Partial dependence and ICE plot for PAY_0

- Monotonic increasing prediction behavior for PAY_0 is displayed for all percentiles of model predictions. 
- Predition behavior is different at different deciles, but not abnormal or vastly different from the average prediction behavior represented by the red partial dependence curve. 
- The largest jump in predicted probability appears to occur at PAY_0 = 2, or when a customer becomes two months late on their most recent payment. 

In [None]:
plot_par_dep_ICE('PAY_0', par_dep_PAY_0) # plot partial dependence and ICE for PAY_0

In [None]:
_ = train['PAY_0'].plot(kind='hist', bins=20, title='Histogram: PAY_0')

#### Partial dependence and ICE plot for BILL_AMT1

- Monotonic decreasing prediction behavior for BILL_AMT1 is also displayed for all percentiles. 
- Mild decrease in probability of default as most recent bill amount increases could be related to wealthier, big-spending customers taking on more debt but also being able to pay it off reliably. 
- customers with negative bills are more likely to default, potentially indicating charge-offs(과금: 채권자가 부채 금액을 징수하지 않을 것이라는 선언) are being recorded as negative bills.

In [None]:
plot_par_dep_ICE('BILL_AMT1', par_dep_BILL_AMT1) # plot partial dependence and ICE for BILL_AMT1

In [None]:
_ = train['BILL_AMT1'].plot(kind='hist', bins=20, title='Histogram: BILL_AMT1')

- predictions below 0 and above 400,000 are based on very little training data.

### 6. Generate reason codes using the Shapley method

#### Select most risky customer in test data

One person who might be of immediate interest is the most likely to default customer in the test data. This customer's row will be selected and local variable importance for the corresponding prediction will be analyzed.

In [None]:
test.reset_index(drop=True, inplace=True)

In [None]:
decile = 99
row = test[test['ID'] == percentile_dict[decile]]

#### Create a Pandas DataFrame of Shapley values for riskiest customer

The most interesting Shapley values are probably those that push this customer's probability of default higher, i.e. the highest positive Shapley values. Those values are plotted below.

In [None]:
# reset test data index to find riskiest customer in shap_values 
# sort to find largest positive contributions
s_df = pd.DataFrame(shap_values[row.index[0], :][:-1].reshape(23, 1), columns=['Reason Codes'], index=X)
s_df.sort_values(by='Reason Codes', inplace=True, ascending=False)

In [None]:
s_df

#### Plot top local contributions as reason codes

In [None]:
_ = s_df[:5].plot(kind='bar', 
                  title='Top Five Reason Codes for a Risky Customer\n', 
                  legend=False)

For the customer in the test dataset that the GBM predicts as most likely to default, the most important input variables in the prediction are, in descending order, PAY_0, PAY_5, PAY_6, PAY_2, and LIMIT_BAL.

#### Display customer in question

In [None]:
row # helps understand reason codes

The local contributions for this customer appear reasonable, especially when considering her payment information. Her most recent payment was 3 months late and her payment for 6 months and 5 months previous were 7 months late. Also her credit limit was extremely low, so it's logical that these factors would weigh heavily into the model's prediction for default for this customer.



To generate reason codes for the model's decision, the locally important variable and its value are used together. If this customer was denied future credit based on this model and data, the top five Shapley-based reason codes for the automated decision would be:

- Most recent payment is 3 months delayed.
- 5th most recent payment is 7 months delayed.
- 6th most recent payment is 7 months delayed.
- 2nd most recent payment is 2 months delayed.
- Credit limit is too low: 10,000 $NT.

(Of course, credit limits are set by the lender and are used to price-in risk to credit decisions, so using credit limits as reason codes or even in a probability of default model is likely questionable. However, in this small, example data set all input columns were used to generate a better model fit. For a slightly more careful treatment of gradient boosting in the context of credit scoring, please see: https://github.com/jphall663/interpretable_machine_learning_with_python/blob/master/dia.ipynb)