# Gradient boosting workflow

This workflow contains set of methods (functions) that are necessary to develop and fine-tune gradient boosting model.

Note: lgbm and shap packages has to be installed in computer.  
--    pip install lightgbm  
--    pip install shap

In [None]:
import time
import datetime
import operator
import math
import random
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os.path
import gc
from tqdm import tqdm_notebook as tqdm

import sys
sys.path.insert(0, '../')
import scoring

sns.set()
%matplotlib inline
%config InlineBackend.close_figures=True
from IPython.display import display, Markdown
pd.options.display.max_columns = None
pd.options.display.max_rows = 15

scoring.check_version('0.7.0')

Please adjust your path to the data

In [None]:
data_path='demo_data/kaggle_train_data.zip'

In [None]:
from scoring import db
data = db.read_csv(data_path, compression='zip',sep = ',', decimal = '.',
                   optimize_types=True, encoding = 'utf-8', low_memory = False)

In [None]:
data.head()

Predictors for default dataset are all columns, except of target.

In [None]:
cols_pred=list(data)
cols_pred.remove('TARGET')
col_target='TARGET'

Split data to train/ test/ valid parts.

In [None]:
from scoring.data_manipulation import data_sample_time_split

data['data_type'] = data_sample_time_split(data, 
                           time_column = '',
                           splitting_points = [],
                           sample_sizes = [[ 0.4 , 0.3, 0.3]],
                           sample_names = [['train','valid','test'],[],],
                           stratify_by_columns = [col_target],
                           random_seed = 1234)

train_mask = (data['data_type'] == 'train')
valid_mask = (data['data_type'] == 'valid')
test_mask = (data['data_type'] == 'test')

Splitting predictors to numerical x categorical parts.

NOTE: Categorical predictors have to be as type 'category' , not 'object' !!! 

In [None]:
from scoring.data_manipulation import split_predictors_bytype

cols_pred, cols_pred_num, cols_pred_cat = split_predictors_bytype(data,
                                                                  pred_list=cols_pred,
                                                                  non_pred_list= [],
                                                                  optimize_types=True,
                                                                  convert_bool2int=True)

#### Setting default parameters of lgbm

In [None]:
params={'learning_rate':0.05,
        'num_leaves':100,
        'colsample_bytree':0.75,
        'subsample':0.75,
        'subsample_freq':1,
        'max_depth':3,
        'nthreads':3,
        'verbose':1,
        'metric':'auc',
        'objective':'binary',
        'early_stopping_rounds':100,
        'num_boost_round':100000,
        'seed':1234}

#### Initiation of lgbm class

In [None]:
%%capture --no-display

#from importlib import reload
from scoring import lgbm 
#lgbm=reload(lgbm)

model_lgb = lgbm.LGBM_model(cols_pred, params, use_CV=False, CV_folds=3, CV_seed=9876)

#### Fit standard or cross-validated model
output: List of lgbm boosters (models)

In [None]:
model1=model_lgb.fit_model(data[train_mask], data[valid_mask], data[train_mask][col_target], data[valid_mask][col_target])

#### Predict to unseen dataset

In case of CV is chosen, then the predictions are average predictions from each of CV models.

In [None]:
from sklearn.metrics import roc_auc_score

predictions = model_lgb.predict(model1, data[test_mask])
print(2 * roc_auc_score(data[test_mask][col_target], predictions) - 1)

#### Gain or weight variable importances

Output: DataFrame with features and chosen importance

In case of CV is chosen, then the variable importance is computed as the average variable importance from each CV models.


In [None]:
var_imp=model_lgb.plot_imp(model1, 'importance_gain', ret=True, show= True, n_predictors=25)

#### Computing shap values for given dataset

Theoretical background for shap values can be found here https://christophm.github.io/interpretable-ml-book/shapley.html

Output: DataFrame with features and its mean absolute shap values that coresponds with second chart


In [None]:
var_imp_shap = model_lgb.print_shap_values(cols_pred_num, cols_pred_cat, data[train_mask], data[valid_mask], data[train_mask][col_target], data[valid_mask][col_target],data[test_mask])

#### Shap interaction matrix

Prints shap interaction matrix, based on https://christophm.github.io/interpretable-ml-book/shap.html#shap-interaction-value.
It prints sum of absolute interactions values throught all observations.
Diagonal values are manually set to zero.


In [None]:
model_lgb.print_shap_interaction_matrix()

#### Shap dependence plot
Note: If y (second feature) is not specified, it is found automatically.

In [None]:
model_lgb.shap_dependence_plot(x='AMT_GOODS_PRICE',y=None)

In [None]:
model_lgb.shap_dependence_plot(x='AMT_GOODS_PRICE',y='AMT_ANNUITY')

#### Shap force plot for one observation
If you are cuious why was given decision to particular observation.  
Note: values in upper chart are in logloss, values in lower chart are in probabilities.

In [None]:
model_lgb.shap_one_row(0)

#### Hyperparameters tunning 
Is based on maximalization of 3-fold cross-validation AUC.  
Output is a dictionary of optimalized hyperparameters that could be paste into params before method iniciations. 

In [None]:
a=model_lgb.param_hyperopt(data[train_mask], data[valid_mask], data[train_mask][col_target], data[valid_mask][col_target], n_iter = 500)

#### Marginal contribution
All features are one by one removed from model training and performance on the test data is computed.  
Output is dataframe with 4 columns - feature, gini with feature, gini without feature and difference of gini with feature and gini without feature.

In [None]:
mc = model_lgb.marginal_contribution(data[train_mask], data[valid_mask], data[train_mask][col_target], data[valid_mask][col_target], data[test_mask], data[test_mask][col_target])