This workflow is supposed to use data transformed by the initial PSW chapters, i.e.
1. Import original dataset and create all important columns there (base, weight, month etc.)
2. Data split
3. Creation of date difference variables
4. OPTIONALLY: creation of interactions and other derived features (gradient boosting is covering interaction between variables naturally, so it is not always necessary to create them manually)
5. Export the transformed dataset and metadata about the important columns

# Preparation

## Import libraries

Note: lgbm and shap packages has to be installed in computer.  
```
pip install lightgbm
pip install shap
```

In [None]:
import time
import datetime
import operator
import math
import random
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os.path
import gc
import pickle
from tqdm import tqdm_notebook as tqdm

import sys
sys.path.insert(0, '../')
import scoring

sns.set()
%matplotlib inline
%config InlineBackend.close_figures=True
from IPython.display import display, Markdown
pd.options.display.max_columns = None
pd.options.display.max_rows = 15

scoring.check_version('0.8.2', list_versions=False)

## Import data and metadata

### Data and important variables

In [None]:
metadata = json.load(open("metadata.json", "r", encoding="utf8"))

col_time = metadata["col_time"]
col_month = metadata["col_month"]
col_day = metadata["col_day"]
col_target = metadata["col_target"]
col_base = metadata["col_base"]
col_weight = metadata["col_weight"]
col_reject = metadata["col_reject"]
col_datatype = metadata["col_datatype"]
col_id = metadata["col_id"]
cols_pred_num = metadata["cols_pred_num"]
cols_pred_cat = metadata["cols_pred_cat"]
cols_pred = cols_pred_num + cols_pred_cat

In [None]:
from scoring import db
data = db.read_csv('data_prepared.csv', index_col=col_id)
data[col_id] = data.index

In [None]:
train_mask = (data[col_datatype] == 'train') & (data[col_base] == 1)
valid_mask = (data[col_datatype] == 'valid') & (data[col_base] == 1)
test_mask = (data[col_datatype] == 'test') & (data[col_base] == 1)
oot_mask = (data[col_datatype] == 'oot') & (data[col_base] == 1)
hoot_mask = (data[col_datatype] == 'hoot') & (data[col_base] == 1)
observable_mask = (data[col_base] == 1)

### Structures for documentation output

In [None]:
sns.set()
%matplotlib inline
%config InlineBackend.close_figures=True
from IPython.display import display, Markdown, HTML
pd.options.display.max_columns = None
pd.options.display.max_rows = 15
output_folder = 'documentation_lgbm'

if not os.path.exists(output_folder): os.makedirs(output_folder)
if not os.path.exists(output_folder+'/shap'): os.makedirs(output_folder+'/shap')
if not os.path.exists(output_folder+'/pdp'): os.makedirs(output_folder+'/pdp')
if not os.path.exists(output_folder+'/ice'): os.makedirs(output_folder+'/ice')
if not os.path.exists(output_folder+'/psi'): os.makedirs(output_folder+'/psi')
if not os.path.exists(output_folder+'/stability'): os.makedirs(output_folder+'/stability')

In [None]:
from scoring import doctools

documentation = doctools.ProjectParameters()

In [None]:
documentation.targets = [(col_target, col_base)]
documentation.time_variable = col_month
documentation.rowid_variable = col_id

documentation.sample_dict = {
    "HOOT": hoot_mask,
    "Train": train_mask,
    "Valid": valid_mask,
    "Test": test_mask,
    "OOT": oot_mask,
    "Observable": observable_mask,
}

## Predictor definition

Categorical predictors have to be as type *category*, not *object*!

In [None]:
from scoring.data_manipulation import split_predictors_bytype

cols_pred, cols_pred_num, cols_pred_cat = split_predictors_bytype(data,
                                                                  pred_list=cols_pred,
                                                                  non_pred_list= [],
                                                                  optimize_types=True,
                                                                  convert_bool2int=True)

# Categorical variable encoding

## Dummy encoding

**By default**, the categorical variables are processed as dummies (i.e. each value of categorical variable is treated as a separate binary variable) in LightGBM. Unlike xgboost, LightGBM is able to do this by itself (if the type of the variable is properly set as `category`), so **no further steps from the user are needed**.

There are some categorical variables (e.g. variable with many distinct value, variable with ordinal business meaning, variable where the default rate can be mapped to its continous characteristic etc.) where it is better to encode such variable as a numeric one.

## Mean target encoding

The basic type of encoding is mean target encoding. Each value of the variable is encoded as weighted average of the average default rate of observations having this particular value and the overall average default rate. More precisely, $MTE_C$ (mean target encoded value of category *C*) can be calculated as

$$
MTE_C = \frac{\sum_{i \in C}{w_i y_i} + \rho \sum_{j \in \{1,\ldots,n\}}{w_j y_j}}
             {\sum_{i \in C}{w_i} + \rho \sum_{j \in \{1,\ldots,n\}}{w_j}}
$$

where $y$ denotes target, $w$ denotes observation weight and $\rho$ is a **regularization parameter**. By changing this regularization parameter, you can change, how close will the *MTE* values be to the overall average. It is important to notice that this parameter is the same for small and large categories which result in small categories being relatively closer to the overall average than large categories. This should help you to deal with **outliers**.

In [None]:
from scoring.variable_encoding import MeanTargetEncoder

mte = MeanTargetEncoder(
    regularization_parameter = 0.05,
    unknown_fill_value = 'mean',
)
cols_pred_mte = []

for predictor in cols_pred_cat:
    mte.fit(predictor=data[train_mask][predictor],
            target=data[train_mask][col_target],
            weight=data[train_mask][col_weight])
    data[predictor+'_MTE'] = mte.transform(data[predictor])
    cols_pred_mte.append(predictor+'_MTE')

In [None]:
data[cols_pred_mte].head()

## Other options

There are other options to encode the categorical variables:
- **Grouping and WOE transformation** - the same way as we use for logistic regression. This might be useful if you want to group multiple categories into one in some logical way - use interactive grouping from standard PSW for this. Like Mean target encoding, WOE values also have the target encoded inside them, which is quite a pleasant property of them.
- **Ordinal transformation** - for variables that have specific business meaning that can be translated into order. For example, you can encode *EDUCATION* as `1 - elementary`, `2 - secondary`, `3 - bachelor`, `4 - postgrad` etc. This must be done manually based on your business knowledge.
- **Use metric instead of dimension** - you can also use something similar to mean target encoding but with mean value of some metric (property) of that category. E.g. for variable *REGION*, you can encode each of its catagories as mean income in the region. This must be done manually based on you business knowledge.

*Example of manual encoding:*

In [None]:
encoding_dict = {'AAA': 6000,
                 'BBB': 2500,
                 'CCC': 2650,
                 'DDD': 1200,
                 'EEE': 9000,
                 'FFF': 1000,
                 'GGG': 5000,
                 np.nan: 0,
                }

data['Categorical_4_ENC'] = data['Categorical_4'].replace(encoding_dict)

# Feature preselection

## Treatment of numerical variables for the analyses

During the workflow we will need to work with numerical predictors as with categorical several times (namely PSI calculation and stability charts). For these reasons, we create fake binning: we bin each numerical predictor equifrequently to `bin_count` bins, keep the categorical predictors as they are and then crate fake "WOE" for them using `Grouping` class well known from traditional Python Scoring Workflow.

This is just a trick to enable the parts of the workflow that expect certain categorizations of the predictors. It should not be considered a proper "WOE".

In [None]:
bin_count = 4

from scoring.features import fake_binning
from scoring.grouping import Grouping, NumpyJSONEncoder
from pandas.api.types import is_numeric_dtype

bin_dict = fake_binning(data[train_mask][cols_pred], bin_count = bin_count)

with open('fake_binning.json', 'w', encoding='utf-8') as file:
    json.dump(bin_dict, file, ensure_ascii=False, cls=NumpyJSONEncoder, indent=2)

stability_grouping = Grouping(
    columns = [column for column in data[cols_pred] if is_numeric_dtype(data[column])],
    cat_columns = [column for column in data[cols_pred] if not is_numeric_dtype(data[column])],
)
stability_grouping.load('fake_binning.json')
data_bins = stability_grouping.transform(data[cols_pred])
bin_columns_to_replace = list()
for column in data_bins.columns:
    if column in data:
        bin_columns_to_replace.append(column)
        print("Column", column, "dropped as it already existed in the data set.")
data = data.drop(bin_columns_to_replace, axis="columns")
data = data.join(data_bins)

In [None]:
fillna_value = 0

cols_pred_num_wo_nan = []
for column in cols_pred_num:
    if data[column].isnull().sum() > 0:
        new_name = column + '_WONAN'
        data[new_name] = data[column].fillna(fillna_value)
        cols_pred_num_wo_nan.append(new_name)
        print(f'Column {new_name} created where NaN values were filled by {fillna_value}.')
    else:
        cols_pred_num_wo_nan.append(column)

## Population stability index

**Population Stability Index (PSI)** is calculated for each predictor. This index quantifies how stable is distribution of the values of the predictor in time. More about PSI here: http://ucanalytics.com/blogs/population-stability-index-psi-banking-case-study/

For numerical predictors we use the fake grouping from the previous step, i.e. each numerical predictor is categorized into `bin_count` quantiles (and separate category for missings if applicable). Categorical predictors are kept as they are.

The function `psi_calc_df()` which we use takes data frame and list of predictors and calculates for each predictor average weighted PSI from all two consecutive months weighted PSIs (e.g. let's have months 1, 2, 3, weighted PSIs are calculated for combinations (1,2), (2,3) and average of these values is returned).

It is important to notice that if for certain month a certain category is missing, this category is not taken into account by the PSI calculation.

It is reasonable to use only such predictors which have a "reasonable" PSI (i.e. PSI under certain threshold).

In [None]:
from scoring.stability_index import psi_calc_df

cols_pred_num_dis = [col_name + '_WOE' for col_name in cols_pred_num]

monthly_psi, masked_psi = psi_calc_df(data, cols_pred_psi=cols_pred_num_dis+cols_pred_cat, col_month="MONTH")
display(monthly_psi)

In [None]:
psi_threshold = 0.25

print(f'Variables with PSI < {psi_threshold}:')
cols_selected_psi = [
    column[:-4] if column[-4:]=='_WOE'
    else column for column in list(monthly_psi[monthly_psi['PSI avg per month'] < psi_threshold]['Variable'])
]
print(cols_selected_psi)

## Hierarchical variable clustering

- Starts with each variable as a separate cluster
- Creates clusters based on highest average correlations
- The stopping criterion is either parameter `max_cluster_correlation` - once no correlation between clusters is larger than this parameter, the clustering is finished; or `max_clusters` - when this many clusters are created, the clustering is finished. If both specified, the one that makes less clusters is used.
At the end we take only one representant (the most powerful one) from each cluster.

In [None]:
from scoring import variable_clustering

clustering_correlation = variable_clustering.CorrVarClus(
    max_correlation=0.75,
    # max_clusters=9,
    standardize=True,
    sample_size=50000,
)

clustering_correlation.fit(data[train_mask][cols_pred_num_wo_nan+cols_pred_mte], data[train_mask][col_target])

In [None]:
clustering_correlation.draw()
clustering_correlation.display()

In [None]:
print("Best variables based on correlation clustering:")
cols_selected_corr = [
    column[:-6] if column[-6:]=='_WONAN' else
    column[:-4] if column[-4:]=='_MTE' else
    column for column in clustering_correlation.bestVariables()]
print(cols_selected_corr)

# Gradient boosting

This workflow contains set of methods (functions) that are necessary to develop and fine-tune gradient boosting model.

## Predictor list

In [None]:
cols_pred_num = [column for column in cols_pred_num if (column in cols_selected_psi) and (column in cols_selected_corr)]
cols_pred_cat = [column for column in cols_pred_cat if (column in cols_selected_psi) and (column in cols_selected_corr)]
cols_pred_mte = [column + '_MTE' for column in cols_pred_cat]

cols_pred_boosting = cols_pred_num + cols_pred_cat
# cols_pred_boosting = cols_pred_num + cols_pred_mte

## Monotone constraints

Business meaning of certain variables (typically mean target encoded or WOE encoded, but also other variables where it makes sense) might imply that the dependence of the target on those variables is monotonic. Gradient boosting algorithms allow us to enforce the monotonity condition for these variables, so the splits in the trees inside the gradient boosting are done in such way that the monotonity is not broken.

In [None]:
monotone_constraints_dict = {predictor: 0 for predictor in cols_pred_boosting}
monotone_constraints_dict['AGE'] = -1
monotone_constraints_dict['Numerical_1'] = -1

In [None]:
# monotone_constraints_str = '('
# for predictor in cols_pred_boosting:
#     monotone_constraints_str += monotone_constraints_dict[predictor].astype(int).astype(str)+','
# monotone_constraints_str = monotone_constraints_str[:-1]+')'
# print(monotone_constraints_str)

monotone_constraints_tup = tuple(monotone_constraints_dict[predictor] for predictor in cols_pred_boosting)
print(monotone_constraints_tup)

## Hyperparameters

### Default values

In [None]:
default_params = {
        'learning_rate':0.05,
        'num_leaves':100,
        'colsample_bytree':0.75,
        'subsample':0.75,
        'subsample_freq':1,
        'max_depth':3,
        'min_split_gain':0.0,
        'max_delta_step':0.0,
        'max_bin':20,
        'metric':'auc',
        'objective':'binary',
        'early_stopping_rounds':100,
        'num_boost_round':100000,
        'seed':1234,
        'monotone_constraints':monotone_constraints_tup,
        'verbose':1,
        'n_jobs':6,
}

In [None]:
params = default_params

### Hyperparameters tunning 
Method `param_hyperopt()` is based on maximalization of 3-fold cross-validation AUC.  
Output is a dictionary of optimalized hyperparameters that could be paste into params before method iniciations.

There is an optional parameter `space` where you can insert your own space of possible hyperparameters to be searched. If this parameter is kept as `None`, a default space is used (see the source code if you want to review the default space).

In [None]:
%%capture --no-display

from importlib import reload
from scoring import lgbm 
lgbm=reload(lgbm)

model_lgb = lgbm.LGBM_model(cols_pred_boosting,
                            params,
                            use_CV=False,
                            CV_folds=3,
                            CV_seed=9876)

In [None]:
best_params = model_lgb.param_hyperopt(
        data[train_mask],
        data[valid_mask],
        data[train_mask][col_target],
        data[valid_mask][col_target],
        n_iter = 2,
        space = None
)

In [None]:
params = default_params

for par in best_params:
    params[par] = best_params[par]
    
print(params)

In [None]:
model_lgb.params = params

## Feature selection

First we compute SHAP values of each variables (more about them later in this notebook), the we add varible to model one by one (from highest absolute SHAP to the least) and observe how the predictive power of the model is changing

In [None]:
var_imp_shap = model_lgb.print_shap_values(cols_pred_num, cols_pred_cat, data[train_mask], data[valid_mask], data[train_mask][col_target], data[valid_mask][col_target],data[test_mask])

In [None]:
from scoring.feature_selection import get_shap_feature_importance, boost_feature_selection, plot_feature_selection

fe_params = params.copy()
if 'monotone_constraints' in fe_params: del fe_params['monotone_constraints']
if 'num_boost_round' in fe_params: del fe_params['num_boost_round']
if 'early_stopping_rounds' in fe_params: del fe_params['early_stopping_rounds']

fi_columns = get_shap_feature_importance(
    names_columns = cols_pred_boosting,
    shap_values = model_lgb.shap_values,
)

boost_aucs = boost_feature_selection(
    params = fe_params,
    df = data,
    col_target = col_target,
    base_columns = [],
    fi_columns = fi_columns,
    train_mask = train_mask, 
    test_mask = valid_mask,
    n_seed = 3, 
    n_fold = 5,
    step = 1,
    boost = 'lgb',
)

In [None]:
boost_aucs

In [None]:
plot_feature_selection(fi_columns, boost_aucs)

In [None]:
auc_df = pd.DataFrame([[c[0], a] for c, a in zip(fi_columns, boost_aucs['test-auc-mean'])])
auc_df.columns=['Feature','Test-AUC']
maximizing = auc_df['Test-AUC'].argmax()+1

print('Columns selected by selection algorithm:')
cols_selected_auc = list(auc_df.iloc[:maximizing]['Feature'])
print(cols_selected_auc)

## Fitting the model

In [None]:
cols_pred_num = [column for column in cols_pred_num if (column in cols_selected_auc)]
cols_pred_cat = [column for column in cols_pred_cat if (column in cols_selected_auc)]
# cols_pred_mte = [column for column in cols_pred_mte if (column in cols_selected_auc)]

cols_pred_boosting_final = cols_pred_num + cols_pred_cat
# cols_pred_boosting_final = cols_pred_num + cols_pred_mte

monotone_constraints_tup_final = tuple(monotone_constraints_dict[predictor] for predictor in cols_pred_boosting_final)
params['monotone_constraints'] = monotone_constraints_tup_final

model_lgb.params = params
model_lgb.cols_pred = cols_pred_boosting_final

output: List of lgbm boosters (models)

In [None]:
model1 = model_lgb.fit_model(
    data[train_mask],
    data[valid_mask],
    data[train_mask][col_target],
    data[valid_mask][col_target]
)

In [None]:
model_lgb.show_progress()

## Prediction

In case of CV is chosen, then the predictions are average predictions from each of CV models.

In [None]:
from sklearn.metrics import roc_auc_score

predictions = model_lgb.predict(model1, data[test_mask])
print(2 * roc_auc_score(data[test_mask][col_target], predictions) - 1)

# Interpretation

## Variable gain

Loss of a single decision tree $T$ is defined as $L(T) = \sum_{i=1}^{n_T}{L_j}$ where $L_i$ are losses in leaves of tree $T$. When a new split is made (and leave $i$ is split into two new leafs $i1$ and $i2$), the loss of tree $T$ changes by $Gain_i = -L_i + L_{i1} + L_{i2}$ - this is gain of the new split.

Gain of a variable $V$ in a gradient boosting model is sum of loss function gain caused by splits in inidividual decision tree that use that particular variable, i.e. $Gain(V) = \sum_{\forall T} \sum_{\forall i:i\,\mathrm{uses}\,V} Gain_i$.

Output: DataFrame with features and chosen importance

In case of CV is chosen, then the variable importance is computed as the average variable importance from each CV models.

In [None]:
var_imp=model_lgb.plot_imp(model1, 'importance_gain', ret=True, show= True, n_predictors=25)

## SHAP values

### Shap values for each variable

Shapley values show impact of each variable to the prediction. Computationaly efficient library SHAP is used to calculate them for us.

In the first SHAP chart below, each row shows impact of a single variable. There are many dots in each row and each of the dots represents a single observation. On the x-axis, there is impact on model output. As baseline for each observation we take prediction where the variable is replaced by its expected value. When the variable is added to the model, the **prediction for each observation changes. This change is showed as the position of the dot on the x-axis**. The value of the variable itself is color-coded (the scale is separately calibrated for each variable). The "thickness" of the dot clusters in the charts shows how many observations have that specific value.

On the second chart, each variable is represented by a bar. This bar is average of absolute values of the SHAP values from the first chart. This shows how important is each variable by telling us how impactful on the final prediction the variable is.

More theoretical background for Shapley values can be found here https://christophm.github.io/interpretable-ml-book/shapley.html

Output: DataFrame with features and its mean absolute shap values that coresponds with second chart


In [None]:
var_imp_shap = model_lgb.print_shap_values(cols_pred_num, cols_pred_cat, data[train_mask], data[valid_mask], data[train_mask][col_target], data[valid_mask][col_target],data[test_mask])

### Shap interaction matrix

Prints shap interaction matrix, based on https://christophm.github.io/interpretable-ml-book/shap.html#shap-interaction-value.
It prints sum of absolute interactions values throught all observations.
Diagonal values are manually set to zero.


In [None]:
model_lgb.print_shap_interaction_matrix()

### Shap dependence plot
Note: If y (second feature) is not specified, it is found automatically.

In [None]:
model_lgb.shap_dependence_plot(x='Numerical_1',y=None)

In [None]:
model_lgb.shap_dependence_plot(x='Numerical_1',y='Categorical_1')

### Shap force plot for one observation
If you are cuious why was given decision to particular observation.  
Note: values in upper chart are in logloss, values in lower chart are in probabilities.

In [None]:
model_lgb.shap_one_row(0)

### Marginal contribution
All features are one by one removed from model training and performance on the test data is computed.  
Output is dataframe with 4 columns - feature, gini with feature, gini without feature and difference of gini with feature and gini without feature.

In [None]:
mc = model_lgb.marginal_contribution(
        data[train_mask],
        data[valid_mask],
        data[train_mask][col_target],
        data[valid_mask][col_target],
        data[test_mask],
        data[test_mask][col_target]
)

## Partial Dependency Plots and Accumulated Local Effects plot

**PDP (Partial Dependency Plots)** are showing overall trend of the model output (prediction) related to one particular predictor. We calculate PDP's for each predictor's values.

First, we group the predictor values into several bins (corresponding to the splits inside the decision trees). Then for each observation (more precisely, for a reasonably sized random subsample) calculate the model output in **hypothetical situation when the predictor would change its value to be in the particular bin** and all the other varibles' values would remain the same. Average of these values over all observations for each particular bin is *mean Partial Dependency value*. When these values are plotted with the bins on x-axis, the PDP plot is formed. This plot shows how the mean of the prediction changes when the variable changes (and all other variables remain the same).

We don't calculate just *mean Partial Dependency* but also its quantiles and median.

PDP makes sense also for categorical variables as we can easily calculate these values also for each particular category of a categorical variable.

More about PDP: https://christophm.github.io/interpretable-ml-book/pdp.html

**ALE (Accumuated Local Effects)** are very similar to PDP, however for each bin we don't caluclate the average prediction, but average difference of predictions if we move from one bin to the next one. The we accumulate these differences which forms the ALE plot.

This plot does not make sense for unordered categorical features, so it is missing in charts for categorical features.

More about ALE: https://christophm.github.io/interpretable-ml-book/ale.html

In [None]:
documentation.model = (model1[0],'LGBM',model1[0].feature_name())

In [None]:
from scoring.doctools import PartialDependencePlotCalculator

pdp = PartialDependencePlotCalculator(documentation)

for pred in model1[0].feature_name():
    print(pred)
    pdp_pred = pdp.s([(data[test_mask],'test')]).p([pred]).calculate()
    pdp_pred.get_visualization(output_folder=output_folder+'/pdp')
    display(pdp_pred.get_table())

## Individual Conditional Expectation plots

**Individual Conditional Expectations** are actually "dismantled" PDP's. For each observation, we show how the prediction would change if one particular variable was chagning its values (and all the other variables remained the same) and we draw these lines all into one chart (in our case there are lines for 250 randomly chosen observations). There is also mean PDP showed by a thick line in the chart.

In [None]:
from scoring.doctools import IceplotRuCalculator

ice = IceplotRuCalculator(documentation)

for pred in model1[0].feature_name():
    print(pred)
    ice_pred = ice.s([(data[test_mask],'test')]).p([pred]).calculate()
    ice_pred.get_visualization(output_folder=output_folder+'/ice')

# Predictor stability

## Stability charts

The stability chart show the following for each predictor:
- share of each category (using the fake Grouping from the beginning of the workflow) in time
- bad rate of each category in time

In [None]:
cols_bins = [col+'_WOE' for col in model1[0].feature_name()]

In [None]:
for col in cols_bins:
    documentation.GroupedEvaluation(
        data,
        predictor=col,
        sample="Observable",
        target=col_target,
        weight=col_weight,
        grouping=stability_grouping,
        show_gini=False, # must be False if fake grouping without real WOEs is used
        output_folder=output_folder + "/stability",
    )

## PSI charts

These charts compare distribution of values of the predictor in data (parameter `data`) in each month (identified by `col_month`) with distribution on a reference set (by default we set `data[train_mask]`).

Categorical predictors are left as they are, numerical are automatically binned to deciles (or user defined `q` quantiles).

The stability is quantified separately for each month and drawn into a chart. The quantifiers of stability are:

- **PSI (Population Stability Index)**: http://ucanalytics.com/blogs/population-stability-index-psi-banking-case-study/
- **Bhattacharyya distance**: https://en.wikipedia.org/wiki/Bhattacharyya_distance
- **Jensen-Shannon distance**: https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence

In [None]:
from scoring.PSI import psi_in_time

for pred in model1[0].feature_name():
    psi_table = psi_in_time(
        data = data,
        data_expected = data[train_mask],
        pred = pred,
        col_month = col_month,
        q = 10,
        output_folder = output_folder +'/psi'
    )
    display(psi_table)

# Creation of prediction

In [None]:
col_score = 'SCORE_LGB'

data[col_score] = model_lgb.predict(model1, data)

if col_score not in documentation.scores:
    documentation.scores.append(col_score)

print("Column", col_score, "with the prediction added/modified.")

# Create equivalent XGBoost model

Training XGBoost model using the same parameters we used for LightGBM.

Only applicable to numerical data. If there are any categorical variables in the dataset, they must be converted to numerical or dummy variables by the user first.

If you want to have the XGBoost as the final output of this workflow we strongly recommend to start with the converted variables from the beginning and also use all the metrics from *Interpretation* chapter on the XGBoost model to interpret the model that you'll be actually deploying.

## Hyperparameter translation

In [None]:
print(params)

In [None]:
params_lgbm = params

cols_pred_xgb = cols_pred_num
xgb_monotone_constraints = str(tuple(
    [params['monotone_constraints'][n] for n, col in enumerate(cols_pred) if col in cols_pred_xgb]
))

xgb_early_stopping_rounds = params_lgbm['early_stopping_rounds']
xgb_num_boost_round = params_lgbm['num_boost_round']

params_xgb = {
    'n_estimators': params_lgbm['num_boost_round'],
    'learning_rate': params_lgbm['learning_rate'], 
    'max_depth' : params_lgbm['max_depth'],
    'min_child_weight' : params_lgbm['min_child_weight'], 
    'max_delta_step' : params_lgbm['max_delta_step'],
    'gamma' : params_lgbm['min_split_gain'],
    'reg_alpha' : params_lgbm['reg_alpha'], 
    'reg_lambda' : params_lgbm['reg_lambda'], 
    'subsample' : params_lgbm['subsample'],   
    'colsample_bytree' : params_lgbm['colsample_bytree'],      
    'seed' : params_lgbm['seed'],
    'scale_pos_weight' : 1,
    'tree_method' : 'hist',
    'grow_policy' : 'lossguide',
    'silent' : True,
    'booster' : 'gbtree',
    'n_jobs' : params_lgbm['n_jobs'],
    'monotone_constraints' : xgb_monotone_constraints,
}

if params_lgbm['objective'] == 'binary':
    params_xgb['objective'] = 'binary:logistic'
    if params_lgbm['metric'] == 'auc':
        params_xgb['eval_metric'] = 'auc'
    else: 
        params_xgb['eval_metric'] = 'logloss'


if params_lgbm['objective'] == 'regression':
    params_xgb['objective'] = 'reg:squarederror'
    params_xgb['metric'] = 'rmse'

if params_lgbm['objective'] == 'multiclass':
    params_xgb['objective'] == 'multi:softprob'
    params_xgb['metric']  = 'mlogloss'
    params_xgb['num_class'] = params_lgbm['num_class']  

In [None]:
import xgboost as xgb

xgbooster = xgb.train(params = params_xgb,
                      dtrain = xgb.DMatrix(data[train_mask][cols_pred_xgb],data[train_mask][col_target]),
                      evals = ((xgb.DMatrix(data[train_mask][cols_pred_xgb], data[train_mask][col_target]), 'train'),
                               (xgb.DMatrix(data[valid_mask][cols_pred_xgb], data[valid_mask][col_target]), 'test'),
                              ),
                      num_boost_round = xgb_num_boost_round,
                      early_stopping_rounds = xgb_early_stopping_rounds,)

# Model export

## Export from LightGBM

### Native JSON

In [None]:
import json

jsonrep = model1[0].dump_model()

with open('model.json', 'w') as outfile:
    json.dump(jsonrep, outfile)

### Pickle

In [None]:
pickle.dump(model1[0], open("model.pkl", "wb"))

### Native TXT

In [None]:
model1[0].save_model('model.txt')

## Export from XGBoost

To be able to implement the model in Blaze advisor, we need to develop model in XGBoost as we currently can use only native XGBoost format in the Blaze tools. Refer to chapter *Create equivalent XGBoost model* above to create such model.

### Native JSON

In [None]:
xgbooster.dump_model('modelX.json', dump_format='json')

### Pickle

In [None]:
pickle.dump(xgbooster, open("modelX.pkl", "wb"))

### Native TXT (for Blaze tools)

In [None]:
xgbooster.dump_model('modelX.txt', dump_format='text')

### Export to SQL

In [None]:
from scoring.boosting import xgb2sql    
xgb2sql('modelX.txt', 'modelX.sql')

### Export to Blaze code

In [None]:
from scoring.boosting import xgb2blz    
xgb2blz('modelX.txt', 'modelXblz.txt')

# Export data to PSW

To evaluate the power of the model, import the data to the *Performance characteristics* of the PSW. In this part we prepare the data and metadata to be loaded inside the PSW.

The time, day, month, target, base, weight and score coumns will be exported in the data. If there are some other columns that need to be analyzed inside PSW, please add them to the list `other_columns_to_be_kept`. This might typically be old score, short target and its base etc.

In [None]:
other_columns_to_be_kept = []

In [None]:
metadata["col_score"] = col_score

json.dump(metadata, open("metadata_gb_wfl.json", "w", encoding="utf8"), indent=4)

default_columns_to_be_kept = [col_time, col_month, col_day, col_target, col_base, col_weight, col_reject, col_datatype, col_id, col_score]

data[default_columns_to_be_kept + other_columns_to_be_kept].to_csv('data_from_gb_wfl.csv')

pickle.dump(documentation, open("documentation_gb_wfl.pkl", "wb"))