# Tutorial 2: AutoWoE (WhiteBox model for binary classification on tabular data)

## Scorecard

![WB0](../../imgs/tutorial_whitebox_report_1.png)

## Linear model

![WB1](../../imgs/tutorial_whitebox_report_2.png)

## Discretization

![WB2](../../imgs/tutorial_whitebox_report_3.png)

## Selection and One-dimensional analysis

![WB3](../../imgs/tutorial_whitebox_report_4.png)

## Whitebox pipeline:

### General parameters

0. Technical

    - n_jobs
    - debug


1. Simple features typing and initial cleaning

    1.1. Remove trash features

        Medium:
            - th_nan 
            - th_const 

    1.2. Typling (auto or user defined)
        
        Critical:
            - features_type (dict) {'age': 'real', 'education': 'cat', 'birth_date': (None, ("d", "wd"), ...}

    1.3. Categories and datetimes encoding

        Critical:
            - features_type (for datetimes)

        Optional:
            - cat_alpha (int) - greater means more conservative encoding


2. Pre selection (based on BlackBox model importances)

    - Critical:
        - select_type (None or int)
        - imp_type (if type(select_type) is int 'perm_imt'/'feature_imp')

    - Optional:
        - imt_th (float) - threshold for select_type is None


3. Binning (discretization)

    - Critical:
        - monotonic / features_monotone_constraints
        - max_bin_count / max_bin_count
        - min_bin_size
        - cat_merge_to
        - nan_merge_to

    - Medium:
        - force_single_split

    - Optional:
        - min_bin_mults
        - min_gains_to_split


4. WoE estimation WoE = LN( ((% 0 in bin) / (% 0 in sample)) / ((% 1 in bin) / (% 1 in sample)) ):

    - Critical:
        - oof_woe

    - Optional:
        - woe_diff_th
        - n_folds (if oof_woe)


5. 2nd selection stage:

    5.1. One-dimentional importance

        Critical:
            - auc_th

    5.2. VIF

        Critical:
            - vif_th

    5.3. Partial correlations

        Critical:
            - pearson_th


6. 3rd selection stage (model based)

    - Optional:
        - n_folds
        - l1_base_step
        - l1_exp_step

    - Do not touch:
        - population_size
        - feature_groups_count


7. Fitting the final model

    - Critical:
        - regularized_refit
        - p_val (if not regularized_refit)
        - validation (if not regularized_refit)

    - Optional:
        - interpreted_model
        - l1_base_step (if regularized_refit)
        - l1_exp_step (if regularized_refit)

8. Report generation

    - report_params

### Imports

In [1]:
import pandas as pd
from pandas import Series, DataFrame

import numpy as np

import os
import requests
import joblib

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from autowoe import AutoWoE, ReportDeco

### Reading the data and train/test split

In [2]:
DATASET_DIR = '../data/'
DATASET_NAME = 'jobs_train.csv'
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = 'https://raw.githubusercontent.com/sberbank-ai-lab/LightAutoML/master/examples/data/jobs_train.csv'

In [3]:
%%time

if not os.path.exists(DATASET_FULLNAME):
    os.makedirs(DATASET_DIR, exist_ok=True)

    dataset = requests.get(DATASET_URL).text
    with open(DATASET_FULLNAME, 'w') as output:
        output.write(dataset)

CPU times: user 14 µs, sys: 12 µs, total: 26 µs
Wall time: 62 µs


In [2]:
data = pd.read_csv(DATASET_FULLNAME)

In [3]:
data

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,21.0,,,1.0,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15.0,99.0,Pvt Ltd,5.0,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5.0,,,0.0,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,0.0,,Pvt Ltd,0.0,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,21.0,99.0,Funded Startup,4.0,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14.0,,,1.0,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14.0,,,4.0,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,21.0,99.0,Pvt Ltd,4.0,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,0.0,999.0,Pvt Ltd,2.0,97,0.0


In [4]:
train, test = train_test_split(data.drop('enrollee_id', axis=1), test_size=0.2, stratify=data['target'])

### AutoWoe: default settings

In [5]:
auto_woe_0 = AutoWoE(interpreted_model=True,
                     monotonic=False,
                     max_bin_count=5,
                     select_type=None,
                     pearson_th=0.9,
                     auc_th=.505,
                     vif_th=10.,
                     imp_th=0,
                     th_const=32,
                     force_single_split=True,
                     th_nan=0.01,
                     th_cat=0.005,
                     auc_tol=1e-4,
                     cat_alpha=100,
                     cat_merge_to="to_woe_0",
                     nan_merge_to="to_woe_0",
                     imp_type="feature_imp",
                     regularized_refit=False,
                     p_val=0.05,
                     verbose=2
        )

auto_woe_0 = ReportDeco(auto_woe_0, )

In [6]:
auto_woe_0.fit(train,
               target_name="target",
              )

city processing...
city_development_index processing...
gender processing...
relevent_experience processing...
enrolled_university processing...
education_level processing...
experience processing...
company_size processing...
company_type processing...
last_new_job processing...
training_hours processing...
dict_keys(['city', 'city_development_index', 'gender', 'relevent_experience', 'enrolled_university', 'education_level', 'experience', 'company_size', 'company_type', 'last_new_job', 'training_hours']) to selector !!!!!
Feature selection...
city_development_index   -0.974107
company_size             -0.795953
company_type             -0.400146
experience               -0.184238
enrolled_university      -0.251287
education_level          -1.188926
dtype: float64


In [7]:
test_prediction = auto_woe_0.predict_proba(test)
test_prediction

array([0.06265852, 0.56483877, 0.04151965, ..., 0.15191705, 0.08528486,
       0.0409943 ])

In [8]:
roc_auc_score(test['target'].values, test_prediction)

0.8034365349304012

In [9]:
report_params = {"output_path": "HR_REPORT_1", # папка, куда сгенерится отчет и сложатся нужные файлы
                 "report_name": "WHITEBOX REPORT",
                 "report_version_id": 1,
                 "city": "Moscow",
                 "model_aim": "Predict if candidate will work for the company",
                 "model_name": "HR model",
                 "zakazchik": "Kaggle",
                 "high_level_department": "Ai Lab",
                 "ds_name": "Btbpanda",
                 "target_descr": "Candidate will work for the company",
                 "non_target_descr": "Candidate will work for the company"}

auto_woe_0.generate_report(report_params, )

No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.


### AutoWoE - simplier model

In [10]:
auto_woe_1 = AutoWoE(interpreted_model=True,
                     monotonic=True,
                     max_bin_count=4,
                     select_type=None,
                     pearson_th=0.9,
                     auc_th=.505,
                     vif_th=10.,
                     imp_th=0,
                     th_const=32,
                     force_single_split=True,
                     th_nan=0.01,
                     th_cat=0.005,
                     auc_tol=1e-4,
                     cat_alpha=100,
                     cat_merge_to="to_woe_0",
                     nan_merge_to="to_woe_0",
                     imp_type="feature_imp",
                     regularized_refit=False,
                     p_val=0.05,
                     verbose=2
        )

auto_woe_1 = ReportDeco(auto_woe_1, )

In [11]:
auto_woe_1.fit(train,
               target_name="target",
              )

city processing...city_development_index processing...

gender processing...
relevent_experience processing...
enrolled_university processing...education_level processing...

experience processing...company_type processing...company_size processing...


last_new_job processing...
training_hours processing...
dict_keys(['city', 'city_development_index', 'gender', 'relevent_experience', 'enrolled_university', 'education_level', 'experience', 'company_size', 'company_type', 'last_new_job', 'training_hours']) to selector !!!!!
Feature selection...
city                     -0.516274
city_development_index   -0.512608
company_size             -0.814922
company_type             -0.397978
experience               -0.175231
enrolled_university      -0.219507
education_level          -1.239627
dtype: float64


In [12]:
test_prediction = auto_woe_1.predict_proba(test)
test_prediction

array([0.06460692, 0.57321671, 0.0497262 , ..., 0.13746553, 0.07190761,
       0.04153373])

In [13]:
roc_auc_score(test['target'].values, test_prediction)

0.8019815944109903

In [14]:
report_params = {"output_path": "HR_REPORT_2", # папка, куда сгенерится отчет и сложатся нужные файлы
                 "report_name": "WHITEBOX REPORT",
                 "report_version_id": 2,
                 "city": "Moscow",
                 "model_aim": "Predict if candidate will work for the company",
                 "model_name": "HR model",
                 "zakazchik": "Kaggle",
                 "high_level_department": "Ai Lab",
                 "ds_name": "Btbpanda",
                 "target_descr": "Candidate will work for the company",
                 "non_target_descr": "Candidate will work for the company"}

auto_woe_1.generate_report(report_params, )

No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.
No handles with labels found to put in legend.


### WhiteBox preset - like TabularAutoML

In [15]:
from lightautoml.automl.presets.whitebox_presets import WhiteBoxPreset
from lightautoml import Task

In [16]:
task = Task('binary')
automl = WhiteBoxPreset(task)

In [17]:

train_pred = automl.fit_predict(train.reset_index(drop=True), roles={'target': 'target'})

Validation data is not set. Train will be used as valid in report and valid prediction


Start automl preset with listed constraints:
- time: 3600 seconds
- cpus: 4 cores
- memory: 16 gb

Train data shape: (15326, 13)
Feats was rejected during automatic roles guess: []


Layer 1 ...
Train process start. Time left 3595.0072581768036 secs
Start fitting Lvl_0_Pipe_0_Mod_0_WhiteBox ...

===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_WhiteBox =====

 features [] contain too many nans or identical values
 features [] have low importance
city processing...
city_development_index processing...company_type processing...education_level processing...


enrolled_university processing...
gender processing...
major_discipline processing...
relevent_experience processing...
company_size processing...
experience processing...
last_new_job processing...
training_hours processing...
dict_keys(['city', 'city_development_index', 'company_type', 'education_level', 'enrolled_university', 'gender', 'major_discipline', 'relevent_experience', 'company_size', 'experience', 'last_new_job', '

In [18]:
test_prediction = automl.predict(test).data[:, 0]

In [19]:
roc_auc_score(test['target'].values, test_prediction)

0.7966826628232216

### Serialization


Важно: auto_woe_1 фактически является ReportDeco объектом (отчетом), не AutoWoE. Чтобы получить AutoWoE надо обратиться к атрибуту .model. 

ReportDeco не рекомендуется для использования на стадии инференса. Отчет требует целевой переменной в датасете для предсказания, так как считает метрики качества. Так же инференс из объекта-отчета намного дольше из-за собственно построения отчета.

In [20]:
joblib.dump(auto_woe_1.model, 'model.pkl')
model = joblib.load('model.pkl')

### SQL inference query

In [21]:
sql_query = model.get_sql_inference_query('global_temp.TABLE_1')
print(sql_query)

SELECT
  1 / (1 + EXP(-(
    -1.111
    -0.516*WOE_TAB.city
    -0.513*WOE_TAB.city_development_index
    -0.815*WOE_TAB.company_size
    -0.398*WOE_TAB.company_type
    -0.175*WOE_TAB.experience
    -0.22*WOE_TAB.enrolled_university
    -1.24*WOE_TAB.education_level
  ))) as PROB,
  WOE_TAB.*
FROM 
    (SELECT
    CASE
      WHEN (city IS NULL OR LOWER(CAST(city AS VARCHAR(50))) = 'nan') THEN 0
      WHEN city IN ('city_100', 'city_102', 'city_103', 'city_116', 'city_149', 'city_159', 'city_160', 'city_45', 'city_46', 'city_64', 'city_71', 'city_73', 'city_83', 'city_99') THEN 0.213
      WHEN city IN ('city_104', 'city_114', 'city_136', 'city_138', 'city_16', 'city_173', 'city_23', 'city_28', 'city_36', 'city_50', 'city_57', 'city_61', 'city_65', 'city_67', 'city_75', 'city_97') THEN 1.017
      WHEN city IN ('city_11', 'city_21', 'city_74') THEN -1.455
      ELSE -0.209
    END AS city,
    CASE
      WHEN (city_development_index IS NULL OR city_development_index = 'NaN') THEN 0
   

### Check the SQL query by PySpark

In [23]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder \
                    .master("local[2]") \
                    .appName("spark-course") \
                    .config("spark.driver.memory", "512m") \
                    .getOrCreate()
sc = spark.sparkContext

In [24]:
spark_df = spark.read.csv("jobs_train.csv", header=True)
spark_df.createGlobalTempView("TABLE_1")

In [25]:
res = spark.sql(sql_query).toPandas()

In [26]:
res

Unnamed: 0,PROB,city,city_development_index,company_size,company_type,experience,enrolled_university,education_level
0,0.365512,0.213,0.461,-0.717,-0.640,0.533,0.208,-0.166
1,0.195716,-0.209,-0.121,0.467,0.398,0.533,0.208,-0.166
2,0.835002,-1.455,-1.454,-0.717,-0.640,-0.319,-0.614,-0.166
3,0.476161,-0.209,-0.121,-0.717,0.398,-0.811,-0.327,-0.166
4,0.117694,-0.209,-0.121,0.467,0.737,0.533,0.208,0.210
...,...,...,...,...,...,...,...,...
19153,0.275602,1.017,0.461,-0.717,-0.640,0.533,0.208,-0.166
19154,0.365512,0.213,0.461,-0.717,-0.640,0.533,0.208,-0.166
19155,0.126794,0.213,0.461,0.467,0.398,0.533,0.208,-0.166
19156,0.060842,1.017,0.461,0.467,0.398,-0.811,0.208,0.340


In [27]:
sc.stop()

In [28]:
full_prediction = model.predict_proba(data)
full_prediction

array([0.36557352, 0.19577798, 0.83497665, ..., 0.12678668, 0.06083813,
       0.13061427])

In [29]:
(res['PROB'] - full_prediction).abs().max()

0.0002878641803194526