# Goal
I’ll show you how you can easily use it to train an automated machine learning pipeline for a classification problem. 
__It’ll start by loading and cleaning the data, removing drift, launching a strong pipeline of accelerated optimization and generating predictions.__

# MLBox: a fully automated pipeline
![title](assets/MlBox.PNG)

## __Initialisation:__
> __From a raw dataset to a cleaned dataset with numerical features.__

- Files reading:
> Reading of several files (csv, xls, json and hdf5)\
> Task detection (binary/multiclass classification or regression)\
> Creation of the training data set and the test data set
- Preprocessing/cleaning:
> Dropping duplicates and constant features\
> Dropping drifting features
- Encoding:
> Converting features to a unique format (float if possible or str)\
> Converting lists\
> Converting dates into timestamp\
> Target encoding for classification task only\
> Categorical features encoding (several strategies available !)\
> Missing values encoding (several strategies available !)

## Validation:
> __Several models are tested and cross-validated and the best one is fitted.__

- On features:
> Feature engineering : neural network features engineering\
> Feature selection : filter methods, wrapper methods and L1 regularization
- On estimator:
> TestIng of a wide range of accurate estimators: Linear model, Random Forest, XGBoost, LightGBM… \
> Model blending: stacking, boosting, bagging
> Hyper-parameters tuning (using TPE algorithm)
- On validation :
> Choice of several metrics: accuracy, log-loss, AUC, f1-score, MSE, MAE, ETC ..\
> Validation parameters: number of folds, random state, …

## Application:
> __We fit the whole pipeline and predict the target on the test set.__

- Prediction:
> Target prediction (class probabilities for classification)\
> Dumping fitted models and final predictions (.csv file)\
> CPU time display
- Models interpretation:
>Features importance\
> Leak detection

# Testing MLBox: from data ingestion to model building

Now we’re going to test and run MLBox to quickly build a model to solve the Kaggle Titanic Challenge.

## Importing MLBox 

In [2]:
%%time

import warnings
warnings.filterwarnings("ignore")

from mlbox.preprocessing.reader import Reader
from mlbox.preprocessing.drift_thresholder import Drift_thresholder
from mlbox.optimisation.optimiser import Optimiser 
from mlbox.prediction.predictor import Predictor

Using TensorFlow backend.


CPU times: user 1.46 s, sys: 166 ms, total: 1.63 s
Wall time: 2.06 s


In [3]:
paths = ["/data1/kaggle/titanic/train.csv", "/data1/kaggle/titanic/test.csv"] 
target_name = "Survived"

## Reading and preprocessing

The Reader class of MLBox is in charge of preparing the data.
It provides methods and utilities to:
1. Read in the data with the correct separator (CSV, XLS, JSON, and h5) and load it
2. Clean the data by:
 - deleting Unnamed column
 - inferring column types (float, int, list)
 - processing dates and extracting relevant information from it: year, month, day, day_of_week, hour, etc.
 - removing duplicates
 - preparing train and test splits

In [4]:
rd = Reader(sep=",")
df = rd.train_test_split(paths, target_name)


reading csv : train.csv ...
cleaning data ...
CPU time: 4.343689680099487 seconds

reading csv : test.csv ...
cleaning data ...
CPU time: 0.05657672882080078 seconds

> Number of common features : 11

gathering and crunching for train and test datasets ...
reindexing for train and test datasets ...
dropping training duplicates ...
dropping constant variables on training set ...

> Number of categorical features: 5
> Number of numerical features: 6
> Number of training samples : 891
> Number of test samples : 418

> Top sparse features (% missing values on train set):
Cabin       77.1
Age         19.9
Embarked     0.2
dtype: float64

> Task : classification
0.0    549
1.0    342
Name: Survived, dtype: int64

encoding target ...


In [9]:
df['train'].head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Ticket
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0.0,1.0,3.0,male,1.0,A/5 21171
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0.0,2.0,1.0,female,1.0,PC 17599
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0.0,3.0,3.0,female,0.0,STON/O2. 3101282
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0.0,4.0,1.0,female,1.0,113803
4,35.0,,S,8.05,"Allen, Mr. William Henry",0.0,5.0,3.0,male,0.0,373450


## Removing drift
![title](assets/Drift.PNG)
This is an innovative feature I haven’t encountered in other packages. __The main idea is to automatically detect and remove variables that have a distribution that is substantially different between the train and the test set.__
This happens quite a lot and we generally talk about biased data. __You could have for example a situation when the train set has a population of young people whereas the test has elderly only.__ This indicates that the age feature is not robust and may lead to poor performance of the model when testing. So it has to be discarded.


### How does MLBox compute drifts for individual variables
MLBox builds a classifier that separates train from test data. __It then uses the ROC score related to this classifier as a measure of the drift.__
This makes sense:
- If the drift score is high (i.e. the ROC score is high) the ability the discern train data from test data is easy, which means that the two distributions are very different.
- Otherwise, if the drift score is low (i.e. the ROC score is low) the classifier is not able to separate the two distributions correctly.

MLBox provides a class called Drift_thresholder that takes as input the train and test sets as well as the target and computes a drift score of each one of the variables.
__Drift_thresholder then deletes the variables that have a drift score higher than a threshold (default to 0.6).__

In [10]:
dft = Drift_thresholder()
df = dft.fit_transform(df)


computing drifts ...
CPU time: 2.5861728191375732 seconds

> Top 10 drifts

('PassengerId', 0.9976076555023923)
('Name', 0.992967794537301)
('Ticket', 0.6848477696483359)
('Cabin', 0.18840282467093372)
('Embarked', 0.07221893417659442)
('Fare', 0.05512063216621499)
('Sex', 0.046565327627161146)
('Pclass', 0.02446271392419952)
('Age', 0.014791769476688144)
('SibSp', 0.00920393884026205)

> Deleted variables : ['Name', 'PassengerId', 'Ticket']
> Drift coefficients dumped into directory : save


In [12]:
df['train'].head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Parch,Pclass,Sex,SibSp
0,22.0,,S,7.25,0.0,3.0,male,1.0
1,38.0,C85,C,71.2833,0.0,1.0,female,1.0
2,26.0,,S,7.925,0.0,3.0,female,0.0
3,35.0,C123,S,53.1,0.0,1.0,female,1.0
4,35.0,,S,8.05,0.0,3.0,male,0.0


## The heavy lifting: optimizing
This section performs the optimization of the pipeline and tries different configurations of the parameters:
- NA encoder (missing values encoder)
- CA encoder (categorical features encoder)
- Feature selector (OPTIONAL)
- Stacking estimator — feature engineer (OPTIONAL)
- Estimator (classifier or regressor)

Run it using the default model configuration set as default (LightGBM) without any autoML or complex grid search.

In [13]:
warnings.filterwarnings('ignore', category=DeprecationWarning)

opt = Optimiser()
score = opt.evaluate(None, df)

No parameters set. Default configuration is tested

##################################################### testing hyper-parameters... #####################################################

>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}

>>> CA ENCODER :{'strategy': 'label_encoding'}

>>> ESTIMATOR :{'strategy': 'LightGBM', 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 0.9, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'nthread': -1, 'seed': 0}


MEAN SCORE : neg_log_loss = -0.6324717748298669
VARIANCE : 0.003224557409561235 (fold 1 = -0.6356963322394281, fold 2 = -0.6292472174203056)
CPU time: 0.5325939655303955 seconds



The neg log loss = -0.6325 as a first baseline.

Let’s now define a space of multiple configurations:
- ne__numerical_strategy: how to handle missing data in numerical features
- ce__strategy: how to handle categorical variables encoding
- fs: feature selection
- stck: meta-features stacker
- est: final estimator

In [14]:
space = {
        'ne__numerical_strategy':{"search":"choice",
                                 "space":[0, "mean"]},
        'ce__strategy':{"search":"choice",
                        "space":["label_encoding", "random_projection", "entity_embedding"]}, 
        'fs__threshold':{"search":"uniform",
                        "space":[0.001, 0.2]}, 
        'est__strategy':{"search":"choice", 
                         "space":["RandomForest", "ExtraTrees", "LightGBM"]},
        'est__max_depth':{"search":"choice", 
                          "space":[8, 9, 10, 11, 12, 13]}
        }

params = opt.optimise(space, df, 15)
opt.evaluate(params, df)

##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'random_projection'}   
>>> FEATURE SELECTOR :{'strategy': 'l1', 'threshold': 0.08234297303219519}
>>> ESTIMATOR :{'strategy': 'RandomForest', 'max_depth': 11, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
MEAN SCORE : neg_log_loss = -0.4744557540842814     
VARIANCE : 0.026845474941050024 (fold 1 = -0.5013012290253315, fold 2 = -0.4476102791432314)
CPU time: 2.2562050819396973 seconds                
###########################################

MEAN SCORE : neg_log_loss = -0.6356866326954257                               
VARIANCE : 0.02212777319535897 (fold 1 = -0.6578144058907847, fold 2 = -0.6135588595000667)
CPU time: 0.45307374000549316 seconds                                         
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}   
>>> CA ENCODER :{'strategy': 'random_projection'}                             
>>> FEATURE SELECTOR :{'strategy': 'l1', 'threshold': 0.18458019695555747}    
>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 10, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state':

MEAN SCORE : neg_log_loss = -0.4498366358215057                                
VARIANCE : 0.022433481491541535 (fold 1 = -0.4722701173130472, fold 2 = -0.4274031543299641)
CPU time: 1.0189557075500488 seconds                                           
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}                                 
>>> FEATURE SELECTOR :{'strategy': 'l1', 'threshold': 0.07128763220975241}     
>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 9, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_s

-0.43670643159910416

Running this pipeline resulted in a higher neg loss, which is better.
There’s a very good potential of more improvement if we define a better space of search or stacking operations and maybe other feature selection techniques.

## Running predictions
 Fit the optimal pipeline and predict our test dataset.

In [15]:
prd = Predictor()
prd.fit_predict(params, df)


fitting the pipeline ...
CPU time: 1.4750120639801025 seconds

> Feature importances dumped into directory : save

predicting ...
CPU time: 0.22813796997070312 seconds

> Overview on predictions : 

        0.0       1.0  Survived_predicted
0  0.897952  0.102048                   0
1  0.656720  0.343280                   0
2  0.854128  0.145872                   0
3  0.881991  0.118009                   0
4  0.531964  0.468036                   0
5  0.840353  0.159647                   0
6  0.346359  0.653641                   1
7  0.775022  0.224978                   0
8  0.295327  0.704673                   1
9  0.890721  0.109279                   0

dumping predictions into directory : save ...


<mlbox.prediction.predictor.Predictor at 0x7f6410ac1588>

# Conclusion
With MLBox, you can do this very quickly and efficiently so that you can focus on what matters when solving a business problem.
 - Understanding the problem
 - Acquiring and consolidating the right data
 - Formalizing the performance metrics to reach and compute
 
해결 해야 할 문제를 효율적이고 빠르게 프로토 타입화 할 수 있기에, 직관적으로 문제의 전체적인 흐름을 인지할 수 있게 하여, 근본적인 문제의 원인 분석에 좀 더 집중 할 수 있도록 도움을 준다.