# Introduction to AutoML with MLBox

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
pip install mlbox

Collecting mlbox
  Downloading mlbox-0.8.5.tar.gz (31 kB)
Collecting numpy==1.18.2
  Downloading numpy-1.18.2-cp37-cp37m-manylinux1_x86_64.whl (20.2 MB)
[K     |████████████████████████████████| 20.2 MB 1.2 MB/s 
Collecting matplotlib==3.0.3
  Downloading matplotlib-3.0.3-cp37-cp37m-manylinux1_x86_64.whl (13.0 MB)
[K     |████████████████████████████████| 13.0 MB 21.4 MB/s 
[?25hCollecting hyperopt==0.2.3
  Downloading hyperopt-0.2.3-py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 56.5 MB/s 
[?25hCollecting pandas==0.25.3
  Downloading pandas-0.25.3-cp37-cp37m-manylinux1_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 35.0 MB/s 
[?25hCollecting joblib==0.14.1
  Downloading joblib-0.14.1-py2.py3-none-any.whl (294 kB)
[K     |████████████████████████████████| 294 kB 87.0 MB/s 
[?25hCollecting scikit-learn==0.22.1
  Downloading scikit_learn-0.22.1-cp37-cp37m-manylinux1_x86_64.whl (7.0 MB)
[K     |████████████████████████████████|

# Importing MLBox

In [3]:
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *

# Inputs to MLBox

If you're having a train and a test set, you can feed these two paths directly to MLBox as well as the target name.

Otherwise, if fed a train set only, MLBox creates a test set.

In [4]:
# Paths to the train set and the test set.
paths = ["/content/drive/MyDrive/MLBOX/classification/train_classification.csv", "/content/drive/MyDrive/MLBOX/classification/test_classification.csv"]
# Name of the feature to predict.
# This columns should only be present in the train set.
target_name = "Survived"

# Reading and preprocessing

The Reader class of MLBox is in charge of preparing the data.

It basically provides methods and utilities to:

* Read in the data with the correct separator (csv, xls, json, and h5) and load it
* Clean the data by:

      deleting Unnamed columns
      inferring column types (float, int, list)
      processing dates and extracting relevant information from it: year, month, day, dayofweek, hour, etc.
      removing duplicates
* Prepare train and test splits

In [5]:
rd = Reader(sep=",")
df = rd.train_test_split(paths, target_name)


reading csv : train_classification.csv ...
cleaning data ...
CPU time: 7.304362535476685 seconds

reading csv : test_classification.csv ...
cleaning data ...
CPU time: 0.5820267200469971 seconds

> Number of common features : 11

gathering and crunching for train and test datasets ...
reindexing for train and test datasets ...
dropping training duplicates ...
dropping constant variables on training set ...

> Number of categorical features: 5
> Number of numerical features: 6
> Number of training samples : 891
> Number of test samples : 418

> Top sparse features (% missing values on train set):
Cabin       77.1
Age         19.9
Embarked     0.2
dtype: float64

> Task : classification
0.0    549
1.0    342
Name: Survived, dtype: int64

encoding target ...


When this function is done running, it creates a folder named save where it dumps the target encoder for later use.

In [6]:
df["train"].head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Ticket
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0.0,1.0,3.0,male,1.0,A/5 21171
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0.0,2.0,1.0,female,1.0,PC 17599
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0.0,3.0,3.0,female,0.0,STON/O2. 3101282
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0.0,4.0,1.0,female,1.0,113803
4,35.0,,S,8.05,"Allen, Mr. William Henry",0.0,5.0,3.0,male,0.0,373450


# Removing drift

This is automatically detect and remove variables that have a distribution that is substantially different between the train and the test set.

This happens quite a lot and we generally talk about biased data. You could have for example a situation when the train set has a population of young people whereas the test has elderly only. This indicates that the age feature is not robust and may lead to a poor performance of the model when testing. So it has to be discarded.

# How does MLBox compute drifts for individual variables

MLBox builds a classifier that separates train from test data. It then uses the ROC score related to this classifier as a measure of the drift.

This makes sense:

* If the drift score is high (i.e. the ROC score is high) the ability the discern train data from test data is easy, which means that the two distributions are very different.
* Otherwise, if the drift score is low (i.e. the ROC score is low) the classifier is not able to separate the two disctributions correctly.
MLBox provides a class called Drift_thresholder that takes as input the train and test sets as well as the target and computes a drift score of each one of the variables.

Drift_thresholder then deletes the variables that have a drift score higher that a threshold (default to 0.6).

In [7]:
dft = Drift_thresholder()
df = dft.fit_transform(df)


computing drifts ...
CPU time: 0.33306169509887695 seconds

> Top 10 drifts

('PassengerId', 0.9976076555023923)
('Name', 0.9884888536056815)
('Ticket', 0.7164320569100027)
('Cabin', 0.16333523222990798)
('Embarked', 0.07316270907851763)
('Pclass', 0.06255825968178108)
('Age', 0.03686470639145445)
('SibSp', 0.03057403490771371)
('Parch', 0.030333848197080737)
('Fare', 0.02626152265790216)

> Deleted variables : ['Name', 'PassengerId', 'Ticket']
> Drift coefficients dumped into directory : save


Name, PassengerId and Ticket get removed beacause of their respective drift scores. 

# optimisation

The optimisation of the pipeline and tries different configurations of the parameters:

* NA encoder (missing values encoder)
* CA encoder (categorical features encoder)
* Feature selector (OPTIONAL)
* Stacking estimator - feature engineer (OPTIONAL)
* Estimator (classifier or regressor)



In [8]:
opt = Optimiser() #instantiate the Optimiser class

  +str(self.to_path)+"/joblib'. Please clear it regularly.")


Then we can run it using the default model configuration set as default (LightGBM) without any autoML or complex grid search.

In [9]:
warnings.filterwarnings('ignore', category=DeprecationWarning)
score = opt.evaluate(None, df)

No parameters set. Default configuration is tested

##################################################### testing hyper-parameters... #####################################################

>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}

>>> CA ENCODER :{'strategy': 'label_encoding'}

>>> ESTIMATOR :{'strategy': 'LightGBM', 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 0.9, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'nthread': -1, 'seed': 0}


MEAN SCORE : neg_log_loss = -0.6566934245807459
VARIANCE : 0.03558843801215594 (fold 1 = -0.6922818625929018, fold 2 = -0.6211049865685899)
CPU time: 0.6626987457275391 seconds



The neglogloss = -0.6325 as a first baseline.

Let's now define a space of multiple configurations:

* ne_numericalstrategy: how to handle missing data in numerical features
* ce__strategy: how to handle categorical variables encoding
* fs: feature selection
* stck: meta-features stacker
* est: final estimator

In [10]:
space = {
        'ne__numerical_strategy':{"search":"choice",
                                 "space":[0, "mean"]},
        'ce__strategy':{"search":"choice",
                        "space":["label_encoding", "random_projection", "entity_embedding"]}, 
        'fs__threshold':{"search":"uniform",
                        "space":[0.001, 0.2]}, 
        'est__strategy':{"search":"choice", 
                         "space":["RandomForest", "ExtraTrees", "LightGBM"]},
        'est__max_depth':{"search":"choice", 
                          "space":[8, 9, 10, 11, 12, 13]}
        }

params = opt.optimise(space, df, 15)

##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}
>>> FEATURE SELECTOR :{'strategy': 'l1', 'threshold': 0.1574588941362064}
>>> ESTIMATOR :{'strategy': 'LightGBM', 'max_depth': 10, 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 0.9, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'nthread': -1, 'seed': 0}
MEAN SCORE : neg_log_loss = -0.6483908894975033
VARIANCE : 0.033352098648820716 (fold 1 = -0.6817429881463241, fold 2 = -0.6150387908486826)
CPU time: 0.8080451488494

Evaluate this model

In [11]:
opt.evaluate(params, df)


##################################################### testing hyper-parameters... #####################################################

>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}

>>> CA ENCODER :{'strategy': 'entity_embedding'}

>>> FEATURE SELECTOR :{'strategy': 'l1', 'threshold': 0.09818915447496747}

>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 10, 'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}


MEAN SCORE : neg_log_loss = -0.43832332241666805
VARIANCE : 0.017590322871002567 (fold 1 = -0.4559136452876706, fold 2 = -0.42073299954566545)
CPU time: 2.6417958736419678 seconds



-0.43832332241666805

Running this pipeline resulted in a higher neg loss, which is better.

There's clearly very good potential of more improvement if we define a better space of search or stacking operations and maybe other feature selection techniques.

# Running predictions


Fit the optimal pipeline and predict on our test dataset.

In [12]:
prd = Predictor()
prd.fit_predict(params, df)


fitting the pipeline ...




CPU time: 2.1722071170806885 seconds

predicting ...




CPU time: 0.22200894355773926 seconds

> Overview on predictions : 

        0.0       1.0  Survived_predicted
0  0.886565  0.113435                   0
1  0.663588  0.336412                   0
2  0.879879  0.120121                   0
3  0.868672  0.131328                   0
4  0.499554  0.500446                   1
5  0.882894  0.117106                   0
6  0.518386  0.481614                   0
7  0.801966  0.198034                   0
8  0.301549  0.698451                   1
9  0.910383  0.089617                   0

dumping predictions into directory : save ...


<mlbox.prediction.predictor.Predictor at 0x7fbb2aed8ed0>