# Preparation 

## [MLBox framework](https://mlbox.readthedocs.io/en/latest/installation.html) setup
Please make sure that you are running this in standalone Python virtual env and that you are using interactive Python for that env. This should help to vastly reduce amount of dependency clashes.

MLBox is using [OpenMP](https://www.openmp.org/) and [LightGBM](https://lightgbm.readthedocs.io/en/latest/)

In [None]:
%%bash
brew install cmake
brew install libomp

In [None]:
%%bash 
pip install setuptools
pip install wheel
pip install pandas
pip install numpy
pip install mlbox

## Enviroment variables setup

In [1]:
paths = ["tmp_mlbox/train_mlbox.csv", "tmp_mlbox/eval_mlbox.csv"]
target_name = "y" #feature("column") with the result
input_file = "tmp_mlbox/input_file.csv"

## New "random" data generation

In [2]:
from random import uniform
from random import randint
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

row_num=400
min_num=0
max_num=100

with open(input_file, "w+") as f: 
    f.write("x1,x2,x3,x4,x5,x6,x7,x8,x9,y\n") 
    for i in range(row_num):
        x1 = randint(min_num, max_num)
        x2 = randint(min_num, max_num)
        x3 = randint(min_num, max_num)
        x4 = randint(min_num, max_num)
        x5 = randint(min_num, max_num)
        x6 = randint(min_num, max_num)
        x7 = randint(min_num, max_num)
        x8 = randint(min_num, max_num)        
        x9 = randint(min_num, max_num)
        y = 1 if( x1 + x2 > x3) else 0
        
        f.write("{},{},{},{},{},{},{},{},{},{}\n".format(x1,x2,x3,x4,x5,x6,x7,x8,x9,y))

In [3]:
df = pd.read_csv(input_file,index_col=None, header=0, delimiter=",")

In [4]:
df.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,y
0,100,73,77,80,56,92,99,24,81,1
1,2,27,69,40,47,17,5,44,42,0
2,84,26,46,60,45,96,69,43,23,1
3,46,98,57,83,89,42,49,29,35,1
4,64,28,43,100,55,71,42,21,0,1


In [5]:
X = df
y = df[target_name]
#MLBox does not seem to be able to do the proper spliting thus it is done manually here
X_train, X_test, _, _ = train_test_split(X, y, test_size=0.1, random_state=42, stratify=y)
X_train.to_csv(paths[0], encoding='utf8',index=False)

#as per MLBox documentation the test dataset is such that does NOT contain target feature!!!
X_test = X_test.drop(target_name, axis=1)
X_test.to_csv(paths[1], encoding='utf8',index=False)

## Training with MLBox

In [6]:
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``b

In [7]:
data = Reader(sep=",").train_test_split(paths, target_name)  #reading


reading csv : train_mlbox.csv ...
cleaning data ...
CPU time: 7.9070000648498535 seconds

reading csv : eval_mlbox.csv ...
cleaning data ...
CPU time: 0.03619790077209473 seconds

> Number of common features : 9

gathering and crunching for train and test datasets ...
reindexing for train and test datasets ...
dropping training duplicates ...
dropping constant variables on training set ...

> Number of categorical features: 0
> Number of numerical features: 9
> Number of training samples : 360
> Number of test samples : 40

> You have no missing values on train set...

> Task : classification
1.0    304
0.0     56
Name: y, dtype: int64

encoding target ...


In [8]:
data = Drift_thresholder().fit_transform(data)  #deleting non-stable variables


computing drifts ...
CPU time: 0.11256694793701172 seconds

> Top 10 drifts

('x2', 0.33638888888888907)
('x5', 0.20972222222222214)
('x4', 0.128611111111111)
('x6', 0.06791666666666663)
('x9', 0.05861111111111117)
('x1', 0.04541666666666666)
('x8', 0.043194444444444535)
('x3', 0.03236111111111106)
('x7', 0.022638888888888875)

> Deleted variables : []
> Drift coefficients dumped into directory : save


[Optimizer documentation](https://mlbox.readthedocs.io/en/latest/features.html#optimisation)

[Scoring options](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules)

In [9]:
opt = Optimiser(scoring = 'accuracy', n_folds = 5)
opt.evaluate(None, data)

  +str(self.to_path)+"/joblib'. Please clear it regularly.")


No parameters set. Default configuration is tested

##################################################### testing hyper-parameters... #####################################################

>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}

>>> CA ENCODER :{'strategy': 'label_encoding'}

>>> ESTIMATOR :{'strategy': 'LightGBM', 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 0.9, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'nthread': -1, 'seed': 0}






MEAN SCORE : accuracy = 0.9389607048684802
VARIANCE : 0.03339214236886179 (fold 1 = 0.9178082191780822, fold 2 = 0.9583333333333334, fold 3 = 0.9861111111111112, fold 4 = 0.8888888888888888, fold 5 = 0.9436619718309859)
CPU time: 1.5439910888671875 seconds





0.9389607048684802

In [None]:
space = {
        'fs__strategy' : {"space" : ["variance", "rf_feature_importance"]},
        'fs__threshold': {"search" : "choice", "space" : [0.1, 0.2, 0.3]},

        'est__strategy' : {"space" : ["LightGBM"]},
        'est__max_depth' : {"search" : "choice", "space" : [5,6]},
        'est__subsample' : {"search" : "uniform", "space" : [0.6,0.9]}
        }

best = opt.optimise(space, data, max_evals = 10)

In [None]:
Predictor().fit_predict(best, data)