### Auto ML
This notebook will explain the auto-ml capabilities of aikit.

It shows the several things involved. If you just want to run it you should use the <b>automl launcher</b>



Let's start by loading some small data

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

from aikit.datasets.datasets import load_dataset,DatasetEnum
dfX, y, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
dfX.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
y[0:5]

array([0, 1, 1, 1, 0], dtype=int64)

Now let's load what is needed 

In [3]:
from aikit.ml_machine import AutoMlConfig, JobConfig,  MlJobManager, MlJobRunner, AutoMlResultReader
from aikit.ml_machine import FolderDataPersister,SavingType, AutoMlModelGuider

### AutoML configuration object
This object will contain all the relevant information about the problem at hand :
 * it's type : REGRESSION or CLASSIFICATION
 * the information about the column in the data
 * the steps that are needed in the processing pipeline (see explanation after)
 * the models that are to be tested
 * ...
 
 By default the model will guess everything but everything can be changed if needed

In [4]:
auto_ml_config = AutoMlConfig(dfX = dfX, y = y, name = "titanic")
auto_ml_config.guess_everything()
auto_ml_config


<aikit.ml_machine.ml_machine.AutoMlConfig object at 0x000001BD444C2CF8>
type of problem : CLASSIFICATION

#### type of problem

In [5]:
auto_ml_config.type_of_problem

'CLASSIFICATION'

The config guess that it was a Classification problem

#### information about columns

In [6]:
auto_ml_config.columns_informations

OrderedDict([('Pclass',
              {'HasMissing': False, 'ToKeep': True, 'TypeOfVariable': 'NUM'}),
             ('Name',
              {'HasMissing': False, 'ToKeep': True, 'TypeOfVariable': 'TEXT'}),
             ('Sex',
              {'HasMissing': False, 'ToKeep': True, 'TypeOfVariable': 'CAT'}),
             ('Age',
              {'HasMissing': True, 'ToKeep': True, 'TypeOfVariable': 'NUM'}),
             ('SibSp',
              {'HasMissing': False, 'ToKeep': True, 'TypeOfVariable': 'NUM'}),
             ('Parch',
              {'HasMissing': False, 'ToKeep': True, 'TypeOfVariable': 'NUM'}),
             ('Ticket',
              {'HasMissing': False, 'ToKeep': True, 'TypeOfVariable': 'TEXT'}),
             ('Fare',
              {'HasMissing': False, 'ToKeep': True, 'TypeOfVariable': 'NUM'}),
             ('Cabin',
              {'HasMissing': True, 'ToKeep': True, 'TypeOfVariable': 'CAT'}),
             ('Embarked',
              {'HasMissing': True, 'ToKeep': True, 'TypeOfVa

In [7]:
pd.DataFrame(auto_ml_config.columns_informations).T

Unnamed: 0,HasMissing,ToKeep,TypeOfVariable
Pclass,False,True,NUM
Name,False,True,TEXT
Sex,False,True,CAT
Age,True,True,NUM
SibSp,False,True,NUM
Parch,False,True,NUM
Ticket,False,True,TEXT
Fare,False,True,NUM
Cabin,True,True,CAT
Embarked,True,True,CAT


For each column in the DataFrame, its type were guess among the three possible values :
 * NUM  : for numerical columns
 * TEXT : for columns that contains text
 * CAT  : for categorical columns

Remarks:
 * The difference between TEXT and CAT is based on the number of different modalities
 * Be careful with categorical value that are encoded into integers (algorithm won't know that it is really a categorical feature)


#### columns block

In [8]:
auto_ml_config.columns_block

OrderedDict([('NUM', ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']),
             ('TEXT', ['Name', 'Ticket']),
             ('CAT', ['Sex', 'Cabin', 'Embarked'])])

the ml machine has the notion of <i>block of columns</i>.
For some use-case features naturally falls into blocks. By default the tool will use the type of feature has blocks. But other things can be used.

The ml machine will sometimes try to create a model without a block

#### needed steps

In [9]:
 auto_ml_config.needed_steps

[{'optional': True, 'step': 'TextPreprocessing'},
 {'optional': False, 'step': 'TextEncoder'},
 {'optional': True, 'step': 'TextDimensionReduction'},
 {'optional': False, 'step': 'CategoryEncoder'},
 {'optional': False, 'step': 'MissingValueImputer'},
 {'optional': True, 'step': 'Scaling'},
 {'optional': True, 'step': 'DimensionReduction'},
 {'optional': True, 'step': 'FeatureExtraction'},
 {'optional': True, 'step': 'FeatureSelection'},
 {'optional': False, 'step': 'Model'}]

The ml machine will create processing pipeline by assembling different steps.
Here are the steps it will use for that use case :
 
 * TextPreprocessing
 * TextEncoder : encoding of text into numerical values
 * TextDimensionReduction : specific dimension reduction for text based features
 
 * CategoryEncoder : encoder of categorical data
 * MissingValueImputer : since there are missing value they need to be filled
 * Scaling : step to re-scale features
 * DimensionReduction  : generic dimension reduction
 
 * FeatureExtraction : create new features
 * FeatureSelction   : select feature
 
 * Model : the final classification/regression model


#### models to keep

In [10]:
auto_ml_config.models_to_keep

[('Model', 'LogisticRegression'),
 ('Model', 'RandomForestClassifier'),
 ('Model', 'ExtraTreesClassifier'),
 ('Model', 'LGBMClassifier'),
 ('FeatureSelection', 'FeaturesSelectorClassifier'),
 ('TextEncoder', 'CountVectorizerWrapper'),
 ('TextEncoder', 'Word2VecVectorizer'),
 ('TextEncoder', 'Char2VecVectorizer'),
 ('TextPreprocessing', 'TextNltkProcessing'),
 ('TextPreprocessing', 'TextDefaultProcessing'),
 ('TextPreprocessing', 'TextDigitAnonymizer'),
 ('CategoryEncoder', 'NumericalEncoder'),
 ('CategoryEncoder', 'TargetEncoderClassifier'),
 ('MissingValueImputer', 'NumImputer'),
 ('DimensionReduction', 'TruncatedSVDWrapper'),
 ('DimensionReduction', 'PCAWrapper'),
 ('TextDimensionReduction', 'TruncatedSVDWrapper'),
 ('DimensionReduction', 'KMeansTransformer'),
 ('Scaling', 'CdfScaler')]

This give us the list of models/transformers to test at each steps.

Remarks:
* some steps are removed because they have no transformer yet

## job configuration

In [11]:
job_config = JobConfig()
job_config.guess_cv(auto_ml_config = auto_ml_config, n_splits = 10)
job_config.guess_scoring(auto_ml_config = auto_ml_config)

job_config.score_base_line = None

In [12]:
job_config.scoring

['accuracy', 'log_loss_patched', 'avg_roc_auc', 'f1_macro']

In [13]:
job_config.cv

StratifiedKFold(n_splits=10, random_state=123, shuffle=True)

In [14]:
job_config.main_scorer

'accuracy'

In [15]:
job_config.score_base_line

The baseline can be setted if we know what a good performance is.
It will be used to specify the threshold bellow which we stop crossvalidation in the first fold

This object has the specific configuration for the job to do :
* how to cross validate
* what scoring/benchmark to use

## Data Persister
To synchronize processes and to save values, we need an object to take of that.

This object is a DataPersister, which save everything on disk
(Other persister using database might be created)

In [16]:
base_folder = # INSERT PATH HERE
data_persister = FolderDataPersister(base_folder = base_folder)

## controller

In [17]:
result_reader = AutoMlResultReader(data_persister)
auto_ml_guider = AutoMlModelGuider(result_reader = result_reader, 
                                       job_config = job_config,
                                       metric_transformation="default",
                                       avg_metric=True
                                       )
    
job_controller = MlJobManager(auto_ml_config = auto_ml_config,
                                job_config = job_config,
                                auto_ml_guider = auto_ml_guider,
                                data_persister = data_persister)

the search will be driven by a <i>controller</i> process. This process won't actually train models but it will decide what models should be tried.

Here three object are actually created :
 * result reader : its job is to read the result of the auto-ml process and aggregate them
 
 * auto_ml_guider : its job is to help the controller <i>guide</i> the seach (using a bayesian technic)
 
 * job_controller : the controller
 
All those objects need the 'data_persister' object to write/read data

Now the controller can be started using:

<span style="color:red"><b>job_controller.run()</b></span>
You need to launch in a subprocess


## Worker(s)

The last things needed is to create worker(s) that will do the actual cross validation.
Those worker will :
 * listen to the controller
 * does the cross validation of the models they are told
 * save result

In [18]:
job_runner = MlJobRunner(dfX = dfX , 
                       y = y, 
                       groups = None,
                       auto_ml_config = auto_ml_config, 
                       job_config = job_config,
                       data_persister = data_persister)


as before the controller can be started using :
<span style="color:red"><b>job_runner.run()</b></span>

You need to launcher that in a Subprocess or a Thread

## Result Reader
After a few models were tested you can see the result, for that you need the 'result_reader' (which I re-create here for simplicity)

In [19]:
base_folder = # INSERT path here
data_persister = FolderDataPersister(base_folder = base_folder)

result_reader = AutoMlResultReader(data_persister)


In [20]:
df_results = result_reader.load_all_results()
df_params  = result_reader.load_all_params()
df_errors  = result_reader.load_all_errors()

* df_results : DataFrame with the scoring results
* df_params  : DataFrame with the parameters of the complete processing pipeline
* df_errors  : DataFrame with the errors

All those DataFrames can be joined using the common 'job_id' column

In [21]:
df_merged_result = pd.merge( df_params, df_results, how = "inner",on = "job_id")
df_merged_error  = pd.merge( df_params, df_errors , how = "inner",on = "job_id")


And result can be writted in an Excel file (for example)

In [22]:
try:
    df_merged_result.to_excel(base_folder + "/result.xlsx",index=False)
except OSError:
    print("I couldn't save excel file")

try:
    df_merged_error.to_excel(base_folder + "/result_error.xlsx",index=False)
except OSError:
    print("I couldn't save excel file")


### Load a given model ####

In [23]:
from aikit.ml_machine import FolderDataPersister, SavingType
from aikit.model_definition import sklearn_model_from_param

base_folder = # INSERT path here
data_persister = FolderDataPersister(base_folder = base_folder)


In [None]:
job_id    = # INSERT job_id here 
job_param = data_persister.read(job_id, path = "job_param", write_type = SavingType.json)
job_param

In [None]:
model = sklearn_model_from_param(job_param)
model