<a href="https://colab.research.google.com/github/sdsc-bw/DataFactory/blob/develop/finetuning/Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning

To find a suitable model for your ML problem is very important. Not every model has the same performane on every task. Some models can be to simple (underfitting) and some models can be to complex for for a problem (overfitting). Also a model has different hyperparameters which also have an impact on the performance. Therefor exist libraries that can be used to find a appropriate model and its hyperparameters. Popular ones are [auto-sklearn](https://papers.neurips.cc/paper/2015/hash/11d0e6287202fced83f79975ec59a3a6-Abstract.html) and [hyperopt](https://github.com/hyperopt/hyperopt).

## Import packages

In [1]:
import sys

In [2]:
if 'google.colab' in sys.modules:
    !git clone https://github.com/sdsc-bw/DataFactory.git
    !ls
    
    !sudo apt-get install build-essential swig
    !curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install
    !pip install auto-sklearn
    
    !pip install scipy
    
    !pip install tsai

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.utils import shuffle
from hyperopt import hp

if 'google.colab' in sys.modules:
    root = 'DataFactory/'
else:
    root = '../'
    
sys.path.insert(0, root + "codes")

from DataFactory import DataFactory

os             : Windows-10-10.0.19041-SP0
python         : 3.8.12
tsai           : 0.2.23
fastai         : 2.5.3
fastcore       : 1.3.27
torch          : 1.9.1+cpu
n_cpus         : 8
device         : cpu
os             : Windows-10-10.0.19041-SP0
python         : 3.8.12
tsai           : 0.2.23
fastai         : 2.5.3
fastcore       : 1.3.27
torch          : 1.9.1+cpu
n_cpus         : 8
device         : cpu
os             : Windows-10-10.0.19041-SP0
python         : 3.8.12
tsai           : 0.2.23
fastai         : 2.5.3
fastcore       : 1.3.27
torch          : 1.9.1+cpu
n_cpus         : 8
device         : cpu


## Load dataset: Diabetes dataset

In [5]:
datafactory = DataFactory()

In [6]:
dataset = load_wine()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['class'] = pd.Series(dataset.target)
df = shuffle(df)
X, y = datafactory.preprocess(df, y_col='class')

2021-12-21 15:51:21,112 - DataFactory - INFO - Remove columns with NAN-values of target feature: class
2021-12-21 15:51:21,115 - DataFactory - INFO - Start to transform the categorical columns...
2021-12-21 15:51:21,118 - DataFactory - INFO - ...End with categorical feature transformation
2021-12-21 15:51:21,119 - DataFactory - INFO - Start to clean the given dataframe...
2021-12-21 15:51:21,120 - DataFactory - INFO - Number of INF- and NAN-values are: (0, 0)
2021-12-21 15:51:21,121 - DataFactory - INFO - Set type to float32 at first && deal with INF
2021-12-21 15:51:21,122 - DataFactory - INFO - Remove columns with half of NAN-values
2021-12-21 15:51:21,125 - DataFactory - INFO - Remove constant columns
2021-12-21 15:51:21,129 - DataFactory - INFO - ...End with Data cleaning, number of INF- and NAN-values are now: (0, 0)


## How To Use DataFactory

### Hyperopt

We provided a function to use hyperopt. Some of the models require finetuning with hyperopt. 

We can define the models that we want to test. Then we have to define a *params* variable that defines the strategy how to examine the search space. There we also can define the parameters of the search space. If parameters for models are not given, it uses our standard search space. Like we do it here:

In [7]:
# list with models to try out
models = ['decision_tree', 'random_forest', 'adaboost', 'inception_time']

In [8]:
# loss in this case refers to -f1
model = datafactory.finetune(X, y, method='hyperopt', models=models, cv=3, max_evals=32, mtype='C')

100%|███████████████████████████████████████████████| 32/32 [06:54<00:00, 12.96s/trial, best loss: -0.9774952919020716]


Unnamed: 0,Model,Score,Hyperparams,Time
0,random_forest,0.977495,"{'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}",0.185534
1,random_forest,0.966196,"{'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}",0.623314
2,random_forest,0.96064,"{'max_depth': 20, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 50}",0.268947
3,random_forest,0.96064,"{'max_depth': 2, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}",0.414587
4,random_forest,0.960546,"{'max_depth': 1, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}",0.298953
5,random_forest,0.960546,"{'max_depth': 1, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}",0.594014
6,random_forest,0.960546,"{'max_depth': 50, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 100}",0.322212
7,random_forest,0.954991,"{'max_depth': 1, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 50}",0.152612
8,random_forest,0.949341,"{'max_depth': 1, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 200}",0.590016
9,random_forest,0.949247,"{'max_depth': 1, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 50}",0.154468


If we want to define custom parameters, they should be defined with the functions of hyperopt. Look at the [sklearn](https://scikit-learn.org/stable/) and [tsai](https://github.com/timeseriesAI/tsai) website to find the hyperparamters of the models. Attention: The identifier of the hyperparameters need to be unique for ever parameter (also between models).

In [9]:
models = ['decision_tree', 'random_forest', 'adaboost', 'inception_time']
# attention the label has to be unique for every parameter (also between models)
dt_params = {'max_depth': hp.quniform('max_depth_dt', 1, 10, 1), 'criterion': hp.choice('criterion_dt', ['gini', 'entropy']), 'min_samples_leaf': hp.choice('min_samples_leaf_dt', [1, 2, 4])}
rf_params = {'max_depth': hp.choice('max_depth_rf', [1, 2, 3, 5, 10, 20, 50]), 'n_estimators': hp.choice('n_estimators_rf', [50, 100, 200])}
ab_params = {'n_estimators': hp.choice('n_estimators_ab', [50, 100, 200]), 'learning_rate': hp.choice('learning_rate_ab', [0.001,0.01,.1,1.0])}
# tsai uses a learner, you also can finetune its parameters
it_learner_params = {'epochs': hp.choice('epochs_it', [25, 50]), 'lr_max': hp.choice('lr_max_it', [1e-3, 1e-5]), 'opt_func':  hp.choice('optimizer_it', ['adam']), 'loss_func': hp.choice('loss_it', ['mse']), 'batch_tfms': ['standardize'], 'batch_size': [64, 128], 'splits': None, 'metrics': ['accuracy']}
it_params = {'learner': it_learner_params, 'nf': hp.choice('nf_it', [32, 64]), 'nb_filters': hp.choice('nb_filters_it', [32, 64, 96, 128])}
# put every hyperparameter definition in an own dictionary
# search strategy of hyperparameters should be in ['parzen', 'random']
params = {'strategy': 'random', 'decision_tree': dt_params, 'random_forest': rf_params, 'adaboost': ab_params, 'inception_time': it_params}

In [10]:
model = datafactory.finetune(X, y, method='hyperopt', models=models, cv=3, mtype='C', params=params.copy())

100%|███████████████████████████████████████████████| 32/32 [10:45<00:00, 20.19s/trial, best loss: -0.9775894538606403]


Unnamed: 0,Model,Score,Hyperparams,Time
0,random_forest,0.977589,"{'max_depth': 10, 'n_estimators': 50}",0.161161
1,random_forest,0.97194,"{'max_depth': 2, 'n_estimators': 100}",0.317152
2,random_forest,0.97194,"{'max_depth': 20, 'n_estimators': 100}",0.434929
3,random_forest,0.971846,"{'max_depth': 5, 'n_estimators': 100}",0.307173
4,random_forest,0.938041,"{'max_depth': 1, 'n_estimators': 100}",0.391893
5,decision_tree,0.926836,"{'criterion': 'entropy', 'max_depth': 5.0, 'min_samples_leaf': 1}",0.010923
6,random_forest,0.926648,"{'max_depth': 1, 'n_estimators': 50}",0.165067
7,decision_tree,0.921375,"{'criterion': 'entropy', 'max_depth': 4.0, 'min_samples_leaf': 1}",0.020945
8,adaboost,0.910264,"{'learning_rate': 0.1, 'n_estimators': 50}",0.174531
9,adaboost,0.910264,"{'learning_rate': 0.1, 'n_estimators': 50}",0.257342


### Native Search

Sklearn and TSAI also provide functions/propose methods to tune the hyperparameters for a specific model. We implemented a function that uses them to find the best model.

In [11]:
# list with models to try out
models = ['decision_tree', 'random_forest', 'adaboost']

In [12]:
model = datafactory.finetune(X, y, method='native', models=models, cv=5, mtype='C')

Unnamed: 0,Model,Best Score,Best Hyperparams,Time
0,adaboost,1.0,"{'algorithm': 'SAMME.R', 'base_estimator': None, 'learning_rate': 0.1, 'n_estimators': 100, 'random_state': None}",1.53336
1,decision_tree,0.977904,"{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 2, 'min_samples_split': 8, 'min_weight_fraction_leaf': 0.0, 'random_state': None, 'splitter': 'best'}",2.070426
2,random_forest,0.955043,"{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 2, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 2, 'min_samples_split': 10, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}",4.799752


Here we defined a custom search space:

In [13]:
# list with params for every model to try out (search strategy of hyperparameters should be in ['grid', 'random'])
params = {'strategy': 'random', 'decision_tree': {"criterion": ['gini', 'entropy'], "max_depth": range(1, 50), "min_samples_split": range(1, 20), "min_samples_leaf": range(1, 5)}, 'random_forest': {'max_depth': [1, 2, 3, 5, 10, 20, 50], 'min_samples_leaf': [1, 5, 10], 'min_samples_split': [2, 5, 10], 'n_estimators': [50, 100, 200]}, 'adaboost': {'n_estimators': [50, 100, 200], 'learning_rate':[0.001,0.01,.1]}}

In [14]:
model = datafactory.finetune(X, y, method='native', models=models, cv=5, mtype='C', params=params)

Unnamed: 0,Model,Best Score,Best Hyperparams,Time
0,random_forest,1.0,"{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 10, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 10, 'min_samples_split': 10, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 50, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}",4.266476
1,decision_tree,0.912026,"{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'entropy', 'max_depth': 14, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 13, 'min_weight_fraction_leaf': 0.0, 'random_state': None, 'splitter': 'best'}",0.25003
2,adaboost,0.911111,"{'algorithm': 'SAMME.R', 'base_estimator': None, 'learning_rate': 0.01, 'n_estimators': 200, 'random_state': None}",1.625872


### Auto-sklearn

Auto-sklearn requires a linux OS (otherwise it can be run on colab). It is an automated machine learning toolkit using sklearn models. It automatically trains different ML models with different hyperparameters. At the end it selects the best model. In the DataFactory you can use it like that:

In [None]:
model, score = datafactory.finetune(X, y, method='auto_sklearn', mtype='C')

In [None]:
score