<a href="https://colab.research.google.com/github/sdsc-bw/DataFactory/blob/develop/demos/04_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning

To find a suitable model for your ML problem is very important. Not every model has the same performane on every task. Some models can be to simple (underfitting) and some models can be to complex for for a problem (overfitting). Also a model has different hyperparameters which also have an impact on the performance. Therefor exist libraries that can be used to find a appropriate model and its hyperparameters. In this github we use [hyperopt](https://github.com/hyperopt/hyperopt).

Not every model fits for every problem. In this notebook we can see the F1 scores of several models on different datasets. 

The F1 score is the harmonic mean of the precision and the recall: 
$$F1 = 2 * \frac{precision * recall}{precision + recall}$$
The higher the F1 score, the better the prediction. Precision and recall are defined as:

$$Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN}$$
TP: True Positive, FN: False Negative, FP: False Positive

# How To use in the Datafactory

## Import packages

In [1]:
# if running in colab
import sys
if 'google.colab' in sys.modules:
    !git clone https://github.com/sdsc-bw/DataFactory.git # clone repository for colab
    !ls
    
    !pip3 install scipy==1.5 # install scipy to use hyperopt, RESTART RUNTIME AFTER THAT
    
    !pip3 install mlflow # install mlflow to use hyperopt
    
    # install auto-sklearn
    !sudo apt-get install build-essential swig
    !curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install
    !pip install auto-sklearn
    
    !pip install tsai # install tsai

In [2]:
import warnings # igorne irrelevant warnings
warnings.filterwarnings('ignore')

In [5]:
import matplotlib.pyplot as plt # library used for visualization
import pandas as pd # library for creating tables
from hyperopt import hp # libary for finetuning and defining search spaces
from sklearn.datasets import load_wine # wine dataset

## add path to import datafactory 
import sys
if 'google.colab' in sys.modules:
    root = 'DataFactory/'
else:
    root = '../'
sys.path.append(root)

from datafactory.preprocessing.cleaning import clean_data # method to clean data
from datafactory.preprocessing.loading import split_data # method to split into training and test data
from datafactory.finetuning.finetuning_hyperopt import finetune_hyperopt # method to finetune with hyperopt
from datafactory.finetuning.finetuning_auto_sklearn import finetune_auto_sklearn # method to finetune with auto-sklearn

## Load dataset: Wine dataset

In [6]:
dataset = load_wine()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['class'] = pd.Series(dataset.target)
df = clean_data(df)
X, y, _ = split_data(df)

2022-01-16 16:10:27,807 - Start to clean the given dataframe...
2022-01-16 16:10:27,811 - Number of INF- and NAN-values are: (0, 0)
2022-01-16 16:10:27,812 - Set type to float32 at first && deal with INF
2022-01-16 16:10:27,814 - Remove columns with half of NAN-values
2022-01-16 16:10:27,816 - Remove constant columns
2022-01-16 16:10:27,821 - ...End with Data cleaning, number of INF- and NAN-values are now: (0, 0)


## How To Use DataFactory

### Hyperopt

We provided a function to use hyperopt for finetuning. You can just create a list with models which you want to try out. We provide a standard search space for every model.

In [7]:
# list with models to try out
models = ['decision_tree', 'random_forest', 'ada_boost', 'inception_time', 'res_net']

In [8]:
# loss in this case refers to -f1
# search strategy should be in ['parzen', 'random']
model = finetune_hyperopt(X, y, strategy='random', models=models, cv=5, max_evals=32, mtype='C')

Unnamed: 0,Model,Score,Hyperparams,Time
0,Random Forest,0.988889,"{'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}",0.565484
1,Random Forest,0.983333,"{'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200}",1.141511
2,Random Forest,0.983175,"{'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 200}",1.069688
3,Random Forest,0.983175,"{'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}",1.121022
4,Random Forest,0.983175,"{'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 200}",1.068254
5,Random Forest,0.977619,"{'max_depth': 2, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}",0.567314
6,Random Forest,0.977619,"{'max_depth': 2, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 100}",0.634305
7,Random Forest,0.966508,"{'max_depth': 2, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 50}",0.289225
8,Random Forest,0.966349,"{'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}",1.083142
9,Random Forest,0.955397,"{'max_depth': 1, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}",0.374096


100%|███████████████████████████████████████████████| 32/32 [06:08<00:00, 11.53s/trial, best loss: -0.9888888888888889]


Is you want to use your custom search space, then you have to define a *params* variable that defines the parameters to try out. If parameters for models are not given, it uses our standard search space.

If you want to define custom parameters, they should be defined with the functions of hyperopt. Look at the [sklearn](https://scikit-learn.org/stable/) and [tsai](https://github.com/timeseriesAI/tsai) website to find the hyperparamters of the models. Attention: The identifier of the hyperparameters need to be unique for ever parameter (also between models).

In [8]:
models = ['decision_tree', 'random_forest', 'ada_boost', 'inception_time', 'res_net']
# attention the label has to be unique for every parameter (also between models)
decision_tree_params = {'max_depth': hp.quniform('max_depth_dt', 1, 10, 1), 
                        'criterion': hp.choice('criterion_dt', ['gini', 'entropy']), 
                        'min_samples_leaf': hp.choice('min_samples_leaf_dt', [1, 2, 4])}
random_forest_params = {'max_depth': hp.choice('max_depth_rf', [1, 2, 3, 5, 10, 20, 50]), 
                        'n_estimators': hp.choice('n_estimators_rf', [50, 100, 200])}
ada_boost_params = {'n_estimators': hp.choice('n_estimators_ab', [50, 100, 200]), 
                    'learning_rate': hp.choice('learning_rate_ab', [0.001,0.01,.1,1.0])}
inception_time_params = {'epochs': hp.choice('epochs_it', [50, 100, 150]), 
                         'lr_max': 1e-3, 
                         'opt_func':  hp.choice('optimizer_it', ['adam', 'sgd']), 
                         'loss_func': hp.choice('loss_it', ['cross_entropy', 'smooth_cross_entropy']), 
                         'batch_tfms': hp.choice('batch_tfms_it', [['standardize', 'clip', 'mag_scale'], []]), 
                         'batch_size': [64, 128], 
                         'splits': None, 
                         'metrics': ['accuracy'],
                         'nf': hp.choice('nf_it', [32, 64]), 
                         'nb_filters': hp.choice('nb_filters_it', [32, 64, 96, 128])}
res_net_params = {'epochs': hp.choice('epochs__res_net', [50, 100, 150]), 
                  'lr_max': 1e-3, 
                  'opt_func':  hp.choice('optimizer__res_net', ['adam', 'sgd']), 
                  'loss_func': hp.choice('loss__res_net', ['cross_entropy', 'smooth_cross_entropy']), 
                  'batch_tfms': hp.choice('batch_tfms_res_net', [['standardize', 'clip', 'mag_scale'], []]), 
                  'batch_size': [64, 128], 
                  'splits': None, 
                  'metrics': ['accuracy']}
# put every hyperparameter definition in an own dictionary
params = {'decision_tree': decision_tree_params, 
          'random_forest': random_forest_params, 
          'ada_boost': ada_boost_params, 
          'inception_time': inception_time_params, 
          'res_net': res_net_params}

In [9]:
# search strategy should be in ['parzen', 'random']
model = finetune_hyperopt(X, y, strategy='random', models=models, cv=5, mtype='C', params=params.copy())

Unnamed: 0,Model,Score,Hyperparams,Time
0,ResNet,1.0,"{'batch_size': (64, 128), 'batch_tfms': (), 'epochs': 150, 'loss_func': 'smooth_cross_entropy', 'lr_max': 0.001, 'opt_func': 'sgd', 'splits': None}",68.652344
1,InceptionTime,1.0,"{'batch_size': (64, 128), 'batch_tfms': ('standardize', 'clip', 'mag_scale'), 'epochs': 150, 'loss_func': 'cross_entropy', 'lr_max': 0.001, 'nb_filters': 64, 'nf': 32, 'opt_func': 'adam', 'splits': None}",104.737286
2,ResNet,1.0,"{'batch_size': (64, 128), 'batch_tfms': (), 'epochs': 150, 'loss_func': 'cross_entropy', 'lr_max': 0.001, 'opt_func': 'adam', 'splits': None}",79.153059
3,InceptionTime,1.0,"{'batch_size': (64, 128), 'batch_tfms': (), 'epochs': 100, 'loss_func': 'cross_entropy', 'lr_max': 0.001, 'nb_filters': 64, 'nf': 32, 'opt_func': 'sgd', 'splits': None}",70.050056
4,ResNet,1.0,"{'batch_size': (64, 128), 'batch_tfms': (), 'epochs': 150, 'loss_func': 'cross_entropy', 'lr_max': 0.001, 'opt_func': 'sgd', 'splits': None}",77.701328
5,InceptionTime,1.0,"{'batch_size': (64, 128), 'batch_tfms': (), 'epochs': 150, 'loss_func': 'cross_entropy', 'lr_max': 0.001, 'nb_filters': 64, 'nf': 64, 'opt_func': 'adam', 'splits': None}",183.858962
6,ResNet,1.0,"{'batch_size': (64, 128), 'batch_tfms': (), 'epochs': 50, 'loss_func': 'smooth_cross_entropy', 'lr_max': 0.001, 'opt_func': 'sgd', 'splits': None}",22.298556
7,ResNet,1.0,"{'batch_size': (64, 128), 'batch_tfms': (), 'epochs': 50, 'loss_func': 'smooth_cross_entropy', 'lr_max': 0.001, 'opt_func': 'sgd', 'splits': None}",29.456243
8,ResNet,1.0,"{'batch_size': (64, 128), 'batch_tfms': ('standardize', 'clip', 'mag_scale'), 'epochs': 50, 'loss_func': 'smooth_cross_entropy', 'lr_max': 0.001, 'opt_func': 'adam', 'splits': None}",23.208064
9,ResNet,1.0,"{'batch_size': (64, 128), 'batch_tfms': (), 'epochs': 100, 'loss_func': 'cross_entropy', 'lr_max': 0.001, 'opt_func': 'adam', 'splits': None}",48.454832


100%|██████████████████████████████████████████████████████████████| 32/32 [17:07<00:00, 32.11s/trial, best loss: -1.0]
