Overview

This is a notebook that shows you how to tune plankton ML models using the 'tune' class.

This is the first notebook in a set of three:

    - tune.ipynb: tune hyper-parameters to find the best model configuration

    - predict.ipynb: make predictions using the best fitting model

    - post.ipynb: analyse predictions and calculate metrics such as diversity

There are several dependencies that need to be install prior to running this notebook:

    pandas
    numpy
    scikit-learn
    xgboost
    joblib
    

Tuned models and scoring are saved using the following directory structure:

    
    /your_base_path/scoring/xgb/sppA_reg.sav
    /your_base_path/scoring/rf/sppA_reg.sav
    /your_base_path/scoring/rf/sppA_reg.sav

    
    /your_base_path/tuning/xgb/sppA_reg.sav


In [6]:
# import required packages
import pandas as pd
import numpy as np
from tune import tune 
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_regression

from yaml import safe_load, load, dump
try:
    from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
    from yaml import Loader, Dumper    

In [7]:
# Setting up the model

with open('/home/phyto/planktonSDM/model_config.yml', 'r') as f:
    model_config = load(f, Loader=Loader)


seed = 1 # random seed
n_threads = 8 # how many cpu threads to use
n_spp = 0 # which species to model
path_out = "/home/phyto/ModelOutput/test/" #where to save model output


X, y = make_regression(n_samples=500, n_features=5, noise=20, random_state=59)
# scale so values are strictly positive:
scaler = MinMaxScaler()  
scaler.fit(y.reshape(-1,1))  
y = scaler.transform(y.reshape(-1,1))
# add exp transformation to data
# make distribution exponential:
y = np.exp(y)-1
#cut tail
y[y <= 0.5] = 0
y = np.squeeze(y)

#name y with species name

cv = 3
verbose = 1

In [8]:
'''
1-phase Random forest 
'''
reg_scoring = model_config['reg_scoring']
reg_param_grid = model_config['rf_param_grid']['reg_param_grid']

m = tune(X, y, seed, n_threads, verbose, cv, path_out)
m.XGB(reg_scoring, reg_param_grid, cv=cv, model="rf", zir=False, log="yes")

Fitting 3 folds for each of 54 candidates, totalling 162 fits


[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.


finished tuning model
reg rRMSE: 48%
reg rMAE: 40.0%
reg R2: 0.42
execution time: 10.665565490722656 seconds


[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.5s finished


In [9]:
'''
2-phase Random forest 
note: for the 2-phase model we need to define the model configuration for both the classifier and the regressor
'''

reg_scoring = model_config['reg_scoring']
clf_scoring = model_config['clf_scoring']

clf_param_grid = model_config['rf_param_grid']['clf_param_grid']
reg_param_grid = model_config['rf_param_grid']['reg_param_grid']

m = tune(X, y, seed, n_threads, verbose, cv, path_out)
m.XGB(reg_scoring, reg_param_grid, clf_scoring = clf_scoring, clf_param_grid = clf_param_grid, 
      cv=cv, model="rf", zir=True, log="yes")

Fitting 3 folds for each of 54 candidates, totalling 162 fits
Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.5s finished
[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.5s finished
[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.


finished tuning model
reg rRMSE: 48%
reg rMAE: 40.0%
reg R2: 0.42
zir rRMSE: 48.0
zir rMAE: 39.0
zir R2: 0.42
execution time: 16.967652797698975 seconds


[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    1.0s finished


In [10]:
'''
Testing the impact of log transformation on the 1-phase Random forest 

note: we test both log and no-log by defining log="both"
'''

reg_scoring = model_config['reg_scoring']
reg_param_grid = model_config['rf_param_grid']['reg_param_grid']

m = tune(X, y, seed, n_threads, verbose, cv, path_out)
m.XGB(reg_scoring, reg_param_grid, cv=cv, model="rf", zir=False, log="both")

Fitting 3 folds for each of 54 candidates, totalling 162 fits
Fitting 3 folds for each of 54 candidates, totalling 162 fits


[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.


finished tuning model
reg rRMSE: 47%
reg rMAE: 40.0%
reg R2: 0.42
execution time: 19.654861211776733 seconds


[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.4s finished


In [11]:
'''
1-phase Gradient boosting with XGBoost:
'''

reg_scoring = model_config['reg_scoring']
reg_param_grid = model_config['xgb_param_grid']['reg_param_grid']

m = tune(X, y, seed, n_threads, verbose, cv, path_out)
m.XGB(reg_scoring, reg_param_grid, cv=cv, model="xgb", zir=False, log="yes")

Fitting 3 folds for each of 1 candidates, totalling 3 fits
finished tuning model
reg rRMSE: 61%
reg rMAE: 50.0%
reg R2: 0.05
execution time: 0.6898267269134521 seconds


[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.1s finished


In [12]:
'''
2-phase Gradient boosting with XGBoost:
'''

reg_scoring = model_config['reg_scoring']
clf_scoring = model_config['clf_scoring']

clf_param_grid = model_config['xgb_param_grid']['clf_param_grid']
reg_param_grid = model_config['xgb_param_grid']['reg_param_grid']

m = tune(X, y, seed, n_threads, verbose, cv, path_out)
m.XGB(reg_scoring, reg_param_grid, clf_scoring = clf_scoring, clf_param_grid = clf_param_grid,
      cv=cv, model="xgb", zir=True, log="yes")

Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.1s finished
[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.1s finished


finished tuning model
reg rRMSE: 61%
reg rMAE: 50.0%
reg R2: 0.05
zir rRMSE: 53.0
zir rMAE: 37.0
zir R2: 0.28
execution time: 1.9003465175628662 seconds


[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.1s finished


In [13]:
'''
1-phase nearest neighbors with a bagged KNN
note: we need to define the number of bags when running KNN by defining bagging_estimators=30
'''

reg_scoring = model_config['reg_scoring']
reg_param_grid = model_config['knn_param_grid']['reg_param_grid']

m = tune(X, y, seed, n_threads, verbose, cv, path_out)
m.XGB(reg_scoring, reg_param_grid, cv=cv, model="knn", zir=False, log="yes", bagging_estimators=30)

Fitting 3 folds for each of 1 candidates, totalling 3 fits
finished tuning model
reg rRMSE: 48%
reg rMAE: 40.0%
reg R2: 0.4
execution time: 1.3744845390319824 seconds


[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.2s finished


In [14]:
'''
2-phase nearest neighbors with a bagged KNN
note: we need to define the number of bags when running KNN by defining bagging_estimators=30
'''

reg_scoring = model_config['reg_scoring']
clf_scoring = model_config['clf_scoring']

clf_param_grid = model_config['knn_param_grid']['clf_param_grid']
reg_param_grid = model_config['knn_param_grid']['reg_param_grid']

m = tune(X, y, seed, n_threads, verbose, cv, path_out)
m.XGB(reg_scoring, reg_param_grid,  clf_scoring = clf_scoring, clf_param_grid = clf_param_grid,  
      cv=cv, model="knn", zir=True, log="both", bagging_estimators=30)

Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.2s finished
[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.2s finished
[Parallel(n_jobs=8)]: Using backend MultiprocessingBackend with 8 concurrent workers.


finished tuning model
reg rRMSE: 40%
reg rMAE: 30.0%
reg R2: 0.58
zir rRMSE: 43.0
zir rMAE: 28.999999999999996
zir R2: 0.53
execution time: 4.422145128250122 seconds


[Parallel(n_jobs=8)]: Done   3 out of   3 | elapsed:    0.4s finished


TO DO:
    
Add print statement for log="both"

Add tau scoring

Add one_hot_encoding