Overview

This is a notebook that shows you how to tune plankton ML models using the 'tune' class.

This is the first notebook in a set of three:

    - tune.ipynb: tune hyper-parameters to find the best model configuration

    - predict.ipynb: make predictions using the best fitting model

    - post.ipynb: analyse predictions and calculate metrics such as diversity

There are several dependencies that need to be install prior to running this notebook:

    pandas
    numpy
    scikit-learn
    xgboost
    joblib
    xarray
    scikit-bio

In [1]:
# import required packages
import pandas as pd
import numpy as np
from tune import tune 
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_regression

from yaml import safe_load, load, dump
try:
    from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
    from yaml import Loader, Dumper   

from functions import example_data

In [2]:
# Setting up the model

with open('/home/phyto/planktonSDM/model_config.yml', 'r') as f:
    model_config = load(f, Loader=Loader)

X, y = example_data(y_name =  "Coccolithus pelagicus", n_samples=500, n_features=5, noise=20, random_state=model_config['seed'])

m = tune(X, y, model_config)

In [3]:
'''
1-phase Random forest 
'''
m.train(model="rf")

Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.


finished tuning model
reg rRMSE: 49%
reg rMAE: 36%
reg R2: 0.14
execution time: 15.653547286987305 seconds


[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.6s finished


In [4]:
'''
2-phase Random forest 
note: for the 2-phase model we need to define the model configuration for both the classifier and the regressor
'''
m.train(model="rf", zir=True)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.


finished tuning model
reg rRMSE: 49%
reg rMAE: 36%
reg R2: 0.14


[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.8s finished


Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.9s finished
[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.


zir rRMSE: 49%
zir rMAE: 36%
zir R2: 0.14
execution time: 26.070433378219604 seconds


[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    1.7s finished


In [5]:
'''
Adding log transformation
'''
#add log:
m.train(model="rf", log="yes")

#try both:
print("both models:")
m.train(model="rf", log="both")

Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.


finished tuning model
reg rRMSE: 49%
reg rMAE: 38%
reg R2: 0.14
execution time: 16.40062427520752 seconds
both models:


[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.8s finished


Fitting 3 folds for each of 24 candidates, totalling 72 fits
Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.


finished tuning model
reg rRMSE: 49%
reg rMAE: 36%
reg R2: 0.14
execution time: 28.613155841827393 seconds


[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.8s finished


In [6]:
'''
1-phase Gradient boosting with XGBoost:
'''
m.train(model="xgb")

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.2s finished


finished tuning model
reg rRMSE: 44%
reg rMAE: 35%
reg R2: 0.31
execution time: 0.871751070022583 seconds


In [7]:
'''
2-phase Gradient boosting with XGBoost:
'''
m.train(model="xgb", zir=True)

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.2s finished


finished tuning model
reg rRMSE: 44%
reg rMAE: 35%
reg R2: 0.31
Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.2s finished
[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.


zir rRMSE: 42%
zir rMAE: 30%
zir R2: 0.38
execution time: 2.1978676319122314 seconds


[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.3s finished


In [8]:
'''
1-phase bagged nearest neighbors
'''
m.train(model="knn")

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.


finished tuning model
reg rRMSE: 40%
reg rMAE: 28%
reg R2: 0.42
execution time: 1.3003971576690674 seconds


[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.3s finished


In [9]:
'''
2-phase bagged nearest neighbors
'''
m.train(model="knn", zir=True)

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.


finished tuning model
reg rRMSE: 37%
reg rMAE: 27%
reg R2: 0.49


[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.3s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.3s finished
[Parallel(n_jobs=2)]: Using backend MultiprocessingBackend with 2 concurrent workers.


zir rRMSE: 38%
zir rMAE: 26%
zir R2: 0.49
execution time: 3.3975861072540283 seconds


[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:    0.6s finished


TO DO:
    
Add print statement for log="both"

Add tau scoring