# Creation of the Initial Model 

## Introduction

This Notebook is develped to identify and specify the models, which will be used to apply the Active Learning strategies on. At least two models will be created, as described in the initial Research Proposal: 
1. PLS-Regression-Model 
2. Random-Forest-Regression-Model

## Preperation

To work in python, various libraries are needed. So the neccessary libraries are imported in the next cell. 

The code is developed inspired by the machine learining course by [Peter Sykacek](peter.sykacek[at]boku.ac.at) in the winter of 2023.


### Define Paths

In [1]:
import sys
# sys.path.clear()

# Basepath
basepath="./" # Project directory
sys.path.append(basepath)
sys.path.append(basepath+"server_files/ml_group/course.lib")

# Data
DATA_PATH = basepath + "data"

#Figure
FIGURE_PATH = basepath + "figures/03_modeling_figures"

# Modelpath
MODEL_PATH = basepath + "models"

# Path to environment

ENV_PATH = "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib"

# Resultspath
RESULTS_PATH = basepath + "results/03_modeling_results"

# Add the paths
sys.path.extend({DATA_PATH, FIGURE_PATH, MODEL_PATH, ENV_PATH, RESULTS_PATH})
sys.path # Check if the path is correct

['/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python312.zip',
 '/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12',
 '/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/lib-dynload',
 '',
 '/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages',
 './',
 './server_files/ml_group/course.lib',
 './data',
 '/home/fhwn.ac.at/202375/.conda/envs/thesis/lib',
 './models',
 './figures/03_modeling_figures',
 './results/03_modeling_results']

In [2]:
## timing the full notebook
import time
nb_start_time = time.time()

### Define the Path to store the ML Models 

In [3]:
import joblib

# Define the path to save the ml models

MODEL_PATH = basepath + "models"
sys.path.append(MODEL_PATH)

### Imports

In [4]:
# import ml_lib as mlib
import numpy as np
import matplotlib.pyplot as plt

### turn off convergence warnings

In [5]:
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)
import os
os.environ["PYTHONWARNINGS"] = "ignore" # Also affect subprocesses

## Import Model functions

To generate various models an import of the respective functions from preexisting packages is neccessary. 

### Gridsearch Crossvalidation

[sklearn GSCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)

Exhaustive search over specified parameter values for an estimator.
Important members are fit, predict.

* GridSearchCV implements a "fit" and a "score" method.
* It also implements "score_samples", "predict", "predict_proba", "decision_function", "transform" and "inverse_transform" if they are implemented in the estimator used.

In [6]:
from sklearn.model_selection import GridSearchCV as GSCV

### Randomized Parameter Optimization

[sklearn RandomizedSearchCV](https://scikit-learn.org/stable/modules/grid_search.html#grid-search)  

 RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:

A budget can be chosen independent of the number of parameters and possible values.


In [7]:
from sklearn.model_selection import RandomizedSearchCV

### K-Fold cross-validator.
[sklearn KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold)

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

In [8]:
from sklearn.model_selection import KFold

### Kernel Ridge

[sklearn KRR](https://scikit-learn.org/stable/modules/kernel_ridge.html#kernel-ridge-regression)

Kernel ridge regression (KRR) [M2012] combines Ridge regression and classification (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space.

In [9]:
from sklearn.kernel_ridge import KernelRidge as KRR

### Model Inspection

[sklearn cv_results_](https://scikit-learn.org/stable/modules/grid_search.html#analyzing-results-with-the-cv-results-attribute)

"The cv_results_ attribute contains useful information for analyzing the results of a search. It can be converted to a pandas dataframe with df = pd.DataFrame(est.cv_results_)."

## Data Import

In this section the sample data will be imported. 

Currently 2 Datasets are of interest for us: 
1. PS20191107_gegl.csv
2. dps1200.csv

The differences are that the first is a dataframe containing the data unmodified and full. It was used to generate the later, which contains only selected sections of the spectra. The Wavelengths of this dataset were selected by discarding Wavelengths, based on critieria ???

**TODO**: Research the criteria

### PS20191107 (Full Data)

In [10]:
import pandas as pd
data_full = pd.read_csv(basepath+"data/PS20191107_gegl.csv", 
                            sep=";", decimal=",", encoding="utf-8")
data_full.head()

Unnamed: 0.1,Unnamed: 0,year,Origin,type,3996,3994,3992,3990,3988,3987,...,417,415,413,411,409,407,405,403,401,399
0,2GOS-18_1955,1955,POL,living,0.016119,0.015972,0.01583,0.015728,0.015734,0.015787,...,-0.027973,-0.02818,-0.028389,-0.028595,-0.029011,-0.029123,-0.029323,-0.02961,-0.029759,-0.029746
1,2GOS-18_1969,1969,POL,living,0.016368,0.016543,0.016663,0.016569,0.016333,0.016217,...,-0.02952,-0.029747,-0.029978,-0.030204,-0.030087,-0.030284,-0.030746,-0.031163,-0.031519,-0.031815
2,2GOS-18_1974,1974,POL,living,0.021364,0.021662,0.021862,0.021573,0.020925,0.020585,...,-0.031046,-0.03127,-0.031483,-0.031701,-0.032089,-0.03239,-0.032609,-0.032653,-0.032627,-0.032784
3,2GOS-18_1976,1976,POL,living,0.019351,0.019246,0.019181,0.018998,0.018926,0.019205,...,-0.029852,-0.030092,-0.030361,-0.030647,-0.031115,-0.031281,-0.031376,-0.031721,-0.032172,-0.032433
4,2GOS-18_1996,1996,POL,living,0.018548,0.018604,0.01867,0.018616,0.018375,0.018266,...,-0.029963,-0.030206,-0.030436,-0.030643,-0.030917,-0.031127,-0.031338,-0.031409,-0.031364,-0.031465


#### Statistics

In [11]:
# Retrive basic characteristics for each variable
data_full.describe()

Unnamed: 0,year,3996,3994,3992,3990,3988,3987,3985,3983,3981,...,417,415,413,411,409,407,405,403,401,399
count,2244.0,2244.0,2244.0,2244.0,2244.0,2244.0,2244.0,2244.0,2244.0,2244.0,...,2244.0,2244.0,2244.0,2244.0,2244.0,2244.0,2244.0,2244.0,2244.0,2244.0
mean,-152.291889,0.011321,0.011238,0.011167,0.011087,0.011004,0.010989,0.010963,0.0109,0.010838,...,-0.024586,-0.024782,-0.024978,-0.025175,-0.025414,-0.025638,-0.025847,-0.026018,-0.026165,-0.026328
std,3659.189806,0.005232,0.005231,0.005229,0.005212,0.00518,0.005176,0.005207,0.005203,0.005198,...,0.003439,0.003428,0.003417,0.003405,0.00341,0.003386,0.003367,0.003356,0.003346,0.003334
min,-13555.0,-0.002773,-0.002953,-0.002774,-0.002312,-0.002147,-0.002444,-0.003096,-0.003154,-0.003191,...,-0.035057,-0.035212,-0.03535,-0.035504,-0.036023,-0.03637,-0.036337,-0.036325,-0.036375,-0.036506
25%,-370.25,0.007695,0.007627,0.007587,0.007504,0.007385,0.007355,0.007297,0.007233,0.007174,...,-0.027123,-0.027286,-0.027461,-0.027654,-0.027906,-0.028134,-0.0283,-0.028394,-0.028544,-0.028711
50%,1472.5,0.012248,0.01216,0.012073,0.011959,0.011875,0.011864,0.011888,0.011831,0.011757,...,-0.024376,-0.024564,-0.024764,-0.024939,-0.025142,-0.025372,-0.025599,-0.025777,-0.025945,-0.026034
75%,1806.0,0.015064,0.014983,0.014875,0.014752,0.014673,0.014688,0.014675,0.014619,0.014553,...,-0.021932,-0.022128,-0.022342,-0.022558,-0.02282,-0.02304,-0.023253,-0.023459,-0.023626,-0.023843
max,2009.0,0.028401,0.027898,0.027302,0.027014,0.026885,0.026733,0.027129,0.026971,0.026841,...,-0.013279,-0.013542,-0.013811,-0.014076,-0.014361,-0.014733,-0.015133,-0.015328,-0.015409,-0.015598


In [12]:
data_full.groupby('type')[['year']].agg(['max', 'mean', 'min'])

Unnamed: 0_level_0,year,year,year
Unnamed: 0_level_1,max,mean,min
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
constr,1936,1628.966667,1239
dry,1765,948.591716,327
living,2009,1886.8437,1524
water,1912,-2682.173774,-13555


### Import dataset dps1200.csv

In [13]:
data_small = pd.read_csv(basepath+"data/dps1200.csv", 
                            sep=",", decimal=".", encoding="utf-8")
data_small.head()

Unnamed: 0.1,Unnamed: 0,year,tree,Origin,type,X2970,X2968,X2966,X2964,X2962,...,X818,X816,X814,X812,X810,X808,X806,X804,X802,X800
0,2GOS-18_1955,1955,2GOS-18,POL,living,0.019849,0.020121,0.020414,0.020724,0.02103,...,-0.023469,-0.023367,-0.023283,-0.02322,-0.023183,-0.023174,-0.02319,-0.023228,-0.023293,-0.023388
1,2GOS-18_1969,1969,2GOS-18,POL,living,0.023933,0.024378,0.024827,0.025273,0.025712,...,-0.024117,-0.024076,-0.024043,-0.024021,-0.024015,-0.024033,-0.024077,-0.024147,-0.024238,-0.024346
2,2GOS-18_1974,1974,2GOS-18,POL,living,0.021605,0.021971,0.022342,0.022719,0.023099,...,-0.026266,-0.026214,-0.026172,-0.026149,-0.026146,-0.026165,-0.026208,-0.026273,-0.026363,-0.026479
3,2GOS-18_1976,1976,2GOS-18,POL,living,0.021999,0.022315,0.022651,0.022999,0.023345,...,-0.025113,-0.02503,-0.024959,-0.024909,-0.024885,-0.024888,-0.024918,-0.024971,-0.025049,-0.025153
4,2GOS-18_1996,1996,2GOS-18,POL,living,0.021031,0.021338,0.021626,0.021923,0.022248,...,-0.025256,-0.025158,-0.025083,-0.025035,-0.025013,-0.025015,-0.02504,-0.025094,-0.025177,-0.025282


In [14]:
# Correct the column headers

# data_1200.rename(lambda x: x[1:], axis='columns')
data_small = data_small.rename(columns=lambda x: x.replace('X', ''))
data_small.head()

Unnamed: 0.1,Unnamed: 0,year,tree,Origin,type,2970,2968,2966,2964,2962,...,818,816,814,812,810,808,806,804,802,800
0,2GOS-18_1955,1955,2GOS-18,POL,living,0.019849,0.020121,0.020414,0.020724,0.02103,...,-0.023469,-0.023367,-0.023283,-0.02322,-0.023183,-0.023174,-0.02319,-0.023228,-0.023293,-0.023388
1,2GOS-18_1969,1969,2GOS-18,POL,living,0.023933,0.024378,0.024827,0.025273,0.025712,...,-0.024117,-0.024076,-0.024043,-0.024021,-0.024015,-0.024033,-0.024077,-0.024147,-0.024238,-0.024346
2,2GOS-18_1974,1974,2GOS-18,POL,living,0.021605,0.021971,0.022342,0.022719,0.023099,...,-0.026266,-0.026214,-0.026172,-0.026149,-0.026146,-0.026165,-0.026208,-0.026273,-0.026363,-0.026479
3,2GOS-18_1976,1976,2GOS-18,POL,living,0.021999,0.022315,0.022651,0.022999,0.023345,...,-0.025113,-0.02503,-0.024959,-0.024909,-0.024885,-0.024888,-0.024918,-0.024971,-0.025049,-0.025153
4,2GOS-18_1996,1996,2GOS-18,POL,living,0.021031,0.021338,0.021626,0.021923,0.022248,...,-0.025256,-0.025158,-0.025083,-0.025035,-0.025013,-0.025015,-0.02504,-0.025094,-0.025177,-0.025282


In [15]:
data_small.describe()
# describe() gives some basic statistics for numeric columns,

Unnamed: 0,year,2970,2968,2966,2964,2962,2960,2959,2957,2955,...,818,816,814,812,810,808,806,804,802,800
count,1290.0,1290.0,1290.0,1290.0,1290.0,1290.0,1290.0,1290.0,1290.0,1290.0,...,1290.0,1290.0,1290.0,1290.0,1290.0,1290.0,1290.0,1290.0,1290.0,1290.0
mean,1740.42093,0.018827,0.019122,0.019427,0.01974,0.020061,0.020389,0.020728,0.021078,0.021439,...,-0.020705,-0.020572,-0.020456,-0.020361,-0.020292,-0.020254,-0.020248,-0.020276,-0.020335,-0.02042
std,196.420289,0.001922,0.001978,0.002038,0.0021,0.002163,0.002223,0.002279,0.002333,0.002385,...,0.002526,0.00256,0.002593,0.002622,0.002649,0.002673,0.002693,0.00271,0.002723,0.002735
min,1194.0,0.011339,0.011597,0.011871,0.012159,0.012466,0.012791,0.013134,0.013493,0.013861,...,-0.026419,-0.026362,-0.026319,-0.026296,-0.026293,-0.026308,-0.026335,-0.026373,-0.026451,-0.026601
25%,1616.0,0.017552,0.017789,0.018044,0.018325,0.018587,0.018869,0.019175,0.019468,0.019784,...,-0.022815,-0.022723,-0.022603,-0.022533,-0.022505,-0.022469,-0.0225,-0.022531,-0.022616,-0.022704
50%,1769.0,0.018673,0.018942,0.01923,0.019521,0.019805,0.020112,0.02043,0.020781,0.021118,...,-0.020678,-0.020497,-0.020383,-0.020287,-0.020214,-0.020141,-0.020116,-0.020118,-0.020152,-0.020231
75%,1913.0,0.019991,0.020269,0.020567,0.020911,0.021258,0.021605,0.021966,0.022307,0.022655,...,-0.018458,-0.018294,-0.018154,-0.018033,-0.01795,-0.017888,-0.017871,-0.017885,-0.017919,-0.017988
max,2009.0,0.027378,0.028247,0.029124,0.02999,0.030832,0.031645,0.032436,0.03322,0.034017,...,-0.013677,-0.013408,-0.013178,-0.012991,-0.012843,-0.012738,-0.012686,-0.012691,-0.012745,-0.012844


In [16]:
data_small.describe(include="object")
# describe() gives some basic statistics for numeric columns, 
# categorial columns are included with the option include="object"

Unnamed: 0.1,Unnamed: 0,tree,Origin,type
count,1290,1290,1290,1290
unique,1290,139,4,4
top,SZLPS15a_1982,Dev2b,AUT,living
freq,1,29,631,627


## Modelling Parameters

In [55]:
# Define the parameters for the CV

# Switch for the dataset
    # Select from (data_1200, data_full) or other if implemented
data = data_small

# Switch for testing mode (use only 10% of the data, among others)
testing = False

# Define a random state for randomized processes
random_state = np.random.RandomState(202375)

# Define a metric for model evaluation
cv_scorer = 'neg_mean_squared_error'

######################################################
if testing == True:
    nfolds = 2
    NoTrials = 2
    n_jobs = 20
    save_model = False
    print("Testing mode for Cross Validation")
    print("Splitting the data for faster modelling")
    data = data.sample(frac=0.1)
else:
    nfolds = 10
    NoTrials = 15
    n_jobs = 40
    save_model = True
    print("Extensive mode for Cross Validation")
######################################################

Extensive mode for Cross Validation


In [18]:
X = data.select_dtypes('float')
X

Unnamed: 0,2970,2968,2966,2964,2962,2960,2959,2957,2955,2953,...,818,816,814,812,810,808,806,804,802,800
0,0.019849,0.020121,0.020414,0.020724,0.021030,0.021321,0.021615,0.021931,0.022270,0.022634,...,-0.023469,-0.023367,-0.023283,-0.023220,-0.023183,-0.023174,-0.023190,-0.023228,-0.023293,-0.023388
1,0.023933,0.024378,0.024827,0.025273,0.025712,0.026149,0.026586,0.027030,0.027490,0.027963,...,-0.024117,-0.024076,-0.024043,-0.024021,-0.024015,-0.024033,-0.024077,-0.024147,-0.024238,-0.024346
2,0.021605,0.021971,0.022342,0.022719,0.023099,0.023470,0.023832,0.024207,0.024610,0.025038,...,-0.026266,-0.026214,-0.026172,-0.026149,-0.026146,-0.026165,-0.026208,-0.026273,-0.026363,-0.026479
3,0.021999,0.022315,0.022651,0.022999,0.023345,0.023682,0.024023,0.024386,0.024768,0.025150,...,-0.025113,-0.025030,-0.024959,-0.024909,-0.024885,-0.024888,-0.024918,-0.024971,-0.025049,-0.025153
4,0.021031,0.021338,0.021626,0.021923,0.022248,0.022589,0.022925,0.023264,0.023624,0.024004,...,-0.025256,-0.025158,-0.025083,-0.025035,-0.025013,-0.025015,-0.025040,-0.025094,-0.025177,-0.025282
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1285,0.018254,0.018577,0.018906,0.019248,0.019604,0.019962,0.020317,0.020670,0.021035,0.021428,...,-0.018632,-0.018450,-0.018285,-0.018149,-0.018048,-0.017984,-0.017955,-0.017961,-0.017997,-0.018058
1286,0.018508,0.018778,0.019051,0.019341,0.019650,0.019962,0.020274,0.020597,0.020944,0.021318,...,-0.019053,-0.018876,-0.018717,-0.018583,-0.018481,-0.018413,-0.018380,-0.018379,-0.018410,-0.018469
1287,0.017196,0.017486,0.017786,0.018100,0.018423,0.018749,0.019080,0.019421,0.019777,0.020150,...,-0.018587,-0.018406,-0.018242,-0.018103,-0.017993,-0.017912,-0.017865,-0.017854,-0.017877,-0.017928
1288,0.017298,0.017541,0.017791,0.018060,0.018352,0.018656,0.018964,0.019273,0.019592,0.019929,...,-0.018916,-0.018720,-0.018541,-0.018390,-0.018273,-0.018189,-0.018138,-0.018122,-0.018142,-0.018194


In [19]:
y = data['year']
y

0       1955
1       1969
2       1974
3       1976
4       1996
        ... 
1285    1942
1286    1952
1287    1962
1288    1972
1289    1982
Name: year, Length: 1290, dtype: int64

In [20]:
random_state

RandomState(MT19937) at 0x7F51BD7F4740

## Train/Test split

During this Project, we will generate statistical model with a random fraction of the dataset. The remainder will be retained to be used as test values to estimate the accuracy of the model and potentially detect overfitting. 

In [21]:
from sklearn.model_selection import train_test_split
# Split the dataset into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_state)

# Random Forest (RSCV)
Implemented: 
- parameter distribution  
- Train  
- Test  
- CV Results  
- Optimal Model Parameters  

## Parameter Distribution

In [22]:
# RF Define the parameters for the CV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_predict as cvp
from scipy.stats import randint
from sklearn.metrics import mean_squared_error

start_time = time.time()

rf = RandomForestRegressor() # default criterion to evaluate the quality of the split is the ”squared_error”

param_distribs = {'n_estimators': randint(low=3, high=150), # for hyperparameter with discrete values 
                  'min_samples_split': randint(low=2, high=20), 
                  'max_depth': randint(low=1, high=20), 
                  'min_samples_leaf': randint(low=1, high=10),
                  }

# loop the fitting with splits of the data
rf_rscv_rmse1 = np.zeros((NoTrials, 1))
rf_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    rf_rscv = RandomizedSearchCV(
        rf, # regressor
        param_distributions=param_distribs, # hyperparameter space
        n_iter=10, # "Number of parameter settings that are sampled." [sklearn]
        cv=inner_cv, # "Determines the cross-validation splitting strategy"[sklearn]
        scoring=cv_scorer, 
        random_state=random_state, 
        verbose=1, 
        n_jobs=n_jobs)
    
    # fit the model on the Trainig Data
    rf_rscv.fit(X_train, y_train)

    # calculate the CV scores
    rf_rscv_rmse1[i] = np.sqrt(-rf_rscv.best_score_)
    y_pred_rf = cvp(rf_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    rf_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_rf))

end_time = time.time()
execution_time = (end_time - start_time)/60
print(f"Execution time: {execution_time} minutes")


Trial 0 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 1 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each o



Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 9 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidate



Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 12 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 13 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 14 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidat

In [23]:
print('optimal Parameters according to RSCV:', rf_rscv.best_params_)
print('best score', rf_rscv.best_score_) # this returns the negative of the MSE

optimal Parameters according to RSCV: {'max_depth': 19, 'min_samples_leaf': 3, 'min_samples_split': 7, 'n_estimators': 23}
best score -14876.810753198622


## RF with optimal parameters

extract the best parameters and run the RF Regression with the full data

In [24]:
# optimal parameters
rf_opt = RandomForestRegressor(**rf_rscv.best_params_)

#fit the model
rf_opt.fit(X_train, y_train)

# predict the values for X_test
y_pred_rf = rf_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))

In [25]:
rmse_rf

122.98601790347223

# PLS (RSCV)
Implemented:


*TODO*

- parameter distribution  
- Train  
- Test  
- CV Results  
- Optimal Model Parameters  

In [26]:
# The parameter search should be limited in regards to the numebr of components to keep:
# "Should be in [1, min(n_samples, n_features, n_targets)]" sklearn

X_test.shape

(387, 410)

In [27]:
# Define the parameters for the CV
from sklearn import cross_decomposition

start_time = time.time()

pls = cross_decomposition.PLSRegression()

param_distribs = {'n_components': randint(low=1, high=90), #  should be in [1, min(n_samples, n_features, n_targets = 90)].
                  'max_iter': randint(low=2, high=700), 
                  }

# loop the fitting with splits of the data
pls_rscv_rmse1 = np.zeros((NoTrials, 1))
pls_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    pls_rscv = RandomizedSearchCV(
        pls, # regressor
        param_distributions=param_distribs, # hyperparameter space
        n_iter=10, # "Number of parameter settings that are sampled." [sklearn]
        cv=inner_cv, # "Determines the cross-validation splitting strategy"[sklearn]
        scoring=cv_scorer, 
        random_state=random_state, 
        verbose=1, 
        n_jobs=n_jobs)
    
    # fit the model on the Trainig Data
    pls_rscv.fit(X_train, y_train)

    # calculate the CV scores
    pls_rscv_rmse1[i] = np.sqrt(-pls_rscv.best_score_)
    y_pred_pls = cvp(pls_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    pls_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_pls))

end_time = time.time()
execution_time = (end_time - start_time)/60
print(f"Execution time: {execution_time} minutes")


Trial 0 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 1 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 2 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 3 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 4 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 5 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 6 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 7 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidate



Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 9 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 10 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 11 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 12 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 13 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 14 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Execution time: 3.6151899456977845 minutes


In [28]:
print('optimal Parameters according to RSCV:', pls_rscv.best_params_)
print('best score' ,pls_rscv.best_score_) # this returns the negative of the MSE

optimal Parameters according to RSCV: {'max_iter': 38, 'n_components': 30}
best score -9864.388872329746


### PLS with optimal parameters

In [29]:
# optimal parameters
pls_opt = cross_decomposition.PLSRegression(**pls_rscv.best_params_)

#fit the model
pls_opt.fit(X_train, y_train)

# predict the values for X_test

y_pred_pls = pls_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_pls = np.sqrt(mean_squared_error(y_test, y_pred_pls))

In [30]:
rmse_pls

103.6372588328837

# KRR with RBF (RSCV)
Implemented:

- parameter distribution  
- Train  
- Test  
- CV Results  
- Optimal Model Parameters  

TODO

In [31]:
# KRR with RBF 

# alpha: Regularization strength

param_distribs = {"alpha": [1e0, 1e-1, 1e-2, 1e-3], 
                  "gamma": np.logspace(-2, 2, 7)}

# param_distribs = {"alpha": np.logspace(0.0001, 0.1), 
#                  "gamma": np.logspace(0.0001, 0.1)}

In [32]:
np.logspace(-2, 2, 7)

array([1.00000000e-02, 4.64158883e-02, 2.15443469e-01, 1.00000000e+00,
       4.64158883e+00, 2.15443469e+01, 1.00000000e+02])

In [33]:
from sklearn.kernel_ridge import KernelRidge as KRR

start_time = time.time()

# loop the fitting with splits of the data
krr_rscv_rmse1 = np.zeros((NoTrials, 1))
krr_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    krr_rscv = RandomizedSearchCV(
        KRR(kernel='rbf'), # regressor
        param_distributions=param_distribs, # hyperparameter space
        n_iter=10, # "Number of parameter settings that are sampled." [sklearn]
        cv=inner_cv, # "Determines the cross-validation splitting strategy"[sklearn]
        scoring=cv_scorer, 
        random_state=random_state, 
        verbose=1, 
        n_jobs=n_jobs)
    
    # fit the model on the Trainig Data
    krr_rscv.fit(X_train, y_train)

    # calculate the CV scores
    krr_rscv_rmse1[i] = np.sqrt(-krr_rscv.best_score_)
    y_pred_krr = cvp(krr_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    krr_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_krr))

end_time = time.time()
execution_time = (end_time - start_time)/ 60
print(f"Execution time: {execution_time} minutes")

Trial 0 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 1 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each o



Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 3 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 4 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 5 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 6 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 7 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidate



Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fitsFitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 11 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidat



Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Execution time: 1.6823788285255432 minutes


In [34]:
print('optimal Parameters according to RSCV:', krr_rscv.best_params_)
print('best score' ,krr_rscv.best_score_) # this returns the negative of the MSE

optimal Parameters according to RSCV: {'gamma': 4.641588833612777, 'alpha': 0.001}
best score -10934.320041126219


## KRR (rbf) with optimal parameters

In [35]:
# optimal parameters
krr_opt = KRR(**krr_rscv.best_params_)

#fit the model
krr_opt.fit(X_train, y_train)

# predict the values for X_test

y_pred_krr = krr_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_krr = np.sqrt(mean_squared_error(y_test, y_pred_krr))

In [36]:
rmse_krr

129.7497505300415

# MLP (RSCV)

This Method, the multi-layer perceptron creates a neural network, where neurons are organized in three or more layers (1 input-, n hidden-, and 1 output-layer). The MLP is based on a threshold logic unit (TLU, sometimes linear threshold unit LTU). A TLU recieves input from its connections and calculates 'weights' from the sum of all inputs and calculates a step function. Common step functions are the *Heaviside step function* or *sign function*. 

To compute the outputs of a single fully connnected layer the following eq. can be used 

(citation: A. Géron, Hands-on machine learning with Scikit-Learn and TensorFlow concepts, tools, and techniques to build intelligent systems, 2nd ed. O’Reilly Media, Inc., 2019. p.283)
‌

$$h_{W,b}(X) = \phi(WX + b)$$

Implemented: 

- parameter distribution  
- Train  
- Test  
- CV Results  
- Optimal Model Parameters 

*TODO* 

In [37]:
# parameter Distribution for mlp

from scipy.stats import randint, uniform

param_distribs = {"hidden_layer_sizes": randint(low=50, high=200), # number of neurons in each layer
                  "activation": ['identity', 'logistic', 'tanh', 'relu'],
                  "solver": ['lbfgs','sgd', 'adam'],
                  'alpha': uniform(loc=0.0001, scale=0.1),
                  'early_stopping': [True, False],  
                  'validation_fraction': uniform(loc=0.1, scale=0.1)
}

In [38]:
from sklearn.neural_network import MLPRegressor as MLP

start_time = time.time()

# loop the fitting with splits of the data
mlp_rscv_rmse1 = np.zeros((NoTrials, 1))
mlp_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    mlp_rscv = RandomizedSearchCV(
        MLP(), # regressor
        param_distributions=param_distribs, # hyperparameter space
        n_iter=10, # "Number of parameter settings that are sampled." [sklearn]
        cv=inner_cv, # "Determines the cross-validation splitting strategy"[sklearn]
        scoring=cv_scorer, 
        random_state=random_state, 
        verbose=1, 
        n_jobs=n_jobs)
    
    # fit the model on the Trainig Data
    mlp_rscv.fit(X_train, y_train)

    # calculate the CV scores
    mlp_rscv_rmse1[i] = np.sqrt(-mlp_rscv.best_score_)
    y_pred_mlp = cvp(mlp_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    mlp_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_mlp))

end_time = time.time()
execution_time = (end_time - start_time)/ 60
print(f"Execution time: {execution_time} minutes")

Trial 0 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 1 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits


10 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py", line 752, in fit
    return self._f

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 2 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidate

20 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py", line 752, in fit
    return self._f

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 6 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidate

20 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py", line 752, in fit
    return self._f

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 9 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits


10 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py", line 752, in fit
    return self._f

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 10 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits


10 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py", line 752, in fit
    return self._f

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 11 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits


10 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py", line 752, in fit
    return self._f

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 12 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits


20 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py", line 752, in fit
    return self._f

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 13 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits




Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 14 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidat

In [39]:
print('optimal Parameters according to RSCV:', mlp_rscv.best_params_)
print('best score' ,mlp_rscv.best_score_) # this returns the negative of the MSE

optimal Parameters according to RSCV: {'activation': 'tanh', 'alpha': 0.018092607452333143, 'early_stopping': False, 'hidden_layer_sizes': 93, 'solver': 'lbfgs', 'validation_fraction': 0.14112525964015027}
best score -13800.961982239276


In [40]:
# optimal parameters
mlp_opt = MLP(**mlp_rscv.best_params_)

#fit the model
mlp_opt.fit(X_train, y_train)

# predict the values for X_test
y_pred_mlp = mlp_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_mlp = np.sqrt(mean_squared_error(y_test, y_pred_mlp))

In [41]:
rmse_mlp

115.61234301860294

# XGBoost (RSCV)

Implemented:

- import

*TODO*


- parameter distribution  
- Train  
- Test  
- CV Results  
- Optimal Model Parameters  

In [42]:
import xgboost
xgboost.__version__

'2.0.3'

#### transform the data into the XGBoost data class

Details see [datacamp](https://www.datacamp.com/tutorial/xgboost-in-python)

In [43]:
# Create regression matrices
dtrain_reg = xgboost.DMatrix(X_train, y_train)
dtest_reg = xgboost.DMatrix(X_test, y_test)

### Define the objective

XGBoost will be used here for a regression problem, with the objective to minimize the squared error of the model.

In [44]:
params = {"objective": "reg:squarederror", 
          "tree_method": "hist"} # "gpu_hist" for gpu only, set to 'hist' if on cpu

In [45]:
# Define hyperparameters

n = 500 # number of rounds
evals = [(dtrain_reg, "train"), (dtest_reg, "validation")] # specify the data for evaluation

model = xgboost.train(
   params=params,
   dtrain=dtrain_reg,
   num_boost_round=n,
   evals=evals,
   verbose_eval = 20, 
   early_stopping_rounds=20,
)

# [60]	train-rmse:3.18431	validation-rmse:123.78980
# [71]	train-rmse:1.87073	validation-rmse:123.80743

[0]	train-rmse:158.91235	validation-rmse:170.96301
[20]	train-rmse:23.83772	validation-rmse:124.03000
[40]	train-rmse:8.38947	validation-rmse:124.03466
[60]	train-rmse:3.18431	validation-rmse:123.78980
[71]	train-rmse:1.87073	validation-rmse:123.80743


### XGBoost Crossvalidation

In [46]:
n = 1000

results = xgboost.cv(
   params,
   dtrain_reg,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=20
)
results.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,157.790979,1.473083,170.071095,9.402711
1,129.446537,2.077011,154.529123,11.302401
2,109.085485,2.19486,143.455057,10.752623
3,93.441676,2.404388,136.020008,9.874314
4,81.071116,2.945638,133.587989,9.806164


In [47]:
best_rmse = results['test-rmse-mean'].min()

best_rmse

121.57590195197679

### Acessing the xgboost eval metrics via sklearn

from tutorial of [xgboost](https://xgboost.readthedocs.io/en/latest/python/examples/sklearn_evals_result.html#demo-for-accessing-the-xgboost-eval-metrics-by-using-sklearn-interface)

In [48]:
# Create regression matrices
dtrain_reg = xgboost.DMatrix(X_train, y_train)
dtest_reg = xgboost.DMatrix(X_test, y_test)

params = {"objective": ["reg:squarederror"], 
          "tree_method": ["hist"]}

XGB = xgboost.XGBModel(**params)

In [49]:
from xgboost import XGBRegressor

start_time = time.time()
# Instantiate the regressor
XGB = XGBRegressor()

param_distribs = {
    "n_estimators": randint(100,500),
    "max_depth": randint(3,100)
}


# loop the fitting with splits of the data
xgb_rscv_rmse1 = np.zeros((NoTrials, 1))
xgb_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    xgb_rscv = RandomizedSearchCV(
        XGB, # regressor
        param_distributions = param_distribs, # hyperparameter space
        n_iter = 10, # "Number of parameter settings that are sampled." [sklearn]
        cv = inner_cv, # "Determines the cross-validation splitting strategy" [sklearn]
        scoring = cv_scorer, 
        random_state = random_state, 
        verbose = 1, 
        n_jobs = n_jobs)
    
    # fit the model on the Trainig Data
    xgb_rscv.fit(X_train, y_train)

    # calculate the CV scores
    xgb_rscv_rmse1[i] = np.sqrt(-xgb_rscv.best_score_)
    y_pred_xgb = cvp(xgb_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    xgb_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_xgb))

end_time = time.time()
execution_time = (end_time - start_time)/ 60
print(f"Execution time: {execution_time} minutes")

Trial 0 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 1 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each o



Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 4 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidate



Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Trial 8 of 15
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Fitting 10 folds for each of 10 candidate

KeyboardInterrupt: 

In [50]:
print('optimal Parameters according to RSCV:', xgb_rscv.best_params_)
print('best score' ,xgb_rscv.best_score_) # this returns the negative of the MSE

optimal Parameters according to RSCV: {'max_depth': 5, 'n_estimators': 393}
best score -13534.912770700568


In [51]:
# optimal parameters
best_params = xgb_rscv.best_params_

xgb_opt = XGBRegressor(**xgb_rscv.best_params_)

#fit the model
xgb_opt.fit(X_train, y_train)

# predict the values for X_test
y_pred_xgb = xgb_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))

In [52]:
rmse_xgb

124.21650265526222

# Histogram-based Gradient Boosting Regression Tree

*TODO*
- parameter distribution
- Train  
- Test  
- CV Results  
- Optimal Model Parameters 

In [56]:
from sklearn.ensemble import HistGradientBoostingRegressor as HGB

# Define parameters for HGB
#param_distribs = {
#    'learning_rate':randint(low=0.001,high=1),
#    'max_iter':randint(low=5,high=250), 
#     'max_leaf_nodes': randint(low=2,high=50, scale = 1)
#                  }
param_distribs = {'max_iter': [5,10], 
                  'max_leaf_nodes': [15,31,40],
                  }

# loop the fitting with splits of the data
hgb_rscv_rmse1 = np.zeros((NoTrials, 1))
hgb_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    hgb_rscv =RandomizedSearchCV(
        HGB(), # regressor
        param_distributions=param_distribs, # hyperparameter space
        n_iter=10, # "Number of parameter settings that are sampled." [sklearn]
        cv=inner_cv, # "Determines the cross-validation splitting strategy"[sklearn]
        scoring=cv_scorer, 
        random_state=random_state, 
        verbose=1, 
        n_jobs=n_jobs)
    
    # fit the model on the Trainig Data
    hgb_rscv.fit(X_train, y_train)

    # calculate the CV scores
    hgb_rscv_rmse1[i] = np.sqrt(-hgb_rscv.best_score_)
    y_pred_hgb = cvp(hgb_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    hgb_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_hgb))

end_time = time.time()
execution_time = (end_time - start_time)/60
print(f"Execution time: {execution_time} minutes")

Trial 0 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 1 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 2 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 3 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 4 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 5 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 6 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 7 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 8 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 9 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 10 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 11 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 12 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 13 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Trial 14 of 15
Fitting 10 folds for each of 6 candidates, totalling 60 fits




Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Fitting 10 folds for each of 6 candidates, totalling 60 fits
Execution time: 438.2999840815862 minutes


In [57]:
print('optimal Parameters according to RSCV:', hgb_rscv.best_params_)
print('best score' ,hgb_rscv.best_score_) # this returns the negative of the MSE

optimal Parameters according to RSCV: {'max_leaf_nodes': 31, 'max_iter': 10}
best score -19639.775208577506


### HGB with optimal parameters

In [58]:
# optimal parameters
hgb_opt = HGB(**hgb_rscv.best_params_)

#fit the model
hgb_opt.fit(X_train, y_train)

# predict the values for X_test

y_pred_hgb = hgb_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_hgb = np.sqrt(mean_squared_error(y_test, y_pred_hgb))
rmse_hgb

141.15222233482842

# Export Models

In [66]:
### Computational Considerations 

# Define the current models: 

model_list = ["rf_opt", "pls_opt", "krr_opt", "xgb_opt", "hgb_opt"]

# write the models to memory: 
if save_model == True:
    for i in model_list: 
        # Extract model name
        #model_name = 
        model_name = i + i.__class__.__name__
        # Construct a filepath
        model_filepath = MODEL_PATH + f"/{model_name}.pkl"
        # Save the model
        joblib.dump(i, model_filepath)
else:
    print("Testrun, no model is written")

## Load the models from memory

for model in model_list:  
    model_name = model.__class__.__name__  
    model_filepath = MODEL_PATH + f"/{model_name}.pkl"  
    model = joblib.load(model_filepath)  

# Quality Control

In this section the goal is to document the packages which where used during the execution of this notebook

In [67]:
## Package informations
from sklearn import show_versions
show_versions()


System:
    python: 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]
executable: /home/fhwn.ac.at/202375/.conda/envs/thesis/bin/python
   machine: Linux-5.15.0-101-generic-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.4.2
          pip: 24.0
   setuptools: 69.5.1
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: None
       pandas: 2.2.2
   matplotlib: 3.8.4
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 80
         prefix: libopenblas
       filepath: /home/fhwn.ac.at/202375/.conda/envs/thesis/lib/libopenblasp-r0.3.27.so
        version: 0.3.27
threading_layer: pthreads
   architecture: SkylakeX

       user_api: openmp
   internal_api: openmp
    num_threads: 80
         prefix: libgomp
       filepath: /home/fhwn.ac.at/202375/.conda/envs/thesis/lib/libgomp.so.1.0.0
        version: None


#### Time considerations

In [68]:
nb_end_time = time.time()
nb_execution_time = (nb_end_time - nb_start_time) / 60
print(f"Execution time: {nb_execution_time} minutes")

Execution time: 597.9394542932511 minutes


## Export Notebook

In [118]:
import subprocess
import datetime
import os

# Get the current date
now = datetime.datetime.now()
date = now.strftime("%Y-%m-%d")

# Define the notebook name, output name, and output directory
try:
    nb_filepath = __vsc_ipynb_file__ # works for Visual Studio Code
    notebook_name = nb_filepath.split('/')[-1]
except:
    print('Please enter the notebook name manually')
    pass
# notebook_name = '03_1_modeling_rscv.ipynb'

output_name = f"{notebook_name.split('.')[0]}_{date}.html"

output_directory = './results/03_modeling_results/'

# Ensure the output directory exists
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Specify the full output path
full_output_path = os.path.join(output_directory, output_name)

# Convert notebook to html with specified output name and path
subprocess.call(['jupyter', 'nbconvert', '--to', 'html', notebook_name, '--output', full_output_path])

[NbConvertApp] Converting notebook 03_1_modeling_rscv.ipynb to html
[NbConvertApp] Writing 604666 bytes to results/03_modeling_results/03_1_modeling_rscv_2024-05-10.html


0

In [98]:
globals()['_dh'][0] # notebook path

[PosixPath('/home/fhwn.ac.at/202375/Thesis')]

# Function to convert the notebook to HTML
def convert_notebook_to_html(notebook_name, output_name, RESULTS_PATH=RESULTS_PATH):
    full_output_path = os.path.join(RESULTS_PATH, output_name)
        # Use subprocess to call the jupyter nbconvert command
    subprocess.call(['jupyter', 'nbconvert', '--to', 'html', 'notebook_name','--output', 'output_name', '--output-dir', 'RESULTS_PATH'])
    
    # Optionally, rename the output file if needed
    # os.rename(notebook_name.split('.')[0] + '.html', full_output_path)

# Wait for a short period to ensure all cells have finished executing
time.sleep(3) # Adjust the sleep duration as needed

# Convert the notebook to HTML
convert_notebook_to_html(notebook_name, output_name)