# Creation of the Initial Model 

## Introduction

This Notebook is develped to identify and specify the models, which will be used to apply the Active Learning strategies on. At least two models will be created, as described in the initial Research Proposal: 
1. PLS-Regression-Model 
2. Random-Forest-Regression-Model

## Preperation

To work in python, various libraries are needed. So the neccessary libraries are imported in the next cell. 

The code is developed inspired by the machine learining course by [Peter Sykacek](peter.sykacek[at]boku.ac.at) in the winter of 2023.


### Define Paths

In [11]:
import sys
# sys.path.clear()

# Basepath
basepath="./" # Project directory
sys.path.append(basepath)
sys.path.append(basepath+"server_files/ml_group/course.lib")

# Data
DATA_PATH = basepath + "data"

#Figure
FIGURE_PATH = basepath + "figures/03_modeling_figures"

# Modelpath
MODEL_PATH = basepath + "models"

# Path to environment

ENV_PATH = "/home/fhwn.ac.at/202375/.conda/envs/thesis/lib"

# Resultspath
RESULTS_PATH = basepath + "results/03_modeling_results"

# Add the paths
sys.path.extend({DATA_PATH, FIGURE_PATH, MODEL_PATH, ENV_PATH, RESULTS_PATH})
sys.path # Check if the path is correct

['/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python312.zip',
 '/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12',
 '/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/lib-dynload',
 '',
 '/home/fhwn.ac.at/202375/.conda/envs/thesis/lib/python3.12/site-packages',
 './',
 './server_files/ml_group/course.lib',
 './results/03_modeling_results',
 './figures/03_modeling_figures',
 '/home/fhwn.ac.at/202375/.conda/envs/thesis/lib',
 './data',
 './models',
 './models',
 './',
 './server_files/ml_group/course.lib',
 './results/03_modeling_results',
 './figures/03_modeling_figures',
 '/home/fhwn.ac.at/202375/.conda/envs/thesis/lib',
 './data',
 './models']

In [2]:
## timing the full notebook
import time
nb_start_time = time.time()

### Define the Path to store the ML Models 

In [3]:
import joblib

# Define the path to save the ml models

MODEL_PATH = basepath + "models"
sys.path.append(MODEL_PATH)

### Imports

In [4]:
# import ml_lib as mlib
import numpy as np
import matplotlib.pyplot as plt

### turn off convergence warnings

In [5]:
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)
import os
os.environ["PYTHONWARNINGS"] = "ignore" # Also affect subprocesses

## Import Model functions

To generate various models an import of the respective functions from preexisting packages is neccessary. 

### Gridsearch Crossvalidation

[sklearn GSCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)

Exhaustive search over specified parameter values for an estimator.
Important members are fit, predict.

* GridSearchCV implements a "fit" and a "score" method.
* It also implements "score_samples", "predict", "predict_proba", "decision_function", "transform" and "inverse_transform" if they are implemented in the estimator used.

In [6]:
from sklearn.model_selection import GridSearchCV as GSCV

### Randomized Parameter Optimization

[sklearn RandomizedSearchCV](https://scikit-learn.org/stable/modules/grid_search.html#grid-search)  

 RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:

A budget can be chosen independent of the number of parameters and possible values.


In [7]:
from sklearn.model_selection import RandomizedSearchCV

### K-Fold cross-validator.
[sklearn KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold)

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

In [8]:
from sklearn.model_selection import KFold

### Kernel Ridge

[sklearn KRR](https://scikit-learn.org/stable/modules/kernel_ridge.html#kernel-ridge-regression)

Kernel ridge regression (KRR) [M2012] combines Ridge regression and classification (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space.

In [9]:
from sklearn.kernel_ridge import KernelRidge as KRR

### Model Inspection

[sklearn cv_results_](https://scikit-learn.org/stable/modules/grid_search.html#analyzing-results-with-the-cv-results-attribute)

"The cv_results_ attribute contains useful information for analyzing the results of a search. It can be converted to a pandas dataframe with df = pd.DataFrame(est.cv_results_)."

## Data Import

In this section the sample data will be imported. 

Currently 2 Datasets are of interest for us: 
1. PS20191107_gegl.csv
2. dps1200.csv

The differences are that the first is a dataframe containing the data unmodified and full. It was used to generate the later, which contains only selected sections of the spectra. The Wavelengths of this dataset were selected by discarding Wavelengths, based on critieria ???

**TODO**: Research the criteria

### PS20191107 (Full Data)

In [10]:
import pandas as pd
data_full = pd.read_csv(basepath+"data/PS20191107_gegl.csv", 
                            sep=";", decimal=",", encoding="utf-8")
data_full.head()

FileNotFoundError: [Errno 2] No such file or directory: './data/PS20191107_gegl.csv'

#### Statistics

In [None]:
# Retrive basic characteristics for each variable
data_full.describe()

In [None]:
data_full.groupby('type')[['year']].agg(['max', 'mean', 'min'])

### Import dataset dps1200.csv

In [None]:
data_small = pd.read_csv(basepath+"data/dps1200.csv", 
                            sep=",", decimal=".", encoding="utf-8")
data_small.head()

In [None]:
# Correct the column headers

# data_1200.rename(lambda x: x[1:], axis='columns')
data_small = data_small.rename(columns=lambda x: x.replace('X', ''))
data_small.head()

In [None]:
data_small.describe()
# describe() gives some basic statistics for numeric columns,

In [None]:
data_small.describe(include="object")
# describe() gives some basic statistics for numeric columns, 
# categorial columns are included with the option include="object"

## Modelling Parameters

In [None]:
# Define the parameters for the CV

# Switch for the dataset
    # Select from (data_1200, data_full) or other if implemented
data = data_small

# Switch for testing mode (use only 10% of the data, among others)
testing = True

# Define a random state for randomized processes
random_state = np.random.RandomState(202375)

# Define a metric for model evaluation
cv_scorer = 'neg_mean_squared_error'

######################################################
if testing == True:
    nfolds = 3
    NoTrials = 2
    n_jobs = 20
    save_model = False
    print("Testing mode for Cross Validation")
    print("Splitting the data for faster modelling")
    data = data.sample(frac=0.1)
else:
    nfolds = 10
    NoTrials = 15
    n_jobs = 40
    save_model = True
    print("Extensive mode for Cross Validation")
######################################################

In [None]:
X = data.select_dtypes('float')
X

In [None]:
y = data['year']
y

In [None]:
random_state

## Train/Test split

During this Project, we will generate statistical model with a random fraction of the dataset. The remainder will be retained to be used as test values to estimate the accuracy of the model and potentially detect overfitting. 

In [None]:
from sklearn.model_selection import train_test_split
# Split the dataset into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_state)

# Random Forest (RSCV)
Implemented: 
- parameter distribution  
- Train  
- Test  
- CV Results  
- Optimal Model Parameters  

## Parameter Distribution

In [None]:
# RF Define the parameters for the CV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_predict as cvp
from scipy.stats import randint
from sklearn.metrics import mean_squared_error

start_time = time.time()

rf = RandomForestRegressor() # default criterion to evaluate the quality of the split is the ”squared_error”

param_distribs = {'n_estimators': randint(low=3, high=150), # for hyperparameter with discrete values 
                  'min_samples_split': randint(low=2, high=20), 
                  'max_depth': randint(low=1, high=20), 
                  'min_samples_leaf': randint(low=1, high=10),
                  }

# loop the fitting with splits of the data
rf_rscv_rmse1 = np.zeros((NoTrials, 1))
rf_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    rf_rscv = RandomizedSearchCV(
        rf, # regressor
        param_distributions=param_distribs, # hyperparameter space
        n_iter=10, # "Number of parameter settings that are sampled." [sklearn]
        cv=inner_cv, # "Determines the cross-validation splitting strategy"[sklearn]
        scoring=cv_scorer, 
        random_state=random_state, 
        verbose=1, 
        n_jobs=n_jobs)
    
    # fit the model on the Trainig Data
    rf_rscv.fit(X_train, y_train)

    # calculate the CV scores
    rf_rscv_rmse1[i] = np.sqrt(-rf_rscv.best_score_)
    y_pred_rf = cvp(rf_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    rf_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_rf))

end_time = time.time()
execution_time = (end_time - start_time)/60
print(f"Execution time: {execution_time} minutes")


In [None]:
print('optimal Parameters according to RSCV:', rf_rscv.best_params_)
print('best score', rf_rscv.best_score_) # this returns the negative of the MSE

## RF with optimal parameters

extract the best parameters and run the RF Regression with the full data

In [None]:
# optimal parameters
rf_opt = RandomForestRegressor(**rf_rscv.best_params_)

#fit the model
rf_opt.fit(X_train, y_train)

# predict the values for X_test
y_pred_rf = rf_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))

In [None]:
rmse_rf

# PLS (RSCV)
Implemented:


*TODO*

- parameter distribution  
- Train  
- Test  
- CV Results  
- Optimal Model Parameters  

In [None]:
# The parameter search should be limited in regards to the numebr of components to keep:
# "Should be in [1, min(n_samples, n_features, n_targets)]" sklearn

X_test.shape

In [None]:
# Define the parameters for the CV
from sklearn import cross_decomposition

start_time = time.time()

pls = cross_decomposition.PLSRegression()

param_distribs = {'n_components': randint(low=1, high=90), #  should be in [1, min(n_samples, n_features, n_targets = 90)].
                  'max_iter': randint(low=2, high=700), 
                  }

# loop the fitting with splits of the data
pls_rscv_rmse1 = np.zeros((NoTrials, 1))
pls_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    pls_rscv = RandomizedSearchCV(
        pls, # regressor
        param_distributions=param_distribs, # hyperparameter space
        n_iter=10, # "Number of parameter settings that are sampled." [sklearn]
        cv=inner_cv, # "Determines the cross-validation splitting strategy"[sklearn]
        scoring=cv_scorer, 
        random_state=random_state, 
        verbose=0, 
        n_jobs=n_jobs)
    
    # fit the model on the Trainig Data
    pls_rscv.fit(X_train, y_train)

    # calculate the CV scores
    pls_rscv_rmse1[i] = np.sqrt(-pls_rscv.best_score_)
    y_pred_pls = cvp(pls_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    pls_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_pls))

end_time = time.time()
execution_time = (end_time - start_time)/60
print(f"Execution time: {execution_time} minutes")


In [None]:
print('optimal Parameters according to RSCV:', pls_rscv.best_params_)
print('best score' ,pls_rscv.best_score_) # this returns the negative of the MSE

### PLS with optimal parameters

In [None]:
# optimal parameters
pls_opt = cross_decomposition.PLSRegression(**pls_rscv.best_params_)

#fit the model
pls_opt.fit(X_train, y_train)

# predict the values for X_test

y_pred_pls = pls_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_pls = np.sqrt(mean_squared_error(y_test, y_pred_pls))

In [None]:
rmse_pls

# KRR with RBF (RSCV)
Implemented:

- parameter distribution  
- Train  
- Test  
- CV Results  
- Optimal Model Parameters  

TODO

In [None]:
# KRR with RBF 

# alpha: Regularization strength

param_distribs = {"alpha": [1e0, 1e-1, 1e-2, 1e-3], 
                  "gamma": np.logspace(-2, 2, 7)}

# param_distribs = {"alpha": np.logspace(0.0001, 0.1), 
#                  "gamma": np.logspace(0.0001, 0.1)}

In [None]:
np.logspace(-2, 2, 7)

In [None]:
from sklearn.kernel_ridge import KernelRidge as KRR

start_time = time.time()

# loop the fitting with splits of the data
krr_rscv_rmse1 = np.zeros((NoTrials, 1))
krr_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    krr_rscv = RandomizedSearchCV(
        KRR(kernel='rbf'), # regressor
        param_distributions=param_distribs, # hyperparameter space
        n_iter=10, # "Number of parameter settings that are sampled." [sklearn]
        cv=inner_cv, # "Determines the cross-validation splitting strategy"[sklearn]
        scoring=cv_scorer, 
        random_state=random_state, 
        verbose=1, 
        n_jobs=n_jobs)
    
    # fit the model on the Trainig Data
    krr_rscv.fit(X_train, y_train)

    # calculate the CV scores
    krr_rscv_rmse1[i] = np.sqrt(-krr_rscv.best_score_)
    y_pred_krr = cvp(krr_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    krr_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_krr))

end_time = time.time()
execution_time = (end_time - start_time)/ 60
print(f"Execution time: {execution_time} minutes")

In [None]:
print('optimal Parameters according to RSCV:', krr_rscv.best_params_)
print('best score' ,krr_rscv.best_score_) # this returns the negative of the MSE

## KRR (rbf) with optimal parameters

In [None]:
# optimal parameters
krr_opt = KRR(**krr_rscv.best_params_)

#fit the model
krr_opt.fit(X_train, y_train)

# predict the values for X_test

y_pred_krr = krr_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_krr = np.sqrt(mean_squared_error(y_test, y_pred_krr))

In [None]:
rmse_krr

# MLP (RSCV)

This Method, the multi-layer perceptron creates a neural network, where neurons are organized in three or more layers (1 input-, n hidden-, and 1 output-layer). The MLP is based on a threshold logic unit (TLU, sometimes linear threshold unit LTU). A TLU recieves input from its connections and calculates 'weights' from the sum of all inputs and calculates a step function. Common step functions are the *Heaviside step function* or *sign function*. 

To compute the outputs of a single fully connnected layer the following eq. can be used 

(citation: A. Géron, Hands-on machine learning with Scikit-Learn and TensorFlow concepts, tools, and techniques to build intelligent systems, 2nd ed. O’Reilly Media, Inc., 2019. p.283)
‌

$$h_{W,b}(X) = \phi(WX + b)$$

Implemented: 

- parameter distribution  
- Train  
- Test  
- CV Results  
- Optimal Model Parameters 

*TODO* 

In [None]:
# parameter Distribution for mlp

from scipy.stats import randint, uniform

param_distribs = {"hidden_layer_sizes": randint(low=50, high=200), # number of neurons in each layer
                  "activation": ['identity', 'logistic', 'tanh', 'relu'],
                  "solver": ['lbfgs','sgd', 'adam'],
                  'alpha': uniform(loc=0.0001, scale=0.1),
                  'early_stopping': [True, False],  
                  'validation_fraction': uniform(loc=0.1, scale=0.1)
}

In [None]:
from sklearn.neural_network import MLPRegressor as MLP

start_time = time.time()

# loop the fitting with splits of the data
mlp_rscv_rmse1 = np.zeros((NoTrials, 1))
mlp_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    mlp_rscv = RandomizedSearchCV(
        MLP(), # regressor
        param_distributions=param_distribs, # hyperparameter space
        n_iter=10, # "Number of parameter settings that are sampled." [sklearn]
        cv=inner_cv, # "Determines the cross-validation splitting strategy"[sklearn]
        scoring=cv_scorer, 
        random_state=random_state, 
        verbose=0, 
        n_jobs=n_jobs)
    
    # fit the model on the Trainig Data
    mlp_rscv.fit(X_train, y_train)

    # calculate the CV scores
    mlp_rscv_rmse1[i] = np.sqrt(-mlp_rscv.best_score_)
    y_pred_mlp = cvp(mlp_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    mlp_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_mlp))

end_time = time.time()
execution_time = (end_time - start_time)/ 60
print(f"Execution time: {execution_time} minutes")

In [None]:
print('optimal Parameters according to RSCV:', mlp_rscv.best_params_)
print('best score' ,mlp_rscv.best_score_) # this returns the negative of the MSE

In [None]:
# optimal parameters
mlp_opt = MLP(**mlp_rscv.best_params_)

#fit the model
mlp_opt.fit(X_train, y_train)

# predict the values for X_test
y_pred_mlp = mlp_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_mlp = np.sqrt(mean_squared_error(y_test, y_pred_mlp))

In [None]:
rmse_mlp

# XGBoost (RSCV)

Implemented:

- import

*TODO*


- parameter distribution  
- Train  
- Test  
- CV Results  
- Optimal Model Parameters  

In [None]:
import xgboost
xgboost.__version__

#### transform the data into the XGBoost data class

Details see [datacamp](https://www.datacamp.com/tutorial/xgboost-in-python)

In [None]:
# Create regression matrices
dtrain_reg = xgboost.DMatrix(X_train, y_train)
dtest_reg = xgboost.DMatrix(X_test, y_test)

### Define the objective

XGBoost will be used here for a regression problem, with the objective to minimize the squared error of the model.

In [None]:
params = {"objective": "reg:squarederror", 
          "tree_method": "hist"} # "gpu_hist" for gpu only, set to 'hist' if on cpu

In [None]:
# Define hyperparameters

n = 500 # number of rounds
evals = [(dtrain_reg, "train"), (dtest_reg, "validation")] # specify the data for evaluation

model = xgboost.train(
   params=params,
   dtrain=dtrain_reg,
   num_boost_round=n,
   evals=evals,
   verbose_eval = 20, 
   early_stopping_rounds=20,
)

# [60]	train-rmse:3.18431	validation-rmse:123.78980
# [71]	train-rmse:1.87073	validation-rmse:123.80743

### XGBoost Crossvalidation

In [None]:
n = 1000

results = xgboost.cv(
   params,
   dtrain_reg,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=20
)
results.head()

In [None]:
best_rmse = results['test-rmse-mean'].min()

best_rmse

### Acessing the xgboost eval metrics via sklearn

from tutorial of [xgboost](https://xgboost.readthedocs.io/en/latest/python/examples/sklearn_evals_result.html#demo-for-accessing-the-xgboost-eval-metrics-by-using-sklearn-interface)

In [None]:
# Create regression matrices
dtrain_reg = xgboost.DMatrix(X_train, y_train)
dtest_reg = xgboost.DMatrix(X_test, y_test)

params = {"objective": ["reg:squarederror"], 
          "tree_method": ["hist"]}

XGB = xgboost.XGBModel(**params)

In [None]:
from xgboost import XGBRegressor

start_time = time.time()
# Instantiate the regressor
XGB = XGBRegressor()

param_distribs = {
    "n_estimators": randint(100,500),
    "max_depth": randint(3,100)
}


# loop the fitting with splits of the data
xgb_rscv_rmse1 = np.zeros((NoTrials, 1))
xgb_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    xgb_rscv = RandomizedSearchCV(
        XGB, # regressor
        param_distributions = param_distribs, # hyperparameter space
        n_iter = 10, # "Number of parameter settings that are sampled." [sklearn]
        cv = inner_cv, # "Determines the cross-validation splitting strategy" [sklearn]
        scoring = cv_scorer, 
        random_state = random_state, 
        verbose = 1, 
        n_jobs = n_jobs)
    
    # fit the model on the Trainig Data
    xgb_rscv.fit(X_train, y_train)

    # calculate the CV scores
    xgb_rscv_rmse1[i] = np.sqrt(-xgb_rscv.best_score_)
    y_pred_xgb = cvp(xgb_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    xgb_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_xgb))

end_time = time.time()
execution_time = (end_time - start_time)/ 60
print(f"Execution time: {execution_time} minutes")

In [None]:
print('optimal Parameters according to RSCV:', xgb_rscv.best_params_)
print('best score' ,xgb_rscv.best_score_) # this returns the negative of the MSE

In [None]:
# optimal parameters
best_params = xgb_rscv.best_params_

xgb_opt = XGBRegressor(**xgb_rscv.best_params_)

#fit the model
xgb_opt.fit(X_train, y_train)

# predict the values for X_test
y_pred_xgb = xgb_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))

In [None]:
rmse_xgb

# Histogram-based Gradient Boosting Regression Tree

*TODO*
- parameter distribution
- Train  
- Test  
- CV Results  
- Optimal Model Parameters 

In [None]:
from sklearn.ensemble import HistGradientBoostingRegressor as HGB

# Define parameters for HGB
#param_distribs = {
#    'learning_rate':randint(low=0.001,high=1),
#    'max_iter':randint(low=5,high=250), 
#     'max_leaf_nodes': randint(low=2,high=50, scale = 1)
#                  }
param_distribs = {'max_iter': [5,10], 
                  'max_leaf_nodes': [15,31,40],
                  }

# loop the fitting with splits of the data
hgb_rscv_rmse1 = np.zeros((NoTrials, 1))
hgb_rscv_rmse2 = np.zeros((NoTrials, 1))

for i in range(0, NoTrials):
    print(f"Trial {i} of {NoTrials}")

    # Split the data into 'nfolds' number of splits 
    inner_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    outer_cv = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    # define the RSCV object
    hgb_rscv =RandomizedSearchCV(
        HGB(), # regressor
        param_distributions=param_distribs, # hyperparameter space
        n_iter=10, # "Number of parameter settings that are sampled." [sklearn]
        cv=inner_cv, # "Determines the cross-validation splitting strategy"[sklearn]
        scoring=cv_scorer, 
        random_state=random_state, 
        verbose=1, 
        n_jobs=n_jobs)
    
    # fit the model on the Trainig Data
    hgb_rscv.fit(X_train, y_train)

    # calculate the CV scores
    hgb_rscv_rmse1[i] = np.sqrt(-hgb_rscv.best_score_)
    y_pred_hgb = cvp(hgb_rscv, X_train, y_train, cv=outer_cv, n_jobs=n_jobs)
    hgb_rscv_rmse2[i] = np.sqrt(mean_squared_error(y_train, y_pred_hgb))

end_time = time.time()
execution_time = (end_time - start_time)/60
print(f"Execution time: {execution_time} minutes")

In [None]:
print('optimal Parameters according to RSCV:', hgb_rscv.best_params_)
print('best score' ,hgb_rscv.best_score_) # this returns the negative of the MSE

### HGB with optimal parameters

In [None]:
# optimal parameters
hgb_opt = HGB(**hgb_rscv.best_params_)

#fit the model
hgb_opt.fit(X_train, y_train)

# predict the values for X_test

y_pred_hgb = hgb_opt.predict(X_test)

# calculate the error between y_test (true) and y predicted
rmse_hgb = np.sqrt(mean_squared_error(y_test, y_pred_hgb))
rmse_hgb

# Export Models

In [None]:
### Computational Considerations 

# Define the current models: 

model_list = ["rf_opt", "pls_opt", "krr_opt", "xgb_opt", "hgb_opt"]

# write the models to memory: 
if save_model == True:
    for i in model_list: 
        # Extract model name
        #model_name = 
        model_name = i + i.__class__.__name__
        # Construct a filepath
        model_filepath = MODEL_PATH + f"/{model_name}.pkl"
        # Save the model
        joblib.dump(i, model_filepath)
else:
    print("Testrun, no model is written")

## Load the models from memory

for model in model_list:  
    model_name = model.__class__.__name__  
    model_filepath = MODEL_PATH + f"/{model_name}.pkl"  
    model = joblib.load(model_filepath)  

# Quality Control

In this section the goal is to document the packages which where used during the execution of this notebook

In [None]:
## Package informations
from sklearn import show_versions
show_versions()

#### Time considerations

In [None]:
nb_end_time = time.time()
nb_execution_time = (nb_end_time - nb_start_time) / 60
print(f"Execution time: {nb_execution_time} minutes")

## Export Notebook

In [None]:
import subprocess
import datetime
import os

# Get the current date
now = datetime.datetime.now()
date = now.strftime("%Y-%m-%d")

# Define the notebook name, output name, and output directory
try:
    nb_filepath = __vsc_ipynb_file__ # works for Visual Studio Code
    notebook_name = nb_filepath.split('/')[-1]
except:
    print('Please enter the notebook name manually')
    pass
# notebook_name = '03_1_modeling_rscv.ipynb'

output_name = f"{notebook_name.split('.')[0]}_{date}.html"

output_directory = './results/03_modeling_results/'

# Ensure the output directory exists
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Specify the full output path
full_output_path = os.path.join(output_directory, output_name)

# Convert notebook to html with specified output name and path
subprocess.call(['jupyter', 'nbconvert', '--to', 'html', notebook_name, '--output', full_output_path])

In [None]:
globals()['_dh'][0] # notebook path

# Function to convert the notebook to HTML
def convert_notebook_to_html(notebook_name, output_name, RESULTS_PATH=RESULTS_PATH):
    full_output_path = os.path.join(RESULTS_PATH, output_name)
        # Use subprocess to call the jupyter nbconvert command
    subprocess.call(['jupyter', 'nbconvert', '--to', 'html', 'notebook_name','--output', 'output_name', '--output-dir', 'RESULTS_PATH'])
    
    # Optionally, rename the output file if needed
    # os.rename(notebook_name.split('.')[0] + '.html', full_output_path)

# Wait for a short period to ensure all cells have finished executing
time.sleep(3) # Adjust the sleep duration as needed

# Convert the notebook to HTML
convert_notebook_to_html(notebook_name, output_name)