The datafiles contain pre-processed training and test data from the Ames housing dataset. Train a DT model predicting SalesPrice. First, create a HP grid with the *max_depth*, *min_samples_split* and *min_samples_leaf*, and try to come up with reasonable ranges for each of the HPs. Determine the best HP settings using Exhaustive Search and Bayesian Optimisation. For Bayesian Optimisation, use max 10 models. Compare on the results you obtain with the two types of grid searches. Comment on the degree of overfitting of the best models.

Some parts of the solution are already provided. Write code in the empty cells and in places indicated with "???".

Hint: use "HP tuning.ipynb" as an examples.

# Ensure skopt is installed

In [None]:
!pip install scikit-optimize

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

sns.set_theme(palette="Set2")

# execution time
from timeit import default_timer as timer
from datetime import timedelta

# increase column width
pd.set_option('display.max_colwidth', 250)

# silence warnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

# Load data

In [None]:
trainset = ???
testset = ???

In [None]:
trainset.info()

In [None]:
testset.info()

# Separate predictors and target

In [None]:
ytrain = trainset["SalePrice"].copy()
Xtrain = trainset.drop("SalePrice", axis=1)
ytest = testset["SalePrice"].copy()
Xtest = testset.drop("SalePrice", axis=1)

# Create a tree with default HP settings

Train an unconstrained DT on the training data. Evaluate it using RMSE, and examine its depth.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

In [None]:
# the depth of the tree
tree_reg.get_depth()

# Exhaustive search

In [None]:
from sklearn.model_selection import ???

In [None]:
start = timer()

# specify the hyperparameters and their values
hp_grid = {
    'max_depth': [???],
    'min_samples_leaf': [???],
    'min_samples_split': [???],
}

tree_reg = DecisionTreeRegressor(random_state=7)

# we'll use 10-fold cross-validation
grid_search = GridSearchCV(tree_reg, hp_grid, cv=10,
                           scoring='neg_root_mean_squared_error', 
                           return_train_score=True, verbose=1)

grid_search.fit(???)

print("Execution time HH:MM:SS:", timedelta(seconds=timer() - start))

In [None]:
grid_search.best_estimator_

In [None]:
cv_results = pd.DataFrame(grid_search.cv_results_)[['params', 'mean_train_score', 'mean_test_score']]
cv_results["mean_train_score"] = -cv_results["mean_train_score"]
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results["diff, %"] = 100*(cv_results["mean_train_score"]-cv_results["mean_test_score"]
                                                     )/cv_results["mean_train_score"]

cv_results.sort_values('mean_test_score')

# Bayesian Optimisation

In [None]:
from skopt import ???

In [None]:
start = timer()

hp_grid = {
         'max_depth': [???],
         'min_samples_leaf': [???],
         'min_samples_split': [???],
     }

opt_grid_search = BayesSearchCV(
     DecisionTreeRegressor(random_state=7),
     hp_grid,
     n_iter=???,
     random_state=7,
     scoring='neg_root_mean_squared_error',
     return_train_score=True,
     cv=10
)

np.int = int
opt_grid_search.fit(Xtrain, ytrain)

print("Execution time HH:MM:SS:", timedelta(seconds=timer() - start))

In [None]:
opt_grid_search.best_estimator_

In [None]:
cv_results = pd.DataFrame(opt_grid_search.cv_results_)[['params', 'mean_train_score', 'mean_test_score']]
cv_results["mean_train_score"] = -cv_results["mean_train_score"]
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results["diff, %"] = 100*(cv_results["mean_train_score"]-cv_results["mean_test_score"]
                                                     )/cv_results["mean_train_score"]

cv_results.sort_values('mean_test_score')