The datafiles contain pre-processed training and test data from the Ames housing dataset. Train a DT model predicting "SalesPrice".

Build a pipeline for the Ames data that includes a Feature Selection step using Pearson's correlation and a DT step. Create a HP grid tuning the *k* HP of the feature selection step and the *max_depth* and *min_samples_leaf* of the DT model, choosing ranges for each of the HPs. Determine the best HP settings using Bayesian Optimisation.

Replace the step that selects features based on Pearson's *r* with feature selection based on RFE. Comment on the results obtained.

Some parts of the solution are already provided. Write code in the empty cells and in places indicated with "???".

Hint: use "Sklearn pipeline.ipynb" and "RFE.ipynb" as examples.

In [None]:
!pip install scikit-optimize

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

sns.set_theme(palette="Set2")

# execution time
from timeit import default_timer as timer
from datetime import timedelta

# increase column width
pd.set_option('display.max_colwidth', 250)

# silence warnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

# Load the data

In [None]:
trainset = pd.read_csv("trainset-ames-housing.csv")
testset = pd.read_csv("testset-ames-housing.csv")

# separate predictors and target
ytrain = trainset["SalePrice"].copy()
Xtrain = trainset.drop("SalePrice", axis=1)
ytest = testset["SalePrice"].copy()
Xtest = testset.drop("SalePrice", axis=1)

# Model development

## Feature Selection with Pearson's r

In [None]:
from skopt import BayesSearchCV

# import relevant modules
from imblearn.pipeline import Pipeline
???

In [None]:
start = timer()

pipe = Pipeline([
    ('fsel', SelectKBest(r_regression)),
    ('dt', DecisionTreeRegressor(random_state=7))
])

hp_grid = {
    'fsel__k': [???],
    'dt__max_depth': [???],
    'dt__min_samples_leaf': [???],
}

opt_grid_search = BayesSearchCV(
     pipe,
     hp_grid,
     n_iter=???,
     random_state=7,
     scoring='neg_root_mean_squared_error',
     return_train_score=True,
     cv=10
)

np.int = int
opt_grid_search.fit(???, ???)

print("Execution time HH:MM:SS:", timedelta(seconds=timer() - start))

In [None]:
cv_results = pd.DataFrame(opt_grid_search.cv_results_)[['params', 'mean_train_score', 'mean_test_score']]
cv_results["mean_train_score"] = -cv_results["mean_train_score"]
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results["diff, %"] = 100*(cv_results["mean_train_score"]-cv_results["mean_test_score"]
                                                     )/cv_results["mean_train_score"]

cv_results.sort_values('mean_test_score')

# Feature selection using RFE

In [None]:
from imblearn.pipeline import Pipeline
from sklearn.feature_selection import ???

start = timer()

pipe = Pipeline([
    ('fsel', ???),
    ('dt', ???)
])

hp_grid = {
    ???
}

opt_grid_search = BayesSearchCV(
     pipe,
     hp_grid,
     n_iter=???,
     random_state=7,
     scoring='neg_root_mean_squared_error',
     return_train_score=True,
     cv=10
)

np.int = int
opt_grid_search.fit(???, ???)

print("Execution time HH:MM:SS:", timedelta(seconds=timer() - start))

In [None]:
cv_results = pd.DataFrame(opt_grid_search.cv_results_)[['params', 'mean_train_score', 'mean_test_score']]
cv_results["mean_train_score"] = -cv_results["mean_train_score"]
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results["diff, %"] = 100*(cv_results["mean_train_score"]-cv_results["mean_test_score"]
                                                     )/cv_results["mean_train_score"]

cv_results.sort_values('mean_test_score')