<a href="https://colab.research.google.com/github/tallerzalan/Applied-Machine-Learning/blob/main/DTs/Exercise_4_dt_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise - DTs for regression

1. Use the $\texttt{load_boston}$ data (remember to split your data into a train and test data). Using your training and validation data, optimize the parameters of your DT. How well does your optimized model perform on the test data? Is it better than your optimized SVM for the same data (the third exercise from last week)?
1. (Optional/bonus): Try to perform standardization to your data. Does it improve your model? Further, try to select only the 5 most important features. Does it improve the performance of your model?

**See slides for more details!**

In [40]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import tree
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

X, y = load_boston(return_X_y = True)

# Use `train_test_split` to split your data into a train and a test set.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.2,
                                                    random_state = 42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# Use `train_test_split` to split your train data into a train and a validation  set.
X_train, X_val, y_train, y_val = train_test_split(X_train,
                                                  y_train,
                                                  test_size = 0.2,
                                                  random_state = 42)

print(X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape)

(404, 13) (102, 13) (404,) (102,)
(323, 13) (81, 13) (102, 13) (323,) (81,) (102,)


# Exercise 1

Use the $\texttt{load_boston}$ data (remember to split your data into a train and test data). Using your training and validation data, optimize the parameters of your DT. How well does your optimized model perform on the test data? Is it better than your optimized SVM for the same data (the third exercise from last week)?

In [41]:
min_samples_split_list = [min_split for min_split in np.arange(2, 11, 1)] # input values seperated by ",".
min_samples_leaf_list = [min_leaf for min_leaf in np.arange(2, 11, 1)] # iput values seperated by ",".
max_features_list = [max_feat for max_feat in np.arange(2, 11, 1)] # input values seperated by ",".

results = []

for min_samples_split in min_samples_split_list:
    for min_samples_leaf in min_samples_leaf_list:
        for max_features in max_features_list:
            dt_current = tree.DecisionTreeRegressor(
                min_samples_split = min_samples_split,
                min_samples_leaf = min_samples_leaf,
                max_features = max_features,
                random_state = 1)
            dt_current.fit(X_train, y_train)
            y_val_hat = dt_current.predict(X_val)
            mse = mean_squared_error(y_val_hat, y_val)

            results.append([mse, min_samples_split, min_samples_leaf, max_features])

results = pd.DataFrame(results)
results.columns = ['MSE', 'min_samples_split', 'min_samples_leaf', 'max_features']
print(results)

           MSE  min_samples_split  min_samples_leaf  max_features
0    37.197821                  2                 2             2
1    22.529417                  2                 2             3
2    28.711638                  2                 2             4
3    27.963416                  2                 2             5
4    20.210195                  2                 2             6
..         ...                ...               ...           ...
724  27.920179                 10                10             6
725  32.711158                 10                10             7
726  33.023648                 10                10             8
727  22.787196                 10                10             9
728  24.069593                 10                10            10

[729 rows x 4 columns]


In [42]:
# Extract best parameters.
results[results['MSE'] == results['MSE'].min()]

Unnamed: 0,MSE,min_samples_split,min_samples_leaf,max_features
584,12.412906,9,3,10


In [43]:
# Initialize your final model
dt_optimized = tree.DecisionTreeRegressor(
    min_samples_split = 5,
    min_samples_leaf = 4,
    max_features = 4,
    random_state = 1
    )

# Use both training and validation data to fit it (np.concatenate "stacks" the array like rbind in R)
dt_optimized.fit(np.concatenate([X_train, X_val]), np.concatenate([y_train, y_val]))

# Predict on test data
y_test_hat_optimized = dt_optimized.predict(X_test)

# Obtain and check mse on test data
mse_optimized = mse = mean_squared_error(y_test_hat_optimized, y_test)
print(f'Optimized DT achieved MSE = {round(mse_optimized, 2)}.')

Optimized DT achieved MSE = 17.19.


# Exercise 2

(Optional/bonus): Try to perform standardization to your data. Does it improve your model? Further, try to select only the 5 most important features. Does it improve the performance of your model?