# Exercise - DTs for regression

1. Use the `./data/HousingData.csv` data (remember to split your data into a train and test data). Using your training and validation data, optimize the parameters of your DT. How well does your optimized model perform on the test data?
1. (Optional/bonus): Try to perform standardization to your data. Does it improve your model? Further, try to select only the 5 most important features. Does it improve the performance of your model?

**See slides for more details!**

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import tree
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

import pandas as pd

data = './data/HousingData.csv'
raw_df = pd.read_csv(data).dropna()

print(raw_df.head())

# Create a copy of the DataFrame with column names
df_copy = raw_df.copy()

# Separate the target variable (y) and features (X)
y = df_copy['MEDV']  # Replace 'TargetColumn' with your actual target column name
X = df_copy.drop(columns=['MEDV'])  # Remove the target column

# Use `train_test_split` to split your data into a train and a test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Use `train_test_split` to split your train data into a train and a validation  set.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print("The shape of train, validation and test sets are:")
print(X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape)

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671    2  242     17.8   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671    2  242     17.8   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622    3  222     18.7   
5  0.02985   0.0   2.18   0.0  0.458  6.430  58.7  6.0622    3  222     18.7   

        B  LSTAT  MEDV  
0  396.90   4.98  24.0  
1  396.90   9.14  21.6  
2  392.83   4.03  34.7  
3  394.63   2.94  33.4  
5  394.12   5.21  28.7  
The shape of train, validation and test sets are:
(252, 13) (63, 13) (79, 13) (252,) (63,) (79,)


# Exercise 1

Use the `./data/HousingData.csv` data (remember to split your data into a train and test data). Using your training and validation data, optimize the parameters of your DT. How well does your optimized model perform on the test data? Is it better than your optimized SVM for the same data (the third exercise from last week)?

In [9]:
min_samples_split_list = [6, 12] # input values seperated by ",".
min_samples_leaf_list = [5, 10, 15] # input values seperated by ",".
max_features_list = [2, 5, 10] # input values seperated by ",".

results = []

for min_samples_split in min_samples_split_list:
    for min_samples_leaf in min_samples_leaf_list:
        for max_features in max_features_list:
            dt_current = tree.DecisionTreeRegressor(min_samples_split=min_samples_split,
                                                    min_samples_leaf=min_samples_leaf,
                                                    max_features=max_features)
                                                    
            dt_current.fit(X_train, y_train)
            y_val_hat = dt_current.predict(X_val)
            mse = mean_squared_error(y_val_hat, y_val)

            results.append([mse, min_samples_split, min_samples_leaf, max_features])

results = pd.DataFrame(results)
results.columns = ['MSE', 'min_samples_split', 'min_samples_leaf', 'max_features']
print(results)

          MSE  min_samples_split  min_samples_leaf  max_features
0   17.291970                  6                 5             2
1   31.505136                  6                 5             5
2   22.170397                  6                 5            10
3   26.103759                  6                10             2
4   18.618021                  6                10             5
5   21.104061                  6                10            10
6   21.709061                  6                15             2
7   12.858097                  6                15             5
8   21.492076                  6                15            10
9   30.151790                 12                 5             2
10  12.269284                 12                 5             5
11  23.528177                 12                 5            10
12  26.347607                 12                10             2
13  16.305567                 12                10             5
14  18.649164            

In [11]:
# Extract best parameters.
results[results['MSE'] == results['MSE'].min()]

Unnamed: 0,MSE,min_samples_split,min_samples_leaf,max_features
10,12.269284,12,5,5


In [8]:
# Initialize your final model
dt_optimized = tree.DecisionTreeRegressor(
    min_samples_split=4,
    min_samples_leaf=15,
    max_features=10
    )

# Use both training and validation data to fit it (np.concatenate "stacks" the array like rbind in R)
dt_optimized.fit(np.concatenate([X_train, X_val]), np.concatenate([y_train, y_val]))

# Predict on test data
y_test_hat_optimized = dt_optimized.predict(X_test)

# Obtain and check mse on test data
mse_optimized = mean_squared_error(y_test_hat_optimized, y_test)
print(f'Optimized DT achieved MSE = {round(mse_optimized, 2)}.')

Optimized DT achieved MSE = 38.26.




# Exercise 2

(Optional/bonus): Try to perform standardization to your data. Does it improve your model? Further, try to select only the 5 most important features. Does it improve the performance of your model?