# Decision Trees: Regression on House Pricing Dataset
We consider a reduced version of a dataset containing house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

https://www.kaggle.com/harlfoxem/housesalesprediction

For each house we know 18 house features (e.g., number of bedrooms, number of bathrooms, etc.) plus its price, that is what we would like to predict.

## Insert your ID number ("numero di matricola") below

In [8]:
#put here your ``numero di matricola''
numero_di_matricola = 1

In [9]:
#import all packages needed
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Load the data, remove data samples/points with missing values (NaN) and take a look at them.

In [10]:
#load the data
df = pd.read_csv('kc_house_data.csv', sep = ',')

#remove the data samples with missing values (NaN)
df = df.dropna() 

df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0
mean,4645240000.0,535435.8,3.381163,2.071903,2070.027813,15250.54,1.434893,0.009798,0.244311,3.459229,7.615676,1761.252212,308.775601,1967.489254,94.668774,98077.125158,47.557868,-122.212337,1982.544564,13176.302465
std,2854203000.0,380900.4,0.895472,0.768212,920.251879,42544.57,0.507792,0.098513,0.776298,0.682592,1.166324,815.934864,458.977904,28.095275,424.439427,54.172937,0.140789,0.139577,686.25667,25413.180755
min,1000102.0,75000.0,0.0,0.0,380.0,649.0,1.0,0.0,0.0,1.0,3.0,380.0,0.0,1900.0,0.0,98001.0,47.1775,-122.514,620.0,660.0
25%,2199775000.0,315000.0,3.0,1.5,1430.0,5453.75,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1950.0,0.0,98032.0,47.459575,-122.32425,1480.0,5429.5
50%,4027701000.0,445000.0,3.0,2.0,1910.0,8000.0,1.0,0.0,0.0,3.0,7.0,1545.0,0.0,1969.0,0.0,98059.0,47.5725,-122.226,1830.0,7873.0
75%,7358175000.0,640250.0,4.0,2.5,2500.0,11222.5,2.0,0.0,0.0,4.0,8.0,2150.0,600.0,1990.0,0.0,98117.0,47.68025,-122.124,2360.0,10408.25
max,9839301000.0,5350000.0,8.0,6.0,8010.0,1651359.0,3.5,1.0,4.0,5.0,12.0,6720.0,2620.0,2015.0,2015.0,98199.0,47.7776,-121.315,5790.0,425581.0


Extract input and output data. We want to predict the price by using features other than id as input.

In [11]:
Data = df.values
# m = number of input samples
m = Data.shape[0]
print("Amount of data:",m)
Y = Data[:m,2]
X = Data[:m,3:]

feature_names = df.columns[3:]

Amount of data: 3164


## Data Pre-Processing

We split the $m$ samples of the data into 3 parts: one will be used for training and choosing the parameters, one for choosing among different models, and one for testing. The part for training and choosing the parameters will consist of $m_{train}=2/3 m$ samples, the one for choosing among different models will consist of $m_{val}= (m - m_{train})/2$ sampels, while the other part consists of $m_{test}=m - m_{train} - m_{val}$ samples.

In [12]:
# Split data into train (2/3 of samples), validation (1/6 of samples), and test data (the rest)
m_train = int(2./3.*m)
m_val = int((m-m_train)/2.)
m_test = m - m_train - m_val
print("Amount of data for training and deciding parameters:",m_train)
print("Amount of data for validation (choosing among different models):",m_val)
print("Amount of data for test:",m_test)
from sklearn.model_selection import train_test_split

Xtrain_and_val, Xtest, Ytrain_and_val, Ytest = train_test_split(X, Y, test_size=m_test/m, random_state=numero_di_matricola)
Xtrain, Xval, Ytrain, Yval = train_test_split(Xtrain_and_val, Ytrain_and_val, test_size=m_val/(m_train+m_val), random_state=numero_di_matricola)

Amount of data for training and deciding parameters: 2109
Amount of data for validation (choosing among different models): 527
Amount of data for test: 528


Let's standardize the data.

In [13]:
# Data pre-processing
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain_scaled = scaler.transform(Xtrain)
Xtrain_and_val_scaled = scaler.transform(Xtrain_and_val)
Xval_scaled = scaler.transform(Xval)
Xtest_scaled = scaler.transform(Xtest)

## Neural Networks
Let's learn the best neural network with 1 hidden layer and between 1 and 9 hidden nodes, choosing the best number of hidden nodes with cross-validation.

In [14]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV

mlp_cv = MLPRegressor()
param_grid = {'hidden_layer_sizes': [i for i in range(1,10)],
              'activation': ['relu'],
              'solver': ['lbfgs'], 
              'random_state': [numero_di_matricola]
             }
mlp_GS = GridSearchCV(mlp_cv, param_grid=param_grid, 
                   cv=5, verbose=True)
mlp_GS.fit(Xtrain_and_val_scaled, Ytrain_and_val)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   14.0s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=MLPRegressor(activation='relu', alpha=0.0001,
                                    batch_size='auto', beta_1=0.9, beta_2=0.999,
                                    early_stopping=False, epsilon=1e-08,
                                    hidden_layer_sizes=(100,),
                                    learning_rate='constant',
                                    learning_rate_init=0.001, max_iter=200,
                                    momentum=0.9, n_iter_no_change=10,
                                    nesterovs_momentum=True, power_t=0.5,
                                    random_state=None, shuffle=True,
                                    solver='adam', tol=0.0001,
                                    validation_fraction=0.1, verbose=False,
                                    warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'activation': ['relu'],
                         'hid

Now let's check what is the best parameter, and compare the best NNs with the linear model (learned on train and validation) on test data.

In [15]:
#let's print the best model according to grid search
print("Best model: ",mlp_GS.best_estimator_)
#let's print the error 1-R^2 for the best model
print("Error (1-R^2) of best model: ",1. - mlp_GS.best_score_)

Best model:  MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=5, learning_rate='constant',
             learning_rate_init=0.001, max_iter=200, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=1, shuffle=True, solver='lbfgs', tol=0.0001,
             validation_fraction=0.1, verbose=False, warm_start=False)
Error (1-R^2) of best model:  0.21972439475822436


Let's learn the best NN using all of training and validation, and then compare the error of the best NN on train and validation and on test data.

In [16]:
best_mlp= MLPRegressor(hidden_layer_sizes=(5,),activation='relu',solver='lbfgs',random_state=numero_di_matricola)
best_mlp.fit(Xtrain_scaled,Ytrain)



training_error = 1. - best_mlp.score(Xtrain_scaled,Ytrain)
test_error = 1. - best_mlp.score(Xtest_scaled,Ytest)

print(training_error)
print(test_error)


0.1639317987011324
0.20236189245123737


# Linear Regression

Now let's learn the linear model on train and validation, and get error (1-R^2) on train and validation and on test data.

In [17]:
from sklearn import linear_model
#LR the linear regression model
LR = linear_model.LinearRegression()

#fit the model on training data
LR.fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("1 - coefficient of determination on training data:"+str(1 - LR.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("1 - coefficient of determination on test data:"+str(1 - LR.score(Xtest_scaled,Ytest)))

1 - coefficient of determination on training data:0.2715568577139226
1 - coefficient of determination on test data:0.3373644878767468


Note: MLPRegressor has several other parameters!

# Decision trees

Let's learn a decision tree without any limitation.

In [18]:
#import the proper module
from sklearn.tree import DecisionTreeRegressor


#define the model

DT= DecisionTreeRegressor( random_state=numero_di_matricola)

#learn the model 
DT.fit(Xtrain_scaled,Ytrain)

#print error on training and on validation
print("1 - coefficient of determination on training data:"+str(1 - DT.score(Xtrain_scaled,Ytrain)))
print("1 - coefficient of determination on test data:"+str(1 - DT.score(Xval_scaled, Yval)))


1 - coefficient of determination on training data:0.0006400252574232379
1 - coefficient of determination on test data:0.3468527250513973


Let's check what are some of the characteristics of the tree, like its depth and the number of nodes...


In [20]:
print("depth of the tree",DT.tree_.max_depth)
print("number of node",DT.tree_.node_count)

depth of the tree 25
number of node 4071



Let's try to plot the tree.

In [None]:

from sklearn import tree
import matplotlib.pyplot
from sklearn.tree import DecisionTreeClassifier
#osservare che la performance è di 70% perchè stiamo quasi costruendo un nodo per ogni sample... 
#il modello è MOLTO complicato
#DOBBIAMO SEMPLIFICARLO! 



plt.figure(figsize=(20,20))

tree.plot_tree(DT, feature_names=feature_names)
plt.show()





Let's use a max depth of 2.

In [4]:
DT_depth2= DecisionTreeRegressor( random_state=numero_di_matricola,max_depth=2)

#learn the model 
DT_depth2.fit(Xtrain_scaled,Ytrain)

#print error on training and on validation
print("1 - coefficient of determination on training data:"+str(1 - DT_depth2.score(Xtrain_scaled,Ytrain)))
print("1 - coefficient of determination on test data:"+str(1 - DT_depth2.score(Xval_scaled, Yval)))


NameError: name 'DecisionTreeRegressor' is not defined

In [1]:

import sklearn


print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 0.21.2.


Let's plot the tree.

In [3]:
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
plt.figure(figsize=(20,20))

tree.plot_tree(DT_depth2, feature_names=feature_names)
plt.show()


NameError: name 'DT_depth2' is not defined

<Figure size 1440x1440 with 0 Axes>

What happens if we do not normalize the data?

In [18]:
DT_depth2_nonorm= DecisionTreeRegressor( random_state=numero_di_matricola,max_depth=2)

#learn the model 
DT_depth2_nonorm.fit(Xtrain,Ytrain)

#print error on training and on validation
print("1 - coefficient of determination on training data:"+str(1 - DT_depth2_nonorm.score(Xtrain_scaled,Ytrain)))
print("1 - coefficient of determination on test data:"+str(1 - DT_depth2_nonorm.score(Xval_scaled, Yval)))


1 - coefficient of determination on training data:1.3371184803184435
1 - coefficient of determination on test data:1.326088101661453


Let's plot the tree

In [None]:
#complete

Let's build a tree of depth 3

In [None]:
#complete

Let's plot the tree.

In [None]:
#complete

Let's use cross validation to find the best value of the maximum depth between 1 and 9.

In [None]:
#complete

Let's see what is the best model

In [None]:
#complete

Let's learn on all of training and validation

In [None]:
#complete

We can inspect the importance of each feature.

In [None]:
#complete

Let's print the names of the most importante features.

In [None]:
#complete

# Random forest

Let's use random forest without changing parameters.

In [None]:
#complete

Let run cross-validation on maximum depth.

In [None]:
#complete

Let's see what is the best model and how it performs.

In [None]:
#complete

Let's learn on all of training and valiation, and test on test.

In [None]:
#complete

Let's see the importance of each feature.

In [None]:
#complete

Let's see the name of the most important features.

In [None]:
#complete