# Regression with CART Trees - Lab

## Introduction

In this lab, we'll make use of what we learned in the previous lesson to build a model for the [Petrol Consumption Dataset](https://www.kaggle.com/harinir/petrol-consumption) from Kaggle. This model will be used to predict gasoline consumption for a bunch of examples, based on features about the drivers.

## Objectives

In this lab you will: 

- Fit a decision tree regression model with scikit-learn

## Import necessary libraries 

In [1]:
# Import libraries 
import pandas as pd  
import numpy as np  
from sklearn.model_selection import train_test_split 

## The dataset 

- Import the `'petrol_consumption.csv'` dataset 
- Print the first five rows of the data 
- Print the dimensions of the data 

In [2]:
# Import the dataset
dataset = pd.read_csv('petrol_consumption.csv')

In [3]:
# Print the first five rows
dataset.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [6]:
# Print the dimensions of the data
print(dataset.shape)
dataset.info()

(48, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 5 columns):
Petrol_tax                      48 non-null float64
Average_income                  48 non-null int64
Paved_Highways                  48 non-null int64
Population_Driver_licence(%)    48 non-null float64
Petrol_Consumption              48 non-null int64
dtypes: float64(2), int64(3)
memory usage: 2.0 KB


- Print the summary statistics of all columns in the data: 

In [5]:
# Describe the dataset
dataset.describe()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


## Create training and test sets

- Assign the target column `'Petrol_Consumption'` to `y` 
- Assign the remaining independent variables to `X` 
- Split the data into training and test sets using a 80/20 split 
- Set the random state to 42 

In [7]:
# Split the data into training and test sets
X = dataset.drop('Petrol_Consumption', axis=1)
y = dataset['Petrol_Consumption']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Create an instance of CART regressor and fit the data to the model 

As mentioned earlier, for a regression task we'll use a different `sklearn` class than we did for the classification task. The class we'll be using here is the `DecisionTreeRegressor` class, as opposed to the `DecisionTreeClassifier` from before.

In [11]:
# Import the DecisionTreeRegressor class 
from sklearn.tree import DecisionTreeRegressor

# Instantiate and fit a regression tree model to training data 
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

DecisionTreeRegressor()

## Make predictions and calculate the MAE, MSE, and RMSE

Use the above model to generate predictions on the test set. 

Just as with decision trees for classification, there are several commonly used metrics for evaluating the performance of our model. The most common metrics are:

* Mean Absolute Error (MAE)
* Mean Squared Error (MSE)
* Root Mean Squared Error (RMSE)

If these look familiar, it's likely because you have already seen them before -- they are common evaluation metrics for any sort of regression model, and as we can see, regressions performed with decision tree models are no exception!

Since these are common evaluation metrics, `sklearn` has functions for each of them that we can use to make our job easier. You'll find these functions inside the `metrics` module. In the cell below, calculate each of the three evaluation metrics. 

In [13]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Make predictions on the test set
y_pred = regressor.predict(X_test)

# Evaluate these predictions
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 97.8
Mean Squared Error: 17981.4
Root Mean Squared Error: 134.09474262624914


## Level Up (Optional)

- Look at the hyperparameters used in the regression tree, check their value ranges in official doc and try running some optimization by growing a number of trees in a loop 

- Use a dataset that you are familiar with and run tree regression to see if you can interpret the results 

- Check for outliers, try normalization and see the impact on the output 

In [34]:
from sklearn.datasets import load_boston
boston = load_boston()
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df['Target'] = pd.DataFrame(boston.target)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [38]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
columns = df.columns
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=columns)
df_scaled.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Target
0,0.0,0.18,0.067815,0.0,0.314815,0.577505,0.641607,0.269203,0.0,0.208015,0.287234,1.0,0.08968,0.422222
1,0.000236,0.0,0.242302,0.0,0.17284,0.547998,0.782698,0.348962,0.043478,0.104962,0.553191,1.0,0.20447,0.368889
2,0.000236,0.0,0.242302,0.0,0.17284,0.694386,0.599382,0.348962,0.043478,0.104962,0.553191,0.989737,0.063466,0.66
3,0.000293,0.0,0.06305,0.0,0.150206,0.658555,0.441813,0.448545,0.086957,0.066794,0.648936,0.994276,0.033389,0.631111
4,0.000705,0.0,0.06305,0.0,0.150206,0.687105,0.528321,0.448545,0.086957,0.066794,0.648936,1.0,0.099338,0.693333


In [94]:
SEED = 1
y = df['Target']
X = df.drop('Target', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=SEED)

regressor = DecisionTreeRegressor(random_state=SEED)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

print('Current Max Depth:', max_depth)
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('R-Squared Score:', r2_score(y_test, y_pred))

Current Max Depth: 5.0
Mean Absolute Error: 3.237795275590551
Mean Squared Error: 20.603307086614176
Root Mean Squared Error: 4.539086591662928
R-Squared Score: 0.7920086354372333


In [61]:
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
%matplotlib inline
max_depths = np.linspace(1, 5, 5, endpoint=True)
for max_depth in max_depths:
    regressor = DecisionTreeRegressor(max_depth=max_depth, random_state=SEED)
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    
    print('Current Max Depth:', max_depth)
    print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))  
    print('Mean Squared Error:', mean_squared_error(y_test, y_pred))  
    print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))
    print('R-Squared Score:', r2_score(y_test, y_pred))
    print('\n-------------------------------------------------------------------')

Current Max Depth: 1.0
Mean Absolute Error: 6.059929129322544
Mean Squared Error: 58.217754710106576
Root Mean Squared Error: 7.630056009631029
R-Squared Score: 0.4122889984102356

-------------------------------------------------------------------
Current Max Depth: 2.0
Mean Absolute Error: 3.9582870089956694
Mean Squared Error: 25.969727563220967
Root Mean Squared Error: 5.09605019237654
R-Squared Score: 0.7378343655952725

-------------------------------------------------------------------
Current Max Depth: 3.0
Mean Absolute Error: 3.7011665979729633
Mean Squared Error: 29.711645121501338
Root Mean Squared Error: 5.450838937402327
R-Squared Score: 0.7000595299460113

-------------------------------------------------------------------
Current Max Depth: 4.0
Mean Absolute Error: 2.856756522078789
Mean Squared Error: 13.826265704678221
Root Mean Squared Error: 3.7183686886426717
R-Squared Score: 0.8604231903822975

-------------------------------------------------------------------
Cu

In [62]:
import matplotlib.pyplot as plt
%matplotlib inline
min_samples_splits = np.linspace(2, 10, 9, endpoint=True)
for min_sample in min_samples_splits:
    regressor = DecisionTreeRegressor(min_samples_split=int(min_sample), random_state=SEED)
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)

    print('Current Min Sample Split:', min_sample)
    print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))  
    print('Mean Squared Error:', mean_squared_error(y_test, y_pred))  
    print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))
    print('R-Squared Score:', r2_score(y_test, y_pred))
    print('\n-------------------------------------------------------------------')

Current Min Sample Split: 2.0
Mean Absolute Error: 3.237795275590551
Mean Squared Error: 20.603307086614176
Root Mean Squared Error: 4.539086591662928
R-Squared Score: 0.7920086354372333

-------------------------------------------------------------------
Current Min Sample Split: 3.0
Mean Absolute Error: 3.1594488188976375
Mean Squared Error: 18.907263779527554
Root Mean Squared Error: 4.348248357617991
R-Squared Score: 0.8091302732556436

-------------------------------------------------------------------
Current Min Sample Split: 4.0
Mean Absolute Error: 3.10485564304462
Mean Squared Error: 19.359136045494314
Root Mean Squared Error: 4.3999018222563
R-Squared Score: 0.8045686012477765

-------------------------------------------------------------------
Current Min Sample Split: 5.0
Mean Absolute Error: 3.0104986876640427
Mean Squared Error: 20.110097331583553
Root Mean Squared Error: 4.484428317141836
R-Squared Score: 0.7969876113624688

---------------------------------------------

In [63]:
import matplotlib.pyplot as plt
%matplotlib inline
min_samples_leafs = np.linspace(2, 10, 9, endpoint=True)
for min_sample in min_samples_leafs:
    regressor = DecisionTreeRegressor(min_samples_leaf=int(min_sample), random_state=SEED)
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)

    print('Current Min Sample Leaf:', min_sample)
    print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))  
    print('Mean Squared Error:', mean_squared_error(y_test, y_pred))  
    print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))
    print('R-Squared Score:', r2_score(y_test, y_pred))
    print('\n-------------------------------------------------------------------')

Current Min Sample Leaf: 2.0
Mean Absolute Error: 2.773490813648294
Mean Squared Error: 13.917909011373574
Root Mean Squared Error: 3.7306713888217993
R-Squared Score: 0.8594980468442975

-------------------------------------------------------------------
Current Min Sample Leaf: 3.0
Mean Absolute Error: 2.928241469816273
Mean Squared Error: 15.478879921259837
Root Mean Squared Error: 3.934320770000819
R-Squared Score: 0.8437399712972431

-------------------------------------------------------------------
Current Min Sample Leaf: 4.0
Mean Absolute Error: 2.7075628046494185
Mean Squared Error: 12.798200387451566
Root Mean Squared Error: 3.577457251659559
R-Squared Score: 0.8708015586360304

-------------------------------------------------------------------
Current Min Sample Leaf: 5.0
Mean Absolute Error: 3.3290594925634305
Mean Squared Error: 22.51823344562088
Root Mean Squared Error: 4.745338074955343
R-Squared Score: 0.7726773628035414

----------------------------------------------

In [93]:
regressor = DecisionTreeRegressor(max_depth=4,
                                  min_samples_split=10,
                                  min_samples_leaf=4,
                                  random_state=SEED)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('R-Squared Score:', r2_score(y_test, y_pred))

Mean Absolute Error: 2.8315266113015065
Mean Squared Error: 13.364079437900017
Root Mean Squared Error: 3.655691376183172
R-Squared Score: 0.8650889827187048


In [95]:
# A final r2 of 0.86 over our vanilla model score of 0.79!

## Summary 

In this lesson, you implemented the architecture to train a tree regressor and predict values for unseen data. You saw that with a vanilla approach, the results were not so great, and thus we must further tune the model (what we described as hyperparameter optimization and pruning, in the case of trees). 