# Decision Tree Regression: Auto MPG Data

To show how we apply decision trees to perform regression, we are going to use the automobile fuel performance prediction data from UCI. The data contains nine features: mpg, cylinders, displacement, horsepower, weight, acceleration, model year, origin, and car name. Of these, we wish to predict the fuel efficiency of the cars, so our first variable is treated as the dependent variable.

In [2]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

In [3]:
# Set up Notebook
%matplotlib inline
warnings.filterwarnings('ignore')
sns.set_style('white')

# Auto-completion is not working bc of jedi
%config Completer.use_jedi = False

In [32]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

In [26]:
col_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower',
            'Weight', 'Acceleration', 'Year', 'Origin', 'Name']
auto_data = pd.read_csv('../datasets/auto-mpg.data', index_col=False, names=col_names,
                       delim_whitespace=True)
auto_data.sample(5)

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Year,Origin,Name
162,15.0,6,258.0,110.0,3730.0,19.0,75,1,amc matador
154,15.0,6,250.0,72.0,3432.0,21.0,75,1,mercury monarch
207,20.0,4,130.0,102.0,3150.0,15.7,76,2,volvo 245
295,35.7,4,98.0,80.0,1915.0,14.4,79,1,dodge colt hatchback custom
180,25.0,4,121.0,115.0,2671.0,13.5,75,2,saab 99le


In [27]:
auto_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MPG           398 non-null    float64
 1   Cylinders     398 non-null    int64  
 2   Displacement  398 non-null    float64
 3   Horsepower    398 non-null    object 
 4   Weight        398 non-null    float64
 5   Acceleration  398 non-null    float64
 6   Year          398 non-null    int64  
 7   Origin        398 non-null    int64  
 8   Name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


The `info()` method shows that there are no missing values or `NaN`s in our data. Is it really so?   
From the sample data we can conclude that all our variables, besides Name, are numerical. However, after calling the `info()` method we find out that `Horsepower` is also an object. So it seems like we have to investigate more this variable.

In [28]:
# print the sorted unique values of the horse power
sorted(auto_data.Horsepower.unique(), reverse=True)[:3]

['?', '98.00', '97.00']

As we can see missing values in this dataset are marked with a `?` symbol. Which is quite often. So it is better to ignore this feature in the analysis.   
An alternative technique would be to convert the `?` into a missing value, and either drop rows with missing values, or impute the missing value.  

In [29]:
import patsy as pts
y, x = pts.dmatrices('MPG ~ C(Cylinders) + Displacement +'
                    'Weight + Acceleration + C(Year) + C(Origin)',
                    data=auto_data, return_type='dataframe')
x.sample(5)

Unnamed: 0,Intercept,C(Cylinders)[T.4],C(Cylinders)[T.5],C(Cylinders)[T.6],C(Cylinders)[T.8],C(Year)[T.71],C(Year)[T.72],C(Year)[T.73],C(Year)[T.74],C(Year)[T.75],...,C(Year)[T.78],C(Year)[T.79],C(Year)[T.80],C(Year)[T.81],C(Year)[T.82],C(Origin)[T.2],C(Origin)[T.3],Displacement,Weight,Acceleration
297,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,183.0,3530.0,20.1
203,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,97.0,1825.0,12.2
378,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,105.0,2125.0,14.7
202,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,258.0,3193.0,17.8
345,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,81.0,1760.0,16.1


In [30]:
y.sample(5)

Unnamed: 0,MPG
185,26.0
372,27.0
162,15.0
362,24.2
323,27.9


With these two DataFrames, we can now build a regressive model using the `DecisionTreeRegressor` method:

In [34]:
# Split data intro training:testing data set
frac = .4
ind_train, ind_test, dep_train, dep_test = \
    train_test_split(x, y, test_size=frac, random_state=23)

# Create Regressor with default properties
auto_model = DecisionTreeRegressor(random_state=23)

# Fit estimator and display score
auto_model = auto_model.fit(ind_train, dep_train)
print('Score = {:.1%}'.format(auto_model.score(ind_test, dep_test)))

Score = 55.5%


In [35]:
from sklearn.metrics import explained_variance_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import r2_score

# Regress on test data
pred = auto_model.predict(ind_test)

# Copute performance metrics
mae = mean_absolute_error(dep_test, pred)
mse = mean_squared_error(dep_test, pred)
mbe = median_absolute_error(dep_test, pred)
mr2 = r2_score(dep_test, pred)

ev_score = explained_variance_score(dep_test, pred)

# Display metrics
print(f'Mean Absolute Error   = {mae:4.2f}')
print(f'Mean Squared Error    = {mse:4.2f}')
print(f'Median Absolute Error = {mbe:4.2f}')
print(f'R^2 Score             = {mr2:5.3f}')
print(f'Explained Variance    = {ev_score:5.3f}')

Mean Absolute Error   = 3.25
Mean Squared Error    = 23.96
Median Absolute Error = 2.00
R^2 Score             = 0.555
Explained Variance    = 0.564


In the previous Code cells, we constructed a decision tree for regression and applied it to the automobile fuel performance prediction task. The initial result was reasonable, but try making the following changes to see if you can do better.

2. Change the features used in the regression, for example drop one column, such as `origin`. Do the results change? 
3. Try using different hyperparameter values, such as specifying a different metric to reduce the variance at tree splits (via the `criterion` hyperparameter), or try using a random `splitter`.

-----