We are interested in understanding what impacts the mileage of cars. 400 cars were measured and its data is available in the file.

1) Train a decision tree and identify features that impact the mileage of cars. Note that cylinders though numerical can take only specific values, and origin is categorical.<br>
2) How good the prediction would be if we use 300 cars and test it for the rest of the data.<br>
3) Are there outliers that influence the result? How can we minimize the impact of outliers?

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeRegressor, export_text
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load data
mileage = pd.read_csv('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Tree-Based-Models-main/06_Car_mileage.csv')
mileage.head()

Unnamed: 0,cylinders,displacement,hp,weight,acceleration,origin,mpg
0,8,307.0,130,3504,12.0,1,18.0
1,8,350.0,165,3693,11.5,1,15.0
2,8,318.0,150,3436,11.0,1,18.0
3,8,304.0,150,3433,12.0,1,16.0
4,8,302.0,140,3449,10.5,1,17.0


In [3]:
# Check info
mileage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cylinders     398 non-null    int64  
 1   displacement  398 non-null    float64
 2   hp            398 non-null    object 
 3   weight        398 non-null    int64  
 4   acceleration  398 non-null    float64
 5   origin        398 non-null    int64  
 6   mpg           398 non-null    float64
dtypes: float64(3), int64(3), object(1)
memory usage: 21.9+ KB


In [4]:
# Check missing values
mileage.isnull().sum()

cylinders       0
displacement    0
hp              0
weight          0
acceleration    0
origin          0
mpg             0
dtype: int64

In [5]:
# Lets check hp, the values stored are numeric but the data type is object
mileage['hp'].value_counts().sort_values(ascending=False)[:25]

150    22
90     20
88     19
110    18
100    17
75     14
95     14
105    12
70     12
67     12
65     10
85      9
97      9
80      7
145     7
140     7
92      6
68      6
78      6
84      6
?       6
72      6
60      5
175     5
170     5
Name: hp, dtype: int64

In [6]:
# Lets exclude the rows where hp contains ?
mileage = mileage[~(mileage['hp'] == '?')]
mileage.shape

(392, 7)

In [7]:
# Convert the data type for hp to float
mileage['hp'] = mileage['hp'].astype('float')

In [8]:
# Lets check origin
mileage['origin'].value_counts()

1    245
3     79
2     68
Name: origin, dtype: int64

In [9]:
# Given origin is categorical, lets convert it into dummies. Also cylinders even though numerical, can take only specific values
# So cylinders is discrete. Hence we will create dummies for cylinders also
mileage_onehot = pd.get_dummies(mileage, columns=['origin', 'cylinders'])
mileage_onehot.head()

Unnamed: 0,displacement,hp,weight,acceleration,mpg,origin_1,origin_2,origin_3,cylinders_3,cylinders_4,cylinders_5,cylinders_6,cylinders_8
0,307.0,130.0,3504,12.0,18.0,1,0,0,0,0,0,0,1
1,350.0,165.0,3693,11.5,15.0,1,0,0,0,0,0,0,1
2,318.0,150.0,3436,11.0,18.0,1,0,0,0,0,0,0,1
3,304.0,150.0,3433,12.0,16.0,1,0,0,0,0,0,0,1
4,302.0,140.0,3449,10.5,17.0,1,0,0,0,0,0,0,1


In [10]:
# Lets check the data types again
mileage_onehot.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   displacement  392 non-null    float64
 1   hp            392 non-null    float64
 2   weight        392 non-null    int64  
 3   acceleration  392 non-null    float64
 4   mpg           392 non-null    float64
 5   origin_1      392 non-null    uint8  
 6   origin_2      392 non-null    uint8  
 7   origin_3      392 non-null    uint8  
 8   cylinders_3   392 non-null    uint8  
 9   cylinders_4   392 non-null    uint8  
 10  cylinders_5   392 non-null    uint8  
 11  cylinders_6   392 non-null    uint8  
 12  cylinders_8   392 non-null    uint8  
dtypes: float64(4), int64(1), uint8(8)
memory usage: 21.4 KB


**1) Train a decision tree and identify features that impact the mileage of cars. Note that cylinders though numerical can take only specific values, and origin is categorical.**

In [11]:
# Prepare X and y
X = mileage_onehot.drop('mpg', axis=1)
y = mileage_onehot['mpg']

# Initialize the model
clf = DecisionTreeRegressor(random_state=42)

In [12]:
X.head()

Unnamed: 0,displacement,hp,weight,acceleration,origin_1,origin_2,origin_3,cylinders_3,cylinders_4,cylinders_5,cylinders_6,cylinders_8
0,307.0,130.0,3504,12.0,1,0,0,0,0,0,0,1
1,350.0,165.0,3693,11.5,1,0,0,0,0,0,0,1
2,318.0,150.0,3436,11.0,1,0,0,0,0,0,0,1
3,304.0,150.0,3433,12.0,1,0,0,0,0,0,0,1
4,302.0,140.0,3449,10.5,1,0,0,0,0,0,0,1


In [13]:
# Use cross validation
params = {'max_depth':range(1,11), 'min_samples_split':range(10,60,10)}

# Create GridSearchCV object
clf_gs = GridSearchCV(estimator=clf, cv=5, param_grid=params, scoring='neg_mean_squared_error')

# Fit
clf_gs.fit(X,y)

# Print best params and best score
print(clf_gs.best_params_)
print(-clf_gs.best_score_)

{'max_depth': 5, 'min_samples_split': 20}
22.8167549863003


In [14]:
# Check RMSE
from sklearn import metrics

y_pred = clf_gs.predict(X)
print('Model RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y, y_pred)), 2))

Model RMSE: 3.18


In [15]:
# Check r2 score
print("Model r2 score:", np.round(metrics.r2_score(y, y_pred)*100, 2))

Model r2 score: 83.31


Without splitting the data into train and test, we have the following values:<br>
Model RMSE: 3.18<br>
Model r2 score: 83.31%

**2) How good the prediction would be if we use 300 cars and test it for the rest of the data.**

In [16]:
# Split the data
train_size = round(300/X.shape[0],2)

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size, random_state=42)

In [17]:
# Check shape
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(301, 12)
(301,)
(91, 12)
(91,)


In [18]:
# Initialize the model
clf_after_split = DecisionTreeRegressor(random_state=42)

# Use cross validation
params = {'max_depth':range(1,11), 'min_samples_split':range(10,60,10)}

# Create GridSearchCV object
clf_gs_after_split = GridSearchCV(estimator=clf_after_split, cv=5, param_grid=params, scoring='neg_mean_squared_error')

# Fit
clf_gs_after_split.fit(X_train, y_train)

# Print best params and best score
print(clf_gs_after_split.best_params_)
print(-clf_gs_after_split.best_score_)

{'max_depth': 8, 'min_samples_split': 40}
17.02776167336989


In [19]:
# Check Tain and Test RMSE
y_train_pred = clf_gs_after_split.predict(X_train)
y_pred = clf_gs_after_split.predict(X_test)

print('Train RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)), 2))
print('Test RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred)), 2))

Train RMSE: 3.13
Test RMSE: 4.45


In [20]:
# Check r2 score
print("Train r2 score:", np.round(metrics.r2_score(y_train, y_train_pred)*100, 2))
print("Test r2 score:", np.round(metrics.r2_score(y_test, y_pred)*100, 2))

Train r2 score: 84.6
Test r2 score: 60.7


The Test r2 score dropped to 60% after splitting the data. It did not generalize well on test data.

**3) Are there outliers that influence the result? How can we minimize the impact of outliers?**

In [21]:
X_train.head()

Unnamed: 0,displacement,hp,weight,acceleration,origin_1,origin_2,origin_3,cylinders_3,cylinders_4,cylinders_5,cylinders_6,cylinders_8
360,145.0,76.0,3160,19.6,0,1,0,0,0,0,1,0
239,97.0,67.0,1985,16.4,0,0,1,0,1,0,0,0
370,112.0,85.0,2575,16.2,1,0,0,0,1,0,0,0
252,231.0,105.0,3535,19.2,1,0,0,0,0,0,1,0
211,168.0,120.0,3820,16.7,0,1,0,0,0,0,1,0


In [22]:
# Outlier detection using LOF (Local Outlier Factor)
from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20)

X_train['lof'] = lof.fit_predict(X_train)
X_train['negative_outlier_factor'] = lof.negative_outlier_factor_

X_train_without_outliers = X_train[~(X_train['negative_outlier_factor'] <= -1.5)] #excluding outliers 

In [23]:
X_train_without_outliers.shape

(286, 14)

In [24]:
# Similarly exclude outliers for y_train also
y_train_without_outliers = y_train[X_train_without_outliers.index]

# Check shape
y_train_without_outliers.shape

(286,)

In [25]:
# Lets drop lof and negative_outlier_factor as these are not actual variables in the given dataset
X_train_without_outliers.drop(['lof', 'negative_outlier_factor'], axis=1, inplace=True)

In [26]:
# Initialize the model
clf_without_outliers = DecisionTreeRegressor(random_state=42)

# Use cross validation
params = {'max_depth':range(1,11), 'min_samples_split':range(10,60,10)}

# Create GridSearchCV object
clf_gs_without_outliers = GridSearchCV(estimator=clf_without_outliers, cv=5, param_grid=params, scoring='neg_mean_squared_error')

# Fit
clf_gs_without_outliers.fit(X_train_without_outliers, y_train_without_outliers)

# Print best params and best score
print(clf_gs_without_outliers.best_params_)
print(-clf_gs_without_outliers.best_score_)

{'max_depth': 6, 'min_samples_split': 30}
17.973939913578565


In [27]:
# Check Tain and Test RMSE
y_train_without_outliers_pred = clf_gs_without_outliers.predict(X_train_without_outliers)
y_pred = clf_gs_without_outliers.predict(X_test)

print('Train RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_train_without_outliers, y_train_without_outliers_pred)), 2))
print('Test RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred)), 2))

Train RMSE: 3.01
Test RMSE: 4.9


In [28]:
# Check r2 score
print("Train r2 score after outlier removal:", np.round(metrics.r2_score(y_train_without_outliers, y_train_without_outliers_pred)*100, 2))
print("Test r2 score after outlier removal:", np.round(metrics.r2_score(y_test, y_pred)*100, 2))

Train r2 score after outlier removal: 85.02
Test r2 score after outlier removal: 52.33


The model performed worst than the model before removal of outliers.