# **Used Car**

*Load data “used_car_price.csv”

*Split the data into 75% for training and 25% for testing

*Train the XG-Boost model in Scikit-Learn

*Assess the performance of the trained XG-Boost model using RMSE and R2

*Perform hyperparameter optimization using GridSearch, choosing reasonable values for max_depth, learning_rate, n_estimators, and colsample_bytree. Use 5-folds cross validation.

*Perform hyperparameter optimization using RandomSearch, choosing reasonable values for max_depth, learning_rate, n_estimators, and colsample_bytree. Use 5-folds cross validation and 100 iterations.

*Perform hyperparameter optimization using Bayesian optimization, choose reasonable values for max_depth, learning_rate, n_estimators. Use 5-folds cross validation and 100 iterations.

*Compare 3 optimization strategies using RMSE and R2.

In [1]:
#Import data “used_car_price.csv”
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
car_df = pd.read_csv(r'E:\_Portofolio\PortofolioProject\UsedCar\used_car_price.csv')
car_df.head()

Unnamed: 0,Make,Model,Type,Origin,DriveTrain,MSRP,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
0,Acura,MDX,SUV,Asia,All,36945,3.5,6,265,17,23,4451,106,189
1,Acura,RSX Type S 2dr,Sedan,Asia,Front,23820,2.0,4,200,24,31,2778,101,172
2,Acura,TSX 4dr,Sedan,Asia,Front,26990,2.4,4,200,22,29,3230,105,183
3,Acura,TL 4dr,Sedan,Asia,Front,33195,3.2,6,270,20,28,3575,108,186
4,Acura,3.5 RL 4dr,Sedan,Asia,Front,43755,3.5,6,225,18,24,3880,115,197


In [4]:
car_df.tail()

Unnamed: 0,Make,Model,Type,Origin,DriveTrain,MSRP,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
423,Volvo,C70 LPT convertible 2dr,Sedan,Europe,Front,40565,2.4,5,197,21,28,3450,105,186
424,Volvo,C70 HPT convertible 2dr,Sedan,Europe,Front,42565,2.3,5,242,20,26,3450,105,186
425,Volvo,S80 T6 4dr,Sedan,Europe,Front,45210,2.9,6,268,19,26,3653,110,190
426,Volvo,V40,Wagon,Europe,Front,26135,1.9,4,170,22,29,2822,101,180
427,Volvo,XC70,Wagon,Europe,All,35145,2.5,5,208,20,27,3823,109,186


In [5]:
car_df.describe()

Unnamed: 0,MSRP,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
count,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0
mean,32774.85514,3.196729,5.799065,215.885514,20.060748,26.843458,3577.953271,108.154206,186.36215
std,19431.716674,1.108595,1.559679,71.836032,5.238218,5.741201,758.983215,8.311813,14.357991
min,10280.0,1.3,3.0,73.0,10.0,12.0,1850.0,89.0,143.0
25%,20334.25,2.375,4.0,165.0,17.0,24.0,3104.0,103.0,178.0
50%,27635.0,3.0,6.0,210.0,19.0,26.0,3474.5,107.0,187.0
75%,39205.0,3.9,6.0,255.0,21.25,29.0,3977.75,112.0,194.0
max,192465.0,8.3,12.0,500.0,60.0,66.0,7190.0,144.0,238.0


In [6]:
car_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Make         428 non-null    object 
 1   Model        428 non-null    object 
 2   Type         428 non-null    object 
 3   Origin       428 non-null    object 
 4   DriveTrain   428 non-null    object 
 5   MSRP         428 non-null    int64  
 6   EngineSize   428 non-null    float64
 7   Cylinders    428 non-null    int64  
 8   Horsepower   428 non-null    int64  
 9   MPG_City     428 non-null    int64  
 10  MPG_Highway  428 non-null    int64  
 11  Weight       428 non-null    int64  
 12  Wheelbase    428 non-null    int64  
 13  Length       428 non-null    int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 46.9+ KB


In [7]:
car_df.isnull().sum()

Make           0
Model          0
Type           0
Origin         0
DriveTrain     0
MSRP           0
EngineSize     0
Cylinders      0
Horsepower     0
MPG_City       0
MPG_Highway    0
Weight         0
Wheelbase      0
Length         0
dtype: int64

In [8]:
#one hot encoding for categorical variable['Make', 'Model', 'Type', 'Origin', 'DriveTrain']
car_df = pd.get_dummies(car_df, columns = ['Make', 'Model', 'Type', 'Origin', 'DriveTrain'])
car_df

Unnamed: 0,MSRP,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length,Make_Acura,...,Type_Sedan,Type_Sports,Type_Truck,Type_Wagon,Origin_Asia,Origin_Europe,Origin_USA,DriveTrain_All,DriveTrain_Front,DriveTrain_Rear
0,36945,3.5,6,265,17,23,4451,106,189,1,...,0,0,0,0,1,0,0,1,0,0
1,23820,2.0,4,200,24,31,2778,101,172,1,...,1,0,0,0,1,0,0,0,1,0
2,26990,2.4,4,200,22,29,3230,105,183,1,...,1,0,0,0,1,0,0,0,1,0
3,33195,3.2,6,270,20,28,3575,108,186,1,...,1,0,0,0,1,0,0,0,1,0
4,43755,3.5,6,225,18,24,3880,115,197,1,...,1,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
423,40565,2.4,5,197,21,28,3450,105,186,0,...,1,0,0,0,0,1,0,0,1,0
424,42565,2.3,5,242,20,26,3450,105,186,0,...,1,0,0,0,0,1,0,0,1,0
425,45210,2.9,6,268,19,26,3653,110,190,0,...,1,0,0,0,0,1,0,0,1,0
426,26135,1.9,4,170,22,29,2822,101,180,0,...,0,0,0,1,0,1,0,0,1,0


In [9]:
# feature input to X and MSRP output to y
X = car_df.drop('MSRP', axis = 1)
y = car_df['MSRP']

In [10]:
X

Unnamed: 0,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length,Make_Acura,Make_Audi,...,Type_Sedan,Type_Sports,Type_Truck,Type_Wagon,Origin_Asia,Origin_Europe,Origin_USA,DriveTrain_All,DriveTrain_Front,DriveTrain_Rear
0,3.5,6,265,17,23,4451,106,189,1,0,...,0,0,0,0,1,0,0,1,0,0
1,2.0,4,200,24,31,2778,101,172,1,0,...,1,0,0,0,1,0,0,0,1,0
2,2.4,4,200,22,29,3230,105,183,1,0,...,1,0,0,0,1,0,0,0,1,0
3,3.2,6,270,20,28,3575,108,186,1,0,...,1,0,0,0,1,0,0,0,1,0
4,3.5,6,225,18,24,3880,115,197,1,0,...,1,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
423,2.4,5,197,21,28,3450,105,186,0,0,...,1,0,0,0,0,1,0,0,1,0
424,2.3,5,242,20,26,3450,105,186,0,0,...,1,0,0,0,0,1,0,0,1,0
425,2.9,6,268,19,26,3653,110,190,0,0,...,1,0,0,0,0,1,0,0,1,0
426,1.9,4,170,22,29,2822,101,180,0,0,...,0,0,0,1,0,1,0,0,1,0


In [11]:
y

0      36945
1      23820
2      26990
3      33195
4      43755
       ...  
423    40565
424    42565
425    45210
426    26135
427    35145
Name: MSRP, Length: 428, dtype: int64

In [12]:
X = np.array(X)

In [13]:
X

array([[  3.5,   6. , 265. , ...,   1. ,   0. ,   0. ],
       [  2. ,   4. , 200. , ...,   0. ,   1. ,   0. ],
       [  2.4,   4. , 200. , ...,   0. ,   1. ,   0. ],
       ...,
       [  2.9,   6. , 268. , ...,   0. ,   1. ,   0. ],
       [  1.9,   4. , 170. , ...,   0. ,   1. ,   0. ],
       [  2.5,   5. , 208. , ...,   1. ,   0. ,   0. ]])

In [14]:
y = np.array(y)

In [15]:
y

array([ 36945,  23820,  26990,  33195,  43755,  46100,  89765,  25940,
        35940,  31840,  33430,  34480,  36640,  39640,  42490,  44240,
        42840,  49690,  69190,  48040,  84600,  35940,  37390,  40590,
        40840,  49090,  37000,  52195,  28495,  30795,  37995,  30245,
        35495,  36995,  37245,  39995,  44295,  44995,  54995,  69195,
        73195,  48195,  56595,  33895,  41045,  32845,  37895,  26545,
        22180,  26470,  24895,  28345,  32245,  35545,  40720,  52795,
        46995,  30835,  45445,  50595,  47955,  76200,  52975,  42735,
        41465,  30295,  20255,  11690,  12585,  14610,  14810,  16385,
        21900,  18995,  20370,  21825,  25000,  27995,  23495,  24225,
        26395,  27020,  44535,  51535,  36100,  18760,  20310,  40340,
        41995,  22225,  17985,  22000,  19090,  21840,  29865,  24130,
        26860,  25955,  25215,  33295,  30950,  27490,  38380,  34495,
        31230,  32235,  13670,  15040,  22035,  18820,  20220,  24885,
      

In [16]:
#Split the data into 75% for training and 25% for testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25)

In [17]:
X_train.shape #75% equal 321

(321, 483)

In [18]:
X_test.shape #25% equal 107

(107, 483)

In [19]:
#Train XG-Boost model in Scikit-Learn
#xgb without optimization
!pip install xgboost



In [20]:
import xgboost as xgb
model = xgb.XGBRegressor(objective = 'reg:squarederror', learning_rate = 1, max_depth = 3, n_estimators = 500)
model.fit(X_train, y_train)

result = model.score(X_test, y_test)
y_predict = model.predict(X_test)

In [21]:
#Assess the performance of the trained XG-Boost model using RMSE and R2
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)), '.3f'))
r2 = r2_score(y_test, y_predict)

print('RMSE= ', RMSE, '\nR2= ', r2)

RMSE=  8356.165 
R2=  0.7755828480677971


In [22]:
#Perform hyperparameter optimization using GridSearch, choose a reasonable value for max_depth,
#learning_rate, n_estimators, and colsample_bytree. Use 5-folds cross validation
from sklearn.model_selection import GridSearchCV
#xgb with grid search
parameters_grid = { 'learning_rate' : [0.1, 0.5],
                    'max_depth' : [3, 10, 20],
                     'n_estimators' : [100, 500],
                     'colsample_bytree' : [0.3, 0.7]}

model = xgb.XGBRegressor(objective= 'reg:squarederror')

#because it uses negative mse because gidsearchcv ranks all algorithms
#and determine which one is the best by minimizing errors
xgb_gridsearch = GridSearchCV(estimator = model,
                              param_grid = parameters_grid,
                              scoring = 'neg_mean_squared_error',
                              cv = 5,
                              verbose = 5)

xgb_gridsearch.fit(X_train, y_train)
y_predict = xgb_gridsearch.predict(X_test)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=100;, score=-24174922.311 total time=   0.1s
[CV 2/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=100;, score=-35437722.136 total time=   0.0s
[CV 3/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=100;, score=-62418349.156 total time=   0.1s
[CV 4/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=100;, score=-265798951.684 total time=   0.0s
[CV 5/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=100;, score=-33036364.817 total time=   0.0s
[CV 1/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=500;, score=-21903001.750 total time=   0.5s
[CV 2/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=500;, score=-31596842.716 total time=   0.5s
[CV 3/5] END colsample_bytree=0.3, learning_rate=0.1, max_dept

In [23]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)), '.3f'))
r2 = r2_score(y_test, y_predict)

print('RMSE =', RMSE, '\nR2 =', r2)

RMSE = 5932.422 
R2 = 0.886888681278377


In [24]:
#Perform hyperparameter optimization using RandomSearch, choose a reasonable value for max_depth,
#learning_rate, n_estimators, and colsample_bytree. Use 5-folds cross validation and 100 iterations
from sklearn.model_selection import RandomizedSearchCV

# Define the hyperparameter grid to search
# choose the booster you want to choose:
# Two options available: gbtree, gblinear
# gbtree uses a tree-based model while gblinear uses linear functions

grid = {
    'n_estimators': [100, 500],
    'max_depth': [3, 10, 20],
    'learning_rate': [0.1, 0.5],
    'colsample_bytree': [0.3, 0.7]}


import xgboost as xgb
model = xgb.XGBRegressor(objective ='reg:squarederror')

random_cv = RandomizedSearchCV(estimator = model,
                               param_distributions = grid,
                               cv = 5,
                               n_iter = 100,
                               scoring = 'neg_mean_absolute_error',
                               verbose = 5,
                               return_train_score = True)
random_cv.fit(X_train, y_train)

random_cv.best_estimator_
y_predict = random_cv.predict(X_test)


from sklearn.metrics import r2_score, mean_squared_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
r2 = r2_score(y_test, y_predict)

print('RMSE =',RMSE,'\nR2 =', r2)



Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=100;, score=(train=-2659.908, test=-3521.827) total time=   0.0s
[CV 2/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=100;, score=(train=-2682.694, test=-3995.828) total time=   0.0s
[CV 3/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=100;, score=(train=-2606.781, test=-4160.904) total time=   0.0s
[CV 4/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=100;, score=(train=-2288.606, test=-6723.692) total time=   0.0s
[CV 5/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=100;, score=(train=-2592.055, test=-3871.885) total time=   0.0s
[CV 1/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=500;, score=(train=-1196.569, test=-3430.324) total time=   0.4s
[CV 2/5] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n

In [25]:
#Perform hyperparameter optimization using Bayesian optimization, choose a reasonable value for max_depth,
#learning_rate, n_estimators. Use 5-folds cross validation and 100 iterations.
#xgb with bayesian
! pip install scikit-optimize

Collecting scikit-optimize
  Downloading scikit_optimize-0.9.0-py2.py3-none-any.whl (100 kB)
     ---------------------------------------- 0.0/100.3 kB ? eta -:--:--
     ---------------------------------------- 0.0/100.3 kB ? eta -:--:--
     ----------- --------------------------- 30.7/100.3 kB 1.4 MB/s eta 0:00:01
     ----------- --------------------------- 30.7/100.3 kB 1.4 MB/s eta 0:00:01
     --------------- --------------------- 41.0/100.3 kB 326.8 kB/s eta 0:00:01
     ---------------------------------- -- 92.2/100.3 kB 525.1 kB/s eta 0:00:01
     ---------------------------------- -- 92.2/100.3 kB 525.1 kB/s eta 0:00:01
     ------------------------------------ 100.3/100.3 kB 411.1 kB/s eta 0:00:00
Collecting pyaml>=16.9 (from scikit-optimize)
  Obtaining dependency information for pyaml>=16.9 from https://files.pythonhosted.org/packages/7e/ed/b5f644b7a1de2e966345e60dacc040d98371213df6ae4070ba19280ae6d4/pyaml-23.9.7-py3-none-any.whl.metadata
  Downloading pyaml-23.9.7-py3-no

In [26]:
from skopt import BayesSearchCV

search_space = {
    'learning_rate' : (0.01, 1.0, 'log-uniform'),
    'max_depth' : (4, 20, 'log-uniform'),
    'n_estimators' : (2, 100, 'log-uniform')}

In [27]:
import xgboost as xgb
model = xgb.XGBRegressor(objective= 'reg:squarederror')

xgb_bayes_search = BayesSearchCV(model,
                                 search_space,
                                 n_iter = 100,
                                 scoring = 'neg_mean_absolute_error',
                                 cv = 5)

In [28]:
xgb_bayes_search.fit(X_train, y_train)

y_predict = xgb_bayes_search.predict(X_test)

In [29]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)), '.3f'))
r2 = r2_score(y_test, y_predict)

print('RMSE =', RMSE, '\nR2 =', r2)

RMSE = 6978.635 
R2 = 0.843475285360471


**NO OPTIMIZATION**

RMSE=  8356.165 

R2=  0.7755828480677971 = **77%**

**GRID SEARCH CV**

RMSE = 5932.422 

R2 = 0.886888681278377 = **88%**

**RANDOM SEARCH CV**

RMSE = 5932.422 

R2 = 0.886888681278377 = **88%**

**BAYES SEARCH CV**

RMSE = 6978.635 

R2 = 0.843475285360471 **84%**

**So for this data the best method that produces the best R2 value is RANDOM SEARCH CV and GRID SEARCH CV**