<a class="anchor" id="0"></a>

# Estimation of the price of selling diesel cars with 3 models + EDA

## **Acknowledgements**
#### This kernel uses such good kernels:
   - https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
   - https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d-abnormals-filter
   - https://www.kaggle.com/vbmokin/feature-importance-xgb-lgbm-logreg-linreg
   - https://www.kaggle.com/darkcore/house-sales-visualization

<a class="anchor" id="0.1"></a>

## Table of Contents

1. [Import libraries](#1)
2. [Download datasets](#2)
3. [EDA & Visualization](#3)
4. [FE: building the feature importance diagrams](#4)
  - [LGBM](4.1)
  - [XGB](4.2)
  - [Logistic Regression](4.3)
  - [Linear Regression](4.3)
5. [Comparison of the all feature importance diagrams](#5)
6. [Preparing to modeling](#6)
7. [Tuning models and test for all features](#7)
  - [LGBM](7.1)
  - [XGB](7.2)
  - [Decision Tree Regressor](7.3)
8. [Models comparison](#8)
9. [Prediction](#9)

    


## 1. Import libraries <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [None]:
import copy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.style as style
from mpl_toolkits import mplot3d
from scipy import stats
import seaborn as sns
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode,iplot
%matplotlib inline

# preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
import pandas_profiling as pp
from sklearn.linear_model import LinearRegression,LogisticRegression, SGDRegressor, RidgeCV

# models
from sklearn.linear_model import LinearRegression, SGDRegressor, RidgeCV
from sklearn.svm import SVR, LinearSVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor 
from sklearn.ensemble import BaggingRegressor, AdaBoostRegressor, VotingRegressor 
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
import sklearn.model_selection
from sklearn.model_selection import cross_val_predict as cvp
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn import preprocessing
import xgboost as xgb
import lightgbm as lgb

# model tuning
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe, space_eval

import warnings
warnings.filterwarnings("ignore")

## 2. Download datasets <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In [None]:
valid_part = 0.3
pd.set_option('max_columns',100)

In [None]:
train0 = pd.read_csv('/kaggle/input/craigslist-carstrucks-data/craigslistVehicles.csv')
train0.head(5)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
drop_columns = ['url', 'city', 'city_url', 'make', 'title_status', 'VIN', 'size', 'image_url', 'desc', 'lat','long']
train0 = train0.drop(columns = drop_columns)

In [None]:
train0.info()

In [None]:
train0 = train0.dropna()
train0.head(5)

In [None]:
# Clone data for FE 
_train = copy.deepcopy(train0)

In [None]:
train0 = train0.loc[train0['fuel'] == 'diesel']
train0.head(5)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
# Thanks to: https://www.kaggle.com/vbmokin/automatic-selection-from-20-classifier-models
# Determination categorical features
numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categorical_columns = []
features = train0.columns.values.tolist()
for col in features:
    if train0[col].dtype in numerics: continue
    categorical_columns.append(col)
# Encoding categorical features
for col in categorical_columns:
    if col in train0.columns:
        le = LabelEncoder()
        le.fit(list(train0[col].astype(str).values))
        train0[col] = le.transform(list(train0[col].astype(str).values))

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
train0['year'] = (train0['year']-1900).astype(int)
train0['odometer'] = train0['odometer'].astype(int)

In [None]:
train0.head(10)

In [None]:
train0.info()

## 3. EDA & Visualization <a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d-abnormals-filter
#Thanks to https://www.kaggle.com/masumrumi/a-detailed-regression-guide-with-house-pricing
def plotting_3_chart(df, feature):
    ## Importing seaborn, matplotlab and scipy modules. 
    style.use('fivethirtyeight')

    ## Creating a customized chart. and giving in figsize and everything. 
    fig = plt.figure(constrained_layout=True, figsize=(15,10))
    ## creating a grid of 3 cols and 3 rows. 
    grid = gridspec.GridSpec(ncols=3, nrows=3, figure=fig)
    
    ## Customizing the histogram grid. 
    ax1 = fig.add_subplot(grid[0, :2])
    ## Set the title. 
    ax1.set_title('Histogram')
    ## plot the histogram. 
    sns.distplot(df.loc[:,feature], norm_hist=True, ax = ax1)

    # customizing the QQ_plot. 
    ax2 = fig.add_subplot(grid[1, :2])
    ## Set the title. 
    ax2.set_title('QQ_plot')
    ## Plotting the QQ_Plot. 
    stats.probplot(df.loc[:,feature], plot = ax2)

    ## Customizing the Box Plot. 
    ax3 = fig.add_subplot(grid[:, 2])
    ## Set title. 
    ax3.set_title('Box Plot')
    ## Plotting the box plot. 
    sns.boxplot(df.loc[:,feature], orient='v', ax = ax3 );

In [None]:
train0['price'].value_counts()

In [None]:
train0.describe(percentiles=[.01, .05, .1, .5, .9, .92, .93, .94, .96, .97, .99])

In [None]:
train0 = train0[train0['price'] > 1000]
train0 = train0[train0['price'] < 58566]

train0['odometer'] = train0['odometer'] // 5000
train0 = train0[train0['year'] > 100]

In [None]:
train0.info()

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d-abnormals-filter
plotting_3_chart(train0, 'price')

In [None]:
# Thanks to: https://www.kaggle.com/dzvlfi/craiglist-eda-dzulfiqar-ridha
#create correlation with heatmap
corr = train0.corr(method = 'pearson')

#convert correlation to numpy array
mask = np.array(corr)

#to mask the repetitive value for each pair
mask[np.tril_indices_from(mask)] = False
fig, ax = plt.subplots(figsize = (15,12))
fig.set_size_inches(20,20)
sns.heatmap(corr, mask = mask, vmax = 0.9, square = True, annot = True)

In [None]:
#Thanks to https://towardsdatascience.com/an-easy-introduction-to-3d-plotting-with-matplotlib-801561999725

fig = plt.figure(figsize=(10,10))
ax = plt.axes(projection="3d")

z_points = train0['price']
x_points = train0['odometer']
y_points = train0['year']
ax.scatter3D(x_points, y_points, z_points, c=z_points, cmap='hsv');

ax.set_xlabel('odometer')
ax.set_ylabel('year')
ax.set_zlabel('price')

plt.show()

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
#Create Cylinders Frame
cylindersframe = pd.DataFrame({"Cylinders":_train.cylinders.value_counts().index,"Car_cylinders":_train.cylinders.value_counts().values})
cylindersframe["Cylinders"] = cylindersframe["Cylinders"].apply(lambda x : "" + str(x))
cylindersframe.set_index("Cylinders",inplace=True)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
p1 = [go.Pie(labels = cylindersframe.index,values = cylindersframe.Car_cylinders,hoverinfo="percent+label+value",hole=0.1,marker=dict(line=dict(color="#000000",width=2)))]
layout4 = go.Layout(title="Cylinders Pie Chart")
fig4 = go.Figure(data=p1,layout=layout4)
iplot(fig4)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
#Create Condition Frame
conditionframe = pd.DataFrame({"Condition":_train.condition.value_counts().index,"Car_conditions":_train.condition.value_counts().values})
conditionframe["Condition"] = conditionframe["Condition"].apply(lambda x : "" + str(x))
conditionframe.set_index("Condition",inplace=True)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
p1 = [go.Pie(labels = conditionframe.index,values = conditionframe.Car_conditions,hoverinfo="percent+label+value",hole=0.1,marker=dict(line=dict(color="#000000",width=2)))]
layout4 = go.Layout(title="Condition Pie Chart")
fig4 = go.Figure(data=p1,layout=layout4)
iplot(fig4)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
#Create Manufacturer Frame
manufacturerframe = pd.DataFrame({"Manufacturer":_train.manufacturer.value_counts().index,"Car_manufacturer":_train.manufacturer.value_counts().values})
manufacturerframe["Manufacturer"] = manufacturerframe["Manufacturer"].apply(lambda x : "" + str(x))
manufacturerframe.set_index("Manufacturer",inplace=True)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
p1 = [go.Pie(labels = manufacturerframe.index,values = manufacturerframe.Car_manufacturer,hoverinfo="percent+label+value",hole=0.1,marker=dict(line=dict(color="#000000",width=2)))]
layout4 = go.Layout(title="Manufacturer Pie Chart")
fig4 = go.Figure(data=p1,layout=layout4)
iplot(fig4)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
#Create Fuel Frame
fuelframe = pd.DataFrame({"Fuel":_train.fuel.value_counts().index,"Car_fuel":_train.fuel.value_counts().values})
fuelframe["Fuel"] = fuelframe["Fuel"].apply(lambda x : "" + str(x))
fuelframe.set_index("Fuel",inplace=True)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
p1 = [go.Pie(labels = fuelframe.index,values = fuelframe.Car_fuel,hoverinfo="percent+label+value",hole=0.1,marker=dict(line=dict(color="#000000",width=2)))]
layout4 = go.Layout(title="Fuel Pie Chart")
fig4 = go.Figure(data=p1,layout=layout4)
iplot(fig4)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
#Create transmission Frame
transmissionframe = pd.DataFrame({"Transmission":_train.transmission.value_counts().index,"Car_transmission":_train.transmission.value_counts().values})
transmissionframe["Transmission"] = transmissionframe["Transmission"].apply(lambda x : "" + str(x))
transmissionframe.set_index("Transmission",inplace=True)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
p1 = [go.Pie(labels = transmissionframe.index,values = transmissionframe.Car_transmission,hoverinfo="percent+label+value",hole=0.1,marker=dict(line=dict(color="#000000",width=2)))]
layout4 = go.Layout(title="Transmission Pie Chart")
fig4 = go.Figure(data=p1,layout=layout4)
iplot(fig4)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
#Create type Frame
typeframe = pd.DataFrame({"Type":_train.type.value_counts().index,"Car_type":_train.type.value_counts().values})
typeframe["Type"] = typeframe["Type"].apply(lambda x : "" + str(x))
typeframe.set_index("Type",inplace=True)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
p1 = [go.Pie(labels = typeframe.index,values = typeframe.Car_type,hoverinfo="percent+label+value",hole=0.1,marker=dict(line=dict(color="#000000",width=2)))]
layout4 = go.Layout(title="Type Pie Chart")
fig4 = go.Figure(data=p1,layout=layout4)
iplot(fig4)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
#Create paint_color Frame
paint_colorframe = pd.DataFrame({"Paint_color":_train.paint_color.value_counts().index,"Car_paint_color":_train.paint_color.value_counts().values})
paint_colorframe["Paint_color"] = paint_colorframe["Paint_color"].apply(lambda x : "" + str(x))
paint_colorframe.set_index("Paint_color",inplace=True)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
p1 = [go.Pie(labels = paint_colorframe.index,values = paint_colorframe.Car_paint_color,hoverinfo="percent+label+value",hole=0.1,marker=dict(line=dict(color="#000000",width=2)))]
layout4 = go.Layout(title="Paint color Pie Chart")
fig4 = go.Figure(data=p1,layout=layout4)
iplot(fig4)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
#Create drive Frame
driveframe = pd.DataFrame({"Drive":_train.drive.value_counts().index,"Car_drive":_train.drive.value_counts().values})
driveframe["Drive"] = driveframe["Drive"].apply(lambda x : "" + str(x))
driveframe.set_index("Drive",inplace=True)

In [None]:
# Thanks to: https://www.kaggle.com/darkcore/house-sales-visualization
p1 = [go.Pie(labels = driveframe.index,values = driveframe.Car_drive,hoverinfo="percent+label+value",hole=0.1,marker=dict(line=dict(color="#000000",width=2)))]
layout4 = go.Layout(title="Drive Pie Chart")
fig4 = go.Figure(data=p1,layout=layout4)
iplot(fig4)

In [None]:
train0.info()

In [None]:
train0.corr()

In [None]:
train0.describe()

In [None]:
pp.ProfileReport(train0)

<a class="anchor" id="4"></a>
## 4. FE: building the feature importance diagrams
##### [Back to Table of Contents](#0.1)

In [None]:
# Clone data for FE 
train_fe = copy.deepcopy(train0)
target_fe = train_fe['price']
del train_fe['price']

<a class="anchor" id="4.1"></a>
### 4.3 LGBM
##### [Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
X = train_fe
z = target_fe

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
#%% split training set to validation set
Xtrain, Xval, Ztrain, Zval = train_test_split(X, z, test_size=0.2, random_state=0)
train_set = lgb.Dataset(Xtrain, Ztrain, silent=False)
valid_set = lgb.Dataset(Xval, Zval, silent=False)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
params = {
        'boosting_type':'gbdt',
        'objective': 'regression',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'max_depth': -1,
        'subsample': 0.8,
        'bagging_fraction' : 1,
        'max_bin' : 5000 ,
        'bagging_freq': 20,
        'colsample_bytree': 0.6,
        'metric': 'rmse',
        'min_split_gain': 0.5,
        'min_child_weight': 1,
        'min_child_samples': 10,
        'scale_pos_weight':1,
        'zero_as_missing': True,
        'seed':0,        
    }

modelL = lgb.train(params, train_set = train_set, num_boost_round=1000,
                   early_stopping_rounds=50,verbose_eval=10, valid_sets=valid_set)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
fig =  plt.figure(figsize = (15,15))
axes = fig.add_subplot(111)
lgb.plot_importance(modelL,ax = axes,height = 0.5)
plt.show();plt.close()

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
feature_score = pd.DataFrame(train_fe.columns, columns = ['feature']) 
feature_score['score_lgb'] = modelL.feature_importance()

<a class="anchor" id="4.2"></a>
### 4.3 XGB
##### [Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
#%% split training set to validation set 
data_tr  = xgb.DMatrix(Xtrain, label=Ztrain)
data_cv  = xgb.DMatrix(Xval   , label=Zval)
evallist = [(data_tr, 'train'), (data_cv, 'valid')]

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
parms = {'max_depth':8, #maximum depth of a tree
         'objective':'reg:squarederror',
         'eta'      :0.3,
         'subsample':0.8,#SGD will use this percentage of data
         'lambda '  :4, #L2 regularization term,>1 more conservative 
         'colsample_bytree ':0.9,
         'colsample_bylevel':1,
         'min_child_weight': 10}
modelx = xgb.train(parms, data_tr, num_boost_round=200, evals = evallist,
                  early_stopping_rounds=30, maximize=False, 
                  verbose_eval=10)

print('score = %1.5f, n_boost_round =%d.'%(modelx.best_score,modelx.best_iteration))

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
fig =  plt.figure(figsize = (15,15))
axes = fig.add_subplot(111)
xgb.plot_importance(modelx,ax = axes,height = 0.5)
plt.show();plt.close()

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
feature_score['score_xgb'] = feature_score['feature'].map(modelx.get_score(importance_type='weight'))
feature_score

<a class="anchor" id="4.3"></a>
### 4.3 Logistic Regression
##### [Back to Table of Contents](#0.1)

In [None]:
# Standardization for regression model
train_fe = pd.DataFrame(
    preprocessing.MinMaxScaler().fit_transform(train_fe),
    columns=train_fe.columns,
    index=train_fe.index
)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
# Logistic Regression

logreg = LogisticRegression()
logreg.fit(train_fe, target_fe)
coeff_logreg = pd.DataFrame(train_fe.columns.delete(0))
coeff_logreg.columns = ['feature']
coeff_logreg["score_logreg"] = pd.Series(logreg.coef_[0])
coeff_logreg.sort_values(by='score_logreg', ascending=False)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
# the level of importance of features is not associated with the sign
coeff_logreg["score_logreg"] = coeff_logreg["score_logreg"].abs()
feature_score = pd.merge(feature_score, coeff_logreg, on='feature')

<a class="anchor" id="4.4"></a>
### 4.1 Linear Regression
##### [Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
# Linear Regression

linreg = LinearRegression()
linreg.fit(train_fe, target_fe)
coeff_linreg = pd.DataFrame(train_fe.columns.delete(0))
coeff_linreg.columns = ['feature']
coeff_linreg["score_linreg"] = pd.Series(linreg.coef_)
coeff_linreg.sort_values(by='score_linreg', ascending=False)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-fe-eda-with-3d
coeff_linreg["score_linreg"] = coeff_linreg["score_linreg"].abs()
feature_score = pd.merge(feature_score, coeff_linreg, on='feature')
feature_score = feature_score.fillna(0)
feature_score = feature_score.set_index('feature')
feature_score

## 5. Comparison of the all feature importance diagrams <a class="anchor" id="5"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/feature-importance-xgb-lgbm-logreg-linreg
# Thanks to: https://www.kaggle.com/nanomathias/feature-engineering-importance-testing
# MinMax scale all importances
feature_score = pd.DataFrame(
    preprocessing.MinMaxScaler().fit_transform(feature_score),
    columns=feature_score.columns,
    index=feature_score.index
)

# Create mean column
feature_score['mean'] = feature_score.mean(axis=1)

# Plot the feature importances
feature_score.sort_values('mean', ascending=False).plot(kind='bar', figsize=(20, 10))

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/feature-importance-xgb-lgbm-logreg-linreg
feature_score.sort_values('mean', ascending=False)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/feature-importance-xgb-lgbm-logreg-linreg
# Create total column with different weights
feature_score['total'] = 0.5*feature_score['score_lgb'] + 0.3*feature_score['score_xgb'] \
                       + 0.1*feature_score['score_logreg'] + 0.1*feature_score['score_linreg']

# Plot the feature importances
feature_score.sort_values('total', ascending=False).plot(kind='bar', figsize=(20, 10))

In [None]:
feature_score.sort_values('total', ascending=False)

## 6. Preparing to modeling <a class="anchor" id="6"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
target_name = 'price'
train_target0 = train0[target_name]
train0 = train0.drop([target_name], axis=1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
# Synthesis test0 from train0
train0, test0, train_target0, test_target0 = train_test_split(train0, train_target0, test_size=0.2, random_state=0)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
# For boosting model
train0b = train0
train_target0b = train_target0
# Synthesis valid as test for selection models
trainb, testb, targetb, target_testb = train_test_split(train0b, train_target0b, test_size=valid_part, random_state=0)

In [None]:
#For models from Sklearn
scaler = StandardScaler()
train0 = pd.DataFrame(scaler.fit_transform(train0), columns = train0.columns)

In [None]:
train0.head(3)

In [None]:
len(train0)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
# Synthesis valid as test for selection models
train, test, target, target_test = train_test_split(train0, train_target0, test_size=valid_part, random_state=0)

In [None]:
train.head(3)

In [None]:
test.head(3)

In [None]:
train.info()

In [None]:
test.info()

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
acc_train_r2 = []
acc_test_r2 = []
acc_train_d = []
acc_test_d = []
acc_train_rmse = []
acc_test_rmse = []

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
def acc_d(y_meas, y_pred):
    # Relative error between predicted y_pred and measured y_meas values
    return mean_absolute_error(y_meas, y_pred)*len(y_meas)/sum(abs(y_meas))

def acc_rmse(y_meas, y_pred):
    # RMSE between predicted y_pred and measured y_meas values
    return (mean_squared_error(y_meas, y_pred))**0.5

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
def acc_boosting_model(num,model,train,test,num_iteration=0):
    # Calculation of accuracy of boosting model by different metrics
    
    global acc_train_r2, acc_test_r2, acc_train_d, acc_test_d, acc_train_rmse, acc_test_rmse
    
    if num_iteration > 0:
        ytrain = model.predict(train, num_iteration = num_iteration)  
        ytest = model.predict(test, num_iteration = num_iteration)
    else:
        ytrain = model.predict(train)  
        ytest = model.predict(test)

    print('target = ', targetb[:5].values)
    print('ytrain = ', ytrain[:5])

    acc_train_r2_num = round(r2_score(targetb, ytrain) * 100, 2)
    print('acc(r2_score) for train =', acc_train_r2_num)   
    acc_train_r2.insert(num, acc_train_r2_num)

    acc_train_d_num = round(acc_d(targetb, ytrain) * 100, 2)
    print('acc(relative error) for train =', acc_train_d_num)   
    acc_train_d.insert(num, acc_train_d_num)

    acc_train_rmse_num = round(acc_rmse(targetb, ytrain) * 100, 2)
    print('acc(rmse) for train =', acc_train_rmse_num)   
    acc_train_rmse.insert(num, acc_train_rmse_num)

    print('target_test =', target_testb[:5].values)
    print('ytest =', ytest[:5])
    
    acc_test_r2_num = round(r2_score(target_testb, ytest) * 100, 2)
    print('acc(r2_score) for test =', acc_test_r2_num)
    acc_test_r2.insert(num, acc_test_r2_num)
    
    acc_test_d_num = round(acc_d(target_testb, ytest) * 100, 2)
    print('acc(relative error) for test =', acc_test_d_num)
    acc_test_d.insert(num, acc_test_d_num)
    
    acc_test_rmse_num = round(acc_rmse(target_testb, ytest) * 100, 2)
    print('acc(rmse) for test =', acc_test_rmse_num)
    acc_test_rmse.insert(num, acc_test_rmse_num)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
def acc_model(num,model,train,test):
    # Calculation of accuracy of model акщь Sklearn by different metrics   
  
    global acc_train_r2, acc_test_r2, acc_train_d, acc_test_d, acc_train_rmse, acc_test_rmse
    
    ytrain = model.predict(train)  
    ytest = model.predict(test)

    print('target = ', target[:5].values)
    print('ytrain = ', ytrain[:5])

    acc_train_r2_num = round(r2_score(target, ytrain) * 100, 2)
    print('acc(r2_score) for train =', acc_train_r2_num)   
    acc_train_r2.insert(num, acc_train_r2_num)

    acc_train_d_num = round(acc_d(target, ytrain) * 100, 2)
    print('acc(relative error) for train =', acc_train_d_num)   
    acc_train_d.insert(num, acc_train_d_num)

    acc_train_rmse_num = round(acc_rmse(target, ytrain) * 100, 2)
    print('acc(rmse) for train =', acc_train_rmse_num)   
    acc_train_rmse.insert(num, acc_train_rmse_num)

    print('target_test =', target_test[:5].values)
    print('ytest =', ytest[:5])
    
    acc_test_r2_num = round(r2_score(target_test, ytest) * 100, 2)
    print('acc(r2_score) for test =', acc_test_r2_num)
    acc_test_r2.insert(num, acc_test_r2_num)
    
    acc_test_d_num = round(acc_d(target_test, ytest) * 100, 2)
    print('acc(relative error) for test =', acc_test_d_num)
    acc_test_d.insert(num, acc_test_d_num)
    
    acc_test_rmse_num = round(acc_rmse(target_test, ytest) * 100, 2)
    print('acc(rmse) for test =', acc_test_rmse_num)
    acc_test_rmse.insert(num, acc_test_rmse_num)

## 7. Tuning models and test for all features <a class="anchor" id="7"></a>

[Back to Table of Contents](#0.1)

### 7.1 LGBM <a class="anchor" id="7.1"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
#%% split training set to validation set
Xtrain, Xval, Ztrain, Zval = train_test_split(trainb, targetb, test_size=0.2, random_state=0)
train_set = lgb.Dataset(Xtrain, Ztrain, silent=False)
valid_set = lgb.Dataset(Xval, Zval, silent=False)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
params = {
        'boosting_type':'gbdt',
        'objective': 'regression',
        'num_leaves': 31,
        'learning_rate': 0.01,
        'max_depth': -1,
        'subsample': 0.8,
        'bagging_fraction' : 1,
        'max_bin' : 5000 ,
        'bagging_freq': 20,
        'colsample_bytree': 0.6,
        'metric': 'rmse',
        'min_split_gain': 0.5,
        'min_child_weight': 1,
        'min_child_samples': 10,
        'scale_pos_weight':1,
        'zero_as_missing': False,
        'seed':0,        
    }
modelL = lgb.train(params, train_set = train_set, num_boost_round=10000,
                   early_stopping_rounds=8000,verbose_eval=500, valid_sets=valid_set)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
acc_boosting_model(1,modelL,trainb,testb,modelL.best_iteration)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
fig =  plt.figure(figsize = (5,5))
axes = fig.add_subplot(111)
lgb.plot_importance(modelL,ax = axes,height = 0.5)
plt.show();
plt.close()

### 7.2 XGB<a class="anchor" id="7.2"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
xgb_clf = xgb.XGBRegressor({'objective': 'reg:squarederror'}) 
parameters = {'n_estimators': [60, 100, 120, 140], 
              'learning_rate': [0.01, 0.1],
              'max_depth': [5, 7],
              'reg_lambda': [0.5]}
xgb_reg = GridSearchCV(estimator=xgb_clf, param_grid=parameters, cv=5, n_jobs=-1).fit(trainb, targetb)
print("Best score: %0.3f" % xgb_reg.best_score_)
print("Best parameters set:", xgb_reg.best_params_)
acc_boosting_model(2,xgb_reg,trainb,testb)

### 7.3 Decision Tree Regressor <a class="anchor" id="7.3"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
# Decision Tree Regression

decision_tree = DecisionTreeRegressor()
decision_tree.fit(train, target)
acc_model(3,decision_tree,train,test)

## 8. Models comparison <a class="anchor" id="8"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
models = pd.DataFrame({
    'Model': ['LGBM', 'XGB', 'Decision Tree Regressor'],
    
    'r2_train': acc_train_r2,
    'r2_test': acc_test_r2,
    'd_train': acc_train_d,
    'd_test': acc_test_d,
    'rmse_train': acc_train_rmse,
    'rmse_test': acc_test_rmse
                     })

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
pd.options.display.float_format = '{:,.2f}'.format

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
print('Prediction accuracy for models by R2 criterion - r2_test')
models.sort_values(by=['r2_test', 'r2_train'], ascending=False)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
print('Prediction accuracy for models by relative error - d_test')
models.sort_values(by=['d_test', 'd_train'], ascending=True)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
print('Prediction accuracy for models by RMSE - rmse_test')
models.sort_values(by=['rmse_test', 'rmse_train'], ascending=True)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
# Plot
plt.figure(figsize=[25,6])
xx = models['Model']
plt.tick_params(labelsize=14)
plt.plot(xx, models['r2_train'], label = 'r2_train')
plt.plot(xx, models['r2_test'], label = 'r2_test')
plt.legend()
plt.title('R2-criterion for 15 popular models for train and test datasets')
plt.xlabel('Models')
plt.ylabel('R2-criterion, %')
plt.xticks(xx, rotation='vertical')
plt.savefig('graph.png')
plt.show()

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
# Plot
plt.figure(figsize=[25,6])
xx = models['Model']
plt.tick_params(labelsize=14)
plt.plot(xx, models['d_train'], label = 'd_train')
plt.plot(xx, models['d_test'], label = 'd_test')
plt.legend()
plt.title('Relative errors for 15 popular models for train and test datasets')
plt.xlabel('Models')
plt.ylabel('Relative error, %')
plt.xticks(xx, rotation='vertical')
plt.savefig('graph.png')
plt.show()

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/used-cars-price-prediction-by-15-models
# Plot
plt.figure(figsize=[25,6])
xx = models['Model']
plt.tick_params(labelsize=14)
plt.plot(xx, models['rmse_train'], label = 'rmse_train')
plt.plot(xx, models['rmse_test'], label = 'rmse_test')
plt.legend()
plt.title('RMSE for 15 popular models for train and test datasets')
plt.xlabel('Models')
plt.ylabel('RMSE, %')
plt.xticks(xx, rotation='vertical')
plt.savefig('graph.png')
plt.show()

## 9. Prediction <a class="anchor" id="9"></a>

[Back to Table of Contents](#0.1)

In [None]:
test0.info()

In [None]:
test0.head(3)

In [None]:
# LGB Regression model for basic train
lgb_predict = modelL.predict(test0)
lgb_predict[:3]

In [None]:
# XGB Regression model for basic train
xgb_reg.fit(train0, train_target0)
xgb_predict = xgb_reg.predict(test0)
xgb_predict[:3]

In [None]:
# Decision Tree Regression for basic train
decision_tree.fit(train0, train_target0)
decision_trees_predict = decision_tree.predict(test0)
decision_trees_predict[:3]

In [None]:
# Thanks to: https://www.kaggle.com/dnzcihan/house-sales-prediction-and-eda
final_df = test_target0.values
final_df = pd.DataFrame(final_df,columns=['Real_price'])
final_df['predicted_prices'] = lgb_predict.astype(int)
final_df['difference'] = abs(final_df['Real_price'] - final_df['predicted_prices']).astype(int)
final_df.head(20)