Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

In [1]:
!pip install --upgrade scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[K     |████████████████████████████████| 22.3 MB 39 kB/s eta 0:00:0101
[?25hCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-0.24.2 threadpoolctl-2.1.0


The presented dataset, was used to build an app where you can find the price of a car with some variables given. Those variables are going to become 

## Data preparation

In [1]:
#Importing libraries and loading datasets.
import scipy
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyRegressor

from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

cars = pd.read_csv('/datasets/car_data.csv')

In [2]:
print(cars.columns)

Index(['DateCrawled', 'Price', 'VehicleType', 'RegistrationYear', 'Gearbox',
       'Power', 'Model', 'Mileage', 'RegistrationMonth', 'FuelType', 'Brand',
       'NotRepaired', 'DateCreated', 'NumberOfPictures', 'PostalCode',
       'LastSeen'],
      dtype='object')


In [3]:
#Renaming columns
cars = cars.rename(columns={'DateCrawled': 'date_crawled', 'Price': 'price', 'VehicleType':'vehicle_type', \
                     'RegistrationYear': 'registration_year', 'Gearbox':'gearbox', 'Power':'power', 'Model': 'model', \
                     'Mileage':'mileage', 'RegistrationMonth':'registration_month','FuelType':'fuel_type', 'Brand':'brand', \
                     'NotRepaired':'not_repaired','DateCreated': 'date_created','NumberOfPictures':'number_of_pictures',\
                     'PostalCode':'postal_code','LastSeen':'last_seen'})

In [4]:
display(cars)

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,number_of_pictures,postal_code,last_seen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354364,21/03/2016 09:50,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes,21/03/2016 00:00,0,2694,21/03/2016 10:42
354365,14/03/2016 17:48,2200,,2005,,0,,20000,1,,sonstige_autos,,14/03/2016 00:00,0,39576,06/04/2016 00:46
354366,05/03/2016 19:56,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no,05/03/2016 00:00,0,26135,11/03/2016 18:17
354367,19/03/2016 18:57,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no,19/03/2016 00:00,0,87439,07/04/2016 07:15


In [5]:
#Describe method
print(cars.describe())

               price  registration_year          power        mileage  \
count  354369.000000      354369.000000  354369.000000  354369.000000   
mean     4416.656776        2004.234448     110.094337  128211.172535   
std      4514.158514          90.227958     189.850405   37905.341530   
min         0.000000        1000.000000       0.000000    5000.000000   
25%      1050.000000        1999.000000      69.000000  125000.000000   
50%      2700.000000        2003.000000     105.000000  150000.000000   
75%      6400.000000        2008.000000     143.000000  150000.000000   
max     20000.000000        9999.000000   20000.000000  150000.000000   

       registration_month  number_of_pictures    postal_code  
count       354369.000000            354369.0  354369.000000  
mean             5.714645                 0.0   50508.689087  
std              3.726421                 0.0   25783.096248  
min              0.000000                 0.0    1067.000000  
25%              3.000000  

In [6]:
#Counting NaNs
print(cars.isna().sum())

date_crawled              0
price                     0
vehicle_type          37490
registration_year         0
gearbox               19833
power                     0
model                 19705
mileage                   0
registration_month        0
fuel_type             32895
brand                     0
not_repaired          71154
date_created              0
number_of_pictures        0
postal_code               0
last_seen                 0
dtype: int64


In [7]:
#Visualizing data type
print(cars.dtypes)

date_crawled          object
price                  int64
vehicle_type          object
registration_year      int64
gearbox               object
power                  int64
model                 object
mileage                int64
registration_month     int64
fuel_type             object
brand                 object
not_repaired          object
date_created          object
number_of_pictures     int64
postal_code            int64
last_seen             object
dtype: object


In [8]:
display(cars.corr())

Unnamed: 0,price,registration_year,power,mileage,registration_month,number_of_pictures,postal_code
price,1.0,0.026916,0.158872,-0.333199,0.110581,,0.076055
registration_year,0.026916,1.0,-0.000828,-0.053447,-0.011619,,-0.003459
power,0.158872,-0.000828,1.0,0.024002,0.04338,,0.021665
mileage,-0.333199,-0.053447,0.024002,1.0,0.009571,,-0.007698
registration_month,0.110581,-0.011619,0.04338,0.009571,1.0,,0.013995
number_of_pictures,,,,,,,
postal_code,0.076055,-0.003459,0.021665,-0.007698,0.013995,,1.0


In [9]:
#date_crawled column.
print(cars.date_crawled.value_counts(dropna=False))

05/03/2016 14:25    66
05/03/2016 14:26    59
16/03/2016 18:49    55
05/03/2016 15:48    54
05/03/2016 14:49    54
                    ..
28/03/2016 00:44     1
21/03/2016 18:23     1
24/03/2016 01:59     1
01/04/2016 11:40     1
30/03/2016 13:59     1
Name: date_crawled, Length: 15470, dtype: int64


In [10]:
#Changing date_crawled column data type
cars['date_crawled'] = pd.to_datetime(cars['date_crawled'], format='%d/%m/%Y %H:%M')
cars['date_created'] = pd.to_datetime(cars['date_created'], format='%d/%m/%Y %H:%M')
cars['last_seen'] = pd.to_datetime(cars['last_seen'], format='%d/%m/%Y %H:%M')

In [11]:
#price column
print(cars.price.value_counts(dropna=False))

0        10772
500       5670
1500      5394
1000      4649
1200      4594
         ...  
13440        1
1414         1
8069         1
10370        1
384          1
Name: price, Length: 3731, dtype: int64


In [12]:
#vehicle_type column
print(cars.vehicle_type.value_counts(dropna=False))

sedan          91457
small          79831
wagon          65166
NaN            37490
bus            28775
convertible    20203
coupe          16163
suv            11996
other           3288
Name: vehicle_type, dtype: int64


In [13]:
#converting NaNs in vehicle_type & model columns to 'missing' object.
cars['vehicle_type'] = cars['vehicle_type'].replace (np.nan, 'missing')
cars['model'] = cars['model'].replace (np.nan, 'missing')

In [14]:
#check
print(cars.vehicle_type.value_counts(dropna=False))
print()
print(cars.model.value_counts(dropna=False))

sedan          91457
small          79831
wagon          65166
missing        37490
bus            28775
convertible    20203
coupe          16163
suv            11996
other           3288
Name: vehicle_type, dtype: int64

golf                  29232
other                 24421
3er                   19761
missing               19705
polo                  13066
                      ...  
serie_2                   8
rangerover                4
serie_3                   4
serie_1                   2
range_rover_evoque        2
Name: model, Length: 251, dtype: int64


In [15]:
#registration_year column.
print(cars.registration_year.value_counts(dropna=False))
print(cars.query('registration_year < 1882 or registration_year > 2021'))

2000    24490
1999    22728
2005    22109
2001    20124
2006    19900
        ...  
3200        1
1920        1
1919        1
1915        1
8455        1
Name: registration_year, Length: 151, dtype: int64
              date_crawled  price vehicle_type  registration_year gearbox  \
622    2016-03-16 16:55:00      0      missing               1111     NaN   
12946  2016-03-29 18:39:00     49      missing               5000     NaN   
15147  2016-03-14 00:52:00      0      missing               9999     NaN   
15870  2016-04-02 11:55:00   1700      missing               3200     NaN   
16062  2016-03-29 23:42:00    190      missing               1000     NaN   
...                    ...    ...          ...                ...     ...   
340548 2016-04-02 17:44:00      0      missing               3500  manual   
340759 2016-04-04 23:55:00    700      missing               1600  manual   
341791 2016-03-28 17:37:00      1      missing               3000     NaN   
348830 2016-03-22 00:38:0

In [16]:
#replacing abnormal values in registration_year column with NaNs.
cars['registration_year'] = np.where((cars['registration_year'] > 2021), np.nan, cars['registration_year'])
cars['registration_year'] = np.where((cars['registration_year'] < 1882), np.nan, cars['registration_year'])
print(cars['registration_year'].isna().sum())

171


In [17]:
#Dropping NaNs values and changing back column data type.
cars.dropna(subset=['registration_year'], how='any', inplace=True)
cars['registration_year'] = cars['registration_year'].astype('int32')

In [18]:
#gearbox column
print(cars.gearbox.value_counts(dropna=False))

manual    268225
auto       66278
NaN        19695
Name: gearbox, dtype: int64


In [19]:
#Assigning a fuel_type to NaN cars based on 'power' column.
cars.fuel_type = cars.fuel_type.fillna('X')
cars['fuel_type'] = np.where((cars['fuel_type'] == 'X'), cars.groupby('power')['fuel_type'].transform('max'), cars['fuel_type'])
cars.fuel_type = cars.fuel_type.replace('X', np.nan) 
cars.dropna(subset=['fuel_type'], how='any', inplace=True)

In [20]:
#Eliminating abnormal values from power column (<50, >500) & changing data type.
cars.query('power == 0')
cars['power'] = np.where((cars['power'] == 0), cars.groupby('fuel_type')['power'].transform('mean').round(0), cars['power'])
cars.power.value_counts(dropna=False)
cars['power'] = np.where((cars['power'] > 500), cars.groupby('brand')['power'].transform('mean').round(0), cars['power'])
cars['power'] = np.where((cars['power'] < 50), cars.groupby('brand')['power'].transform('mean').round(0), cars['power'])
cars['power'] = cars['power'].astype('int16')

In [21]:
#fuel_type after fillna.
print(cars.fuel_type.value_counts(dropna=False))

petrol      249021
gasoline     98720
lpg           5313
cng            564
hybrid         233
other          203
electric        90
Name: fuel_type, dtype: int64


In [22]:
#replacing NaNs with automatic.
cars.gearbox = cars.gearbox.replace(np.nan, 'auto')
print(cars.gearbox.value_counts(normalize=True, dropna=False))

manual    0.757296
auto      0.242704
Name: gearbox, dtype: float64


In [23]:
#Dropping abnormal values in price & changing data type
cars['price'] = np.where((cars['price'] < 100), np.nan, cars['price'])
cars.dropna(subset=['price'], how='any', inplace=True)
cars['price'] = cars['price'].astype('int32')
print(cars.price.value_counts(dropna=False))

500      5660
1500     5389
1000     4646
1200     4592
2500     4434
         ... 
16320       1
1735        1
2657        1
9667        1
8188        1
Name: price, Length: 3674, dtype: int64


In [24]:
#Replacing not_repaired NaNs, trasform based on price column.
cars['not_repaired'] = cars['not_repaired'].replace (np.nan, 'not_mentioned')

In [25]:
print(cars.not_repaired.value_counts(normalize=True, dropna=False))

no               0.713334
not_mentioned    0.188462
yes              0.098204
Name: not_repaired, dtype: float64


In [26]:
#Check for more NaNs in the dataset.
print(cars.isna().sum())

date_crawled          0
price                 0
vehicle_type          0
registration_year     0
gearbox               0
power                 0
model                 0
mileage               0
registration_month    0
fuel_type             0
brand                 0
not_repaired          0
date_created          0
number_of_pictures    0
postal_code           0
last_seen             0
dtype: int64


In [33]:
#Resetting indexes in the original dataset.
cars= cars.reset_index(drop=True)
display(cars)

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,number_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:00,480,3,1993,1,101,116,150000,0,6,38,1,2016-03-24,0,70435,2016-04-07 03:16:00
1,2016-03-24 10:58:00,18300,2,2011,1,190,153,125000,5,2,1,2,2016-03-24,0,66954,2016-04-07 01:46:00
2,2016-03-14 12:52:00,9800,7,2004,0,163,117,125000,8,2,14,1,2016-03-14,0,90480,2016-04-05 12:47:00
3,2016-03-17 16:54:00,1500,6,2001,1,75,116,150000,6,6,38,0,2016-03-17,0,91074,2016-03-17 17:40:00
4,2016-03-31 17:25:00,3600,6,2008,1,69,101,90000,7,2,31,0,2016-03-31,0,60437,2016-04-06 10:17:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
340886,2016-03-27 20:36:00,1150,0,2000,1,101,250,150000,3,6,24,0,2016-03-27,0,26624,2016-03-29 10:17:00
340887,2016-03-14 17:48:00,2200,3,2005,0,101,153,20000,1,6,33,1,2016-03-14,0,39576,2016-04-06 00:46:00
340888,2016-03-05 19:56:00,1199,1,2000,0,101,106,125000,3,6,32,0,2016-03-05,0,26135,2016-03-11 18:17:00
340889,2016-03-19 18:57:00,9200,0,1996,1,102,225,150000,3,2,38,0,2016-03-19,0,87439,2016-04-07 07:15:00


In [28]:
#Creating encoded dataset.
encoder = LabelEncoder()
cars_ordinal = pd.DataFrame(cars, columns=cars.columns)
cars_ordinal['vehicle_type'] = encoder.fit_transform(cars_ordinal['vehicle_type'])
cars_ordinal['gearbox'] = encoder.fit_transform(cars_ordinal['gearbox'])
cars_ordinal['model'] = encoder.fit_transform(cars_ordinal['model'])
cars_ordinal['fuel_type'] = encoder.fit_transform(cars_ordinal['fuel_type'])
cars_ordinal['brand'] = encoder.fit_transform(cars_ordinal['brand'])
cars_ordinal['not_repaired'] = encoder.fit_transform(cars_ordinal['not_repaired'])
display(cars_ordinal)

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,number_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:00,480,3,1993,1,101,116,150000,0,6,38,1,2016-03-24,0,70435,2016-04-07 03:16:00
1,2016-03-24 10:58:00,18300,2,2011,1,190,153,125000,5,2,1,2,2016-03-24,0,66954,2016-04-07 01:46:00
2,2016-03-14 12:52:00,9800,7,2004,0,163,117,125000,8,2,14,1,2016-03-14,0,90480,2016-04-05 12:47:00
3,2016-03-17 16:54:00,1500,6,2001,1,75,116,150000,6,6,38,0,2016-03-17,0,91074,2016-03-17 17:40:00
4,2016-03-31 17:25:00,3600,6,2008,1,69,101,90000,7,2,31,0,2016-03-31,0,60437,2016-04-06 10:17:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
340886,2016-03-27 20:36:00,1150,0,2000,1,101,250,150000,3,6,24,0,2016-03-27,0,26624,2016-03-29 10:17:00
340887,2016-03-14 17:48:00,2200,3,2005,0,101,153,20000,1,6,33,1,2016-03-14,0,39576,2016-04-06 00:46:00
340888,2016-03-05 19:56:00,1199,1,2000,0,101,106,125000,3,6,32,0,2016-03-05,0,26135,2016-03-11 18:17:00
340889,2016-03-19 18:57:00,9200,0,1996,1,102,225,150000,3,2,38,0,2016-03-19,0,87439,2016-04-07 07:15:00


# Brief summarize. Point 1.

DATAPREPROCESSING. 

First thing done while working on this project was renaming columns with lowercase letters. 
With the describe method and the sum of NaNs values. I tried to understand where abnormal values where present.
After that I checked the types of columns and looked at the correlations among columns.
Within the datatypes I understood that the date columns where in object format and I decided to change all of them to datetime type.

To address vehicle_type and model column without any further information, I decided to create a new value 'missing' for all the NaNs.

The registration_year column presented 171 abnormal values, I decided to drop those records since they were really few.

There were couple of columns with still NaNs values after this. One of those was fuel_type. To fill fuel_type NaNs my choice was to address them with the max values coming from the dataset grouped by power. Since I understood that the two columns where related.

I add then to eliminate abnormal values in power column. I decided to change all the zeros with the mean of the dataset groupedby fuel type, and values > 500 and < 50 with the mean coming from the brand column.

For what concerne gearbox column instead, I choose to replace the few NaNs in the column 8700 on 300000 records, with 'auto', obtaining 77% of manual cars vs 22% of automatic, this looked to me a good normalization of the column so I decided to move further. 

In price the max values within this app was 20000 and looked totally fine, the problem was presented with the low values. There were in price column many zeros and many values lower then a hundred. I choose to drop all the values <100.

There were still NaNs in not_repaired column I addressed those values to 'not_mentioned' string obtaining 71% 'no', 18% 'not_mentioned' and 9% 'yes'. 

After those changes the dataset did not presented anymore abnormal values and neither None values. The resulting dataset was composed by 340891 rows divided in 16 columns.

ENCODING CATEGORICAL FEATURES.

Using LabelEncoder I decided to encode all the categorical features priorly then building the model. The resulting dataset was basically the same as before with only numerical values.
The columns encoded were: 'vehicle_type', 'gearbox', 'model', 'fuel_type', 'brand' and 'not_repaired'

## Model training

In [35]:
#Assigning features and target for the model construction.
features = cars_ordinal.drop(['price', 'date_crawled', 'registration_month', 'date_created', 'number_of_pictures', 'postal_code', 'last_seen'], axis=1)
target = cars_ordinal['price']

In [36]:
#Train test split.
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state=0)

In [37]:
#Creating pipelines.
pipe_dtr = Pipeline([('scaler0', StandardScaler()),
                    ('DecisionTreeRegressor', DecisionTreeRegressor())])

pipe_rfr = Pipeline([('scaler1', StandardScaler()),
                    ('RandomForestRegressor', RandomForestRegressor(n_estimators=100))])

pipe_linear = Pipeline([('scaler2', StandardScaler()),
                       ('LinearRegression(Dummy)', LinearRegression())])

pipe_cat_boost_r = Pipeline([('scaler3', StandardScaler()),
                       ('CatBoostRegressor', CatBoostRegressor(verbose=500))])

pipe_lgbm_r =  Pipeline([('scaler4', StandardScaler()),
                       ('LGBMRegressor', LGBMRegressor())])

pipe_xgb_r = Pipeline([('scaler5', StandardScaler()),
                       ('XGBRegressor', XGBRegressor())])

In [38]:
#Creating list of pipelines.
pipelines = [pipe_dtr, pipe_rfr, pipe_linear, pipe_cat_boost_r, pipe_lgbm_r, pipe_xgb_r]
#Creating a dictionary of pipelines.
pipe_dict = {pipe_dtr: 'DecisionTreeRegressor', pipe_rfr:'RandomForestRegressor', pipe_linear:'LinearRegression',\
             pipe_cat_boost_r: 'CatBoostRegressor', pipe_lgbm_r: 'LGBMRegressor', pipe_xgb_r:'XGBRegressor'}

In [39]:
#Defining a function to calculate RMSE.
def rmse(target,predictions): 
    score = mean_squared_error(target, predictions)
    score = score **0.5
    return score

In [40]:
#Looping trough pipelines to obtain cross validation scores.
rmse = make_scorer(rmse, greater_is_better=False)
for pipe in pipelines:
    print(pipe_dict[pipe])
    print(cross_val_score(pipe, features_train, target_train, scoring = rmse, cv=5))

DecisionTreeRegressor
[-2047.16645238 -2030.45243966 -2093.8009451  -2077.40993621
 -2075.63970058]
RandomForestRegressor
[-1659.33318103 -1656.59993514 -1690.89301427 -1654.32980263
 -1690.83458884]
LinearRegression
[-3037.88505867 -3018.26670968 -3034.23573896 -2993.96421837
 -3030.51648119]
CatBoostRegressor
0:	learn: 4429.3817836	total: 154ms	remaining: 2m 33s
500:	learn: 1813.9782806	total: 1m 18s	remaining: 1m 17s
999:	learn: 1729.4937973	total: 2m 36s	remaining: 0us
0:	learn: 4443.5276462	total: 104ms	remaining: 1m 43s
500:	learn: 1812.9736401	total: 1m 18s	remaining: 1m 17s
999:	learn: 1726.9779505	total: 2m 35s	remaining: 0us
0:	learn: 4421.9808304	total: 126ms	remaining: 2m 5s
500:	learn: 1806.5030695	total: 1m 17s	remaining: 1m 17s
999:	learn: 1722.7759295	total: 2m 35s	remaining: 0us
0:	learn: 4441.3232149	total: 62.1ms	remaining: 1m 2s
500:	learn: 1814.5277209	total: 1m 18s	remaining: 1m 18s
999:	learn: 1730.2637280	total: 2m 36s	remaining: 0us
0:	learn: 4428.2227153	total

  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \


[-1975.09693557 -1980.54999366 -2002.0861426  -1962.44412197
 -2006.32312692]


In [None]:
#Creating a parameters dictionary for RandomForestRegressor possible hyperparameters values.
parameters = {'n_estimators': (110,150,200,230,250,40,50),
              'max_depth': (10,16,19),
              } 

#Creating a grid model.
RF_grid = GridSearchCV(RandomForestRegressor(random_state=0, criterion='mse'), param_grid=parameters, cv=5)
RF_grid_model = RF_grid.fit(features_train, target_train)
print(RF_grid_model.best_estimator_)

In [None]:
RF_grid_model.best_score_

In [None]:
#Creating a parameters dictionary for possible CatBoostRegressor hyperparameters values.
parameters = {'learning_rate': (0.15,0.5,0.35,0.8),
              'depth': (10,15,16),
              } 

#Creating a grid model.
CB_grid = GridSearchCV(CatBoostRegressor(random_seed=0), param_grid=parameters, cv=5)
CB_grid_model = CB_grid.fit(features_train, target_train, cat_features = cat_features)
print(CB_grid_model.best_estimator_)

In [43]:
#Creating a parameters dictionary for LGBMRegressor possible hyperparameters values.
parameters = {'learning_rate': (0.15,0.5,0.4,0.8),
              'n_estimators': (300,350,450,150),
              } 

#Creating a grid model.
LGBM_grid = GridSearchCV(LGBMRegressor(), param_grid=parameters, cv=5) 
LGBM_grid_model = LGBM_grid.fit(features_train, target_train)
print(LGBM_grid_model.best_estimator_)

LGBMRegressor(learning_rate=0.4, n_estimators=450)


In [46]:
#Creating a parameters dictionary for possible XGBRegressor hyperparameters values.
warnings.simplefilter(action='ignore', category=FutureWarning)

parameters = {'learning_rate': (0.15,0.5,0.4,0.65,0.85),
              'n_estimators': (300,350,450,150),
              } 

#Creating a grid model.
XGB_grid = GridSearchCV(XGBRegressor(), param_grid=parameters, cv=5) 
XGB_grid_model = XGB_grid.fit(features_train, target_train)
print(XGB_grid_model.best_estimator_)

XGBRegressor(learning_rate=0.65, n_estimators=450)


# Brief summarize. Point 2.

CROSS VALIDATION ON TRAINING SET.
This section concern the steps done in order to train the models I had. Obviously in this section I took advantage only of the data presented in the training set. To do so, I choose the features and the target of the app and addressed them from the OHE dataset. Quite important here to notice that I choose to remove many column from our features: 'date_crawled', 'registration_month', 'date_created', 'number_of_pictures', 'postal_code', 'last_seen' and 'price' that was our target. The choice to remove those columns was done taking in account the fact that those values were not helpful at all in defining a price for our vehicles, they just were additional informations.

I proceded splitting the data in training and testing set and I create pipelines without hyperparameters to cross validate the models on the training set. The results of the cross validation were: 

- RandomForestRegressor with RMSE values among 1654.55 and 1690.71
- CatBoostRegressor with RMSE values among 1733.24 and 1770.13
- LGBMRegressor with RMSE values among 1746.15 and 1772.25
- XGBRegressor with RMSE values among 1962.44 and 2006.32
- DecisionTreeRegressor with RMSE values among 2037.78 and 2092.28
- LinearRegression (Dummy) with RMSE values among 2993.96 and 3037.88

TUNING HYPERPARAMETERS.

Since LinearRegression does not lend itself to hyperparameters tuning and DecisionTree received an higher CV score then RFR and the neuralnetworks, I decided to not tune those models and went trough with the tuning of my RandomForestRegressor.
  
The hyperparameters for the RFR_quality model were: 
- RandomForestRegressor(n_estimators=230, max_depth=19, random_state=0)

I started to train CatBoostRegressor here I choose to remain with those hyperparameters: 
- CatBoostRegressor(depth=16, random_seed=0, learning_rate=0.35, iterations=1000, verbose=100)

I then went on tuning hyperparameters for LGBMRegressor:
- LGBMRegressor(n_estimators = 1100, learning_rate = 0.4)
  
XGBRegressor was tuned with the following hyperparameters:
- XGBRegressor(learning_rate = 0.65, n_estimators=3000)


## Model analysis

In [34]:
from time import time
run_time = {}
training_time = {}
predictions_time = {}

In [35]:
#Creating pipelines with the right hyperparameters setted.

pipe_linear = Pipeline([('scaler0', StandardScaler()),
                       ('LinearRegression', LinearRegression())])

pipe_dtr = Pipeline([('scaler1', StandardScaler()),
                    ('DecisionTreeRegressor', DecisionTreeRegressor())])

pipe_rfr_fast = Pipeline([('scaler2', StandardScaler()),
                    ('RandomForestRegressor_fast', RandomForestRegressor(random_state=0))])

pipe_rfr_quality = Pipeline([('scaler3', StandardScaler()),
                    ('RandomForestRegressor_quality', RandomForestRegressor(max_depth=19, n_estimators=250, random_state=0))])

pipe_cat_boost_r = Pipeline([('scaler4', StandardScaler()),
                       ('CatBoostRegressor', CatBoostRegressor(depth=16, random_seed=0, learning_rate=0.35, verbose=100))])

pipe_lgbm_r =  Pipeline([('scaler5', StandardScaler()),
                       ('LGBMRegressor', LGBMRegressor(learning_rate=0.4, n_estimators=450))])

pipe_xgb_r = Pipeline([('scaler6', StandardScaler()),
                       ('XGBRegressorFast', XGBRegressor())])

pipe_xgb_quality = Pipeline([('scaler7', StandardScaler()),
                       ('XGBRegressorQuality', XGBRegressor(learning_rate = 0.65, n_estimators=450))])

In [36]:
#Saving LinearRegression results on testing dataset.
start = time()
pipe_linear.fit(features_train, target_train)
training_time['LinearRegression'] = np.round(time()-start, 3)

start_predictions = time()
predictions = pipe_linear.predict(features_test)
predictions_time['LinearRegression'] = np.round(time()-start_predictions, 3)

rmse_linear = mean_squared_error(target_test, predictions)**0.5

run_time['LinearRegression'] = np.round(time()-start, 3)

print('Rmse:', rmse_linear,'\nTraining Time:', {training_time['LinearRegression']},'s',\
      '\nPredictions Time:', {predictions_time['LinearRegression']},'s','\nRun Time:', {run_time['LinearRegression']},'s')

Rmse: 3029.9790047276947 
Training Time: {0.128} s 
Predictions Time: {0.03} s 
Run Time: {0.16} s


In [37]:
#Saving DecisionTreeRegressor results on testing dataset.
start = time()
pipe_dtr.fit(features_train, target_train)
training_time['DecisionTreeRegressor'] = np.round(time()-start, 3)

start_predictions = time()
predictions = pipe_dtr.predict(features_test)
rmse_dtr = mean_squared_error(target_test, predictions)**0.5
predictions_time['DecisionTreeRegressor'] = np.round(time()-start_predictions, 3)

run_time['DecisionTreeRegressor'] = np.round(time()-start, 3)

print('Rmse:', rmse_dtr,'\nTraining Time:', {training_time['DecisionTreeRegressor']},'s',\
      '\nPredictions Time:', {predictions_time['DecisionTreeRegressor']},'s','\nRun Time:', {run_time['DecisionTreeRegressor']},'s')

Rmse: 2025.7881383285292 
Training Time: {1.228} s 
Predictions Time: {0.058} s 
Run Time: {1.286} s


In [38]:
#Saving RandomForestRegressor_fast results on testing dataset.
start = time()
pipe_rfr_fast.fit(features_train, target_train)
training_time['RandomForestRegressor_fast'] = np.round(time()-start, 3)

start_predictions = time()
predictions = pipe_rfr_fast.predict(features_test)
rmse_rfr = mean_squared_error(target_test, predictions)**0.5
predictions_time['RandomForestRegressor_fast'] = np.round(time()-start_predictions, 3)

run_time['RandomForestRegressor_fast'] = np.round(time()-start, 3)

print('Rmse:', rmse_rfr,'\nTraining Time:', {training_time['RandomForestRegressor_fast']},'s',\
      '\nPredictions Time:', {predictions_time['RandomForestRegressor_fast']},'s','\nRun Time:', {run_time['RandomForestRegressor_fast']},'s')

Rmse: 1667.9125712348043 
Training Time: {78.948} s 
Predictions Time: {5.19} s 
Run Time: {84.138} s


In [39]:
#Saving RandomForestRegressor_quality results on testing dataset.
start = time()
pipe_rfr_quality.fit(features_train, target_train)
training_time['RandomForestRegressor_quality'] = np.round(time()-start, 3)

start_predictions = time()
predictions = pipe_rfr_quality.predict(features_test)
rmse_rfr = mean_squared_error(target_test, predictions)**0.5
predictions_time['RandomForestRegressor_quality'] = np.round(time()-start_predictions, 3)

run_time['RandomForestRegressor_quality'] = np.round(time()-start, 3)

print('Rmse:', rmse_rfr,'\nTraining Time:', {training_time['RandomForestRegressor_quality']},'s',\
      '\nPredictions Time:', {predictions_time['RandomForestRegressor_quality']},'s','\nRun Time:', {run_time['RandomForestRegressor_quality']},'s')

Rmse: 1653.884919717754 
Training Time: {175.973} s 
Predictions Time: {7.706} s 
Run Time: {183.679} s


In [40]:
#Saving LGBMRegressor results on testing dataset.
start = time()
pipe_lgbm_r.fit(features_train, target_train)
training_time['LGBMRegressor'] = np.round(time()-start, 3)

start_predictions = time()
predictions = pipe_lgbm_r.predict(features_test)
rmse_lgbm = mean_squared_error(target_test, predictions)**0.5
predictions_time['LGBMRegressor'] = np.round(time()-start_predictions, 3)

run_time['LGBMRegressor'] = np.round(time()-start, 3)

print('Rmse:', rmse_lgbm,'\nTraining Time:', {training_time['LGBMRegressor']},'s',\
      '\nPredictions Time:', {predictions_time['LGBMRegressor']},'s','\nRun Time:', {run_time['LGBMRegressor']},'s')

Rmse: 1620.3474149026972 
Training Time: {77.162} s 
Predictions Time: {3.209} s 
Run Time: {80.371} s


In [41]:
#Saving XGBRegressorFast results on testing dataset.
warnings.simplefilter(action='ignore', category=FutureWarning)

start = time()
pipe_xgb_r.fit(features_train, target_train)
training_time['XGBRegressor'] = np.round(time()-start, 3)

start_predictions = time()
predictions = pipe_xgb_r.predict(features_test)
rmse_xgb = mean_squared_error(target_test, predictions)**0.5
predictions_time['XGBRegressor'] = np.round(time()-start_predictions, 3)

run_time['XGBRegressor'] = np.round(time()-start, 3)

print('Rmse:', rmse_xgb,'\nTraining Time:', {training_time['XGBRegressor']},'s',\
      '\nPredictions Time:', {predictions_time['XGBRegressor']},'s','\nRun Time:', {run_time['XGBRegressor']},'s')

Rmse: 1997.3352843679986 
Training Time: {15.929} s 
Predictions Time: {0.335} s 
Run Time: {16.264} s


In [42]:
#Saving XGBRegressorQuality results on testing dataset.
warnings.simplefilter(action='ignore', category=FutureWarning)

start = time()
pipe_xgb_quality.fit(features_train, target_train)
training_time['XGBRegressor_quality'] = np.round(time()-start, 3)

start_predictions = time()
predictions = pipe_xgb_quality.predict(features_test)
rmse_xgb = mean_squared_error(target_test, predictions)**0.5
predictions_time['XGBRegressor_quality'] = np.round(time()-start_predictions, 3)

run_time['XGBRegressor_quality'] = np.round(time()-start, 3)

print('Rmse:', rmse_xgb,'\nTraining Time:', {training_time['XGBRegressor_quality']},'s',\
      '\nPredictions Time:', {predictions_time['XGBRegressor_quality']},'s','\nRun Time:', {run_time['XGBRegressor_quality']},'s')

Rmse: 1714.837137747176 
Training Time: {71.098} s 
Predictions Time: {1.836} s 
Run Time: {72.934} s


In [None]:
%%time
#Saving CatBoostRegressor results on testing dataset.
start = time()
pipe_cat_boost_r.fit(features_train, target_train)
training_time['CatBoostRegressor'] = np.round(time()-start, 3)

start_predictions = time()
predictions = pipe_cat_boost_r.predict(features_test)
rmse_cat_boost = mean_squared_error(target_test, predictions)**0.5
predictions_time['CatBoostRegressor'] = np.round(time()-start_predictions, 3)

run_time['CatBoostRegressor'] = np.round(time()-start, 3)

print('Rmse:', rmse_cat_boost,'\nTraining Time:', {training_time['CatBoostRegressor']},'s',\
      '\nPredictions Time:', {predictions_time['CatBoostRegressor']},'s','\nRun Time:', {run_time['CatBoostRegressor']},'s')

0:	learn: 3392.6868424	total: 3.09s	remaining: 51m 25s
100:	learn: 1281.2579742	total: 3m 19s	remaining: 29m 33s
200:	learn: 1165.7591789	total: 6m 39s	remaining: 26m 29s
300:	learn: 1097.1590422	total: 9m 57s	remaining: 23m 7s
400:	learn: 1052.6240329	total: 13m 9s	remaining: 19m 39s
500:	learn: 1020.4344638	total: 16m 25s	remaining: 16m 21s
600:	learn: 994.2617019	total: 19m 40s	remaining: 13m 3s
700:	learn: 972.7583702	total: 22m 53s	remaining: 9m 45s
800:	learn: 956.2916035	total: 26m 8s	remaining: 6m 29s
900:	learn: 942.0644573	total: 29m 22s	remaining: 3m 13s
999:	learn: 929.7737219	total: 32m 37s	remaining: 0us


In [None]:
rmse_dict = {'LinearRegression':rmse_linear, 'DecisionTreeRegressor': rmse_dtr, 'RandomForestRegressor':rmse_rfr, 'LGBMRegressor':rmse_lgbm, 'XGBRegressor':rmse_xgb, 'CatBoostRegressor':rmse_cat_boost}

In [None]:
print(training_time)
print()
print(predictions_time)
print()
print(run_time)
print()
print(rmse_dict)

# Brief summarize. Point 3

MODEL ANALYSIS ON TESTING SET. 

I rewrote the pipelines for models and gradient boosting tecniques this time with the right hyperparameters. Created various dictionaries to store the results from the testing predictions.

-   LinearRegression: \
    Rmse: 3029.9790047276947 \
    Training Time: {0.128} s \
    Predictions Time: {0.03} s \
    Run Time: {0.16} s
    
    
-   DecisionTreeRegressor: \
    Rmse: 2025.7881383285292 \
    Training Time: {1.228} s \
    Predictions Time: {0.058} s \
    Run Time: {1.286} s
    
    
-   RandomForestRegressor_fast: \
    Rmse: 1667.9125712348043 \
    Training Time: {78.948} s \
    Predictions Time: {5.19} s \
    Run Time: {84.138} s


-   RandomForestRegressor_quality:\
    Rmse: 1653.884919717754 \
    Training Time: {175.973} s \
    Predictions Time: {7.706} s \
    Run Time: {183.679} s


-   LGBMRegressor:\
    Rmse: 1620.3474149026972 \
    Training Time: {77.162} s \
    Predictions Time: {3.209} s \
    Run Time: {80.371} s


-   XGBRegressor_fast:\
    Rmse: 1997.3352843679986 \
    Training Time: {15.929} s \
    Predictions Time: {0.335} s \
    Run Time: {16.264} s


-   XGBRegressor_quality:\
    Rmse: 1714.837137747176 \
    Training Time: {71.098} s \
    Predictions Time: {1.836} s \
    Run Time: {72.934} s


-   CatBoostRegressor: 
        
        Can't retrieve the results here, can I have some help figure this out?

The best quality was obtained by LGBMRegressor, RandomForestRegressor_quality and XGBRegressor_quality (I suppose even CatBoostRegressor, but should be proven). 

The models that trained faster were LinearRegression and DecisionTreeRegressor. LGBM obtained the best results looking to RMSE metric and run time. Way faster then the RFR model.

The final Runtime was reflected the training one.

Afterall every model built had a great result if compared to our dummy model (LinearRegression).