Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

---
Features

- DateCrawled — date profile was downloaded from the database
- VehicleType — vehicle body type
- RegistrationYear — vehicle registration year
- Gearbox — gearbox type
- Power — power (hp)
- Model — vehicle model
- Mileage — mileage (measured in km due to dataset's regional specifics)
- RegistrationMonth — vehicle registration month
- FuelType — fuel type
- Brand — vehicle brand
- NotRepaired — vehicle repaired or not
- DateCreated — date of profile creation
- NumberOfPictures — number of vehicle pictures
- PostalCode — postal code of profile owner (user)
- LastSeen — date of the last activity of the user

Target
- Price — price (Euro)

Analysis done January 2022

## Data preparation

In [1]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math
import time
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV

import random
random_state=42
random.seed(random_state)
np.random.seed(random_state)

# import sys and insert code to ignore warnings 
import sys
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

__Helper functions__

In [2]:
# function for timing execution of cell [refrence](https://stackoverflow.com/questions/52738709/how-to-store-time-values-in-a-variable-in-jupyter)
# couldn't get it to work though...
def exec_time(start, end): 
   diff_time = end - start
   m, s = divmod(diff_time, 60)
   h, m = divmod(m, 60)
   s,m,h = int(round(s, 0)), int(round(m, 0)), int(round(h, 0))
   print("time: " + "{0:02d}:{1:02d}:{2:02d}".format(h, m, s))

# function for displaying outlier statistics for column
def outlier_stats(data):
    data_mean, data_std = np.mean(data), np.std(data)
    cut_off = data_std * 3
    lower, upper = data_mean - cut_off, data_mean + cut_off
    outliers = [x for x in data if x < lower or x > upper]
    outliers_removed = [x for x in data if x >= lower and x <= upper]   
    outliers_stats = pd.Series(outliers)
    return outliers_stats.describe()

# function for displaying count and percent of missing values if column object or int64
def missing_stats(df):
    print("\nColumns Missing Values")
    df_missing = pd.concat([df.dtypes, df.isnull().sum(), df.isnull().sum() / (len(df)*.01)], axis=1)
    df_missing.columns = ['type', 'cnt', 'pct']
    print(df_missing)
    df.describe(include=['object', 'int64'])

In [3]:
# load data
df = pd.read_csv('/datasets/car_data.csv')

In [4]:
# inspect data
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [5]:
original_no_rows = df.shape[0]
df.shape

(354369, 16)

In [6]:
df.duplicated().sum()

262

In [7]:
missing_stats(df)


Columns Missing Values
                     type    cnt        pct
DateCrawled        object      0   0.000000
Price               int64      0   0.000000
VehicleType        object  37490  10.579368
RegistrationYear    int64      0   0.000000
Gearbox            object  19833   5.596709
Power               int64      0   0.000000
Model              object  19705   5.560588
Mileage             int64      0   0.000000
RegistrationMonth   int64      0   0.000000
FuelType           object  32895   9.282697
Brand              object      0   0.000000
NotRepaired        object  71154  20.079070
DateCreated        object      0   0.000000
NumberOfPictures    int64      0   0.000000
PostalCode          int64      0   0.000000
LastSeen           object      0   0.000000


__Rename columns to lowercase for consistency__

In [8]:
df = df.rename(columns={'DateCrawled': 'date_crawled', 'Price': 'price', 'VehicleType': 'vehicle_type', 
                        'RegistrationYear': 'registration_year', 'Gearbox': 'gearbox', 'Power': 'power', 
                        'Model': 'model','Mileage': 'milage', 'RegistrationMonth': 'registration_month', 
                        'FuelType': 'fuel_type', 'Brand': 'brand','NotRepaired': 'not_repaired', 
                        'DateCreated': 'date_created', 'NumberOfPictures': 'num_pictures', 
                        'PostalCode': 'postal_code', 'LastSeen': 'last_seen', })

df.columns

Index(['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox',
       'power', 'model', 'milage', 'registration_month', 'fuel_type', 'brand',
       'not_repaired', 'date_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

__Initial findings__

- 354369 rows by 16 columns/features, all listed as int64 or object
- 262 duplicate rows --> delete
- 5 features have missing rows --> investigate
- Not all features will be useful in analysis --> investigate
- Some dates are listed as objects --> chg datatype or delete feature

In [9]:
# drop duplicate rows
df = df.drop_duplicates() 
df.shape

(354107, 16)

In [10]:
# investigate columns
df.describe(include=['object', 'int64'])

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,milage,registration_month,fuel_type,brand,not_repaired,date_created,num_pictures,postal_code,last_seen
count,354107,354107.0,316623,354107.0,334277,354107.0,334406,354107.0,354107.0,321218,354107,282962,354107,354107.0,354107.0,354107
unique,15470,,8,,2,,250,,,7,40,2,109,,,18592
top,05/03/2016 14:25,,sedan,,manual,,golf,,,petrol,volkswagen,no,03/04/2016 00:00,,,07/04/2016 07:16
freq,66,,91399,,268034,,29215,,,216161,76960,246927,13705,,,653
mean,,4416.433287,,2004.235355,,110.089651,,128211.811684,5.714182,,,,,0.0,50507.14503,
std,,4514.338584,,90.261168,,189.914972,,37906.590101,3.726682,,,,,0.0,25784.212094,
min,,0.0,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,1050.0,,1999.0,,69.0,,125000.0,3.0,,,,,0.0,30165.0,
50%,,2700.0,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49406.0,
75%,,6400.0,,2008.0,,143.0,,150000.0,9.0,,,,,0.0,71083.0,


__Initial observations__ 
- date_crawled, date_created and last_seen indicate these listing are from 2016
- The latest year possible in registration_year is 2016 --> delete rows > 2016
- date_crawled, date_created and last_seen are not useful for analysis --> delete columns
- num_pictures only has 0.0, not useful for analysis --> delete column
- postal_code has objects less than 5 digits long --> investigate and delete rows or postal_code column

In [11]:
df.query('registration_year > 2016')

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,milage,registration_month,fuel_type,brand,not_repaired,date_created,num_pictures,postal_code,last_seen
22,23/03/2016 14:52,2900,,2018,manual,90,meriva,150000,5,petrol,opel,no,23/03/2016 00:00,0,49716,31/03/2016 01:16
26,10/03/2016 19:38,5555,,2017,manual,125,c4,125000,4,,citroen,no,10/03/2016 00:00,0,31139,16/03/2016 09:16
48,25/03/2016 14:40,7750,,2017,manual,80,golf,100000,1,petrol,volkswagen,,25/03/2016 00:00,0,48499,31/03/2016 21:47
51,07/03/2016 18:57,2000,,2017,manual,90,punto,150000,11,gasoline,fiat,yes,07/03/2016 00:00,0,66115,07/03/2016 18:57
57,10/03/2016 20:53,2399,,2018,manual,64,other,125000,3,,seat,no,10/03/2016 00:00,0,33397,25/03/2016 10:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354112,11/03/2016 15:49,3600,,2017,manual,86,transit,150000,5,gasoline,ford,,11/03/2016 00:00,0,32339,12/03/2016 05:45
354140,29/03/2016 16:47,1000,,2017,manual,101,a4,150000,9,,audi,,29/03/2016 00:00,0,38315,06/04/2016 02:44
354203,17/03/2016 00:56,2140,,2018,manual,80,fiesta,150000,6,,ford,no,17/03/2016 00:00,0,44866,29/03/2016 15:45
354253,25/03/2016 09:37,1250,,2018,,0,corsa,150000,0,petrol,opel,,25/03/2016 00:00,0,45527,06/04/2016 07:46


In [12]:
14529/354369

0.04099963597267255

In [13]:
df.query('registration_year < 1959')

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,milage,registration_month,fuel_type,brand,not_repaired,date_created,num_pictures,postal_code,last_seen
15,11/03/2016 21:39,450,small,1910,,0,ka,5000,0,petrol,ford,,11/03/2016 00:00,0,24148,19/03/2016 08:46
622,16/03/2016 16:55,0,,1111,,0,,5000,0,,opel,,16/03/2016 00:00,0,44628,20/03/2016 16:44
1928,25/03/2016 15:58,7000,suv,1945,manual,48,other,150000,2,petrol,volkswagen,no,25/03/2016 00:00,0,58135,25/03/2016 15:58
2273,15/03/2016 21:44,1800,convertible,1925,,0,,5000,1,,sonstige_autos,no,15/03/2016 00:00,0,79288,07/04/2016 05:15
3333,15/03/2016 21:36,10500,sedan,1955,manual,30,other,60000,0,petrol,ford,,15/03/2016 00:00,0,53498,07/04/2016 08:16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
351299,09/03/2016 21:56,5500,bus,1956,manual,37,,60000,4,petrol,sonstige_autos,no,09/03/2016 00:00,0,1900,06/04/2016 02:17
351682,12/03/2016 00:57,11500,,1800,,16,other,5000,6,petrol,fiat,,11/03/2016 00:00,0,16515,05/04/2016 19:47
353531,16/03/2016 21:56,6000,sedan,1937,manual,38,other,5000,0,petrol,mercedes_benz,,16/03/2016 00:00,0,23936,30/03/2016 18:47
353961,17/03/2016 13:54,200,,1910,,0,,5000,0,petrol,sonstige_autos,,17/03/2016 00:00,0,42289,31/03/2016 22:46


In [14]:
390/354369

0.0011005477341415305

__Since it is unlikely that we have many legitimate cars with registration dates less then 1960, we will eliminate those rows along with the rows of vehicles with a registration year > 2016, which would be impossible since the data was collected in 2016.__

In [15]:
df = df.loc[df['registration_year'] <= 2016]
df = df.loc[df['registration_year'] >1959]
df.query('registration_year > 2016' and 'registration_year < 1960')

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,milage,registration_month,fuel_type,brand,not_repaired,date_created,num_pictures,postal_code,last_seen


In [16]:
df_bad_code = df.query('postal_code <10000') 
print('Pct of rows with postal_codes < 10000', df_bad_code.shape[0]/df.shape[0])

Pct of rows with postal_codes < 10000 0.05119178195284766


In [17]:
df.corr()

Unnamed: 0,price,registration_year,power,milage,registration_month,num_pictures,postal_code
price,1.0,0.453728,0.161802,-0.33763,0.107351,,0.075679
registration_year,0.453728,1.0,0.050668,-0.225518,0.073391,,0.034969
power,0.161802,0.050668,1.0,0.02462,0.042747,,0.02135
milage,-0.33763,-0.225518,0.02462,1.0,0.008051,,-0.00822
registration_month,0.107351,0.073391,0.042747,0.008051,1.0,,0.013036
num_pictures,,,,,,,
postal_code,0.075679,0.034969,0.02135,-0.00822,0.013036,,1.0


__We note a little over half a percent of the rows have an obviously inaccurate postal code so deleting those rows is one choice. However, the correlation between price and postal_code is fairly low, so we will drop the postal_code column instead__

In [18]:
print(df.shape)
df.drop(columns=['date_crawled', 'date_created', 'last_seen', 'num_pictures', 'postal_code'], inplace=True)
df.shape

(339156, 16)


(339156, 11)

In [19]:
missing_stats(df)


Columns Missing Values
                      type    cnt        pct
price                int64      0   0.000000
vehicle_type        object  22835   6.732890
registration_year    int64      0   0.000000
gearbox             object  17774   5.240656
power                int64      0   0.000000
model               object  17459   5.147779
milage               int64      0   0.000000
registration_month   int64      0   0.000000
fuel_type           object  27131   7.999564
brand               object      0   0.000000
not_repaired        object  64613  19.051115


__Observations with updated list__
- price has outliers --> investigate outliers
- vehicle_type has missing values --> fillin related to model
- registration_year --> handled earlier
- gearbox has missing values --> investigate and fillin 
- power values vary widely --> [research online](https://www.autolist.com/guides/average-car-horsepower), investigate
- __model has missing values --> investigate (first since will use for fill ins)__
- milage no missing values, given values look reasonable --> keep as is
- registration_month has some values of zero --> investigate
- fuel_type, not sure how there are 7 unique values --> investigate
- brand (make) has no missing values --> keep as is
- not_repaired indicates if vehicle repaired or not --> investigate

In [20]:
df['model'].value_counts(dropna=False)

golf                  27593
other                 23711
3er                   19206
NaN                   17459
polo                  12441
                      ...  
kalina                    6
serie_3                   4
rangerover                3
serie_1                   2
range_rover_evoque        2
Name: model, Length: 251, dtype: int64

In [21]:
df['model'].isnull().sum()/len(df)

0.051477786033565676

In [22]:
df.loc[df['model'].isnull()].head()

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,milage,registration_month,fuel_type,brand,not_repaired
1,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes
59,1,suv,1994,manual,286,,150000,11,,sonstige_autos,
81,350,small,1997,manual,54,,150000,3,,fiat,yes
115,0,small,1999,,0,,5000,0,petrol,volkswagen,
135,1450,sedan,1992,manual,136,,150000,0,,audi,no


__model has 251 unique values and a little over half a percent are null. It is unlikely we can make wise replacements for the missing values, so we will remove those rows with missing the model value.__

In [23]:
print(df.shape)
df.dropna(subset=['model'], inplace=True) 
df.shape               

(339156, 11)


(321697, 11)

In [24]:
df['model'].isnull().sum()

0

In [25]:
outlier_stats(df.price)

count     4467.000000
mean     19247.173942
std        572.356100
min      18247.000000
25%      18800.000000
50%      19200.000000
75%      19900.000000
max      20000.000000
dtype: float64

In [26]:
df.query('price == 0')

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,milage,registration_month,fuel_type,brand,not_repaired
7,0,sedan,1980,manual,50,other,40000,7,petrol,volkswagen,no
40,0,,1990,,0,corsa,150000,1,petrol,opel,
152,0,bus,2004,manual,101,meriva,150000,10,lpg,opel,yes
154,0,,2006,,0,other,5000,0,,fiat,
231,0,wagon,2001,manual,115,mondeo,150000,0,,ford,
...,...,...,...,...,...,...,...,...,...,...,...
354205,0,,2000,manual,65,corsa,150000,0,,opel,yes
354238,0,small,2002,manual,60,fiesta,150000,3,petrol,ford,
354248,0,small,1999,manual,53,swift,150000,3,petrol,suzuki,
354277,0,small,1999,manual,37,arosa,150000,7,petrol,seat,yes


__There are almost 8000 rows where the price is 0. This may be clerical error or it may reflect vehicles junked or given for free. Either way, these will not be useful for our analysis. We will keep the rows with higher outlier values as some cars may have gone for an impressive amount.__

In [27]:
df = df.loc[df['price'] != 0]
df.query('price == 0')

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,milage,registration_month,fuel_type,brand,not_repaired


In [28]:
df['vehicle_type'].value_counts(dropna=False)

sedan          86217
small          74809
wagon          61646
bus            27400
convertible    19001
NaN            16217
coupe          14743
suv            11041
other           2630
Name: vehicle_type, dtype: int64

In [29]:
df['vehicle_type'].fillna(df.groupby('model')['vehicle_type'].
                          transform(lambda x:x.value_counts().index[0]), inplace=True)
df['vehicle_type'].value_counts(dropna=False)

sedan          93980
small          79416
wagon          63377
bus            28920
convertible    19154
coupe          14955
suv            11272
other           2630
Name: vehicle_type, dtype: int64

In [30]:
df['gearbox'].value_counts(dropna=False)

manual    240670
auto       60188
NaN        12846
Name: gearbox, dtype: int64

__gearbox is missing about 4% of the values and we will replace those based on model since the gearbox is usally consistent with models.__ 

In [31]:
df['gearbox'].fillna(df.groupby('model')['gearbox']\
                          .transform(lambda x:x.value_counts().index[0]), inplace=True)
df['gearbox'].value_counts(dropna=False)

manual    252002
auto       61702
Name: gearbox, dtype: int64

In [32]:
outlier_stats(df.power)

count      270.000000
mean      3716.562963
std       4484.265366
min        671.000000
25%       1149.500000
50%       1612.500000
75%       4237.500000
max      20000.000000
dtype: float64

__We discover a small number of outliers in the power column and will delete those.__

In [33]:
print(df.shape)
df = df.loc[df['power'] <= 670]
df.shape 

(313704, 11)


(313434, 11)

In [34]:
df['power'].isnull().sum()

0

In [35]:
df['registration_month'].value_counts()

3     31568
6     28522
4     26538
5     26437
0     24734
7     24652
10    23937
11    22226
12    22024
9     21924
1     20953
8     20552
2     19367
Name: registration_month, dtype: int64

In [36]:
df.corr()

Unnamed: 0,price,registration_year,power,milage,registration_month
price,1.0,0.485168,0.49291,-0.367805,0.085502
registration_year,0.485168,1.0,0.152706,-0.273744,0.058507
power,0.49291,0.152706,1.0,0.073695,0.098254
milage,-0.367805,-0.273744,0.073695,1.0,-0.003873
registration_month,0.085502,0.058507,0.098254,-0.003873,1.0


__Once again we find a weak relationship between registration_month and price, but it would be better to keep the column and rows for overall data integrity. The months are fairly evenly divided so we could randomly distribute the NA values with values of 1:12 to indicate months.__

In [37]:
df['registration_month'] = df['registration_month'].apply(lambda v: random.choice([1,2,3,4,5,6,7,8,9,10,11,12]))
df['registration_month'].value_counts(dropna=False)

11    26388
3     26325
2     26291
8     26165
5     26144
10    26111
7     26094
9     26053
1     26037
4     25968
12    25960
6     25898
Name: registration_month, dtype: int64

In [38]:
df.query('registration_month == 0')

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,milage,registration_month,fuel_type,brand,not_repaired


In [39]:
df['fuel_type'].value_counts(dropna=False)

petrol      196403
gasoline     92275
NaN          19160
lpg           4711
cng            511
hybrid         198
other          113
electric        63
Name: fuel_type, dtype: int64

__We will fill in the missing values using the influence of model since usually the fuel type is consistent with model.__

In [40]:
df['fuel_type'].fillna(df.groupby('model')['fuel_type']\
                          .transform(lambda x:x.value_counts().index[0]), inplace=True)
df['fuel_type'].value_counts(dropna=False)

petrol      212129
gasoline     95709
lpg           4711
cng            511
hybrid         198
other          113
electric        63
Name: fuel_type, dtype: int64

In [41]:
df['not_repaired'].value_counts(dropna=False)

no     229472
NaN     53015
yes     30947
Name: not_repaired, dtype: int64

__Since no is the most common value, we will replace the missing values with no.__

In [42]:
df['not_repaired'].fillna('no', inplace=True)
df['not_repaired'].value_counts(dropna=False)

no     282487
yes     30947
Name: not_repaired, dtype: int64

__Now we will check to for any missed missing values, reset the index, and look at the percentage of the original data we deleted.__

In [43]:
missing_stats(df)


Columns Missing Values
                      type  cnt  pct
price                int64    0  0.0
vehicle_type        object    0  0.0
registration_year    int64    0  0.0
gearbox             object    0  0.0
power                int64    0  0.0
model               object    0  0.0
milage               int64    0  0.0
registration_month   int64    0  0.0
fuel_type           object    0  0.0
brand               object    0  0.0
not_repaired        object    0  0.0


In [44]:
df.head()

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,milage,registration_month,fuel_type,brand,not_repaired
0,480,sedan,1993,manual,0,golf,150000,11,petrol,volkswagen,no
2,9800,suv,2004,auto,163,grand,125000,2,gasoline,jeep,no
3,1500,small,2001,manual,75,golf,150000,1,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,12,gasoline,skoda,no
5,650,sedan,1995,manual,102,3er,150000,5,petrol,bmw,yes


In [45]:
# resetting the DataFrame index
df = df.reset_index()
df.head(5)

Unnamed: 0,index,price,vehicle_type,registration_year,gearbox,power,model,milage,registration_month,fuel_type,brand,not_repaired
0,0,480,sedan,1993,manual,0,golf,150000,11,petrol,volkswagen,no
1,2,9800,suv,2004,auto,163,grand,125000,2,gasoline,jeep,no
2,3,1500,small,2001,manual,75,golf,150000,1,petrol,volkswagen,no
3,4,3600,small,2008,manual,69,fabia,90000,12,gasoline,skoda,no
4,5,650,sedan,1995,manual,102,3er,150000,5,petrol,bmw,yes


In [46]:
len(df)/original_no_rows

0.8844848166741448

__Summary of data preparation__
- We dropped several columns, 'date_crawled', 'date_created', 'last_seen', 'num_pictures', 'postal_code', that will not be useful for our analysis.
- We eliminated rows where the registration_year > 2016 (the year of the data) or < 1960 (unlikely).
- We eliminated rows where the price = 0.
- We eliminated the rows where a value for model was missing.
- We filled in missing rows in vehicle_type, gearbox, fuel_type, power based on model.
- We changed the zero values in registration_month to a randomly assigned # (1-12).
- We replaced the NaN values in not_repaired with 'no'.
- We verified that no missing values remain.
- We note our preparation eliminated almost 12% of the data, but feel confident in the deletion choices.

## Model training

In [47]:
# change into categories for lightBGM and CatBoost
categories = ['vehicle_type', 'gearbox', 'model', 'fuel_type', 'brand', 'not_repaired']
for col in categories:
    df[col] = df[col].astype('category')

In [48]:
# create feature and target variables
target = df['price']
features = df.drop(['price'], axis=1)

In [49]:
# use one hot encoding to turn the categories into numeric
features_ohe = pd.get_dummies(features, drop_first=True)

In [50]:
# divide data with OHE into 3 groups using 3:1:1 (60%, 20%, 20%) ratio
features_train, features_valid, target_train, target_valid = train_test_split(
    features_ohe, target, test_size=0.4, random_state = 12345)
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid, target_valid, test_size=0.5, shuffle = False)
print('Train target and features and percentage\n', target_train.shape, features_train.shape,
      'pct', (len(target_train)/len(df)))
print('Valid target and features and percentage\n', target_valid.shape, features_valid.shape,
      'pct', (len(target_valid)/len(df)))
print('Test target and features and percentage\n', target_test.shape, features_test.shape,
      'pct', (len(target_test)/len(df)))


Train target and features and percentage
 (188060,) (188060, 307) pct 0.5999987238142639
Valid target and features and percentage
 (62687,) (62687, 307) pct 0.20000063809286803
Test target and features and percentage
 (62687,) (62687, 307) pct 0.20000063809286803


In [51]:
# divide data without OHE into 3 groups using 3:1:1 (60%, 20%, 20%) ratio
features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(
    features, target, test_size=0.4, random_state = 12345)
features_valid_2, features_test_2, target_valid_2, target_test_2 = train_test_split(
    features_valid_2, target_valid_2, test_size=0.5, shuffle = False)

print('Train target and features and percentage\n', target_train_2.shape, features_train_2.shape,
      'pct', (len(target_train_2)/len(df)))
print('Valid target and features and percentage\n', target_valid_2.shape, features_valid_2.shape,
      'pct', (len(target_valid_2)/len(df)))
print('Test target and features and percentage\n', target_test_2.shape, features_test_2.shape,
      'pct', (len(target_test_2)/len(df)))

Train target and features and percentage
 (188060,) (188060, 11) pct 0.5999987238142639
Valid target and features and percentage
 (62687,) (62687, 11) pct 0.20000063809286803
Test target and features and percentage
 (62687,) (62687, 11) pct 0.20000063809286803


In [52]:
# set up rmse calculation
def find_rmse(target_test, predictions):
    return round(mean_squared_error(target_test, predictions) ** 0.5, 2)
rmse = make_scorer(find_rmse, greater_is_better=False)

__Note on timing cells. We used %%time to find the total time elapsed in each cell. However, we wanted to save the time values in variables and discovered %%time does not allow that. Therefore, we manually timed training time and prediction time and took note that their total matched closely with the %%time.__

__Looking at base models__

In [53]:
%%time
# linear regression with default parameters
lr_model = LinearRegression()

start = time.time()
lr_model.fit(features_train, target_train)
end = time.time()
lrtt = end - start

start = time.time()
predicted_valid = lr_model.predict(features_valid)
end = time.time()
lrpt = end - start

lr_rmse_calc = mean_squared_error(target_valid, predicted_valid)**0.5

print('Linear Regression - Sanity Check')
print('RMSE:', lr_rmse_calc, 'Training time:', lrtt, 'Prediction time:', lrpt)

Linear Regression - Sanity Check
RMSE: 2673.154220556299 Training time: 19.875536680221558 Prediction time: 0.2535219192504883
CPU times: user 15.7 s, sys: 4.5 s, total: 20.2 s
Wall time: 20.2 s


In [54]:
%%time
# random forest regressor with default parameters
rf_model = RandomForestRegressor(random_state=42)

start = time.time()
rf_model.fit(features_train, target_train)
end = time.time()
rftt = end - start

start = time.time()
predicted_valid = rf_model.predict(features_valid)
end = time.time()
rfpt = end - start

rf_rmse_calc = mean_squared_error(target_valid, predicted_valid)**0.5

print('Random Forest Regressor')
print('RMSE:', rf_rmse_calc, 'Training time:', rftt, 'Prediction time:', rfpt)

Random Forest Regressor
RMSE: 1727.9087596013774 Training time: 60.69512057304382 Prediction time: 0.6225581169128418
CPU times: user 1min, sys: 139 ms, total: 1min
Wall time: 1min 1s


In [55]:
%%time
# lightGBM with OHE 
lg_model = lgb.LGBMRegressor(random_state=42)

start = time.time()
lg_model.fit(features_train, target_train)
end = time.time()
lgohett = end - start

start = time.time()
predicted_valid = lg_model.predict(features_valid)
end = time.time()
lgohept = end - start

lgohe_rmse_calc = mean_squared_error(target_valid, predicted_valid)**0.5

print('LightGBM with OHE')
print('RMSE:', lgohe_rmse_calc, 'Training time:', lgohett, 'Prediction time:', lgohept)

LightGBM with OHE
RMSE: 1707.641861082534 Training time: 10.578785419464111 Prediction time: 1.1029925346374512
CPU times: user 11.2 s, sys: 351 ms, total: 11.6 s
Wall time: 11.7 s


In [56]:
%%time
# lightGBM without OHE 
lg_model_2 = lgb.LGBMRegressor(random_state=42)

start = time.time()
lg_model_2.fit(features_train_2, target_train_2, categorical_feature=categories)
end = time.time()
lgtt = end - start

start = time.time()
predicted_valid = lg_model_2.predict(features_valid_2)
end = time.time()
lgpt = end - start

lg_rmse_calc = mean_squared_error(target_valid_2, predicted_valid)**0.5

print('LightGBM without OHE')
print('RMSE:', lg_rmse_calc, 'Training time:', lgtt, 'Prediction time:', lgpt)

LightGBM without OHE
RMSE: 1643.351067294688 Training time: 8.479040384292603 Prediction time: 0.9047174453735352
CPU times: user 9.2 s, sys: 43.4 ms, total: 9.24 s
Wall time: 9.39 s


In [57]:
%%time
# CatBoost with OHE 
cb_model = CatBoostRegressor(random_state=42)

start = time.time()
cb_model.fit(features_train, target_train)
end = time.time()
cbohett = end - start

start = time.time()
predicted_valid = cb_model.predict(features_valid)
end = time.time()
cbohept = end - start

cbohe_rmse_calc = mean_squared_error(target_valid, predicted_valid)**0.5

print('CatBoost with OHE')
print('RMSE:', cbohe_rmse_calc, 'Training time:', cbohett, 'Prediction time:', cbohept)

0:	learn: 4477.9699145	total: 142ms	remaining: 2m 21s
1:	learn: 4391.9709932	total: 332ms	remaining: 2m 45s
2:	learn: 4308.0746711	total: 529ms	remaining: 2m 55s
3:	learn: 4227.9910306	total: 637ms	remaining: 2m 38s
4:	learn: 4151.0618149	total: 829ms	remaining: 2m 44s
5:	learn: 4079.2342586	total: 936ms	remaining: 2m 35s
6:	learn: 4006.5334079	total: 1.13s	remaining: 2m 40s
7:	learn: 3936.3632976	total: 1.24s	remaining: 2m 33s
8:	learn: 3870.6813826	total: 1.43s	remaining: 2m 37s
9:	learn: 3804.0220973	total: 1.62s	remaining: 2m 40s
10:	learn: 3740.0336401	total: 1.73s	remaining: 2m 35s
11:	learn: 3681.7381368	total: 1.92s	remaining: 2m 38s
12:	learn: 3623.1264956	total: 2.03s	remaining: 2m 34s
13:	learn: 3566.3356465	total: 2.22s	remaining: 2m 36s
14:	learn: 3513.3201315	total: 2.33s	remaining: 2m 32s
15:	learn: 3461.0498985	total: 2.52s	remaining: 2m 35s
16:	learn: 3410.9187350	total: 2.72s	remaining: 2m 37s
17:	learn: 3363.4008978	total: 2.82s	remaining: 2m 34s
18:	learn: 3315.8216

In [58]:
%%time
# CatBoost without OHE 
cb_model = CatBoostRegressor(random_state=42)

start = time.time()
cb_model.fit(features_train_2, target_train_2, cat_features=categories)
end = time.time()
cbtt = end - start

start = time.time()
predicted_valid = cb_model.predict(features_valid_2)
end = time.time()
cbpt = end - start

cb_rmse_calc = mean_squared_error(target_valid_2, predicted_valid)**0.5

print('CatBoost without OHE')
print('RMSE:', cb_rmse_calc, 'Training time:', cbtt, 'Prediction time:', cbpt)

0:	learn: 4478.4138069	total: 529ms	remaining: 8m 48s
1:	learn: 4392.1961601	total: 1.13s	remaining: 9m 23s
2:	learn: 4309.6511182	total: 1.63s	remaining: 9m 1s
3:	learn: 4230.0585080	total: 2.22s	remaining: 9m 12s
4:	learn: 4153.5073781	total: 2.73s	remaining: 9m 2s
5:	learn: 4077.4151115	total: 3.32s	remaining: 9m 10s
6:	learn: 4004.1621682	total: 3.92s	remaining: 9m 16s
7:	learn: 3934.3220221	total: 4.42s	remaining: 9m 8s
8:	learn: 3866.8768840	total: 5.01s	remaining: 9m 12s
9:	learn: 3800.0963102	total: 5.53s	remaining: 9m 7s
10:	learn: 3736.5353106	total: 6.12s	remaining: 9m 10s
11:	learn: 3675.8516709	total: 6.72s	remaining: 9m 13s
12:	learn: 3616.5362126	total: 7.32s	remaining: 9m 15s
13:	learn: 3559.9368459	total: 7.91s	remaining: 9m 17s
14:	learn: 3505.0986190	total: 8.32s	remaining: 9m 6s
15:	learn: 3453.0637704	total: 8.91s	remaining: 9m 8s
16:	learn: 3400.6724419	total: 9.51s	remaining: 9m 9s
17:	learn: 3352.4051909	total: 10s	remaining: 9m 6s
18:	learn: 3304.3949169	total:

In [59]:
print('Linear Regression - Sanity Check')
print('RMSE:', lr_rmse_calc, 'Training time:', lrtt, 'Prediction time:', lrpt)
print('\nRandom Forest Regressor')
print('RMSE:', rf_rmse_calc, 'Training time:', rftt, 'Prediction time:', rfpt)
print('\nLightGBM with OHE')
print('RMSE:', lgohe_rmse_calc, 'Training time:', lgohett, 'Prediction time:', lgohept)
print('\nLightGBM without OHE')
print('RMSE:', lg_rmse_calc, 'Training time:', lgtt, 'Prediction time:', lgpt)
print('\nCatBoost with OHE')
print('RMSE:', cbohe_rmse_calc, 'Training time:', cbohett, 'Prediction time:', cbohept)
print('\nCatBoost without OHE')
print('RMSE:', cb_rmse_calc, 'Training time:', cbtt, 'Prediction time:', cbpt)

Linear Regression - Sanity Check
RMSE: 2673.154220556299 Training time: 19.875536680221558 Prediction time: 0.2535219192504883

Random Forest Regressor
RMSE: 1727.9087596013774 Training time: 60.69512057304382 Prediction time: 0.6225581169128418

LightGBM with OHE
RMSE: 1707.641861082534 Training time: 10.578785419464111 Prediction time: 1.1029925346374512

LightGBM without OHE
RMSE: 1643.351067294688 Training time: 8.479040384292603 Prediction time: 0.9047174453735352

CatBoost with OHE
RMSE: 1703.817876905744 Training time: 153.08609819412231 Prediction time: 0.21335506439208984

CatBoost without OHE
RMSE: 1683.0274262756113 Training time: 501.59921288490295 Prediction time: 0.4114964008331299


__We note the RMSE and the times required for training and prediction__

- All the models perform better on RMSE (goodness of fit) than linear regression, which we expected.
- LightGBM without One Hot Encoding provides the best RMSE, but all the values besides linear regression are pretty close.
- The training times of the base models range from around 20 seconds with LightGBM to just over 500 seconds with CatBoost.
- The prediction times of the base models range from around .2 seconds (CatBoost) to close to 1.2 with LightGBM w/OHE.
- Since the model needs to be trained once, the prediction time is actually more relevant for running multiple predictions. 

__Tuning hyperparameters__

- We will look at n_estimators with RandomForestRegressor
- We will tune parameters of CatBoost and LightGBM using both with and without OHE since there is variable performance [LightGBMdoc](https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html), [CatBoost](https://catboost.ai/)

In [60]:
%%time
rf_model = RandomForestRegressor(random_state=42)
params = { 'n_estimators': range(10, 30, 5) }

best_model = RandomizedSearchCV(rf_model, params, scoring=rmse, cv=5, verbose=10)
best_model.fit(features_train, target_train)  
print('Best parameters:', best_model.best_params_)

predictions = best_model.best_estimator_.predict(features_valid)
print('RMSE:', round(mean_squared_error(target_valid, predictions) ** 0.5, 2))

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] n_estimators=10 .................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ................. n_estimators=10, score=-1772.600, total=  48.5s
[CV] n_estimators=10 .................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   48.5s remaining:    0.0s


[CV] ................. n_estimators=10, score=-1711.530, total=  47.8s
[CV] n_estimators=10 .................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.6min remaining:    0.0s


[CV] ................. n_estimators=10, score=-1736.270, total=  48.7s
[CV] n_estimators=10 .................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.4min remaining:    0.0s


[CV] ................. n_estimators=10, score=-1717.170, total=  48.1s
[CV] n_estimators=10 .................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  3.2min remaining:    0.0s


[CV] ................. n_estimators=10, score=-1741.730, total=  48.8s
[CV] n_estimators=15 .................................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  4.0min remaining:    0.0s


[CV] ................. n_estimators=15, score=-1744.790, total= 1.2min
[CV] n_estimators=15 .................................................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  5.2min remaining:    0.0s


[CV] ................. n_estimators=15, score=-1686.140, total= 1.2min
[CV] n_estimators=15 .................................................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  6.4min remaining:    0.0s


[CV] ................. n_estimators=15, score=-1710.780, total= 1.2min
[CV] n_estimators=15 .................................................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  7.6min remaining:    0.0s


[CV] ................. n_estimators=15, score=-1692.710, total= 1.2min
[CV] n_estimators=15 .................................................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  8.8min remaining:    0.0s


[CV] ................. n_estimators=15, score=-1719.410, total= 1.2min
[CV] n_estimators=20 .................................................
[CV] ................. n_estimators=20, score=-1729.780, total= 1.6min
[CV] n_estimators=20 .................................................
[CV] ................. n_estimators=20, score=-1676.970, total= 1.6min
[CV] n_estimators=20 .................................................
[CV] ................. n_estimators=20, score=-1704.250, total= 1.6min
[CV] n_estimators=20 .................................................
[CV] ................. n_estimators=20, score=-1678.980, total= 1.6min
[CV] n_estimators=20 .................................................
[CV] ................. n_estimators=20, score=-1708.880, total= 1.6min
[CV] n_estimators=25 .................................................
[CV] ................. n_estimators=25, score=-1722.470, total= 2.2min
[CV] n_estimators=25 .................................................
[CV] .

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 28.7min finished


Best parameters: {'n_estimators': 25}
RMSE: 1681.12
CPU times: user 30min 46s, sys: 1.88 s, total: 30min 48s
Wall time: 31min 22s


__Tuning for Random Forest Regressor__

__Print and copy results so we don't run use another 30 minute block by running the code__

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] n_estimators=10 .................................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ................. n_estimators=10, score=-1772.600, total=  49.3s
[CV] n_estimators=10 .................................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   49.3s remaining:    0.0s
[CV] ................. n_estimators=10, score=-1711.530, total=  49.8s
[CV] n_estimators=10 .................................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.7min remaining:    0.0s
[CV] ................. n_estimators=10, score=-1736.270, total=  49.9s
[CV] n_estimators=10 .................................................
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.5min remaining:    0.0s
[CV] ................. n_estimators=10, score=-1717.170, total=  49.5s
[CV] n_estimators=10 .................................................
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  3.3min remaining:    0.0s
[CV] ................. n_estimators=10, score=-1741.730, total=  48.3s
[CV] n_estimators=15 .................................................
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  4.1min remaining:    0.0s
[CV] ................. n_estimators=15, score=-1744.790, total= 1.2min
[CV] n_estimators=15 .................................................
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  5.3min remaining:    0.0s
[CV] ................. n_estimators=15, score=-1686.140, total= 1.2min
[CV] n_estimators=15 .................................................
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  6.6min remaining:    0.0s
[CV] ................. n_estimators=15, score=-1710.780, total= 1.2min
[CV] n_estimators=15 .................................................
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  7.7min remaining:    0.0s
[CV] ................. n_estimators=15, score=-1692.710, total= 1.2min
[CV] n_estimators=15 .................................................
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  8.9min remaining:    0.0s
[CV] ................. n_estimators=15, score=-1719.410, total= 1.2min
[CV] n_estimators=20 .................................................
[CV] ................. n_estimators=20, score=-1729.780, total= 1.6min
[CV] n_estimators=20 .................................................
[CV] ................. n_estimators=20, score=-1676.970, total= 1.6min
[CV] n_estimators=20 .................................................
[CV] ................. n_estimators=20, score=-1704.250, total= 1.6min
[CV] n_estimators=20 .................................................
[CV] ................. n_estimators=20, score=-1678.980, total= 1.6min
[CV] n_estimators=20 .................................................
[CV] ................. n_estimators=20, score=-1708.880, total= 1.6min
[CV] n_estimators=25 .................................................
[CV] ................. n_estimators=25, score=-1722.470, total= 2.0min
[CV] n_estimators=25 .................................................
[CV] ................. n_estimators=25, score=-1668.740, total= 2.1min
[CV] n_estimators=25 .................................................
[CV] ................. n_estimators=25, score=-1694.250, total= 2.0min
[CV] n_estimators=25 .................................................
[CV] ................. n_estimators=25, score=-1669.860, total= 2.0min
[CV] n_estimators=25 .................................................
[CV] ................. n_estimators=25, score=-1700.940, total= 2.0min
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 28.1min finished
Best parameters: {'n_estimators': 25}
RMSE: 1681.06
CPU times: user 30min 16s, sys: 2.41 s, total: 30min 19s
Wall time: 30min 45s

Best parameters: {'n_estimators': 25}

RMSE: 1681.06

CPU times: user 30min 16s, sys: 2.41 s, total: 30min 19s

Wall time: 30min 45s

In [73]:
%%time
# CatBoost without OHE
cb_model_2 = CatBoostRegressor(random_state=42)
params = { 'n_estimators': range(10, 30, 5), 'learning_rate': [.25, .5, .75] }

best_model = RandomizedSearchCV(cb_model_2, params, scoring=rmse, cv=5, verbose=10)
best_model.fit(features_train_2, target_train_2, cat_features=categories) 
predictions = best_model.best_estimator_.predict(features_valid_2)

print('CatBoost without OHE')
print('Best parameters:', best_model.best_params_)
print('RMSE:', round(mean_squared_error(target_valid_2, predictions) ** 0.5, 2))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


0:	learn: 3863.6577136	total: 167ms	remaining: 1.51s
1:	learn: 3351.7184724	total: 371ms	remaining: 1.48s
2:	learn: 2974.3715764	total: 664ms	remaining: 1.55s
3:	learn: 2711.2629352	total: 872ms	remaining: 1.31s
4:	learn: 2500.5296025	total: 1.16s	remaining: 1.16s
5:	learn: 2354.0300120	total: 1.36s	remaining: 909ms
6:	learn: 2255.7703826	total: 1.57s	remaining: 673ms
7:	learn: 2171.3605273	total: 1.77s	remaining: 443ms
8:	learn: 2108.6749476	total: 2.06s	remaining: 229ms
9:	learn: 2059.0081094	total: 2.26s	remaining: 0us
[CV]  n_estimators=10, learning_rate=0.25, score=-2082.430, total=   4.2s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.2s remaining:    0.0s


0:	learn: 3859.5267232	total: 185ms	remaining: 1.67s
1:	learn: 3345.1425641	total: 386ms	remaining: 1.54s
2:	learn: 2974.6865057	total: 676ms	remaining: 1.58s
3:	learn: 2710.0498119	total: 880ms	remaining: 1.32s
4:	learn: 2508.3198786	total: 1.08s	remaining: 1.08s
5:	learn: 2370.0833593	total: 1.28s	remaining: 857ms
6:	learn: 2265.2938333	total: 1.48s	remaining: 636ms
7:	learn: 2183.1756569	total: 1.69s	remaining: 421ms
8:	learn: 2112.8276502	total: 1.98s	remaining: 220ms
9:	learn: 2068.7618436	total: 2.18s	remaining: 0us
[CV]  n_estimators=10, learning_rate=0.25, score=-2065.370, total=   3.8s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    8.0s remaining:    0.0s


0:	learn: 3866.2746728	total: 209ms	remaining: 1.88s
1:	learn: 3338.0993299	total: 409ms	remaining: 1.64s
2:	learn: 2977.1975753	total: 704ms	remaining: 1.64s
3:	learn: 2720.4852566	total: 910ms	remaining: 1.36s
4:	learn: 2521.9318702	total: 1.2s	remaining: 1.2s
5:	learn: 2382.6355112	total: 1.41s	remaining: 938ms
6:	learn: 2279.1847565	total: 1.61s	remaining: 689ms
7:	learn: 2197.4389047	total: 1.81s	remaining: 452ms
8:	learn: 2125.8281701	total: 2.1s	remaining: 233ms
9:	learn: 2075.0804717	total: 2.3s	remaining: 0us
[CV]  n_estimators=10, learning_rate=0.25, score=-2064.070, total=   4.0s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   12.0s remaining:    0.0s


0:	learn: 3865.9207030	total: 143ms	remaining: 1.29s
1:	learn: 3350.3102025	total: 341ms	remaining: 1.36s
2:	learn: 2980.9988387	total: 552ms	remaining: 1.29s
3:	learn: 2714.6851635	total: 840ms	remaining: 1.26s
4:	learn: 2514.7335139	total: 1.05s	remaining: 1.05s
5:	learn: 2376.0568475	total: 1.24s	remaining: 830ms
6:	learn: 2273.8742158	total: 1.44s	remaining: 618ms
7:	learn: 2194.5378843	total: 1.73s	remaining: 433ms
8:	learn: 2129.2384309	total: 1.93s	remaining: 215ms
9:	learn: 2079.9704948	total: 2.14s	remaining: 0us
[CV]  n_estimators=10, learning_rate=0.25, score=-2063.230, total=   3.9s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   15.9s remaining:    0.0s


0:	learn: 3872.4817575	total: 150ms	remaining: 1.35s
1:	learn: 3351.4133580	total: 354ms	remaining: 1.42s
2:	learn: 2968.5659644	total: 646ms	remaining: 1.51s
3:	learn: 2711.9409127	total: 852ms	remaining: 1.28s
4:	learn: 2502.4730387	total: 1.15s	remaining: 1.15s
5:	learn: 2368.5974576	total: 1.35s	remaining: 898ms
6:	learn: 2266.5018235	total: 1.55s	remaining: 664ms
7:	learn: 2185.3293948	total: 1.75s	remaining: 437ms
8:	learn: 2119.1773461	total: 2.04s	remaining: 227ms
9:	learn: 2073.1974148	total: 2.25s	remaining: 0us
[CV]  n_estimators=10, learning_rate=0.25, score=-2084.950, total=   4.0s
[CV] n_estimators=15, learning_rate=0.5 ..............................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   19.9s remaining:    0.0s


0:	learn: 3275.5863406	total: 153ms	remaining: 2.15s
1:	learn: 2668.1453191	total: 356ms	remaining: 2.32s
2:	learn: 2380.7034117	total: 652ms	remaining: 2.61s
3:	learn: 2191.7127217	total: 863ms	remaining: 2.37s
4:	learn: 2102.6113904	total: 1.15s	remaining: 2.3s
5:	learn: 2038.2334057	total: 1.35s	remaining: 2.02s
6:	learn: 2008.1382743	total: 1.55s	remaining: 1.77s
7:	learn: 1966.7449541	total: 1.84s	remaining: 1.61s
8:	learn: 1948.0588490	total: 2.04s	remaining: 1.36s
9:	learn: 1924.9562813	total: 2.25s	remaining: 1.12s
10:	learn: 1907.4797977	total: 2.45s	remaining: 891ms
11:	learn: 1891.8441675	total: 2.74s	remaining: 685ms
12:	learn: 1879.6456252	total: 2.94s	remaining: 452ms
13:	learn: 1868.9066636	total: 3.14s	remaining: 225ms
14:	learn: 1859.9341860	total: 3.34s	remaining: 0us
[CV]  n_estimators=15, learning_rate=0.5, score=-1891.750, total=   5.1s
[CV] n_estimators=15, learning_rate=0.5 ..............................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   25.0s remaining:    0.0s


0:	learn: 3265.3757639	total: 181ms	remaining: 2.53s
1:	learn: 2653.4760557	total: 388ms	remaining: 2.52s
2:	learn: 2335.8832856	total: 681ms	remaining: 2.72s
3:	learn: 2175.5405604	total: 973ms	remaining: 2.67s
4:	learn: 2086.1034465	total: 1.18s	remaining: 2.36s
5:	learn: 2039.1548377	total: 1.38s	remaining: 2.07s
6:	learn: 2001.2702045	total: 1.68s	remaining: 1.92s
7:	learn: 1966.8870855	total: 1.88s	remaining: 1.64s
8:	learn: 1947.0304649	total: 2.08s	remaining: 1.38s
9:	learn: 1928.7037558	total: 2.28s	remaining: 1.14s
10:	learn: 1913.9700098	total: 2.48s	remaining: 901ms
11:	learn: 1895.3206816	total: 2.68s	remaining: 670ms
12:	learn: 1883.4944868	total: 2.88s	remaining: 443ms
13:	learn: 1876.0839342	total: 3.08s	remaining: 220ms
14:	learn: 1866.1684763	total: 3.36s	remaining: 0us
[CV]  n_estimators=15, learning_rate=0.5, score=-1869.760, total=   5.0s
[CV] n_estimators=15, learning_rate=0.5 ..............................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   30.0s remaining:    0.0s


0:	learn: 3268.8523823	total: 126ms	remaining: 1.76s
1:	learn: 2652.1001421	total: 325ms	remaining: 2.11s
2:	learn: 2336.1381128	total: 621ms	remaining: 2.48s
3:	learn: 2174.7753181	total: 823ms	remaining: 2.26s
4:	learn: 2084.1486786	total: 1.03s	remaining: 2.05s
5:	learn: 2015.6322970	total: 1.31s	remaining: 1.97s
6:	learn: 1982.8641137	total: 1.52s	remaining: 1.74s
7:	learn: 1957.2669314	total: 1.72s	remaining: 1.5s
8:	learn: 1935.1338977	total: 1.92s	remaining: 1.28s
9:	learn: 1920.9173879	total: 2.12s	remaining: 1.06s
10:	learn: 1894.3883895	total: 2.32s	remaining: 844ms
11:	learn: 1884.3617490	total: 2.52s	remaining: 631ms
12:	learn: 1874.7853270	total: 2.81s	remaining: 432ms
13:	learn: 1863.3884380	total: 3.01s	remaining: 215ms
14:	learn: 1856.9080343	total: 3.21s	remaining: 0us
[CV]  n_estimators=15, learning_rate=0.5, score=-1853.670, total=   5.0s
[CV] n_estimators=15, learning_rate=0.5 ..............................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   35.0s remaining:    0.0s


0:	learn: 3270.2655769	total: 177ms	remaining: 2.48s
1:	learn: 2660.2849434	total: 396ms	remaining: 2.57s
2:	learn: 2345.2883667	total: 675ms	remaining: 2.7s
3:	learn: 2183.9613061	total: 883ms	remaining: 2.43s
4:	learn: 2105.7824448	total: 1.17s	remaining: 2.34s
5:	learn: 2043.8376917	total: 1.37s	remaining: 2.06s
6:	learn: 2000.2512155	total: 1.58s	remaining: 1.8s
7:	learn: 1973.3793957	total: 1.78s	remaining: 1.55s
8:	learn: 1955.9318434	total: 1.98s	remaining: 1.32s
9:	learn: 1934.8939707	total: 2.18s	remaining: 1.09s
10:	learn: 1921.7118827	total: 2.47s	remaining: 897ms
11:	learn: 1903.0576810	total: 2.67s	remaining: 668ms
12:	learn: 1891.8931652	total: 2.87s	remaining: 442ms
13:	learn: 1882.0853141	total: 3.07s	remaining: 219ms
14:	learn: 1861.5149794	total: 3.28s	remaining: 0us
[CV]  n_estimators=15, learning_rate=0.5, score=-1843.470, total=   5.1s
[CV] n_estimators=15, learning_rate=0.5 ..............................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   40.1s remaining:    0.0s


0:	learn: 3284.7908916	total: 185ms	remaining: 2.6s
1:	learn: 2653.7775240	total: 388ms	remaining: 2.52s
2:	learn: 2316.0786065	total: 688ms	remaining: 2.75s
3:	learn: 2172.3378767	total: 892ms	remaining: 2.45s
4:	learn: 2089.2207968	total: 1.18s	remaining: 2.36s
5:	learn: 2043.7925446	total: 1.39s	remaining: 2.08s
6:	learn: 1999.5899729	total: 1.58s	remaining: 1.81s
7:	learn: 1976.7493889	total: 1.79s	remaining: 1.56s
8:	learn: 1949.8353604	total: 1.98s	remaining: 1.32s
9:	learn: 1931.1053821	total: 2.27s	remaining: 1.14s
10:	learn: 1915.5189272	total: 2.47s	remaining: 900ms
11:	learn: 1894.3976623	total: 2.68s	remaining: 669ms
12:	learn: 1880.8327206	total: 2.88s	remaining: 443ms
13:	learn: 1870.2670485	total: 3.17s	remaining: 227ms
14:	learn: 1863.7169088	total: 3.38s	remaining: 0us
[CV]  n_estimators=15, learning_rate=0.5, score=-1885.590, total=   5.2s
[CV] n_estimators=20, learning_rate=0.25 .............................
0:	learn: 3863.6577136	total: 190ms	remaining: 3.6s
1:	lear

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  4.7min finished


0:	learn: 2821.8785389	total: 228ms	remaining: 5.47s
1:	learn: 2394.1858168	total: 525ms	remaining: 6.04s
2:	learn: 2188.5647889	total: 823ms	remaining: 6.04s
3:	learn: 2099.7693981	total: 1.12s	remaining: 5.89s
4:	learn: 2038.5658087	total: 1.33s	remaining: 5.33s
5:	learn: 2005.9005117	total: 1.63s	remaining: 5.15s
6:	learn: 1962.7477751	total: 1.93s	remaining: 4.95s
7:	learn: 1934.8087575	total: 2.22s	remaining: 4.72s
8:	learn: 1910.3177812	total: 2.43s	remaining: 4.32s
9:	learn: 1898.2869369	total: 2.72s	remaining: 4.08s
10:	learn: 1881.6918651	total: 3.01s	remaining: 3.84s
11:	learn: 1867.1362571	total: 3.22s	remaining: 3.49s
12:	learn: 1856.5701865	total: 3.52s	remaining: 3.25s
13:	learn: 1837.7874695	total: 3.72s	remaining: 2.92s
14:	learn: 1827.7487052	total: 4.02s	remaining: 2.68s
15:	learn: 1823.5539045	total: 4.31s	remaining: 2.42s
16:	learn: 1816.2561508	total: 4.52s	remaining: 2.13s
17:	learn: 1805.4012057	total: 4.81s	remaining: 1.87s
18:	learn: 1796.9243594	total: 5.11s	r

__Tuning for CatBoost without OHE__

CatBoost without OHE

Best parameters: {'n_estimators': 25, 'learning_rate': 0.75}

RMSE: 1787.49

CPU times: user 3min 42s, sys: 27.7 s, total: 4min 10s

Wall time: 5min 8s

In [74]:
%%time
# CatBoost with OHE
cb_model = CatBoostRegressor(random_state=42)
params = { 'n_estimators': range(10, 30, 5), 'learning_rate': [.25, .5, .75] }

best_model = RandomizedSearchCV(cb_model, params, scoring=rmse, cv=5, verbose=10)
best_model.fit(features_train, target_train) 
predictions = best_model.best_estimator_.predict(features_valid)

print('CatBoost with OHE')
print('Best parameters:', best_model.best_params_)
print('RMSE:', round(mean_squared_error(target_valid, predictions) ** 0.5, 2))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] n_estimators=10, learning_rate=0.75 .............................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


0:	learn: 2823.0316345	total: 19.9ms	remaining: 179ms
1:	learn: 2420.7173621	total: 121ms	remaining: 486ms
2:	learn: 2254.0653137	total: 226ms	remaining: 527ms
3:	learn: 2142.2217385	total: 415ms	remaining: 623ms
4:	learn: 2077.9998403	total: 518ms	remaining: 518ms
5:	learn: 2035.1970997	total: 623ms	remaining: 415ms
6:	learn: 2005.5962187	total: 810ms	remaining: 347ms
7:	learn: 1979.7717804	total: 912ms	remaining: 228ms
8:	learn: 1954.7702393	total: 1.02s	remaining: 113ms
9:	learn: 1936.7215070	total: 1.12s	remaining: 0us
[CV]  n_estimators=10, learning_rate=0.75, score=-1976.270, total=   6.9s
[CV] n_estimators=10, learning_rate=0.75 .............................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.9s remaining:    0.0s


0:	learn: 2842.4599166	total: 61.3ms	remaining: 552ms
1:	learn: 2436.8917460	total: 163ms	remaining: 654ms
2:	learn: 2279.0701229	total: 266ms	remaining: 621ms
3:	learn: 2180.6631275	total: 455ms	remaining: 683ms
4:	learn: 2110.5806664	total: 556ms	remaining: 556ms
5:	learn: 2069.0726284	total: 657ms	remaining: 438ms
6:	learn: 2031.5062521	total: 764ms	remaining: 327ms
7:	learn: 2007.1822973	total: 949ms	remaining: 237ms
8:	learn: 1965.8982937	total: 1.05s	remaining: 117ms
9:	learn: 1946.2553248	total: 1.16s	remaining: 0us
[CV]  n_estimators=10, learning_rate=0.75, score=-1953.290, total=   6.9s
[CV] n_estimators=10, learning_rate=0.75 .............................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   13.7s remaining:    0.0s


0:	learn: 2861.5758681	total: 70.1ms	remaining: 631ms
1:	learn: 2425.5712475	total: 174ms	remaining: 697ms
2:	learn: 2265.0190050	total: 280ms	remaining: 653ms
3:	learn: 2148.3249091	total: 466ms	remaining: 698ms
4:	learn: 2098.0268218	total: 570ms	remaining: 570ms
5:	learn: 2041.9457461	total: 671ms	remaining: 448ms
6:	learn: 2009.2994891	total: 862ms	remaining: 369ms
7:	learn: 1990.9031156	total: 964ms	remaining: 241ms
8:	learn: 1972.5907464	total: 1.07s	remaining: 119ms
9:	learn: 1954.4557828	total: 1.17s	remaining: 0us
[CV]  n_estimators=10, learning_rate=0.75, score=-1960.540, total=   6.9s
[CV] n_estimators=10, learning_rate=0.75 .............................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   20.6s remaining:    0.0s


0:	learn: 2834.2177178	total: 71.2ms	remaining: 641ms
1:	learn: 2415.0898992	total: 175ms	remaining: 699ms
2:	learn: 2254.5850867	total: 366ms	remaining: 854ms
3:	learn: 2137.6663429	total: 470ms	remaining: 705ms
4:	learn: 2080.5201488	total: 576ms	remaining: 576ms
5:	learn: 2033.0901518	total: 764ms	remaining: 509ms
6:	learn: 1995.6585042	total: 867ms	remaining: 372ms
7:	learn: 1971.1796364	total: 969ms	remaining: 242ms
8:	learn: 1953.6448470	total: 1.08s	remaining: 120ms
9:	learn: 1922.0041314	total: 1.26s	remaining: 0us
[CV]  n_estimators=10, learning_rate=0.75, score=-1911.240, total=   6.9s
[CV] n_estimators=10, learning_rate=0.75 .............................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   27.6s remaining:    0.0s


0:	learn: 2830.5108597	total: 49ms	remaining: 441ms
1:	learn: 2409.2310440	total: 152ms	remaining: 609ms
2:	learn: 2249.9705061	total: 263ms	remaining: 614ms
3:	learn: 2165.4818394	total: 445ms	remaining: 668ms
4:	learn: 2092.8797564	total: 549ms	remaining: 549ms
5:	learn: 2052.2372578	total: 652ms	remaining: 435ms
6:	learn: 2017.1459991	total: 842ms	remaining: 361ms
7:	learn: 1991.3705240	total: 947ms	remaining: 237ms
8:	learn: 1963.5085012	total: 1.05s	remaining: 117ms
9:	learn: 1945.2032732	total: 1.24s	remaining: 0us
[CV]  n_estimators=10, learning_rate=0.75, score=-1956.750, total=   6.9s
[CV] n_estimators=15, learning_rate=0.75 .............................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   34.5s remaining:    0.0s


0:	learn: 2823.0316345	total: 33.8ms	remaining: 473ms
1:	learn: 2420.7173621	total: 136ms	remaining: 883ms
2:	learn: 2254.0653137	total: 241ms	remaining: 962ms
3:	learn: 2142.2217385	total: 430ms	remaining: 1.18s
4:	learn: 2077.9998403	total: 534ms	remaining: 1.07s
5:	learn: 2035.1970997	total: 636ms	remaining: 954ms
6:	learn: 2005.5962187	total: 739ms	remaining: 844ms
7:	learn: 1979.7717804	total: 927ms	remaining: 811ms
8:	learn: 1954.7702393	total: 1.03s	remaining: 688ms
9:	learn: 1936.7215070	total: 1.13s	remaining: 567ms
10:	learn: 1911.3216198	total: 1.32s	remaining: 482ms
11:	learn: 1891.2563316	total: 1.43s	remaining: 357ms
12:	learn: 1879.0848539	total: 1.53s	remaining: 236ms
13:	learn: 1867.7770290	total: 1.64s	remaining: 117ms
14:	learn: 1857.9099962	total: 1.82s	remaining: 0us
[CV]  n_estimators=15, learning_rate=0.75, score=-1892.140, total=   7.4s
[CV] n_estimators=15, learning_rate=0.75 .............................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   41.9s remaining:    0.0s


0:	learn: 2842.4599166	total: 69.6ms	remaining: 974ms
1:	learn: 2436.8917460	total: 173ms	remaining: 1.12s
2:	learn: 2279.0701229	total: 276ms	remaining: 1.1s
3:	learn: 2180.6631275	total: 465ms	remaining: 1.28s
4:	learn: 2110.5806664	total: 567ms	remaining: 1.13s
5:	learn: 2069.0726284	total: 667ms	remaining: 1000ms
6:	learn: 2031.5062521	total: 772ms	remaining: 883ms
7:	learn: 2007.1822973	total: 883ms	remaining: 773ms
8:	learn: 1965.8982937	total: 1.06s	remaining: 709ms
9:	learn: 1946.2553248	total: 1.16s	remaining: 582ms
10:	learn: 1922.6592633	total: 1.27s	remaining: 461ms
11:	learn: 1913.4148468	total: 1.46s	remaining: 365ms
12:	learn: 1899.5476169	total: 1.56s	remaining: 240ms
13:	learn: 1882.5671427	total: 1.66s	remaining: 119ms
14:	learn: 1869.6646834	total: 1.76s	remaining: 0us
[CV]  n_estimators=15, learning_rate=0.75, score=-1881.220, total=   7.4s
[CV] n_estimators=15, learning_rate=0.75 .............................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   49.4s remaining:    0.0s


0:	learn: 2861.5758681	total: 50.4ms	remaining: 706ms
1:	learn: 2425.5712475	total: 156ms	remaining: 1.01s
2:	learn: 2265.0190050	total: 348ms	remaining: 1.39s
3:	learn: 2148.3249091	total: 451ms	remaining: 1.24s
4:	learn: 2098.0268218	total: 559ms	remaining: 1.12s
5:	learn: 2041.9457461	total: 742ms	remaining: 1.11s
6:	learn: 2009.2994891	total: 846ms	remaining: 967ms
7:	learn: 1990.9031156	total: 948ms	remaining: 830ms
8:	learn: 1972.5907464	total: 1.14s	remaining: 760ms
9:	learn: 1954.4557828	total: 1.24s	remaining: 622ms
10:	learn: 1921.1566234	total: 1.35s	remaining: 490ms
11:	learn: 1907.2247437	total: 1.45s	remaining: 363ms
12:	learn: 1889.2549722	total: 1.64s	remaining: 252ms
13:	learn: 1870.3979210	total: 1.74s	remaining: 125ms
14:	learn: 1856.4358504	total: 1.85s	remaining: 0us
[CV]  n_estimators=15, learning_rate=0.75, score=-1868.320, total=   7.5s
[CV] n_estimators=15, learning_rate=0.75 .............................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   56.9s remaining:    0.0s


0:	learn: 2834.2177178	total: 73.3ms	remaining: 1.03s
1:	learn: 2415.0898992	total: 176ms	remaining: 1.14s
2:	learn: 2254.5850867	total: 279ms	remaining: 1.12s
3:	learn: 2137.6663429	total: 470ms	remaining: 1.29s
4:	learn: 2080.5201488	total: 573ms	remaining: 1.15s
5:	learn: 2033.0901518	total: 679ms	remaining: 1.02s
6:	learn: 1995.6585042	total: 866ms	remaining: 989ms
7:	learn: 1971.1796364	total: 971ms	remaining: 849ms
8:	learn: 1953.6448470	total: 1.07s	remaining: 716ms
9:	learn: 1922.0041314	total: 1.26s	remaining: 631ms
10:	learn: 1905.5869507	total: 1.37s	remaining: 497ms
11:	learn: 1887.8269973	total: 1.47s	remaining: 367ms
12:	learn: 1874.4899301	total: 1.57s	remaining: 242ms
13:	learn: 1860.4039279	total: 1.76s	remaining: 126ms
14:	learn: 1845.0999811	total: 1.86s	remaining: 0us
[CV]  n_estimators=15, learning_rate=0.75, score=-1836.100, total=   7.6s
[CV] n_estimators=15, learning_rate=0.75 .............................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  1.1min remaining:    0.0s


0:	learn: 2830.5108597	total: 70.6ms	remaining: 988ms
1:	learn: 2409.2310440	total: 174ms	remaining: 1.13s
2:	learn: 2249.9705061	total: 365ms	remaining: 1.46s
3:	learn: 2165.4818394	total: 467ms	remaining: 1.28s
4:	learn: 2092.8797564	total: 576ms	remaining: 1.15s
5:	learn: 2052.2372578	total: 681ms	remaining: 1.02s
6:	learn: 2017.1459991	total: 864ms	remaining: 988ms
7:	learn: 1991.3705240	total: 967ms	remaining: 846ms
8:	learn: 1963.5085012	total: 1.07s	remaining: 716ms
9:	learn: 1945.2032732	total: 1.26s	remaining: 631ms
10:	learn: 1919.4991366	total: 1.37s	remaining: 497ms
11:	learn: 1904.7310181	total: 1.47s	remaining: 368ms
12:	learn: 1891.6072562	total: 1.66s	remaining: 255ms
13:	learn: 1884.8161988	total: 1.76s	remaining: 126ms
14:	learn: 1869.7620927	total: 1.87s	remaining: 0us
[CV]  n_estimators=15, learning_rate=0.75, score=-1884.310, total=   7.5s
[CV] n_estimators=15, learning_rate=0.5 ..............................
0:	learn: 3246.7078833	total: 84ms	remaining: 1.18s
1:	l

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  6.5min finished


0:	learn: 2828.7660023	total: 59.3ms	remaining: 1.42s
1:	learn: 2455.7443839	total: 248ms	remaining: 2.85s
2:	learn: 2276.0639037	total: 357ms	remaining: 2.62s
3:	learn: 2174.8998719	total: 549ms	remaining: 2.88s
4:	learn: 2121.0216861	total: 659ms	remaining: 2.64s
5:	learn: 2071.8607803	total: 852ms	remaining: 2.7s
6:	learn: 2027.2965487	total: 958ms	remaining: 2.46s
7:	learn: 1988.5164500	total: 1.15s	remaining: 2.44s
8:	learn: 1956.8783102	total: 1.25s	remaining: 2.23s
9:	learn: 1938.2167870	total: 1.45s	remaining: 2.17s
10:	learn: 1921.0319750	total: 1.56s	remaining: 1.98s
11:	learn: 1907.4844304	total: 1.75s	remaining: 1.89s
12:	learn: 1898.6240839	total: 1.86s	remaining: 1.71s
13:	learn: 1884.4521546	total: 2.06s	remaining: 1.62s
14:	learn: 1871.9938733	total: 2.17s	remaining: 1.44s
15:	learn: 1862.2852177	total: 2.35s	remaining: 1.32s
16:	learn: 1846.6990558	total: 2.45s	remaining: 1.15s
17:	learn: 1839.7884012	total: 2.64s	remaining: 1.03s
18:	learn: 1833.2904967	total: 2.76s	r

__Tuning for CatBoost with OHE__

CatBoost with OHE

Best parameters: {'n_estimators': 25, 'learning_rate': 0.75}

RMSE: 1814.42

CPU times: user 5min 24s, sys: 27.8 s, total: 5min 52s

Wall time: 6min 46s

In [75]:
%%time
# LIghtGBM with OHE
lg_model = lgb.LGBMRegressor(random_state=42)
params = { 'n_estimators': range(10, 30, 5), 'learning_rate': [.25, .5, .75] }

best_model = RandomizedSearchCV(lg_model, params, scoring=rmse, cv=5, verbose=10)
best_model.fit(features_train, target_train) 
predictions = best_model.best_estimator_.predict(features_valid)

print('LightGBM with OHE')
print('Best parameters:', best_model.best_params_)
print('RMSE:', round(mean_squared_error(target_valid, predictions) ** 0.5, 2))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] n_estimators=20, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  n_estimators=20, learning_rate=0.25, score=-1819.550, total=  47.8s
[CV] n_estimators=20, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   47.8s remaining:    0.0s


[CV]  n_estimators=20, learning_rate=0.25, score=-1783.630, total=  56.4s
[CV] n_estimators=20, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.7min remaining:    0.0s


[CV]  n_estimators=20, learning_rate=0.25, score=-1795.880, total= 1.5min
[CV] n_estimators=20, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  3.2min remaining:    0.0s


[CV]  n_estimators=20, learning_rate=0.25, score=-1772.030, total=  51.5s
[CV] n_estimators=20, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  4.0min remaining:    0.0s


[CV]  n_estimators=20, learning_rate=0.25, score=-1788.610, total=  43.7s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  4.8min remaining:    0.0s


[CV]  n_estimators=10, learning_rate=0.25, score=-1995.890, total=  28.3s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  5.2min remaining:    0.0s


[CV]  n_estimators=10, learning_rate=0.25, score=-1949.310, total=  42.8s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  6.0min remaining:    0.0s


[CV]  n_estimators=10, learning_rate=0.25, score=-1966.020, total= 1.0min
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  7.0min remaining:    0.0s


[CV]  n_estimators=10, learning_rate=0.25, score=-1949.580, total=  27.3s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  7.4min remaining:    0.0s


[CV]  n_estimators=10, learning_rate=0.25, score=-1959.310, total=  26.9s
[CV] n_estimators=20, learning_rate=0.5 ..............................
[CV]  n_estimators=20, learning_rate=0.5, score=-1776.350, total= 1.1min
[CV] n_estimators=20, learning_rate=0.5 ..............................
[CV]  n_estimators=20, learning_rate=0.5, score=-1746.610, total= 1.8min
[CV] n_estimators=20, learning_rate=0.5 ..............................
[CV]  n_estimators=20, learning_rate=0.5, score=-1754.180, total=  50.6s
[CV] n_estimators=20, learning_rate=0.5 ..............................
[CV]  n_estimators=20, learning_rate=0.5, score=-1719.620, total= 1.0min
[CV] n_estimators=20, learning_rate=0.5 ..............................
[CV]  n_estimators=20, learning_rate=0.5, score=-1753.660, total=  56.0s
[CV] n_estimators=10, learning_rate=0.5 ..............................
[CV]  n_estimators=10, learning_rate=0.5, score=-1863.720, total=  30.8s
[CV] n_estimators=10, learning_rate=0.5 ......................

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed: 22.9min finished


LightGBM with OHE
Best parameters: {'n_estimators': 25, 'learning_rate': 0.5}
RMSE: 1731.09
CPU times: user 22min 25s, sys: 20.5 s, total: 22min 45s
Wall time: 22min 56s


__Tuning for LightGBM with OHE__

LightGBM with OHE

Best parameters: {'n_estimators': 25, 'learning_rate': 0.5}

RMSE: 1731.08

CPU times: user 3min 14s, sys: 14.1 s, total: 3min 28s

Wall time: 3min 30s


In [76]:
%%time
# LightGBM without OHE
lg_model_2 = lgb.LGBMRegressor(random_state=42)
params = { 'n_estimators': range(10, 30, 5), 'learning_rate': [.25, .5, .75] }

best_model = RandomizedSearchCV(lg_model_2, params, scoring=rmse, cv=5, verbose=10)
best_model.fit(features_train_2, target_train_2, categorical_feature=categories)  
predictions = best_model.best_estimator_.predict(features_valid_2)

print('LightGBM without OHE')
print('Best parameters:', best_model.best_params_)
print('RMSE:', round(mean_squared_error(target_valid_2, predictions) ** 0.5, 2))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] n_estimators=20, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  n_estimators=20, learning_rate=0.25, score=-1722.390, total=   3.1s
[CV] n_estimators=20, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.1s remaining:    0.0s


[CV]  n_estimators=20, learning_rate=0.25, score=-1692.810, total=   2.1s
[CV] n_estimators=20, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.2s remaining:    0.0s


[CV]  n_estimators=20, learning_rate=0.25, score=-1700.010, total=   2.0s
[CV] n_estimators=20, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    7.2s remaining:    0.0s


[CV]  n_estimators=20, learning_rate=0.25, score=-1682.490, total=   2.1s
[CV] n_estimators=20, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    9.4s remaining:    0.0s


[CV]  n_estimators=20, learning_rate=0.25, score=-1704.770, total=   2.3s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   11.7s remaining:    0.0s


[CV]  n_estimators=10, learning_rate=0.25, score=-1860.680, total=   1.5s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   13.2s remaining:    0.0s


[CV]  n_estimators=10, learning_rate=0.25, score=-1837.940, total=   1.4s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   14.6s remaining:    0.0s


[CV]  n_estimators=10, learning_rate=0.25, score=-1833.680, total=   1.4s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   16.0s remaining:    0.0s


[CV]  n_estimators=10, learning_rate=0.25, score=-1827.030, total=   1.5s
[CV] n_estimators=10, learning_rate=0.25 .............................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   17.5s remaining:    0.0s


[CV]  n_estimators=10, learning_rate=0.25, score=-1842.050, total=   1.5s
[CV] n_estimators=15, learning_rate=0.5 ..............................
[CV]  n_estimators=15, learning_rate=0.5, score=-1742.990, total=   1.9s
[CV] n_estimators=15, learning_rate=0.5 ..............................
[CV]  n_estimators=15, learning_rate=0.5, score=-1716.690, total=   1.7s
[CV] n_estimators=15, learning_rate=0.5 ..............................
[CV]  n_estimators=15, learning_rate=0.5, score=-1716.080, total=   2.9s
[CV] n_estimators=15, learning_rate=0.5 ..............................
[CV]  n_estimators=15, learning_rate=0.5, score=-1694.330, total=   1.7s
[CV] n_estimators=15, learning_rate=0.5 ..............................
[CV]  n_estimators=15, learning_rate=0.5, score=-1719.580, total=   1.8s
[CV] n_estimators=25, learning_rate=0.25 .............................
[CV]  n_estimators=25, learning_rate=0.25, score=-1701.020, total=   2.5s
[CV] n_estimators=25, learning_rate=0.25 ....................

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  1.6min finished


LightGBM without OHE
Best parameters: {'n_estimators': 25, 'learning_rate': 0.25}
RMSE: 1693.93
CPU times: user 1min 42s, sys: 372 ms, total: 1min 42s
Wall time: 1min 43s


__Tuning for LightGBM without OHE__

LightGBM without OHE

Best parameters: {'n_estimators': 25, 'learning_rate': 0.25}

RMSE: 1693.92

CPU times: user 4min 59s, sys: 960 ms, total: 5min

Wall time: 5min 3s

__Fit models with best parameters__

- We will use the combined data from the train and valid datasets because that gives us a slightly better RMSE score (we compared previously).
- We will compare tuned and untuned versions.

In [77]:
%%time
# random forest regressor with test data
# Best parameters: {'n_estimators': 25}
rf_test = RandomForestRegressor(random_state=42, n_estimators = 25)

start = time.time()
# fit with train and valid data
rf_test.fit(pd.concat([features_train, features_valid]), pd.concat([target_train, target_valid]))
end = time.time()
test_rftt = end - start

start = time.time()
test_pred = rf_test.predict(features_test)
end = time.time()
test_rfpt = end - start

test_rf_rmse_calc = mean_squared_error(target_test, test_pred)**0.5

print('Random Forest Regressor with test data')
print('RMSE:', test_rf_rmse_calc, 'Training time:', test_rftt, 'Prediction time:', test_rfpt)

Random Forest Regressor with test data
RMSE: 1660.958400832672 Training time: 220.39202785491943 Prediction time: 1.481811285018921
CPU times: user 3min 37s, sys: 331 ms, total: 3min 37s
Wall time: 3min 41s


In [78]:
%%time
# CatBoost with OHE with test data
# Best parameters: {'n_estimators': 25, 'learning_rate': 0.75}
cb_test = CatBoostRegressor(random_state=42, n_estimators=25, learning_rate=0.75)

start = time.time()
# use combined train and valid datasets to fit on
cb_test.fit(pd.concat([features_train, features_valid]), pd.concat([target_train, target_valid]))
end = time.time()
test_cbohett = end - start

start = time.time()
test_pred = cb_test.predict(features_test)
end = time.time()
test_cbohept = end - start

test_cbohe_rmse_calc = mean_squared_error(target_test, test_pred)**0.5

print('CatBoost with OHE with test data')
print('RMSE:', test_cbohe_rmse_calc, 'Training time:', test_cbohett, 'Prediction time:', test_cbohept)

0:	learn: 2799.7902845	total: 41.6ms	remaining: 999ms
1:	learn: 2392.2788678	total: 242ms	remaining: 2.78s
2:	learn: 2242.0372998	total: 438ms	remaining: 3.21s
3:	learn: 2135.8989092	total: 635ms	remaining: 3.33s
4:	learn: 2074.0915844	total: 830ms	remaining: 3.32s
5:	learn: 2025.1557285	total: 1.03s	remaining: 3.26s
6:	learn: 1995.3671545	total: 1.23s	remaining: 3.15s
7:	learn: 1962.5786932	total: 1.42s	remaining: 3.03s
8:	learn: 1936.3794134	total: 1.54s	remaining: 2.73s
9:	learn: 1925.6442519	total: 1.73s	remaining: 2.6s
10:	learn: 1914.9244248	total: 1.92s	remaining: 2.45s
11:	learn: 1897.7737444	total: 2.12s	remaining: 2.3s
12:	learn: 1884.3269147	total: 2.32s	remaining: 2.14s
13:	learn: 1870.6201454	total: 2.43s	remaining: 1.91s
14:	learn: 1860.3036252	total: 2.63s	remaining: 1.75s
15:	learn: 1845.2119689	total: 2.83s	remaining: 1.59s
16:	learn: 1838.5330180	total: 3.02s	remaining: 1.42s
17:	learn: 1830.0865712	total: 3.22s	remaining: 1.25s
18:	learn: 1822.9739171	total: 3.42s	re

In [79]:
%%time
# CatBoost without initial one hot encoding
# Best parameters: {'n_estimators': 25, 'learning_rate': 0.75}

start = time.time()
# use combined train and valid datasets to fit on
cb_test_2 = CatBoostRegressor(random_state=42, n_estimators=25, learning_rate=0.75)
cb_test_2.fit(pd.concat([features_train_2, features_valid_2]), pd.concat([target_train_2, target_valid_2]), cat_features=categories)
end = time.time()
test_cbtt = end - start

start = time.time()
test_pred = cb_test_2.predict(features_test_2)
end = time.time()
test_cbpt = end - start

test_cb_rmse=mean_squared_error(target_test, test_pred)**0.5

print('CatBoost without OHE with test data')
print('RMSE:',test_cb_rmse, 'Training time:', test_cbtt, 'Prediction time:', test_cbpt) 

0:	learn: 2821.8166658	total: 294ms	remaining: 7.04s
1:	learn: 2344.6451715	total: 697ms	remaining: 8.01s
2:	learn: 2208.1949750	total: 1.09s	remaining: 7.96s
3:	learn: 2123.3551953	total: 1.48s	remaining: 7.77s
4:	learn: 2081.9071319	total: 1.78s	remaining: 7.14s
5:	learn: 2033.5083635	total: 2.18s	remaining: 6.9s
6:	learn: 1985.3326815	total: 2.48s	remaining: 6.39s
7:	learn: 1960.6963974	total: 2.88s	remaining: 6.12s
8:	learn: 1931.4359357	total: 3.19s	remaining: 5.68s
9:	learn: 1912.6610977	total: 3.58s	remaining: 5.37s
10:	learn: 1902.6895904	total: 3.88s	remaining: 4.94s
11:	learn: 1890.3243957	total: 4.19s	remaining: 4.54s
12:	learn: 1876.5431975	total: 4.57s	remaining: 4.22s
13:	learn: 1867.0785279	total: 4.88s	remaining: 3.83s
14:	learn: 1852.2333265	total: 5.27s	remaining: 3.52s
15:	learn: 1836.7261889	total: 5.58s	remaining: 3.14s
16:	learn: 1831.1213107	total: 5.88s	remaining: 2.77s
17:	learn: 1819.1362963	total: 6.27s	remaining: 2.44s
18:	learn: 1810.9800457	total: 6.57s	re

In [80]:
%%time
# lightGBM with OHE with test data
# Best parameters: {'n_estimators': 25, 'learning_rate': 0.5}
lgohe_test = lgb.LGBMRegressor(random_state=42, n_estimators=25, learning_rate=0.5)

start = time.time()
# fit model on combined data from train and valid
lgohe_test.fit(pd.concat([features_train, features_valid]), pd.concat([target_train, target_valid]))
end = time.time()
test_lgohett = end - start

start = time.time()
test_pred = lgohe_test.predict(features_test)
end = time.time()
test_lgohept = end - start

test_lgohe_rmse_calc = mean_squared_error(target_test, test_pred)**0.5

print('LightGBM with OHE with test data')
print('RMSE:', test_lgohe_rmse_calc, 'Training time:', test_lgohett, 'Prediction time:', test_lgohept)

LightGBM with OHE with test data
RMSE: 1726.3591713044495 Training time: 5.873456001281738 Prediction time: 0.5961816310882568
CPU times: user 6.02 s, sys: 391 ms, total: 6.41 s
Wall time: 6.47 s


In [81]:
%%time
# lightGBM without OHE with test data
# Best parameters: {'n_estimators': 25, 'learning_rate': 0.25}
lg_test = lgb.LGBMRegressor(random_state=42, n_estimators=25, learning_rate=0.25)

start = time.time()
# fit on combined data from train and valid
lg_test.fit(pd.concat([features_train_2, features_valid_2]), pd.concat([target_train_2, target_valid_2]), categorical_feature=categories)
end = time.time()
test_lgtt = end - start

start = time.time()
test_pred = lg_test.predict(features_test_2)
end = time.time()
test_lgpt = end - start

test_lg_rmse_calc = mean_squared_error(target_test_2, test_pred)**0.5

print('LightGBM without OHE')
print('RMSE:', test_lg_rmse_calc, 'Training time:', test_lgtt, 'Prediction time:', test_lgpt)

LightGBM without OHE
RMSE: 1680.7187262024022 Training time: 3.1926093101501465 Prediction time: 0.29376220703125
CPU times: user 3.46 s, sys: 5.31 ms, total: 3.46 s
Wall time: 3.49 s


In [82]:
# linear regression base model with test data
start = time.time()
predicted_test = lr_model.predict(features_test)
end = time.time()
test_lrpt = end - start

test_lr_rmse_calc = mean_squared_error(target_test, predicted_test)**0.5

print('Linear Regression - Sanity Check on test data')
print('RMSE:', test_lr_rmse_calc, 'Prediction time:', test_lrpt)

Linear Regression - Sanity Check on test data
RMSE: 2669.1918845974697 Prediction time: 0.1891636848449707


## Model analysis

In [83]:
print('Results with test data of base models\n')

print('Random Forest Regressor')
print('RMSE:', rf_rmse_calc, 'Training time:', rftt, 'Prediction time:', rfpt)
print('\nLightGBM with OHE')
print('RMSE:', lgohe_rmse_calc, 'Training time:', lgohett, 'Prediction time:', lgohept)
print('\nLightGBM without OHE')
print('RMSE:', lg_rmse_calc, 'Training time:', lgtt, 'Prediction time:', lgpt)
print('\nCatBoost with OHE')
print('RMSE:', cbohe_rmse_calc, 'Training time:', cbohett, 'Prediction time:', cbohept)
print('\nCatBoost without OHE')
print('RMSE:', cb_rmse_calc, 'Training time:', cbtt, 'Prediction time:', cbpt)
print('\nLinear Regression - Sanity Check')
print('RMSE:', lr_rmse_calc, 'Training time:', lrtt, 'Prediction time:', lrpt)

Results with test data of base models

Random Forest Regressor
RMSE: 1727.9087596013774 Training time: 60.69512057304382 Prediction time: 0.6225581169128418

LightGBM with OHE
RMSE: 1707.641861082534 Training time: 10.578785419464111 Prediction time: 1.1029925346374512

LightGBM without OHE
RMSE: 1643.351067294688 Training time: 8.479040384292603 Prediction time: 0.9047174453735352

CatBoost with OHE
RMSE: 1703.817876905744 Training time: 153.08609819412231 Prediction time: 0.21335506439208984

CatBoost without OHE
RMSE: 1683.0274262756113 Training time: 501.59921288490295 Prediction time: 0.4114964008331299

Linear Regression - Sanity Check
RMSE: 2673.154220556299 Training time: 19.875536680221558 Prediction time: 0.2535219192504883


In [84]:
print('Results with test data of tuned models\n')

print('Random Forest Regressor with test data')
print('RMSE:', test_rf_rmse_calc, 'Training time:', test_rftt, 'Prediction time:', test_rfpt)

print('\nCatBoost with OHE with test data')
print('RMSE:', test_cbohe_rmse_calc, 'Training time:', test_cbohett, 'Prediction time:', test_cbohept)

print('\nCatBoost without OHE with test data')
print('RMSE:',test_cb_rmse, 'Training time:', test_cbtt, 'Prediction time:', test_cbpt)

print('\nLightGBM with OHE with test data')
print('RMSE:', test_lgohe_rmse_calc, 'Training time:', test_lgohett, 'Prediction time:', test_lgohept)

print('\nLightGBM without OHE')
print('RMSE:', test_lg_rmse_calc, 'Training time:', test_lgtt, 'Prediction time:', test_lgpt)

print('\nLinear Regression - Sanity Check on test data')
print('RMSE:', test_lr_rmse_calc, 'Prediction time:', test_lrpt)

Results with test data of tuned models

Random Forest Regressor with test data
RMSE: 1660.958400832672 Training time: 220.39202785491943 Prediction time: 1.481811285018921

CatBoost with OHE with test data
RMSE: 1792.644534898325 Training time: 11.937709331512451 Prediction time: 0.015808820724487305

CatBoost without OHE with test data
RMSE: 1790.8160057257508 Training time: 10.462419748306274 Prediction time: 0.058998823165893555

LightGBM with OHE with test data
RMSE: 1726.3591713044495 Training time: 5.873456001281738 Prediction time: 0.5961816310882568

LightGBM without OHE
RMSE: 1680.7187262024022 Training time: 3.1926093101501465 Prediction time: 0.29376220703125

Linear Regression - Sanity Check on test data
RMSE: 2669.1918845974697 Prediction time: 0.1891636848449707


__We were tasked with finding the RMSE, the training time and the prediction time__


- Tuning the Random Forest Regressor model did improve the goodness of fit as evidenced by a lower RMSE, but tuning also more than doubled the time for predictions and training.
- The RMSE for CatBoost and LightGBM did not benefit from my implementation of performance tuning.
- However, tuning the parameters decreased the prediction time dramatically.
- The best RMSE for both tuned and untuned models is LightGBM without OHE. This matches nicely with [documentation of LightGBM](https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html). It states LightGBM "offers good accuracy with integer-encoded categorical features. LightGBM applies Fisher (1958) to find the optimal split over categories as described here. This often performs better than one-hot encoding."
- LightGBM, both with and without OHE, had the fastest prediction speeds. CatBoost had the slowest.
- LightGBM had lower prediction times, and tuning the model lowered them further.
- While the tuned Random Forest Regressor has the lowest RMSE, it also has the highest prediction time, over double of the second longest. The slight gains in RMSE over LIghtGBM without OHE doesn't compensate for the large increase in prediction time.

__Recommendation__

- We recommend the tuned LightGBM without One Hot Encoding. It has the 3rd lowest RMSE scores, but also has a fast training time, just over 11 seconds, and an impressive prediction time of about 1/5th of a second.

[Timing cells](https://stackoverflow.com/questions/52738709/how-to-store-time-values-in-a-variable-in-jupyter)

[simple function to time execution of cell](https://stackoverflow.com/questions/52738709/how-to-store-time-values-in-a-variable-in-jupyter)

[LightGBM often performs better without OHE](https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html)
