<div class="alert alert-success">
<b>Reviewer's comment V4</b>

You did some really nice work and the project is now accepted. Good luck on the next sprint! :)

</div>

**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good job overall, but there are some problems that need to be fixed before the project is accepted. Let me know if you have any questions!

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

#### Importing Modules

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from catboost import CatBoostRegressor
import xgboost as xgb
from xgboost import XGBRegressor
import lightgbm as lgb

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

import time

## Data preparation

#### Loading in the dataframe

In [2]:
df = pd.read_csv('/datasets/car_data.csv')

#### Viewing Dataframe

In [3]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


#### Looking at data types

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected!

</div>

#### Viewing Duplications

In [5]:
df.duplicated().sum()

262

#### Dropping Duplications

In [6]:
df = df.drop_duplicates()

#### Sanity Check for Duplications

In [7]:
df.duplicated().sum()

0

<div class="alert alert-success">
<b>Reviewer's comment</b>

Good!

</div>

#### Viewing missing values in dataframe

In [8]:
df.isna().sum()

DateCrawled              0
Price                    0
VehicleType          37484
RegistrationYear         0
Gearbox              19830
Power                    0
Model                19701
Mileage                  0
RegistrationMonth        0
FuelType             32889
Brand                    0
NotRepaired          71145
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

#### Viewing the unique values in the missing values columns to better make decisions about what to do with the missing values

In [9]:
df.FuelType.unique()

array(['petrol', 'gasoline', nan, 'lpg', 'other', 'hybrid', 'cng',
       'electric'], dtype=object)

In [10]:
df.Model.unique()

array(['golf', nan, 'grand', 'fabia', '3er', '2_reihe', 'other', 'c_max',
       '3_reihe', 'passat', 'navara', 'ka', 'polo', 'twingo', 'a_klasse',
       'scirocco', '5er', 'meriva', 'arosa', 'c4', 'civic', 'transporter',
       'punto', 'e_klasse', 'clio', 'kadett', 'kangoo', 'corsa', 'one',
       'fortwo', '1er', 'b_klasse', 'signum', 'astra', 'a8', 'jetta',
       'fiesta', 'c_klasse', 'micra', 'vito', 'sprinter', '156', 'escort',
       'forester', 'xc_reihe', 'scenic', 'a4', 'a1', 'insignia', 'combo',
       'focus', 'tt', 'a6', 'jazz', 'omega', 'slk', '7er', '80', '147',
       '100', 'z_reihe', 'sportage', 'sorento', 'v40', 'ibiza', 'mustang',
       'eos', 'touran', 'getz', 'a3', 'almera', 'megane', 'lupo', 'r19',
       'zafira', 'caddy', 'mondeo', 'cordoba', 'colt', 'impreza',
       'vectra', 'berlingo', 'tiguan', 'i_reihe', 'espace', 'sharan',
       '6_reihe', 'panda', 'up', 'seicento', 'ceed', '5_reihe', 'yeti',
       'octavia', 'mii', 'rx_reihe', '6er', 'modus', 'fox'

In [11]:
df.NotRepaired.unique()

array([nan, 'yes', 'no'], dtype=object)

In [12]:
df.Gearbox.unique()

array(['manual', 'auto', nan], dtype=object)

In [13]:
df.VehicleType.unique()

array([nan, 'coupe', 'suv', 'small', 'sedan', 'convertible', 'bus',
       'wagon', 'other'], dtype=object)

#### Filled the null values in the dataframe with 'Unknown' because the amount of missing was great enough that I did not want to drop more than 20% of the column. Filling with 'Unknown' will give the data another category to consider when looking at prices.

In [14]:
df = df.fillna('Unknown')

In [15]:
df.isna().sum()

DateCrawled          0
Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
DateCreated          0
NumberOfPictures     0
PostalCode           0
LastSeen             0
dtype: int64

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

Ok, so you encountered missing values. But how are you going to deal with them?
    
</div>

<div class="alert alert-info">
I should have taken care of the missing values the first time. Thanks for the heads up. I decided to drop most of them because of the amount of missing values that are in the columns. If I fill in with averages, modes, or other then I run the risk of skewing the data.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Ok, that's one way to move forward :)

</div>

#### Looking at the statistics for the columns of the dataframe that are integers,

The number of pictures having a value of 0 throughout is abnormal and strange. I would want to investigate and ask the company why people are not posting pictures of their vehicles. It would be difficult to sell your vehicle without pictures.

In [16]:
df.NumberOfPictures.unique()

array([0])

<div class="alert alert-warning">
<b>Reviewer's comment V2</b>

That's a great point! As this feature has the same value in all rows, it is completely useless for the model, so it can be dropped.

</div>

I also have questions about the maximum value in a car. I have questions if the website has a maximum value of Euro's that the vehicle can be sold for. 20,000 Euros is still rather cheap for a vehicle. 

<div class="alert alert-warning">
<b>Reviewer's comment V2</b>

Indeed! Also, there are some strange power values (the maximum value goes way above any car actually in production)

</div>

In [17]:
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354107.0,354107.0,354107.0,354107.0,354107.0,354107.0,354107.0
mean,4416.433287,2004.235355,110.089651,128211.811684,5.714182,0.0,50507.14503
std,4514.338584,90.261168,189.914972,37906.590101,3.726682,0.0,25784.212094
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49406.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


<div class="alert alert-warning">
<b>Reviewer's comment</b>

Does everything look fine here?

</div>

#### Looking at the shape of the dataframe

In [18]:
df.shape

(354107, 16)

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Is that necessary if we're going to drop these columns anyway? :)

</div>

<div class="alert alert-info">
You are correct. There isn't really a reason to convert the date columns.
</div>

#### Converting the following columns to Ordinal Encoding() to allow Machine Learning to be able to run. The reason to use Ordinal Encoding is because the columns we have are categorical in values.

In [19]:
ordinal_encoder = OrdinalEncoder()

ordinal_columns = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']

for col in ordinal_columns:
    df[col] = ordinal_encoder.fit_transform(df[[col]])
    

#encoder = OrdinalEncoder()
#df.loc[:, columns] =  encoder.fit_transform(df.loc[:, columns])
    

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354107 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   DateCrawled        354107 non-null  object 
 1   Price              354107 non-null  int64  
 2   VehicleType        354107 non-null  float64
 3   RegistrationYear   354107 non-null  int64  
 4   Gearbox            354107 non-null  float64
 5   Power              354107 non-null  int64  
 6   Model              354107 non-null  float64
 7   Mileage            354107 non-null  int64  
 8   RegistrationMonth  354107 non-null  int64  
 9   FuelType           354107 non-null  float64
 10  Brand              354107 non-null  float64
 11  NotRepaired        354107 non-null  float64
 12  DateCreated        354107 non-null  object 
 13  NumberOfPictures   354107 non-null  int64  
 14  PostalCode         354107 non-null  int64  
 15  LastSeen           354107 non-null  object 
dtypes:

<div class="alert alert-danger">
<s><b>Reviewer's comment V2</b>

Unfortunately there is now a problem that ordinal encoding is applied to all columns including numerical (and even the target, price, which leads to overly optimistic RMSE values)

</div>

<div class="alert alert-success">
<b>Reviewer's comment V3</b>

Fixed! 

</div>

<div class="alert alert-warning">
<b>Reviewer's comment V2</b>
    
By the way OrdinalEncoder can work with multiple columns at once. Suppose `columns` is a list of categorical column names, then you can do something like:
    
```python
encoder = OrdinalEncoder()
df.loc[:, columns] =  encoder.fit_transform(df.loc[:, columns])
```

</div>

In [21]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,0.0,1993,2.0,0,117.0,150000,0,7.0,38.0,0.0,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,3.0,2011,2.0,190,26.0,125000,5,3.0,1.0,2.0,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,7.0,2004,1.0,163,118.0,125000,8,3.0,14.0,0.0,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,6.0,2001,2.0,75,117.0,150000,6,7.0,38.0,1.0,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,6.0,2008,2.0,69,102.0,90000,7,3.0,31.0,1.0,31/03/2016 00:00,0,60437,06/04/2016 10:17


<div class="alert alert-warning">
<b>Reviewer's comment</b>

`LabelEncoder` is intended to be used to encode targets. To encode categorical features as integers, you can use [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)

</div>

#### Viewing the dataframe to view the changes that were made from the Ordinal Encoding

Viewing the dataframe info to make sure the types were all converted to integers or floats so that our ML algorithms can run

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354107 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   DateCrawled        354107 non-null  object 
 1   Price              354107 non-null  int64  
 2   VehicleType        354107 non-null  float64
 3   RegistrationYear   354107 non-null  int64  
 4   Gearbox            354107 non-null  float64
 5   Power              354107 non-null  int64  
 6   Model              354107 non-null  float64
 7   Mileage            354107 non-null  int64  
 8   RegistrationMonth  354107 non-null  int64  
 9   FuelType           354107 non-null  float64
 10  Brand              354107 non-null  float64
 11  NotRepaired        354107 non-null  float64
 12  DateCreated        354107 non-null  object 
 13  NumberOfPictures   354107 non-null  int64  
 14  PostalCode         354107 non-null  int64  
 15  LastSeen           354107 non-null  object 
dtypes:

<div class="alert alert-warning">
<b>Reviewer's comment</b>

The EDA/preprocessing would look better if you added more comments about what you found out, how you're going to deal with problems and so on. Also it's a good idea to plot histograms of numerical features.

</div>

## Model training

#### Creating the Features and the Target

In [23]:
X = df.drop(['Price', 'DateCreated', 'LastSeen', 'DateCrawled'], axis=1)
y = df['Price']

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright, I agree with dropping the dates: they don't really have anything to do with the price

</div>

#### Using Train Test Split on the Features and Target

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12345)

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was split into train and test sets

</div>

#### Sanity Checking the shape of the X_Train feature

In [25]:
X_train.shape

(283285, 12)

#### Sanity Checking the shape of the X_Test feature

In [26]:
X_test.shape

(70822, 12)

#### Creating a function for Root Mean Squared Error 

In [27]:
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

#### Creating a Linear Regression Model for a sanity check against the other models

In [28]:
# Start time
st = time.time()

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_predict = lr_model.predict(X_test)
lr_rmse = rmse(y_test, lr_predict)
print(f'RMSE Linear Regression Model: {lr_rmse}.')

# End time
et = time.time()
lr_elapsed_time = et - st

print('Execution time:', lr_elapsed_time, 'seconds')


RMSE Linear Regression Model: 4077.3124109385144.
Execution time: 0.1916966438293457 seconds


#### Creating a Random Forest Regression Model with different hyperparameters

In [29]:
# Start time
st = time.time()

rfr_model = RandomForestRegressor(random_state=12345, n_jobs = -1)
rfr_model.fit(X_train, y_train)
rfr_predict = rfr_model.predict(X_test)
rfr_rmse = rmse(y_test, rfr_predict)
print(f'RMSE Random Forest Regression Model: {rfr_rmse}.')

# End time
et = time.time()
rfr_elapsed_time = et - st

print('Execution time:', rfr_elapsed_time, 'seconds')


RMSE Random Forest Regression Model: 1728.655775513481.
Execution time: 189.5502233505249 seconds


#### Used Light GBM to try and lower the RMSE. The model did the opposite so I used other models.

In [30]:
# Start time
st = time.time()

lgbm_reg = LGBMRegressor(learning_rate=0.001, sub_feature=0.5, num_leaves=500, max_depth=10)
lgbm_reg.fit(X_train, y_train)
lgbm_pred = lgbm_reg.predict(X_test)
lgbm_rmse = rmse(y_test, lgbm_pred)
print('')
print(f'RMSE LightGBM: {lgbm_rmse}.')

# End time
et = time.time()
light_elapsed_time = et - st

print('Execution time:', light_elapsed_time, 'seconds')


RMSE LightGBM: 4249.445050229711.
Execution time: 18.931505918502808 seconds


<div class="alert alert-warning">
<b>Reviewer's comment</b>

Hyperparameter choices seem pretty odd (5 leaves?), no wonder the model performs poorly :)

</div>

<div class="alert alert-info">
Now that I have tuned the model to have 200 leaves the LightGBM is performing at one of the higher levels of all of my models.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Right! So we should be careful when tuning hyperparameters on one hand not to make them too restrictive (so that the model would underfit), on the other hand not to make the model too complex (which would lead to overfitting). 

</div>

#### Cat Boost Regressor Model with an RMSE of 1922.43

In [31]:
# Start time
st = time.time()

model = CatBoostRegressor(iterations=40, random_seed=12345, silent=True)
model.fit(X_train, y_train)
cat_pred = model.predict(X_test)
cat_rmse = rmse(y_test, cat_pred)
print(f'RMSE Cat Boost Regression Model: {cat_rmse}.')

# End time
et = time.time()
cat_elapsed_time = et - st

print('Execution time:', cat_elapsed_time, 'seconds')

RMSE Cat Boost Regression Model: 1922.7097789115803.
Execution time: 2.955890655517578 seconds


#### Created an XG Boost Model with an RMSE of 1687.53

In [32]:
# Start time
st = time.time()

xgb_model = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8)
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)
xgb_rmse = rmse(y_test, xgb_pred)
print(f'RMSE XG Boost: {xgb_rmse}.')

# End time
et = time.time()
xgb_elapsed_time = et - st

print('Execution time:', xgb_elapsed_time, 'seconds')

RMSE XG Boost: 1689.9684758584876.
Execution time: 621.1642436981201 seconds


#### Creating an Ada Boosting Regressor to try and see if we can get a lower RMSE

In [33]:
# Start time
st = time.time()

adaboost = AdaBoostRegressor(random_state=12345, n_estimators=500, learning_rate=0.001)
ada_model = adaboost.fit(X_train, y_train)
ada_pred = ada_model.predict(X_test)
ada_rmse = rmse(y_test, ada_pred)
print(f'RMSE Ada Boost: {ada_rmse}.')

# End time
et = time.time()
ada_elapsed_time = et - st

print('Execution time:', ada_elapsed_time, 'seconds')

RMSE Ada Boost: 2990.4572918977315.
Execution time: 324.3500123023987 seconds


#### Hyperparameter tuning with GridSearchCV on the CatBoostRegression model. 

In [34]:
# Start time
st = time.time()

param_grid =  {'depth'         : [6,8],
               'learning_rate' : [0.01, 0.05, 0.1],
               'iterations'    : [50]
            }

model = CatBoostRegressor(verbose= False)

grid_search = GridSearchCV(model, param_grid, scoring='neg_root_mean_squared_error')
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
model = grid_search.best_estimator_
score = grid_search.best_score_
print('%s\tHP\t%s\t%f' % ("R" , str(best_params), abs(score)))

# End time

et = time.time()
grid_elapsed_time = et - st

print('Execution time:', grid_elapsed_time, 'seconds')

R	HP	{'depth': 8, 'iterations': 50, 'learning_rate': 0.1}	2012.262855
Execution time: 102.27638459205627 seconds


#### Cat Boost Regression without hyperparameters

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

Great, you tried a couple of different models and compared their results using the test set.
    
Some problems here:
    
1. `timeit.timeit()` measures the time it takes to run the function given as argument. As you gave it no arguments, it doesn't run anything. One way to fix this is by using line/cell magics (here's a nice [guide](https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html))
    
2. Please add some hyperparameter tuning (i.e. try at least two different sets of hyperparameters for at least one model). Maybe you manually adjusted hyperparameters based on the test set performance? That is problematic as it can lead to overfitting to the test set. To tune hyperparameters you need a separate validation set or cross-validation (the test set is then only used to evaluate the final model).
    


</div>

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Also, it would be nice to separately measure fitting and prediction time

</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Very good, time is now measured correctly!
    
</div>

<div class="alert alert-danger">
<s><b>Reviewer's comment V2</b>

The problem with hyperparameter tuning still stands though: I still can't see in the code where you're trying different hyperparameters using cross-validation or a validation set.
    
Although I see you're evaluating two models using cross-validation, but there is no hyperparameter tuning in the code itself (it looks like you adjusted lightgbm's hyperparameters from the previous iteration, but again, it was done based on the test set score, which is a problematic practice as it can lead to overfitting the model to the test set).
    
What I'm asking is to either just add a for loop which trains and evaluates two models of the same type with different hyperparameter values using cross-validation or use something like [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

</div>

<div class="alert alert-info">
  I think the hyperparameter tuning has now been taken care of but still lends to a score lower than the XGBoost. Is it smart if you are doing this in the field to find the best model and only then do hyperparameter tuning after or should you be doing hyperparameter tuning to every model. It seems time consuming to do the latter. 
</div>

<div class="alert alert-warning">
<b>Reviewer's comment V3</b>

The defaults can work pretty good a lot of the time, but tuning hyperparameters can often help improve the results at least somewhat. As you noted, a grid search can be very time consuming, so in practice something like [optuna](https://optuna.org/) is used, which utilizes randomization (more specifically [bayesian optimization](https://distill.pub/2020/bayesian-optimization/)

</div>

<div class="alert alert-danger">
<s><b>Reviewer's comment V3</b>

Very good! One small problem: right now in your code GridSearchCV selects the model with maximal error.
    
The reason is that your scoring object initialized using `greater_is_better=True` parameter. This option makes GridSearchCV select the model with the greatest score value. This is the correct behavior for metrics like accuracy, precision, recall, F1 score. But error (be it MSE, MAE, SMAPE or other error) is something we'd like to minimize. So it should be initialized using `greater_is_better=False`. Note that setting it to false just negates the function (i.e. instead of returning `RMSE`) it returns `-RMSE`): this is done because in scikit-learn there is a convention that the score is always maximized, and maximizing the negated error is the same thing as minimizing the error.
    
You don't reallly need to implement the scoring for RMSE though: it is one of the default metrics which can be used like this: `scoring='neg_root_mean_squared_error'`

</div>

<div class="alert alert-info">
Thank you for upgrading my project. I didn't find that scoring option on my own. If anything else needs to be upgraded, please let me know!
</div>

<div class="alert alert-success">
<b>Reviewer's comment V4</b>

No problem! 

</div>

## Model analysis

#### Creating a Dataframe with the lowest RMSE, as we can see XG Boost has the lowest RMSE with an relatively slow runtime. If you want a low RMSE with a quicker runtime, I would look at the Random Forest Regression Model or the Catboost Regressor Model for even faster times.

In [36]:
# Dictionary of lists
models = ["Linear Regression Model", "Random Forest Regression Model", "Light GBM", "Cat Boost Regressor Model", 
          "XG Boost Model", "Ada Boosting Regressor", 'CatBoost HyperParameter']
model_rmse = [lr_rmse, rfr_rmse, lgbm_rmse, cat_rmse, xgb_rmse, ada_rmse, 2012.262855]
elapsed_times = [lr_elapsed_time, rfr_elapsed_time, light_elapsed_time, cat_elapsed_time, xgb_elapsed_time,
                 ada_elapsed_time, 102.27638]

dict = {'models': models,
        'rmse': model_rmse,
        'time': elapsed_times}
  
# Creating a dataframe from a dictionary 
model_df = pd.DataFrame(dict)
 
sorted_model_df = model_df.sort_values(by='rmse').reset_index(drop=True)
sorted_model_df


Unnamed: 0,models,rmse,time
0,XG Boost Model,1689.968476,621.164244
1,Random Forest Regression Model,1728.655776,189.550223
2,Cat Boost Regressor Model,1922.709779,2.955891
3,CatBoost HyperParameter,2012.262855,102.27638
4,Ada Boosting Regressor,2990.457292,324.350012
5,Linear Regression Model,4077.312411,0.191697
6,Light GBM,4249.44505,18.931506


### Conclusion: The XG Boost Model is the most efficient model and has the lowest RMSE score of all of the models. The Cat Boost Regressor is a great second option if you want speed and a solid RMSE score, as well. 

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

Please check the results after fixing the problems above

</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Alright! Note that the results heavily depend on hyperparameters used, and you can optimize the models for lower rmse or time depending on hyperparameters used

</div>

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed