# Rusty Bargain: Car Value Model

# Contents <a id='back'></a>

* [Introduction](#introduction)
* [Data Overview](#data_overview)
    * [Initialization](#initialization)
    * [Load Data](load_data)
* [Prepare the Data](#prepare_data)
    * [Fix Data](#fix_data)   
    * [Check for Duplicates](#duplicates)
    * [Check for Missing Values](#missing_values)
* [Model Training](#model_training)
    * [Light GBM for gradient boosting models](#light_gbm)   
        * [Hyperparameter Tuning for Light GBM](#light_gbm_tuning)
    * [Linear Regression](#linear_regression)
    * [Decision Tree](#decision_tree)
        * [Hyperparameter Tuning for Decision Tree](#decision_tree_tuning)
    * [Random Forest](#random_forest)
        * [Hyperparameter for Random Forest](#random_forest_tuning)
* [Model Analysis](#model_analysis)
* [Conclusion](#conclusion)

# Introduction <a id='introduction'></a>

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data (`/datasets/car_data.csv`): technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

[Back to Contents](#back)

# Data Overview <a id='data_overview'></a>

## Initialization <a id='initialization'></a> <a class="tocSkip">

In [1]:
# Loading all the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

## Load data <a id='load_data'></a> <a class="tocSkip">

In [2]:
# Reading the dataframe file and storing it to df
df = pd.read_csv('/datasets/car_data.csv')

# Data preparation <a id='prepare_data'></a> <a class="tocSkip">

In [3]:
# Print the general/summary information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [4]:
# Print a sample of the data
display(df.head())

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


## Fix Data <a id='fix_data'></a> <a class="tocSkip">

In [5]:
# Convert column names to snake case
df.columns = [col.lower() for col in df.columns]

# Check corrected column names
print(df.columns)

Index(['datecrawled', 'price', 'vehicletype', 'registrationyear', 'gearbox',
       'power', 'model', 'mileage', 'registrationmonth', 'fueltype', 'brand',
       'notrepaired', 'datecreated', 'numberofpictures', 'postalcode',
       'lastseen'],
      dtype='object')


In [6]:
# Rename columns
df= df.rename(columns = {
    'datecrawled':'date_crawled',
    'vehicletype':'vehicle_type',
    'registrationyear':'registration_year',
    'registrationmonth':'registration_month',
    'fueltype':'fuel_type',
    'notrepaired':'not_repaired',
    'datecreated':'date_created',
    'numberofpictures':'number_of_pictures',
    'postalcode':'postal_code',
    'lastseen':'last_seen'
})

# Check corrected column names
print(df.columns)

Index(['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox',
       'power', 'model', 'mileage', 'registration_month', 'fuel_type', 'brand',
       'not_repaired', 'date_created', 'number_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')


In [7]:
# Convert columns to datetime
df['date_crawled'] = pd.to_datetime(df['date_crawled'])
df['date_created'] = pd.to_datetime(df['date_created'])
df['last_seen'] = pd.to_datetime(df['last_seen'])

# Check the updated datatypes
print(df.dtypes)

date_crawled          datetime64[ns]
price                          int64
vehicle_type                  object
registration_year              int64
gearbox                       object
power                          int64
model                         object
mileage                        int64
registration_month             int64
fuel_type                     object
brand                         object
not_repaired                  object
date_created          datetime64[ns]
number_of_pictures             int64
postal_code                    int64
last_seen             datetime64[ns]
dtype: object


## Check for Duplicates <a id='duplicates'></a> <a class="tocSkip">

In [8]:
# checking for obvious duplicated rows in df
print(df.duplicated().sum())

262


In [9]:
# Show duplicated rows
duplicates = df[df.duplicated(keep=False)]
display(duplicates)

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,number_of_pictures,postal_code,last_seen
183,2016-03-21 19:06:00,5999,small,2009,manual,80,polo,125000,5,petrol,volkswagen,no,2016-03-21,0,65529,2016-05-04 20:47:00
1771,2016-06-04 21:25:00,3900,sedan,1999,manual,116,beetle,150000,6,petrol,volkswagen,no,2016-06-04,0,55469,2016-06-04 21:25:00
1910,2016-05-04 00:59:00,2400,sedan,2002,auto,133,other,150000,12,gasoline,peugeot,no,2016-04-04,0,15517,2016-05-04 08:41:00
2267,2016-03-13 20:48:00,4200,sedan,2003,manual,105,golf,150000,10,gasoline,volkswagen,no,2016-03-13,0,14482,2016-03-13 20:48:00
3176,2016-03-23 21:25:00,900,bus,1995,manual,110,transporter,150000,9,petrol,volkswagen,yes,2016-03-23,0,65239,2016-03-30 10:17:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349709,2016-03-04 20:52:00,700,small,1999,manual,60,ibiza,150000,12,petrol,seat,yes,2016-03-04,0,6268,2016-05-04 21:47:00
351555,2016-03-26 16:54:00,3150,bus,2003,manual,86,transit,150000,11,gasoline,ford,no,2016-03-26,0,96148,2016-02-04 07:47:00
352384,2016-03-15 21:54:00,5900,wagon,2006,manual,129,3er,150000,12,petrol,bmw,no,2016-03-15,0,92526,2016-03-20 21:17:00
353057,2016-05-03 14:16:00,9500,small,2013,manual,105,ibiza,40000,5,petrol,seat,no,2016-04-03,0,61381,2016-05-04 19:18:00


In [10]:
# removing obvious duplicates
df = df.drop_duplicates().reset_index()

In [11]:
# checking for duplicates
print(df.duplicated().sum())

0


Entire duplicated rows were dropped for data preparation.

## Check for Missing Values <a id='missing_values'></a> <a class="tocSkip">

In [12]:
# calculating missing values in df
print(df.isna().sum())

index                     0
date_crawled              0
price                     0
vehicle_type          37484
registration_year         0
gearbox               19830
power                     0
model                 19701
mileage                   0
registration_month        0
fuel_type             32889
brand                     0
not_repaired          71145
date_created              0
number_of_pictures        0
postal_code               0
last_seen                 0
dtype: int64


In [13]:
# Number of missing values per column
missing_values = df.isnull().sum()

# Total number of values per column
total_values = df.shape[0]

# Percentage of missing values per column
percentage_missing = (missing_values / total_values) * 100

# Create a DataFrame to display the results
missing_data_summary = pd.DataFrame({
    'Missing Values': missing_values,
    'Total Values': total_values,
    'Percentage Missing': percentage_missing
})

# Sort by percentage of missing values in descending order
missing_data_summary = missing_data_summary.sort_values(by='Percentage Missing', ascending=False)

display(missing_data_summary)

Unnamed: 0,Missing Values,Total Values,Percentage Missing
not_repaired,71145,354107,20.091385
vehicle_type,37484,354107,10.585501
fuel_type,32889,354107,9.287871
gearbox,19830,354107,5.600002
model,19701,354107,5.563573
index,0,354107,0.0
postal_code,0,354107,0.0
number_of_pictures,0,354107,0.0
date_created,0,354107,0.0
brand,0,354107,0.0


It appears that most of the missing data is unrecoverable. The vehicle_type, fuel_type, and gearbox may have been recoverable if the precise model was known since a certain model of car can't have two different gearbox types or fuel types. Thus, missing values will be dropped and the missing values in not_repaired will be replaced with `other`.

In [14]:
# Drop rows with missing values in specified columns
columns_to_check = ['vehicle_type', 'fuel_type', 'gearbox', 'model']
df.dropna(subset=columns_to_check, inplace=True)

# Replace missing values in 'not_repaired' column with 'other'
df['not_repaired'] = df['not_repaired'].fillna('other')

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 283875 entries, 2 to 354106
Data columns (total 17 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   index               283875 non-null  int64         
 1   date_crawled        283875 non-null  datetime64[ns]
 2   price               283875 non-null  int64         
 3   vehicle_type        283875 non-null  object        
 4   registration_year   283875 non-null  int64         
 5   gearbox             283875 non-null  object        
 6   power               283875 non-null  int64         
 7   model               283875 non-null  object        
 8   mileage             283875 non-null  int64         
 9   registration_month  283875 non-null  int64         
 10  fuel_type           283875 non-null  object        
 11  brand               283875 non-null  object        
 12  not_repaired        283875 non-null  object        
 13  date_created        283875 no

In [16]:
display(df.head())

Unnamed: 0,index,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,number_of_pictures,postal_code,last_seen
2,2,2016-03-14 12:52:00,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,other,2016-03-14,0,90480,2016-05-04 12:47:00
3,3,2016-03-17 16:54:00,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17,0,91074,2016-03-17 17:40:00
4,4,2016-03-31 17:25:00,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31,0,60437,2016-06-04 10:17:00
5,5,2016-04-04 17:36:00,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes,2016-04-04,0,33775,2016-06-04 19:17:00
6,6,2016-01-04 20:48:00,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no,2016-01-04,0,67112,2016-05-04 18:18:00


# Model Training <a id='model_training'></a>

Use the RMSE metric to evaluate the models.

Linear regression is not very good for hyperparameter tuning, but it is perfect for doing a sanity check of other methods. If gradient boosting performs worse than linear regression, something definitely went wrong.

On your own, work with the LightGBM library and use its tools to build gradient boosting models.

Ideally, your project should include linear regression for a sanity check, a tree-based algorithm with hyperparameter tuning (preferably, random forrest), LightGBM with hyperparameter tuning (try a couple of sets), and CatBoost and XGBoost with hyperparameter tuning (optional).

Take note of the encoding of categorical features for simple algorithms. LightGBM and CatBoost have their implementation, but XGBoost requires OHE.

You can use a special command to find the cell code runtime in Jupyter Notebook. Find that command.

Since the training of a gradient boosting model can take a long time, change only a few model parameters.


Light GBM
lin reg for sanity check
dec tree hyperparameter
random forest hyperparameter


## Light GBM for gradient boosting models <a id='light_gbm'></a><a class="tocSkip">

In [17]:
# Selecting specific columns for modeling
selected_cols = ['vehicle_type', 'gearbox', 'model', 'fuel_type', 'brand', 'not_repaired', 'price']

# Filtering the DataFrame to include only selected columns
df_filtered = df[selected_cols].copy()  # Ensure to create a copy to avoid SettingWithCopyWarning

# Handling categorical columns (example using Label Encoding)
label_encoders = {}
categorical_cols = ['vehicle_type', 'gearbox', 'model', 'fuel_type', 'brand', 'not_repaired']
for col in categorical_cols:
    label_encoders[col] = LabelEncoder()
    df_filtered.loc[:, col] = label_encoders[col].fit_transform(df_filtered[col])

# Splitting the data into features and target variable
X = df_filtered.drop('price', axis=1)
y = df_filtered['price']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define LightGBM model
lgbm = LGBMRegressor()

# Measuring the execution time of model training
%time lgbm.fit(X_train, y_train)

# Predict on test set
y_pred = lgbm.predict(X_test)

# Evaluate the model using RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'LightGBM RMSE: {rmse}')

CPU times: user 2.67 s, sys: 16.3 ms, total: 2.68 s
Wall time: 2.66 s
LightGBM RMSE: 3534.00807025049


### Hyperparameter Tuning for Light GBM <a id='light_gbm_tuning'></a><a class="tocSkip">

In [18]:
# Hyperparameter tuning using GridSearchCV
param_grid = {
    'num_leaves': [30, 50, 100],
    'max_depth': [5, 10, 15],
    'learning_rate': [0.01, 0.05, 0.1],
    # Add more parameters to tune
}

lgbm = LGBMRegressor()
grid_search = GridSearchCV(lgbm, param_grid, cv=3, scoring='neg_root_mean_squared_error')

# Measuring the execution time of grid search
%time grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Predict on test set using the best model
best_lgbm = grid_search.best_estimator_
y_pred = best_lgbm.predict(X_test)

# Evaluate the model using RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'LightGBM RMSE after tuning: {rmse}')

CPU times: user 4min 16s, sys: 1.39 s, total: 4min 18s
Wall time: 4min 19s
Best Parameters: {'learning_rate': 0.1, 'max_depth': 15, 'num_leaves': 100}
LightGBM RMSE after tuning: 3463.7275971362


Hyperparameter tuning of the Light GBM model using `GridSearchCV` has a runtime of approximately 4 minutes and an RMSE 3464.

In [19]:
# Define the parameter distributions for LightGBM
param_dist = {
    'num_leaves': [30, 50, 100],
    'max_depth': [5, 10, 15],
    'learning_rate': [0.01, 0.05, 0.1],
    # Add more parameters to tune
}

# Create the LightGBM regressor
lgbm = LGBMRegressor()

# Using RandomizedSearchCV for hyperparameter tuning
random_search = RandomizedSearchCV(lgbm, param_distributions=param_dist, n_iter=10, cv=3,
                                   scoring='neg_root_mean_squared_error', random_state=42)

# Measuring the execution time of random search
%time random_search.fit(X_train, y_train)

# Get the best parameters and best estimator
best_params = random_search.best_params_
print("Best Parameters:", best_params)

# Predicting on test set using the best model
best_lgbm = random_search.best_estimator_
y_pred = best_lgbm.predict(X_test)

# Evaluate the model using RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'LightGBM RMSE after tuning: {rmse}')

CPU times: user 1min 35s, sys: 570 ms, total: 1min 36s
Wall time: 1min 36s
Best Parameters: {'num_leaves': 100, 'max_depth': 15, 'learning_rate': 0.05}
LightGBM RMSE after tuning: 3504.3422921877477


Hyperparameter tuning of the Light GBM model using `RandomizedSearchCV` has a runtime of approximately 1.5 minutes and an RMSE 3504.

## Linear Regression <a id='linear_regression'></a><a class="tocSkip">

In [20]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a Linear Regression model
linear_model = LinearRegression()

# Measuring the execution time of model training
%time linear_model.fit(X_train, y_train)

# Predicting on the test set
y_pred = linear_model.predict(X_test)

# Calculating RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'Linear Regression RMSE: {rmse}')

CPU times: user 27.4 ms, sys: 8.02 ms, total: 35.4 ms
Wall time: 28.1 ms
Linear Regression RMSE: 4129.996467828162


Using linear regression as a sanity check, we see that the RMSE is indeed higher than from the Light GBM, indicating that Light GBM performed better, as expected.  If gradient boosting had performed worse than linear regression, something definitely went wrong.

## Decision Tree <a id='decision_tree'></a><a class="tocSkip">

In [21]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining the parameter grid for Decision Tree
param_grid = {
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    # Add more parameters to tune
}

# Creating the Decision Tree regressor
dt = DecisionTreeRegressor()

# Using GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(dt, param_grid, cv=3, scoring='neg_root_mean_squared_error')

# Measuring the execution time of grid search
%time grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Predicting on test set using the best model
best_dt = grid_search.best_estimator_
y_pred = best_dt.predict(X_test)

# Evaluate the model using RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'Decision Tree RMSE after tuning: {rmse}')

CPU times: user 13.9 s, sys: 7.78 ms, total: 13.9 s
Wall time: 14 s
Best Parameters: {'max_depth': 15, 'min_samples_leaf': 2, 'min_samples_split': 5}
Decision Tree RMSE after tuning: 3495.7158466752003


### Hyperparameter Tuning for Decision Tree <a id='decision_tree_tuning'></a><a class="tocSkip">

In [22]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining the parameter grid for Decision Tree
param_grid = {
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    # Add more parameters to tune
}

# Creating the Decision Tree regressor
dt = DecisionTreeRegressor()

# Using GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(dt, param_grid, cv=3, scoring='neg_root_mean_squared_error')

# Measuring the execution time of grid search
%time grid_search.fit(X_train, y_train)

# Get the best parameters and best estimator
best_params = grid_search.best_params_
best_dt = grid_search.best_estimator_

# Predicting on the test set using the best model
y_pred = best_dt.predict(X_test)

# Calculate RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'Decision Tree RMSE after tuning: {rmse}')

CPU times: user 14.1 s, sys: 11.8 ms, total: 14.1 s
Wall time: 14.2 s
Decision Tree RMSE after tuning: 3495.6530514895844


The Decision Tree model shows similar RMSE values both with and without hyperparameter tuning.

## Random  Forest <a id='random_forest'></a><a class="tocSkip">

In [23]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining the parameter grid for Random Forest
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    # Add more parameters to tune
}

# Creating the Random Forest regressor
rf = RandomForestRegressor()

# Using GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='neg_root_mean_squared_error')

# Measuring the execution time of grid search
%time grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Predicting on test set using the best model
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

# Evaluate the model using RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'Random Forest RMSE after tuning: {rmse}')

CPU times: user 11min 11s, sys: 460 ms, total: 11min 11s
Wall time: 11min 12s
Best Parameters: {'max_depth': 15, 'n_estimators': 300}
Random Forest RMSE after tuning: 3462.8338321245383


### Hyperparameter for Random Forest <a id='random_forest_tuning'></a><a class="tocSkip">

In [24]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining the parameter distributions for Random Forest
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    # Add more parameters to tune
}

# Creating the Random Forest regressor
rf = RandomForestRegressor()

# Using RandomizedSearchCV for hyperparameter tuning
random_search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=10, cv=3,
                                   scoring='neg_root_mean_squared_error', random_state=42)

# Fitting the randomized search to find the best model
%time random_search.fit(X_train, y_train)

# Get the best parameters and best estimator
best_params = random_search.best_params_
best_rf = random_search.best_estimator_

# Predicting on the test set using the best model
y_pred = best_rf.predict(X_test)

# Calculate RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'Random Forest RMSE after tuning: {rmse}')

CPU times: user 9min 1s, sys: 403 ms, total: 9min 2s
Wall time: 9min 2s
Random Forest RMSE after tuning: 3463.5662875368857


Hyperparameter tuning was originally done using `GridSearchCV`, but the cell never completed the execution due to the extremely long runtime. To reduce the runtime, `RandomizedSearchCV` was used instead. The Random Forest model shows similar RMSE values both with and without hyperparameter tuning.

# Model analysis <a id='model_analysis'></a><a class="tocSkip">

When comparing different models using RMSE (Root Mean Squared Error), lower values indicate better performance. Lower RMSE means the model's predictions are closer to the actual values, implying better accuracy and less deviation from the true values. Therefore, among different models, you'd prefer the one with the lowest RMSE as it signifies better predictive performance.

The LightGBM model has a runtime of about 3 seconds and an RMSE of 3534. One set with hyperparameter tuning using `GridSearchCV` has a runtime of 4 minutes with RMSE of 3463, while another set using `RandomizedSearchCV` has a runtime of 1 minute and and RMSE of 3504.

As a sanity check, Linear Regression was also used, and resulted in a runtime of about 30 seconds and RMSE of 4129, which is higher than from the Light GBM, indicating that Light GBM performed better, as expected. If gradient boosting had performed worse than linear regression, something definitely went wrong.

To compare LightGBM with other models, Decision Tree and Randomized forest was alo used for comparison. For the Decision Tree model, the runtime was about 14 seconds with an RMSE of 3496. With hyperparameter tuning, the same runtime of 14 seconds and similar RMSE of 3497.

The Randomized Forest model overall had the longest runtimes, as expected, with a runtime of 11 minutes and an RMSE of 3463. With hyperparameter tuning, the runtime was 9 minutes and had an RMSE of 3463.

Both the Decision Tree model and Random Forest model achieved similar RMSE values, but with the longer runtimes for the Random Forest models, Decision Tree would be preferred. When comparing Decision Tree with LightGBM models, LightGBM achieved the best RMSE overall with the hyperparameter tuning with `GridSearchCV`, but with a runtime of a few minutes whereas the Decision Tree model achieved a slightly higher RMSE, but with the runtime of only a few seconds.

# Conclusion <a id='conclusion'></a><a class="tocSkip">

Based on the model analysis, when taking into account Rusty Bargain's interests in the quality of the prediction, the speed of the prediction, and the time required for training, both the LightGBM and Decision Tree models proved as viable options to be used, depending on if Rusty Bargain prefers more emphasis on time over accuracy. Both models, with hyperparameter tuning, achieved similar RMSE values, though Light GBM was slightly better, and both had relatively short runtimes, with LightGBM being a few minutes whereas Decision Tree was a few seconds. However, since the purpose of the model is for development of Rusty Bargain's app in predicting car market values, with the Decsision Tree models runtime of a few seconds, it would appear to be the optimal model to be used for an app.