Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
import pandas as pd
import numpy as np
import math
import seaborn as sns
import lightgbm as lgb
import time

from sklearn.model_selection import train_test_split, GridSearchCV
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.metrics import  mean_squared_error, mean_absolute_error, make_scorer, r2_score

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [2]:
df =  pd.read_csv('car_data.csv')

In [3]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [4]:
df = df.rename(columns={'DateCrawled': 'date_crawled', "Price": 'price', "VehicleType": 'vehicle_type', 'RegistrationYear': 'registration_year', 'Gearbox': 'gearbox', 'Power': 'power', 'Model': 'model', 'Mileage': 'mileage', 'RegistrationMonth': 'registration_month', 'FuelType': 'fuel_type', 'Brand': 'brand', 'NotRepaired': 'not_repaired', 'DateCreated': 'date_created', 'NumberOfPictures': 'number_of_pictures', 'PostalCode': 'postal_code', 'LastSeen': 'last_seen'})

In [5]:
df.sample(10)

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,number_of_pictures,postal_code,last_seen
134585,11/03/2016 20:44,4799,wagon,2004,manual,155,vectra,150000,6,petrol,opel,no,11/03/2016 00:00,0,41199,06/04/2016 03:15
124785,29/03/2016 08:54,4990,wagon,2005,auto,177,e_klasse,150000,9,gasoline,mercedes_benz,no,29/03/2016 00:00,0,28876,01/04/2016 21:44
178258,31/03/2016 15:52,5500,,2017,auto,109,b_klasse,150000,4,gasoline,mercedes_benz,no,31/03/2016 00:00,0,21033,06/04/2016 08:46
338157,27/03/2016 14:38,11490,small,2012,manual,143,ibiza,90000,6,gasoline,seat,no,27/03/2016 00:00,0,51688,07/04/2016 10:45
82291,24/03/2016 16:25,14500,bus,2007,auto,190,other,125000,1,gasoline,mercedes_benz,no,24/03/2016 00:00,0,61381,04/04/2016 02:18
10505,01/04/2016 00:56,950,sedan,1996,manual,75,astra,150000,9,petrol,opel,no,31/03/2016 00:00,0,85057,07/04/2016 05:15
128934,13/03/2016 22:47,7990,suv,2009,auto,170,sorento,150000,5,gasoline,kia,no,13/03/2016 00:00,0,46145,17/03/2016 14:17
15648,29/03/2016 12:58,850,small,2002,manual,86,justy,150000,10,petrol,subaru,,29/03/2016 00:00,0,87671,05/04/2016 22:17
198284,03/04/2016 12:06,13400,sedan,2009,manual,122,1er,40000,11,petrol,bmw,no,03/04/2016 00:00,0,63303,06/04/2016 18:16
251861,07/03/2016 16:57,500,,2016,manual,0,clio,150000,10,,renault,,07/03/2016 00:00,0,12437,21/03/2016 08:45


In [6]:
df.isna().sum()

date_crawled              0
price                     0
vehicle_type          37490
registration_year         0
gearbox               19833
power                     0
model                 19705
mileage                   0
registration_month        0
fuel_type             32895
brand                     0
not_repaired          71154
date_created              0
number_of_pictures        0
postal_code               0
last_seen                 0
dtype: int64

In [7]:

df['vehicle_type'].fillna('Unknown', inplace=True)
df['gearbox'].fillna('Unknown', inplace=True)
df['model'].fillna('Unknown', inplace=True)
df['fuel_type'].fillna('Unknown', inplace=True)
df['not_repaired'].fillna('Unknown', inplace=True)

In [8]:
df.drop_duplicates(inplace=True)

In [9]:
print(f"Duplicated values: {df.duplicated().sum()}")
print(f"Missing values: {df.isna().sum()}")

Duplicated values: 0
Missing values: date_crawled          0
price                 0
vehicle_type          0
registration_year     0
gearbox               0
power                 0
model                 0
mileage               0
registration_month    0
fuel_type             0
brand                 0
not_repaired          0
date_created          0
number_of_pictures    0
postal_code           0
last_seen             0
dtype: int64


In [10]:
df = df.drop(['date_crawled', 'date_created', 'number_of_pictures'], axis=1)

In [11]:
# Convert the 'date' column to datetime format
df['last_seen'] = pd.to_datetime(df['last_seen'])
# Extract year, month, day, and hour as new features
df['year'] = df['last_seen'].dt.year
df['month'] = df['last_seen'].dt.month
df['day'] = df['last_seen'].dt.day
df['hour'] = df['last_seen'].dt.hour

# Convert 'date' to timestamp and then to float32
df['last_seen'] = df['last_seen'].astype(np.int64) / 10**9  # Convert to seconds
df['last_seen'] = df['last_seen'].astype('float32')

In [12]:
df.head(20)

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,postal_code,last_seen,year,month,day,hour
0,480,Unknown,1993,manual,0,golf,150000,0,petrol,volkswagen,Unknown,70435,1467602000.0,2016,7,4,3
1,18300,coupe,2011,manual,190,Unknown,125000,5,gasoline,audi,yes,66954,1467597000.0,2016,7,4,1
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,Unknown,90480,1462366000.0,2016,5,4,12
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,91074,1458236000.0,2016,3,17,17
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,60437,1465035000.0,2016,6,4,10
5,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes,33775,1465068000.0,2016,6,4,19
6,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no,67112,1462386000.0,2016,5,4,18
7,0,sedan,1980,manual,50,other,40000,7,petrol,volkswagen,no,19348,1458924000.0,2016,3,25,16
8,14500,bus,2014,manual,125,c_max,30000,8,petrol,ford,Unknown,94505,1459813000.0,2016,4,4,23
9,999,small,1998,manual,101,golf,150000,0,Unknown,volkswagen,Unknown,27472,1459445000.0,2016,3,31,17


## Model training

In [17]:

X = df.drop(['price'], axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns

df = pd.get_dummies(df,drop_first=True)



In [14]:
# Creating function to evaluate the models
def evaluate_model(preds, y_test):
    rmse = mean_squared_error(y_test, preds, squared=False)
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    return rmse, mae, r2


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354107 entries, 0 to 354368
Columns: 319 entries, price to not_repaired_yes
dtypes: float32(1), int64(10), uint8(308)
memory usage: 135.1 MB


In [18]:
#Scaling numerical features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train[numerical_features])
X_test_scaled = scaler.transform(X_test[numerical_features])

In [19]:
%time


# Train a Linear Regression model
lr_model = LinearRegression()
lr_start_time = time.time()
lr_model.fit(X_train_scaled, y_train)
lr_end_time = time.time()
lr_training_time = lr_end_time - lr_start_time

lr_start_time = time.time()
lr_preds = lr_model.predict(X_test_scaled)
lr_end_time = time.time()
lr_prediction_time = lr_end_time - lr_start_time

lr_rmse, lr_mae, lr_r2 = evaluate_model(lr_preds, y_test)
print(f"Linear Regression - RMSE: {lr_rmse}, MAE: {lr_mae}, R2: {lr_r2}")

print(f"Linear Regression - Training Time: {lr_training_time} seconds, Prediction Time: {lr_prediction_time} seconds")

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.68 µs
Linear Regression - RMSE: 4109.2684009096165, MAE: 3072.435152634596, R2: 0.16522972957169735
Linear Regression - Training Time: 0.034822940826416016 seconds, Prediction Time: 0.004396677017211914 seconds


In [20]:
%time
# Train a Decision Tree model
dt_model = DecisionTreeRegressor(random_state=12345)
# Define the hyperparameters to tune
param_grid = {
    'max_depth': [3,5,7,10],
    'min_samples_split': [2,5,10]
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(estimator=dt_model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Get the best model
best_dt_model = grid_search.best_estimator_



dt_start_time = time.time()
dt_model.fit(X_train, y_train)
dt_end_time = time.time()
dt_train_time = dt_end_time - dt_start_time

dt_start_time = time.time()
dt_preds = dt_model.predict(X_test)
dt_end_time = time.time()
dt_prediction_time = dt_end_time - dt_start_time


dt_rmse, dt_mae, dt_r2 = evaluate_model(dt_preds, y_test)
print(f"Decision Tree - RMSE: {dt_rmse}, MAE: {dt_mae}, R2: {dt_r2}")

print(f"Best Decision Tree Regressor - Training Time: {dt_train_time} seconds, Prediction Time: {dt_prediction_time} seconds")
print(f"Best Hyperparameters: {grid_search.best_params_}")

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.44 µs
Decision Tree - RMSE: 2356.9450273686753, MAE: 1340.8334910056196, R2: 0.7253770277928431
Best Decision Tree Regressor - Training Time: 8.32465648651123 seconds, Prediction Time: 0.08254742622375488 seconds
Best Hyperparameters: {'max_depth': 10, 'min_samples_split': 10}


In [21]:
%time
# Train a Random Forest model 
rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_start_time = time.time()
rf_model.fit(X_train, y_train)
rf_end_time = time.time()
rf_train_time = rf_end_time - rf_start_time

rf_start_time = time.time()
rf_preds = rf_model.predict(X_test)
rf_end_time = time.time()
rf_pred_time = rf_end_time - rf_start_time

rf_rmse, rf_mae, rf_r2 = evaluate_model(rf_preds, y_test)
print(f"Random Forest - RMSE: {rf_rmse}, MAE: {rf_mae}, R2: {rf_r2}")
print(f"Random Forest - Training Time: {rf_train_time} seconds, Prediction Time: {rf_pred_time} seconds")

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.01 µs
Random Forest - RMSE: 2008.8161273335397, MAE: 1299.0173194508595, R2: 0.8005112866452901
Random Forest - Training Time: 277.8753926753998 seconds, Prediction Time: 0.582298755645752 seconds


In [20]:

# Train a LightGBM model with hyperparameter tuning
lgb_train = lgb.Dataset(X_train, y_train)
lgb_params = {
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}
lg_start_time = time.time()
lgb_model = lgb.train(lgb_params, lgb_train, num_boost_round=100)
lg_end_time = time.time()
lg_train_time = lg_end_time - lg_start_time

lg_start_time = time.time()
lgb_preds = lgb_model.predict(X_test)
lg_end_time = time.time()
lg_pred_time = lg_end_time - lg_start_time
%time

lgb_rmse, lgb_mae, lgb_r2 = evaluate_model(lgb_preds, y_test)
print(f"LightGBM - RMSE: {lgb_rmse}, MAE: {lgb_mae}, R2: {lgb_r2}")

print(f"LightGBM - Training Time: {lg_train_time} seconds, Prediction Time: {lg_pred_time} seconds")

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1547
[LightGBM] [Info] Number of data points in the train set: 283285, number of used features: 307
[LightGBM] [Info] Start training from score 4418.702272
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs
LightGBM - RMSE: 1883.1422804757233, MAE: 1190.5678497958877, R2: 0.8246909948714369
LightGBM - Training Time: 5.700544595718384 seconds, Prediction Time: 0.5975062847137451 seconds


In [22]:

# Train a CatBoost model with hyperparameter tuning
cat_model = CatBoostRegressor(iterations=100, learning_rate=0.1, depth=6, random_state=12345, verbose=0)
cat_start_time = time.time()
cat_model.fit(X_train, y_train)
cat_end_time = time.time()
cat_train_time = cat_end_time - cat_start_time

cat_start_time = time.time()
cat_preds = cat_model.predict(X_test)
cat_end_time = time.time()
cat_pred_time = cat_end_time - cat_start_time

%time

cat_rmse, cat_mae, cat_r2 = evaluate_model(cat_preds, y_test)
print(f"CatBoost - RMSE: {cat_rmse}, MAE: {cat_mae}, R2: {cat_r2}")
print(f"Cat Boost - Training Time: {cat_train_time} seconds, Prediction Time: {cat_pred_time} seconds")

CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 7.87 µs
CatBoost - RMSE: 1921.0681474270573, MAE: 1219.7630650580936, R2: 0.8175585568673962
Cat Boost - Training Time: 4.636782646179199 seconds, Prediction Time: 0.026078224182128906 seconds


In [23]:

# Train an XGBoost model with hyperparameter tuning

xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=12345)
xgb_start_time = time.time()
xgb_model.fit(X_train, y_train)
xgb_end_time = time.time()
xgb_train_time = xgb_end_time - xgb_start_time

xgb_start_time = time.time()
xgb_preds = xgb_model.predict(X_test)
xgb_end_time = time.time()
xgb_pred_time = xgb_end_time - xgb_start_time

%time

print(f"XGBoost - Training Time: {xgb_train_time} seconds, Prediction Time: {xgb_pred_time} seconds")

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 5.96 µs
XGBoost - Training Time: 340.5770080089569 seconds, Prediction Time: 0.6043856143951416 seconds


---

## Model analysis

# Model Quality

**Decision Tree Regression**:
RMSE: 2356.94  
MAE: 1340.82  
R²: 0.72 
Analysis: This model performs slightly worse than Random Forest but not nearly as poorly as Linear Regression

**Linear Regression**:  

RMSE: 4109.27  
MAE: 3072.44  
R²: 0.165  
Analysis: This model has the highest errors (RMSE and MAE) and the lowest R², indicating it doesn’t explain much of the variance in the data.  

**Random Forest**:  
RMSE: 2004.27  
MAE: 1296.25  
R²: 0.801  
Analysis: This model performs significantly better than Linear Regression, with much lower errors and a high R² value, indicating it explains a large portion of the variance.  

**LightGBM**:  
RMSE: 1883.14  
MAE: 1190.57  
R²: 0.825  
Analysis: This model has the lowest errors and the highest R², indicating it explains the most variance in the data.  

**CatBoost**:  
RMSE: 1916.26  
MAE: 1214.88  
R²: 0.818  
Analysis: This model also performs very well, with low errors and a high R² value.  

# Training and Prediction Time


**Decision Tree Regressor**:  
Training Time: 8.32  
Prediction Time: 0.08  
Analysis: Moderate Training time and extremely fast predcition.  

**Linear Regression**:  
Training Time: 0.029 seconds  
Prediction Time: 0.0099 seconds  
Analysis: Extremely fast in both training and prediction.  

**Random Forest**:  
Training Time: 328.9 seconds  
Prediction Time: 0.685 seconds  
Analysis: Takes significantly longer to train but is relatively fast in prediction.  

**LightGBM**:  
Training Time: 5.7 seconds  
Prediction Time: 0.598 seconds  
Analysis: Moderate training time and fast in prediction.  

**CatBoost**:  
Training Time: 4.62 seconds  
Prediction Time: 0.023 seconds  
Analysis: Moderate training time and very fast in prediction.  

**XGBoost**:  
Training Time: 359 seconds  
Prediction Time: 0.644 seconds  
Analysis: Takes the longest to train but is relatively fast in prediction.  


# Summary

**Best Quality**: LightGBM, followed by CatBoost.  
**Best Speed**: Linear Regression for both training and prediction, but it has poor quality. For a balance of speed and quality, LightGBM and CatBoost are good choices.

---