<a id='pred_top'>

# Predict auction price

Try several models and improve predicition accuracy

## Model fitting

- Linear fits  
  1. [Simple linear fit](#pred_model_1)  
     No cross validation. Observations with missing values are dropped.
  2. [Dependent values scaled](#pred_model_2)  
     Dependent value here is _prices_.
  3. [Partial data](#pred_model_3)  
     Only young cars
- Multiple linear regression models  
  1. [MLR fit without imputation](#pred_model_4)  
  2. [With imputation](#pred_model_5)  
  3. [Include categorical features](#pred_model_6)  
  4. [Lasso regularization](#pred_model_7)  
  5. [include engineered features](#pred_model_8)

## Results

- [Model performance](#pred_accuracies)
- [Save best model](#pred_save_model) **TODO**  
  This is not implemented yet. Some preprocessing functions are not handled well with `pickle`.
- [Predictions](#pred_predict)
     
  

In [1]:
import sys
import os
import re
import json

In [2]:
with open('../assets/drz-settings-current.json', 'r') as fid:
    cfg = json.load(fid)
print(cfg['AUCTION'])

if cfg['AUCTION']['kind'] == 'opbod':
    raise NotImplementedError
    
OPBOD = cfg['AUCTION']['kind'] == 'opbod'
AUCTION_ID = cfg['AUCTION']['id']
DATA_DIR = cfg['FILE_LOCATION']['data_dir']
RESULTS_DIR = cfg['FILE_LOCATION']['report_dir']
VERBOSE = int(cfg['GENERAL']['verbose'])
SAVE_METHOD = cfg['GENERAL']['save_method']


{'kind': 'inschrijving', 'id': '2025-0010', 'date': '20250524'}


In [3]:
if SAVE_METHOD == 'skip_when_exist':
    do_save = lambda fn: not(os.path.isfile(fn))
elif SAVE_METHOD == 'always_overwrite':
    do_save = lambda _: True
elif SAVE_METHOD == 'skip_save':
    do_save = lambda _: False
else:
    raise NotImplementedError(f'SAVE_METHOD: {SAVE_METHOD} not implemented')

In [4]:
SAVE_METHOD

'skip_when_exist'

In [5]:
TAG_SINGLE = "nbconvert_instruction:remove_single_output"


In [6]:
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

import seaborn as sns

In [7]:
pd.__version__, np.__version__


('2.2.3', '2.2.3')

In [8]:
# set figure defaults (needs to be in cell seperate from import sns)
plt.style.use([
    'default',
    f"{cfg['FILE_LOCATION']['app_dir']}/assets/movshon.mplstyle",
    f"{cfg['FILE_LOCATION']['app_dir']}/assets/context-notebook.mplstyle"
])

# Load data

In [9]:
fn = f'{DATA_DIR}/cars-for-ml.pkl'
print(fn)
df = pd.read_pickle(fn)
print(df.shape)

# time deltas
sel = (df.dtypes == 'timedelta64[ns]') | (df.columns == 'age_at_import')
df.loc[:, sel] = df.loc[:, sel].applymap(lambda x: x.days).astype('Float64')
# nullable boolean
sel = df.dtypes == 'boolean'
df.loc[:,sel] = df.loc[:,sel].astype('O').fillna(np.nan)
# int to float
df.price = df.price.astype('Float64')
# categories
cat_columns = ['brand', 'model', 'fuel', 'body_type','color', 'energy_label', 'fourwd', 'automatic_gearbox', 'under_survey']
# numerical
num_columns = list(np.setdiff1d(df.columns, cat_columns + ['price']))
df.loc[:, num_columns] = df.loc[:, num_columns].astype('Float64')

# Factorized categorical values
fld = 'energy_label'
# replace empty with NaN creates factor '-1'
v, idx = pd.factorize(df[fld].replace({'': np.nan}), sort=True)
# convert '-1' back to NaN
v = v.astype(float)
v[v==-1] = np.nan
# Store in dataframe
new_col = 'converted_' + fld
df[new_col] = v
# update list
num_columns += [new_col]
cat_columns.remove(fld)
print('\nCategorical field [{}] is converted to sequential numbers with: '.format(fld), end='\n\t')
print(*['{} <'.format(c) for c in idx], end='\n\n')

# convert boolean to string
for fld in ['fourwd', 'automatic_gearbox', 'under_survey']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    # # update list
    # cat_columns += [new_col]
    # cat_columns.remove(fld)
    replace_dict = {
        '': '', 
        True: 'y', 
        False: 'n'
    }
    df[new_col] = df[fld].astype('O').replace(replace_dict)
    print('\nBoolean field [{}] is converted to numbers according to: '.format(fld), end='\n')
    print(*['\t"{}" -> {} ({})\n'.format(k,v, type(v)) for k,v in replace_dict.items()], end='\n\n')

# convert integer to float and replace -1
for fld in ['number_of_cylinders', 'number_of_doors', 'number_of_gears', 'number_of_seats']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    replace_dict = {
        -1: np.nan, 
    }
    df[new_col] = df[fld].replace(replace_dict).astype(float)

# convert empty string to NaN
for fld in ['brand', 'model', 'fuel', 'body_type', 'color', 'fourwd']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    replace_dict = {
        '': np.nan, 
    }
    df[new_col] = df[fld].replace(replace_dict)

# translate Dutch to English
fld = 'color'
new_col = fld
# # update list
# cat_columns += [new_col]
# cat_columns.remove(fld)
replace_dict = {
    '': 'missing', 
    'BLAUW': 'Blue',
    'ROOD': 'Red',
    'GROEN': 'Green',
    'GRIJS': 'Gray',
    'WIT': 'White',
    'ZWART': 'Black',
    'BEIGE': 'Beige',
    'BRUIN': 'Brown',
    'ROSE': 'Pink',
    'GEEL': 'Yellow',
    'CREME': 'Creme',
    'ORANJE': 'Orange',
    'PAARS': 'Purple'
}
df[new_col] = df[fld].replace(replace_dict)
print('\nField [{}] is converted according to: '.format(fld), end='\n')
print(*['\t"{}" -> {} ({})\n'.format(k,v, type(v)) for k,v in replace_dict.items()], end='\n\n')

# reporting
try:
    print('Categorical:', len(cat_columns))
    [print('\t[{:2.0f}] {:s}'.format(i+1, c)) for i,c in enumerate(df[cat_columns].columns)]
    print('Numercial:', len(num_columns))
    [print('\t[{:2.0f}] {:s}'.format(i+1, c)) for i,c in enumerate(df[num_columns].columns)]
    print('Last lot in data set:\n\t{}'.format(df.index[-1]))
except:
    cat_columns = [c for c in cat_columns if c in df.columns]
    num_columns = [c for c in num_columns if c in df.columns]    
    print('! not all fields are in data !. Skip for now')

/home/tom/bin/satdatsci/Saturday-Datascience/data/cars-for-ml.pkl
(13174, 29)

Categorical field [energy_label] is converted to sequential numbers with: 
	A < B < C < D < E < F < G < nan <


Boolean field [fourwd] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Boolean field [automatic_gearbox] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Boolean field [under_survey] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Field [color] is converted according to: 
	"" -> missing (<class 'str'>)
 	"BLAUW" -> Blue (<class 'str'>)
 	"ROOD" -> Red (<class 'str'>)
 	"GROEN" -> Green (<class 'str'>)
 	"GRIJS" -> Gray (<class 'str'>)
 	"WIT" -> White (<class 'str'>)
 	"ZWART" -> Black (<class 'str'>)
 	"BEIGE" -> Beige (<class 'str'>)
 	"BRUIN" -> Brown (<class 'str

  df.loc[:, sel] = df.loc[:, sel].applymap(lambda x: x.days).astype('Float64')
[  <NA>,   <NA>,   <NA>, 1609.0,   <NA>,   <NA>,   <NA>,   <NA>, 1328.0,
   <NA>,
 ...
 4537.0, 7013.0, 6425.0, 5129.0, 3237.0, 3814.0, 5011.0,   <NA>, 2093.0,
   <NA>]
Length: 13174, dtype: Float64' has dtype incompatible with timedelta64[ns], please explicitly cast to a compatible dtype first.
  df.loc[:, sel] = df.loc[:, sel].applymap(lambda x: x.days).astype('Float64')
[   <NA>,    <NA>,  -547.0, -2973.0,    <NA>,    <NA>,   -26.0,    <NA>,
 -3055.0,   -60.0,
 ...
    84.0,    69.0,    58.0,     5.0,   -51.0,   391.0,   -44.0,    <NA>,
   632.0,    <NA>]
Length: 13174, dtype: Float64' has dtype incompatible with timedelta64[ns], please explicitly cast to a compatible dtype first.
  df.loc[:, sel] = df.loc[:, sel].applymap(lambda x: x.days).astype('Float64')
[  <NA>,   <NA>,   <NA>,    0.0,   <NA>,   <NA>,   <NA>,   <NA>,    0.0,
   <NA>,
 ...
    0.0,    0.0,  957.0,    0.0,   <NA>, 3057.0, 1384.0,   <NA

In [10]:
# Store model results in dictonary: Instantiate empty dict
models = dict()

In [11]:
def split_shelve_vars():
    import types
    to_shelve = {}
    not_to_shelve = {}

    # loop over global variables (within this function is ignored)
    for var,val in globals().items():
        
        # skip variables based on names
        if re.match('^_(\d+|(i+\d*))$', var) is not None:
            not_to_shelve[var] = '-n'
            continue
        if re.match('^_+$', var) is not None:
            not_to_shelve[var] = '---'
            continue
        if var in ('_dh', '_ih', '_oh'):
            not_to_shelve[var] = '-dio'
            continue
        if var in ('In', 'Out'):
            not_to_shelve[var] = '-io'
            continue
        if var in ('__doc__', '__loader__', '__name__', '__package__', '__session__', '__spec__'):
            not_to_shelve[var] = val
            continue
        
        # skip built-ins and modules
        if isinstance(globals()[var], (types.ModuleType, types.BuiltinFunctionType, types.FunctionType)):
            not_to_shelve[var] = type(globals()[var])
            continue
        else:
            pass
            #print(globals()[var].__class__, var)

        # store
        to_shelve[var] = val
        
    return not_to_shelve, to_shelve
    
drop,keep = split_shelve_vars()
list(keep.keys())

['get_ipython',
 'exit',
 'quit',
 'fid',
 'cfg',
 'OPBOD',
 'AUCTION_ID',
 'DATA_DIR',
 'RESULTS_DIR',
 'VERBOSE',
 'SAVE_METHOD',
 'TAG_SINGLE',
 'fn',
 'df',
 'sel',
 'cat_columns',
 'num_columns',
 'fld',
 'v',
 'idx',
 'new_col',
 'replace_dict',
 'models']

In [12]:
keep.pop('get_ipython')
keep.pop('exit')
keep.pop('quit')
keep.pop('fid')

<_io.TextIOWrapper name='../assets/drz-settings-current.json' mode='r' encoding='UTF-8'>

In [13]:
import shelve
from inspect import getsource
import types


In [14]:
with shelve.open('./predict-price.shelve', flag='n') as slf:
    for k in [
        'cfg',
        #'OPBOD',
        #'AUCTION_ID',
        #'DATA_DIR',
        'RESULTS_DIR',
        #'VERBOSE',
        #'SAVE_METHOD',
        #'TAG_SINGLE',
        #'fn',
        'df',
        'cat_columns',
        'num_columns',
        'models',
        'do_save']:
        print(k)
        if k in keep:
            v = keep[k]
        else:
            v = globals()[k]
            if isinstance(v, types.FunctionType):
                src = getsource(v).strip()
                cnt = len([k for k in slf.keys() if k.startswith('def')])
                k = f'def{cnt}:'
                v = src
            
        slf[k] = v


cfg
RESULTS_DIR
df
cat_columns
num_columns
models
do_save


In [15]:
for m in range(1,9):
    nb = f"./predict-price-model{m}.ipynb"
    display({'text/html':f'<HR><h3>Running {nb}</h3><hr>'}, raw=True)
    %run {nb}

RESULTS_DIR
def0:
do_save = lambda fn: not(os.path.isfile(fn))
cfg
num_columns
cat_columns
models
df
(9722, 1)
(9722, 1)
/home/tom/bin/satdatsci/Saturday-Datascience/results/linear_regression_no_cv.png
Shelve file [./predict-price.shelve] contains models:
	linear regression no cv


RESULTS_DIR
def0:
do_save = lambda fn: not(os.path.isfile(fn))
cfg
num_columns
cat_columns
models
df
(9512, 1)
(9512, 1)
(6658, 1)
(2854, 1)
According to "linear regression log price young"-model
Car depreciates to half its value every
	1348 days (3.7 years).
	y(t=   +0) = 27469 euro
	y(t=   +2) = 18870 euro
	y(t=   +4) = 12962 euro
	y(t=   +6) = 8905 euro
	y(t=   +8) = 6117 euro

	y(t= +3.7) = 13734 euro
	y(t=0) / 2 = 13734 euro
/home/tom/bin/satdatsci/Saturday-Datascience/results/linear_regression_log_price_young.png
Shelve file [./predict-price.shelve] contains models:
	linear regression no cv
	linear regression log price young


RESULTS_DIR
def0:
do_save = lambda fn: not(os.path.isfile(fn))
cfg
num_columns
cat_columns
models
df
(2973, 20)
(2973, 1)
(2081, 20)
(892, 20)
/home/tom/bin/satdatsci/Saturday-Datascience/results/MLR_reduced_observations.png
Shelve file [./predict-price.shelve] contains models:
	linear regression no cv
	linear regression log price young
	MLR reduced observations


RESULTS_DIR
def0:
do_save = lambda fn: not(os.path.isfile(fn))
cfg
num_columns
cat_columns
models
df
(11546, 20)
(11546, 1)
(8082, 20)
(3464, 20)
/home/tom/bin/satdatsci/Saturday-Datascience/results/MLR_impute_median.png
Shelve file [./predict-price.shelve] contains models:
	linear regression no cv
	linear regression log price young
	MLR reduced observations
	MLR impute median


RESULTS_DIR
def0:
do_save = lambda fn: not(os.path.isfile(fn))
cfg
num_columns
cat_columns
models
df
(11546, 29)
(11546,)
(8082, 29)
(3464, 29)
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.5s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.2s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.2s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.2s
[Col

RESULTS_DIR
def0:
do_save = lambda fn: not(os.path.isfile(fn))
cfg
num_columns
cat_columns
models
df
(11546, 29)
(11546,)
(8082, 29)
(3464, 29)
Fitting 8 folds for each of 9 candidates, totalling 72 fits
[CV 1/8; 1/9] START regressor__lasso__alpha=0.0001..............................
[CV 1/8; 1/9] END regressor__lasso__alpha=0.0001;, score=0.613 total time=   0.9s
[CV 2/8; 1/9] START regressor__lasso__alpha=0.0001..............................
[CV 2/8; 1/9] END regressor__lasso__alpha=0.0001;, score=0.480 total time=   0.9s
[CV 3/8; 1/9] START regressor__lasso__alpha=0.0001..............................
[CV 3/8; 1/9] END regressor__lasso__alpha=0.0001;, score=0.508 total time=   1.0s
[CV 4/8; 1/9] START regressor__lasso__alpha=0.0001..............................
[CV 4/8; 1/9] END regressor__lasso__alpha=0.0001;, score=0.502 total time=   0.9s
[CV 5/8; 1/9] START regressor__lasso__alpha=0.0001..............................
[CV 5/8; 1/9] END regressor__lasso__alpha=0.0001;, score=0.343 

RESULTS_DIR
def0:
do_save = lambda fn: not(os.path.isfile(fn))
cfg
num_columns
cat_columns
models
df
odometer              -0.439047
age                   -0.403877
> usage_intensity <   -0.030426
price                  1.000000
Name: price, dtype: float64

"usage_intensity" does not seem to correlate better than "age" and "odometer" seperately
(11546, 31)
(11546,)
(8082, 31)
(3464, 31)
Fitting 8 folds for each of 9 candidates, totalling 72 fits
[CV 1/8; 1/9] START regressor__lasso__alpha=0.0001..............................


  X.loc[:,'classic'] = X.loc[:,'classic'].astype('O').replace({True:'y', False:'n'})


[CV 1/8; 1/9] END regressor__lasso__alpha=0.0001;, score=0.491 total time=   1.6s
[CV 2/8; 1/9] START regressor__lasso__alpha=0.0001..............................
[CV 2/8; 1/9] END regressor__lasso__alpha=0.0001;, score=0.275 total time=   1.5s
[CV 3/8; 1/9] START regressor__lasso__alpha=0.0001..............................
[CV 3/8; 1/9] END regressor__lasso__alpha=0.0001;, score=0.588 total time=   1.6s
[CV 4/8; 1/9] START regressor__lasso__alpha=0.0001..............................
[CV 4/8; 1/9] END regressor__lasso__alpha=0.0001;, score=0.421 total time=   1.7s
[CV 5/8; 1/9] START regressor__lasso__alpha=0.0001..............................
[CV 5/8; 1/9] END regressor__lasso__alpha=0.0001;, score=0.242 total time=   1.8s
[CV 6/8; 1/9] START regressor__lasso__alpha=0.0001..............................
[CV 6/8; 1/9] END regressor__lasso__alpha=0.0001;, score=-3.874 total time=   1.6s
[CV 7/8; 1/9] START regressor__lasso__alpha=0.0001..............................
[CV 7/8; 1/9] END reg

RESULTS_DIR
def0:
do_save = lambda fn: not(os.path.isfile(fn))
cfg
num_columns
cat_columns
models
df
(11546, 31)
(11546,)
[0, np.str_('automatic_gearbox'), ['n', 'y', 'missing']] << transfer to numerical
[1, np.str_('body_type'), ['Cabriolet', 'Hatchback', 'Sedan', 'Stationwagen', 'MPV', 'Coupe', 'Vrachtwagen', 'Bestelwagen', 'Opleggertrekker', 'Pick-uptruck', 'Multipurpose vehicle (MPV)', 'missing']]
[2, np.str_('brand'), ['ASTON-MARTIN', 'MERCEDES-BENZ', 'BMW', 'RENAULT', 'CITROËN', 'VOLKSWAGEN', 'PORSCHE', 'BENTLEY', 'LEXUS', 'SEAT', 'AUDI', 'HYUNDAI', 'FIAT', 'MINI', 'SUBARU', 'SAAB', 'OPEL', 'SKODA', 'FORD', 'TOYOTA', 'JAGUAR', 'DAIHATSU', 'ALFA ROMEO', 'HONDA', 'PEUGEOT', 'MITSUBISHI', 'VOLVO', 'CHEVROLET', 'LADA-VAZ', 'SUZUKI', 'MAZDA', 'CHRYSLER', 'DODGE', 'MASERATI', 'FERRARI', 'SMART', 'SSANGYONG', 'KIA', 'JEEP', 'LAMBORGHINI', 'AIXAM', 'ROLLS ROYCE', 'HUMMER', 'TRIUMPH', 'DAEWOO', 'NISSAN', 'PONTIAC', 'CADILLAC', 'LAND ROVER', 'LANCIA', 'ROVER', 'DATSUN', 'HYMER', 'DACIA', '

  X.loc[:,'classic'] = X.loc[:,'classic'].astype('O').replace({True:+1, False:-1, np.nan:0})
  X.loc[:,'classic'] = X.loc[:,'classic'].astype('O').replace({True:+1, False:-1, np.nan:0})
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X.loc[:,cn].replace({'y':+1, 'n':-1, 'missing':0, np.nan: 0}, inplace=True)
  X.loc[:,cn].replace({'y':+1, 'n':-1, 'missing':0, np.nan: 0}, inplace=True)


grid search results
best decisiontreeregressor__max_depth=8.00000
/home/tom/bin/satdatsci/Saturday-Datascience/results/Decision_Tree_Regression.png
Shelve file [./predict-price.shelve] contains models:
	linear regression no cv
	linear regression log price young
	MLR reduced observations
	MLR impute median
	MLR with categorical
	MLR Lasso
	MLR added features
	Decision Tree Regression


AssertionError: stop running, below is sandboxing and testing

AssertionError: stop running, below is sandboxing and testing

In [16]:
with shelve.open('./predict-price.shelve', flag='r') as slf:
    for k,v in slf.items():
        print(k)
        globals()[k] = v
        if re.match('def\d+:', k) is not None:
            print(v)
            exec(v) 

RESULTS_DIR
def0:
do_save = lambda fn: not(os.path.isfile(fn))
cfg
num_columns
cat_columns
models
df


In [17]:
# Best model
model_name = 'MLR added features'

# Display prediction errors

x_sample = df.dropna(subset=['price']).iloc[:,1:]
y_sample = df.dropna(subset=['price']).iloc[:,0]
# Add features
x_sample.loc[:,'usage_intensity'] = x_sample.odometer / x_sample.age
x_sample.loc[:,'classic'] = x_sample.age > 25*365
x_sample.loc[:,'classic'] = x_sample.loc[:,'classic'].astype('O').replace({True:'y', False:'n'})
#x_sample.loc[:,'youngtimer'] = (x_sample.age > 15*365) & (x_sample.age <= 25*365)
#x_sample.loc[:,'youngtimer'].replace({True:'y', False:'n'}, inplace=True)
x_sample[pd.isna(x_sample)] = np.nan
# predict again
y_sample_pred = models[model_name]['model'].predict(x_sample) 

x_sample['price'] = y_sample
x_sample['prediction_error'] = y_sample_pred - y_sample
x_sample['prediction_error_fraction'] = y_sample_pred/y_sample
x_sample['prediction_error_log'] = np.log10(x_sample.prediction_error_fraction)
x_sample['prediction_error_abslog'] = np.abs(np.log10(x_sample.prediction_error_fraction))
x_sample['prediction'] = y_sample_pred
x_sample['age_y'] = x_sample.age/365


# Note some are close to perfect, because they are in training set and are unique in brand etc
print(f'best predictons of [{model_name}] model')
display(x_sample.sort_values(by='prediction_error_abslog').head(16).T)
print('worst predictions')
display(x_sample.sort_values(by='prediction_error_abslog').tail(16).T)
print('largest underestimate')
display(x_sample.sort_values(by='prediction_error').head(16).T)
print('largest overestimate')
display(x_sample.sort_values(by='prediction_error').tail(16).T)
print('worst prediction recent auction')
is_last_auction = x_sample.index.str.startswith('-'.join(x_sample.index[-1].split('-')[:2]))
display(x_sample[is_last_auction].sort_values(by='prediction_error_abslog').tail(8).T)

plt.figure(figsize=[8,8])
plt.plot(x_sample.age_y, x_sample.prediction_error_log, color='k', marker='s', markeredgecolor = (0, 0, 0, 0), markerfacecolor = (0, 0, 0, 1), linestyle='None', ms=1)
plt.axhline(0, lw=2, linestyle='--', color ='k')
plt.xlabel('age [years]')
plt.ylabel('prediction error [log of fraction]\n(positive: prediction overestimates)')
plt.show()

  x_sample.loc[:,'classic'] = x_sample.loc[:,'classic'].astype('O').replace({True:'y', False:'n'})


best predictons of [MLR added features] model


Unnamed: 0,2020-6-7175,2025-05-704210,2021-11-703811,2021-08-810118,2019-2-7153,2024-09-706417,2014-10-7133,2021-10-704710,2023-01-702601,2017-11-8158,2023-11-707122,2019-5-2208,2021-12-805822,2022-03-704903,2021-09-803919,2021-11-802421
brand,RENAULT,MERCEDES-BENZ,OPEL,AUDI,VOLKSWAGEN,VOLVO,VOLKSWAGEN,KIA,VOLKSWAGEN,FIAT,TOYOTA,MERCEDES-BENZ,SHARE NGO,OPEL,MERCEDES-BENZ,MERCEDES-BENZ
model,kangoo,cls 350 cdi,astra station wagon,a3 sportback,golf,v70,golf,picanto,up!,panda,aygo,c 200 cdi,zd,corsa,c 63 amg,a 200
age,5927.0,5203.0,5352.0,2656.0,3031.0,3026.0,4579.0,386.0,1452.0,5030.0,5577.0,4170.0,,4974.0,3318.0,1111.0
fuel,Diesel,Diesel,Benzine,Diesel,Benzine,Diesel,Benzine,Benzine,Benzine,Benzine,Benzine,Diesel,,Benzine,Benzine,Benzine
odometer,310731.0,322570.0,202255.0,197384.0,160339.0,308837.0,106857.0,7286.0,92229.0,140185.0,154194.0,426360.287232,9288.0,181467.0,180407.0,40796.0
days_since_inspection_invalid,-83.0,-75.0,-149.0,-40.0,-140.0,-192.0,-3046.0,-1075.0,-9.0,-482.0,-1.0,112.0,,-140.0,-3.0,-350.0
age_at_import,0.0,0.0,0.0,188.0,2435.0,0.0,0.0,0.0,0.0,0.0,0.0,1496.0,,0.0,1212.0,602.0
body_type,Stationwagen,Sedan,Stationwagen,Hatchback,Stationwagen,Stationwagen,Hatchback,MPV,Hatchback,Hatchback,Hatchback,Stationwagen,,MPV,Sedan,Stationwagen
displacement,1461.0,2987.0,1598.0,1968.0,1197.0,1969.0,1598.0,998.0,999.0,1108.0,998.0,2148.0,,1364.0,6208.0,1332.0
number_of_cylinders,4.0,6.0,4.0,4.0,4.0,4.0,4.0,3.0,3.0,4.0,3.0,4.0,0.0,4.0,8.0,4.0


worst predictions


Unnamed: 0,2024-03-220205,2015-03-2420,2023-12-240123,2021-09-260619,2017-8-7139,2024-11-704021,2022-12-260032,2019-4-2021,2022-09-265029,2018-8-2400,2015-02-2200,2015-03-2402,2014-12-2207,2018-7-2415,2017-5-2216,2017-3-2000
brand,MERCEDES-BENZ,BENTLEY,MERCEDES-BENZ,LOTUS,BMW,VOLVO,CHEVROLET,VOLKSWAGEN,LAMBORGHINI,ROLLS ROYCE,ASTON-MARTIN,FORD,FORD,AUSTIN-HEALEY,ALFA ROMEO,ALFA ROMEO
model,g63 amg,contintal gt 60w12 gtc,g63 amg,elise (lhd & rhd),5er reihe,p 13134,impala sport coupe,111011,urus,phantom drophead coupe,vanguish volante,thunderbird,thunderbird,3000 mkiii phase ii,2000 gtv,2000 gtv
age,,,,8377.0,,20959.0,20000.0,19787.0,,,,21063.0,20973.0,19116.0,16257.0,16196.0
fuel,Benzine,Benzine,Benzine,Benzine,Benzine,Benzine/LPG/nan,Benzine,Benzine,Benzine,Benzine,Benzine,Benzine,Benzine,Benzine,Benzine,Benzine
odometer,29724.0,18210.0,29724.0,87992.0,362604.0,16234.0,73172.043648,91157.0,17340.0,11305.0,4778.0,86207.0,86207.0,106800.895872,23982.0,23982.0
days_since_inspection_invalid,,,,-288.0,,,,1099.0,,,,,,-402.0,-739.0,-800.0
age_at_import,,,,0.0,,0.0,19118.0,16180.0,,,,12826.0,12826.0,18063.0,0.0,0.0
body_type,,,,Cabriolet,,,Coupe,Sedan,,,,,,Cabriolet,Coupe,Coupe
displacement,,,,1796.0,,,5358.0,1192.0,,,,,,2912.0,,
number_of_cylinders,,,,4.0,,4.0,8.0,4.0,,,,8.0,8.0,6.0,4.0,4.0


largest underestimate


Unnamed: 0,2022-09-265029,2023-12-240123,2023-02-200304,2024-03-220205,2015-02-2200,2021-12-260012,2018-8-2400,2022-05-260625,2019-11-2418,2019-4-2411,2015-03-2420,2015-01-2414,2018-6-2410,2023-02-260104,2022-05-260925,2018-8-2410
brand,LAMBORGHINI,MERCEDES-BENZ,PORSCHE,MERCEDES-BENZ,ASTON-MARTIN,LAMBORGHINI,ROLLS ROYCE,FERRARI,PORSCHE,MERCEDES-BENZ,BENTLEY,SKODA,MERCEDES-BENZ,LAND ROVER,LAND ROVER,ASTON-MARTIN
model,urus,g63 amg,911 turbo s,g63 amg,vanguish volante,132 se,phantom drophead coupe,430 scuderia,panamera turbo s e-hybrid,amg s63 cabriolet,contintal gt 60w12 gtc,octavia,S65 AMG,autobiography d350,range rover 3.0 lwb autobiogra,dbs
age,,,,,,8546.0,,,,,,315.0,,,,
fuel,Benzine,Benzine,Benzine,Benzine,Benzine,Benzine,Benzine,Benzine,Hybrid,Benzine,Benzine,Benzine,Benzine,Diesel,Benzine,Benzine
odometer,17340.0,29724.0,1285.0,29724.0,4778.0,40090.0,11305.0,11077.0,6925.0,13.0,18210.0,6796.0,6379.0,734.0,183.0,58429.0
days_since_inspection_invalid,,,,,,1949.0,,,,,,-2972.0,,,,
age_at_import,,,,,,6164.0,,,,,,0.0,,,,
body_type,,,,,,Coupe,,,,,,Sedan,,,,
displacement,,,,,,5707.0,,,,,,1798.0,,,,
number_of_cylinders,,,,,,12.0,,,,,,4.0,,,,


largest overestimate


Unnamed: 0,2023-04-707708,2021-10-805220,2022-09-704409,2022-11-705411,2015-02-2202,2024-03-220505,2022-11-260311,2023-02-700604,2023-12-705424,2022-06-712506,2023-05-262310,2025-01-708801,2021-03-2610,2021-06-702006,2023-08-701216,2024-03-240106
brand,MERCEDES-BENZ,MERCEDES-BENZ,BMW,BMW,PORSCHE,AUDI,BMW,BMW,LAND ROVER,CHEVROLET,AUDI,PORSCHE,LAND ROVER,PORSCHE,PORSCHE,LAND ROVER
model,amg gle 53 4matic+ coupe,amg a 35,x6 m,x5 m,panamera 4s,rs 6 avant,z4 m40i,x5 m,range rover velar,astro,rs q8,cayenne e-hybrid,range rover sport,macan s,cayenne e-hybrid,range rover sport
age,1053.0,399.0,1695.0,2569.0,1602.0,1086.0,733.0,2661.0,870.0,9428.0,600.0,462.0,1851.0,783.0,1354.0,1470.0
fuel,Benzine/Elektriciteit,Benzine,Benzine,Benzine,Benzine,Benzine/Elektriciteit,Benzine,Benzine,Benzine/Elektriciteit,Benzine,Benzine/Elektriciteit,Elektriciteit/Benzine,Benzine,Benzine,Benzine/Elektriciteit,Benzine
odometer,18353.0,1160.0,119198.0,73626.0,43374.0,55835.0,8197.0,73626.0,45854.0,126008.416512,19510.0,14579.0,110979.0,25070.0,81991.0,48791.0
days_since_inspection_invalid,-408.0,-1062.0,234.0,-432.0,-2569.0,-375.0,-728.0,-340.0,-591.0,370.0,-861.0,-999.0,-121.0,-678.0,-107.0,9.0
age_at_import,575.0,75.0,309.0,2270.0,645.0,728.0,0.0,2270.0,125.0,7414.0,255.0,0.0,1246.0,0.0,408.0,553.0
body_type,Stationwagen,Sedan,Sedan,Stationwagen,Hatchback,Stationwagen,Cabriolet,Stationwagen,Stationwagen,,Hatchback,Stationwagen,Stationwagen,Stationwagen,Stationwagen,Stationwagen
displacement,2999.0,1991.0,4395.0,4395.0,4806.0,3996.0,2998.0,4395.0,1997.0,4300.0,3996.0,2995.0,4999.0,2995.0,2995.0,4999.0
number_of_cylinders,6.0,4.0,8.0,8.0,8.0,8.0,6.0,8.0,4.0,6.0,8.0,6.0,8.0,6.0,6.0,8.0


worst prediction recent auction


Unnamed: 0,2025-05-960509,2025-05-708509,2025-05-707010,2025-05-708409,2025-05-708009,2025-05-803309,2025-05-260009,2025-05-705109
brand,OPEL,HONDA,MERCEDES-BENZ,VOLKSWAGEN,FORD,OPEL,AUDI,VOLVO
model,astra sports tourer,insight,c 320 cdi,golf,focus,astra,rs6 quattro performance,xc90
age,4242.0,5508.0,,4601.0,4376.0,,,
fuel,Diesel,Benzine/Elektriciteit,Diesel,Diesel,Diesel,Diesel,Benzine,Diesel
odometer,223125.0,,283247.0,365731.0,400699.0,,38826.0,258496.051968
days_since_inspection_invalid,534.0,394.0,,385.0,246.0,,,
age_at_import,3337.0,0.0,,2473.0,2547.0,,,
body_type,Multipurpose vehicle (MPV),Hatchback,,Stationwagen,Stationwagen,,,
displacement,1956.0,1339.0,,1598.0,1997.0,,,
number_of_cylinders,4.0,4.0,,4.0,4.0,,,


## Model accuracies

In [18]:
# plot R^2

# counter for x-offset
c=0

# figure
fig = plt.figure(figsize=[4,2])
ax = fig.gca()
xs = ys = fs = np.empty(0)

# loop over all models
for name,res in models.items():

    c+=1 # x-offset

    if name == 'linear regression no cv':
        # No cv, so only one value. Make it a list of one for type consistency
        k = 'R^2'
        rsq = [res[k]]
    
    else: 
        k = 'cv R^2'
        rsq = res[k]
        
    if 'n betas effective' in res:
        ndf = res['n betas effective']
    elif 'betas' in res:
        ndf = len(res['betas'])
    elif 'n effective features' in res:
        ndf = res['n effective features']
        
    # add r-squares and offset to vectors
    ys = np.concatenate([ys, rsq])
    xs = np.concatenate([xs, np.ones_like(rsq) * c])
    fs = np.concatenate([fs, [ndf]])

# actual plotting
sns.swarmplot(x=xs, y=ys, ax=ax, hue=None)
ax.bar(range(0,len(models)), [res['R^2'] for res in models.values()], width=0.8, fc='none')
for x,ndf in enumerate(fs):
    if ndf is None:
        continue
    if x == 0:
        s = f'd.f.: {ndf:.0f}'
    else:
        s = f'{ndf:.0f}'
    ax.text(x, 1, s, ha='center')
# prettify
ax.set_xticks(range(0,len(models)))
ax.set_xticklabels(labels=list(models.keys()), rotation=45, va='top', ha='right', style='italic')
ax.set_ylim(bottom=0, top=+1)
ax.set_title('Model performance\n', style='italic')
ax.set_ylabel('Coefficient of determination\n($R^2$)', style='italic')
ax.xaxis.set_tick_params(which='minor', bottom=False)

# save
file_name = f"{RESULTS_DIR}/model-performance.png"
if True | do_save(file_name): # always save
    print(file_name)
    with plt.style.context(f"{cfg['FILE_LOCATION']['app_dir']}/assets/context-paper.mplstyle"):
        plt.savefig(file_name, bbox_inches='tight', transparent=False)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

/home/tom/bin/satdatsci/Saturday-Datascience/results/model-performance.png


In [19]:
# plot data

# loop over all models
for model_name in models.keys():
    print(model_name)
    res = models[model_name]
    features = num_columns.copy()
    
    # model specific adjustments
    if (model_name == 'linear regression log price') \
    or (model_name == 'linear regression log price young'):
        yX = df.loc[:,['price', 'age']].dropna()
        X = yX.iloc[:,1]
        y = yX.iloc[:,0]
        X[pd.isna(X)] = np.nan
        # log price is used
        y = np.log10(y)
        # unit
        unit = '(log[EUR])'
    elif (model_name == 'MLR reduced observations') \
    or (model_name == 'MLR impute median'):
        yX = df.dropna(subset=['price'] + features).loc[:,['price'] + features]
        X = yX.iloc[:,1:]
        y = np.log10(yX.iloc[:,0])
        X[pd.isna(X)] = np.nan
        unit = '(log[EUR])'
    elif (model_name == 'MLR with categorical') \
    or (model_name == 'MLR Lasso'):
        yX = df.dropna(subset=['price']).copy()
        X = yX.iloc[:,1:]
        y = yX.iloc[:,0]
        X[pd.isna(X)] = np.nan
        unit = '(EUR)'
    elif (model_name == 'MLR added features'):
        yX = df.dropna(subset=['price']).copy()
        X = yX.iloc[:,1:]
        y = yX.iloc[:,0]
        X[pd.isna(X)] = np.nan
        unit = '(EUR)'
        X.loc[:,'usage_intensity'] = X.odometer / X.age
        X.loc[:,'classic'] = X.age > 25*365
        X.loc[:,'classic'] = X.loc[:,'classic'].astype('O').replace({True:'y', False:'n'})
        X[pd.isna(X)] = np.nan
    elif (model_name == 'Decision Tree Regression'):
        yX = df.dropna(subset=['price']).copy()
        X = yX.iloc[:,1:]
        y = yX.iloc[:,0]
        X[pd.isna(X)] = np.nan
        unit = '(EUR)'
        X.loc[:,'usage_intensity'] = X.odometer / X.age
        X.loc[:,'classic'] = X.age > 25*365
        X.loc[:,'classic'] = X.loc[:,'classic'].astype('O').replace({True:+1, False:-1, np.nan:0}, inplace=True)
        for col in ['fourwd', 'under_survey', 'automatic_gearbox']:
            X.loc[:, col].replace({'n':0, 'y':1}, inplace=True)
            
    else:
        # all original data
        yX = df.loc[:,['price', 'age']].dropna()
        X = yX.iloc[:,1]
        y = yX.iloc[:,0]
        X[pd.isna(X)] = np.nan
        unit = '(EUR)'
    
    if X.ndim != 1:
        n_feat = X.shape[1]
    else:
        n_feat = 1
        
    if not model_name in ('MLR with categorical', 'MLR Lasso', 'MLR added features', 'Decision Tree Regression'):
        # needed for .predict
        X = np.array(X).reshape(-1,n_feat)
        y = np.array(y).reshape(-1,1)
    
    # predict all data
    y_pred = res['model'].predict(X)
    if max(y) < 10:
        rmse = np.sqrt(np.mean(((10**y)-(10**y_pred))**2))
    else:
        rmse = np.sqrt(np.mean((y-y_pred)**2))
    print(rmse)

    # actual plotting
    fig,axs = plt.subplots(nrows=2, ncols=1, figsize=[8,8])
    
    # data
    axs[0].plot(y, y_pred, marker=',', linestyle='None')
    # error
    axs[1].plot(y, y_pred-y, marker=',', linestyle='None')
    
    # axis equal for top
    if (model_name == 'MLR with categorical') or (model_name == 'MLR Lasso') or (model_name == 'MLR added features') or (model_name == 'Decision Tree Regression'):
        axs[0].set_xscale('log')
        axs[0].set_yscale('log')
        axs[1].set_xscale('log')
    axs[0].set_aspect(1)
    # store limits
    yl = axs[0].get_ylim()
    xl_top = axs[0].get_xlim()
    xl_bot = axs[1].get_xlim()
    xl = [np.max([xl_top[0], xl_bot[0]]), np.min([xl_top[1], xl_bot[1]])]
    # plot unity line and 0 error
    unity_line = [np.max([xl[0], yl[0]]), np.min([xl[1], yl[1]])]
    axs[0].plot(unity_line, unity_line, '-k', linewidth=2)
    axs[1].plot(xl, [0, 0], '-k', linewidth=2)
    # reset limits
    axs[0].set_xlim(xl)
    axs[1].set_xlim(xl)

    # make equal size panels
    # Note: sharex did not work
    bb=axs[0].get_position(False)
    rect_top = bb.bounds
    bb=axs[1].get_position(False)
    rect_bot = bb.bounds
    rect = list(rect_bot)
    rect[0] = rect_top[0]
    rect[2] = rect_top[2]
    axs[1].set_position(rect)
    
    # labeling
    fig.suptitle('{}\nrmse: EUR {:.0f}'.format(model_name,rmse), style='italic')
    axs[1].set_xlabel('Real price ' + unit, style='italic')
    axs[0].set_ylabel('Predicted price\n' + unit, style='italic')
    axs[1].set_ylabel('Prediction error\n' + unit, style='italic')
    
    # save
    file_name = f"{RESULTS_DIR}/{model_name.replace(' ','_')}-accuracy.png"
    if True | do_save(file_name): # always save
        print(file_name)
        with plt.style.context(f"{cfg['FILE_LOCATION']['app_dir']}/assets/context-paper.mplstyle"):
            plt.savefig(file_name, bbox_inches='tight', transparent=False)
    else:
        plt.show()
        print(f'Skip. {file_name} exists or saving is disabled in settings.')

linear regression no cv
9887.695312684224
/home/tom/bin/satdatsci/Saturday-Datascience/results/linear_regression_no_cv-accuracy.png
linear regression log price young
9092.92151046899
/home/tom/bin/satdatsci/Saturday-Datascience/results/linear_regression_log_price_young-accuracy.png
MLR reduced observations
7613.09372441863
/home/tom/bin/satdatsci/Saturday-Datascience/results/MLR_reduced_observations-accuracy.png
MLR impute median
10568.957991016641
/home/tom/bin/satdatsci/Saturday-Datascience/results/MLR_impute_median-accuracy.png
MLR with categorical
8999.771480591377
/home/tom/bin/satdatsci/Saturday-Datascience/results/MLR_with_categorical-accuracy.png
MLR Lasso
7605.28607547973
/home/tom/bin/satdatsci/Saturday-Datascience/results/MLR_Lasso-accuracy.png
MLR added features


  X.loc[:,'classic'] = X.loc[:,'classic'].astype('O').replace({True:'y', False:'n'})


7923.510036916174
/home/tom/bin/satdatsci/Saturday-Datascience/results/MLR_added_features-accuracy.png
Decision Tree Regression


  X.loc[:,'classic'] = X.loc[:,'classic'].astype('O').replace({True:+1, False:-1, np.nan:0}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X.loc[:, col].replace({'n':0, 'y':1}, inplace=True)
  X.loc[:, col].replace({'n':0, 'y':1}, inplace=True)


6219.499984647982
/home/tom/bin/satdatsci/Saturday-Datascience/results/Decision_Tree_Regression-accuracy.png
