<a id='pred_top'>

# Predict auction price

Try several models and improve predicition accuracy

## Model fitting

- Linear fits  
  1. [Simple linear fit](#pred_model_1)  
     No cross validation. Observations with missing values are dropped.
  2. [Dependent values scaled](#pred_model_2)  
     Dependent value here is _prices_.
  3. [Partial data](#pred_model_3)  
     Only young cars
- Multiple linear regression models  
  1. [MLR fit without imputation](#pred_model_4)  
  2. [With imputation](#pred_model_5)  
  3. [Include categorical features](#pred_model_6)  
  4. [Lasso regularization](#pred_model_7)  
  5. [include engineered features](#pred_model_8) **TODO**  

## Results

- [Model performance](#pred_accuracies)
- [Save best model](#pred_save_model) **TODO**  
  This is not implemented yet. Some preprocessing functions are not handled well with `pickle`.
- [Predictions](#pred_predict)
     
  

In [1]:
import drz_config
cfg = drz_config.read_config()
VERBOSE = cfg['VERBOSE']
SKIPSAVE = cfg['SKIPSAVE']

if VERBOSE > 0:
    display(cfg)

{'settings_fn': '../code/assets/drz-auction-settings.ini',
 'DATE': '2022-01',
 'VERBOSE': 1,
 'OPBOD': False,
 'URL': 'http://verkoop.domeinenrz.nl/verkoop_bij_inschrijving_2022-0021',
 'EXTEND_URL': False,
 'CLOSEDDATA': True,
 'closed_data_fields': '*',
 'SKIPSAVE': False}

In [2]:
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

import seaborn as sns

In [3]:
# set figure defaults (needs to be in cell seperate from import sns)
plt.style.use(['default', '../assets/movshon.mplstyle', '../assets/context-notebook.mplstyle'])

# Load data

In [4]:
fn = '../data/cars-for-ml.pkl'
print(fn)
df = pd.read_pickle(fn)

# categories
cat_columns = ['brand', 'model', 'fuel', 'body_type','color', 'energy_label', 'fwd', 'automatic_gearbox', 'under_survey']
# numerical
num_columns = list(np.setdiff1d(df.columns, cat_columns + ['price']))

# Factorized categorical values
fld = 'energy_label'
# replace empty with NaN creates factor '-1'
v, idx = pd.factorize(df[fld].replace({'': np.NaN}), sort=True)
# convert '-1' back to NaN
v = v.astype(float)
v[v==-1] = np.NaN
# Store in dataframe
new_col = 'converted_' + fld
df[new_col] = v
# update list
num_columns += [new_col]
cat_columns.remove(fld)
print('\nCategorical field [{}] is converted to sequential numbers with: '.format(fld), end='\n\t')
print(*['{} <'.format(c) for c in idx], end='\n\n')

# convert boolean to string
for fld in ['fwd', 'automatic_gearbox', 'under_survey']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    # # update list
    # cat_columns += [new_col]
    # cat_columns.remove(fld)
    replace_dict = {
        '': '', 
        True: 'y', 
        False: 'n'
    }
    df[new_col] = df[fld].replace(replace_dict)
    print('\nBoolean field [{}] is converted to numbers according to: '.format(fld), end='\n')
    print(*['\t"{}" -> {} ({})\n'.format(k,v, type(v)) for k,v in replace_dict.items()], end='\n\n')

# convert integer to float and replace -1
for fld in ['number_of_cylinders', 'number_of_doors', 'number_of_gears', 'number_of_seats']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    replace_dict = {
        -1: np.NaN, 
    }
    df[new_col] = df[fld].replace(replace_dict).astype(float)

# convert empty string to NaN
for fld in ['brand', 'model', 'fuel', 'body_type', 'color', 'fwd']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    replace_dict = {
        '': np.NaN, 
    }
    df[new_col] = df[fld].replace(replace_dict)

# translate Dutch to English
fld = 'color'
new_col = fld
# # update list
# cat_columns += [new_col]
# cat_columns.remove(fld)
replace_dict = {
    '': 'missing', 
    'BLAUW': 'Blue',
    'ROOD': 'Red',
    'GROEN': 'Green',
    'GRIJS': 'Gray',
    'WIT': 'White',
    'ZWART': 'Black',
    'BEIGE': 'Beige',
    'BRUIN': 'Brown',
    'ROSE': 'Pink',
    'GEEL': 'Yellow',
    'CREME': 'Creme',
    'ORANJE': 'Orange',
    'PAARS': 'Purple,'
}
df[new_col] = df[fld].replace(replace_dict)
print('\nField [{}] is converted according to: '.format(fld), end='\n')
print(*['\t"{}" -> {} ({})\n'.format(k,v, type(v)) for k,v in replace_dict.items()], end='\n\n')

# reporting
try:
    print('Categorical:', len(cat_columns))
    [print('\t[{:2.0f}] {:s}'.format(i+1, c)) for i,c in enumerate(df[cat_columns].columns)]
    print('Numercial:', len(num_columns))
    [print('\t[{:2.0f}] {:s}'.format(i+1, c)) for i,c in enumerate(df[num_columns].columns)]
    print('Last lot in data set:\n\t{}'.format(df.index[-1]))
except:
    cat_columns = [c for c in cat_columns if c in df.columns]
    num_columns = [c for c in num_columns if c in df.columns]    
    print('! not all fields are in data !. Skip for now')

../data/cars-for-ml.pkl

Categorical field [energy_label] is converted to sequential numbers with: 
	A < B < C < D < E < F < G <


Boolean field [fwd] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Boolean field [automatic_gearbox] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Boolean field [under_survey] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Field [color] is converted according to: 
	"" -> missing (<class 'str'>)
 	"BLAUW" -> Blue (<class 'str'>)
 	"ROOD" -> Red (<class 'str'>)
 	"GROEN" -> Green (<class 'str'>)
 	"GRIJS" -> Gray (<class 'str'>)
 	"WIT" -> White (<class 'str'>)
 	"ZWART" -> Black (<class 'str'>)
 	"BEIGE" -> Beige (<class 'str'>)
 	"BRUIN" -> Brown (<class 'str'>)
 	"ROSE" -> Pink (<class 'str'>)
 	"GEEL" -> Yellow (<class

In [5]:
# Store model results in dictonary: Instantiate empty dict
models = dict()

<H1><a href="#pred_top">^</a></H1><a id='pred_model_1'>

- - - - - 
# Model: Simple linear fit
Regress age (in days) with price (euro).  

## >> BIG FAT WARNING <<
All data is used without train / test split. I.e. accuracy is based on data that was used for fit. This is considered bad practice!

## Prepare input

In [6]:
from sklearn import linear_model

model_name = 'linear regression no cv'

X = df.dropna(subset=['price','age']).age.values.reshape(-1,1)
y = df.dropna(subset=['price','age']).price.values.reshape(-1,1)
print(X.shape)
print(y.shape)

(6699, 1)
(6699, 1)


## Fit

In [7]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X,y) # fit with all data
models[model_name].update({'n':y.shape[0]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})

In [8]:
# Fit a line by using predict
prediction_X = np.array([0,int(np.ceil(X.max()/365.25))*365.25]).reshape(-1,1)
prediction_y = reg.predict(prediction_X)

# plot
plt.figure(figsize=[8,8])
plt.plot(X/365.25, y/1000, marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4)
hdl_fit = plt.plot(prediction_X/365, prediction_y/1000, color='blue', marker=None, linestyle='-', linewidth=4)
plt.legend(hdl_fit, ['n = {}, $R^2$ = {:.2f}\ny = {:+.0f}{:+.2f}*(x*365.25)'.format(
    models[model_name]['n'],
    models[model_name]['R^2'],
    *models[model_name]['betas']
)], loc='upper right')
plt.xlabel('Age (years)', style='italic')
plt.ylabel('Winning bid (EUR X1000)', style='italic')
plt.title('Simple linear fit', style='italic')
plt.ylim(bottom = -10)
plt.xlim(left = 0)

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/linear_regression_no_cv.png


<H1><a href="#pred_top">^</a></H1><a id='pred_model_2'> 

## Model: linear but with scaled dependent values (prices)

Instead of using all data **train/test split** is performed. Also prices are log transformed.  

## Prepare input

In [9]:
from sklearn.model_selection import train_test_split, cross_val_score

model_name = 'linear regression log price'

X = df.dropna(subset=['price','age']).age.values.reshape(-1,1)
y = np.log10(df.dropna(subset=['price','age']).price.values.reshape(-1,1))
print(X.shape)
print(y.shape)

(6699, 1)
(6699, 1)


## Fit

In [10]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X_train,y_train) # fit with training set
models[model_name].update({'n':y.shape[0]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(4689, 1)
(2010, 1)


In [11]:
depr_half_n_days = -(np.log10(2)/models[model_name]['betas'][1])
print('According to "{}"-model'.format(model_name))
print('Car depreciates to half its value every\n\t{:.0f} days ({:.1f} years).'.format(depr_half_n_days, depr_half_n_days/365.25))
for y in [0,2,4,6,8]:
    print('\ty(t={:+5.0f}) = {:.0f} euro'.format(y, 10**reg.predict([[y*365.25]])[0][0]))
print('\n\ty(t={:+5.1f}) = {:.0f} euro'.format(depr_half_n_days/365.25, 10**reg.predict([[depr_half_n_days]])[0][0]))
print('\ty(t=0) / 2 = {:.0f} euro'.format(10**models[model_name]['betas'][0]/2))

According to "linear regression log price"-model
Car depreciates to half its value every
	2379 days (6.5 years).
	y(t=   +0) = 10321 euro
	y(t=   +2) = 8343 euro
	y(t=   +4) = 6743 euro
	y(t=   +6) = 5451 euro
	y(t=   +8) = 4406 euro

	y(t= +6.5) = 5161 euro
	y(t=0) / 2 = 5161 euro


In [12]:
# Fit a line by using predict
prediction_X = np.array([0,int(np.ceil(X.max()/365.25))*365.25]).reshape(-1,1)
prediction_y = reg.predict(prediction_X)

# plot
plt.figure(figsize=[8,8])
hdl_trn = plt.plot(X_train/365.25, np.power(10,y_train), marker='s', markeredgecolor = (0, 0, 1, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='train (n = {})'.format(y_train.shape[0]))
hdl_tst = plt.plot(X_test/365.25, np.power(10,y_test), marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='test (n = {}, $R^2$ = {:.2f})'.format(
                       y_test.shape[0],
                       models[model_name]['test R^2'],
                   ))
hdl_fit = plt.plot(prediction_X/365, np.power(10,prediction_y), color='blue', marker=None, linestyle='-', linewidth=4, 
                   label = '$log10(y)$ = {:+.2f}{:+.1e}*(x*365.25)\n($R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f}))'.format(
                       *models[model_name]['betas'],
                       models[model_name]['R^2'],
                       models[model_name]['cv R^2'].shape[0],
                       np.mean(models[model_name]['cv R^2']),
                       np.std(models[model_name]['cv R^2']),
                   ))
plt.legend()
plt.xlabel('Age (years)', style='italic')
plt.ylabel('Winning bid (EUR)', style='italic')
plt.title('Linear fit with log(price)', style='italic')
plt.ylim(bottom = 10, top = 1000000)
plt.xlim(left = 0)
plt.yscale('log')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/linear_regression_log_price.png


<H1><a href="#pred_top">^</a></H1><a id='pred_model_3'> 

## Model: scaled price, but only young cars

Same as [model 2](#pred_model_2), but ignore cars older than 25y

## Prepare input

In [13]:
from sklearn.model_selection import train_test_split, cross_val_score

model_name = 'linear regression log price young'

is_yng = df.age/365.25 < 25

X = df[is_yng].dropna(subset=['price','age']).age.values.reshape(-1,1)
y = np.log10(df[is_yng].dropna(subset=['price','age']).price.values.reshape(-1,1))
print(X.shape)
print(y.shape)

(6524, 1)
(6524, 1)


## Fit

In [14]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X_train,y_train)
models[model_name].update({'n':y.shape[0]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(4566, 1)
(1958, 1)


In [15]:
depr_half_n_days = -(np.log10(2)/models[model_name]['betas'][1])
print('According to "{}"-model'.format(model_name))
print('Car depreciates to half its value every\n\t{:.0f} days ({:.1f} years).'.format(depr_half_n_days, depr_half_n_days/365.25))
for y in [0,2,4,6,8]:
    print('\ty(t={:+5.0f}) = {:.0f} euro'.format(y, 10**reg.predict([[y*365.25]])[0][0]))
print('\n\ty(t={:+5.1f}) = {:.0f} euro'.format(depr_half_n_days/365.25, 10**reg.predict([[depr_half_n_days]])[0][0]))
print('\ty(t=0) / 2 = {:.0f} euro'.format(10**models[model_name]['betas'][0]/2))

According to "linear regression log price young"-model
Car depreciates to half its value every
	1308 days (3.6 years).
	y(t=   +0) = 24360 euro
	y(t=   +2) = 16541 euro
	y(t=   +4) = 11232 euro
	y(t=   +6) = 7627 euro
	y(t=   +8) = 5179 euro

	y(t= +3.6) = 12180 euro
	y(t=0) / 2 = 12180 euro


In [16]:
# Fit a line by using predict
prediction_X = np.array([0,int(np.ceil(X.max()/365.25))*365.25]).reshape(-1,1)
prediction_y = reg.predict(prediction_X)

# plot
plt.figure(figsize=[8,8])
hdl_trn = plt.plot(X_train/365.25, np.power(10,y_train), marker='s', markeredgecolor = (0, 0, 1, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='train (n = {})'.format(y_train.shape[0]))
hdl_tst = plt.plot(X_test/365.25, np.power(10,y_test), marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='test (n = {}, $R^2$ = {:.2f})'.format(
                       y_test.shape[0],
                       models[model_name]['test R^2'],
                   ))
hdl_fit = plt.plot(prediction_X/365, np.power(10,prediction_y), color='blue', marker=None, linestyle='-', linewidth=4, 
                   label = '$log10(y)$ = {:+.2f}{:+.1e}*(x*365.25)\n($R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f}))'.format(
                       *models[model_name]['betas'],
                       models[model_name]['R^2'],
                       models[model_name]['cv R^2'].shape[0],
                       np.mean(models[model_name]['cv R^2']),
                       np.std(models[model_name]['cv R^2']),
                   ))
plt.legend()
plt.xlabel('Age (years)', style='italic')
plt.ylabel('Winning bid (EUR)', style='italic')
plt.title('Linear fit with log(price) of young cars', style='italic')
plt.ylim(bottom = 10, top = 1000000)
plt.xlim(left = 0)
plt.yscale('log')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/linear_regression_log_price_young.png


<H1><a href="#pred_top">^</a></H1><a id='pred_model_4'> 

- - - - - 
# Model: Multiple linear fit

Above [simple linear models](#pred_model_1) only use _Age_ as predictor of price. Here MLR will regress many (numerical) features with price (euro).  


## Prepare input

In [17]:
model_name = 'MLR reduced observations'

features = num_columns 
# Can be reduced here

X = df.dropna(subset=['price'] + features).loc[:,features].values.reshape(-1,len(features))
y = np.log10(df.dropna(subset=['price'] + features).price.values.reshape(-1,1))
print(X.shape)
print(y.shape)

(990, 20)
(990, 1)


## Fit

In [18]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X_train,y_train)
models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(693, 20)
(297, 20)


In [19]:
# plot coefficients
plt.figure(figsize=[8,2])

# sorted bar height
betas = models[model_name]['betas']
x = ['offset (log[EUR])'] + [features[i] for i in np.argsort(betas[1:])[::-1]]
y = [betas[0]] + sorted(betas[1:], reverse=True)

# plot bar
plt.bar(x=x, height=y, edgecolor='k', facecolor='None')

# add values when bar is small
for x_val, coef in zip(x,y):
    if np.abs(coef)<1:
        plt.text(x_val, coef, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')
plt.yticks(range(0,5,2))

# plot origin
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
plt.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
plt.axhline(0, linewidth=2, linestyle='-', color='k')
        
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
yl = plt.gca().get_ylim()
plt.vlines(x_sign_switch-0.5, yl[0], yl[1], linewidth=2, linestyle='--')
plt.gca().set_ylim(yl)
# plt.gca().set_ylim(top=0.01, bottom=-0.01)

# labels        
plt.gca().set_xticklabels(labels=x, rotation=45, va='top', ha='right', style='italic')
plt.gca().xaxis.set_tick_params(which='minor', bottom=False)
plt.xlabel('Feature', style='italic')
plt.ylabel('Coefficient (a.u.)', style='italic')
plt.title('Multiple linear regression', style='italic') 

# stats
xy=[plt.gca().get_xlim()[1], plt.gca().get_ylim()[1]]
plt.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
    models[model_name]['R^2'],
    models[model_name]['cv R^2'].shape[0],
    np.mean(models[model_name]['cv R^2']),
    np.std(models[model_name]['cv R^2']),
) + '\n' +
         'train (n = {})'.format(y_train.shape[0]) + '\n' +
         'test (n = {}, $R^2$ = {:.2f})'.format(
             y_test.shape[0],
             models[model_name]['test R^2'],
         ), style='italic', va='top', ha='left')


# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/MLR_reduced_observations.png


  plt.gca().set_xticklabels(labels=x, rotation=45, va='top', ha='right', style='italic')


<H1><a href="#pred_top">^</a></H1><a id='pred_model_5'> 

- - - - - 
# Model: MLR + imputer

MLR as above, but instead of `dropna` us an imputer. This allows to use more observation.  

At this point a pipeline is used.

## Prepare input

In [20]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

model_name = 'MLR impute median'

features = num_columns 
# Can be reduced here

yX = df.loc[:,['price'] + features].dropna(subset=['price'])
X = yX.iloc[:,1:].values.reshape(-1,len(features))
y = np.log10(yX.iloc[:,0].values.reshape(-1,1))
print(X.shape)
print(y.shape)

(6724, 20)
(6724, 1)


## Fit

In [21]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
pl = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler(),
    linear_model.LinearRegression()
)
models[model_name].update({'model':pl})

# fit
pl.fit(X,y)
models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [*pl.steps[-1][1].intercept_, *pl.steps[-1][1].coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':pl.score(X,y)})
models[model_name].update({'test R^2':pl.score(X_test,y_test)})
cv_results = cross_val_score(pl, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(4706, 20)
(2018, 20)


In [22]:
# plot coefficients
plt.figure(figsize=[8,4])

# sorted bar height
betas = models[model_name]['betas']
x = ['offset (log[EUR])'] + [features[i] for i in np.argsort(betas[1:])[::-1]]
y = [betas[0]] + sorted(betas[1:], reverse=True)

# plot bar
plt.bar(x=x, height=y, edgecolor='k', facecolor='None')

# add values when bar is small
for x_val, coef in zip(x,y):
    if np.abs(coef)<0.05:
        plt.text(x_val, coef, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')
plt.yticks(np.arange(-0.3,0.4,0.1))
plt.ylim(top=+0.3, bottom=-0.3)
# offset
x_val = x[0]
coef = y[0]
plt.text(x_val, 0.3, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')

# plot origin
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
plt.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
plt.axhline(0, linewidth=2, linestyle='-', color='k')

# labels        
plt.gca().set_xticklabels(labels=x, rotation=45, va='top', ha='right', style='italic')
plt.gca().xaxis.set_tick_params(which='minor', bottom=False)
plt.xlabel('Feature', style='italic')
plt.ylabel('Coefficient (a.u.)', style='italic')
plt.title('Multiple linear regression', style='italic') 

# stats
xy=[plt.gca().get_xlim()[1], plt.gca().get_ylim()[1]]
plt.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
    models[model_name]['R^2'],
    models[model_name]['cv R^2'].shape[0],
    np.mean(models[model_name]['cv R^2']),
    np.std(models[model_name]['cv R^2']),
) + '\n' +
         'train (n = {})'.format(y_train.shape[0]) + '\n' +
         'test (n = {}, $R^2$ = {:.2f})'.format(
             y_test.shape[0],
             models[model_name]['test R^2'],
         ), style='italic', va='top', ha='left')


# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/MLR_impute_median.png


  plt.gca().set_xticklabels(labels=x, rotation=45, va='top', ha='right', style='italic')


<H1><a href="#pred_top">^</a></H1><a id='pred_model_6'> 

- - - - - 
# Model: MLR with categorical

As MLR, but do one-hot-encoding

Use different scalers for different columns:  
https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html  
https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer  
p. 68 book: ML with sklearn & tf

## Prepare input

In [23]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import MinMaxScaler
# from sklearn.pipeline import FeatureUnion

model_name = 'MLR with categorical'

cat_columns_reduced = list(np.setdiff1d(cat_columns, ['model', 'fuel']))
features = num_columns + cat_columns_reduced
# Can be reduced here

# list of lists with categories. Needed for column transformer
cats = list(df[cat_columns_reduced].apply(lambda x:pd.Series(x.unique()).dropna().tolist() + ['missing'], axis='index'))

# Use data frame not array
yX = df.dropna(subset=['price'])
# # only use young
# is_yng = yX.age/365.25 < 25
# yX = yX[is_yng]
X = yX.iloc[:,1:]
y = yX.iloc[:,0]
print(X.shape)
print(y.shape)


(6724, 29)
(6724,)


In [24]:
import re

# Split fuel helper functions

def split_lpg_type(s):
    '''Split lpg type from list of fuels separated by / '''
    # No type
    if s.endswith('lpg'):
        return s, ''
    if 'lpg' not in s:
        return s, ''
    # Type is after the last '/'
    M = re.search('^(.*)/(.*)$',s)
    if M:
        return M[1], M[2]
    else:
        return s, ''

def merge_lpg_and_lpgtype(fuel_type):

    '''Add LPG type to LPG (remove /). 
    Note that order of fuels is preserved. I.e. it is able to return both "benzine/lpg-g3" and "lpg-g3/benzine". '''
    
    lpg_type = fuel_type.apply(lambda s: 'lpg-' + split_lpg_type(s)[1] if (type(s) == str) and ('lpg' in s) else '')
    fuel_type_short = fuel_type.apply(lambda s: split_lpg_type(s)[0] if (type(s) == str) else '')
    fuel_type_new = pd.Series([f.replace('lpg', l) if type(f) == str else f for f,l in zip(fuel_type_short,lpg_type)])
    return fuel_type_new


def get_unique_fuels(fuel_type):
    
    '''Splitting fuels at "/" and return unique values'''
    
    # make list (as string)
    fuel_type_list = fuel_type.apply(lambda s:s.split('/') if type(s) == str else np.NaN).astype(str)
    
    # Get unique fuels
    possible_fuels = list() # empty list
    for l in fuel_type_list.unique():
        for ll in eval(l): # use eval to convert str to list
            possible_fuels += [ll]     
    # uniquify
    return np.unique(possible_fuels)

    
from sklearn.base import BaseEstimator, TransformerMixin

# Custom transformer to make one-hot fuel encoder based on string
# This is different from get_dummies, because it can take a list of values in a field
class DummyfyFuel(BaseEstimator, TransformerMixin):
    def __init__(self, fuel_names=None):
        
        assert (fuel_names == None) or (isinstance(fuel_names, (list,))), '[fuel_names] should be list (or None)'
        
        self.fuel_names = fuel_names
        
    def fit(self, X, y=None):
        
        if not self.fuel_names:
            # get fuel names based on input.
            # Note that if train/test are split, test might lack a fuel type.
            self.fuel_names = get_unique_fuels(merge_lpg_and_lpgtype(X))

        return self
    
    def transform(self, X):
        
        # get stringyfied list
        fuel_type_list = merge_lpg_and_lpgtype(X).apply(lambda s:s.split('/') if type(s) == str else np.NaN).astype(str)
        # set index as input
        fuel_type_list.index = X.index

        # transform: dummies
        fuel_dummies = pd.DataFrame(index=fuel_type_list.index)
        for f in self.fuel_names:
            fuel_dummies['fuel_' + f] = fuel_type_list.apply(lambda l:int(f in eval(l)))

        return fuel_dummies


In [25]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)


(4706, 29)
(2018, 29)


In [26]:
# Create model

# Preprocessor: numerical features
num_transformer = make_pipeline(
    SimpleImputer(strategy='median'),
    MinMaxScaler(),
)
# Preprocessor: categorical features
cat_transformer = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing', missing_values=np.NaN),
    OneHotEncoder(categories=cats),
)

# Preprocess: fuels
# list of all fuels is passed by using full data set! (X)
fuel_list = list(get_unique_fuels(merge_lpg_and_lpgtype(X.fuel)))
#fuel_list = ['benzine', 'diesel']
get_fuel_dummies = DummyfyFuel(fuel_list)


# Combine num and cat
preprocessor = ColumnTransformer(transformers=[
    ('numerical', num_transformer, pd.Index(num_columns)),
    ('categorical', cat_transformer, pd.Index(cat_columns_reduced)),
    ('onehot_fuel', get_fuel_dummies, 'fuel')
], verbose=True)

# full pipeline with preproc and mlr
mlr = make_pipeline(
    preprocessor,
    linear_model.LinearRegression()
)

# Target transformation: log transform price
pl = TransformedTargetRegressor(
    regressor=mlr,
    func=np.log10,
    inverse_func=lambda y: 10**y,
#     func=lambda x:x,
#     inverse_func=lambda y: y,
#     inverse_func=np.exp,
)

models[model_name].update({'model':pl})

In [27]:
# fit
pl.fit(X_train, y_train)
y_pred = pl.predict(X_test)

[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.2s


In [28]:
# sanity check that target transformation has occured as expected
# y_pred_manual_transform = mlr.predict(X_test)
# assert all(np.log10(y_pred)-y_pred_manual_transform == 0)

models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [pl.regressor_.steps[-1][1].intercept_, *pl.regressor_.steps[-1][1].coef_]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':pl.score(X,y)})
models[model_name].update({'test R^2':pl.score(X_test,y_test)})
cv_results = cross_val_score(pl, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})

[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[Colum

In [29]:
# update features, by adding fuels
cat_columns_reduced += ['fuel']
cats += [fuel_list]


In [30]:
# Split betas per category feature.
idx_start = len(num_columns)+1
cat_betas = list()
for cat in cats:
    cat_betas += [betas[idx_start:idx_start+len(cat)]]
    idx_start += len(cat)
# Check if all betas are stored
assert cat_betas[0][0] == betas[len(num_columns)+1] # first cat beta follows numerical betas 
assert cat_betas[-1][-1] == betas[-1] # last

In [31]:
# plot coefficients

# plot numerical and catagorical in different subplots
n_plots = len(cat_columns_reduced) + 1
fig,axs=plt.subplots(
    nrows=n_plots,
    figsize=[16,4*n_plots]
)
plt.subplots_adjust(hspace=0.5)


# Plot numerical
plt.sca(axs[0])
# sorted bar height
betas = models[model_name]['betas']
num_betas = betas[1:len(num_columns)+1]
x = ['offset'] + [features[i] for i in np.argsort(num_betas)[::-1]]
y = [betas[0]] + sorted(num_betas, reverse=True)

# plot bar
plt.bar(x=x, height=y, edgecolor='k', facecolor='None', clip_on=True)

# add values when bar is small
for x_val, coef in zip(x,y):
    if np.abs(coef)<0.5:
        plt.text(x_val, coef, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')
plt.yticks(np.arange(-2,2.2,0.5))
plt.ylim(top=+2, bottom=-2)
# offset
x_val = x[0]
coef = y[0]
plt.text(x_val, 2, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')

# plot origin
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
plt.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
plt.axhline(0, linewidth=2, linestyle='-', color='k')

# labels        
rot = 45
fsz = 10
ha = 'right'
plt.gca().set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
plt.gca().xaxis.set_tick_params(which='minor', bottom=False)
plt.ylabel('Coefficient (a.u.)', style='italic')
plt.title('Multiple linear regression\nNumerical features', style='italic') 

# stats
xy=[plt.gca().get_xlim()[1], plt.gca().get_ylim()[1]]
plt.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
    models[model_name]['R^2'],
    models[model_name]['cv R^2'].shape[0],
    np.mean(models[model_name]['cv R^2']),
    np.std(models[model_name]['cv R^2']),
) + '\n' +
         'train (n = {})'.format(y_train.shape[0]) + '\n' +
         'test (n = {}, $R^2$ = {:.2f})'.format(
             y_test.shape[0],
             models[model_name]['test R^2'],
         ), style='italic', va='top', ha='left')

# Plot categorical
for cat, cat_beta, cat_name, ax in zip(cats, cat_betas, cat_columns_reduced, axs[1:]):
    # activate subplot axes
    plt.sca(ax)
    # sort by height
    x = [cat[i] for i in np.argsort(cat_beta)[::-1]]
    y = sorted(cat_beta, reverse=True)
    #x = cat
    #y = cat_beta
    # plot bar
    plt.bar(x=x, height=y, edgecolor='k', facecolor='None', clip_on=False)

    # prettify
    plt.yticks(np.arange(-1,+1.1,0.2))
    plt.ylim(top=+1, bottom=-1)

    # plot origin
    x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
    plt.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
    plt.axhline(0, linewidth=2, linestyle='-', color='k')

    # labels
    rot = 45
    fsz = 10
    ha = 'right'
    ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
    ax.xaxis.set_tick_params(which='minor', bottom=False)
    plt.title('Categorical feature: ' + cat_name, style='italic')
    plt.ylabel('Coefficient (a.u.)', style='italic')
    # add extra margin if bars are too wide (too little bars)
    if len(x) < 20:
        add_space = len(x) - 20
        xl = list(plt.xlim())
        xl[1] -= add_space/2
        xl[0] += add_space/2
        plt.xlim(xl)

# Label on bottom panel
plt.sca(axs[-1])
plt.xlabel('Sorted features', style='italic')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

  plt.gca().set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
  ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)


../results/MLR_with_categorical.png


<H1><a href="#pred_top">^</a></H1><a id='pred_model_7'> 

- - - - - 
# Model: MLR regularized

As [previous model](#pred_model_6), but use regularization by using built-in Lasso

## Prepare input

In [32]:
from sklearn.model_selection import GridSearchCV

model_name = 'MLR Lasso'

cat_columns_reduced = list(np.setdiff1d(cat_columns, ['model', 'fuel']))
features = num_columns + cat_columns_reduced
# Can be reduced here

# list of lists with categories. Needed for column transformer
cats = list(df[cat_columns_reduced].apply(lambda x:pd.Series(x.unique()).dropna().tolist() + ['missing'], axis='index'))

# Use data frame not array
yX = df.dropna(subset=['price'])
X = yX.iloc[:,1:]
y = yX.iloc[:,0]
print(X.shape)
print(y.shape)


(6724, 29)
(6724,)


## Determine regularization rate (alpha)

Alpha is the hyperparameter that needs to be determined. For this the data needs to be splitted, but the dataset is too small to do a 3 way split (i.e. CV, Train, Test). Therefor spilt 2 way k-fold cv 
- **Test**: Hold-out set for calculating performance
- **Train**: Use to fit model and do CV


In [33]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)


(4706, 29)
(2018, 29)


In [34]:
# Create model (same as MLR with cats, but regressor is Lasso)

# Preprocessor: numerical features
num_transformer = make_pipeline(
    SimpleImputer(strategy='median'),
    MinMaxScaler(),
)
# Preprocessor: categorical features
cat_transformer = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing', missing_values=np.NaN),
    OneHotEncoder(categories=cats),
)

# Preprocess: fuels
# list of all fuels is passed by using full data set! (X)
fuel_list = list(get_unique_fuels(merge_lpg_and_lpgtype(X.fuel)))
get_fuel_dummies = DummyfyFuel(fuel_list)


# Combine num and cat
preprocessor = ColumnTransformer(transformers=[
    ('numerical', num_transformer, pd.Index(num_columns)),
    ('categorical', cat_transformer, pd.Index(cat_columns_reduced)),
    ('onehot_fuel', get_fuel_dummies, 'fuel')
], verbose=True)

# full pipeline with preproc and mlr
mlr = make_pipeline(
    preprocessor,
    linear_model.Lasso(random_state=42)
)

# Target transformation: log transform price
pl = TransformedTargetRegressor(
    regressor=mlr,
    func=np.log10,
    inverse_func=lambda y: 10**y
)



In [35]:
# grid search estimator
grid_search_alpha = GridSearchCV(
    estimator=pl,
    param_grid=[
        {
            'regressor__lasso__alpha': 10**(np.linspace(-5,-2,13)) # Choose alphas such that a clear peaked graph is shown in next plot
        } 
    ],
    cv=8,
    scoring='r2',
    n_jobs=4,
    verbose=10
)

# Perform grid search
grid_search_alpha.fit(X_train,y_train)

Fitting 8 folds for each of 13 candidates, totalling 104 fits
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.2s


GridSearchCV(cv=8,
             estimator=TransformedTargetRegressor(func=<ufunc 'log10'>,
                                                  inverse_func=<function <lambda> at 0x7fc9f0b9a4c0>,
                                                  regressor=Pipeline(steps=[('columntransformer',
                                                                             ColumnTransformer(transformers=[('numerical',
                                                                                                              Pipeline(steps=[('simpleimputer',
                                                                                                                               SimpleImputer(strategy='median')),
                                                                                                                              ('minmaxscaler',
                                                                                                                               MinMaxScal

In [36]:
# plot search results
plt.figure(figsize=[2,2])

# abscissa
alphas = list(grid_search_alpha.cv_results_['param_regressor__lasso__alpha'])

# plot mean
r2_mean = grid_search_alpha.cv_results_['mean_test_score']
# normalize
r2_mean = (r2_mean-r2_mean.mean())/r2_mean.std()
plt.plot(alphas, r2_mean, label='mean', lw=4, color='blue')

# plot folds
for fold in range(grid_search_alpha.cv):
    r2_fold = grid_search_alpha.cv_results_['split{:.0f}_test_score'.format(fold)]
    # normalize
    r2_fold = (r2_fold-r2_fold.mean())/r2_fold.std()
    plt.plot(alphas, r2_fold, label='fold ' + str(fold), lw=1, color='black')

plt.xscale('log')
plt.xlabel('alpha')
plt.ylabel('standardized r2 score [a.u.]')
plt.axvline(grid_search_alpha.best_params_['regressor__lasso__alpha'], linewidth=2, linestyle='--', color='k')
result = 'grid search results\nbest alpha={:.5f}'.format(grid_search_alpha.best_params_['regressor__lasso__alpha'])
plt.title(result)
print(result)
plt.legend(ncol=1, loc='center left', bbox_to_anchor=(1,0.5))



grid search results
best alpha=0.00006


<matplotlib.legend.Legend at 0x7fc9e1ddab20>

### Fit with regressor found with grid search

In [37]:
# Store estimator with best alpha
reg = grid_search_alpha.best_estimator_
models[model_name].update({'model':reg})

# fit
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [reg.regressor_.steps[-1][1].intercept_, *reg.regressor_.steps[-1][1].coef_]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})

[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.2s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[Colum

In [38]:
# update features, by adding fuels
cat_columns_reduced += ['fuel']
cats += [fuel_list]

# Split betas per category feature.
idx_start = len(num_columns)+1
cat_betas = list()
for cat in cats:
    cat_betas += [betas[idx_start:idx_start+len(cat)]]
    idx_start += len(cat)
# Check if all betas are stored
assert cat_betas[0][0] == betas[len(num_columns)+1] # first
assert cat_betas[-1][-1] == betas[-1] # last

In [39]:
# plot coefficients

# plot numerical and catagorical in different subplots
n_plots = len(cat_columns_reduced) + 1
fig,axs=plt.subplots(
    nrows=n_plots,
    figsize=[16,4*n_plots]
)
plt.subplots_adjust(hspace=0.5)

# Plot coefficients
for feats, coefs, name, ax in zip(
    [['offset'] + features] + cats,
    [[betas[0]] + betas[1:len(num_columns)+1]] + cat_betas,
    ['numerical'] + cat_columns_reduced,
    axs
):
    # activate subplot axes
    plt.sca(ax)
    # sort by bar height
    x = [feats[i] for i in np.argsort(coefs)[::-1]]
    y = sorted(coefs, reverse=True)
    # plot bar
    plt.bar(x=x, height=y, edgecolor='k', facecolor='None', clip_on=True)

    # prettify
    if not name.startswith('num'):
        plt.yticks(np.arange(-0.5,+0.6,0.1))
        bot_tick, top_tick = plt.ylim(top=+0.5, bottom=-0.5)
    else:
        plt.yticks(np.arange(-2,2.2,0.5))
        bot_tick, top_tick = plt.ylim(top=+2, bottom=-2)
        # stats
        xy=[plt.gca().get_xlim()[1], plt.gca().get_ylim()[1]]
        plt.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
            models[model_name]['R^2'],
            models[model_name]['cv R^2'].shape[0],
            np.mean(models[model_name]['cv R^2']),
            np.std(models[model_name]['cv R^2']),
        ) + '\n' +
                 'train (n = {})'.format(y_train.shape[0]) + '\n' +
                 'test (n = {}, $R^2$ = {:.2f})'.format(
                     y_test.shape[0],
                     models[model_name]['test R^2'],
                 ), style='italic', va='top', ha='left')


    # plot sign switch
    x_sign_switch1 = np.nonzero(np.array(y+[-np.inf]) < 0)[0][0]
    x_sign_switch2 = np.nonzero(np.array([+np.inf]+y) > 0)[0][-1]
    plt.axvline(x_sign_switch1-0.5, linewidth=2, linestyle='--', color='k')
    plt.axvline(x_sign_switch2-0.5, linewidth=2, linestyle='--', color='k')
    plt.axhline(0, linewidth=2, linestyle='-', color='k')

    # add values when bar is small or too large (clipping)
    yt,ytl=plt.yticks()
    first_tick = sorted(np.abs(yt))[1]
    for x_val, coef in zip(x,y):
        if (coef < first_tick) & (coef > 0):
            plt.text(x_val, coef, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif (coef > -first_tick) & (coef < 0):
            plt.text(x_val, 0, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif coef > top_tick:
            # generally this is offset (bias)
            plt.text(x_val, top_tick, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif coef < bot_tick:
            plt.text(x_val, bot_tick, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')

    
    # labels and titles
    rot = 45
    fsz = 10
    ha = 'right'
    ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
    ax.xaxis.set_tick_params(which='minor', bottom=False)
    if not name.startswith('num'):
        plt.title('Categorical feature: ' + name, style='italic')
    else:
        plt.title('Multiple linear regression (Lasso, alpha={:g})\nNumerical features'.format(
            reg.regressor_.named_steps['lasso'].alpha
        ), style='italic') 
    plt.ylabel('Coefficient (a.u.)', style='italic')
    
    # add extra margin if bars are too wide (too little bars)
    if len(x) < 20:
        add_space = len(x) - 20
        xl = list(plt.xlim())
        xl[1] -= add_space/2
        xl[0] += add_space/2
        plt.xlim(xl)

# Label on bottom panel
plt.sca(axs[-1])
plt.xlabel('Sorted features', style='italic')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

  ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)


../results/MLR_Lasso.png


- - - - - 

In [40]:
# Display prediction errors

x_sample = df.dropna(subset=['price']).iloc[:,1:]
y_sample = df.dropna(subset=['price']).iloc[:,0]
y_sample_pred = models[model_name]['model'].predict(x_sample) 

x_sample['price'] = y_sample
x_sample['prediction_error'] = y_sample_pred - y_sample
x_sample['prediction_error_fraction'] = y_sample_pred/y_sample
x_sample['prediction_error_log'] = np.log10(x_sample.prediction_error_fraction)
x_sample['prediction_error_abslog'] = np.abs(np.log10(x_sample.prediction_error_fraction))
x_sample['prediction'] = y_sample_pred
x_sample['age_y'] = x_sample.age/365

# Note some are close to perfect, because they are in training set and are unique in brand etc
print('best predictons')
display(x_sample.sort_values(by='prediction_error_abslog').head(16).T)
print('worst predictions')
display(x_sample.sort_values(by='prediction_error_abslog').tail(16).T)
print('largest underestimate')
display(x_sample.sort_values(by='prediction_error').head(16).T)
print('largest overestimate')
display(x_sample.sort_values(by='prediction_error').tail(16).T)
print('worst prediction recent auction')
is_last_auction = x_sample.index.str.startswith('-'.join(x_sample.index[-1].split('-')[:2]))
display(x_sample[is_last_auction].sort_values(by='prediction_error_abslog').tail(8).T)

plt.figure(figsize=[8,8])
plt.plot(x_sample.age_y, x_sample.prediction_error_log, color='k', marker='s', markeredgecolor = (0, 0, 0, 0), markerfacecolor = (0, 0, 0, 1), linestyle='None', ms=4)
plt.axhline(0, lw=2, linestyle='--', color ='k')
plt.xlabel('age [years]')
plt.ylabel('prediction error [log of fraction]\n(positive: prediction overestimates)')
plt.show()

best predictons


Unnamed: 0,2020-10-7164,2018-7-2413,2019-7-7123,2020-12-7303,2018-3-8232,2018-3-2614,2017-9-8158,2020-10-7130,2017-11-8168,2019-5-2611,2018-11-3081,2017-5-8016,2018-9-8145,2020-8-2613,2018-4-7183,2019-2-7144
brand,FIAT,AUDI,VOLKSWAGEN,OPEL,AUDI,AUDI,RENAULT,MITSUBISHI,MERCEDES-BENZ,BMW,VOLKSWAGEN,MERCEDES-BENZ,PEUGEOT,VOLKSWAGEN,SEAT,FORD
model,punto,q2,golf-cabriolet,corsa,a4 cabriolet,a1,megane,colt 1500 glx automaat,e 240,1er reihe,jetta,280 se,308,golf,leon,fiesta
age,3513.0,489.0,6668.0,3464.0,5886.0,2066.0,4354.0,11847.0,5381.0,2589.0,,13893.0,3851.0,1101.0,3425.0,5748.0
fuel,diesel,benzine,benzine,benzine,benzine,benzine,benzine/lpg/g3 gasinstallatie,benzine,benzine,diesel,benzine,lpg,benzine,diesel,diesel,benzine
odometer,198641.0,8543.0,173909.0,85855.0,188786.0,77238.0,319625.0,184955.0,208894.0,148232.0,235259.0,340919.0,151638.0,98490.0,227444.0,289906.0
days_since_inspection_invalid,-155.0,-972.0,71.0,136.0,-386.0,-125.0,-197.0,-78.0,-246.0,-156.0,,,-167.0,5.0,-426.0,-82.0
age_at_import,0.0,252.0,0.0,2921.0,388.0,491.0,0.0,0.0,0.0,0.0,,,0.0,313.0,0.0,0.0
body_type,mpv,stationwagen,cabriolet,mpv,cabriolet,stationwagen,hatchback,hatchback,sedan,stationwagen,,,hatchback,stationwagen,mpv,hatchback
displacement,1248.0,1395.0,1984.0,1229.0,2393.0,1390.0,1598.0,1468.0,2597.0,1598.0,,,1598.0,1968.0,1896.0,1388.0
number_of_cylinders,4.0,4.0,4.0,4.0,6.0,4.0,4.0,4.0,6.0,4.0,,,4.0,4.0,4.0,4.0


worst predictions


Unnamed: 0,2019-2-7260,2019-8-2221,2021-08-260218,2017-3-2003,2021-10-260520,2021-06-260316,2017-3-2007,2019-4-2021,2017-3-2409,2021-06-260216,2017-5-2216,2021-03-2606,2017-3-2000,2018-1-2412,2018-7-2411,2021-05-2201
brand,OPEL,ALFA ROMEO,LANCIA,VOLKSWAGEN,LINCOLN,FORD,VOLKSWAGEN,VOLKSWAGEN,VOLKSWAGEN,MERCEDES-BENZ,ALFA ROMEO,JAGUAR,ALFA ROMEO,VOLKSWAGEN,MERCEDES-BENZ,MERCEDES-BENZ
model,ascona 1.6s,giulia 1300 super,137as0,152131,continental iii conv,mustang,karmann ghia,111011,T2,w110/190d,2000 gtv,e-type,2000 gtv,t1,sl230,w100 600 pullman
age,11956.0,17379.0,16011.0,15121.0,23284.0,17503.0,18507.0,19787.0,15873.0,21156.0,16257.0,18507.0,16196.0,19813.0,19848.0,20451.0
fuel,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine
odometer,3314.0,9050.0,32474.0,84145.0,122427.626112,105840.117504,59227.0,91157.0,46642.0,94097.0,23982.0,20623.0,23982.0,96563.858688,68722.0,
days_since_inspection_invalid,66.0,,,,,232.0,,1099.0,,,-739.0,,-800.0,,,
age_at_import,0.0,,,,,13880.0,,16180.0,,,0.0,,0.0,,,
body_type,,,,,,coupe,,sedan,,,coupe,,coupe,,,
displacement,,,,,,4948.0,,1192.0,,,,,,,,
number_of_cylinders,4.0,,,,,8.0,,4.0,,,4.0,,4.0,,,


largest underestimate


Unnamed: 0,2021-12-260012,2019-4-2411,2019-11-2418,2021-05-8126,2018-6-2410,2018-8-2400,2017-5-2406,2018-8-2410,2017-3-2000,2021-08-702908,2022-01-240321,2021-08-260318,2018-7-2411,2021-12-260812,2017-8-2409,2021-05-2401
brand,LAMBORGHINI,MERCEDES-BENZ,PORSCHE,MERCEDES-BENZ,MERCEDES-BENZ,ROLLS ROYCE,MERCEDES-BENZ,ASTON-MARTIN,ALFA ROMEO,MERCEDES-BENZ,MERCEDES-BENZ,PORSCHE,MERCEDES-BENZ,MERCEDES-BENZ,PORSCHE,PORSCHE
model,diablo sv 132 se,amg s63 cabriolet,panamera turbo s e-hybrid,v-klasse,S65 AMG,phantom drophead coupe,S600 Maybach,dbs,2000 gtv,amg e 43 4matic,"S63 AMG 4M, voorzien van Brabus componenten*",911 carrera 2,sl230,brabus s63,911 carrera s,911 gt3
age,8546.0,636.0,431.0,1506.0,844.0,3136.0,810.0,2665.0,16196.0,1599.0,2389.0,11415.0,19848.0,2358.0,2039.0,7744.0
fuel,benzine,benzine,,diesel,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,,benzine
odometer,40090.0,13.0,6925.0,121324.0,6379.0,11305.0,19173.0,58429.0,23982.0,71428.0,43751.0,120844.0,68722.0,43751.0,59807.0,38157.0
days_since_inspection_invalid,1949.0,,,-320.0,,,-651.0,,-800.0,-593.0,,817.0,,,,2314.0
age_at_import,6164.0,,,1067.0,,,0.0,,0.0,523.0,,9873.0,,,,0.0
body_type,coupe,,,mpv,,,sedan,,coupe,sedan,,coupe,,,,coupe
displacement,5707.0,,,2143.0,,,5980.0,,,2996.0,,3600.0,,,,3600.0
number_of_cylinders,12.0,,,4.0,,,12.0,,4.0,6.0,,6.0,,,,6.0


largest overestimate


Unnamed: 0,2020-12-2212,2021-10-804220,2018-2-2408,2018-5-2401,2018-1-2411,2020-2-2401,2020-6-2221,2021-04-2205,2020-9-2405,2021-06-702006,2021-12-806022,2021-03-2206,2021-03-2610,2017-3-2405,2017-3-2400,2018-11-2401
brand,MERCEDES-BENZ,LAND ROVER,LAND ROVER,LAND ROVER,BMW,ASTON-MARTIN,LAND ROVER,MERCEDES-BENZ,LAND ROVER,PORSCHE,SHARE NGO,VOLVO,LAND ROVER,MERCEDES-BENZ,ASTON-MARTIN,BENTLEY
model,amg glc 43,range rover evoque,range rover evoque,range rover evoque,x5 m50d,v8 vantage roadster,discovery sport,gle 350 d 4matic,range rover evoque,macan s,zd,xc90 t8 twin engine,range rover sport,amg gle 63 s,rapide s,continental gtc
age,1177.0,1361.0,701.0,677.0,1761.0,4385.0,1581.0,1718.0,889.0,783.0,898.0,579.0,1851.0,484.0,673.0,4200.0
fuel,benzine,diesel,diesel,diesel,diesel,benzine,diesel,diesel,benzine,benzine,elektriciteit,elektriciteit/benzine,benzine,benzine,benzine,benzine
odometer,39331.0,58524.0,28223.0,26110.0,36340.0,51545.0,62423.0,55224.0,25563.0,25070.0,626.0,7071.0,110979.0,20757.0,14415.0,27184.0
days_since_inspection_invalid,-284.0,265.0,-394.0,-418.0,-102.0,197.0,485.0,244.0,-572.0,-678.0,,-882.0,-121.0,-977.0,-788.0,-21.0
age_at_import,688.0,199.0,492.0,0.0,1349.0,1954.0,910.0,1067.0,297.0,0.0,0.0,0.0,1246.0,135.0,297.0,2028.0
body_type,stationwagen,stationwagen,stationwagen,stationwagen,stationwagen,cabriolet,stationwagen,stationwagen,stationwagen,stationwagen,,mpv,stationwagen,stationwagen,hatchback,cabriolet
displacement,2996.0,1999.0,1999.0,1999.0,2993.0,4282.0,1999.0,2987.0,1997.0,2995.0,0.0,1969.0,4999.0,5461.0,5935.0,5998.0
number_of_cylinders,6.0,4.0,4.0,4.0,6.0,8.0,4.0,6.0,4.0,6.0,0.0,4.0,8.0,8.0,12.0,12.0


worst prediction recent auction


Unnamed: 0,2022-01-706901,2022-01-706301,2022-01-703201,2022-01-708201,2022-01-240101,2022-01-708501,2022-01-240321,2022-01-805421
brand,TOYOTA,MERCEDES-BENZ,LANCIA,AUDI,MERCEDES-BENZ,MERCEDES-BENZ,MERCEDES-BENZ,AUDI
model,avensis,cls 320 cdi,fulvia coupe rallye 1.3s 3rd s,sq5,AMG E 63 S,s 65 amg,"S63 AMG 4M, voorzien van Brabus componenten*",rs6
age,5953.0,5241.0,17352.0,1185.0,1328.0,5559.0,2389.0,4860.0
fuel,diesel,diesel,benzine,benzine,benzine,benzine,benzine,benzine
odometer,310751.0,175726.0,12944.0,98889.0,79020.0,151194.0,43751.0,174186.0
days_since_inspection_invalid,24.0,,282.0,,,,,
age_at_import,0.0,,8898.0,,,,,
body_type,stationwagen,,,,,,,
displacement,2231.0,,,,,,,
number_of_cylinders,4.0,,4.0,,,,,


In [41]:
# check to see if combining features would improve model
yX = df.dropna(subset=['price'])
yX.loc[:,'usage_intensity'] = (yX.odometer / yX.age)
yX.loc[:,'classic'] = yX.age > 25
print(yX.corr().price)
print('\n"usage_intensity" does not seem to correlate better than "age" and "odometer" seperately')

price                            1.000000
age                             -0.349085
odometer                        -0.452006
days_since_inspection_invalid   -0.127949
age_at_import                    0.052260
displacement                     0.385007
number_of_cylinders              0.369068
power                            0.614348
weight                           0.345062
registration_tax                 0.374249
sale_price                       0.688026
number_of_seats                 -0.024993
number_of_doors                  0.174178
top_speed                        0.509195
length                           0.298851
height                           0.059324
width                            0.433894
number_of_gears                  0.590978
private_owners                  -0.286312
company_owners                   0.101711
converted_energy_label           0.255198
usage_intensity                 -0.065589
classic                          0.022824
Name: price, dtype: float64

"usag

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = val

<H1><a href="#pred_top">^</a></H1><a id='pred_model_8'> 

<H1><a href="#pred_top">^</a></H1><a id='pred_accuracies'> 

## Model accuracies

In [42]:
# plot R^2

# counter for x-offset
c=0

# figure
fig = plt.figure(figsize=[2,2])
ax = fig.gca()
xs = ys = [None]

# loop over all models
for name,res in models.items():

    c+=1 # x-offset

    if name == 'linear regression no cv':
        # No cv, so only one value. Make it a list of one for type consistency
        k = 'R^2'
        rsq = [res[k]]
    
    else: 
        k = 'cv R^2'
        rsq = res[k]
        
    # add r-squares and offset to vectors
    ys = np.concatenate([ys,rsq])
    xs = np.concatenate([xs,np.ones_like(rsq) * c])

# actual plotting
sns.swarmplot(x=xs, y=ys, ax=ax)
# prettify
ax.set_xticklabels(models.keys(), rotation=45, va='top', ha='right', style='italic')
ax.set_ylim(bottom=0, top=1)
ax.set_title('Model performance\n', style='italic')
ax.set_ylabel('Coefficient of determination\n($R^2$)', style='italic')


# save
file_name = '../results/model-performance.png'
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/model-performance.png


In [43]:
# plot data

# loop over all models
for model_name in models.keys():
    print(model_name)
    res = models[model_name]
    
    # all original data
    yX = df.loc[:,['price', 'age']].dropna()
    X = yX.iloc[:,1]
    y = yX.iloc[:,0]
    
    features = num_columns.copy()
    
    # model specific adjustments
    if (model_name == 'linear regression log price') or (model_name == 'linear regression log price young'):
        # log price is used
        y = np.log10(y)
        # unit
        unit = '(log[EUR])'
    elif (model_name == 'MLR reduced observations') or (model_name == 'MLR impute median'):
        yX = df.dropna(subset=['price'] + features).loc[:,['price'] + features]
        X = yX.iloc[:,1:]
        y = np.log10(yX.iloc[:,0])
        unit = '(log[EUR])'
    elif (model_name == 'MLR with categorical') or (model_name == 'MLR Lasso') or (model_name == 'MLR added features'):
        yX = df.dropna(subset=['price']).copy()
        X = yX.iloc[:,1:]
        y = yX.iloc[:,0]      
        unit = '(EUR)'
    else:
        unit = '(EUR)'
    
    if X.ndim != 1:
        n_feat = X.shape[1]
    else:
        n_feat = 1
        
    if not ((model_name == 'MLR with categorical') or (model_name == 'MLR Lasso') or (model_name == 'MLR added features')):
        # needed for .predict
        X = np.array(X).reshape(-1,n_feat)
        y = np.array(y).reshape(-1,1)
    
    # predict all data
    y_pred = res['model'].predict(X)
    if max(y) < 10:
        rmse = np.sqrt(np.mean(((10**y)-(10**y_pred))**2))
    else:
        rmse = np.sqrt(np.mean((y-y_pred)**2))
    print(rmse)

    # actual plotting
    fig,axs = plt.subplots(nrows=2, ncols=1, figsize=[4,4])
    
    # data
    axs[0].plot(y, y_pred, marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4,)
    # error
    axs[1].plot(y, y_pred-y, marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4,)
    
    # axis equal for top
    if (model_name == 'MLR with categorical') or (model_name == 'MLR Lasso') or (model_name == 'MLR added features'):
        axs[0].set_xscale('log')
        axs[0].set_yscale('log')
        axs[1].set_xscale('log')
    axs[0].set_aspect(1)
    # store limits
    yl = axs[0].get_ylim()
    xl_top = axs[0].get_xlim()
    xl_bot = axs[1].get_xlim()
    xl = [np.max([xl_top[0], xl_bot[0]]), np.min([xl_top[1], xl_bot[1]])]
    # plot unity line and 0 error
    unity_line = [np.max([xl[0], yl[0]]), np.min([xl[1], yl[1]])]
    axs[0].plot(unity_line, unity_line, '-k', linewidth=2)
    axs[1].plot(xl, [0, 0], '-k', linewidth=2)
    # reset limits
    axs[0].set_xlim(xl)
    axs[1].set_xlim(xl)

    # make equal size panels
    # Note: sharex did not work
    bb=axs[0].get_position(False)
    rect_top = bb.bounds
    bb=axs[1].get_position(False)
    rect_bot = bb.bounds
    rect = list(rect_bot)
    rect[0] = rect_top[0]
    rect[2] = rect_top[2]
    axs[1].set_position(rect)
    
    # labeling
    fig.suptitle('{}\nrmse: EUR {:.0f}'.format(model_name,rmse), style='italic')
    axs[1].set_xlabel('Real price ' + unit, style='italic')
    axs[0].set_ylabel('Predicted price\n' + unit, style='italic')
    axs[1].set_ylabel('Prediction error\n' + unit, style='italic')
    
    # save
    file_name = '../results/{}-accuracy.png'.format(model_name.replace(' ','_'))
    if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
        print(file_name)
        with plt.style.context('../assets/context-paper.mplstyle'):
            plt.savefig(file_name, bbox_inches='tight', transparent=True)
    else:
        plt.show()
        print(f'Skip. {file_name} exists or saving is disabled in settings.')

linear regression no cv
9578.866860211785
../results/linear_regression_no_cv-accuracy.png
linear regression log price
9727.404043400491
../results/linear_regression_log_price-accuracy.png
linear regression log price young
8801.11131086857
../results/linear_regression_log_price_young-accuracy.png
MLR reduced observations
6365.453264529394
../results/MLR_reduced_observations-accuracy.png
MLR impute median
7030.18919546757
../results/MLR_impute_median-accuracy.png
MLR with categorical
5714.143328817421
../results/MLR_with_categorical-accuracy.png
MLR Lasso
5594.872042472295
../results/MLR_Lasso-accuracy.png


In [44]:
assert False, 'stop running, below is sandboxing and testing'

AssertionError: stop running, below is sandboxing and testing

<H1><a href="#pred_top">^</a></H1><a id='pred_save_model'> 

# Save model as pickle
Save the best model as a .pkl file.

See also: https://scikit-learn.org/stable/modules/model_persistence.html

In [None]:
# import dill # dill acts as pickle but handles lambda functions
model_name = 'MLR Lasso' 
model = models[model_name]

In [None]:
model['name'] = model_name
fn = '../results/trained_model_{}.pkl'.format(model_name.replace(' ', '_').lower())
print(fn)
# with open(fn, 'wb') as file:
#     dill.dump(model, file)

<H1><a href="#pred_top">^</a></H1><a id='pred_predict'> 

# Example predictions

In [108]:
# Predict some known cars
B = pd.DataFrame(columns=X.columns, index=['Mine'])
B.loc['Mine', ['brand', 'model', 'fuel', 'body_type', 'color']] = ['CITROËN', 'berlingo', 'benzine', 'mpv', 'Gray']
B.loc['Mine', ['displacement', 'number_of_cylinders', 'number_of_seats', 'number_of_doors', 'fwd', 'number_of_gears']] = [1600, 4, 5, 5, 'n', 5]
B.loc['Mine', ['top_speed']] = [170]
B.loc['Mine', 'age'] = (pd.to_datetime('now') - pd.to_datetime('2005-12-1')).days
B.loc['Mine', 'days_since_inspection_invalid'] = (pd.to_datetime('now') - pd.to_datetime('2022-6-11')).days
B.loc['Mine', 'age_at_import'] = 0
B.loc['Mine', 'odometer'] = 160000
B.loc['Mine', ['weight']] = [1326]

B.loc['Peer', ['brand', 'model', 'fuel', 'body_type', 'color']] = ['CITROËN', 'ax', 'benzine', 'hatchback', 'Gray']
B.loc['Peer', ['displacement', 'number_of_cylinders', 'number_of_seats', 'number_of_doors', 'fwd', 'number_of_gears']] = [1100, 4, 5, 5, 'n', 5]
B.loc['Peer', ['top_speed']] = [170]
B.loc['Peer', 'age'] = (pd.to_datetime('now') - pd.to_datetime('1996-12-1')).days
# B.loc['Mine', 'days_since_inspection_invalid'] = (pd.to_datetime('now') - pd.to_datetime('2020-6-11')).days
B.loc['Peer', 'age_at_import'] = 0
B.loc['Peer', 'odometer'] = 160000
B.loc['Peer', ['weight']] = [800]

B.loc['a car', ['brand']] = [np.NaN]


B.loc['J-892-TZ', ['brand', 'model', 'fuel', 'body_type', 'color']] = ['SUZUKI', 'sx4', 'benzine', 'hatchback', 'Gray']
B.loc['J-892-TZ', ['displacement', 'number_of_cylinders', 'number_of_seats', 'number_of_doors', 'fwd', 'number_of_gears']] = [1586, 4, 5, 5, 'n', np.NaN]
B.loc['J-892-TZ', 'age'] = (pd.to_datetime('now') - pd.to_datetime('2010-11-11')).days
B.loc['J-892-TZ', 'days_since_inspection_invalid'] = (pd.to_datetime('now') - pd.to_datetime('2021-11-18')).days
B.loc['J-892-TZ', 'age_at_import'] = (pd.to_datetime('now') - pd.to_datetime('2020-11-19')).days
B.loc['J-892-TZ', 'odometer'] = 58153
B.loc['J-892-TZ', 'weight'] = 1230
B.loc['J-892-TZ', 'power'] = 118
B.loc['J-892-TZ', 'automatic_gearbox'] = 'y'
B.loc['J-892-TZ', 'private_owners'] = 1
B.loc['J-892-TZ', 'company_owners'] = 0
B.loc['J-892-TZ', 'sale_price'] = 19979
B.loc['J-892-TZ', 'registration_tax'] = 3936

B.loc['J-892-TZ-real'] = df.loc['2022-01-805121',:].drop(columns='Price')

B.T

Unnamed: 0,Mine,Peer,a car,J-892-TZ,J-892-TZ-real
brand,CITROËN,CITROËN,,SUZUKI,SUZUKI
model,berlingo,ax,,sx4,sx4
age,5895,9182,,4089,4069.0
fuel,benzine,benzine,,benzine,benzine
odometer,160000,160000,,58153,58153.0
days_since_inspection_invalid,-141,,,64,44.0
age_at_import,0,0,,428,3661.0
body_type,mpv,hatchback,,hatchback,hatchback
displacement,1600,1100,,1586,1586.0
number_of_cylinders,4,4,,4,4.0


In [109]:
df_ = pd.DataFrame(index=models.keys(), columns=B.index)
for model in df_.index[::-1]:
    try:
        print(f'{model}')
        B.loc[:,'predict'] = models[model]['model'].predict(B)
        pred = B.predict
    except: 
        pred = pd.Series(index=B.index, data=np.NaN)
    df_.loc[model,:] = pred
    
df_

MLR Lasso
MLR with categorical
MLR impute median
MLR reduced observations
linear regression log price young
linear regression log price
linear regression no cv


Unnamed: 0,Mine,Peer,a car,J-892-TZ,J-892-TZ-real
linear regression no cv,,,,,
linear regression log price,,,,,
linear regression log price young,,,,,
MLR reduced observations,,,,,
MLR impute median,,,,,
MLR with categorical,,,,,
MLR Lasso,1064.75069,466.821739,1431.067988,3213.441625,4888.626884


In [106]:
B2 = df.loc['2022-01-805121',:].to_frame().T.drop(columns='price')
models['MLR Lasso']['model'].predict(B2)


array([4888.62688358])

In [None]:
B = pd.read_pickle('/home/tom/bin/satdatsci/Saturday-Datascience/data/rdw-data-2021-02.pkl')
B.columns
#['brand', 'model', 'fuel', 'body_type', 'color']
B.loc[:, [
    'rdw_merk',
    'rdw_type',
    'rdw_brandstof_brandstof_omschrijving_1',
    'rdw_ovi_inrichting_code_omschrijving',
    'rdw_eerste_kleur',
     ]]

In [None]:
B.loc[:,(B == 'GRIJS').any()]