<a id='pred_top'>

# Predict auction price

Try several models and improve predicition accuracy

## Model fitting

- Linear fits  
  1. [Simple linear fit](#pred_model_1)  
     No cross validation. Observations with missing values are dropped.
  2. [Dependent values scaled](#pred_model_2)  
     Dependent value here is _prices_.
  3. [Partial data](#pred_model_3)  
     Only young cars
- Multiple linear regression models  
  1. [MLR fit without imputation](#pred_model_4)  
  2. [With imputation](#pred_model_5)  
  3. [Include categorical features](#pred_model_6)  
  4. [Lasso regularization](#pred_model_7)  
  5. [include engineered features](#pred_model_8) **TODO**  

## Results

- [Model performance](#pred_accuracies)
- [Save best model](#pred_save_model) **TODO**  
  This is not implemented yet. Some preprocessing functions are not handled well with `pickle`.
- [Predictions](#pred_predict)
     
  

In [1]:
import os
# setting path
os.chdir(r'..')

import drz_config
cfg = drz_config.read_config()
VERBOSE = cfg['VERBOSE']
SKIPSAVE = cfg['SKIPSAVE']

if VERBOSE > 0:
    display(cfg)

{'settings_fn': '../code/assets/drz-auction-settings.ini',
 'DATE': '2022-06',
 'VERBOSE': 1,
 'OPBOD': False,
 'URL': 'http://verkoop.domeinenrz.nl/verkoop_bij_inschrijving_2022-0006',
 'EXTEND_URL': False,
 'CLOSEDDATA': True,
 'closed_data_fields': '*',
 'SKIPSAVE': False}

In [2]:
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

import seaborn as sns

In [3]:
# set figure defaults (needs to be in cell seperate from import sns)
plt.style.use(['default', '../assets/movshon.mplstyle', '../assets/context-notebook.mplstyle'])

# Load data

In [4]:
fn = '../data/cars-for-ml.pkl'
print(fn)
df = pd.read_pickle(fn)
print(df.shape)

# categories
cat_columns = ['brand', 'model', 'fuel', 'body_type','color', 'energy_label', 'fwd', 'automatic_gearbox', 'under_survey']
# numerical
num_columns = list(np.setdiff1d(df.columns, cat_columns + ['price']))

# Factorized categorical values
fld = 'energy_label'
# replace empty with NaN creates factor '-1'
v, idx = pd.factorize(df[fld].replace({'': np.NaN}), sort=True)
# convert '-1' back to NaN
v = v.astype(float)
v[v==-1] = np.NaN
# Store in dataframe
new_col = 'converted_' + fld
df[new_col] = v
# update list
num_columns += [new_col]
cat_columns.remove(fld)
print('\nCategorical field [{}] is converted to sequential numbers with: '.format(fld), end='\n\t')
print(*['{} <'.format(c) for c in idx], end='\n\n')

# convert boolean to string
for fld in ['fwd', 'automatic_gearbox', 'under_survey']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    # # update list
    # cat_columns += [new_col]
    # cat_columns.remove(fld)
    replace_dict = {
        '': '', 
        True: 'y', 
        False: 'n'
    }
    df[new_col] = df[fld].replace(replace_dict)
    print('\nBoolean field [{}] is converted to numbers according to: '.format(fld), end='\n')
    print(*['\t"{}" -> {} ({})\n'.format(k,v, type(v)) for k,v in replace_dict.items()], end='\n\n')

# convert integer to float and replace -1
for fld in ['number_of_cylinders', 'number_of_doors', 'number_of_gears', 'number_of_seats']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    replace_dict = {
        -1: np.NaN, 
    }
    df[new_col] = df[fld].replace(replace_dict).astype(float)

# convert empty string to NaN
for fld in ['brand', 'model', 'fuel', 'body_type', 'color', 'fwd']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    replace_dict = {
        '': np.NaN, 
    }
    df[new_col] = df[fld].replace(replace_dict)

# translate Dutch to English
fld = 'color'
new_col = fld
# # update list
# cat_columns += [new_col]
# cat_columns.remove(fld)
replace_dict = {
    '': 'missing', 
    'BLAUW': 'Blue',
    'ROOD': 'Red',
    'GROEN': 'Green',
    'GRIJS': 'Gray',
    'WIT': 'White',
    'ZWART': 'Black',
    'BEIGE': 'Beige',
    'BRUIN': 'Brown',
    'ROSE': 'Pink',
    'GEEL': 'Yellow',
    'CREME': 'Creme',
    'ORANJE': 'Orange',
    'PAARS': 'Purple,'
}
df[new_col] = df[fld].replace(replace_dict)
print('\nField [{}] is converted according to: '.format(fld), end='\n')
print(*['\t"{}" -> {} ({})\n'.format(k,v, type(v)) for k,v in replace_dict.items()], end='\n\n')

# reporting
try:
    print('Categorical:', len(cat_columns))
    [print('\t[{:2.0f}] {:s}'.format(i+1, c)) for i,c in enumerate(df[cat_columns].columns)]
    print('Numercial:', len(num_columns))
    [print('\t[{:2.0f}] {:s}'.format(i+1, c)) for i,c in enumerate(df[num_columns].columns)]
    print('Last lot in data set:\n\t{}'.format(df.index[-1]))
except:
    cat_columns = [c for c in cat_columns if c in df.columns]
    num_columns = [c for c in num_columns if c in df.columns]    
    print('! not all fields are in data !. Skip for now')

../data/cars-for-ml.pkl
(9155, 29)

Categorical field [energy_label] is converted to sequential numbers with: 
	A < B < C < D < E < F < G <


Boolean field [fwd] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Boolean field [automatic_gearbox] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Boolean field [under_survey] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Field [color] is converted according to: 
	"" -> missing (<class 'str'>)
 	"BLAUW" -> Blue (<class 'str'>)
 	"ROOD" -> Red (<class 'str'>)
 	"GROEN" -> Green (<class 'str'>)
 	"GRIJS" -> Gray (<class 'str'>)
 	"WIT" -> White (<class 'str'>)
 	"ZWART" -> Black (<class 'str'>)
 	"BEIGE" -> Beige (<class 'str'>)
 	"BRUIN" -> Brown (<class 'str'>)
 	"ROSE" -> Pink (<class 'str'>)
 	"GEEL" -> Yel

In [5]:
# Store model results in dictonary: Instantiate empty dict
models = dict()

<a href="#pred_top" id='pred_model_1'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Model: Simple linear fit
Regress age (in days) with price (euro).  

## >> BIG FAT WARNING <<
All data is used without train / test split. I.e. accuracy is based on data that was used for fit. This is considered bad practice!

## Prepare input

In [6]:
from sklearn import linear_model

model_name = 'linear regression no cv'

X = df.dropna(subset=['price','age']).age.values.reshape(-1,1)
y = df.dropna(subset=['price','age']).price.values.reshape(-1,1)
print(X.shape)
print(y.shape)

(7948, 1)
(7948, 1)


## Fit

In [7]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X,y) # fit with all data
models[model_name].update({'n':y.shape[0]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})

In [8]:
# Fit a line by using predict
prediction_X = np.array([0,int(np.ceil(X.max()/365.25))*365.25]).reshape(-1,1)
prediction_y = reg.predict(prediction_X)

# plot
plt.figure(figsize=[8,8])
plt.plot(X/365.25, y/1000, marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4)
hdl_fit = plt.plot(prediction_X/365, prediction_y/1000, color='blue', marker=None, linestyle='-', linewidth=4)
plt.legend(hdl_fit, ['n = {}, $R^2$ = {:.2f}\ny = {:+.0f}{:+.2f}*(x*365.25)'.format(
    models[model_name]['n'],
    models[model_name]['R^2'],
    *models[model_name]['betas']
)], loc='upper right')
plt.xlabel('Age (years)', style='italic')
plt.ylabel('Winning bid (EUR X1000)', style='italic')
plt.title('Simple linear fit', style='italic')
plt.ylim(bottom = -10)
plt.xlim(left = 0)

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/linear_regression_no_cv.png


<a href="#pred_top" id='pred_model_2'><font size=+1><center>^^ TOP ^^</center></font></a>

---

## Model: linear but with scaled dependent values (prices)

Instead of using all data **train/test split** is performed. Also prices are log transformed.  

## Prepare input

In [9]:
from sklearn.model_selection import train_test_split, cross_val_score

model_name = 'linear regression log price'

X = df.dropna(subset=['price','age']).age.values.reshape(-1,1)
y = np.log10(df.dropna(subset=['price','age']).price.values.reshape(-1,1))
print(X.shape)
print(y.shape)

(7948, 1)
(7948, 1)


## Fit

In [10]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X_train,y_train) # fit with training set
models[model_name].update({'n':y.shape[0]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(5563, 1)
(2385, 1)


In [11]:
depr_half_n_days = -(np.log10(2)/models[model_name]['betas'][1])
print('According to "{}"-model'.format(model_name))
print('Car depreciates to half its value every\n\t{:.0f} days ({:.1f} years).'.format(depr_half_n_days, depr_half_n_days/365.25))
for y in [0,2,4,6,8]:
    print('\ty(t={:+5.0f}) = {:.0f} euro'.format(y, 10**reg.predict([[y*365.25]])[0][0]))
print('\n\ty(t={:+5.1f}) = {:.0f} euro'.format(depr_half_n_days/365.25, 10**reg.predict([[depr_half_n_days]])[0][0]))
print('\ty(t=0) / 2 = {:.0f} euro'.format(10**models[model_name]['betas'][0]/2))

According to "linear regression log price"-model
Car depreciates to half its value every
	2449 days (6.7 years).
	y(t=   +0) = 9985 euro
	y(t=   +2) = 8120 euro
	y(t=   +4) = 6604 euro
	y(t=   +6) = 5370 euro
	y(t=   +8) = 4367 euro

	y(t= +6.7) = 4993 euro
	y(t=0) / 2 = 4993 euro


In [12]:
# Fit a line by using predict
prediction_X = np.array([0,int(np.ceil(X.max()/365.25))*365.25]).reshape(-1,1)
prediction_y = reg.predict(prediction_X)

# plot
plt.figure(figsize=[8,8])
hdl_trn = plt.plot(X_train/365.25, np.power(10,y_train), marker='s', markeredgecolor = (0, 0, 1, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='train (n = {})'.format(y_train.shape[0]))
hdl_tst = plt.plot(X_test/365.25, np.power(10,y_test), marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='test (n = {}, $R^2$ = {:.2f})'.format(
                       y_test.shape[0],
                       models[model_name]['test R^2'],
                   ))
hdl_fit = plt.plot(prediction_X/365, np.power(10,prediction_y), color='blue', marker=None, linestyle='-', linewidth=4, 
                   label = '$log10(y)$ = {:+.2f}{:+.1e}*(x*365.25)\n($R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f}))'.format(
                       *models[model_name]['betas'],
                       models[model_name]['R^2'],
                       models[model_name]['cv R^2'].shape[0],
                       np.mean(models[model_name]['cv R^2']),
                       np.std(models[model_name]['cv R^2']),
                   ))
plt.legend()
plt.xlabel('Age (years)', style='italic')
plt.ylabel('Winning bid (EUR)', style='italic')
plt.title('Linear fit with log(price)', style='italic')
plt.ylim(bottom = 10, top = 1000000)
plt.xlim(left = 0)
plt.yscale('log')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/linear_regression_log_price.png


<a href="#pred_top" id='pred_model_3'><font size=+1><center>^^ TOP ^^</center></font></a>

---

## Model: scaled price, but only young cars

Same as [model 2](#pred_model_2), but ignore cars older than 25y

## Prepare input

In [13]:
from sklearn.model_selection import train_test_split, cross_val_score

model_name = 'linear regression log price young'

is_yng = df.age/365.25 < 25

X = df[is_yng].dropna(subset=['price','age']).age.values.reshape(-1,1)
y = np.log10(df[is_yng].dropna(subset=['price','age']).price.values.reshape(-1,1))
print(X.shape)
print(y.shape)

(7748, 1)
(7748, 1)


## Fit

In [14]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X_train,y_train)
models[model_name].update({'n':y.shape[0]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(5423, 1)
(2325, 1)


In [15]:
depr_half_n_days = -(np.log10(2)/models[model_name]['betas'][1])
print('According to "{}"-model'.format(model_name))
print('Car depreciates to half its value every\n\t{:.0f} days ({:.1f} years).'.format(depr_half_n_days, depr_half_n_days/365.25))
for y in [0,2,4,6,8]:
    print('\ty(t={:+5.0f}) = {:.0f} euro'.format(y, 10**reg.predict([[y*365.25]])[0][0]))
print('\n\ty(t={:+5.1f}) = {:.0f} euro'.format(depr_half_n_days/365.25, 10**reg.predict([[depr_half_n_days]])[0][0]))
print('\ty(t=0) / 2 = {:.0f} euro'.format(10**models[model_name]['betas'][0]/2))

According to "linear regression log price young"-model
Car depreciates to half its value every
	1339 days (3.7 years).
	y(t=   +0) = 23414 euro
	y(t=   +2) = 16041 euro
	y(t=   +4) = 10990 euro
	y(t=   +6) = 7530 euro
	y(t=   +8) = 5159 euro

	y(t= +3.7) = 11707 euro
	y(t=0) / 2 = 11707 euro


In [16]:
# Fit a line by using predict
prediction_X = np.array([0,int(np.ceil(X.max()/365.25))*365.25]).reshape(-1,1)
prediction_y = reg.predict(prediction_X)

# plot
plt.figure(figsize=[8,8])
hdl_trn = plt.plot(X_train/365.25, np.power(10,y_train), marker='s', markeredgecolor = (0, 0, 1, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='train (n = {})'.format(y_train.shape[0]))
hdl_tst = plt.plot(X_test/365.25, np.power(10,y_test), marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='test (n = {}, $R^2$ = {:.2f})'.format(
                       y_test.shape[0],
                       models[model_name]['test R^2'],
                   ))
hdl_fit = plt.plot(prediction_X/365, np.power(10,prediction_y), color='blue', marker=None, linestyle='-', linewidth=4, 
                   label = '$log10(y)$ = {:+.2f}{:+.1e}*(x*365.25)\n($R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f}))'.format(
                       *models[model_name]['betas'],
                       models[model_name]['R^2'],
                       models[model_name]['cv R^2'].shape[0],
                       np.mean(models[model_name]['cv R^2']),
                       np.std(models[model_name]['cv R^2']),
                   ))
plt.legend()
plt.xlabel('Age (years)', style='italic')
plt.ylabel('Winning bid (EUR)', style='italic')
plt.title('Linear fit with log(price) of young cars', style='italic')
plt.ylim(bottom = 10, top = 1000000)
plt.xlim(left = 0)
plt.yscale('log')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/linear_regression_log_price_young.png


<a href="#pred_top" id='pred_model_4'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Model: Multiple linear fit

Above [simple linear models](#pred_model_1) only use _Age_ as predictor of price. Here MLR will regress many (numerical) features with price (euro).  


## Prepare input

In [17]:
model_name = 'MLR reduced observations'

features = num_columns 
# Can be reduced here

X = df.dropna(subset=['price'] + features).loc[:,features].values.reshape(-1,len(features))
y = np.log10(df.dropna(subset=['price'] + features).price.values.reshape(-1,1))
print(X.shape)
print(y.shape)

(1440, 20)
(1440, 1)


## Fit

In [18]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X_train,y_train)
models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(1008, 20)
(432, 20)


In [19]:
# plot coefficients
plt.figure(figsize=[8,2])

# sorted bar height
betas = models[model_name]['betas']
x = ['offset (log[EUR])'] + [features[i] for i in np.argsort(betas[1:])[::-1]]
y = [betas[0]] + sorted(betas[1:], reverse=True)

# plot bar
plt.bar(x=x, height=y, edgecolor='k', facecolor='None')

# add values when bar is small
for x_val, coef in zip(x,y):
    if np.abs(coef)<1:
        plt.text(x_val, coef, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')
plt.yticks(range(0,5,2))

# plot origin
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
plt.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
plt.axhline(0, linewidth=2, linestyle='-', color='k')
        
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
yl = plt.gca().get_ylim()
plt.vlines(x_sign_switch-0.5, yl[0], yl[1], linewidth=2, linestyle='--')
plt.gca().set_ylim(yl)
# plt.gca().set_ylim(top=0.01, bottom=-0.01)

# labels        
plt.gca().set_xticklabels(labels=x, rotation=45, va='top', ha='right', style='italic')
plt.gca().xaxis.set_tick_params(which='minor', bottom=False)
plt.xlabel('Feature', style='italic')
plt.ylabel('Coefficient (a.u.)', style='italic')
plt.title('Multiple linear regression', style='italic') 

# stats
xy=[plt.gca().get_xlim()[1], plt.gca().get_ylim()[1]]
plt.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
    models[model_name]['R^2'],
    models[model_name]['cv R^2'].shape[0],
    np.mean(models[model_name]['cv R^2']),
    np.std(models[model_name]['cv R^2']),
) + '\n' +
         'train (n = {})'.format(y_train.shape[0]) + '\n' +
         'test (n = {}, $R^2$ = {:.2f})'.format(
             y_test.shape[0],
             models[model_name]['test R^2'],
         ), style='italic', va='top', ha='left')


# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/MLR_reduced_observations.png


  plt.gca().set_xticklabels(labels=x, rotation=45, va='top', ha='right', style='italic')


<a href="#pred_top" id='pred_model_5'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Model: MLR + imputer

MLR as above, but instead of `dropna` us an imputer. This allows to use more observation.  

At this point a pipeline is used.

## Prepare input

In [20]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

model_name = 'MLR impute median'

features = num_columns 
# Can be reduced here

yX = df.loc[:,['price'] + features].dropna(subset=['price'])
X = yX.iloc[:,1:].values.reshape(-1,len(features))
y = np.log10(yX.iloc[:,0].values.reshape(-1,1))
print(X.shape)
print(y.shape)

(7979, 20)
(7979, 1)


## Fit

In [21]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
pl = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler(),
    linear_model.LinearRegression()
)
models[model_name].update({'model':pl})

# fit
pl.fit(X,y)
models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [*pl.steps[-1][1].intercept_, *pl.steps[-1][1].coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':pl.score(X,y)})
models[model_name].update({'test R^2':pl.score(X_test,y_test)})
cv_results = cross_val_score(pl, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(5585, 20)
(2394, 20)


In [22]:
# plot coefficients
plt.figure(figsize=[8,4])

# sorted bar height
betas = models[model_name]['betas']
x = ['offset (log[EUR])'] + [features[i] for i in np.argsort(betas[1:])[::-1]]
y = [betas[0]] + sorted(betas[1:], reverse=True)

# plot bar
plt.bar(x=x, height=y, edgecolor='k', facecolor='None')

# add values when bar is small
for x_val, coef in zip(x,y):
    if np.abs(coef)<0.05:
        plt.text(x_val, coef, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')
plt.yticks(np.arange(-0.3,0.4,0.1))
plt.ylim(top=+0.3, bottom=-0.3)
# offset
x_val = x[0]
coef = y[0]
plt.text(x_val, 0.3, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')

# plot origin
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
plt.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
plt.axhline(0, linewidth=2, linestyle='-', color='k')

# labels        
plt.gca().set_xticklabels(labels=x, rotation=45, va='top', ha='right', style='italic')
plt.gca().xaxis.set_tick_params(which='minor', bottom=False)
plt.xlabel('Feature', style='italic')
plt.ylabel('Coefficient (a.u.)', style='italic')
plt.title('Multiple linear regression', style='italic') 

# stats
xy=[plt.gca().get_xlim()[1], plt.gca().get_ylim()[1]]
plt.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
    models[model_name]['R^2'],
    models[model_name]['cv R^2'].shape[0],
    np.mean(models[model_name]['cv R^2']),
    np.std(models[model_name]['cv R^2']),
) + '\n' +
         'train (n = {})'.format(y_train.shape[0]) + '\n' +
         'test (n = {}, $R^2$ = {:.2f})'.format(
             y_test.shape[0],
             models[model_name]['test R^2'],
         ), style='italic', va='top', ha='left')


# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/MLR_impute_median.png


  plt.gca().set_xticklabels(labels=x, rotation=45, va='top', ha='right', style='italic')


<a href="#pred_top" id='pred_model_6'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Model: MLR with categorical

As MLR, but do one-hot-encoding

Use different scalers for different columns:  
https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html  
https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer  
p. 68 book: ML with sklearn & tf

## Prepare input

In [23]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import MinMaxScaler
# from sklearn.pipeline import FeatureUnion

model_name = 'MLR with categorical'

cat_columns_reduced = list(np.setdiff1d(cat_columns, ['model', 'fuel']))
features = num_columns + cat_columns_reduced
# Can be reduced here

# list of lists with categories. Needed for column transformer
cats = list(df[cat_columns_reduced].apply(lambda x:pd.Series(x.unique()).dropna().tolist() + ['missing'], axis='index'))

# Use data frame not array
yX = df.dropna(subset=['price'])
# # only use young
# is_yng = yX.age/365.25 < 25
# yX = yX[is_yng]
X = yX.iloc[:,1:]
y = yX.iloc[:,0]
print(X.shape)
print(y.shape)


(7979, 29)
(7979,)


In [24]:
import re

# Split fuel helper functions

def split_lpg_type(s):
    '''Split lpg type from list of fuels separated by / '''
    # No type
    if s.endswith('lpg'):
        return s, ''
    if 'lpg' not in s:
        return s, ''
    # Type is after the last '/'
    M = re.search('^(.*)/(.*)$',s)
    if M:
        return M[1], M[2]
    else:
        return s, ''

def merge_lpg_and_lpgtype(fuel_type):

    '''Add LPG type to LPG (remove /). 
    Note that order of fuels is preserved. I.e. it is able to return both "benzine/lpg-g3" and "lpg-g3/benzine". '''
    
    lpg_type = fuel_type.apply(lambda s: 'lpg-' + split_lpg_type(s)[1] if (type(s) == str) and ('lpg' in s) else '')
    fuel_type_short = fuel_type.apply(lambda s: split_lpg_type(s)[0] if (type(s) == str) else '')
    fuel_type_new = pd.Series([f.replace('lpg', l) if type(f) == str else f for f,l in zip(fuel_type_short,lpg_type)])
    return fuel_type_new


def get_unique_fuels(fuel_type):
    
    '''Splitting fuels at "/" and return unique values'''
    
    # make list (as string)
    fuel_type_list = fuel_type.apply(lambda s:s.split('/') if type(s) == str else np.NaN).astype(str)
    
    # Get unique fuels
    possible_fuels = list() # empty list
    for l in fuel_type_list.unique():
        for ll in eval(l): # use eval to convert str to list
            possible_fuels += [ll]     
    # uniquify
    return np.unique(possible_fuels)

    
from sklearn.base import BaseEstimator, TransformerMixin

# Custom transformer to make one-hot fuel encoder based on string
# This is different from get_dummies, because it can take a list of values in a field
class DummyfyFuel(BaseEstimator, TransformerMixin):
    def __init__(self, fuel_names=None):
        
        assert (fuel_names == None) or (isinstance(fuel_names, (list,))), '[fuel_names] should be list (or None)'
        
        self.fuel_names = fuel_names
        
    def fit(self, X, y=None):
        
        if not self.fuel_names:
            # get fuel names based on input.
            # Note that if train/test are split, test might lack a fuel type.
            self.fuel_names = get_unique_fuels(merge_lpg_and_lpgtype(X))

        return self
    
    def transform(self, X):
        
        # get stringyfied list
        fuel_type_list = merge_lpg_and_lpgtype(X).apply(lambda s:s.split('/') if type(s) == str else np.NaN).astype(str)
        # set index as input
        fuel_type_list.index = X.index

        # transform: dummies
        fuel_dummies = pd.DataFrame(index=fuel_type_list.index)
        for f in self.fuel_names:
            fuel_dummies['fuel_' + f] = fuel_type_list.apply(lambda l:int(f in eval(l)))

        return fuel_dummies


In [25]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)


(5585, 29)
(2394, 29)


In [26]:
# Create model

# Preprocessor: numerical features
num_transformer = make_pipeline(
    SimpleImputer(strategy='median'),
    MinMaxScaler(),
)
# Preprocessor: categorical features
cat_transformer = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing', missing_values=np.NaN),
    OneHotEncoder(categories=cats),
)

# Preprocess: fuels
# list of all fuels is passed by using full data set! (X)
fuel_list = list(get_unique_fuels(merge_lpg_and_lpgtype(X.fuel)))
#fuel_list = ['benzine', 'diesel']
get_fuel_dummies = DummyfyFuel(fuel_list)


# Combine num and cat
preprocessor = ColumnTransformer(transformers=[
    ('numerical', num_transformer, pd.Index(num_columns)),
    ('categorical', cat_transformer, pd.Index(cat_columns_reduced)),
    ('onehot_fuel', get_fuel_dummies, 'fuel')
], verbose=True)

# full pipeline with preproc and mlr
mlr = make_pipeline(
    preprocessor,
    linear_model.LinearRegression()
)

# Target transformation: log transform price
pl = TransformedTargetRegressor(
    regressor=mlr,
    func=np.log10,
    inverse_func=lambda y: 10**y,
#     func=lambda x:x,
#     inverse_func=lambda y: y,
#     inverse_func=np.exp,
)

models[model_name].update({'model':pl})

In [27]:
# fit
pl.fit(X_train, y_train)
y_pred = pl.predict(X_test)

[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.3s


In [28]:
# sanity check that target transformation has occured as expected
# y_pred_manual_transform = mlr.predict(X_test)
# assert all(np.log10(y_pred)-y_pred_manual_transform == 0)

models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [pl.regressor_.steps[-1][1].intercept_, *pl.regressor_.steps[-1][1].coef_]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':pl.score(X,y)})
models[model_name].update({'test R^2':pl.score(X_test,y_test)})
cv_results = cross_val_score(pl, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})

[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[Colum

In [29]:
# update features, by adding fuels
cat_columns_reduced += ['fuel']
cats += [fuel_list]


In [30]:
# Split betas per category feature.
idx_start = len(num_columns)+1
cat_betas = list()
for cat in cats:
    cat_betas += [betas[idx_start:idx_start+len(cat)]]
    idx_start += len(cat)
# Check if all betas are stored
assert cat_betas[0][0] == betas[len(num_columns)+1] # first cat beta follows numerical betas 
assert cat_betas[-1][-1] == betas[-1] # last

In [31]:
# plot coefficients

# plot numerical and catagorical in different subplots
n_plots = len(cat_columns_reduced) + 1
fig,axs=plt.subplots(
    nrows=n_plots,
    figsize=[16,4*n_plots]
)
plt.subplots_adjust(hspace=0.5)


# Plot numerical
plt.sca(axs[0])
# sorted bar height
betas = models[model_name]['betas']
num_betas = betas[1:len(num_columns)+1]
x = ['offset'] + [features[i] for i in np.argsort(num_betas)[::-1]]
y = [betas[0]] + sorted(num_betas, reverse=True)

# plot bar
plt.bar(x=x, height=y, edgecolor='k', facecolor='None', clip_on=True)

# add values when bar is small
for x_val, coef in zip(x,y):
    if np.abs(coef)<0.5:
        plt.text(x_val, coef, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')
plt.yticks(np.arange(-2,2.2,0.5))
plt.ylim(top=+2, bottom=-2)
# offset
x_val = x[0]
coef = y[0]
plt.text(x_val, 2, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')

# plot origin
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
plt.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
plt.axhline(0, linewidth=2, linestyle='-', color='k')

# labels        
rot = 45
fsz = 10
ha = 'right'
plt.gca().set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
plt.gca().xaxis.set_tick_params(which='minor', bottom=False)
plt.ylabel('Coefficient (a.u.)', style='italic')
plt.title('Multiple linear regression\nNumerical features', style='italic') 

# stats
xy=[plt.gca().get_xlim()[1], plt.gca().get_ylim()[1]]
plt.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
    models[model_name]['R^2'],
    models[model_name]['cv R^2'].shape[0],
    np.mean(models[model_name]['cv R^2']),
    np.std(models[model_name]['cv R^2']),
) + '\n' +
         'train (n = {})'.format(y_train.shape[0]) + '\n' +
         'test (n = {}, $R^2$ = {:.2f})'.format(
             y_test.shape[0],
             models[model_name]['test R^2'],
         ), style='italic', va='top', ha='left')

# Plot categorical
for cat, cat_beta, cat_name, ax in zip(cats, cat_betas, cat_columns_reduced, axs[1:]):
    # activate subplot axes
    plt.sca(ax)
    # sort by height
    x = [cat[i] for i in np.argsort(cat_beta)[::-1]]
    y = sorted(cat_beta, reverse=True)
    #x = cat
    #y = cat_beta
    # plot bar
    plt.bar(x=x, height=y, edgecolor='k', facecolor='None', clip_on=False)

    # prettify
    plt.yticks(np.arange(-1,+1.1,0.2))
    plt.ylim(top=+1, bottom=-1)

    # plot origin
    x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
    plt.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
    plt.axhline(0, linewidth=2, linestyle='-', color='k')

    # labels
    rot = 45
    fsz = 10
    ha = 'right'
    ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
    ax.xaxis.set_tick_params(which='minor', bottom=False)
    plt.title('Categorical feature: ' + cat_name, style='italic')
    plt.ylabel('Coefficient (a.u.)', style='italic')
    # add extra margin if bars are too wide (too little bars)
    if len(x) < 20:
        add_space = len(x) - 20
        xl = list(plt.xlim())
        xl[1] -= add_space/2
        xl[0] += add_space/2
        plt.xlim(xl)

# Label on bottom panel
plt.sca(axs[-1])
plt.xlabel('Sorted features', style='italic')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

  plt.gca().set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
  ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)


../results/MLR_with_categorical.png


<a href="#pred_top" id='pred_model_7'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Model: MLR regularized

As [previous model](#pred_model_6), but use regularization by using built-in Lasso

## Prepare input

In [32]:
from sklearn.model_selection import GridSearchCV

model_name = 'MLR Lasso'

cat_columns_reduced = list(np.setdiff1d(cat_columns, ['model', 'fuel']))
features = num_columns + cat_columns_reduced
# Can be reduced here

# list of lists with categories. Needed for column transformer
cats = list(df[cat_columns_reduced].apply(lambda x:pd.Series(x.unique()).dropna().tolist() + ['missing'], axis='index'))

# Use data frame not array
yX = df.dropna(subset=['price'])
X = yX.iloc[:,1:]
y = yX.iloc[:,0]
print(X.shape)
print(y.shape)


(7979, 29)
(7979,)


## Determine regularization rate (alpha)

Alpha is the hyperparameter that needs to be determined. For this the data needs to be splitted, but the dataset is too small to do a 3 way split (i.e. CV, Train, Test). Therefor spilt 2 way k-fold cv 
- **Test**: Hold-out set for calculating performance
- **Train**: Use to fit model and do CV


In [33]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)


(5585, 29)
(2394, 29)


In [34]:
# Create model (same as MLR with cats, but regressor is Lasso)

# Preprocessor: numerical features
num_transformer = make_pipeline(
    SimpleImputer(strategy='median'),
    MinMaxScaler(),
)
# Preprocessor: categorical features
cat_transformer = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing', missing_values=np.NaN),
    OneHotEncoder(categories=cats),
)

# Preprocess: fuels
# list of all fuels is passed by using full data set! (X)
fuel_list = list(get_unique_fuels(merge_lpg_and_lpgtype(X.fuel)))
get_fuel_dummies = DummyfyFuel(fuel_list)


# Combine num and cat
preprocessor = ColumnTransformer(transformers=[
    ('numerical', num_transformer, pd.Index(num_columns)),
    ('categorical', cat_transformer, pd.Index(cat_columns_reduced)),
    ('onehot_fuel', get_fuel_dummies, 'fuel')
], verbose=True)

# full pipeline with preproc and mlr
mlr = make_pipeline(
    preprocessor,
    linear_model.Lasso(random_state=42)
)

# Target transformation: log transform price
pl = TransformedTargetRegressor(
    regressor=mlr,
    func=np.log10,
    inverse_func=lambda y: 10**y
)



In [35]:
# grid search estimator
grid_search_alpha = GridSearchCV(
    estimator=pl,
    param_grid=[
        {
            'regressor__lasso__alpha': 10**(np.linspace(-5,-2,13)) # Choose alphas such that a clear peaked graph is shown in next plot
        } 
    ],
    cv=8,
    scoring='r2',
    n_jobs=4,
    verbose=10
)

# Perform grid search
grid_search_alpha.fit(X_train,y_train)

Fitting 8 folds for each of 13 candidates, totalling 104 fits
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.3s


GridSearchCV(cv=8,
             estimator=TransformedTargetRegressor(func=<ufunc 'log10'>,
                                                  inverse_func=<function <lambda> at 0x7f2e5257a430>,
                                                  regressor=Pipeline(steps=[('columntransformer',
                                                                             ColumnTransformer(transformers=[('numerical',
                                                                                                              Pipeline(steps=[('simpleimputer',
                                                                                                                               SimpleImputer(strategy='median')),
                                                                                                                              ('minmaxscaler',
                                                                                                                               MinMaxScal

In [36]:
# plot search results
plt.figure(figsize=[2,2])

# abscissa
alphas = list(grid_search_alpha.cv_results_['param_regressor__lasso__alpha'])

# plot mean
r2_mean = grid_search_alpha.cv_results_['mean_test_score']
# normalize
r2_mean = (r2_mean-r2_mean.mean())/r2_mean.std()
plt.plot(alphas, r2_mean, label='mean', lw=4, color='blue')

# plot folds
for fold in range(grid_search_alpha.cv):
    r2_fold = grid_search_alpha.cv_results_['split{:.0f}_test_score'.format(fold)]
    # normalize
    r2_fold = (r2_fold-r2_fold.mean())/r2_fold.std()
    plt.plot(alphas, r2_fold, label='fold ' + str(fold), lw=1, color='black')

plt.xscale('log')
plt.xlabel('alpha')
plt.ylabel('standardized r2 score [a.u.]')
plt.axvline(grid_search_alpha.best_params_['regressor__lasso__alpha'], linewidth=2, linestyle='--', color='k')
result = 'grid search results\nbest alpha={:.5f}'.format(grid_search_alpha.best_params_['regressor__lasso__alpha'])
plt.title(result)
print(result)
plt.legend(ncol=1, loc='center left', bbox_to_anchor=(1,0.5))



grid search results
best alpha=0.00032


<matplotlib.legend.Legend at 0x7f2e524dc790>

### Fit with regressor found with grid search

In [37]:
# Store estimator with best alpha
reg = grid_search_alpha.best_estimator_
models[model_name].update({'model':reg})

# fit
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [reg.regressor_.steps[-1][1].intercept_, *reg.regressor_.steps[-1][1].coef_]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})

[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.3s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[Colum

In [38]:
# update features, by adding fuels
cat_columns_reduced += ['fuel']
cats += [fuel_list]

# Split betas per category feature.
idx_start = len(num_columns)+1
cat_betas = list()
for cat in cats:
    cat_betas += [betas[idx_start:idx_start+len(cat)]]
    idx_start += len(cat)
# Check if all betas are stored
assert cat_betas[0][0] == betas[len(num_columns)+1] # first
assert cat_betas[-1][-1] == betas[-1] # last

In [39]:
# plot coefficients

# plot numerical and catagorical in different subplots
n_plots = len(cat_columns_reduced) + 1
fig,axs=plt.subplots(
    nrows=n_plots,
    figsize=[16,4*n_plots]
)
plt.subplots_adjust(hspace=0.5)

# Plot coefficients
for feats, coefs, name, ax in zip(
    [['offset'] + features] + cats,
    [[betas[0]] + betas[1:len(num_columns)+1]] + cat_betas,
    ['numerical'] + cat_columns_reduced,
    axs
):
    # activate subplot axes
    plt.sca(ax)
    # sort by bar height
    x = [feats[i] for i in np.argsort(coefs)[::-1]]
    y = sorted(coefs, reverse=True)
    # plot bar
    plt.bar(x=x, height=y, edgecolor='k', facecolor='None', clip_on=True)

    # prettify
    if not name.startswith('num'):
        plt.yticks(np.arange(-0.5,+0.6,0.1))
        bot_tick, top_tick = plt.ylim(top=+0.5, bottom=-0.5)
    else:
        plt.yticks(np.arange(-2,2.2,0.5))
        bot_tick, top_tick = plt.ylim(top=+2, bottom=-2)
        # stats
        xy=[plt.gca().get_xlim()[1], plt.gca().get_ylim()[1]]
        plt.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
            models[model_name]['R^2'],
            models[model_name]['cv R^2'].shape[0],
            np.mean(models[model_name]['cv R^2']),
            np.std(models[model_name]['cv R^2']),
        ) + '\n' +
                 'parameters total n={}, not zero n={}\n'.format(len(betas), sum(np.array(betas) != 0)) +
                 'train (n = {})'.format(y_train.shape[0]) + '\n' +
                 'test (n = {}, $R^2$ = {:.2f})'.format(
                     y_test.shape[0],
                     models[model_name]['test R^2'],
                 ), style='italic', va='top', ha='left')


    # plot sign switch
    x_sign_switch1 = np.nonzero(np.array(y+[-np.inf]) < 0)[0][0]
    x_sign_switch2 = np.nonzero(np.array([+np.inf]+y) > 0)[0][-1]
    plt.axvline(x_sign_switch1-0.5, linewidth=2, linestyle='--', color='k')
    plt.axvline(x_sign_switch2-0.5, linewidth=2, linestyle='--', color='k')
    plt.axhline(0, linewidth=2, linestyle='-', color='k')

    # add values when bar is small or too large (clipping)
    yt,ytl=plt.yticks()
    first_tick = sorted(np.abs(yt))[1]
    for x_val, coef in zip(x,y):
        if (coef < first_tick) & (coef > 0):
            plt.text(x_val, coef, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif (coef > -first_tick) & (coef < 0):
            plt.text(x_val, 0, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif coef > top_tick:
            # generally this is offset (bias)
            plt.text(x_val, top_tick, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif coef < bot_tick:
            plt.text(x_val, bot_tick, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')

    
    # labels and titles
    rot = 45
    fsz = 10
    ha = 'right'
    ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
    ax.xaxis.set_tick_params(which='minor', bottom=False)
    if not name.startswith('num'):
        plt.title('Categorical feature: ' + name, style='italic')
    else:
        plt.title('Multiple linear regression (Lasso, alpha={:g})\nNumerical features'.format(
            reg.regressor_.named_steps['lasso'].alpha
        ), style='italic') 
    plt.ylabel('Coefficient (a.u.)', style='italic')
    
    # add extra margin if bars are too wide (too little bars)
    if len(x) < 20:
        add_space = len(x) - 20
        xl = list(plt.xlim())
        xl[1] -= add_space/2
        xl[0] += add_space/2
        plt.xlim(xl)

# Label on bottom panel
plt.sca(axs[-1])
plt.xlabel('Sorted features', style='italic')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

  ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)


../results/MLR_Lasso.png


- - - - - 

In [40]:
# Display prediction errors

x_sample = df.dropna(subset=['price']).iloc[:,1:]
y_sample = df.dropna(subset=['price']).iloc[:,0]
y_sample_pred = models[model_name]['model'].predict(x_sample) 

x_sample['price'] = y_sample
x_sample['prediction_error'] = y_sample_pred - y_sample
x_sample['prediction_error_fraction'] = y_sample_pred/y_sample
x_sample['prediction_error_log'] = np.log10(x_sample.prediction_error_fraction)
x_sample['prediction_error_abslog'] = np.abs(np.log10(x_sample.prediction_error_fraction))
x_sample['prediction'] = y_sample_pred
x_sample['age_y'] = x_sample.age/365

# Note some are close to perfect, because they are in training set and are unique in brand etc
print('best predictons')
display(x_sample.sort_values(by='prediction_error_abslog').head(16).T)
print('worst predictions')
display(x_sample.sort_values(by='prediction_error_abslog').tail(16).T)
print('largest underestimate')
display(x_sample.sort_values(by='prediction_error').head(16).T)
print('largest overestimate')
display(x_sample.sort_values(by='prediction_error').tail(16).T)
print('worst prediction recent auction')
is_last_auction = x_sample.index.str.startswith('-'.join(x_sample.index[-1].split('-')[:2]))
display(x_sample[is_last_auction].sort_values(by='prediction_error_abslog').tail(8).T)

plt.figure(figsize=[8,8])
plt.plot(x_sample.age_y, x_sample.prediction_error_log, color='k', marker='s', markeredgecolor = (0, 0, 0, 0), markerfacecolor = (0, 0, 0, 1), linestyle='None', ms=4)
plt.axhline(0, lw=2, linestyle='--', color ='k')
plt.xlabel('age [years]')
plt.ylabel('prediction error [log of fraction]\n(positive: prediction overestimates)')
plt.show()

best predictons


Unnamed: 0,2018-11-8145,2021-12-802322,2019-10-7168,2021-04-7032,2014-12-7181,2020-8-7113,2022-01-701801,2017-12-2222,2021-04-7137,2020-10-7137,2020-8-7175,2017-6-7130,2017-6-7133,2019-5-8141,2017-9-8106,2020-11-8150
brand,VOLKSWAGEN,MERCEDES-BENZ,AUDI,TOYOTA,VOLKSWAGEN,DODGE,OPEL,AUDI,MERCEDES-BENZ,OPEL,VOLKSWAGEN,PEUGEOT,OPEL,VOLKSWAGEN,CHRYSLER,OPEL
model,golf,a 200,a3,auris,polo,caliber,meriva-a,a3,c 180,corsa,polo,206,corsa,golf,pt cruiser,meriva-a
age,5214.0,1212.0,6082.0,2715.0,2414.0,5008.0,4970.0,1513.0,7133.0,3557.0,2851.0,6156.0,3773.0,5515.0,5887.0,4672.0
fuel,diesel,benzine,diesel,benzine/elektriciteit,benzine,benzine/lpg/g3 gasinstallatie,benzine,benzine,benzine,diesel,diesel,benzine,diesel,benzine,benzine,benzine
odometer,213984.0,58507.0,387923.0,131981.0,71994.0,239875.0,149086.0,27035.0,264548.0,260522.0,247777.0,157475.0,174730.0,273563.0,162562.0,128396.0
days_since_inspection_invalid,205.0,-249.0,83.0,-207.0,-508.0,65.0,-137.0,-742.0,98.0,-37.0,203.0,-569.0,-245.0,98.0,-165.0,52.0
age_at_import,0.0,0.0,0.0,55.0,0.0,0.0,0.0,849.0,0.0,0.0,838.0,0.0,3617.0,0.0,0.0,0.0
body_type,hatchback,stationwagen,hatchback,mpv,hatchback,hatchback,mpv,stationwagen,sedan,mpv,hatchback,hatchback,hatchback,hatchback,stationwagen,mpv
displacement,1896.0,1333.0,1896.0,1798.0,1390.0,1998.0,1364.0,1395.0,1998.0,1248.0,1199.0,1360.0,1248.0,1390.0,1996.0,1598.0
number_of_cylinders,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,4.0,4.0,4.0,4.0


worst predictions


Unnamed: 0,2017-3-2003,2021-07-808017,2018-8-2400,2015-03-2402,2017-3-2007,2017-3-2409,2014-12-2207,2021-06-260216,2018-7-2415,2017-5-2216,2021-10-260520,2021-03-2606,2017-3-2000,2018-1-2412,2021-05-2201,2018-7-2411
brand,VOLKSWAGEN,HYUNDAI,ROLLS ROYCE,FORD,VOLKSWAGEN,VOLKSWAGEN,FORD,MERCEDES-BENZ,AUSTIN-HEALEY,ALFA ROMEO,LINCOLN,JAGUAR,ALFA ROMEO,VOLKSWAGEN,MERCEDES-BENZ,MERCEDES-BENZ
model,152131,trajet,phantom drophead coupe,thunderbird,karmann ghia,T2,thunderbird,w110/190d,3000 mkiii phase ii,2000 gtv,continental iii conv,e-type,2000 gtv,t1,w100 600 pullman,sl230
age,15121.0,5459.0,3136.0,21063.0,18507.0,15873.0,20973.0,21156.0,19116.0,16257.0,23284.0,18507.0,16196.0,19813.0,20451.0,19848.0
fuel,benzine,diesel,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine
odometer,84145.0,152702.0,11305.0,86207.0,59227.0,46642.0,86207.0,94097.0,106800.895872,23982.0,122427.626112,20623.0,23982.0,96563.858688,,68722.0
days_since_inspection_invalid,,447.0,,,,,,,-402.0,-739.0,,,-800.0,,,
age_at_import,,0.0,,12826.0,,,12826.0,,18063.0,0.0,,,0.0,,,
body_type,,mpv,,,,,,,cabriolet,coupe,,,coupe,,,
displacement,,1991.0,,0.0,,,0.0,,2912.0,,,,,,,
number_of_cylinders,,4.0,,8.0,,,8.0,,6.0,4.0,,,4.0,,,


largest underestimate


Unnamed: 0,2015-02-2200,2018-8-2400,2022-05-260625,2021-12-260012,2015-01-2414,2019-4-2411,2021-05-8126,2019-11-2418,2018-6-2410,2015-03-2420,2018-8-2410,2022-05-260925,2021-05-2202,2021-08-702908,2014-12-2221,2017-5-2406
brand,ASTON-MARTIN,ROLLS ROYCE,FERRARI,LAMBORGHINI,SKODA,MERCEDES-BENZ,MERCEDES-BENZ,PORSCHE,MERCEDES-BENZ,BENTLEY,ASTON-MARTIN,LAND ROVER,ROLLS ROYCE,MERCEDES-BENZ,BENTLEY,MERCEDES-BENZ
model,vanguish volante,phantom drophead coupe,430 scuderia,diablo sv 132 se,octavia,amg s63 cabriolet,v-klasse,panamera turbo s e-hybrid,S65 AMG,contintal gt 60w12 gtc,dbs,range rover 3.0 lwb autobiogra,rr01,amg e 43 4matic,bentley continental gtc,S600 Maybach
age,278.0,3136.0,4868.0,8546.0,315.0,636.0,1506.0,431.0,844.0,1139.0,2665.0,845.0,6283.0,1599.0,2344.0,810.0
fuel,benzine,benzine,benzine,benzine,benzine,benzine,diesel,,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine
odometer,4778.0,11305.0,11077.0,40090.0,6796.0,13.0,121324.0,6925.0,6379.0,18210.0,58429.0,183.0,89607.0,71428.0,67890.0,19173.0
days_since_inspection_invalid,,,,1949.0,-1146.0,,-320.0,,,,,,871.0,-593.0,,-651.0
age_at_import,,,,6164.0,0.0,,1067.0,,,,,,3486.0,523.0,,0.0
body_type,,,,coupe,sedan,,mpv,,,,,,sedan,sedan,,sedan
displacement,,,,5707.0,1798.0,,2143.0,,,,,,6749.0,2996.0,,5980.0
number_of_cylinders,,,,12.0,4.0,,4.0,,,,,,12.0,6.0,,12.0


largest overestimate


Unnamed: 0,2019-6-2409,2019-9-2400,2015-01-8117,2015-02-2204,2019-6-2403,2017-11-2214,2019-12-2407,2018-1-2411,2021-04-2205,2021-03-2206,2021-03-2610,2020-1-2414,2018-11-2401,2017-3-2405,2021-06-702006,2020-3-2406
brand,BMW,MERCEDES-BENZ,VOLKSWAGEN,AUDI,MERCEDES-BENZ,BMW,MERCEDES-BENZ,BMW,MERCEDES-BENZ,VOLVO,LAND ROVER,MERCEDES-BENZ,BENTLEY,MERCEDES-BENZ,PORSCHE,FERRARI
model,x6 m,amg glc 63 s 4matic,polo,a6 allroad quattro,amg c 63 s,7er reihe,amg c63 s,x5 m50d,gle 350 d 4matic,xc90 t8 twin engine,range rover sport,amg gle 63 s,continental gtc,amg gle 63 s,macan s,599
age,3431.0,216.0,1459.0,878.0,904.0,2939.0,1059.0,1761.0,1718.0,579.0,1851.0,1063.0,4200.0,484.0,783.0,3611.0
fuel,benzine,benzine,diesel,diesel,benzine,benzine,benzine,diesel,diesel,elektriciteit/benzine,benzine,benzine,benzine,benzine,benzine,benzine
odometer,138603.0,7702.0,81299.0,34999.0,30129.0,70432.0,22356.0,36340.0,55224.0,7071.0,110979.0,31404.0,27184.0,20757.0,25070.0,11974.0
days_since_inspection_invalid,121.0,-1245.0,-152.0,-217.0,-557.0,-240.0,-402.0,-102.0,244.0,-882.0,-121.0,-398.0,-21.0,-977.0,-678.0,2150.0
age_at_import,2678.0,0.0,,364.0,544.0,2447.0,329.0,1349.0,1067.0,0.0,1246.0,526.0,2028.0,135.0,0.0,3619.0
body_type,stationwagen,stationwagen,,stationwagen,coupe,sedan,cabriolet,stationwagen,stationwagen,mpv,stationwagen,stationwagen,cabriolet,stationwagen,stationwagen,coupe
displacement,4395.0,3982.0,1199.0,2967.0,3982.0,4395.0,3982.0,2993.0,2987.0,1969.0,4999.0,5461.0,5998.0,5461.0,2995.0,5999.0
number_of_cylinders,8.0,8.0,3.0,6.0,8.0,8.0,8.0,6.0,6.0,4.0,8.0,8.0,12.0,8.0,6.0,12.0


worst prediction recent auction


Unnamed: 0,2022-06-706306,2022-06-705906,2022-06-703206,2022-06-709006,2022-06-707406,2022-06-705306,2022-06-709806,2022-06-706006
brand,VOLVO,MITSUBISHI,AUDI,SUZUKI,VOLKSWAGEN,RENAULT,MERCEDES-BENZ,VOLKSWAGEN
model,v70,lancer,a6,jimny,golf,clio,sl350,polo
age,7915.0,4160.0,2659.0,8104.0,3537.0,5060.0,7076.0,7303.0
fuel,benzine,benzine,diesel,benzine,benzine,benzine/lpg/g3 gasinstallatie,benzine,benzine
odometer,361068.0,179393.0,174750.0,,38102.0,350494.0,212504.0,166718.0
days_since_inspection_invalid,-105.0,80.0,,342.0,110.0,-53.0,,83.0
age_at_import,0.0,0.0,,0.0,2135.0,0.0,,2830.0
body_type,stationwagen,hatchback,,mpv,stationwagen,hatchback,,hatchback
displacement,2435.0,1499.0,,1298.0,1395.0,1598.0,,1390.0
number_of_cylinders,5.0,4.0,,4.0,4.0,4.0,,4.0


In [41]:
# check to see if combining features would improve model
yX = df.dropna(subset=['price'])
yX.loc[:,'usage_intensity'] = (yX.odometer / yX.age)
yX.loc[:,'classic'] = yX.age > 25
print(yX.corr().price)
print('\n"usage_intensity" does not seem to correlate better than "age" and "odometer" seperately')

price                            1.000000
age                             -0.345795
odometer                        -0.443297
days_since_inspection_invalid   -0.112642
age_at_import                    0.048508
displacement                     0.374775
number_of_cylinders              0.357952
power                            0.606776
weight                           0.347442
registration_tax                 0.372875
sale_price                       0.689801
number_of_seats                 -0.027936
number_of_doors                  0.168958
top_speed                        0.497844
length                           0.301031
height                           0.061355
width                            0.429777
number_of_gears                  0.582543
private_owners                  -0.259573
company_owners                   0.108663
converted_energy_label           0.236275
usage_intensity                 -0.072843
classic                          0.019472
Name: price, dtype: float64

"usag

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = val

<a href="#pred_top" id='pred_model_8'><font size=+1><center>^^ TOP ^^</center></font></a>

---

<a href="#pred_top" id='pred_accuracies'><font size=+1><center>^^ TOP ^^</center></font></a>

---

## Model accuracies

In [42]:
# plot R^2

# counter for x-offset
c=0

# figure
fig = plt.figure(figsize=[2,2])
ax = fig.gca()
xs = ys = [None]

# loop over all models
for name,res in models.items():

    c+=1 # x-offset

    if name == 'linear regression no cv':
        # No cv, so only one value. Make it a list of one for type consistency
        k = 'R^2'
        rsq = [res[k]]
    
    else: 
        k = 'cv R^2'
        rsq = res[k]
        
    # add r-squares and offset to vectors
    ys = np.concatenate([ys,rsq])
    xs = np.concatenate([xs,np.ones_like(rsq) * c])

# actual plotting
sns.swarmplot(x=xs, y=ys, ax=ax)
# prettify
ax.set_xticklabels(models.keys(), rotation=45, va='top', ha='right', style='italic')
ax.set_ylim(bottom=0, top=1)
ax.set_title('Model performance\n', style='italic')
ax.set_ylabel('Coefficient of determination\n($R^2$)', style='italic')


# save
file_name = '../results/model-performance.png'
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=True)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')



../results/model-performance.png


In [43]:
# plot data

# loop over all models
for model_name in models.keys():
    print(model_name)
    res = models[model_name]
    
    # all original data
    yX = df.loc[:,['price', 'age']].dropna()
    X = yX.iloc[:,1]
    y = yX.iloc[:,0]
    
    features = num_columns.copy()
    
    # model specific adjustments
    if (model_name == 'linear regression log price') or (model_name == 'linear regression log price young'):
        # log price is used
        y = np.log10(y)
        # unit
        unit = '(log[EUR])'
    elif (model_name == 'MLR reduced observations') or (model_name == 'MLR impute median'):
        yX = df.dropna(subset=['price'] + features).loc[:,['price'] + features]
        X = yX.iloc[:,1:]
        y = np.log10(yX.iloc[:,0])
        unit = '(log[EUR])'
    elif (model_name == 'MLR with categorical') or (model_name == 'MLR Lasso') or (model_name == 'MLR added features'):
        yX = df.dropna(subset=['price']).copy()
        X = yX.iloc[:,1:]
        y = yX.iloc[:,0]      
        unit = '(EUR)'
    else:
        unit = '(EUR)'
    
    if X.ndim != 1:
        n_feat = X.shape[1]
    else:
        n_feat = 1
        
    if not ((model_name == 'MLR with categorical') or (model_name == 'MLR Lasso') or (model_name == 'MLR added features')):
        # needed for .predict
        X = np.array(X).reshape(-1,n_feat)
        y = np.array(y).reshape(-1,1)
    
    # predict all data
    y_pred = res['model'].predict(X)
    if max(y) < 10:
        rmse = np.sqrt(np.mean(((10**y)-(10**y_pred))**2))
    else:
        rmse = np.sqrt(np.mean((y-y_pred)**2))
    print(rmse)

    # actual plotting
    fig,axs = plt.subplots(nrows=2, ncols=1, figsize=[4,4])
    
    # data
    axs[0].plot(y, y_pred, marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4,)
    # error
    axs[1].plot(y, y_pred-y, marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4,)
    
    # axis equal for top
    if (model_name == 'MLR with categorical') or (model_name == 'MLR Lasso') or (model_name == 'MLR added features'):
        axs[0].set_xscale('log')
        axs[0].set_yscale('log')
        axs[1].set_xscale('log')
    axs[0].set_aspect(1)
    # store limits
    yl = axs[0].get_ylim()
    xl_top = axs[0].get_xlim()
    xl_bot = axs[1].get_xlim()
    xl = [np.max([xl_top[0], xl_bot[0]]), np.min([xl_top[1], xl_bot[1]])]
    # plot unity line and 0 error
    unity_line = [np.max([xl[0], yl[0]]), np.min([xl[1], yl[1]])]
    axs[0].plot(unity_line, unity_line, '-k', linewidth=2)
    axs[1].plot(xl, [0, 0], '-k', linewidth=2)
    # reset limits
    axs[0].set_xlim(xl)
    axs[1].set_xlim(xl)

    # make equal size panels
    # Note: sharex did not work
    bb=axs[0].get_position(False)
    rect_top = bb.bounds
    bb=axs[1].get_position(False)
    rect_bot = bb.bounds
    rect = list(rect_bot)
    rect[0] = rect_top[0]
    rect[2] = rect_top[2]
    axs[1].set_position(rect)
    
    # labeling
    fig.suptitle('{}\nrmse: EUR {:.0f}'.format(model_name,rmse), style='italic')
    axs[1].set_xlabel('Real price ' + unit, style='italic')
    axs[0].set_ylabel('Predicted price\n' + unit, style='italic')
    axs[1].set_ylabel('Prediction error\n' + unit, style='italic')
    
    # save
    file_name = '../results/{}-accuracy.png'.format(model_name.replace(' ','_'))
    if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
        print(file_name)
        with plt.style.context('../assets/context-paper.mplstyle'):
            plt.savefig(file_name, bbox_inches='tight', transparent=True)
    else:
        plt.show()
        print(f'Skip. {file_name} exists or saving is disabled in settings.')

linear regression no cv
9995.143506616845
../results/linear_regression_no_cv-accuracy.png
linear regression log price
10207.535575023114
../results/linear_regression_log_price-accuracy.png
linear regression log price young
9285.8960450924
../results/linear_regression_log_price_young-accuracy.png
MLR reduced observations
6682.666038067728
../results/MLR_reduced_observations-accuracy.png
MLR impute median
7171.272662148217
../results/MLR_impute_median-accuracy.png
MLR with categorical
6964.297697108233
../results/MLR_with_categorical-accuracy.png
MLR Lasso
6115.655914644568
../results/MLR_Lasso-accuracy.png


In [44]:
assert False, 'stop running, below is sandboxing and testing'

AssertionError: stop running, below is sandboxing and testing

<a href="#pred_top" id='pred_save_model'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Save model as pickle
Save the best model as a .pkl file.

See also: https://scikit-learn.org/stable/modules/model_persistence.html

In [None]:
# import dill # dill acts as pickle but handles lambda functions
model_name = 'MLR Lasso' 
model = models[model_name]

In [None]:
model['name'] = model_name
fn = '../results/trained_model_{}.pkl'.format(model_name.replace(' ', '_').lower())
print(fn)
# with open(fn, 'wb') as file:
#     dill.dump(model, file)

<a href="#pred_top" id='pred_predict'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Example predictions

In [None]:
# Predict some known cars
B = pd.DataFrame(columns=X.columns, index=['Mine'])
B.loc['Mine', ['brand', 'model', 'fuel', 'body_type', 'color']] = ['CITROËN', 'berlingo', 'benzine', 'mpv', 'Gray']
B.loc['Mine', ['displacement', 'number_of_cylinders', 'number_of_seats', 'number_of_doors', 'fwd', 'number_of_gears']] = [1600, 4, 5, 5, 'n', 5]
B.loc['Mine', ['top_speed']] = [170]
B.loc['Mine', 'age'] = (pd.to_datetime('now') - pd.to_datetime('2005-12-1')).days
B.loc['Mine', 'days_since_inspection_invalid'] = (pd.to_datetime('now') - pd.to_datetime('2022-6-11')).days
B.loc['Mine', 'age_at_import'] = 0
B.loc['Mine', 'odometer'] = 160000
B.loc['Mine', ['weight']] = [1326]

B.loc['Peer', ['brand', 'model', 'fuel', 'body_type', 'color']] = ['CITROËN', 'ax', 'benzine', 'hatchback', 'Gray']
B.loc['Peer', ['displacement', 'number_of_cylinders', 'number_of_seats', 'number_of_doors', 'fwd', 'number_of_gears']] = [1100, 4, 5, 5, 'n', 5]
B.loc['Peer', ['top_speed']] = [170]
B.loc['Peer', 'age'] = (pd.to_datetime('now') - pd.to_datetime('1996-12-1')).days
# B.loc['Mine', 'days_since_inspection_invalid'] = (pd.to_datetime('now') - pd.to_datetime('2020-6-11')).days
B.loc['Peer', 'age_at_import'] = 0
B.loc['Peer', 'odometer'] = 160000
B.loc['Peer', ['weight']] = [800]

B.loc['a car', ['brand']] = [np.NaN]


B.loc['J-892-TZ', ['brand', 'model', 'fuel', 'body_type', 'color']] = ['SUZUKI', 'sx4', 'benzine', 'hatchback', 'Gray']
B.loc['J-892-TZ', ['displacement', 'number_of_cylinders', 'number_of_seats', 'number_of_doors', 'fwd', 'number_of_gears']] = [1586, 4, 5, 5, 'n', np.NaN]
B.loc['J-892-TZ', 'age'] = (pd.to_datetime('now') - pd.to_datetime('2010-11-11')).days
B.loc['J-892-TZ', 'days_since_inspection_invalid'] = (pd.to_datetime('now') - pd.to_datetime('2021-11-18')).days
B.loc['J-892-TZ', 'age_at_import'] = (pd.to_datetime('now') - pd.to_datetime('2020-11-19')).days
B.loc['J-892-TZ', 'odometer'] = 58153
B.loc['J-892-TZ', 'weight'] = 1230
B.loc['J-892-TZ', 'power'] = 118
B.loc['J-892-TZ', 'automatic_gearbox'] = 'y'
B.loc['J-892-TZ', 'private_owners'] = 1
B.loc['J-892-TZ', 'company_owners'] = 0
B.loc['J-892-TZ', 'sale_price'] = 19979
B.loc['J-892-TZ', 'registration_tax'] = 3936

B.loc['J-892-TZ-real'] = df.loc['2022-01-805121',:].drop(columns='Price')

B.T

In [None]:
df_ = pd.DataFrame(index=models.keys(), columns=B.index)
for model in df_.index[::-1]:
    try:
        print(f'{model}')
        B.loc[:,'predict'] = models[model]['model'].predict(B)
        pred = B.predict
    except: 
        pred = pd.Series(index=B.index, data=np.NaN)
    df_.loc[model,:] = pred
    
df_

In [None]:
B2 = df.loc['2022-01-805121',:].to_frame().T.drop(columns='price')
models['MLR Lasso']['model'].predict(B2)


In [None]:
B = pd.read_pickle('/home/tom/bin/satdatsci/Saturday-Datascience/data/rdw-data-2021-02.pkl')
B.columns
#['brand', 'model', 'fuel', 'body_type', 'color']
B.loc[:, [
    'rdw_merk',
    'rdw_type',
    'rdw_brandstof_brandstof_omschrijving_1',
    'rdw_ovi_inrichting_code_omschrijving',
    'rdw_eerste_kleur',
     ]]

In [None]:
B.loc[:,(B == 'GRIJS').any()]