<a id='pred_top'>

# Predict auction price

Try several models and improve predicition accuracy

## Model fitting

- Linear fits  
  1. [Simple linear fit](#pred_model_1)  
     No cross validation. Observations with missing values are dropped.
  2. [Dependent values scaled](#pred_model_2)  
     Dependent value here is _prices_.
  3. [Partial data](#pred_model_3)  
     Only young cars
- Multiple linear regression models  
  1. [MLR fit without imputation](#pred_model_4)  
  2. [With imputation](#pred_model_5)  
  3. [Include categorical features](#pred_model_6)  
  4. [Lasso regularization](#pred_model_7)  
  5. [include engineered features](#pred_model_8)

## Results

- [Model performance](#pred_accuracies)
- [Save best model](#pred_save_model) **TODO**  
  This is not implemented yet. Some preprocessing functions are not handled well with `pickle`.
- [Predictions](#pred_predict)
     
  

In [1]:
import os
# setting path
os.chdir(r'..')

import drz_config
cfg = drz_config.read_config()
VERBOSE = cfg['VERBOSE']
SKIPSAVE = cfg['SKIPSAVE']

if VERBOSE > 0:
    display(cfg)

if cfg['OPBOD']:
    raise NotImplementedError

{'settings_fn': '../code/assets/drz-auction-settings.ini',
 'DATE': '2023-01',
 'VERBOSE': 1,
 'OPBOD': False,
 'URL': 'http://verkoop.domeinenrz.nl/verkoop_bij_inschrijving_2023-0001',
 'EXTEND_URL': False,
 'CLOSEDDATA': True,
 'closed_data_fields': '*',
 'SKIPSAVE': False}

In [2]:
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

import seaborn as sns

In [3]:
# set figure defaults (needs to be in cell seperate from import sns)
plt.style.use(['default', '../assets/movshon.mplstyle', '../assets/context-notebook.mplstyle'])

# Load data

In [4]:
fn = '../data/cars-for-ml.pkl'
print(fn)
df = pd.read_pickle(fn)
print(df.shape)

# categories
cat_columns = ['brand', 'model', 'fuel', 'body_type','color', 'energy_label', 'fwd', 'automatic_gearbox', 'under_survey']
# numerical
num_columns = list(np.setdiff1d(df.columns, cat_columns + ['price']))

# Factorized categorical values
fld = 'energy_label'
# replace empty with NaN creates factor '-1'
v, idx = pd.factorize(df[fld].replace({'': np.NaN}), sort=True)
# convert '-1' back to NaN
v = v.astype(float)
v[v==-1] = np.NaN
# Store in dataframe
new_col = 'converted_' + fld
df[new_col] = v
# update list
num_columns += [new_col]
cat_columns.remove(fld)
print('\nCategorical field [{}] is converted to sequential numbers with: '.format(fld), end='\n\t')
print(*['{} <'.format(c) for c in idx], end='\n\n')

# convert boolean to string
for fld in ['fwd', 'automatic_gearbox', 'under_survey']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    # # update list
    # cat_columns += [new_col]
    # cat_columns.remove(fld)
    replace_dict = {
        '': '', 
        True: 'y', 
        False: 'n'
    }
    df[new_col] = df[fld].replace(replace_dict)
    print('\nBoolean field [{}] is converted to numbers according to: '.format(fld), end='\n')
    print(*['\t"{}" -> {} ({})\n'.format(k,v, type(v)) for k,v in replace_dict.items()], end='\n\n')

# convert integer to float and replace -1
for fld in ['number_of_cylinders', 'number_of_doors', 'number_of_gears', 'number_of_seats']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    replace_dict = {
        -1: np.NaN, 
    }
    df[new_col] = df[fld].replace(replace_dict).astype(float)

# convert empty string to NaN
for fld in ['brand', 'model', 'fuel', 'body_type', 'color', 'fwd']:
    if fld not in df.columns:
        print(f'!{fld} not in data!. Skip for now')
        continue
    new_col = fld
    replace_dict = {
        '': np.NaN, 
    }
    df[new_col] = df[fld].replace(replace_dict)

# translate Dutch to English
fld = 'color'
new_col = fld
# # update list
# cat_columns += [new_col]
# cat_columns.remove(fld)
replace_dict = {
    '': 'missing', 
    'BLAUW': 'Blue',
    'ROOD': 'Red',
    'GROEN': 'Green',
    'GRIJS': 'Gray',
    'WIT': 'White',
    'ZWART': 'Black',
    'BEIGE': 'Beige',
    'BRUIN': 'Brown',
    'ROSE': 'Pink',
    'GEEL': 'Yellow',
    'CREME': 'Creme',
    'ORANJE': 'Orange',
    'PAARS': 'Purple,'
}
df[new_col] = df[fld].replace(replace_dict)
print('\nField [{}] is converted according to: '.format(fld), end='\n')
print(*['\t"{}" -> {} ({})\n'.format(k,v, type(v)) for k,v in replace_dict.items()], end='\n\n')

# reporting
try:
    print('Categorical:', len(cat_columns))
    [print('\t[{:2.0f}] {:s}'.format(i+1, c)) for i,c in enumerate(df[cat_columns].columns)]
    print('Numercial:', len(num_columns))
    [print('\t[{:2.0f}] {:s}'.format(i+1, c)) for i,c in enumerate(df[num_columns].columns)]
    print('Last lot in data set:\n\t{}'.format(df.index[-1]))
except:
    cat_columns = [c for c in cat_columns if c in df.columns]
    num_columns = [c for c in num_columns if c in df.columns]    
    print('! not all fields are in data !. Skip for now')

../data/cars-for-ml.pkl
(9769, 29)

Categorical field [energy_label] is converted to sequential numbers with: 
	A < B < C < D < E < F < G <


Boolean field [fwd] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Boolean field [automatic_gearbox] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Boolean field [under_survey] is converted to numbers according to: 
	"" ->  (<class 'str'>)
 	"True" -> y (<class 'str'>)
 	"False" -> n (<class 'str'>)



Field [color] is converted according to: 
	"" -> missing (<class 'str'>)
 	"BLAUW" -> Blue (<class 'str'>)
 	"ROOD" -> Red (<class 'str'>)
 	"GROEN" -> Green (<class 'str'>)
 	"GRIJS" -> Gray (<class 'str'>)
 	"WIT" -> White (<class 'str'>)
 	"ZWART" -> Black (<class 'str'>)
 	"BEIGE" -> Beige (<class 'str'>)
 	"BRUIN" -> Brown (<class 'str'>)
 	"ROSE" -> Pink (<class 'str'>)
 	"GEEL" -> Yel

In [5]:
# Store model results in dictonary: Instantiate empty dict
models = dict()

<a href="#pred_top" id='pred_model_1'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Model: Simple linear fit
Regress age (in days) with price (euro).  

## >> BIG FAT WARNING <<
All data is used without train / test split. I.e. accuracy is based on data that was used for fit. This is considered bad practice!

## Prepare input

In [6]:
from sklearn import linear_model

model_name = 'linear regression no cv'

X = df.dropna(subset=['price','age']).age.values.reshape(-1,1)
y = df.dropna(subset=['price','age']).price.values.reshape(-1,1)
print(X.shape)
print(y.shape)

(8491, 1)
(8491, 1)


## Fit

In [7]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X,y) # fit with all data
models[model_name].update({'n':y.shape[0]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})

In [8]:
# Fit a line by using predict
prediction_X = np.array([0,int(np.ceil(X.max()/365.25))*365.25]).reshape(-1,1)
prediction_y = reg.predict(prediction_X)

# plot
plt.figure(figsize=[8,8])
plt.plot(X/365.25, y/1000, marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4)
hdl_fit = plt.plot(prediction_X/365, prediction_y/1000, color='blue', marker=None, linestyle='-', linewidth=4)
plt.legend(hdl_fit, ['n = {}, $R^2$ = {:.2f}\ny = {:+.0f}{:+.2f}*(x*365.25)'.format(
    models[model_name]['n'],
    models[model_name]['R^2'],
    *models[model_name]['betas']
)], loc='upper right')
plt.xlabel('Age (years)', style='italic')
plt.ylabel('Winning bid (EUR X1000)', style='italic')
plt.title('Simple linear fit', style='italic')
plt.ylim(bottom = -10)
plt.xlim(left = 0)

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=False)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/linear_regression_no_cv.png


<a href="#pred_top" id='pred_model_2'><font size=+1><center>^^ TOP ^^</center></font></a>

---

## Model: linear but with scaled dependent values (prices)

Instead of using all data **train/test split** is performed. Also prices are log transformed.  

## Prepare input

In [9]:
from sklearn.model_selection import train_test_split, cross_val_score

model_name = 'linear regression log price'

X = df.dropna(subset=['price','age']).age.values.reshape(-1,1)
y = np.log10(df.dropna(subset=['price','age']).price.values.reshape(-1,1))
print(X.shape)
print(y.shape)

(8491, 1)
(8491, 1)


## Fit

In [10]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X_train,y_train) # fit with training set
models[model_name].update({'n':y.shape[0]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(5943, 1)
(2548, 1)


In [11]:
depr_half_n_days = -(np.log10(2)/models[model_name]['betas'][1])
print('According to "{}"-model'.format(model_name))
print('Car depreciates to half its value every\n\t{:.0f} days ({:.1f} years).'.format(depr_half_n_days, depr_half_n_days/365.25))
for y in [0,2,4,6,8]:
    print('\ty(t={:+5.0f}) = {:.0f} euro'.format(y, 10**reg.predict([[y*365.25]])[0][0]))
print('\n\ty(t={:+5.1f}) = {:.0f} euro'.format(depr_half_n_days/365.25, 10**reg.predict([[depr_half_n_days]])[0][0]))
print('\ty(t=0) / 2 = {:.0f} euro'.format(10**models[model_name]['betas'][0]/2))

According to "linear regression log price"-model
Car depreciates to half its value every
	2514 days (6.9 years).
	y(t=   +0) = 9830 euro
	y(t=   +2) = 8037 euro
	y(t=   +4) = 6571 euro
	y(t=   +6) = 5372 euro
	y(t=   +8) = 4392 euro

	y(t= +6.9) = 4915 euro
	y(t=0) / 2 = 4915 euro


In [12]:
# Fit a line by using predict
prediction_X = np.array([0,int(np.ceil(X.max()/365.25))*365.25]).reshape(-1,1)
prediction_y = reg.predict(prediction_X)

# plot
plt.figure(figsize=[8,8])
hdl_trn = plt.plot(X_train/365.25, np.power(10,y_train), marker='s', markeredgecolor = (0, 0, 1, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='train (n = {})'.format(y_train.shape[0]))
hdl_tst = plt.plot(X_test/365.25, np.power(10,y_test), marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='test (n = {}, $R^2$ = {:.2f})'.format(
                       y_test.shape[0],
                       models[model_name]['test R^2'],
                   ))
hdl_fit = plt.plot(prediction_X/365, np.power(10,prediction_y), color='blue', marker=None, linestyle='-', linewidth=4, 
                   label = '$log10(y)$ = {:+.2f}{:+.1e}*(x*365.25)\n($R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f}))'.format(
                       *models[model_name]['betas'],
                       models[model_name]['R^2'],
                       models[model_name]['cv R^2'].shape[0],
                       np.mean(models[model_name]['cv R^2']),
                       np.std(models[model_name]['cv R^2']),
                   ))
plt.legend()
plt.xlabel('Age (years)', style='italic')
plt.ylabel('Winning bid (EUR)', style='italic')
plt.title('Linear fit with log(price)', style='italic')
plt.ylim(bottom = 10, top = 1000000)
plt.xlim(left = 0)
plt.yscale('log')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=False)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/linear_regression_log_price.png


<a href="#pred_top" id='pred_model_3'><font size=+1><center>^^ TOP ^^</center></font></a>

---

## Model: scaled price, but only young cars

Same as [model 2](#pred_model_2), but ignore cars older than 25y

## Prepare input

In [13]:
from sklearn.model_selection import train_test_split, cross_val_score

model_name = 'linear regression log price young'

is_yng = df.age/365.25 < 25

X = df[is_yng].dropna(subset=['price','age']).age.values.reshape(-1,1)
y = np.log10(df[is_yng].dropna(subset=['price','age']).price.values.reshape(-1,1))
print(X.shape)
print(y.shape)

(8261, 1)
(8261, 1)


## Fit

In [14]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X_train,y_train)
models[model_name].update({'n':y.shape[0]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(5782, 1)
(2479, 1)


In [15]:
depr_half_n_days = -(np.log10(2)/models[model_name]['betas'][1])
print('According to "{}"-model'.format(model_name))
print('Car depreciates to half its value every\n\t{:.0f} days ({:.1f} years).'.format(depr_half_n_days, depr_half_n_days/365.25))
for y in [0,2,4,6,8]:
    print('\ty(t={:+5.0f}) = {:.0f} euro'.format(y, 10**reg.predict([[y*365.25]])[0][0]))
print('\n\ty(t={:+5.1f}) = {:.0f} euro'.format(depr_half_n_days/365.25, 10**reg.predict([[depr_half_n_days]])[0][0]))
print('\ty(t=0) / 2 = {:.0f} euro'.format(10**models[model_name]['betas'][0]/2))

According to "linear regression log price young"-model
Car depreciates to half its value every
	1341 days (3.7 years).
	y(t=   +0) = 23772 euro
	y(t=   +2) = 16297 euro
	y(t=   +4) = 11172 euro
	y(t=   +6) = 7659 euro
	y(t=   +8) = 5251 euro

	y(t= +3.7) = 11886 euro
	y(t=0) / 2 = 11886 euro


In [16]:
# Fit a line by using predict
prediction_X = np.array([0,int(np.ceil(X.max()/365.25))*365.25]).reshape(-1,1)
prediction_y = reg.predict(prediction_X)

# plot
plt.figure(figsize=[8,8])
hdl_trn = plt.plot(X_train/365.25, np.power(10,y_train), marker='s', markeredgecolor = (0, 0, 1, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='train (n = {})'.format(y_train.shape[0]))
hdl_tst = plt.plot(X_test/365.25, np.power(10,y_test), marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4, 
                   label='test (n = {}, $R^2$ = {:.2f})'.format(
                       y_test.shape[0],
                       models[model_name]['test R^2'],
                   ))
hdl_fit = plt.plot(prediction_X/365, np.power(10,prediction_y), color='blue', marker=None, linestyle='-', linewidth=4, 
                   label = '$log10(y)$ = {:+.2f}{:+.1e}*(x*365.25)\n($R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f}))'.format(
                       *models[model_name]['betas'],
                       models[model_name]['R^2'],
                       models[model_name]['cv R^2'].shape[0],
                       np.mean(models[model_name]['cv R^2']),
                       np.std(models[model_name]['cv R^2']),
                   ))
plt.legend()
plt.xlabel('Age (years)', style='italic')
plt.ylabel('Winning bid (EUR)', style='italic')
plt.title('Linear fit with log(price) of young cars', style='italic')
plt.ylim(bottom = 10, top = 1000000)
plt.xlim(left = 0)
plt.yscale('log')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=False)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/linear_regression_log_price_young.png


<a href="#pred_top" id='pred_model_4'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Model: Multiple linear fit

Above [simple linear models](#pred_model_1) only use _Age_ as predictor of price. Here MLR will regress many (numerical) features with price (euro).  


## Prepare input

In [17]:
model_name = 'MLR reduced observations'

features = num_columns 
# Can be reduced here

X = df.dropna(subset=['price'] + features).loc[:,features].values.reshape(-1,len(features))
y = np.log10(df.dropna(subset=['price'] + features).price.values.reshape(-1,1))
print(X.shape)
print(y.shape)

(1778, 20)
(1778, 1)


## Fit

In [18]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
reg = linear_model.LinearRegression()
models[model_name].update({'model':reg})

# fit
reg.fit(X_train,y_train)
models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [*reg.intercept_, *reg.coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(1244, 20)
(534, 20)


In [19]:
# plot coefficients
plt.figure(figsize=[8,2])
ax=plt.gca()

# sorted bar height
betas = models[model_name]['betas']
x = ['offset (log[EUR])'] + [features[i] for i in np.argsort(betas[1:])[::-1]]
y = [betas[0]] + sorted(betas[1:], reverse=True)

# plot bar
ax.bar(x=x, height=y, edgecolor='k', facecolor='None')

# add values when bar is small
for x_val, coef in zip(x,y):
    if np.abs(coef)<1:
        ax.text(x_val, coef, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')
ax.set_yticks(range(0,5,2))

# plot origin
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
ax.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
ax.axhline(0, linewidth=2, linestyle='-', color='k')
        
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
yl = ax.get_ylim()
ax.vlines(x_sign_switch-0.5, yl[0], yl[1], linewidth=2, linestyle='--')
ax.set_ylim(yl)
# ax.set_ylim(top=0.01, bottom=-0.01)

# labels
ax.set_xticks(x)
ax.set_xticklabels(labels=x, rotation=45, va='top', ha='right', style='italic')
ax.xaxis.set_tick_params(which='minor', bottom=False)
ax.set_xlabel('Feature', style='italic')
ax.set_ylabel('Coefficient (a.u.)', style='italic')
ax.set_title('Multiple linear regression', style='italic') 

# stats
xy=[ax.get_xlim()[1], ax.get_ylim()[1]]
ax.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
    models[model_name]['R^2'],
    models[model_name]['cv R^2'].shape[0],
    np.mean(models[model_name]['cv R^2']),
    np.std(models[model_name]['cv R^2']),
) + '\n' +
         'train (n = {})'.format(y_train.shape[0]) + '\n' +
         'test (n = {}, $R^2$ = {:.2f})'.format(
             y_test.shape[0],
             models[model_name]['test R^2'],
         ), style='italic', va='top', ha='left')


# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=False)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/MLR_reduced_observations.png


<a href="#pred_top" id='pred_model_5'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Model: MLR + imputer

MLR as above, but instead of `dropna` us an imputer. This allows to use more observation.  

At this point a pipeline is used.

## Prepare input

In [20]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

model_name = 'MLR impute median'

features = num_columns 
# Can be reduced here

yX = df.loc[:,['price'] + features].dropna(subset=['price'])
X = yX.iloc[:,1:].values.reshape(-1,len(features))
y = np.log10(yX.iloc[:,0].values.reshape(-1,1))
print(X.shape)
print(y.shape)

(8524, 20)
(8524, 1)


## Fit

In [21]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

# create regression model object and store
pl = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler(),
    linear_model.LinearRegression()
)
models[model_name].update({'model':pl})

# fit
pl.fit(X,y)
models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [*pl.steps[-1][1].intercept_, *pl.steps[-1][1].coef_[0]]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':pl.score(X,y)})
models[model_name].update({'test R^2':pl.score(X_test,y_test)})
cv_results = cross_val_score(pl, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})


(5966, 20)
(2558, 20)


In [22]:
# plot coefficients
plt.figure(figsize=[8,4])
ax = plt.gca()

# sorted bar height
betas = models[model_name]['betas']
x = ['offset (log[EUR])'] + [features[i] for i in np.argsort(betas[1:])[::-1]]
y = [betas[0]] + sorted(betas[1:], reverse=True)

# plot bar
ax.bar(x=x, height=y, edgecolor='k', facecolor='None')

# add values when bar is small
for x_val, coef in zip(x,y):
    if np.abs(coef)<0.05:
        ax.text(x_val, coef, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')
ax.set_yticks(np.arange(-0.3,0.4,0.1))
ax.set_ylim(top=+0.3, bottom=-0.3)
# offset
x_val = x[0]
coef = y[0]
ax.text(x_val, 0.3, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')

# plot origin
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
ax.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
ax.axhline(0, linewidth=2, linestyle='-', color='k')

# labels
ax.set_xticks(x)
ax.set_xticklabels(labels=x, rotation=45, va='top', ha='right', style='italic')
ax.xaxis.set_tick_params(which='minor', bottom=False)
ax.set_xlabel('Feature', style='italic')
ax.set_ylabel('Coefficient (a.u.)', style='italic')
ax.set_title('Multiple linear regression', style='italic') 

# stats
xy=[ax.get_xlim()[1], ax.get_ylim()[1]]
ax.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
    models[model_name]['R^2'],
    models[model_name]['cv R^2'].shape[0],
    np.mean(models[model_name]['cv R^2']),
    np.std(models[model_name]['cv R^2']),
) + '\n' +
         'train (n = {})'.format(y_train.shape[0]) + '\n' +
         'test (n = {}, $R^2$ = {:.2f})'.format(
             y_test.shape[0],
             models[model_name]['test R^2'],
         ), style='italic', va='top', ha='left')


# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=False)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/MLR_impute_median.png


<a href="#pred_top" id='pred_model_6'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Model: MLR with categorical

As MLR, but do one-hot-encoding

Use different scalers for different columns:  
https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html  
https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer  
p. 68 book: ML with sklearn & tf

## Prepare input

In [23]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import MinMaxScaler
# from sklearn.pipeline import FeatureUnion

model_name = 'MLR with categorical'

cat_columns_reduced = list(np.setdiff1d(cat_columns, ['model', 'fuel']))
features = num_columns + cat_columns_reduced
# Can be reduced here

# list of lists with categories. Needed for column transformer
cats = list(df[cat_columns_reduced].apply(lambda x:pd.Series(x.unique()).dropna().tolist() + ['missing'], axis='index'))

# Use data frame not array
yX = df.dropna(subset=['price'])
# # only use young
# is_yng = yX.age/365.25 < 25
# yX = yX[is_yng]
X = yX.iloc[:,1:]
y = yX.iloc[:,0]
print(X.shape)
print(y.shape)


(8524, 29)
(8524,)


In [24]:
import re

# Split fuel helper functions

def split_lpg_type(s):
    '''Split lpg type from list of fuels separated by / '''
    # No type
    if s.endswith('lpg'):
        return s, ''
    if 'lpg' not in s:
        return s, ''
    # Type is after the last '/'
    M = re.search('^(.*)/(.*)$',s)
    if M:
        return M[1], M[2]
    else:
        return s, ''

def merge_lpg_and_lpgtype(fuel_type):

    '''Add LPG type to LPG (remove /). 
    Note that order of fuels is preserved. I.e. it is able to return both "benzine/lpg-g3" and "lpg-g3/benzine". '''
    
    lpg_type = fuel_type.apply(lambda s: 'lpg-' + split_lpg_type(s)[1] if (type(s) == str) and ('lpg' in s) else '')
    fuel_type_short = fuel_type.apply(lambda s: split_lpg_type(s)[0] if (type(s) == str) else '')
    fuel_type_new = pd.Series([f.replace('lpg', l) if type(f) == str else f for f,l in zip(fuel_type_short,lpg_type)])
    return fuel_type_new


def get_unique_fuels(fuel_type):
    
    '''Splitting fuels at "/" and return unique values'''
    
    # make list (as string)
    fuel_type_list = fuel_type.apply(lambda s:s.split('/') if type(s) == str else np.NaN).astype(str)
    
    # Get unique fuels
    possible_fuels = list() # empty list
    for l in fuel_type_list.unique():
        for ll in eval(l): # use eval to convert str to list
            possible_fuels += [ll]     
    # uniquify
    return np.unique(possible_fuels)

    
from sklearn.base import BaseEstimator, TransformerMixin

# Custom transformer to make one-hot fuel encoder based on string
# This is different from get_dummies, because it can take a list of values in a field
class DummyfyFuel(BaseEstimator, TransformerMixin):
    def __init__(self, fuel_names=None):
        
        assert (fuel_names == None) or (isinstance(fuel_names, (list,))), '[fuel_names] should be list (or None)'
        
        self.fuel_names = fuel_names
        
    def fit(self, X, y=None):
        
        if not self.fuel_names:
            # get fuel names based on input.
            # Note that if train/test are split, test might lack a fuel type.
            self.fuel_names = get_unique_fuels(merge_lpg_and_lpgtype(X))

        return self
    
    def transform(self, X):
        
        # get stringyfied list
        fuel_type_list = merge_lpg_and_lpgtype(X).apply(lambda s:s.split('/') if type(s) == str else np.NaN).astype(str)
        # set index as input
        fuel_type_list.index = X.index

        # transform: dummies
        fuel_dummies = pd.DataFrame(index=fuel_type_list.index)
        for f in self.fuel_names:
            fuel_dummies['fuel_' + f] = fuel_type_list.apply(lambda l:int(f in eval(l)))

        return fuel_dummies


In [25]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)


(5966, 29)
(2558, 29)


In [26]:
# Create model

# Preprocessor: numerical features
num_transformer = make_pipeline(
    SimpleImputer(strategy='median'),
    MinMaxScaler(),
)
# Preprocessor: categorical features
cat_transformer = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing', missing_values=np.NaN),
    OneHotEncoder(categories=cats),
)

# Preprocess: fuels
# list of all fuels is passed by using full data set! (X)
fuel_list = list(get_unique_fuels(merge_lpg_and_lpgtype(X.fuel)))
#fuel_list = ['benzine', 'diesel']
get_fuel_dummies = DummyfyFuel(fuel_list)


# Combine num and cat
preprocessor = ColumnTransformer(transformers=[
    ('numerical', num_transformer, pd.Index(num_columns)),
    ('categorical', cat_transformer, pd.Index(cat_columns_reduced)),
    ('onehot_fuel', get_fuel_dummies, 'fuel')
], verbose=True)

# full pipeline with preproc and mlr
mlr = make_pipeline(
    preprocessor,
    linear_model.LinearRegression()
)

# Target transformation: log transform price
pl = TransformedTargetRegressor(
    regressor=mlr,
    func=np.log10,
    inverse_func=lambda y: 10**y,
#     func=lambda x:x,
#     inverse_func=lambda y: y,
#     inverse_func=np.exp,
)

models[model_name].update({'model':pl})

In [27]:
# fit
pl.fit(X_train, y_train)
y_pred = pl.predict(X_test)

[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.3s


In [28]:
# sanity check that target transformation has occured as expected
# y_pred_manual_transform = mlr.predict(X_test)
# assert all(np.log10(y_pred)-y_pred_manual_transform == 0)

models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [pl.regressor_.steps[-1][1].intercept_, *pl.regressor_.steps[-1][1].coef_]
models[model_name].update({'betas':betas})

# score
models[model_name].update({'R^2':pl.score(X,y)})
models[model_name].update({'test R^2':pl.score(X_test,y_test)})
cv_results = cross_val_score(pl, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})

[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ... (3 of 3) Processing onehot_fuel, total=   0.1s
[ColumnTransformer] ..... (1 of 3) Processing numerical, total=   0.0s
[ColumnTransformer] ... (2 of 3) Processing categorical, total=   0.0s
[Colum

In [29]:
# update features, by adding fuels
cat_columns_reduced += ['fuel']
cats += [fuel_list]


In [30]:
# Split betas per category feature.
idx_start = len(num_columns)+1
cat_betas = list()
for cat in cats:
    cat_betas += [betas[idx_start:idx_start+len(cat)]]
    idx_start += len(cat)
# Check if all betas are stored
assert cat_betas[0][0] == betas[len(num_columns)+1] # first cat beta follows numerical betas 
assert cat_betas[-1][-1] == betas[-1] # last

In [31]:
# plot coefficients

# plot numerical and catagorical in different subplots
n_plots = len(cat_columns_reduced) + 1
fig,axs=plt.subplots(
    nrows=n_plots,
    figsize=[16,4*n_plots]
)
plt.subplots_adjust(hspace=0.5)


# Plot numerical
ax = axs[0]
# sorted bar height
betas = models[model_name]['betas']
num_betas = betas[1:len(num_columns)+1]
x = ['offset'] + [features[i] for i in np.argsort(num_betas)[::-1]]
y = [betas[0]] + sorted(num_betas, reverse=True)

# plot bar
ax.bar(x=x, height=y, edgecolor='k', facecolor='None', clip_on=True)

# add values when bar is small
for x_val, coef in zip(x,y):
    if np.abs(coef)<0.5:
        ax.text(x_val, coef, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')
ax.set_yticks(np.arange(-2,2.2,0.5))
ax.set_ylim(top=+2, bottom=-2)
# offset
x_val = x[0]
coef = y[0]
ax.text(x_val, 2, '{:.3g}'.format(coef), rotation=45, va='bottom', ha='left')

# plot origin
x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
ax.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
ax.axhline(0, linewidth=2, linestyle='-', color='k')

# labels        
rot = 45
fsz = 10
ha = 'right'
ax.set_xticks(x)
ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
ax.xaxis.set_tick_params(which='minor', bottom=False)
ax.set_ylabel('Coefficient (a.u.)', style='italic')
ax.set_title('Multiple linear regression\nNumerical features', style='italic') 

# stats
xy=[ax.get_xlim()[1], ax.get_ylim()[1]]
ax.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
    models[model_name]['R^2'],
    models[model_name]['cv R^2'].shape[0],
    np.mean(models[model_name]['cv R^2']),
    np.std(models[model_name]['cv R^2']),
) + '\n' +
         'train (n = {})'.format(y_train.shape[0]) + '\n' +
         'test (n = {}, $R^2$ = {:.2f})'.format(
             y_test.shape[0],
             models[model_name]['test R^2'],
         ), style='italic', va='top', ha='left')

# Plot categorical
for cat, cat_beta, cat_name, ax in zip(cats, cat_betas, cat_columns_reduced, axs[1:]):
    # sort by height
    x = [cat[i] for i in np.argsort(cat_beta)[::-1]]
    y = sorted(cat_beta, reverse=True)
    #x = cat
    #y = cat_beta
    # plot bar
    ax.bar(x=x, height=y, edgecolor='k', facecolor='None', clip_on=False)

    # prettify
    ax.set_yticks(np.arange(-1,+1.1,0.2))
    ax.set_ylim(top=+1, bottom=-1)

    # plot origin
    x_sign_switch = np.nonzero(np.array(y) < 0)[0][0]
    ax.axvline(x_sign_switch-0.5, linewidth=2, linestyle='--', color='k')
    ax.axhline(0, linewidth=2, linestyle='-', color='k')

    # labels
    rot = 45
    fsz = 10
    ha = 'right'
    ax.set_xticks(x)
    ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
    ax.xaxis.set_tick_params(which='minor', bottom=False)
    ax.set_title('Categorical feature: ' + cat_name, style='italic')
    ax.set_ylabel('Coefficient (a.u.)', style='italic')
    # add extra margin if bars are too wide (too little bars)
    if len(x) < 20:
        add_space = len(x) - 20
        xl = list(ax.set_xlim())
        xl[1] -= add_space/2
        xl[0] += add_space/2
        ax.set_xlim(xl)

# Label on bottom panel
ax = axs[-1]
ax.set_xlabel('Sorted features', style='italic')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=False)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/MLR_with_categorical.png


<a href="#pred_top" id='pred_model_7'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Model: MLR regularized

As [previous model](#pred_model_6), but use regularization by using built-in Lasso

## Prepare input

In [32]:
from sklearn.model_selection import GridSearchCV

model_name = 'MLR Lasso'

cat_columns_reduced = list(np.setdiff1d(cat_columns, ['model', 'fuel']))
features = num_columns + cat_columns_reduced
# Can be reduced here

# list of lists with categories. Needed for column transformer
cats = list(df[cat_columns_reduced].apply(lambda x:pd.Series(x.unique()).dropna().tolist() + ['missing'], axis='index'))

# Use data frame not array
yX = df.dropna(subset=['price'])
X = yX.iloc[:,1:]
y = yX.iloc[:,0]
print(X.shape)
print(y.shape)


(8524, 29)
(8524,)


## Determine regularization rate (alpha)

Alpha is the hyperparameter that needs to be determined. For this the data needs to be splitted, but the dataset is too small to do a 3 way split (i.e. CV, Train, Test). Therefor spilt 2 way k-fold cv 
- **Test**: Hold-out set for calculating performance
- **Train**: Use to fit model and do CV


In [33]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)


(5966, 29)
(2558, 29)


In [34]:
# Create model (same as MLR with cats, but regressor is Lasso)

# Preprocessor: numerical features
num_transformer = make_pipeline(
    SimpleImputer(strategy='median'),
    MinMaxScaler(),
)
# Preprocessor: categorical features
cat_transformer = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing', missing_values=np.NaN),
    OneHotEncoder(categories=cats),
)

# Preprocess: fuels
# list of all fuels is passed by using full data set! (X)
fuel_list = list(get_unique_fuels(merge_lpg_and_lpgtype(X.fuel)))
get_fuel_dummies = DummyfyFuel(fuel_list)


# Combine num and cat
preprocessor = ColumnTransformer(transformers=[
    ('numerical', num_transformer, pd.Index(num_columns)),
    ('categorical', cat_transformer, pd.Index(cat_columns_reduced)),
    ('onehot_fuel', get_fuel_dummies, 'fuel')
], verbose=False)

# full pipeline with preproc and mlr
mlr = make_pipeline(
    preprocessor,
    linear_model.Lasso(random_state=42, max_iter=2**11)
)

# Target transformation: log transform price
pl = TransformedTargetRegressor(
    regressor=mlr,
    func=np.log10,
    inverse_func=lambda y: 10**y
)



In [35]:
def gs_lasso_alpha(pipeline):
    # grid search estimator
    grid_search_alpha = GridSearchCV(
        estimator=pipeline,
        param_grid=[
            {
                'regressor__lasso__alpha': 10**(np.linspace(-5,-2,13)) # Choose alphas such that a clear peaked graph is shown in next plot
            } 
        ],
        cv=8,
        scoring='r2',
        #n_jobs=-1,
        verbose=10
    )

    # Perform grid search
    grid_search_alpha.fit(X_train,y_train);
    
    return grid_search_alpha

In [36]:
def plot_gscv_result(gscv):
    '''
    plot search results    
    '''
    plt.figure(figsize=[2,2])

    # abscissa
    alphas = list(gscv.cv_results_['param_regressor__lasso__alpha'])

    # plot mean
    r2_mean = gscv.cv_results_['mean_test_score']
    # normalize
    r2_mean = (r2_mean-r2_mean.mean())/r2_mean.std()
    plt.plot(alphas, r2_mean, label='mean', lw=4, color='blue')

    # plot folds
    for fold in range(gscv.cv):
        r2_fold = gscv.cv_results_['split{:.0f}_test_score'.format(fold)]
        # normalize
        r2_fold = (r2_fold-r2_fold.mean())/r2_fold.std()
        plt.plot(alphas, r2_fold, label='fold ' + str(fold), lw=1, color='black')

    plt.xscale('log')
    plt.xlabel('alpha')
    plt.ylabel('standardized r2 score [a.u.]')
    plt.axvline(gscv.best_params_['regressor__lasso__alpha'], linewidth=2, linestyle='--', color='k')
    result = 'grid search results\nbest alpha={:.5f}'.format(gscv.best_params_['regressor__lasso__alpha'])
    plt.title(result)
    print(result)
    plt.legend(ncol=1, loc='center left', bbox_to_anchor=(1,0.5))
    
    return gscv.best_estimator_



### Fit with regressor found with grid search

In [37]:
best_estimator = plot_gscv_result(gs_lasso_alpha(pl))

Fitting 8 folds for each of 13 candidates, totalling 104 fits
[CV 1/8; 1/13] START regressor__lasso__alpha=1e-05..............................
[CV 1/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.745 total time=   1.5s
[CV 2/8; 1/13] START regressor__lasso__alpha=1e-05..............................


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.715 total time=   3.1s
[CV 3/8; 1/13] START regressor__lasso__alpha=1e-05..............................
[CV 3/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.112 total time=   2.6s
[CV 4/8; 1/13] START regressor__lasso__alpha=1e-05..............................


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 4/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.663 total time=   3.2s
[CV 5/8; 1/13] START regressor__lasso__alpha=1e-05..............................
[CV 5/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.648 total time=   2.2s
[CV 6/8; 1/13] START regressor__lasso__alpha=1e-05..............................
[CV 6/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.749 total time=   2.6s
[CV 7/8; 1/13] START regressor__lasso__alpha=1e-05..............................
[CV 7/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.707 total time=   2.1s
[CV 8/8; 1/13] START regressor__lasso__alpha=1e-05..............................
[CV 8/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.316 total time=   2.1s
[CV 1/8; 2/13] START regressor__lasso__alpha=1.778279410038923e-05..............
[CV 1/8; 2/13] END regressor__lasso__alpha=1.778279410038923e-05;, score=0.758 total time=   1.0s
[CV 2/8; 2/13] START regressor__lasso__alpha=1.778279410038923e-05..............
[CV 2/

In [38]:
# Store estimator with best alpha
reg = best_estimator
models[model_name].update({'model':reg})

# fit
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [reg.regressor_.steps[-1][1].intercept_, *reg.regressor_.steps[-1][1].coef_]
models[model_name].update({'betas':betas})
models[model_name].update({'n betas effective':(np.abs(betas) > 0).sum()})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})

In [39]:
# update features, by adding fuels
cat_columns_reduced += ['fuel']
cats += [fuel_list]

# Split betas per category feature.
idx_start = len(num_columns)+1
cat_betas = list()
for cat in cats:
    cat_betas += [betas[idx_start:idx_start+len(cat)]]
    idx_start += len(cat)
# Check if all betas are stored
assert cat_betas[0][0] == betas[len(num_columns)+1] # first
assert cat_betas[-1][-1] == betas[-1] # last

In [40]:
# plot coefficients

# plot numerical and catagorical in different subplots
n_plots = len(cat_columns_reduced) + 1
fig,axs=plt.subplots(
    nrows=n_plots,
    figsize=[16,4*n_plots]
)
plt.subplots_adjust(hspace=0.5)

# Plot coefficients
for feats, coefs, name, ax in zip(
    [['offset'] + features] + cats,
    [[betas[0]] + betas[1:len(num_columns)+1]] + cat_betas,
    ['numerical'] + cat_columns_reduced,
    axs
):
    # sort by bar height
    x = [feats[i] for i in np.argsort(coefs)[::-1]]
    y = sorted(coefs, reverse=True)
    # plot bar
    ax.bar(x=x, height=y, edgecolor='k', facecolor='None', clip_on=True)

    # prettify
    if not name.startswith('num'):
        ax.set_yticks(np.arange(-0.5,+0.6,0.1))
        bot_tick, top_tick = ax.set_ylim(top=+0.5, bottom=-0.5)
    else:
        ax.set_yticks(np.arange(-2,2.2,0.5))
        bot_tick, top_tick = ax.set_ylim(top=+2, bottom=-2)
        # stats
        xy=[ax.get_xlim()[1], ax.get_ylim()[1]]
        ax.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
            models[model_name]['R^2'],
            models[model_name]['cv R^2'].shape[0],
            np.mean(models[model_name]['cv R^2']),
            np.std(models[model_name]['cv R^2']),
        ) + '\n' +
                 'parameters total n={}, not zero n={}\n'.format(len(betas), sum(np.array(betas) != 0)) +
                 'train (n = {})'.format(y_train.shape[0]) + '\n' +
                 'test (n = {}, $R^2$ = {:.2f})'.format(
                     y_test.shape[0],
                     models[model_name]['test R^2'],
                 ), style='italic', va='top', ha='left')


    # plot sign switch
    x_sign_switch1 = np.nonzero(np.array(y+[-np.inf]) < 0)[0][0]
    x_sign_switch2 = np.nonzero(np.array([+np.inf]+y) > 0)[0][-1]
    ax.axvline(x_sign_switch1-0.5, linewidth=2, linestyle='--', color='k')
    ax.axvline(x_sign_switch2-0.5, linewidth=2, linestyle='--', color='k')
    ax.axhline(0, linewidth=2, linestyle='-', color='k')

    # add values when bar is small or too large (clipping)
    yt=ax.get_yticks()
    first_tick = sorted(np.abs(yt))[1]
    for x_val, coef in zip(x,y):
        if (coef < first_tick) & (coef > 0):
            ax.text(x_val, coef, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif (coef > -first_tick) & (coef < 0):
            ax.text(x_val, 0, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif coef > top_tick:
            # generally this is offset (bias)
            ax.text(x_val, top_tick, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif coef < bot_tick:
            ax.text(x_val, bot_tick, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')

    
    # labels and titles
    rot = 45
    fsz = 10
    ha = 'right'
    ax.set_xticks(x)
    ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
    ax.xaxis.set_tick_params(which='minor', bottom=False)
    if not name.startswith('num'):
        ax.set_title('Categorical feature: ' + name, style='italic')
    else:
        ax.set_title('Multiple linear regression (Lasso, alpha={:g})\nNumerical features'.format(
            reg.regressor_.named_steps['lasso'].alpha
        ), style='italic') 
    ax.set_ylabel('Coefficient (a.u.)', style='italic')
    
    # add extra margin if bars are too wide (too little bars)
    if len(x) < 20:
        add_space = len(x) - 20
        xl = list(ax.get_xlim())
        xl[1] -= add_space/2
        xl[0] += add_space/2
        ax.set_xlim(xl)

# Label on bottom panel
ax = axs[-1]
ax.set_xlabel('Sorted features', style='italic')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=False)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/MLR_Lasso.png


- - - - - 

In [41]:
# Display prediction errors

x_sample = df.dropna(subset=['price']).iloc[:,1:]
y_sample = df.dropna(subset=['price']).iloc[:,0]
y_sample_pred = models[model_name]['model'].predict(x_sample) 

x_sample['price'] = y_sample
x_sample['prediction_error'] = y_sample_pred - y_sample
x_sample['prediction_error_fraction'] = y_sample_pred/y_sample
x_sample['prediction_error_log'] = np.log10(x_sample.prediction_error_fraction)
x_sample['prediction_error_abslog'] = np.abs(np.log10(x_sample.prediction_error_fraction))
x_sample['prediction'] = y_sample_pred
x_sample['age_y'] = x_sample.age/365

# Note some are close to perfect, because they are in training set and are unique in brand etc
print('best predictons')
display(x_sample.sort_values(by='prediction_error_abslog').head(16).T)
print('worst predictions')
display(x_sample.sort_values(by='prediction_error_abslog').tail(16).T)
print('largest underestimate')
display(x_sample.sort_values(by='prediction_error').head(16).T)
print('largest overestimate')
display(x_sample.sort_values(by='prediction_error').tail(16).T)
print('worst prediction recent auction')
is_last_auction = x_sample.index.str.startswith('-'.join(x_sample.index[-1].split('-')[:2]))
display(x_sample[is_last_auction].sort_values(by='prediction_error_abslog').tail(8).T)

plt.figure(figsize=[8,8])
plt.plot(x_sample.age_y, x_sample.prediction_error_log, color='k', marker='s', markeredgecolor = (0, 0, 0, 0), markerfacecolor = (0, 0, 0, 1), linestyle='None', ms=4)
plt.axhline(0, lw=2, linestyle='--', color ='k')
plt.xlabel('age [years]')
plt.ylabel('prediction error [log of fraction]\n(positive: prediction overestimates)')
plt.show()

best predictons


Unnamed: 0,2022-02-714402,2022-08-802828,2018-12-7135,2019-9-2613,2021-03-8165,2018-10-7107,2018-2-2014,2017-10-7157,2020-8-7115,2018-3-2614,2021-06-704306,2017-6-7130,2022-08-802428,2020-10-7217,2017-10-7128,2017-11-8119
brand,OPEL,PEUGEOT,RENAULT,VOLKSWAGEN,MINI,OPEL,MINI,VOLKSWAGEN,RENAULT,AUDI,MERCEDES-BENZ,PEUGEOT,AUDI,PEUGEOT,VOLKSWAGEN,TOYOTA
model,astra,5008,clio,golf,cooper s,zafira-a,one,golf,megane,a1,s 600,206,a4 avant,207,golf,yaris
age,1952.0,4415.0,4755.0,2704.0,5359.0,6067.0,3592.0,2784.0,5434.0,2066.0,7552.0,6156.0,3471.0,3548.0,4448.0,6626.0
fuel,diesel,benzine,benzine,benzine,benzine,benzine,benzine,diesel,benzine/lpg/g3 gasinstallatie,benzine,benzine,benzine,diesel,diesel,diesel,benzine
odometer,178830.0,239150.0,,69198.0,152160.0,241350.0,193297.0,209232.0,274061.0,77238.0,272267.0,157475.0,260734.0,180126.0,224781.0,169784.0
days_since_inspection_invalid,-311.0,19.0,-179.0,106.0,,795.0,-77.0,958.0,125.0,-125.0,201.0,-569.0,-394.0,,-68.0,-127.0
age_at_import,442.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,491.0,0.0,0.0,262.0,,0.0,0.0
body_type,hatchback,mpv,hatchback,cabriolet,,stationwagen,hatchback,stationwagen,cabriolet,stationwagen,sedan,hatchback,stationwagen,,hatchback,hatchback
displacement,1598.0,1598.0,1598.0,1197.0,,1796.0,1397.0,1598.0,1998.0,1390.0,5786.0,1360.0,1968.0,,1896.0,998.0
number_of_cylinders,4.0,4.0,4.0,4.0,,4.0,4.0,4.0,4.0,4.0,12.0,4.0,4.0,,4.0,4.0


worst predictions


Unnamed: 0,2021-07-808017,2022-12-260032,2017-3-2007,2015-03-2402,2019-4-2021,2021-08-260218,2021-06-260316,2017-3-2409,2021-06-260216,2014-12-2207,2017-5-2216,2021-03-2606,2018-1-2412,2017-3-2000,2021-05-2201,2018-7-2411
brand,HYUNDAI,CHEVROLET,VOLKSWAGEN,FORD,VOLKSWAGEN,LANCIA,FORD,VOLKSWAGEN,MERCEDES-BENZ,FORD,ALFA ROMEO,JAGUAR,VOLKSWAGEN,ALFA ROMEO,MERCEDES-BENZ,MERCEDES-BENZ
model,trajet,impala sport coupe,karmann ghia,thunderbird,111011,137as0,mustang,T2,w110/190d,thunderbird,2000 gtv,e-type,t1,2000 gtv,w100 600 pullman,sl230
age,5459.0,20000.0,18507.0,21063.0,19787.0,16011.0,17503.0,15873.0,21156.0,20973.0,16257.0,18507.0,19813.0,16196.0,20451.0,19848.0
fuel,diesel,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine
odometer,152702.0,73172.043648,59227.0,86207.0,91157.0,32474.0,105840.117504,46642.0,94097.0,86207.0,23982.0,20623.0,96563.858688,23982.0,,68722.0
days_since_inspection_invalid,447.0,,,,1099.0,,232.0,,,,-739.0,,,-800.0,,
age_at_import,0.0,19118.0,,12826.0,16180.0,,13880.0,,,12826.0,0.0,,,0.0,,
body_type,mpv,coupe,,,sedan,,coupe,,,,coupe,,,coupe,,
displacement,1991.0,5358.0,,0.0,1192.0,,4948.0,,,0.0,,,,,,
number_of_cylinders,4.0,8.0,,8.0,4.0,,8.0,,,8.0,4.0,,,4.0,,


largest underestimate


Unnamed: 0,2022-09-265029,2022-05-260625,2018-8-2400,2015-02-2200,2015-01-2414,2019-4-2411,2021-05-8126,2019-11-2418,2021-12-260012,2018-6-2410,2017-5-2406,2022-05-260925,2018-8-2410,2021-08-702908,2017-3-2000,2018-7-2411
brand,LAMBORGHINI,FERRARI,ROLLS ROYCE,ASTON-MARTIN,SKODA,MERCEDES-BENZ,MERCEDES-BENZ,PORSCHE,LAMBORGHINI,MERCEDES-BENZ,MERCEDES-BENZ,LAND ROVER,ASTON-MARTIN,MERCEDES-BENZ,ALFA ROMEO,MERCEDES-BENZ
model,urus,430 scuderia,phantom drophead coupe,vanguish volante,octavia,amg s63 cabriolet,v-klasse,panamera turbo s e-hybrid,diablo sv 132 se,S65 AMG,S600 Maybach,range rover 3.0 lwb autobiogra,dbs,amg e 43 4matic,2000 gtv,sl230
age,667.0,4868.0,3136.0,278.0,315.0,636.0,1506.0,431.0,8546.0,844.0,810.0,845.0,2665.0,1599.0,16196.0,19848.0
fuel,benzine,benzine,benzine,benzine,benzine,benzine,diesel,,benzine,benzine,benzine,benzine,benzine,benzine,benzine,benzine
odometer,17340.0,11077.0,11305.0,4778.0,6796.0,13.0,121324.0,6925.0,40090.0,6379.0,19173.0,183.0,58429.0,71428.0,23982.0,68722.0
days_since_inspection_invalid,,,,,-1146.0,,-320.0,,1949.0,,-651.0,,,-593.0,-800.0,
age_at_import,,,,,0.0,,1067.0,,6164.0,,0.0,,,523.0,0.0,
body_type,,,,,sedan,,mpv,,coupe,,sedan,,,sedan,coupe,
displacement,,,,,1798.0,,2143.0,,5707.0,,5980.0,,,2996.0,,
number_of_cylinders,,,,,4.0,,4.0,,12.0,,12.0,,,6.0,4.0,


largest overestimate


Unnamed: 0,2019-6-2403,2022-11-710111,2021-04-2205,2018-1-2411,2021-06-702006,2022-03-220003,2017-4-2404,2022-11-705411,2021-04-2208,2021-03-2610,2017-3-2405,2021-03-2206,2017-3-2400,2020-3-2406,2019-5-2405,2018-11-2401
brand,MERCEDES-BENZ,LAND ROVER,MERCEDES-BENZ,BMW,PORSCHE,FERRARI,LAMBORGHINI,BMW,FERRARI,LAND ROVER,MERCEDES-BENZ,VOLVO,ASTON-MARTIN,FERRARI,LAMBORGHINI,BENTLEY
model,amg c 63 s,range rover sport,gle 350 d 4matic,x5 m50d,macan s,458,gallardo,x5 m,California F149,range rover sport,amg gle 63 s,xc90 t8 twin engine,rapide s,599,gallardo,continental gtc
age,904.0,3009.0,1718.0,1761.0,783.0,4347.0,4797.0,2569.0,3714.0,1851.0,484.0,579.0,673.0,3611.0,3898.0,4200.0
fuel,benzine,diesel,diesel,diesel,benzine,benzine,benzine,benzine,benzine,benzine,benzine,elektriciteit/benzine,benzine,benzine,benzine,benzine
odometer,30129.0,87443.0,55224.0,36340.0,25070.0,31345.0,35410.0,73626.0,11457.0,110979.0,20757.0,7071.0,14415.0,11974.0,25128.0,27184.0
days_since_inspection_invalid,-557.0,476.0,244.0,-102.0,-678.0,985.0,-483.0,-432.0,-28.0,-121.0,-977.0,-882.0,-788.0,2150.0,-287.0,-21.0
age_at_import,544.0,2158.0,1067.0,1349.0,0.0,2997.0,0.0,2270.0,3301.0,1246.0,135.0,0.0,297.0,3619.0,3102.0,2028.0
body_type,coupe,stationwagen,stationwagen,stationwagen,stationwagen,coupe,coupe,stationwagen,cabriolet,stationwagen,stationwagen,mpv,hatchback,coupe,cabriolet,cabriolet
displacement,3982.0,2993.0,2987.0,2993.0,2995.0,4497.0,4961.0,4395.0,4297.0,4999.0,5461.0,1969.0,5935.0,5999.0,4961.0,5998.0
number_of_cylinders,8.0,6.0,6.0,6.0,6.0,8.0,10.0,8.0,8.0,8.0,8.0,4.0,12.0,12.0,10.0,12.0


worst prediction recent auction


Unnamed: 0,2023-01-708801,2023-01-706001,2023-01-703201,2023-01-701701,2023-01-701501,2023-01-708601,2023-01-702201,2023-01-701101
brand,RENAULT,MERCEDES-BENZ,BMW,VOLKSWAGEN,FIAT,CHEVROLET,MERCEDES-BENZ,FIAT
model,clio,a 170 cdi,x reihe,polo,500,matiz,c 30 cdi amg,500
age,6080.0,7292.0,6505.0,4290.0,5147.0,5876.0,7230.0,5100.0
fuel,benzine,diesel,benzine,benzine,benzine,benzine,diesel,benzine
odometer,208654.0,373133.0,,282983.0,148515.0,,471431.0,197731.0
days_since_inspection_invalid,-115.0,317.0,1842.0,-93.0,-32.0,136.0,-101.0,-13.0
age_at_import,0.0,0.0,0.0,0.0,3548.0,0.0,1198.0,0.0
body_type,hatchback,stationwagen,stationwagen,hatchback,hatchback,hatchback,stationwagen,hatchback
displacement,1390.0,1689.0,4799.0,1197.0,1242.0,796.0,2950.0,1242.0
number_of_cylinders,4.0,4.0,8.0,4.0,4.0,3.0,5.0,4.0


<a href="#pred_top" id='pred_model_8'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Model: MLR regularized with added feature

As previous, but with odometer and age combine into one feature

In [42]:
# check to see if combining features would improve model
yX = df.dropna(subset=['price']).copy()
yX.loc[:,'> usage_intensity <'] = (yX.odometer / yX.age)
yX.loc[:,'> is_classic <'] = yX.age > 25
print(yX.corr().price.sort_values())
print('\n"usage_intensity" does not seem to correlate better than "age" and "odometer" seperately')

odometer                        -0.439744
age                             -0.345985
private_owners                  -0.274213
days_since_inspection_invalid   -0.114322
> usage_intensity <             -0.070299
number_of_seats                 -0.027247
> is_classic <                   0.018917
age_at_import                    0.047455
height                           0.065005
company_owners                   0.113851
number_of_doors                  0.171647
converted_energy_label           0.235920
length                           0.306357
weight                           0.357246
number_of_cylinders              0.358855
registration_tax                 0.374142
displacement                     0.377018
width                            0.435490
top_speed                        0.502988
number_of_gears                  0.576110
power                            0.609081
sale_price                       0.688471
price                            1.000000
Name: price, dtype: float64

"usag

## Prepare input

In [43]:
from sklearn.model_selection import GridSearchCV

model_name = 'MLR added features'

cat_columns_reduced = list(np.setdiff1d(cat_columns, ['model', 'fuel']))
# Can be reduced here

# list of lists with categories. Needed for column transformer
cats = list(df[cat_columns_reduced].apply(lambda x:pd.Series(x.unique()).dropna().tolist() + ['missing'], axis='index'))

# Use data frame not array
yX = df.dropna(subset=['price'])
X = yX.iloc[:,1:]
y = yX.iloc[:,0]

# Add features
X.loc[:,'usage_intensity'] = X.odometer / X.age
X.loc[:,'classic'] = X.age > 25*365
X.loc[:,'classic'].replace({True:'y', False:'n'}, inplace=True)
#X.loc[X.age.isna(),'classic'] = 'missing' # is done in imputer

print(X.shape)
print(y.shape)

(8524, 31)
(8524,)


In [44]:
cat_columns_expanded = cat_columns_reduced + ['classic']
cats_added = cats + [['y', 'n', 'missing']]
num_columns_expanded = num_columns + ['usage_intensity']
# num_columns_expanded.remove('age')
num_columns_expanded.remove('odometer')
features = num_columns_expanded + cat_columns_expanded


In [45]:
# instantiate a dict in models at key with name of this model
models[model_name] = dict()

# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)


(5966, 31)
(2558, 31)


In [46]:
# Create model (same as MLR Lasso)

# Preprocessor: numerical features
num_transformer = make_pipeline(
    SimpleImputer(strategy='median'),
    MinMaxScaler(),
)
# Preprocessor: categorical features
cat_transformer = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing', missing_values=np.NaN),
    OneHotEncoder(categories=cats_added),
)

# Preprocess: fuels
# list of all fuels is passed by using full data set! (X)
fuel_list = list(get_unique_fuels(merge_lpg_and_lpgtype(X.fuel)))
get_fuel_dummies = DummyfyFuel(fuel_list)


# Combine num and cat
preprocessor = ColumnTransformer(transformers=[
    ('numerical', num_transformer, pd.Index(num_columns_expanded)),
    ('categorical', cat_transformer, pd.Index(cat_columns_expanded)),
    ('onehot_fuel', get_fuel_dummies, 'fuel')
], verbose=False)

# full pipeline with preproc and mlr
mlr = make_pipeline(
    preprocessor,
    linear_model.Lasso(random_state=42)
)

# Target transformation: log transform price
pl = TransformedTargetRegressor(
    regressor=mlr,
    func=np.log10,
    inverse_func=lambda y: 10**y
)



In [47]:
# # use alpha from previous model
# alpha = models['MLR Lasso']['model'].regressor['lasso'].get_params()['alpha']
# pl.regressor['lasso'].set_params(alpha = alpha)

best_estimator = plot_gscv_result(gs_lasso_alpha(pl))



Fitting 8 folds for each of 13 candidates, totalling 104 fits
[CV 1/8; 1/13] START regressor__lasso__alpha=1e-05..............................


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.583 total time=   1.5s
[CV 2/8; 1/13] START regressor__lasso__alpha=1e-05..............................


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.714 total time=   1.5s
[CV 3/8; 1/13] START regressor__lasso__alpha=1e-05..............................


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.460 total time=   1.5s
[CV 4/8; 1/13] START regressor__lasso__alpha=1e-05..............................


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 4/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.677 total time=   1.5s
[CV 5/8; 1/13] START regressor__lasso__alpha=1e-05..............................


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 5/8; 1/13] END regressor__lasso__alpha=1e-05;, score=-0.247 total time=   1.5s
[CV 6/8; 1/13] START regressor__lasso__alpha=1e-05..............................


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 6/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.734 total time=   1.5s
[CV 7/8; 1/13] START regressor__lasso__alpha=1e-05..............................


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 7/8; 1/13] END regressor__lasso__alpha=1e-05;, score=0.716 total time=   1.5s
[CV 8/8; 1/13] START regressor__lasso__alpha=1e-05..............................


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 8/8; 1/13] END regressor__lasso__alpha=1e-05;, score=-0.352 total time=   1.5s
[CV 1/8; 2/13] START regressor__lasso__alpha=1.778279410038923e-05..............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/8; 2/13] END regressor__lasso__alpha=1.778279410038923e-05;, score=0.595 total time=   1.5s
[CV 2/8; 2/13] START regressor__lasso__alpha=1.778279410038923e-05..............
[CV 2/8; 2/13] END regressor__lasso__alpha=1.778279410038923e-05;, score=0.715 total time=   1.3s
[CV 3/8; 2/13] START regressor__lasso__alpha=1.778279410038923e-05..............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/8; 2/13] END regressor__lasso__alpha=1.778279410038923e-05;, score=0.495 total time=   1.5s
[CV 4/8; 2/13] START regressor__lasso__alpha=1.778279410038923e-05..............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 4/8; 2/13] END regressor__lasso__alpha=1.778279410038923e-05;, score=0.682 total time=   1.5s
[CV 5/8; 2/13] START regressor__lasso__alpha=1.778279410038923e-05..............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 5/8; 2/13] END regressor__lasso__alpha=1.778279410038923e-05;, score=-0.120 total time=   1.5s
[CV 6/8; 2/13] START regressor__lasso__alpha=1.778279410038923e-05..............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 6/8; 2/13] END regressor__lasso__alpha=1.778279410038923e-05;, score=0.736 total time=   1.5s
[CV 7/8; 2/13] START regressor__lasso__alpha=1.778279410038923e-05..............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 7/8; 2/13] END regressor__lasso__alpha=1.778279410038923e-05;, score=0.715 total time=   1.5s
[CV 8/8; 2/13] START regressor__lasso__alpha=1.778279410038923e-05..............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 8/8; 2/13] END regressor__lasso__alpha=1.778279410038923e-05;, score=-0.309 total time=   1.5s
[CV 1/8; 3/13] START regressor__lasso__alpha=3.1622776601683795e-05.............
[CV 1/8; 3/13] END regressor__lasso__alpha=3.1622776601683795e-05;, score=0.610 total time=   1.4s
[CV 2/8; 3/13] START regressor__lasso__alpha=3.1622776601683795e-05.............
[CV 2/8; 3/13] END regressor__lasso__alpha=3.1622776601683795e-05;, score=0.716 total time=   0.9s
[CV 3/8; 3/13] START regressor__lasso__alpha=3.1622776601683795e-05.............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/8; 3/13] END regressor__lasso__alpha=3.1622776601683795e-05;, score=0.533 total time=   1.5s
[CV 4/8; 3/13] START regressor__lasso__alpha=3.1622776601683795e-05.............
[CV 4/8; 3/13] END regressor__lasso__alpha=3.1622776601683795e-05;, score=0.690 total time=   1.1s
[CV 5/8; 3/13] START regressor__lasso__alpha=3.1622776601683795e-05.............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 5/8; 3/13] END regressor__lasso__alpha=3.1622776601683795e-05;, score=0.062 total time=   1.4s
[CV 6/8; 3/13] START regressor__lasso__alpha=3.1622776601683795e-05.............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 6/8; 3/13] END regressor__lasso__alpha=3.1622776601683795e-05;, score=0.739 total time=   1.5s
[CV 7/8; 3/13] START regressor__lasso__alpha=3.1622776601683795e-05.............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 7/8; 3/13] END regressor__lasso__alpha=3.1622776601683795e-05;, score=0.714 total time=   1.5s
[CV 8/8; 3/13] START regressor__lasso__alpha=3.1622776601683795e-05.............


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 8/8; 3/13] END regressor__lasso__alpha=3.1622776601683795e-05;, score=-0.237 total time=   1.5s
[CV 1/8; 4/13] START regressor__lasso__alpha=5.623413251903491e-05..............
[CV 1/8; 4/13] END regressor__lasso__alpha=5.623413251903491e-05;, score=0.627 total time=   1.4s
[CV 2/8; 4/13] START regressor__lasso__alpha=5.623413251903491e-05..............
[CV 2/8; 4/13] END regressor__lasso__alpha=5.623413251903491e-05;, score=0.718 total time=   0.7s
[CV 3/8; 4/13] START regressor__lasso__alpha=5.623413251903491e-05..............
[CV 3/8; 4/13] END regressor__lasso__alpha=5.623413251903491e-05;, score=0.556 total time=   1.1s
[CV 4/8; 4/13] START regressor__lasso__alpha=5.623413251903491e-05..............
[CV 4/8; 4/13] END regressor__lasso__alpha=5.623413251903491e-05;, score=0.703 total time=   0.8s
[CV 5/8; 4/13] START regressor__lasso__alpha=5.623413251903491e-05..............
[CV 5/8; 4/13] END regressor__lasso__alpha=5.623413251903491e-05;, score=0.296 total time=   1.1s
[CV 6

  model = cd_fast.sparse_enet_coordinate_descent(


[CV 7/8; 4/13] END regressor__lasso__alpha=5.623413251903491e-05;, score=0.712 total time=   1.4s
[CV 8/8; 4/13] START regressor__lasso__alpha=5.623413251903491e-05..............
[CV 8/8; 4/13] END regressor__lasso__alpha=5.623413251903491e-05;, score=-0.128 total time=   1.1s
[CV 1/8; 5/13] START regressor__lasso__alpha=0.0001.............................
[CV 1/8; 5/13] END regressor__lasso__alpha=0.0001;, score=0.641 total time=   0.6s
[CV 2/8; 5/13] START regressor__lasso__alpha=0.0001.............................
[CV 2/8; 5/13] END regressor__lasso__alpha=0.0001;, score=0.722 total time=   1.2s
[CV 3/8; 5/13] START regressor__lasso__alpha=0.0001.............................
[CV 3/8; 5/13] END regressor__lasso__alpha=0.0001;, score=0.537 total time=   0.8s
[CV 4/8; 5/13] START regressor__lasso__alpha=0.0001.............................
[CV 4/8; 5/13] END regressor__lasso__alpha=0.0001;, score=0.721 total time=   0.6s
[CV 5/8; 5/13] START regressor__lasso__alpha=0.0001...............

### Fit

In [48]:
# Store estimator
reg = best_estimator
models[model_name].update({'model':reg})

# fit
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

models[model_name].update({'n':y.shape[0]})
models[model_name].update({'n features':X.shape[1]})

# parameters
betas = [reg.regressor_.steps[-1][1].intercept_, *reg.regressor_.steps[-1][1].coef_]
models[model_name].update({'betas':betas})
models[model_name].update({'n betas effective':(np.abs(betas) > 0).sum()})

# score
models[model_name].update({'R^2':reg.score(X,y)})
models[model_name].update({'test R^2':reg.score(X_test,y_test)})
cv_results = cross_val_score(reg, X_test, y_test, cv=5)
models[model_name].update({'cv R^2':cv_results})

In [49]:
# update features, by adding fuels
cat_columns_expanded += ['fuel']
cats_added += [fuel_list]


In [50]:
# Split betas per category feature.
idx_start = len(num_columns_expanded)+1
cat_betas = list()
for cat in cats_added:
    cat_betas += [betas[idx_start:idx_start+len(cat)]]
    idx_start += len(cat)
# Check if all betas are stored
assert cat_betas[0][0] == betas[len(num_columns_expanded)+1] # first
assert cat_betas[-1][-1] == betas[-1] # last

In [51]:
# plot coefficients

# plot numerical and catagorical in different subplots
n_plots = len(cat_columns_expanded) + 1
fig,axs=plt.subplots(
    nrows=n_plots,
    figsize=[16,8*n_plots]
)
plt.subplots_adjust(hspace=0.5)

# Plot coefficients
for feats, coefs, name, ax in zip(
    [['offset'] + features] + cats_added,
    [[betas[0]] + betas[1:len(num_columns_expanded)+1]] + cat_betas,
    ['numerical'] + cat_columns_expanded,
    axs
):
    # sort by bar height
    x = [feats[i] for i in np.argsort(coefs)[::-1]]
    y = sorted(coefs, reverse=True)
    # plot bar
    ax.bar(x=x, height=y, edgecolor='k', facecolor='None', clip_on=True)

    # prettify
    if not name.startswith('num'):
        ax.set_yticks(np.arange(-0.5,+0.6,0.1))
        bot_tick, top_tick = ax.set_ylim(top=+0.5, bottom=-0.5)
    else:
        ax.set_yticks(np.arange(-2,2.2,0.5))
        bot_tick, top_tick = ax.set_ylim(top=+2, bottom=-2)
        # stats
        xy=[ax.get_xlim()[1], ax.get_ylim()[1]]
        ax.text(xy[0]*1.05,xy[1], '$R^2$ = {:.2f}, $R^2_{{cv{:g}}}$ = {:.2f} (+/-{:.2f})'.format(
            models[model_name]['R^2'],
            models[model_name]['cv R^2'].shape[0],
            np.mean(models[model_name]['cv R^2']),
            np.std(models[model_name]['cv R^2']),
        ) + '\n' +
                 'parameters total n={}, not zero n={}\n'.format(len(betas), sum(np.array(betas) != 0)) +
                 'train (n = {})'.format(y_train.shape[0]) + '\n' +
                 'test (n = {}, $R^2$ = {:.2f})'.format(
                     y_test.shape[0],
                     models[model_name]['test R^2'],
                 ), style='italic', va='top', ha='left')


    # plot sign switch
    x_sign_switch1 = np.nonzero(np.array(y+[-np.inf]) < 0)[0][0]
    x_sign_switch2 = np.nonzero(np.array([+np.inf]+y) > 0)[0][-1]
    ax.axvline(x_sign_switch1-0.5, linewidth=2, linestyle='--', color='k')
    ax.axvline(x_sign_switch2-0.5, linewidth=2, linestyle='--', color='k')
    ax.axhline(0, linewidth=2, linestyle='-', color='k')

    # add values when bar is small or too large (clipping)
    yt=ax.get_yticks()
    first_tick = sorted(np.abs(yt))[1]
    for x_val, coef in zip(x,y):
        if (coef < first_tick) & (coef > 0):
            ax.text(x_val, coef, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif (coef > -first_tick) & (coef < 0):
            ax.text(x_val, 0, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif coef > top_tick:
            # generally this is offset (bias)
            ax.text(x_val, top_tick, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')
        elif coef < bot_tick:
            ax.text(x_val, bot_tick, '{:+.3g}'.format(coef), rotation=45, va='bottom', ha='left')

    
    # labels and titles
    rot = 45
    fsz = 10
    ha = 'right'
    ax.set_xticks(x)
    ax.set_xticklabels(labels=x, rotation=rot, va='top', ha=ha, style='italic', fontsize=fsz)
    ax.xaxis.set_tick_params(which='minor', bottom=False)
    if not name.startswith('num'):
        ax.set_title('Categorical feature: ' + name, style='italic')
    else:
        ax.set_title('Multiple linear regression (Lasso, alpha={:g})\nNumerical features'.format(
            reg.regressor_.named_steps['lasso'].alpha
        ), style='italic') 
    ax.set_ylabel('Coefficient (a.u.)', style='italic')
    
    # add extra margin if bars are too wide (too little bars)
    if len(x) < 10:
        add_space = len(x) - 10
        xl = list(ax.get_xlim())
        xl[1] -= add_space/2
        xl[0] += add_space/2
        ax.set_xlim(xl)

# Label on bottom panel
ax = axs[-1]
ax.set_xlabel('Sorted features', style='italic')

# Save
file_name = '../results/{}.png'.format(model_name.replace(' ','_'))
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=False)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/MLR_added_features.png


<a href="#pred_top" id='pred_accuracies'><font size=+1><center>^^ TOP ^^</center></font></a>

---

## Model accuracies

In [52]:
# plot R^2

# counter for x-offset
c=0

# figure
fig = plt.figure(figsize=[4,2])
ax = fig.gca()
xs = ys = fs = [None]

# loop over all models
for name,res in models.items():

    c+=1 # x-offset

    if name == 'linear regression no cv':
        # No cv, so only one value. Make it a list of one for type consistency
        k = 'R^2'
        rsq = [res[k]]
    
    else: 
        k = 'cv R^2'
        rsq = res[k]
        
    if 'n betas effective' in res:
        ndf = res['n betas effective']
    else:
        ndf = len(res['betas'])
        
    # add r-squares and offset to vectors
    ys = np.concatenate([ys, rsq])
    xs = np.concatenate([xs, np.ones_like(rsq) * c])
    fs = np.concatenate([fs, [ndf]])

# actual plotting
sns.swarmplot(x=xs, y=ys, ax=ax, hue=None)
ax.bar(range(1,len(models)+1), [res['R^2'] for res in models.values()], width=0.8, fc='none')
for x,ndf in enumerate(fs):
    if ndf is None:
        continue
    if x == 1:
        s = f'd.f.: {ndf:.0f}'
    else:
        s = f'{ndf:.0f}'
    ax.text(x, 1, s, ha='center')
# prettify
ax.set_xticks(range(1,len(models)+1))
ax.set_xticklabels(labels=list(models.keys()), rotation=45, va='top', ha='right', style='italic')
ax.set_ylim(bottom=0, top=+1)
ax.set_title('Model performance\n', style='italic')
ax.set_ylabel('Coefficient of determination\n($R^2$)', style='italic')
ax.xaxis.set_tick_params(which='minor', bottom=False)

# save
file_name = '../results/model-performance.png'
if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
    print(file_name)
    with plt.style.context('../assets/context-paper.mplstyle'):
        plt.savefig(file_name, bbox_inches='tight', transparent=False)
else:
    plt.show()
    print(f'Skip. {file_name} exists or saving is disabled in settings.')

../results/model-performance.png


In [53]:
# plot data

# loop over all models
for model_name in models.keys():
    print(model_name)
    res = models[model_name]
    
    # all original data
    yX = df.loc[:,['price', 'age']].dropna()
    X = yX.iloc[:,1]
    y = yX.iloc[:,0]
    
    features = num_columns.copy()
    
    # model specific adjustments
    if (model_name == 'linear regression log price') or (model_name == 'linear regression log price young'):
        # log price is used
        y = np.log10(y)
        # unit
        unit = '(log[EUR])'
    elif (model_name == 'MLR reduced observations') or (model_name == 'MLR impute median'):
        yX = df.dropna(subset=['price'] + features).loc[:,['price'] + features]
        X = yX.iloc[:,1:]
        y = np.log10(yX.iloc[:,0])
        unit = '(log[EUR])'
    elif (model_name == 'MLR with categorical') or (model_name == 'MLR Lasso'):
        yX = df.dropna(subset=['price']).copy()
        X = yX.iloc[:,1:]
        y = yX.iloc[:,0]      
        unit = '(EUR)'
    elif (model_name == 'MLR added features'):
        yX = df.dropna(subset=['price']).copy()
        X = yX.iloc[:,1:]
        y = yX.iloc[:,0]      
        unit = '(EUR)'
        X.loc[:,'usage_intensity'] = X.odometer / X.age
        X.loc[:,'classic'] = X.age > 25*365
        X.loc[:,'classic'].replace({True:'y', False:'n'}, inplace=True)

    else:
        unit = '(EUR)'
    
    if X.ndim != 1:
        n_feat = X.shape[1]
    else:
        n_feat = 1
        
    if not ((model_name == 'MLR with categorical') or (model_name == 'MLR Lasso') or (model_name == 'MLR added features')):
        # needed for .predict
        X = np.array(X).reshape(-1,n_feat)
        y = np.array(y).reshape(-1,1)
    
    # predict all data
    y_pred = res['model'].predict(X)
    if max(y) < 10:
        rmse = np.sqrt(np.mean(((10**y)-(10**y_pred))**2))
    else:
        rmse = np.sqrt(np.mean((y-y_pred)**2))
    print(rmse)

    # actual plotting
    fig,axs = plt.subplots(nrows=2, ncols=1, figsize=[4,4])
    
    # data
    axs[0].plot(y, y_pred, marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4,)
    # error
    axs[1].plot(y, y_pred-y, marker='s', markeredgecolor = (0, 0, 0, 1), markerfacecolor = (1, 1, 1, .5), linestyle='None', ms=4,)
    
    # axis equal for top
    if (model_name == 'MLR with categorical') or (model_name == 'MLR Lasso') or (model_name == 'MLR added features'):
        axs[0].set_xscale('log')
        axs[0].set_yscale('log')
        axs[1].set_xscale('log')
    axs[0].set_aspect(1)
    # store limits
    yl = axs[0].get_ylim()
    xl_top = axs[0].get_xlim()
    xl_bot = axs[1].get_xlim()
    xl = [np.max([xl_top[0], xl_bot[0]]), np.min([xl_top[1], xl_bot[1]])]
    # plot unity line and 0 error
    unity_line = [np.max([xl[0], yl[0]]), np.min([xl[1], yl[1]])]
    axs[0].plot(unity_line, unity_line, '-k', linewidth=2)
    axs[1].plot(xl, [0, 0], '-k', linewidth=2)
    # reset limits
    axs[0].set_xlim(xl)
    axs[1].set_xlim(xl)

    # make equal size panels
    # Note: sharex did not work
    bb=axs[0].get_position(False)
    rect_top = bb.bounds
    bb=axs[1].get_position(False)
    rect_bot = bb.bounds
    rect = list(rect_bot)
    rect[0] = rect_top[0]
    rect[2] = rect_top[2]
    axs[1].set_position(rect)
    
    # labeling
    fig.suptitle('{}\nrmse: EUR {:.0f}'.format(model_name,rmse), style='italic')
    axs[1].set_xlabel('Real price ' + unit, style='italic')
    axs[0].set_ylabel('Predicted price\n' + unit, style='italic')
    axs[1].set_ylabel('Prediction error\n' + unit, style='italic')
    
    # save
    file_name = '../results/{}-accuracy.png'.format(model_name.replace(' ','_'))
    if (SKIPSAVE==False): #and (not(os.path.isfile(file_name))):
        print(file_name)
        with plt.style.context('../assets/context-paper.mplstyle'):
            plt.savefig(file_name, bbox_inches='tight', transparent=False)
    else:
        plt.show()
        print(f'Skip. {file_name} exists or saving is disabled in settings.')

linear regression no cv
10168.977008833397
../results/linear_regression_no_cv-accuracy.png
linear regression log price
10414.001147971177
../results/linear_regression_log_price-accuracy.png
linear regression log price young
9448.518596736243
../results/linear_regression_log_price_young-accuracy.png
MLR reduced observations
6789.418087332828
../results/MLR_reduced_observations-accuracy.png
MLR impute median
7397.903139461647
../results/MLR_impute_median-accuracy.png
MLR with categorical
6565.710356522046
../results/MLR_with_categorical-accuracy.png
MLR Lasso
6201.662825922467
../results/MLR_Lasso-accuracy.png
MLR added features
6466.485315737244
../results/MLR_added_features-accuracy.png


In [54]:
assert False, 'stop running, below is sandboxing and testing'

AssertionError: stop running, below is sandboxing and testing

<a href="#pred_top" id='pred_save_model'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Save model as pickle
Save the best model as a .pkl file.

See also: https://scikit-learn.org/stable/modules/model_persistence.html

In [None]:
# import dill # dill acts as pickle but handles lambda functions
model_name = 'MLR Lasso' 
model = models[model_name]

In [None]:
model['name'] = model_name
fn = '../results/trained_model_{}.pkl'.format(model_name.replace(' ', '_').lower())
print(fn)
# with open(fn, 'wb') as file:
#     dill.dump(model, file)

<a href="#pred_top" id='pred_predict'><font size=+1><center>^^ TOP ^^</center></font></a>

---

# Example predictions

In [None]:
# Predict some known cars
B = pd.DataFrame(columns=X.columns, index=['Mine'])
B.loc['Mine', ['brand', 'model', 'fuel', 'body_type', 'color']] = ['CITROËN', 'berlingo', 'benzine', 'mpv', 'Gray']
B.loc['Mine', ['displacement', 'number_of_cylinders', 'number_of_seats', 'number_of_doors', 'fwd', 'number_of_gears']] = [1600, 4, 5, 5, 'n', 5]
B.loc['Mine', ['top_speed']] = [170]
B.loc['Mine', 'age'] = (pd.Timestamp.now() - pd.to_datetime('2005-12-1')).days
B.loc['Mine', 'days_since_inspection_invalid'] = (pd.Timestamp.now() - pd.to_datetime('2022-6-11')).days
B.loc['Mine', 'age_at_import'] = 0
B.loc['Mine', 'odometer'] = 160000
B.loc['Mine', ['weight']] = [1326]

B.loc['Peer', ['brand', 'model', 'fuel', 'body_type', 'color']] = ['CITROËN', 'ax', 'benzine', 'hatchback', 'Gray']
B.loc['Peer', ['displacement', 'number_of_cylinders', 'number_of_seats', 'number_of_doors', 'fwd', 'number_of_gears']] = [1100, 4, 5, 5, 'n', 5]
B.loc['Peer', ['top_speed']] = [170]
B.loc['Peer', 'age'] = (pd.Timestamp.now() - pd.to_datetime('1996-12-1')).days
# B.loc['Mine', 'days_since_inspection_invalid'] = (pd.Timestamp.now() - pd.to_datetime('2020-6-11')).days
B.loc['Peer', 'age_at_import'] = 0
B.loc['Peer', 'odometer'] = 160000
B.loc['Peer', ['weight']] = [800]

B.loc['a car', ['brand']] = [np.NaN]


B.loc['J-892-TZ', ['brand', 'model', 'fuel', 'body_type', 'color']] = ['SUZUKI', 'sx4', 'benzine', 'hatchback', 'Gray']
B.loc['J-892-TZ', ['displacement', 'number_of_cylinders', 'number_of_seats', 'number_of_doors', 'fwd', 'number_of_gears']] = [1586, 4, 5, 5, 'n', np.NaN]
B.loc['J-892-TZ', 'age'] = (pd.Timestamp.now() - pd.to_datetime('2010-11-11')).days
B.loc['J-892-TZ', 'days_since_inspection_invalid'] = (pd.Timestamp.now() - pd.to_datetime('2021-11-18')).days
B.loc['J-892-TZ', 'age_at_import'] = (pd.Timestamp.now() - pd.to_datetime('2020-11-19')).days
B.loc['J-892-TZ', 'odometer'] = 58153
B.loc['J-892-TZ', 'weight'] = 1230
B.loc['J-892-TZ', 'power'] = 118
B.loc['J-892-TZ', 'automatic_gearbox'] = 'y'
B.loc['J-892-TZ', 'private_owners'] = 1
B.loc['J-892-TZ', 'company_owners'] = 0
B.loc['J-892-TZ', 'sale_price'] = 19979
B.loc['J-892-TZ', 'registration_tax'] = 3936

B.loc['J-892-TZ-real'] = df.loc['2022-01-805121',:].drop(columns='Price')

B.T

In [None]:
df_ = pd.DataFrame(index=models.keys(), columns=B.index)
for model in df_.index[::-1]:
    try:
        print(f'{model}')
        B.loc[:,'predict'] = models[model]['model'].predict(B)
        pred = B.predict
    except: 
        pred = pd.Series(index=B.index, data=np.NaN)
    df_.loc[model,:] = pred
    
df_

In [None]:
ix = df[df.model == 'sx4'].index
ix = df.loc[
    (df.brand == 'VOLKSWAGEN') & 
    (df.fuel == 'benzine') & 
    (df.number_of_cylinders == 5)
    ,:].index
B2 = df.loc[ix,:].drop(columns='price')
display(B2)
models['MLR Lasso']['model'].predict(B2)


In [None]:
df.loc['2022-04-704304',:]

In [None]:
B = pd.read_pickle('/home/tom/bin/satdatsci/Saturday-Datascience/data/rdw-data-2021-02.pkl')
B.columns
#['brand', 'model', 'fuel', 'body_type', 'color']
B.loc[:, [
    'rdw_merk',
    'rdw_type',
    'rdw_brandstof_brandstof_omschrijving_1',
    'rdw_ovi_inrichting_code_omschrijving',
    'rdw_eerste_kleur',
     ]]

In [None]:
B.loc[:,(B == 'GRIJS').any()]

In [None]:
m = models['MLR Lasso']['model']
m.n_features_in_

In [None]:
p = m.get_params(deep=True)
t=p['regressor__steps'][0][1]
e = t.get_params(deep=True)

In [None]:
input_columns = []
for estimator in e['transformers']:
    idx = estimator[2]
    if isinstance(idx, str):
        input_columns += [idx]
    else:
        input_columns += list(idx)

In [None]:
len(input_columns)

In [None]:
[c for c in X.columns.to_list() if c not in input_columns]