# Model training

In this notebook I tried diffenent model, compare their performance, try optimze hyperparameters.  

List of models:
- LinearRegression
- Ridge (Linear model with L2 regularization)
- Linear SVR
- Random Forest
- CatBoost

We have regression task, we try predict the amout of money that customer willing to pay for a car.  

I used the RMSE metric for evaluation

Training part of this notebook is converted to the script [2_train_model.py](../scripts/2_train_model.py)

In [1]:
import numpy as np
import pandas as pd
from pprint import pprint
import pickle

from  sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction import DictVectorizer

from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.svm import LinearSVR
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor

import bentoml

from vectorizers import DataPreparation, MyVectorizer

In [2]:
SEED=42

# Prepare data for training

In [3]:
df = pd.read_csv("../data/processed/main_data.csv")

In [4]:
df.head()

Unnamed: 0,gender,age,annual_salary,credit_card_debt,net_worth,car_purchase_amount,customer_email_suffix,region,population,area_sq_mi,pop_density_per_sq_mi,coastline_coastarea_ratio,gdp__per_capita,birthrate,deathrate
0,0,41.85172,62812.09301,11609.38091,238961.2505,35321.45877,.edu,EASTERN EUROPE,7385367,110910,66.6,0.32,7600.0,9.65,14.27
1,0,40.870623,66646.89292,9572.957136,530973.9078,45115.52566,.co.uk,LATIN AMER. & CARIB,287730,22966,12.5,1.68,4900.0,28.84,5.72
2,1,43.152897,53798.55112,11160.35506,638467.1773,42925.70921,.co.uk,NORTHERN AFRICA,32930091,2381740,13.8,0.04,6000.0,17.14,4.61
3,1,58.271369,79370.03798,14426.16485,548599.0524,67422.36313,.com,OCEANIA,21388,240,89.1,50.0,5000.0,21.0,0.0
4,1,57.313749,59729.1513,5358.712177,560304.0671,55915.46248,.net,LATIN AMER. & CARIB,188078227,8511965,22.1,0.09,7600.0,16.56,6.17


In [5]:
y = df.car_purchase_amount
df.drop(["car_purchase_amount"], axis=1, inplace=True)

In [6]:
df_train, df_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=SEED, shuffle=True)

In [7]:
df_train.head()

Unnamed: 0,gender,age,annual_salary,credit_card_debt,net_worth,customer_email_suffix,region,population,area_sq_mi,pop_density_per_sq_mi,coastline_coastarea_ratio,gdp__per_capita,birthrate,deathrate
249,0,36.117247,70334.42787,9823.218967,632600.4718,.ca,UNKNOWN,0,0,0.0,0.0,0.0,0.0,0.0
433,1,39.814817,56086.45033,6588.606462,228388.5491,.ca,UNKNOWN,0,0,0.0,0.0,0.0,0.0,0.0
19,0,58.741842,55368.23716,13272.94647,566022.1306,.ca,NEAR EAST,26783383,437072,61.3,0.01,1500.0,31.98,5.37
322,0,60.180119,58837.97088,12788.81573,622324.7499,.co.uk,EASTERN EUROPE,9396411,88361,106.3,0.0,2200.0,0.0,0.0
332,0,31.244838,41361.95045,10755.50842,615720.0425,.edu,UNKNOWN,0,0,0.0,0.0,0.0,0.0,0.0


> Separeted y from df  
> Splited dataset to train & test (took 20% for holdout)


# Vecrorize dataset

In [8]:
vectorizer = MyVectorizer()
X_train = vectorizer.fit_transform(df_train)
X_test = vectorizer.transform(df_test)

# Let's try dummy models
> That predicts mean amount

In [9]:
class MeanPredictor(BaseEstimator):
    def __init__(self):
        self._mean = 0
    
    def fit(self, X, y):
        self._mean = np.mean(y)
    
    def predict(self, X):
        return np.repeat(self._mean, X.shape[0])
    
    def score(self, X, y):
        return mean_squared_error(self.predict(X), y)

In [10]:
def score_model(model, X, y):
    result = list(np.round((-cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error'))**0.5, 5))
    print(f"Cross validation RMSE for {model}")
    print()
    pprint(result)
    print()
    print(f"Mean RMSE: {np.round(np.mean(result), 5)}")

In [11]:
score_model(MeanPredictor(), X_train, y_train)

Cross validation RMSE for MeanPredictor()

[11534.6994, 11003.47481, 11618.62646, 10089.47294, 9885.06097]

Mean RMSE: 10826.26692


# Linear models

## LinearRegression

In [12]:
score_model(LinearRegression(), X_train, y_train)

Cross validation RMSE for LinearRegression()

[1.51511, 1.37404, 1.44615, 1.62358, 1.78798]

Mean RMSE: 1.54937


## RidgeCV (Linear + L2 regularization + CV for alpha)

In [13]:
regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])

In [14]:
model_cv = regr_cv.fit(X_train, y_train)

In [15]:
print(f"Param alpha = {model_cv.alpha_}")

Param alpha = 0.1


In [16]:
score_model(model_cv, X_train, y_train)

Cross validation RMSE for RidgeCV(alphas=[0.1, 1.0, 10.0])

[4.35292, 4.19551, 4.40486, 3.72861, 3.79454]

Mean RMSE: 4.09529


## Linear models have an exelent performance.
> Yes, and it's a problems.  
> It means that our data is syntetic, and our target is linear combination of numeric features

In [17]:
model_linear = LinearRegression()
model_linear.fit(X_train, y_train)
pprint(list(np.round(model_linear.predict(X_train)[0:5],5)))
pprint(list(y_train.iloc[:5]))

[46133.83724, 29517.87529, 54828.59527, 59621.99384, 25252.00363]
[46135.27233, 29519.56184, 54827.52403, 59625.02618, 25252.93221]


# Second dummy model, linear combination of features

In [18]:
class LinearPredictor(BaseEstimator):
    def __init__(self):
        self._mean = 0
    
    def fit(self, X, y):
        pass
    
    def predict(self, X):
        y = 43946.09177638776 + X[:, 0] * 6764.2431783130405 + X[:, 1] * 6685.214307781013 + X[:, 15] * 5082.409111980831
        return y
    
    def score(self, X, y):
        return mean_squared_error(self.predict(X), y)

In [19]:
score_model(LinearPredictor(), X_train, y_train)

Cross validation RMSE for LinearPredictor()

[1.46616, 1.29778, 1.34564, 1.57134, 1.71436]

Mean RMSE: 1.47906


In [20]:
model_linear = LinearPredictor()
model_linear.fit(X_train, y_train)
print(f"Test RMSE: {mean_squared_error(y_test, model_linear.predict(X_test))}")

Test RMSE: 2.0497900543250025


> So, the best perfomance is given by linear model, because of linear dependency in source data  
> By the way, I will not use this model for education purposes, and to make make more generalization ability for my service


# Linear SVR

In [21]:
score_model(LinearSVR(), X_train, y_train)

Cross validation RMSE for LinearSVR()

[45242.06469, 44232.61896, 45465.55035, 45472.61648, 43546.78562]

Mean RMSE: 44791.92722


# Random Forest

In [22]:
for depth in [1,2,3,5,7,10]:
    print(f"DEPTH {depth}")
    score_model(RandomForestRegressor(n_estimators=200, max_depth=depth, random_state=SEED), X_train, y_train)

DEPTH 1
Cross validation RMSE for RandomForestRegressor(max_depth=1, n_estimators=200, random_state=42)

[9035.18447, 8114.38469, 9139.67589, 7514.8783, 7607.73515]

Mean RMSE: 8282.3717
DEPTH 2
Cross validation RMSE for RandomForestRegressor(max_depth=2, n_estimators=200, random_state=42)

[7210.48341, 6336.19179, 7431.46746, 5876.22276, 6736.31624]

Mean RMSE: 6718.13633
DEPTH 3
Cross validation RMSE for RandomForestRegressor(max_depth=3, n_estimators=200, random_state=42)

[5540.1352, 4812.57835, 6448.71115, 4472.06241, 5185.9053]

Mean RMSE: 5291.87848
DEPTH 5
Cross validation RMSE for RandomForestRegressor(max_depth=5, n_estimators=200, random_state=42)

[4034.6144, 3515.24238, 4914.26935, 2910.84405, 3710.08313]

Mean RMSE: 3817.01066
DEPTH 7
Cross validation RMSE for RandomForestRegressor(max_depth=7, n_estimators=200, random_state=42)

[3537.16047, 3127.25475, 4310.69216, 2498.00026, 3245.02482]

Mean RMSE: 3343.62649
DEPTH 10
Cross validation RMSE for RandomForestRegressor(max

# Catboost 

In [23]:
for depth in [1,2,3,5,7,10]:
    print(f"DEPTH {depth}")
    score_model(CatBoostRegressor(verbose=False, iterations=200, depth=depth, random_state=SEED), X_train, y_train)

DEPTH 1
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x0000025FD75AD8E0>

[1808.50302, 1692.45359, 2431.94612, 1364.16829, 1655.12434]

Mean RMSE: 1790.43907
DEPTH 2
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x0000025FD757E190>

[1548.81551, 1500.26176, 1761.86331, 1149.7257, 1459.35538]

Mean RMSE: 1484.00433
DEPTH 3
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x0000025FD757E190>

[1409.05795, 1585.60115, 1655.40142, 877.8478, 1455.58039]

Mean RMSE: 1396.69774
DEPTH 5
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x0000025FD75AD790>

[1799.26243, 1850.77835, 2016.86718, 1028.04793, 1711.00724]

Mean RMSE: 1681.19263
DEPTH 7
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x0000025FD75AD940>

[2900.77422, 2670.43779, 3152.16527, 1948.3585, 2468.60227]

Mean RMSE: 2628.06761
DEPTH 10
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x0000025FD75673

# Results

| Model                 | Params    | CV results |
|-----------------------|-----------|------------|
| MeanPredictor         |           | 10826      |
| LinearRegression      |           | 1          |
| RidgeCV               | alpha=0.1 | 1          |
| LinearPredictor       |           | 1          |
| LinearSVR             |           | 44791      |
| RandomForestRegressor | depth=10  | 3271       |
| **CatBoostRegressor** | **depth=3**| **1396**   |

> Except for linear models, the CatBoostRegressor model with a depth of 3 gave the best performance.  
> Let's take it as a basis and try to optimize other parameters.

## Optimizing CatBoostRegressor

In [24]:
%%time
model = CatBoostRegressor(verbose=False, random_seed=SEED)
parameters = {
    'depth': [2, 3],
    'learning_rate' : [0.01, 0.05, 0.1],
    'iterations'    : [30, 50, 100, 200]
}

grid = GridSearchCV(estimator=model, param_grid = parameters, cv = 5, n_jobs=-1)
grid.fit(X_train, y_train)

Wall time: 16 s


In [25]:
score_model(grid.best_estimator_, X_train, y_train)

Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x0000025FD7567EE0>

[1366.13134, 1447.61249, 1694.44458, 937.76391, 1440.63379]

Mean RMSE: 1377.31722


In [26]:
grid.best_params_

{'depth': 3, 'iterations': 200, 'learning_rate': 0.1}

# Prepare best model on whole train data

In [27]:
model = CatBoostRegressor(**grid.best_params_)

In [28]:
model.fit(X_train, y_train)

0:	learn: 10206.3968092	total: 1.89ms	remaining: 376ms
1:	learn: 9827.6873623	total: 4.5ms	remaining: 446ms
2:	learn: 9292.5877784	total: 6.65ms	remaining: 437ms
3:	learn: 8821.6414960	total: 8.65ms	remaining: 424ms
4:	learn: 8461.2281584	total: 10.8ms	remaining: 421ms
5:	learn: 8076.9167548	total: 12.8ms	remaining: 415ms
6:	learn: 7774.6498370	total: 14.7ms	remaining: 406ms
7:	learn: 7415.2745377	total: 16.8ms	remaining: 403ms
8:	learn: 7052.9067533	total: 18.1ms	remaining: 383ms
9:	learn: 6699.1300358	total: 19.9ms	remaining: 379ms
10:	learn: 6383.2637121	total: 22.8ms	remaining: 391ms
11:	learn: 6178.5744403	total: 24.7ms	remaining: 387ms
12:	learn: 5913.0529845	total: 26.2ms	remaining: 377ms
13:	learn: 5679.6443364	total: 27.6ms	remaining: 367ms
14:	learn: 5514.2514179	total: 28.9ms	remaining: 357ms
15:	learn: 5298.1409175	total: 30.7ms	remaining: 353ms
16:	learn: 5087.2302020	total: 32.5ms	remaining: 350ms
17:	learn: 4964.0389447	total: 34.3ms	remaining: 347ms
18:	learn: 4775.6544

<catboost.core.CatBoostRegressor at 0x25fd75b9fa0>

# Score on holdout dataset

In [29]:
mean_squared_error(model.predict(X_test), y_test) ** 0.5

918.6958474680919

> Perfomance on holdout dataset is even better than CV scores

# Object-transformer to prepare data for model from raw data

In [30]:
dataPreparation = DataPreparation(
    vectorizer=vectorizer, 
    countries_df=pd.read_csv("../data/processed/countries_data.csv")
)

In [31]:
sample_data = {'customer_name': 'Martina Avila',
               'customer_email': 'cubilia.Curae.Phasellus@quisaccumsanconvallis.edu',
               'country': 'Bulgaria',
               'gender': 0,
               'age': 42,
               'annual_salary': 62812,
               'credit_card_debt': 11609.5,
               'net_worth': 238961.2,
               'car_purchase_amount': 35320}

In [32]:
sample_data_df = pd.DataFrame.from_dict({k: [v] for k, v in sample_data.items()})

In [33]:
sample_data_df

Unnamed: 0,customer_name,customer_email,country,gender,age,annual_salary,credit_card_debt,net_worth,car_purchase_amount
0,Martina Avila,cubilia.Curae.Phasellus@quisaccumsanconvallis.edu,Bulgaria,0,42,62812,11609.5,238961.2,35320


In [34]:
dataPreparation.transform(sample_data_df)

array([[-0.49806936,  0.06988975, -0.2864808 , -0.6954499 , -0.22996797,
         0.58002918,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ,  0.        ,  1.22060375, -0.0391651 ,  0.        ,
        -1.10159198, -0.12981586, -0.17935684,  0.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

In [35]:
dataPreparation.transform(sample_data)

array([[-0.49806936,  0.06988975, -0.2864808 , -0.6954499 , -0.22996797,
         0.58002918,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ,  0.        ,  1.22060375, -0.0391651 ,  0.        ,
        -1.10159198, -0.12981586, -0.17935684,  0.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

In [36]:
model.predict(dataPreparation.transform(sample_data_df))

array([35092.49289698])

> I created a handy object, than allow me to transform raw data to vector, than model can use.  
> And I can give her both a dataframe and a dictionary

# Saving artefacts

In [39]:
model.save_model("../artefacts/catboostregressor.bin")
with open("../artefacts/vectorizer.pkl", "wb") as pkl_file:
    pickle.dump(dataPreparation, pkl_file)

In [40]:
bentoml.catboost.save_model("what_price_catboost", model);

# Conclusion

> In this notebook I perform folowing steps:
> - I vectorize input dataset to vector form (class MyVectorizer)
> - I try different regression models, and find that LinearModel is unfair for this task,
> - Select the best model (except Linear ones), that was CatBoostRegressor
> - Optimize hyperparameters for this model
> - Save model and vectorizer as artefacts
> - Also save model for BentoML