# Model training

In this notebook I tried diffenent model, compare their performance, try optimze hyperparameters.  

List of models:
- LinearRegression
- Ridge (Linear model with L2 regularization)
- Linear SVR
- Random Forest
- CatBoost

We have regression task, we try predict the amout of money that customer willing to pay for a car.  

I used the RMSE metric for evaluation

In [159]:
import numpy as np
import pandas as pd
from pprint import pprint
import pickle

from  sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction import DictVectorizer

from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.svm import LinearSVR
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor

import bentoml


In [2]:
SEED=42

# Prepare data for training

In [3]:
df = pd.read_csv("../data/processed/main_data.csv")

In [4]:
df.head()

Unnamed: 0,gender,age,annual_salary,credit_card_debt,net_worth,car_purchase_amount,customer_email_suffix,region,population,area_sq_mi,pop_density_per_sq_mi,coastline_coastarea_ratio,gdp__per_capita,birthrate,deathrate
0,0,41.85172,62812.09301,11609.38091,238961.2505,35321.45877,.edu,EASTERN EUROPE,7385367,110910,66.6,0.32,7600.0,9.65,14.27
1,0,40.870623,66646.89292,9572.957136,530973.9078,45115.52566,.co.uk,LATIN AMER. & CARIB,287730,22966,12.5,1.68,4900.0,28.84,5.72
2,1,43.152897,53798.55112,11160.35506,638467.1773,42925.70921,.co.uk,NORTHERN AFRICA,32930091,2381740,13.8,0.04,6000.0,17.14,4.61
3,1,58.271369,79370.03798,14426.16485,548599.0524,67422.36313,.com,OCEANIA,21388,240,89.1,50.0,5000.0,21.0,0.0
4,1,57.313749,59729.1513,5358.712177,560304.0671,55915.46248,.net,LATIN AMER. & CARIB,188078227,8511965,22.1,0.09,7600.0,16.56,6.17


In [5]:
y = df.car_purchase_amount
df.drop(["car_purchase_amount"], axis=1, inplace=True)

In [6]:
df_train, df_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=SEED, shuffle=True)

In [7]:
df_train.head()

Unnamed: 0,gender,age,annual_salary,credit_card_debt,net_worth,customer_email_suffix,region,population,area_sq_mi,pop_density_per_sq_mi,coastline_coastarea_ratio,gdp__per_capita,birthrate,deathrate
249,0,36.117247,70334.42787,9823.218967,632600.4718,.ca,UNKNOWN,0,0,0.0,0.0,0.0,0.0,0.0
433,1,39.814817,56086.45033,6588.606462,228388.5491,.ca,UNKNOWN,0,0,0.0,0.0,0.0,0.0,0.0
19,0,58.741842,55368.23716,13272.94647,566022.1306,.ca,NEAR EAST,26783383,437072,61.3,0.01,1500.0,31.98,5.37
322,0,60.180119,58837.97088,12788.81573,622324.7499,.co.uk,EASTERN EUROPE,9396411,88361,106.3,0.0,2200.0,0.0,0.0
332,0,31.244838,41361.95045,10755.50842,615720.0425,.edu,UNKNOWN,0,0,0.0,0.0,0.0,0.0,0.0


> Separeted y from df  
> Splited dataset to train & test (took 20% for holdout)


# Vecrorize dataset

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   gender                     500 non-null    int64  
 1   age                        500 non-null    float64
 2   annual_salary              500 non-null    float64
 3   credit_card_debt           500 non-null    float64
 4   net_worth                  500 non-null    float64
 5   customer_email_suffix      500 non-null    object 
 6   region                     500 non-null    object 
 7   population                 500 non-null    int64  
 8   area_sq_mi                 500 non-null    int64  
 9   pop_density_per_sq_mi      500 non-null    float64
 10  coastline_coastarea_ratio  500 non-null    float64
 11  gdp__per_capita            500 non-null    float64
 12  birthrate                  500 non-null    float64
 13  deathrate                  500 non-null    float64

In [36]:
class MyVectorizer(BaseEstimator):
    def __init__(self):
        self._scaler = StandardScaler()
        self._vectorizer = DictVectorizer(sparse=False)
        self._numerical_features = [
            "age",
            "annual_salary",
            "credit_card_debt",
            "net_worth",
            "population",
            "area_sq_mi",
            "pop_density_per_sq_mi",
            "coastline_coastarea_ratio",
            "gdp__per_capita",
            "birthrate",
            "deathrate"
            
        ]
        
    def fit(self, df):
        self._scaler.fit(df[self._numerical_features])
        row_dicts = df.to_dict(orient='records')
        self._vectorizer.fit(row_dicts)
        
    def transform(self, df):
        df[self._numerical_features] = self._scaler.transform(df[self._numerical_features])
        row_dicts = df.to_dict(orient='records')
        return self._vectorizer.transform(row_dicts)
    
    def fit_transform(self, df):
        self.fit(df)
        return self.transform(df)
    
    def get_feature_names(self):
        return self._vectorizer.get_feature_names_out ()
            

In [105]:
vectorizer = MyVectorizer()
X_train = vectorizer.fit_transform(df_train)
X_test = vectorizer.transform(df_test)

# Let's try dummy models
> That predicts mean amount

In [13]:
class MeanPredictor(BaseEstimator):
    def __init__(self):
        self._mean = 0
    
    def fit(self, X, y):
        self._mean = np.mean(y)
    
    def predict(self, X):
        return np.repeat(self._mean, X.shape[0])
    
    def score(self, X, y):
        return mean_squared_error(self.predict(X), y)

In [131]:
def score_model(model, X, y):
    result = list(np.round((-cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error'))**0.5, 5))
    print(f"Cross validation RMSE for {model}")
    print()
    pprint(result)
    print()
    print(f"Mean RMSE: {np.round(np.mean(result), 5)}")

In [132]:
score_model(MeanPredictor(), X_train, y_train)

Cross validation RMSE for MeanPredictor()

[11534.6994, 11003.47481, 11618.62646, 10089.47294, 9885.06097]

Mean RMSE: 10826.26692


# Linear models

## LinearRegression

In [62]:
score_model(LinearRegression(), X_train, y_train)

Cross validation RMSE for LinearRegression()

[1.0, 1.0, 1.0, 1.0, 1.0]

Mean RMSE: 1.0


## RidgeCV (Linear + L2 regularization + CV for alpha)

In [63]:
regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])

In [64]:
model_cv = regr_cv.fit(X_train, y_train)

In [143]:
print(f"Param alpha = {model_cv.alpha_}")

Param alpha = 0.1


In [65]:
score_model(model_cv, X_train, y_train)

Cross validation RMSE for RidgeCV(alphas=[0.1, 1.0, 10.0])

[1.0, 1.0, 1.0, 1.0, 1.0]

Mean RMSE: 1.0


## Linear models have an exelent performance.
> Yes, and it's a problems.  
> It means that our data is syntetic, and our target is linear combination of numeric features

In [81]:
model_linear = LinearRegression()
model_linear.fit(X_train, y_train)
pprint(list(np.round(model_linear.predict(X_train)[0:5],5)))
pprint(list(y_train.iloc[:5]))

[46133.83724, 29517.87529, 54828.59527, 59621.99384, 25252.00363]
[46135.27233, 29519.56184, 54827.52403, 59625.02618, 25252.93221]


# Second dummy model, linear combination of features

In [100]:
class LinearPredictor(BaseEstimator):
    def __init__(self):
        self._mean = 0
    
    def fit(self, X, y):
        pass
    
    def predict(self, X):
        y = 43946.09177638776 + X[:, 0] * 6764.2431783130405 + X[:, 1] * 6685.214307781013 + X[:, 15] * 5082.409111980831
        return y
    
    def score(self, X, y):
        return mean_squared_error(self.predict(X), y)

In [102]:
score_model(LinearPredictor(), X_train, y_train)

Cross validation RMSE for LinearPredictor()

[1.46616, 1.29778, 1.34564, 1.57134, 1.71436]

Mean RMSE: 1.47906


In [106]:
model_linear = LinearPredictor()
model_linear.fit(X_train, y_train)
print(f"Test RMSE: {mean_squared_error(y_test, model_linear.predict(X_test))}")

Test RMSE: 7.470502849657696e+18


> So, the best perfomance is given by linear model, because of linear dependency in source data  
> By the way, I will not use this model for education purposes, and to make make more generalization ability for my service


# Linear SVR

In [133]:
score_model(LinearSVR(), X_train, y_train)

Cross validation RMSE for LinearSVR()

[45242.06469, 44232.61896, 45465.55035, 45472.61648, 43546.78562]

Mean RMSE: 44791.92722


# Random Forest

In [156]:
for depth in [1,2,3,5,7,10]:
    print(f"DEPTH {depth}")
    score_model(RandomForestRegressor(n_estimators=200, max_depth=depth, random_state=SEED), X_train, y_train)

DEPTH 1
Cross validation RMSE for RandomForestRegressor(max_depth=1, n_estimators=200, random_state=42)

[9035.18447, 8114.38469, 9139.67589, 7514.8783, 7607.73515]

Mean RMSE: 8282.3717
DEPTH 2
Cross validation RMSE for RandomForestRegressor(max_depth=2, n_estimators=200, random_state=42)

[7210.48341, 6336.19179, 7431.46746, 5876.22276, 6736.31624]

Mean RMSE: 6718.13633
DEPTH 3
Cross validation RMSE for RandomForestRegressor(max_depth=3, n_estimators=200, random_state=42)

[5540.1352, 4812.57835, 6448.71115, 4472.06241, 5185.9053]

Mean RMSE: 5291.87848
DEPTH 5
Cross validation RMSE for RandomForestRegressor(max_depth=5, n_estimators=200, random_state=42)

[4034.6144, 3515.24238, 4914.26935, 2910.84405, 3710.08313]

Mean RMSE: 3817.01066
DEPTH 7
Cross validation RMSE for RandomForestRegressor(max_depth=7, n_estimators=200, random_state=42)

[3537.16047, 3127.25475, 4310.69216, 2498.00026, 3245.02482]

Mean RMSE: 3343.62649
DEPTH 10
Cross validation RMSE for RandomForestRegressor(max

# Catboost 

In [157]:
for depth in [1,2,3,5,7,10]:
    print(f"DEPTH {depth}")
    score_model(CatBoostRegressor(verbose=False, iterations=200, depth=depth, random_state=SEED), X_train, y_train)

DEPTH 1
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x000001D2F200F4C0>

[1808.50302, 1692.45359, 2431.94612, 1364.16829, 1655.12434]

Mean RMSE: 1790.43907
DEPTH 2
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x000001D2F31BC8E0>

[1548.81551, 1500.26176, 1761.86331, 1149.7257, 1459.35538]

Mean RMSE: 1484.00433
DEPTH 3
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x000001D2ECACC610>

[1409.05795, 1585.60115, 1655.40142, 877.8478, 1455.58039]

Mean RMSE: 1396.69774
DEPTH 5
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x000001D2F200F4C0>

[1799.26243, 1850.77835, 2016.86718, 1028.04793, 1711.00724]

Mean RMSE: 1681.19263
DEPTH 7
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x000001D2F2ECDFD0>

[2900.77422, 2670.43779, 3152.16527, 1948.3585, 2468.60227]

Mean RMSE: 2628.06761
DEPTH 10
Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x000001D2F2E0E7

# Results

| Model                 | Params    | CV results |
|-----------------------|-----------|------------|
| MeanPredictor         |           | 10826      |
| LinearRegression      |           | 1          |
| RidgeCV               | alpha=0.1 | 1          |
| LinearPredictor       |           | 1          |
| LinearSVR             |           | 44791      |
| RandomForestRegressor | depth=10  | 3249       |
| **CatBoostRegressor** | **depth=3**| **1396**   |

> Except for linear models, the CatBoostRegressor model with a depth of 3 gave the best performance.  
> Let's take it as a basis and try to optimize other parameters.

## Optimizing CatBoostRegressor

In [148]:
%%time
model = CatBoostRegressor(verbose=False, random_seed=SEED)
parameters = {
    'depth': [2, 3],
    'learning_rate' : [0.01, 0.05, 0.1],
    'iterations'    : [30, 50, 100, 200]
}

grid = GridSearchCV(estimator=model, param_grid = parameters, cv = 5, n_jobs=-1)
grid.fit(X_train, y_train)

Wall time: 16.9 s


In [151]:
score_model(grid.best_estimator_, X_train, y_train)

Cross validation RMSE for <catboost.core.CatBoostRegressor object at 0x000001D2F33AE7F0>

[1585.89346, 1455.53301, 1695.60371, 796.05805, 1258.44854]

Mean RMSE: 1358.30735


In [153]:
grid.best_params_

{'depth': 3, 'iterations': 200, 'learning_rate': 0.1}

# Saving artefacts

In [158]:
grid.best_estimator_.save_model("../artefacts/catboostregressor.bin")
with open("../artefacts/vectorizer.pkl", "wb") as pkl_file:
    pickle.dump(vectorizer, pkl_file)

In [161]:
bentoml.catboost.save_model("what_price_catboost", grid.best_estimator_);

# Conclusion

> In this notebook I perform folowing steps:
> - I vectorize input dataset to vector form (class MyVectorizer)
> - I try different regression models, and find that LinearModel is unfair for this task,
> - Select the best model (except Linear ones), that was CatBoostRegressor
> - Optimize hyperparameters for this model
> - Save model and vectorizer as artefacts
> - Also save model for BentoML