# Diamond Prediction

After the exploratory data analysis, a prediction will be made.
The folowing steps for this prediction:
1. **Preprocessing** - Handle outliers, Feature engineering.
2. **Predictions** - predicting with validation data and then with test data, metrics.
3. **Evaluation** - Baselines for comparison.

## Libraries & settings

In [103]:
#numpy
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

#boosters
from sklearn.ensemble import AdaBoostRegressor
from xgboost import XGBRegressor

#forests
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import RandomForestRegressor

#ensemble
from sklearn.ensemble import VotingRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GridSearchCV

#model explainability
import eli5
from eli5.sklearn import PermutationImportance

#pipeline tools
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from pipelinehelper import PipelineHelper
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
#time related
from timeit import default_timer as timer
from datetime import timedelta

#timer for entire code
start = timer()

#warning hadle
import warnings
warnings.filterwarnings("always")
warnings.filterwarnings("ignore")

#plotly
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

#settings
pd.options.display.float_format = '{:.3f}'.format

## Baseline 1: Basic approach

### Decisions:
* **Preprocessing:**
    1. **Outliers:** **continuous** - fill with median.               
    2. **Feature Engineering**: only categoric labels, OrdinalEncoder.
* **Model training** - using cross validation in validation data on a simple linear regression model.
* **Model testing** - train on whole train + validation set and use test data for results.
* **Model evaluating** - record for validation and test the following metrics:
    1. MSE - [Mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)
    2. R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
    3. MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)
    4. NRMSE - [Negative root mean squared error](https://en.wikipedia.org/wiki/Root-mean-square_deviation)
    

**Preprocessing**

In [104]:
# Preprocessing for continuous data

def Outlier_Detector(X,factor):
    X = pd.DataFrame(X).copy()
    for i in range(X.shape[1]):
        x = pd.Series(X.iloc[:,i]).copy()
        q1 = x.quantile(0.25)
        q3 = x.quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - (factor * iqr)
        upper_bound = q3 + (factor * iqr)
        X.iloc[((X.iloc[:,i] < lower_bound) | (X.iloc[:,i] > upper_bound)),i] = np.nan 
    return X

#creating outlier_remover object using FunctionTransformer with factor=1.5
Outlier = FunctionTransformer(Outlier_Detector,kw_args={'factor':1.5})

#contiuous_transformer = SimpleImputer(strategy='median')

contiuous_transformer = Pipeline(steps=[
('outlier', Outlier),
('imputer', SimpleImputer(strategy='median'))
])

# building categorical transformers (worst to best)
cut_enc = OrdinalEncoder(categories=[["Fair", "Good", "Very Good", "Premium","Ideal"]])
color_enc = OrdinalEncoder(categories=[['J', 'I', 'H', 'G', 'F', 'E','D']])
clarity_enc = OrdinalEncoder(categories=[["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1",'IF']])


# Create preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', contiuous_transformer, Continuous),
        ('cuts', cut_enc, ["cut"]),
        ('colors', color_enc, ["color"]),
        ('clarities', clarity_enc, ["clarity"])
    ])

**Model**

In [105]:
model = LinearRegression()

**Final Pipeline**

In [106]:
# Bundle preprocessing and modeling code in a pipeline
Baseline1 = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
Baseline1.fit(X_train2, y_train2)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('outlier',
                                                                   FunctionTransformer(func=<function Outlier_Detector at 0x000001B614A9A550>,
                                                                                       kw_args={'factor': 1.5})),
                                                                  ('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['carat', 'depth', 'table',
                                                   'x', 'y', 'z']),
                                                 ('cuts',
                                                  OrdinalEncoder(categories=[['Fair',
                                                                              'Good',
                 

**Validation Prediction**

In [107]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score


# Preprocessing of validation data, get predictions
b1_val_preds = Baseline1.predict(X_val)

# Evaluate the model
b1_val_mae = mean_absolute_error(y_val, b1_val_preds)
b1_val_mse = mean_squared_error(y_val, b1_val_preds)
b1_val_r2 = r2_score(y_val, b1_val_preds)

print('MAE:', b1_val_mae)
print("MSE: ",b1_val_mse)
print("R2: ",b1_val_r2)

MAE: 1227.2738971261278
MSE:  2652792.3125459263
R2:  0.8367188169244258


**Cross Validation Prediction (NRMSE)**

In [108]:
from sklearn.model_selection import cross_val_score
CV = cross_val_score(Baseline1, X_train, y_train, cv=5, scoring = "neg_root_mean_squared_error")
print(f"validation negative root mean squared error on 5 fold cross validation: {CV}")
print(f"validation negative root mean squared error accuracy: {CV.mean()}")

validation negative root mean squared error on 5 fold cross validation: [-1674.34968697 -1594.12551928 -1562.22328898 -1589.48519538
 -1627.25022853]
validation negative root mean squared error accuracy: -1609.4867838266748


**Test Prediction**

In [109]:
Baseline1.fit(X_train, y_train)
b1_test_preds = Baseline1.predict(X_test)
b1_test_mae = mean_absolute_error(y_test, b1_test_preds)
b1_test_mse = mean_squared_error(y_test, b1_test_preds)
b1_test_r2 = r2_score(y_test, b1_test_preds)
print('MAE:', b1_test_mae)
print("MSE: ",b1_test_mse)
print("R2: ",b1_test_r2)

MAE: 1249.8900248224795
MSE:  2697964.8791565355
R2:  0.8331690897943889


**Test NRMSE**

In [110]:
print("NRMSE: ",-np.sqrt(b1_test_mse))

NRMSE:  -1642.5482882267222


**Model Explainability**

The chosen method is [Permutation](https://brilliant.org/wiki/permutations/#permutations-problem-solving) Importance.
Each feature is shuffled and predicted with the rest of the normally ordered columns, where the weights are mostly changed in positive numbers means they most effect the model. negative means the model performed better with this random ordered feature, which means it's that it is useless as a predictor.

In [111]:
x_tr = Baseline1.named_steps["preprocessor"].fit_transform(X_train)#preprocess inputs
perm = PermutationImportance(Baseline1.named_steps["model"]).fit(x_tr, y_train)#fit model
eli5.show_weights(perm, feature_names = X_train.columns.tolist())#show results

Weight,Feature
0.7128  ± 0.0043,depth
0.2197  ± 0.0028,table
0.1015  ± 0.0021,z
0.1001  ± 0.0013,clarity
0.0545  ± 0.0016,carat
0.0235  ± 0.0006,y
0.0031  ± 0.0002,x
0.0008  ± 0.0001,cut
0.0002  ± 0.0001,color


**insights:**
* depth is very important for this models prediction. 
* color has no effect on the models predictions.
* only table, z and clarity has a big effect on the model other featres less so.

### Baseline 1 Summary
The model was validated with mse, mae and r2, cross validated 5 times for NRMSE, and tested 1 time for all the metrics:
* **Mae**: validation set and test set around 1,200 and validation performed slightly better. 
* **Mse**: validation set and test set around 2,650,000 and validation performed slightly better.
* **R2**: validation set and test set around 0.83 and validation performed slightly better.
* **NRMSE**: validation set and test set around 1,600 and validation performed slightly better.

The baseline is saved for comparison as a pandas dataframe: 

In [112]:
baseline1 = pd.DataFrame({"val_mae": b1_val_mae,"val_mse": b1_val_mse,"val_r2": b1_val_r2,"val_nrmse": CV.mean(),"test_mae": b1_test_mae,"test_mse": b1_test_mse,"test_r2": b1_test_r2, "test_nrmse": -np.sqrt(b1_test_mse)}, index=["Baseline1"])
baseline1

Unnamed: 0,val_mae,val_mse,val_r2,val_nrmse,test_mae,test_mse,test_r2,test_nrmse
Baseline1,1227.274,2652792.313,0.837,-1609.487,1249.89,2697964.879,0.833,-1642.548


## Baseline 2: Preprocess Parameter tuning

### Decisions:
* **Preprocessing:**
    1. **Outliers:**  **continuous** - choose what is considered an outlier, fill with median/mean and scaling.               
    2. **Feature Engineering**: only categoric labels, OrdinalEncoder.
* **Model training** - using cross validation in validation data on a simple linear regression model.
* **Model testing** - validate with grid search and validation set then train on whole train + validation set and use test data for results.
* **Model evaluating** - record for validation and test the following metrics:
    1. MSE - [Mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)
    2. R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
    3. MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)
    4. NRMSE - [Negative root mean squared error](https://en.wikipedia.org/wiki/Root-mean-square_deviation)
    

**Preprocessing**

In [113]:
# Preprocessing for continuous data
Outlier2 = FunctionTransformer(Outlier_Detector,kw_args={'factor':np.nan})

#contiuous_transformer = SimpleImputer(strategy='median')

contiuous_transformer = Pipeline(steps=[
('outlier', Outlier2),
('imputer', SimpleImputer()),
('scaler', StandardScaler())    
])

# Create preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', contiuous_transformer, Continuous),
        ('cuts', cut_enc, ["cut"]),
        ('colors', color_enc, ["color"]),
        ('clarities', clarity_enc, ["clarity"])
    ])

**Final Pipeline**

In [114]:
# Bundle preprocessing and modeling code in a pipeline
Baseline2 = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
Baseline2.fit(X_train2, y_train2)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('outlier',
                                                                   FunctionTransformer(func=<function Outlier_Detector at 0x000001B614A9A550>,
                                                                                       kw_args={'factor': nan})),
                                                                  ('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['carat', 'depth', 'table',
                                                   'x', 'y', 'z']),
                                                 ('cuts',
                                           

In [115]:
Baseline2.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'model', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__num', 'preprocessor__cuts', 'preprocessor__colors', 'preprocessor__clarities', 'preprocessor__num__memory', 'preprocessor__num__steps', 'preprocessor__num__verbose', 'preprocessor__num__outlier', 'preprocessor__num__imputer', 'preprocessor__num__scaler', 'preprocessor__num__outlier__accept_sparse', 'preprocessor__num__outlier__check_inverse', 'preprocessor__num__outlier__func', 'preprocessor__num__outlier__inv_kw_args', 'preprocessor__num__outlier__inverse_func', 'preprocessor__num__outlier__kw_args', 'preprocessor__num__outlier__validate', 'preprocessor__num__imputer__add_indicator', 'preprocessor__num__imputer__copy', 'preprocessor__num__imputer__fill_value', 'preprocessor__num__imputer__missing_values', 'preprocessor__num__impute

**Parameter tuning and grid search**

the factor of the outlier is being grid searched as well as impute strategy:

In [116]:
hyperparameters = {'preprocessor__num__outlier__kw_args':[{'factor':0},{'factor':0.5},{'factor':1},{'factor':1.5},{'factor':2},{'factor':2.5},{'factor':3}],
              'preprocessor__num__imputer__strategy':['mean','median']}
#grid search
b2_test_clf = GridSearchCV(Baseline2, hyperparameters,cv = 5, scoring = "neg_root_mean_squared_error", n_jobs = -1, verbose = 2) 

In [117]:
%%time
# Fit and tune model
b2_test_clf.fit(X_train, y_train)

Fitting 5 folds for each of 14 candidates, totalling 70 fits
Wall time: 4.79 s


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('outlier',
                                                                                          FunctionTransformer(func=<function Outlier_Detector at 0x000001B614A9A550>,
                                                                                                              kw_args={'factor': nan})),
                                                                                         ('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                     

In [118]:
b2_test_clf.best_params_

{'preprocessor__num__imputer__strategy': 'mean',
 'preprocessor__num__outlier__kw_args': {'factor': 3}}

**Validation Prediction**

In [119]:
b2_val_clf = b2_test_clf.best_estimator_
b2_val_clf.fit(X_train2, y_train2)
b2_val_preds = b2_val_clf.predict(X_val)

# Evaluate the model
b2_val_mae = mean_absolute_error(y_val, b2_val_preds)
b2_val_mse = mean_squared_error(y_val, b2_val_preds)
b2_val_r2 = r2_score(y_val, b2_val_preds)

print('MAE:', b2_val_mae)
print("MSE: ",b2_val_mse)
print("R2: ",b2_val_r2)
print("NRMSE: ",-np.sqrt(b2_val_mse))

MAE: 829.0934236938818
MSE:  1413280.7606081744
R2:  0.9130116015797015
NRMSE:  -1188.8148554792601


**Test Prediction**

In [120]:
b2_Te_clf = b2_test_clf.best_estimator_
b2_Te_clf.fit(X_train, y_train)
b2_test_preds = b2_Te_clf.predict(X_test)

# Evaluate the model
b2_test_mae = mean_absolute_error(y_test, b2_test_preds)
b2_test_mse = mean_squared_error(y_test, b2_test_preds)
b2_test_r2 = r2_score(y_test, b2_test_preds)

print('MAE:', b2_test_mae)
print("MSE: ",b2_test_mse)
print("R2: ",b2_test_r2)
print("NRMSE: ",-np.sqrt(b2_test_mse))

MAE: 827.987068391731
MSE:  1451969.502775263
R2:  0.9102162538844779
NRMSE:  -1204.9769718858793


In [121]:
x_tr = b2_test_clf.best_estimator_.named_steps["preprocessor"].fit_transform(X_train)#preprocess inputs
perm = PermutationImportance(b2_test_clf.best_estimator_.named_steps["model"]).fit(x_tr, y_train)#fit model
eli5.show_weights(perm, feature_names = X_train.columns.tolist())#show results

Weight,Feature
2.6284  ± 0.0291,carat
0.2455  ± 0.0035,clarity
0.1797  ± 0.0019,depth
0.0852  ± 0.0014,z
0.0355  ± 0.0007,y
0.0066  ± 0.0003,table
0.0026  ± 0.0002,x
0.0002  ± 0.0001,color
0.0001  ± 0.0000,cut


**insights:**
* carat is very important for this models prediction. 
* color and cut has no effect on the models predictions.
* only depth, z and clarity has a big effect on the model other featres less so.

### Baseline 2 Summary
The model was  cross validated 5 times for NRMSE, then validated and tested 1 time for all the metrics:
* **Mae**: validation set and test set around 800 and test performed slightly better. 
* **Mse**: validation set and test set around 1,400,000 and validation performed slightly better.
* **R2**: validation set and test set around 0.91 and validation performed slightly better.
* **NRMSE**: validation set and test set around 1,200 and validation performed slightly better.

The baseline is saved for comparison as a pandas dataframe: 

In [122]:
baseline2 = pd.DataFrame({"val_mae": b2_val_mae,"val_mse": b2_val_mse,"val_r2": b2_val_r2,"val_nrmse": -np.sqrt(b2_val_mse),
                          "test_mae": b2_test_mae,"test_mse": b2_test_mse,"test_r2": b2_test_r2, "test_nrmse": -np.sqrt(b2_test_mse)}, index=["Baseline2"])
baseline2

Unnamed: 0,val_mae,val_mse,val_r2,val_nrmse,test_mae,test_mse,test_r2,test_nrmse
Baseline2,829.093,1413280.761,0.913,-1188.815,827.987,1451969.503,0.91,-1204.977


The baselines are saved together for comparison as a pandas dataframe:

In [123]:
Baselines = pd.concat([baseline1,baseline2])
Baselines

Unnamed: 0,val_mae,val_mse,val_r2,val_nrmse,test_mae,test_mse,test_r2,test_nrmse
Baseline1,1227.274,2652792.313,0.837,-1609.487,1249.89,2697964.879,0.833,-1642.548
Baseline2,829.093,1413280.761,0.913,-1188.815,827.987,1451969.503,0.91,-1204.977


## Baseline 3: Preprocess Parameter tuning, and model tuning using pipline helper

### Decisions:
* **Preprocessing:**
    1. **Outliers:**  **continuous** - choose what is considered an outlier, fill with median/mean and scaling standart scaling.               
    2. **Feature Engineering**: only categoric labels, OrdinalEncoder.
* **Model training** - using cross validation in validation data on 5 models.
* **Model testing** - validate with grid search and validation set then train on whole train + validation set and use test data for results.
* **Model evaluating** - record for validation and test the following metrics:
    1. MSE - [Mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)
    2. R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
    3. MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)
    4. NRMSE - [Negative root mean squared error](https://en.wikipedia.org/wiki/Root-mean-square_deviation)
    

**Preprocessing**

In [124]:
# Bundle preprocessing and modeling code in a pipeline
Baseline3 = Pipeline(steps=[('preprocessor', preprocessor),
                            ('model', PipelineHelper([('svr', SVR()),
                                   ('dt', DecisionTreeRegressor(random_state = 42)),
                                   ('br', BayesianRidge()),
                                   ('knn', KNeighborsRegressor()),
                                                       ]))
                             ])
Baseline3

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('outlier',
                                                                   FunctionTransformer(func=<function Outlier_Detector at 0x000001B614A9A550>,
                                                                                       kw_args={'factor': nan})),
                                                                  ('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['carat', 'depth', 'table',
                                                   'x', 'y', 'z']),
                                                 ('cuts',
                                           

In [125]:
Baseline3.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'model', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__num', 'preprocessor__cuts', 'preprocessor__colors', 'preprocessor__clarities', 'preprocessor__num__memory', 'preprocessor__num__steps', 'preprocessor__num__verbose', 'preprocessor__num__outlier', 'preprocessor__num__imputer', 'preprocessor__num__scaler', 'preprocessor__num__outlier__accept_sparse', 'preprocessor__num__outlier__check_inverse', 'preprocessor__num__outlier__func', 'preprocessor__num__outlier__inv_kw_args', 'preprocessor__num__outlier__inverse_func', 'preprocessor__num__outlier__kw_args', 'preprocessor__num__outlier__validate', 'preprocessor__num__imputer__add_indicator', 'preprocessor__num__imputer__copy', 'preprocessor__num__imputer__fill_value', 'preprocessor__num__imputer__missing_values', 'preprocessor__num__impute

In [126]:
hyperparameters = { 
                   'preprocessor__num__outlier__kw_args':[{'factor':1.5},{'factor':3}],
                  'preprocessor__num__imputer__strategy':['mean','median'],
                    'model__selected_model': Baseline3.named_steps['model'].generate({   
                    'svr__C': [0.1,1],
                    'dt__max_depth': [None,5],
                    'knn__n_neighbors': [4,6],
                    'br__tol': [0.0001,0.001]
    })}

**Parameter tuning and grid search**

the factor of the outlier is being grid searched as well as impute strategy and a decition between 5 different regressors:

In [127]:
b3_test_clf = GridSearchCV(Baseline3, hyperparameters,cv = 5, scoring = "neg_root_mean_squared_error", n_jobs = -1, verbose = 2) 

In [128]:
%%time
# Fit and tune model
b3_test_clf.fit(X_train, y_train)

Fitting 5 folds for each of 32 candidates, totalling 160 fits
Wall time: 20min 40s


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('outlier',
                                                                                          FunctionTransformer(func=<function Outlier_Detector at 0x000001B614A9A550>,
                                                                                                              kw_args={'factor': nan})),
                                                                                         ('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                     

In [129]:
b3_test_clf.best_params_

{'model__selected_model': ('knn', {'n_neighbors': 6}),
 'preprocessor__num__imputer__strategy': 'median',
 'preprocessor__num__outlier__kw_args': {'factor': 3}}

**Validation Prediction**

In [130]:
b3_val_clf = b3_test_clf.best_estimator_
b3_val_clf.fit(X_train2, y_train2)
b3_val_preds = b3_val_clf.predict(X_val)

# Evaluate the model
b3_val_mae = mean_absolute_error(y_val, b3_val_preds)
b3_val_mse = mean_squared_error(y_val, b3_val_preds)
b3_val_r2 = r2_score(y_val, b3_val_preds)

print('MAE:', b3_val_mae)
print("MSE: ",b3_val_mse)
print("R2: ",b3_val_r2)
print("NRMSE: ",-np.sqrt(b3_val_mse))

MAE: 361.27771598071934
MSE:  436370.49618918146
R2:  0.9731410971978187
NRMSE:  -660.5834513437204


**Test Prediction**

In [131]:
b3_Te_clf = b3_test_clf.best_estimator_
b3_Te_clf.fit(X_train, y_train)
b3_test_preds = b3_Te_clf.predict(X_test)

# Evaluate the model
b3_test_mae = mean_absolute_error(y_test, b3_test_preds)
b3_test_mse = mean_squared_error(y_test, b3_test_preds)
b3_test_r2 = r2_score(y_test, b3_test_preds)

print('MAE:', b3_test_mae)
print("MSE: ",b3_test_mse)
print("R2: ",b3_test_r2)
print("NRMSE: ",-np.sqrt(b3_test_mse))

MAE: 348.4425287356322
MSE:  390584.52777777775
R2:  0.9758478797167416
NRMSE:  -624.9676213835223


### Baseline 3 Summary
The model was  cross validated 5 times for NRMSE, then validated and tested 1 time for all the metrics:
* **Mae**: validation set and test set around 350 and test performed slightly better. 
* **Mse**: validation set and test set around 400,000 and test performed slightly better.
* **R2**: validation set and test set around 0.97 and test performed slightly better.
* **NRMSE**: validation set and test set around 600 and validation performed slightly better.

This is a major improvement!!

The baseline is saved for comparison as a pandas dataframe: 

In [132]:
baseline3 = pd.DataFrame({"val_mae": b3_val_mae,"val_mse": b3_val_mse,"val_r2": b3_val_r2,"val_nrmse": -np.sqrt(b3_val_mse),
                          "test_mae": b3_test_mae,"test_mse": b3_test_mse,"test_r2": b3_test_r2, "test_nrmse": -np.sqrt(b3_test_mse)}, index=["Baseline3"])
baseline3

Unnamed: 0,val_mae,val_mse,val_r2,val_nrmse,test_mae,test_mse,test_r2,test_nrmse
Baseline3,361.278,436370.496,0.973,-660.583,348.443,390584.528,0.976,-624.968


In [133]:
Baselines = pd.concat([baseline1,baseline2,baseline3])
Baselines

Unnamed: 0,val_mae,val_mse,val_r2,val_nrmse,test_mae,test_mse,test_r2,test_nrmse
Baseline1,1227.274,2652792.313,0.837,-1609.487,1249.89,2697964.879,0.833,-1642.548
Baseline2,829.093,1413280.761,0.913,-1188.815,827.987,1451969.503,0.91,-1204.977
Baseline3,361.278,436370.496,0.973,-660.583,348.443,390584.528,0.976,-624.968


## Baseline 4: Boosters

### Decisions:
* **Preprocessing:**
    1. **Outliers:**  **continuous** - choose what is considered an outlier, fill with median/mean and scaling standart scaling.               
    2. **Feature Engineering**: only categoric labels, OrdinalEncoder.
* **Model training** - using cross validation in validation data on Xgboost and adaboost.
* **Model testing** - validate with grid search and validation set then train on whole train + validation set and use test data for results.
* **Model evaluating** - record for validation and test the following metrics:
    1. MSE - [Mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)
    2. R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
    3. MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)
    4. NRMSE - [Negative root mean squared error](https://en.wikipedia.org/wiki/Root-mean-square_deviation)
    

**Preprocessing**

In [134]:
contiuous_transformer = Pipeline(steps=[
('outlier', Outlier2),
('imputer', SimpleImputer()),
('scaler', StandardScaler())    
])


# Create preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', contiuous_transformer, Continuous),
        ('cuts', cut_enc, ["cut"]),
        ('colors', color_enc, ["color"]),
        ('clarities', clarity_enc, ["clarity"])
    ])

# Bundle preprocessing and modeling code in a pipeline
Baseline4 = Pipeline(steps=[('preprocessor', preprocessor),
                            ('model', PipelineHelper([
                                   ('adb', AdaBoostRegressor(random_state = 42)),
                                   ('xgb', XGBRegressor())
                                                       ]))
                             ])
Baseline4

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('outlier',
                                                                   FunctionTransformer(func=<function Outlier_Detector at 0x000001B614A9A550>,
                                                                                       kw_args={'factor': nan})),
                                                                  ('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['carat', 'depth', 'table',
                                                   'x', 'y', 'z']),
                                                 ('cuts',
                                           

In [135]:
Baseline4.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'model', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__num', 'preprocessor__cuts', 'preprocessor__colors', 'preprocessor__clarities', 'preprocessor__num__memory', 'preprocessor__num__steps', 'preprocessor__num__verbose', 'preprocessor__num__outlier', 'preprocessor__num__imputer', 'preprocessor__num__scaler', 'preprocessor__num__outlier__accept_sparse', 'preprocessor__num__outlier__check_inverse', 'preprocessor__num__outlier__func', 'preprocessor__num__outlier__inv_kw_args', 'preprocessor__num__outlier__inverse_func', 'preprocessor__num__outlier__kw_args', 'preprocessor__num__outlier__validate', 'preprocessor__num__imputer__add_indicator', 'preprocessor__num__imputer__copy', 'preprocessor__num__imputer__fill_value', 'preprocessor__num__imputer__missing_values', 'preprocessor__num__impute

In [136]:
hyperparameters = { 
                   'preprocessor__num__outlier__kw_args':[{'factor':1.5},{'factor':3}],
                  'preprocessor__num__imputer__strategy':['mean','median'],
                    'model__selected_model': Baseline4.named_steps['model'].generate({   
                    'adb__n_estimators': [50,100,200],
                    'adb__learning_rate': np.logspace(0, -2, num=2),
                    'xgb__max_depth' : [2,5,8],
                    'xgb__learning_rate': np.logspace(0, -2, num=2)
    })}

In [137]:
b4_test_clf = GridSearchCV(Baseline4, hyperparameters,cv = 5, scoring = "neg_root_mean_squared_error", n_jobs = -1, verbose = 2) 

**Parameter tuning and grid search**

the factor of the outlier is being grid searched as well as impute strategy and a decition between 2 different boosters:

In [138]:
%%time
# Fit and tune model
b4_test_clf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
Wall time: 3min 20s


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('outlier',
                                                                                          FunctionTransformer(func=<function Outlier_Detector at 0x000001B614A9A550>,
                                                                                                              kw_args={'factor': nan})),
                                                                                         ('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                     

In [139]:
b4_test_clf.best_params_

{'model__selected_model': ('xgb', {'learning_rate': 1.0, 'max_depth': 5}),
 'preprocessor__num__imputer__strategy': 'median',
 'preprocessor__num__outlier__kw_args': {'factor': 3}}

**Validation Prediction**

In [140]:
b4_val_clf = b4_test_clf.best_estimator_
b4_val_clf.fit(X_train2, y_train2)
b4_val_preds = b4_val_clf.predict(X_val)

# Evaluate the model
b4_val_mae = mean_absolute_error(y_val, b4_val_preds)
b4_val_mse = mean_squared_error(y_val, b4_val_preds)
b4_val_r2 = r2_score(y_val, b4_val_preds)

print('MAE:', b4_val_mae)
print("MSE: ",b4_val_mse)
print("R2: ",b4_val_r2)
print("NRMSE: ",-np.sqrt(b4_val_mse))

MAE: 322.89563765647813
MSE:  404415.59891928424
R2:  0.9751079430027508
NRMSE:  -635.9367884619385


**Test Prediction**

In [141]:
b4_Te_clf = b4_test_clf.best_estimator_
b4_Te_clf.fit(X_train, y_train)
b4_test_preds = b4_Te_clf.predict(X_test)

# Evaluate the model
b4_test_mae = mean_absolute_error(y_test, b4_test_preds)
b4_test_mse = mean_squared_error(y_test, b4_test_preds)
b4_test_r2 = r2_score(y_test, b4_test_preds)

print('MAE:', b4_test_mae)
print("MSE: ",b4_test_mse)
print("R2: ",b4_test_r2)
print("NRMSE: ",-np.sqrt(b4_test_mse))

MAE: 312.3389783611022
MSE:  356963.3903221024
R2:  0.977926870813776
NRMSE:  -597.46413308424


### Baseline 4 Summary
The model was  cross validated 5 times for NRMSE, then validated and tested 1 time for all the metrics:
* **Mae**: validation set and test set around 315 and test performed slightly better. 
* **Mse**: validation set and test set around 350,000 and test performed slightly better.
* **R2**: validation set and test set around 0.98 and test performed slightly better.
* **NRMSE**: validation set and test set around 600 and validation performed slightly better.

This is a slight improvement than previous baseline.


The baseline is saved for comparison as a pandas dataframe: 

In [142]:
baseline4 = pd.DataFrame({"val_mae": b4_val_mae,"val_mse": b4_val_mse,"val_r2": b4_val_r2,"val_nrmse": -np.sqrt(b4_val_mse),
                          "test_mae": b4_test_mae,"test_mse": b4_test_mse,"test_r2": b4_test_r2, "test_nrmse": -np.sqrt(b4_test_mse)}, index=["Baseline4"])
baseline4

Unnamed: 0,val_mae,val_mse,val_r2,val_nrmse,test_mae,test_mse,test_r2,test_nrmse
Baseline4,322.896,404415.599,0.975,-635.937,312.339,356963.39,0.978,-597.464


In [143]:
Baselines = pd.concat([baseline1,baseline2,baseline3,baseline4])
Baselines

Unnamed: 0,val_mae,val_mse,val_r2,val_nrmse,test_mae,test_mse,test_r2,test_nrmse
Baseline1,1227.274,2652792.313,0.837,-1609.487,1249.89,2697964.879,0.833,-1642.548
Baseline2,829.093,1413280.761,0.913,-1188.815,827.987,1451969.503,0.91,-1204.977
Baseline3,361.278,436370.496,0.973,-660.583,348.443,390584.528,0.976,-624.968
Baseline4,322.896,404415.599,0.975,-635.937,312.339,356963.39,0.978,-597.464


## Baseline 5: Forests

### Decisions:
* **Preprocessing:**
    1. **Outliers:**  **continuous** - choose what is considered an outlier, fill with median/mean and scaling standart scaling.               
    2. **Feature Engineering**: only categoric labels, OrdinalEncoder.
* **Model training** - using cross validation in validation data on Random Foerst and Extra Trees.
* **Model testing** - validate with grid search and validation set then train on whole train + validation set and use test data for results.
* **Model evaluating** - record for validation and test the following metrics:
    1. MSE - [Mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)
    2. R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
    3. MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)
    4. NRMSE - [Negative root mean squared error](https://en.wikipedia.org/wiki/Root-mean-square_deviation)
    

**Preprocessing**

In [144]:
# Bundle preprocessing and modeling code in a pipeline
Baseline5 = Pipeline(steps=[('preprocessor', preprocessor),
                            ('model', PipelineHelper([
                                   ('et', ExtraTreesRegressor(random_state = 42)),
                                   ('rf', RandomForestRegressor(random_state = 42))
                                                       ]))
                             ])
Baseline5

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('outlier',
                                                                   FunctionTransformer(func=<function Outlier_Detector at 0x000001B614A9A550>,
                                                                                       kw_args={'factor': nan})),
                                                                  ('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['carat', 'depth', 'table',
                                                   'x', 'y', 'z']),
                                                 ('cuts',
                                           

In [145]:
Baseline5.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'model', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__num', 'preprocessor__cuts', 'preprocessor__colors', 'preprocessor__clarities', 'preprocessor__num__memory', 'preprocessor__num__steps', 'preprocessor__num__verbose', 'preprocessor__num__outlier', 'preprocessor__num__imputer', 'preprocessor__num__scaler', 'preprocessor__num__outlier__accept_sparse', 'preprocessor__num__outlier__check_inverse', 'preprocessor__num__outlier__func', 'preprocessor__num__outlier__inv_kw_args', 'preprocessor__num__outlier__inverse_func', 'preprocessor__num__outlier__kw_args', 'preprocessor__num__outlier__validate', 'preprocessor__num__imputer__add_indicator', 'preprocessor__num__imputer__copy', 'preprocessor__num__imputer__fill_value', 'preprocessor__num__imputer__missing_values', 'preprocessor__num__impute

In [146]:
hyperparameters = { 
                   'preprocessor__num__outlier__kw_args':[{'factor':1.5},{'factor':3}],
                  'preprocessor__num__imputer__strategy':['mean','median'],
                    'model__selected_model': Baseline5.named_steps['model'].generate({   
                    'et__n_estimators': [50,100,200],
                    'et__max_features': [5,8],
                    'rf__n_estimators' : [2,5,8],
                    'rf__max_depth': [2,5]
    })}

In [147]:
b5_test_clf = GridSearchCV(Baseline5, hyperparameters,cv = 5, scoring = "neg_root_mean_squared_error", n_jobs = -1, verbose = 2) 

**Parameter tuning and grid search**

the factor of the outlier is being grid searched as well as impute strategy and a decition between 2 different tree based models:

In [148]:
%%time
# Fit and tune model
b5_test_clf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
Wall time: 3min 1s


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('outlier',
                                                                                          FunctionTransformer(func=<function Outlier_Detector at 0x000001B614A9A550>,
                                                                                                              kw_args={'factor': nan})),
                                                                                         ('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                     

In [149]:
b5_test_clf.best_params_

{'model__selected_model': ('et', {'max_features': 5, 'n_estimators': 200}),
 'preprocessor__num__imputer__strategy': 'mean',
 'preprocessor__num__outlier__kw_args': {'factor': 3}}

**Validation Prediction**

In [150]:
b5_val_clf = b5_test_clf.best_estimator_
b5_val_clf.fit(X_train2, y_train2)
b5_val_preds = b5_val_clf.predict(X_val)

# Evaluate the model
b5_val_mae = mean_absolute_error(y_val, b5_val_preds)
b5_val_mse = mean_squared_error(y_val, b5_val_preds)
b5_val_r2 = r2_score(y_val, b5_val_preds)

print('MAE:', b5_val_mae)
print("MSE: ",b5_val_mse)
print("R2: ",b5_val_r2)
print("NRMSE: ",-np.sqrt(b5_val_mse))

MAE: 251.80838338895066
MSE:  269571.66163312626
R2:  0.9834076796638247
NRMSE:  -519.202909885072


**Test Prediction**

In [151]:
b5_Te_clf = b5_test_clf.best_estimator_
b5_Te_clf.fit(X_train, y_train)
b5_test_preds = b5_Te_clf.predict(X_test)

# Evaluate the model
b5_test_mae = mean_absolute_error(y_test, b5_test_preds)
b5_test_mse = mean_squared_error(y_test, b5_test_preds)
b5_test_r2 = r2_score(y_test, b5_test_preds)

print('MAE:', b5_test_mae)
print("MSE: ",b5_test_mse)
print("R2: ",b5_test_r2)
print("NRMSE: ",-np.sqrt(b5_test_mse))

MAE: 241.0921493016932
MSE:  223531.28226708792
R2:  0.9861777565867706
NRMSE:  -472.7909498574269


**Adjusted R2 R2**

In [166]:
n = X_test.shape[0]
p = X_test.shape[1]
b5_ad_r2 = 1-(1-b5_test_r2)*(n-1)/(n-p-1)
print("adjusted R2: ",b5_ad_r2)

adjusted R2:  0.9861314595303066


### Baseline 5 Summary
The model was  cross validated 5 times for NRMSE, then validated and tested 1 time for all the metrics:
* **Mae**: validation set and test set around 250 and test performed slightly better. 
* **Mse**: validation set and test set around 250,000 and test performed slightly better.
* **R2**: validation set and test set around 0.985 and test performed slightly better.
* **NRMSE**: validation set and test set around 500 and validation performed slightly better.

This is a major improvement than previous baseline.

The baseline is saved for comparison as a pandas dataframe: 

In [152]:
baseline5 = pd.DataFrame({"val_mae": b5_val_mae,"val_mse": b5_val_mse,"val_r2": b5_val_r2,"val_nrmse": -np.sqrt(b5_val_mse),
                          "test_mae": b5_test_mae,"test_mse": b5_test_mse,"test_r2": b5_test_r2, "test_nrmse": -np.sqrt(b5_test_mse)}, index=["Baseline5"])
baseline5

Unnamed: 0,val_mae,val_mse,val_r2,val_nrmse,test_mae,test_mse,test_r2,test_nrmse
Baseline5,251.808,269571.662,0.983,-519.203,241.092,223531.282,0.986,-472.791


In [153]:
Baselines = pd.concat([baseline1,baseline2,baseline3,baseline4,baseline5])
Baselines

Unnamed: 0,val_mae,val_mse,val_r2,val_nrmse,test_mae,test_mse,test_r2,test_nrmse
Baseline1,1227.274,2652792.313,0.837,-1609.487,1249.89,2697964.879,0.833,-1642.548
Baseline2,829.093,1413280.761,0.913,-1188.815,827.987,1451969.503,0.91,-1204.977
Baseline3,361.278,436370.496,0.973,-660.583,348.443,390584.528,0.976,-624.968
Baseline4,322.896,404415.599,0.975,-635.937,312.339,356963.39,0.978,-597.464
Baseline5,251.808,269571.662,0.983,-519.203,241.092,223531.282,0.986,-472.791


## Baseline 6: Ensemble

### Decisions:
* **Preprocessing:**
    1. **Outliers:**  **continuous** - choose what is considered an outlier, fill with median/mean and scaling standart scaling.               
    2. **Feature Engineering**: only categoric labels, OrdinalEncoder.
* **Model training** - using cross validation in validation data on voting regressor using gradient booster and extra trees.
* **Model testing** - validate with grid search and validation set then train on whole train + validation set and use test data for results.
* **Model evaluating** - record for validation and test the following metrics:
    1. MSE - [Mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)
    2. R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
    3. MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)
    4. NRMSE - [Negative root mean squared error](https://en.wikipedia.org/wiki/Root-mean-square_deviation)
    

**Preprocessing**

In [154]:
reg1 = ExtraTreesRegressor(random_state = 42)
reg2 = GradientBoostingRegressor(random_state=42)

ereg = VotingRegressor(estimators=[('et', reg1), ('gb', reg2)])

# Bundle preprocessing and modeling code in a pipeline
Baseline6 = Pipeline(steps=[('preprocessor', preprocessor),
                            ('model', ereg)
                             ])
Baseline6

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('outlier',
                                                                   FunctionTransformer(func=<function Outlier_Detector at 0x000001B614A9A550>,
                                                                                       kw_args={'factor': nan})),
                                                                  ('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['carat', 'depth', 'table',
                                                   'x', 'y', 'z']),
                                                 ('cuts',
                                           

In [155]:
Baseline6.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'model', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__num', 'preprocessor__cuts', 'preprocessor__colors', 'preprocessor__clarities', 'preprocessor__num__memory', 'preprocessor__num__steps', 'preprocessor__num__verbose', 'preprocessor__num__outlier', 'preprocessor__num__imputer', 'preprocessor__num__scaler', 'preprocessor__num__outlier__accept_sparse', 'preprocessor__num__outlier__check_inverse', 'preprocessor__num__outlier__func', 'preprocessor__num__outlier__inv_kw_args', 'preprocessor__num__outlier__inverse_func', 'preprocessor__num__outlier__kw_args', 'preprocessor__num__outlier__validate', 'preprocessor__num__imputer__add_indicator', 'preprocessor__num__imputer__copy', 'preprocessor__num__imputer__fill_value', 'preprocessor__num__imputer__missing_values', 'preprocessor__num__impute

In [156]:
hyperparameters = {
                   'preprocessor__num__outlier__kw_args':[{'factor':1.5},{'factor':3}],
                   'preprocessor__num__imputer__strategy':['mean','median'],
                   'model__et__n_estimators': [200],
                   'model__et__max_features': [7,8],
                   "model__et__criterion":["mse"],
                   'model__gb__n_estimators': [250,300]    
}

In [157]:
b6_test_clf = GridSearchCV(Baseline6, hyperparameters,cv = 5, scoring = "neg_root_mean_squared_error", n_jobs = -1, verbose = 2) 

**Parameter tuning and grid search**

the factor of the outlier is being grid searched as well as impute strategy and a decision between the parameters of this ensemble regressor:

In [158]:
%%time
# Fit and tune model
b6_test_clf.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
Wall time: 5min 32s


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('outlier',
                                                                                          FunctionTransformer(func=<function Outlier_Detector at 0x000001B614A9A550>,
                                                                                                              kw_args={'factor': nan})),
                                                                                         ('imputer',
                                                                                          SimpleImputer()),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                     

In [159]:
b6_test_clf.best_params_

{'model__et__criterion': 'mse',
 'model__et__max_features': 7,
 'model__et__n_estimators': 200,
 'model__gb__n_estimators': 300,
 'preprocessor__num__imputer__strategy': 'mean',
 'preprocessor__num__outlier__kw_args': {'factor': 3}}

**Validation Prediction**

In [160]:
b6_val_clf = b6_test_clf.best_estimator_
b6_val_clf.fit(X_train2, y_train2)
b6_val_preds = b6_val_clf.predict(X_val)

# Evaluate the model
b6_val_mae = mean_absolute_error(y_val, b6_val_preds)
b6_val_mse = mean_squared_error(y_val, b6_val_preds)
b6_val_r2 = r2_score(y_val, b6_val_preds)

print('MAE:', b6_val_mae)
print("MSE: ",b6_val_mse)
print("R2: ",b6_val_r2)
print("NRMSE: ",-np.sqrt(b6_val_mse))

MAE: 261.1658032916365
MSE:  260160.67152128238
R2:  0.9839869325484577
NRMSE:  -510.0594784152946


**Test Prediction**

In [161]:
b6_Te_clf = b6_test_clf.best_estimator_
b6_Te_clf.fit(X_train, y_train)
b6_test_preds = b6_Te_clf.predict(X_test)

# Evaluate the model
b6_test_mae = mean_absolute_error(y_test, b6_test_preds)
b6_test_mse = mean_squared_error(y_test, b6_test_preds)
b6_test_r2 = r2_score(y_test, b6_test_preds)

print('MAE:', b6_test_mae)
print("MSE: ",b6_test_mse)
print("R2: ",b6_test_r2)
print("NRMSE: ",-np.sqrt(b6_test_mse))

MAE: 255.94975715272105
MSE:  230639.96527167456
R2:  0.985738185239797
NRMSE:  -480.2498987732059


**Adjusted R2 R2**

In [162]:
n = X_test.shape[0]
p = X_test.shape[1]
b6_ad_r2 = 1-(1-b6_test_r2)*(n-1)/(n-p-1)
print("adjusted R2: ",b6_ad_r2)

adjusted R2:  0.9856904158565287


### Baseline 6 Summary
The model was  cross validated 5 times for NRMSE, then validated and tested 1 time for all the metrics:
* **Mae**: validation set and test set around 250 and test performed slightly better. 
* **Mse**: validation set and test set around 250,000 and test performed slightly better.
* **R2**: validation set and test set around 0.985 and test performed slightly better.
* **NRMSE**: validation set and test set around 500 and validation performed slightly better.

The performance of this baseline is very good but a little worse than the previous.

Baseline 5 will be used for model deployment.

The baseline is saved for comparison as a pandas dataframe: 

In [163]:
baseline6 = pd.DataFrame({"val_mae": b6_val_mae,"val_mse": b6_val_mse,"val_r2": b6_val_r2,"val_nrmse": -np.sqrt(b6_val_mse),
                          "test_mae": b6_test_mae,"test_mse": b6_test_mse,"test_r2": b6_test_r2, "test_nrmse": -np.sqrt(b6_test_mse)}, index=["Baseline6"])
baseline6

Unnamed: 0,val_mae,val_mse,val_r2,val_nrmse,test_mae,test_mse,test_r2,test_nrmse
Baseline6,261.166,260160.672,0.984,-510.059,255.95,230639.965,0.986,-480.25


Final baselines df is saved:

In [164]:
Baselines = pd.concat([baseline1,baseline2,baseline3,baseline4,baseline5,baseline6])
Baselines.to_csv("Data/Baslines.csv")

In [167]:
### Create a Pickle file using serialization 
import pickle
pickle_out = open("Data/b5_Te_clf.pkl","wb")
pickle.dump(b5_Te_clf, pickle_out)
pickle_out.close()

In [165]:
end = timer()
print(f"full code execution time: {timedelta(seconds=end-start)}")

full code execution time: 0:33:59.968145
