# Regression Modelling for Salary Predictor
---

### Table of Contents <a class="anchor" id="toc"></a>

* [Overview](#overview)
* [Importing Libraries](#importinglibraries)
* [Creating Custom Functions](#customfunctions)
* [Regression Modelling (Target Variable 'Average Salary')](#salary)
    * [Null Model](#null)
    * [Linear Regression](#linreg)
    * [LassoCV](#lasso)
    * [RidgeCV](#ridge)
    * [ElasticnetCV](#elasticnet)    
    * [AdaBoost Regressor](#ada)
    * [Bagging Regressor](#bagr)
    * [Gradient Boosting Regressor](#gbr)
    * [Random Forest Regressor](#rfr)
    * [Extra Trees Regressor](#etr)
    * [Support Vector Machine for Regression](#svr)
    * [Summary of metrics for Regression Modelling](#summary)    
* [Regression Modelling (Target Variable 'Average Salary' with Logarithmic Function applied)](#salarylog)
    * [Linear Regression](#linreg_log)
    * [LassoCV](#lasso_log)
    * [RidgeCV](#ridge_log)
    * [ElasticnetCV](#elasticnet_log)    
    * [AdaBoost Regressor](#ada_log)
    * [Bagging Regressor](#bagr_log)
    * [Gradient Boosting Regressor](#gbr_log)
    * [Random Forest Regressor](#rfr_log)
    * [Extra Trees Regressor](#etr_log)
    * [Support Vector Machine for Regression](#svr_log)
    * [Summary of metrics for Regression Modelling](#summary_log) 


## Overview <a class="anchor" id="overview"></a>
---
[Back to top!](#toc)

### Metrics used for Regressor Model Evaluation

* Coefficient of Determinant, $R^2$ (i.e. Test Score)
    * Goal: Get $R^2$ as close to 1 as possible. 
    * Easily interpretable i.e. "An $R^2$ value of 0.8 means that 80% of the variability in _y_ is explained by the _x_-variables in our model."
* Mean Absolute Error
    * Goal: Get Mean Absolute Error as close to 0 as possible.
    * Represents mean distance from the predicted value.
* Mean Percentage Absolute Error
    * Goal: Get Mean Percentage Absolute Error as close to 0 as possible.
    * Derived from the Mean Absolute Error

## Importing Libraries <a class="anchor" id="importinglibraries"></a>
---
[Back to top!](#toc)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.ensemble import AdaBoostRegressor, BaggingRegressor, GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.svm import SVR
from sklearn.metrics import r2_score, mean_absolute_error, median_absolute_error

pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
# limit floats to 3 decimal points

## Custom Functions <a class="anchor" id="customfunctions"></a>
---
[Back to top!](#toc)

In [2]:
def get_scores(model_name, pipeline):
    print(model_name)
    print('------------')
    train_score = pipeline.score(X_train, y_train)
    print(f'Training score: {round(train_score, 4)}')
    test_score = pipeline.score(X_test, y_test)
    print(f'Testing score: {round(test_score, 4)}')

In [3]:
from sklearn.metrics import mean_absolute_error, r2_score, median_absolute_error

def get_evaluation_metrics(model_name, preds):
    print(model_name)
    print('------------')
    
    #r2
    r2 = r2_score(y_test, preds)
    print(f'R2 Score: {round(r2, 4)}')
    
    # Mean Absolute Error
    mean_ae = mean_absolute_error(y_test, preds)
    print(f'Mean Absolute Error: {round(mean_ae, 4)}')
    
    # Mean Absolute Error
    mean_pae = (mean_ae / y_test.mean())*100
    print(f'Mean Percentage Absolute Error: {round(mean_pae, 4)}')
    
    # Median Absolute Error
    median_ae = median_absolute_error(y_test, preds)
    print(f'Median Absolute Error: {round(median_ae, 4)}')

    # Median Percentage Absolute Error
    median_pae = (median_ae / y_test.median())*100
    print(f'Median Percentage Absolute Error: {round(median_pae, 4)}')

In [4]:
from sklearn.metrics import mean_absolute_error, r2_score, median_absolute_error

def get_evaluation_metrics_log(model_name, preds, preds_exp):
    print(model_name)
    print('------------')
    
    #r2
    r2 = r2_score(y_test, preds)
    print(f'R2 Score: {round(r2, 4)}')
    
    # Mean Absolute Error
    mean_ae = mean_absolute_error(y_test_exp, preds_exp)
    print(f'Mean Absolute Error: {round(mean_ae, 4)}')
    
    # Mean Absolute Error
    mean_pae = (mean_ae / y_test_exp.mean())*100
    print(f'Mean Percentage Absolute Error: {round(mean_pae, 4)}')
    
    # Median Absolute Error
    median_ae = median_absolute_error(y_test_exp, preds_exp)
    print(f'Median Absolute Error: {round(median_ae, 4)}')

    # Median Percentage Absolute Error
    median_pae = (median_ae / y_test_exp.median())*100
    print(f'Median Percentage Absolute Error: {round(median_pae, 4)}')

## Regression Modelling for `salary_average` <a class="anchor" id="salary"></a>
---
[Back to top!](#toc)

In [5]:
df_v1 = pd.read_csv('../data/modelling_dataset_v1.csv')
df_v1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3969 entries, 0 to 3968
Data columns (total 92 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   minimum_years_experience           3969 non-null   int64  
 1   days_new_posting_closing           3969 non-null   float64
 2   days_original_posting_closing      3969 non-null   float64
 3   salary_average                     3969 non-null   float64
 4   job_title_wordcount                3969 non-null   int64  
 5   job_title_charcount                3969 non-null   int64  
 6   job_description_wordcount          3969 non-null   int64  
 7   job_description_charcount          3969 non-null   int64  
 8   skills_num                         3969 non-null   int64  
 9   skill_tableau                      3969 non-null   int64  
 10  skill_datawarehouse                3969 non-null   int64  
 11  skill_agile                        3969 non-null   int64

### Null Model <a class="anchor" id="null"></a>
[Back to top!](#toc)

In [6]:
features = [col for col in df_v1.columns if col != 'salary_average']
X = df_v1[features]
y = df_v1['salary_average']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [7]:
# baseline accuracy
y_bar = y_test.mean()
null_resids = y_test - y_bar
null_mae = np.abs(null_resids).mean()
print(f'Null Mean Absolute Error: {null_mae}')

mean_pae = (null_mae / y_test.mean())*100
print(f'Null Mean Absolute Percentage Error: {mean_pae}')

Null Mean Absolute Error: 2323.868493350736
Null Mean Absolute Percentage Error: 31.87748496380985


### Linear Regression <a class="anchor" id="linreg"></a>
[Back to top!](#toc)

In [8]:
features = [col for col in df_v1.columns if col != 'salary_average']
X = df_v1[features]
y = df_v1['salary_average']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression()

In [9]:
get_scores('Linear Regression', linreg)

# keeping variable for summary input later
train_score_linreg = linreg.score(X_train, y_train)

Linear Regression
------------
Training score: 0.5427
Testing score: 0.5267


In [10]:
# Get predictions
pred_linreg = linreg.predict(X_test)

# Get model metrics
get_evaluation_metrics('Linear Regression', pred_linreg)

Linear Regression
------------
R2 Score: 0.5267
Mean Absolute Error: 1572.5177
Mean Percentage Absolute Error: 21.5709
Median Absolute Error: 1176.732
Median Percentage Absolute Error: 16.8105


### LassoCV <a class="anchor" id="lasso"></a>
[Back to top!](#toc)

In [11]:
features = [col for col in df_v1.columns if col != 'salary_average']
X = df_v1[features]
y = df_v1['salary_average']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

lasso = LassoCV(n_alphas=200)
lasso.fit(X_train, y_train)

LassoCV(n_alphas=200)

In [12]:
get_scores('LassoCV', lasso)

# keeping variable for summary input later
train_score_lasso = lasso.score(X_train, y_train)

LassoCV
------------
Training score: 0.5399
Testing score: 0.5305


In [13]:
# Get predictions
pred_lasso = lasso.predict(X_test)

# Get model metrics
get_evaluation_metrics('LassoCV', pred_lasso)

LassoCV
------------
R2 Score: 0.5305
Mean Absolute Error: 1557.3892
Mean Percentage Absolute Error: 21.3634
Median Absolute Error: 1193.9285
Median Percentage Absolute Error: 17.0561


### RidgeCV <a class="anchor" id="ridge"></a>
[Back to top!](#toc)

In [14]:
features = [col for col in df_v1.columns if col != 'salary_average']
X = df_v1[features]
y = df_v1['salary_average']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

ridge = RidgeCV(alphas=np.linspace(.1, 10, 100))
ridge.fit(X_train, y_train)

RidgeCV(alphas=array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ,  1.1,
        1.2,  1.3,  1.4,  1.5,  1.6,  1.7,  1.8,  1.9,  2. ,  2.1,  2.2,
        2.3,  2.4,  2.5,  2.6,  2.7,  2.8,  2.9,  3. ,  3.1,  3.2,  3.3,
        3.4,  3.5,  3.6,  3.7,  3.8,  3.9,  4. ,  4.1,  4.2,  4.3,  4.4,
        4.5,  4.6,  4.7,  4.8,  4.9,  5. ,  5.1,  5.2,  5.3,  5.4,  5.5,
        5.6,  5.7,  5.8,  5.9,  6. ,  6.1,  6.2,  6.3,  6.4,  6.5,  6.6,
        6.7,  6.8,  6.9,  7. ,  7.1,  7.2,  7.3,  7.4,  7.5,  7.6,  7.7,
        7.8,  7.9,  8. ,  8.1,  8.2,  8.3,  8.4,  8.5,  8.6,  8.7,  8.8,
        8.9,  9. ,  9.1,  9.2,  9.3,  9.4,  9.5,  9.6,  9.7,  9.8,  9.9,
       10. ]))

In [15]:
get_scores('RidgeCV', ridge)

# keeping variable for summary input later
train_score_ridge = ridge.score(X_train, y_train)

RidgeCV
------------
Training score: 0.5427
Testing score: 0.5269


In [16]:
# Get predictions
pred_ridge = ridge.predict(X_test)

# Get model metrics
get_evaluation_metrics('RidgeCV', pred_ridge)

RidgeCV
------------
R2 Score: 0.5269
Mean Absolute Error: 1572.0656
Mean Percentage Absolute Error: 21.5647
Median Absolute Error: 1175.4893
Median Percentage Absolute Error: 16.7927


### ElasticnetCV <a class="anchor" id="elasticnet"></a>
[Back to top!](#toc)

In [17]:
features = [col for col in df_v1.columns if col != 'salary_average']
X = df_v1[features]
y = df_v1['salary_average']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

enet = ElasticNetCV()
enet.fit(X_train, y_train)

ElasticNetCV()

In [18]:
get_scores('ElasticNetCV', enet)

# keeping variable for summary input later
train_score_enet = enet.score(X_train, y_train)

ElasticNetCV
------------
Training score: 0.4001
Testing score: 0.4084


In [19]:
# Get predictions
pred_enet = enet.predict(X_test)

# Get model metrics
get_evaluation_metrics('ElasticNetCV', pred_enet)

ElasticNetCV
------------
R2 Score: 0.4084
Mean Absolute Error: 1757.0581
Mean Percentage Absolute Error: 24.1023
Median Absolute Error: 1348.8741
Median Percentage Absolute Error: 19.2696


### AdaBoost Regressor <a class="anchor" id="ada"></a>
[Back to top!](#toc)

In [20]:
features = [col for col in df_v1.columns if col != 'salary_average']
X = df_v1[features]
y = df_v1['salary_average']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

ada = AdaBoostRegressor(n_estimators=100, random_state=123)
ada.fit(X_train, y_train)

AdaBoostRegressor(n_estimators=100, random_state=123)

In [21]:
get_scores('AdaBoost Regressor', ada)

# keeping variable for summary input later
train_score_ada = ada.score(X_train, y_train)

AdaBoost Regressor
------------
Training score: 0.3725
Testing score: 0.327


In [22]:
# Get predictions
pred_ada = ada.predict(X_test)

# Get model metrics
get_evaluation_metrics('AdaBoost Regressor', pred_ada)

AdaBoost Regressor
------------
R2 Score: 0.327
Mean Absolute Error: 2068.4509
Mean Percentage Absolute Error: 28.3738
Median Absolute Error: 1832.9211
Median Percentage Absolute Error: 26.1846


### Bagging Regressor <a class="anchor" id="bagr"></a>
[Back to top!](#toc)

In [23]:
features = [col for col in df_v1.columns if col != 'salary_average']
X = df_v1[features]
y = df_v1['salary_average']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

bagr = BaggingRegressor(base_estimator=SVR(), n_estimators=10, random_state=123)
bagr.fit(X_train, y_train)

BaggingRegressor(base_estimator=SVR(), random_state=123)

In [24]:
get_scores('Bagging Regressor', bagr)

# keeping variable for summary input later
train_score_bagr = bagr.score(X_train, y_train)

Bagging Regressor
------------
Training score: -0.0288
Testing score: -0.0261


In [25]:
# Get predictions
pred_bagr = bagr.predict(X_test)

# Get model metrics
get_evaluation_metrics('Bagging Regressor', pred_bagr)

Bagging Regressor
------------
R2 Score: -0.0261
Mean Absolute Error: 2275.5426
Mean Percentage Absolute Error: 31.2146
Median Absolute Error: 1675.107
Median Percentage Absolute Error: 23.9301


### Gradient Boosting Regressor <a class="anchor" id="gbr"></a>
[Back to top!](#toc)

In [26]:
features = [col for col in df_v1.columns if col != 'salary_average']
X = df_v1[features]
y = df_v1['salary_average']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

gbr = GradientBoostingRegressor(random_state=123)
gbr.fit(X_train, y_train)

GradientBoostingRegressor(random_state=123)

In [27]:
get_scores('Gradient Boosting Regressor', gbr)

# keeping variable for summary input later
train_score_gbr = gbr.score(X_train, y_train)

Gradient Boosting Regressor
------------
Training score: 0.6212
Testing score: 0.5301


In [28]:
# Get predictions
pred_gbr = gbr.predict(X_test)

# Get model metrics
get_evaluation_metrics('Bagging Regressor', pred_gbr)

Bagging Regressor
------------
R2 Score: 0.5301
Mean Absolute Error: 1539.8522
Mean Percentage Absolute Error: 21.1228
Median Absolute Error: 1164.8923
Median Percentage Absolute Error: 16.6413


### Random Forest Regressor <a class="anchor" id="rfr"></a>
[Back to top!](#toc)

In [29]:
features = [col for col in df_v1.columns if col != 'salary_average']
X = df_v1[features]
y = df_v1['salary_average']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

rfr = RandomForestRegressor(max_depth=100)
rfr.fit(X_train, y_train)

RandomForestRegressor(max_depth=100)

In [30]:
get_scores('Random Forest Regressor', rfr)

# keeping variable for summary input later
train_score_rfr = rfr.score(X_train, y_train)

Random Forest Regressor
------------
Training score: 0.9332
Testing score: 0.5545


In [31]:
# Get predictions
pred_rfr = rfr.predict(X_test)

# Get model metrics
get_evaluation_metrics('Random Forest Regressor', pred_rfr)

Random Forest Regressor
------------
R2 Score: 0.5545
Mean Absolute Error: 1471.8029
Mean Percentage Absolute Error: 20.1893
Median Absolute Error: 1050.0
Median Percentage Absolute Error: 15.0


### Extra Trees Regressor <a class="anchor" id="etr"></a>
[Back to top!](#toc)

In [32]:
features = [col for col in df_v1.columns if col != 'salary_average']
X = df_v1[features]
y = df_v1['salary_average']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

etr = ExtraTreesRegressor(n_estimators=150)
etr.fit(X_train, y_train)

ExtraTreesRegressor(n_estimators=150)

In [33]:
get_scores('Extra Trees Regressor', etr)

# keeping variable for summary input later
train_score_etr = etr.score(X_train, y_train)

Extra Trees Regressor
------------
Training score: 0.998
Testing score: 0.5399


In [34]:
# Get predictions
pred_etr = etr.predict(X_test)

# Get model metrics
get_evaluation_metrics('Extra Trees Regressor', pred_etr)

Extra Trees Regressor
------------
R2 Score: 0.5399
Mean Absolute Error: 1465.1384
Mean Percentage Absolute Error: 20.0979
Median Absolute Error: 1006.36
Median Percentage Absolute Error: 14.3766


### Support Vector Machine for Regression <a class="anchor" id="svr"></a>
[Back to top!](#toc)

In [35]:
features = [col for col in df_v1.columns if col != 'salary_average']
X = df_v1[features]
y = df_v1['salary_average']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

svr = SVR(C=1.0, epsilon=0.2)
svr.fit(X_train, y_train)

SVR(epsilon=0.2)

In [36]:
get_scores('SVM for Regression', svr)

# keeping variable for summary input later
train_score_svr = svr.score(X_train, y_train)

SVM for Regression
------------
Training score: -0.0264
Testing score: -0.0238


In [37]:
# Get predictions
pred_svr = svr.predict(X_test)

# Get model metrics
get_evaluation_metrics('SVM for Regression', pred_svr)

SVM for Regression
------------
R2 Score: -0.0238
Mean Absolute Error: 2274.5381
Mean Percentage Absolute Error: 31.2008
Median Absolute Error: 1693.8366
Median Percentage Absolute Error: 24.1977


### Summary of metrics for regression modelling of target variable `salary_average` <a class="anchor" id="summary"></a>
[Back to top!](#toc)

In [38]:
# list of dict
summary_v1 = [{'Model': 'Linear Regression v1', 
            'Train Score': train_score_linreg, 
            'Test Score': r2_score(y_test, pred_linreg),
            'Mean Absolute Error': mean_absolute_error(y_test, pred_linreg), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test, pred_linreg) / y_test.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test, pred_linreg),
            'Median Percentage Absolute Error': (median_absolute_error(y_test, pred_linreg) / y_test.median())*100},
           {'Model': 'LassoCV v1', 
            'Train Score': train_score_lasso, 
            'Test Score': r2_score(y_test, pred_lasso),
            'Mean Absolute Error': mean_absolute_error(y_test, pred_lasso), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test, pred_lasso) / y_test.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test, pred_lasso),
            'Median Percentage Absolute Error': (median_absolute_error(y_test, pred_lasso) / y_test.median())*100},
           {'Model': 'RidgeCV v1', 
            'Train Score': train_score_ridge, 
            'Test Score': r2_score(y_test, pred_ridge),
            'Mean Absolute Error': mean_absolute_error(y_test, pred_ridge), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test, pred_ridge) / y_test.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test, pred_ridge),
            'Median Percentage Absolute Error': (median_absolute_error(y_test, pred_ridge) / y_test.median())*100},
           {'Model': 'ElasticnetCV v1', 
            'Train Score': train_score_enet, 
            'Test Score': r2_score(y_test, pred_enet),
            'Mean Absolute Error': mean_absolute_error(y_test, pred_enet), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test, pred_enet) / y_test.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test, pred_enet),
            'Median Percentage Absolute Error': (median_absolute_error(y_test, pred_enet) / y_test.median())*100},
           {'Model': 'AdaBoost Regressor v1', 
            'Train Score': train_score_ada, 
            'Test Score': r2_score(y_test, pred_ada),
            'Mean Absolute Error': mean_absolute_error(y_test, pred_ada), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test, pred_ada) / y_test.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test, pred_ada),
            'Median Percentage Absolute Error': (median_absolute_error(y_test, pred_ada) / y_test.median())*100},
           {'Model': 'Bagging Regressor v1', 
            'Train Score': train_score_bagr, 
            'Test Score': r2_score(y_test, pred_bagr),
            'Mean Absolute Error': mean_absolute_error(y_test, pred_bagr), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test, pred_bagr) / y_test.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test, pred_bagr),
            'Median Percentage Absolute Error': (median_absolute_error(y_test, pred_bagr) / y_test.median())*100},
           {'Model': 'Gradient Boosting Regressor v1', 
            'Train Score': train_score_gbr, 
            'Test Score': r2_score(y_test, pred_gbr),
            'Mean Absolute Error': mean_absolute_error(y_test, pred_gbr), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test, pred_gbr) / y_test.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test, pred_gbr),
            'Median Percentage Absolute Error': (median_absolute_error(y_test, pred_gbr) / y_test.median())*100},
           {'Model': 'Random Forest Regressor v1', 
            'Train Score': train_score_rfr, 
            'Test Score': r2_score(y_test, pred_rfr),
            'Mean Absolute Error': mean_absolute_error(y_test, pred_rfr), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test, pred_rfr) / y_test.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test, pred_rfr),
            'Median Percentage Absolute Error': (median_absolute_error(y_test, pred_rfr) / y_test.median())*100},
           {'Model': 'Extra Trees Regressor v1', 
            'Train Score': train_score_etr, 
            'Test Score': r2_score(y_test, pred_etr),
            'Mean Absolute Error': mean_absolute_error(y_test, pred_etr), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test, pred_etr) / y_test.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test, pred_etr),
            'Median Percentage Absolute Error': (median_absolute_error(y_test, pred_etr) / y_test.median())*100},
           {'Model': 'Support Vector Machine for Regression v1', 
            'Train Score': train_score_svr, 
            'Test Score': r2_score(y_test, pred_svr),
            'Mean Absolute Error': mean_absolute_error(y_test, pred_svr), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test, pred_svr) / y_test.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test, pred_svr),
            'Median Percentage Absolute Error': (median_absolute_error(y_test, pred_svr) / y_test.median())*100}]

summary_v1 = pd.DataFrame(summary_v1)
summary_v1.round(4) # rounding off values to 4 decimal places

Unnamed: 0,Model,Train Score,Test Score,Mean Absolute Error,Mean Percentage Absolute Error,Median Absolute Error,Median Percentage Absolute Error
0,Linear Regression v1,0.543,0.527,1572.518,21.571,1176.732,16.811
1,LassoCV v1,0.54,0.53,1557.389,21.363,1193.928,17.056
2,RidgeCV v1,0.543,0.527,1572.066,21.565,1175.489,16.793
3,ElasticnetCV v1,0.4,0.408,1757.058,24.102,1348.874,19.27
4,AdaBoost Regressor v1,0.372,0.327,2068.451,28.374,1832.921,26.185
5,Bagging Regressor v1,-0.029,-0.026,2275.543,31.215,1675.107,23.93
6,Gradient Boosting Regressor v1,0.621,0.53,1539.852,21.123,1164.892,16.641
7,Random Forest Regressor v1,0.933,0.554,1471.803,20.189,1050.0,15.0
8,Extra Trees Regressor v1,0.998,0.54,1465.138,20.098,1006.36,14.377
9,Support Vector Machine for Regression v1,-0.026,-0.024,2274.538,31.201,1693.837,24.198


In [39]:
# exporting .csv for executive summary at the top
summary_v1.to_csv('../data/model_metrics_v1.csv', index=False)

## Regression Modelling for `salary_average_log` <a class="anchor" id="salarylog"></a>
---
[Back to top!](#toc)

In [40]:
df_v2 = pd.read_csv('../data/modelling_dataset_v2.csv')
df_v2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3969 entries, 0 to 3968
Data columns (total 92 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   minimum_years_experience           3969 non-null   int64  
 1   days_new_posting_closing           3969 non-null   float64
 2   days_original_posting_closing      3969 non-null   float64
 3   job_title_wordcount                3969 non-null   int64  
 4   job_title_charcount                3969 non-null   int64  
 5   job_description_wordcount          3969 non-null   int64  
 6   job_description_charcount          3969 non-null   int64  
 7   skills_num                         3969 non-null   int64  
 8   skill_tableau                      3969 non-null   int64  
 9   skill_datawarehouse                3969 non-null   int64  
 10  skill_agile                        3969 non-null   int64  
 11  skill_aws                          3969 non-null   int64

### Linear Regression <a class="anchor" id="linreg_log"></a>
[Back to top!](#toc)

In [41]:
features = [col for col in df_v2.columns if col != 'salary_average_log']
X = df_v2[features]
y = df_v2['salary_average_log']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

linreg_log = LinearRegression()
linreg_log.fit(X_train, y_train)

LinearRegression()

In [42]:
get_scores('Linear Regression', linreg_log)

# keeping variable for summary input later
train_score_linreg_log = linreg_log.score(X_train, y_train)

Linear Regression
------------
Training score: 0.5645
Testing score: 0.561


In [43]:
# Get predictions
pred_linreg_log = linreg_log.predict(X_test)
pred_linreg_exp = np.exp(pred_linreg_log)
y_test_exp = np.exp(y_test)

# Get model metrics
get_evaluation_metrics_log('Linear Regression', pred_linreg_log, pred_linreg_exp)

Linear Regression
------------
R2 Score: 0.561
Mean Absolute Error: 1567.7916
Mean Percentage Absolute Error: 21.5061
Median Absolute Error: 1089.2433
Median Percentage Absolute Error: 15.5606


### LassoCV <a class="anchor" id="lasso_log"></a>
[Back to top!](#toc)

In [44]:
features = [col for col in df_v2.columns if col != 'salary_average_log']
X = df_v2[features]
y = df_v2['salary_average_log']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

lasso_log = LassoCV(n_alphas=200)
lasso_log.fit(X_train, y_train)

LassoCV(n_alphas=200)

In [45]:
get_scores('LassoCV', lasso_log)

# keeping variable for summary input later
train_score_lasso_log = lasso_log.score(X_train, y_train)

LassoCV
------------
Training score: 0.5607
Testing score: 0.5635


In [46]:
# Get predictions
pred_lasso_log = lasso_log.predict(X_test)
pred_lasso_exp = np.exp(pred_lasso_log)
y_test_exp = np.exp(y_test)

# Get model metrics
get_evaluation_metrics_log('LassoCV', pred_lasso_log, pred_lasso_exp)

LassoCV
------------
R2 Score: 0.5635
Mean Absolute Error: 1558.8956
Mean Percentage Absolute Error: 21.384
Median Absolute Error: 1093.647
Median Percentage Absolute Error: 15.6235


### RidgeCV <a class="anchor" id="ridge_log"></a>
[Back to top!](#toc)

In [47]:
features = [col for col in df_v2.columns if col != 'salary_average_log']
X = df_v2[features]
y = df_v2['salary_average_log']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

ridge_log = RidgeCV(alphas=np.linspace(.1, 10, 100))
ridge_log.fit(X_train, y_train)

RidgeCV(alphas=array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ,  1.1,
        1.2,  1.3,  1.4,  1.5,  1.6,  1.7,  1.8,  1.9,  2. ,  2.1,  2.2,
        2.3,  2.4,  2.5,  2.6,  2.7,  2.8,  2.9,  3. ,  3.1,  3.2,  3.3,
        3.4,  3.5,  3.6,  3.7,  3.8,  3.9,  4. ,  4.1,  4.2,  4.3,  4.4,
        4.5,  4.6,  4.7,  4.8,  4.9,  5. ,  5.1,  5.2,  5.3,  5.4,  5.5,
        5.6,  5.7,  5.8,  5.9,  6. ,  6.1,  6.2,  6.3,  6.4,  6.5,  6.6,
        6.7,  6.8,  6.9,  7. ,  7.1,  7.2,  7.3,  7.4,  7.5,  7.6,  7.7,
        7.8,  7.9,  8. ,  8.1,  8.2,  8.3,  8.4,  8.5,  8.6,  8.7,  8.8,
        8.9,  9. ,  9.1,  9.2,  9.3,  9.4,  9.5,  9.6,  9.7,  9.8,  9.9,
       10. ]))

In [48]:
get_scores('RidgeCV', ridge_log)

# keeping variable for summary input later
train_score_ridge_log = ridge_log.score(X_train, y_train)

RidgeCV
------------
Training score: 0.5645
Testing score: 0.5611


In [49]:
# Get predictions
pred_ridge_log = ridge_log.predict(X_test)
pred_ridge_exp = np.exp(pred_ridge_log)
y_test_exp = np.exp(y_test)

# Get model metrics
get_evaluation_metrics_log('RidgeCV', pred_ridge_log, pred_ridge_exp)

RidgeCV
------------
R2 Score: 0.5611
Mean Absolute Error: 1567.3258
Mean Percentage Absolute Error: 21.4997
Median Absolute Error: 1091.4924
Median Percentage Absolute Error: 15.5927


### ElasticnetCV <a class="anchor" id="elasticnet_log"></a>
[Back to top!](#toc)

In [50]:
features = [col for col in df_v2.columns if col != 'salary_average_log']
X = df_v2[features]
y = df_v2['salary_average_log']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

enet_log = ElasticNetCV()
enet_log.fit(X_train, y_train)

ElasticNetCV()

In [51]:
get_scores('ElasticnetCV', enet_log)

# keeping variable for summary input later
train_score_enet_log = enet_log.score(X_train, y_train)

ElasticnetCV
------------
Training score: 0.5605
Testing score: 0.5635


In [52]:
# Get predictions
pred_enet_log = enet_log.predict(X_test)
pred_enet_exp = np.exp(pred_enet_log)
y_test_exp = np.exp(y_test)

# Get model metrics
get_evaluation_metrics_log('ElasticnetCV', pred_enet_log, pred_enet_exp)

ElasticnetCV
------------
R2 Score: 0.5635
Mean Absolute Error: 1558.5253
Mean Percentage Absolute Error: 21.3789
Median Absolute Error: 1098.4315
Median Percentage Absolute Error: 15.6919


### AdaBoost Regressor <a class="anchor" id="ada_log"></a>
[Back to top!](#toc)

In [53]:
features = [col for col in df_v2.columns if col != 'salary_average_log']
X = df_v2[features]
y = df_v2['salary_average_log']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

ada_log = AdaBoostRegressor(n_estimators=100, random_state=123)
ada_log.fit(X_train, y_train)

AdaBoostRegressor(n_estimators=100, random_state=123)

In [54]:
get_scores('AdaBoost Regressor', ada_log)

# keeping variable for summary input later
train_score_ada_log = ada_log.score(X_train, y_train)

AdaBoost Regressor
------------
Training score: 0.4806
Testing score: 0.4652


In [55]:
# Get predictions
pred_ada_log = ada_log.predict(X_test)
pred_ada_exp = np.exp(pred_ada_log)
y_test_exp = np.exp(y_test)

# Get model metrics
get_evaluation_metrics_log('AdaBoost Regressor', pred_ada_log, pred_ada_exp)

AdaBoost Regressor
------------
R2 Score: 0.4652
Mean Absolute Error: 1675.7467
Mean Percentage Absolute Error: 22.9869
Median Absolute Error: 1256.0955
Median Percentage Absolute Error: 17.9442


### Bagging Regressor <a class="anchor" id="bagr_log"></a>
[Back to top!](#toc)

In [56]:
features = [col for col in df_v2.columns if col != 'salary_average_log']
X = df_v2[features]
y = df_v2['salary_average_log']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

bagr_log = BaggingRegressor(base_estimator=SVR(), n_estimators=10, random_state=123)
bagr_log.fit(X_train, y_train)

BaggingRegressor(base_estimator=SVR(), random_state=123)

In [57]:
get_scores('Bagging Regressor', bagr_log)

# keeping variable for summary input later
train_score_bagr_log = bagr_log.score(X_train, y_train)

Bagging Regressor
------------
Training score: 0.7475
Testing score: 0.5342


In [58]:
# Get predictions
pred_bagr_log = bagr_log.predict(X_test)
pred_bagr_exp = np.exp(pred_bagr_log)
y_test_exp = np.exp(y_test)

# Get model metrics
get_evaluation_metrics_log('Bagging Regressor', pred_bagr_log, pred_bagr_exp)

Bagging Regressor
------------
R2 Score: 0.5342
Mean Absolute Error: 1543.7526
Mean Percentage Absolute Error: 21.1763
Median Absolute Error: 1058.33
Median Percentage Absolute Error: 15.119


### Gradient Boosting Regressor <a class="anchor" id="gbr_log"></a>
[Back to top!](#toc)

In [59]:
features = [col for col in df_v2.columns if col != 'salary_average_log']
X = df_v2[features]
y = df_v2['salary_average_log']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

gbr_log = GradientBoostingRegressor(random_state=123)
gbr_log.fit(X_train, y_train)

GradientBoostingRegressor(random_state=123)

In [60]:
get_scores('Gradient Boosting Regressor', gbr_log)

# keeping variable for summary input later
train_score_gbr_log = gbr_log.score(X_train, y_train)

Gradient Boosting Regressor
------------
Training score: 0.6377
Testing score: 0.5765


In [61]:
# Get predictions
pred_gbr_log = gbr_log.predict(X_test)
pred_gbr_exp = np.exp(pred_gbr_log)
y_test_exp = np.exp(y_test)

# Get model metrics
get_evaluation_metrics_log('Gradient Boosting Regressor', pred_gbr_log, pred_gbr_exp)

Gradient Boosting Regressor
------------
R2 Score: 0.5765
Mean Absolute Error: 1513.0686
Mean Percentage Absolute Error: 20.7554
Median Absolute Error: 1094.6755
Median Percentage Absolute Error: 15.6382


### Random Forest Regressor <a class="anchor" id="rfr_log"></a>
[Back to top!](#toc)

In [62]:
features = [col for col in df_v2.columns if col != 'salary_average_log']
X = df_v2[features]
y = df_v2['salary_average_log']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

rfr_log = RandomForestRegressor(max_depth=100)
rfr_log.fit(X_train, y_train)

RandomForestRegressor(max_depth=100)

In [63]:
get_scores('Random Forest Regressor', rfr_log)

# keeping variable for summary input later
train_score_rfr_log = rfr_log.score(X_train, y_train)

Random Forest Regressor
------------
Training score: 0.9387
Testing score: 0.6049


In [64]:
# Get predictions
pred_rfr_log = rfr_log.predict(X_test)
pred_rfr_exp = np.exp(pred_rfr_log)
y_test_exp = np.exp(y_test)

# Get model metrics
get_evaluation_metrics_log('Random Forest Regressor', pred_rfr_log, pred_rfr_exp)

Random Forest Regressor
------------
R2 Score: 0.6049
Mean Absolute Error: 1436.7994
Mean Percentage Absolute Error: 19.7092
Median Absolute Error: 1025.854
Median Percentage Absolute Error: 14.6551


### Extra Trees Regressor <a class="anchor" id="etr_log"></a>
[Back to top!](#toc)

In [65]:
features = [col for col in df_v2.columns if col != 'salary_average_log']
X = df_v2[features]
y = df_v2['salary_average_log']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

etr_log = ExtraTreesRegressor(n_estimators=150)
etr_log.fit(X_train, y_train)

ExtraTreesRegressor(n_estimators=150)

In [66]:
get_scores('Extra Trees Regressor', etr_log)

# keeping variable for summary input later
train_score_etr_log = etr_log.score(X_train, y_train)

Extra Trees Regressor
------------
Training score: 0.9984
Testing score: 0.5714


In [67]:
# Get predictions
pred_etr_log = etr_log.predict(X_test)
pred_etr_exp = np.exp(pred_etr_log)
y_test_exp = np.exp(y_test)

# Get model metrics
get_evaluation_metrics_log('Random Forest Regressor', pred_etr_log, pred_etr_exp)

Random Forest Regressor
------------
R2 Score: 0.5714
Mean Absolute Error: 1456.033
Mean Percentage Absolute Error: 19.973
Median Absolute Error: 995.8418
Median Percentage Absolute Error: 14.2263


### Support Vector Machine for Regression <a class="anchor" id="svr_log"></a>
[Back to top!](#toc)

In [68]:
features = [col for col in df_v2.columns if col != 'salary_average_log']
X = df_v2[features]
y = df_v2['salary_average_log']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

svr_log = SVR(C=1.0, epsilon=0.2)
svr_log.fit(X_train, y_train)

SVR(epsilon=0.2)

In [69]:
get_scores('SVM for Regresssion', svr_log)

# keeping variable for summary input later
train_score_svr_log = svr_log.score(X_train, y_train)

SVM for Regresssion
------------
Training score: 0.7422
Testing score: 0.5286


In [70]:
# Get predictions
pred_svr_log = svr_log.predict(X_test)
pred_svr_exp = np.exp(pred_svr_log)
y_test_exp = np.exp(y_test)

# Get model metrics
get_evaluation_metrics_log('SVM for Regresssion', pred_svr_log, pred_svr_exp)

SVM for Regresssion
------------
R2 Score: 0.5286
Mean Absolute Error: 1565.4703
Mean Percentage Absolute Error: 21.4742
Median Absolute Error: 1138.745
Median Percentage Absolute Error: 16.2678


### Summary of metrics for regression modelling of target variable `salary_average_log` <a class="anchor" id="summary_log"></a>
[Back to top!](#toc)

In [71]:
# list of dict
summary_v2 = [{'Model': 'Linear Regression v2', 
            'Train Score': train_score_linreg_log, 
            'Test Score': r2_score(y_test, pred_linreg_log),
            'Mean Absolute Error': mean_absolute_error(y_test_exp, pred_linreg_exp), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test_exp, pred_linreg_exp) / y_test_exp.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test_exp, pred_linreg_exp),
            'Median Percentage Absolute Error': (median_absolute_error(y_test_exp, pred_linreg_exp) / y_test_exp.median())*100},
           {'Model': 'LassoCV v2', 
            'Train Score': train_score_lasso_log, 
            'Test Score': r2_score(y_test, pred_lasso_log),
            'Mean Absolute Error': mean_absolute_error(y_test_exp, pred_lasso_exp), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test_exp, pred_lasso_exp) / y_test_exp.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test_exp, pred_lasso_exp),
            'Median Percentage Absolute Error': (median_absolute_error(y_test_exp, pred_lasso_exp) / y_test_exp.median())*100},
           {'Model': 'RidgeCV v2', 
            'Train Score': train_score_ridge_log, 
            'Test Score': r2_score(y_test, pred_ridge_log),
            'Mean Absolute Error': mean_absolute_error(y_test_exp, pred_ridge_exp), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test_exp, pred_ridge_exp) / y_test_exp.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test_exp, pred_ridge_exp),
            'Median Percentage Absolute Error': (median_absolute_error(y_test_exp, pred_ridge_exp) / y_test_exp.median())*100},
           {'Model': 'ElasticnetCV v2', 
            'Train Score': train_score_enet_log, 
            'Test Score': r2_score(y_test, pred_enet_log),
            'Mean Absolute Error': mean_absolute_error(y_test_exp, pred_enet_exp), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test_exp, pred_enet_exp) / y_test_exp.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test_exp, pred_enet_exp),
            'Median Percentage Absolute Error': (median_absolute_error(y_test_exp, pred_enet_exp) / y_test_exp.median())*100},
           {'Model': 'AdaBoost Regressor v2', 
            'Train Score': train_score_ada_log, 
            'Test Score': r2_score(y_test, pred_ada_log),
            'Mean Absolute Error': mean_absolute_error(y_test_exp, pred_ada_exp), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test_exp, pred_ada_exp) / y_test_exp.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test_exp, pred_ada_exp),
            'Median Percentage Absolute Error': (median_absolute_error(y_test_exp, pred_ada_exp) / y_test_exp.median())*100},
           {'Model': 'Bagging Regressor v2', 
            'Train Score': train_score_bagr_log, 
            'Test Score': r2_score(y_test, pred_bagr_log),
            'Mean Absolute Error': mean_absolute_error(y_test_exp, pred_bagr_exp), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test_exp, pred_bagr_exp) / y_test_exp.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test_exp, pred_bagr_exp),
            'Median Percentage Absolute Error': (median_absolute_error(y_test_exp, pred_bagr_exp) / y_test_exp.median())*100},
           {'Model': 'Gradient Boosting Regressor v2', 
            'Train Score': train_score_gbr_log, 
            'Test Score': r2_score(y_test, pred_gbr_log),
            'Mean Absolute Error': mean_absolute_error(y_test_exp, pred_gbr_exp), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test_exp, pred_gbr_exp) / y_test_exp.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test_exp, pred_gbr_exp),
            'Median Percentage Absolute Error': (median_absolute_error(y_test_exp, pred_gbr_exp) / y_test_exp.median())*100},
           {'Model': 'Random Forest Regressor v2', 
            'Train Score': train_score_rfr_log, 
            'Test Score': r2_score(y_test, pred_rfr_log),
            'Mean Absolute Error': mean_absolute_error(y_test_exp, pred_rfr_exp), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test_exp, pred_rfr_exp) / y_test_exp.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test_exp, pred_rfr_exp),
            'Median Percentage Absolute Error': (median_absolute_error(y_test_exp, pred_rfr_exp) / y_test_exp.median())*100},
           {'Model': 'Extra Trees Regressor v2', 
            'Train Score': train_score_etr_log, 
            'Test Score': r2_score(y_test, pred_etr_log),
            'Mean Absolute Error': mean_absolute_error(y_test_exp, pred_etr_exp), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test_exp, pred_etr_exp) / y_test_exp.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test_exp, pred_etr_exp),
            'Median Percentage Absolute Error': (median_absolute_error(y_test_exp, pred_etr_exp) / y_test_exp.median())*100},
           {'Model': 'Support Vector Machine for Regression v2', 
            'Train Score': train_score_svr_log, 
            'Test Score': r2_score(y_test, pred_svr_log),
            'Mean Absolute Error': mean_absolute_error(y_test_exp, pred_svr_exp), 
            'Mean Percentage Absolute Error': (mean_absolute_error(y_test_exp, pred_svr_exp) / y_test_exp.mean())*100,
            'Median Absolute Error': median_absolute_error(y_test_exp, pred_svr_exp),
            'Median Percentage Absolute Error': (median_absolute_error(y_test_exp, pred_svr_exp) / y_test_exp.median())*100}]

summary_v2 = pd.DataFrame(summary_v2)
summary_v2.round(4) # rounding off values to 4 decimal places

Unnamed: 0,Model,Train Score,Test Score,Mean Absolute Error,Mean Percentage Absolute Error,Median Absolute Error,Median Percentage Absolute Error
0,Linear Regression v2,0.565,0.561,1567.792,21.506,1089.243,15.561
1,LassoCV v2,0.561,0.564,1558.896,21.384,1093.647,15.623
2,RidgeCV v2,0.565,0.561,1567.326,21.5,1091.492,15.593
3,ElasticnetCV v2,0.56,0.564,1558.525,21.379,1098.431,15.692
4,AdaBoost Regressor v2,0.481,0.465,1675.747,22.987,1256.095,17.944
5,Bagging Regressor v2,0.748,0.534,1543.753,21.176,1058.33,15.119
6,Gradient Boosting Regressor v2,0.638,0.577,1513.069,20.755,1094.676,15.638
7,Random Forest Regressor v2,0.939,0.605,1436.799,19.709,1025.854,14.655
8,Extra Trees Regressor v2,0.998,0.571,1456.033,19.973,995.842,14.226
9,Support Vector Machine for Regression v2,0.742,0.529,1565.47,21.474,1138.745,16.268


In [72]:
# exporting .csv for executive summary at the top
summary_v2.to_csv('../data/model_metrics_v2.csv', index=False)