##### Model Training

Import Data and Required Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

Import ML models and evaluation metrics

In [None]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
import warnings

Load in the scaled data

In [19]:
scaled_X_train=pd.read_csv('data/scaled_X_train.csv')
scaled_y_train=pd.read_csv('data/scaled_y_train.csv')
scaled_X_test=pd.read_csv('data/scaled_X_test.csv')
scaled_y_test=pd.read_csv('data/scaled_y_test.csv')

Define evaluate_model function to evaluate the following metrics on the true and predicted values:

* Mean Absolute Error (mae) - averaging the absolute differences between predicted and actual values
* Mean Squared Error (mse) - measures the average of the squared differences between predicted values and actual values
* Root Mean Squared Error (rmse) - square root of the average of squared differences
* Mean Absolute Percentage Error (mape) - showing the average percentage difference between predicted and actual values
* R2 Score (r2_square) - measures how well a regression model's predictions fit actual data, indicating the proportion of the variance in the dependent variable explained by the model

In [None]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    mape = mean_absolute_percentage_error(true, predicted)
    r2_square = r2_score(true, predicted)
    return mae, mse, rmse, mape, r2_square

Define function to train and evaluate each model for the scaled data

In [None]:
models = {
    "Linear Regression": LinearRegression(),

    "Lasso": Lasso(),

    "Ridge": Ridge(),

    "K-Neighbors Regressor": KNeighborsRegressor(),

    "Decision Tree": DecisionTreeRegressor(),

    "Random Forest Regressor": RandomForestRegressor(),

    "XGBRegressor": XGBRegressor(), 
    
    "CatBoosting Regressor": CatBoostRegressor(verbose=False),

    "AdaBoost Regressor": AdaBoostRegressor(),

    "Gradient Boosting Regressor": GradientBoostingRegressor()
}
# define empty list
model_list = []
r2_list =[]

# for every model in the list
for i in range(len(list(models))):
    # set model to model at index i
    model = list(models.values())[i]

    # Train model
    model.fit(scaled_X_train, scaled_y_train) 

    # Make predictions
    y_train_pred = model.predict(scaled_X_train)
    y_test_pred = model.predict(scaled_X_test)
    
    model_train_mae, model_train_mse, model_train_rmse, model_train_mape, model_train_r2 = evaluate_model(scaled_y_train, y_train_pred)

    model_test_mae, model_test_mse, model_test_rmse, model_test_mape, model_test_r2 = evaluate_model(scaled_y_test, y_test_pred)

    print(list(models.keys())[i])

    # append each model to model_list
    model_list.append(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Squared Error: {:.4f}".format(model_train_mse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print(f"- Mean Absolute Percentage Error: {model_train_mape * 100:.4f}%")
    print("- R2 Score: {:.4f}".format(model_train_r2))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Squared Error: {:.4f}".format(model_test_mse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print(f"- Mean Absolute Percentage Error: {model_test_mape * 100:.4f}%")
    print("- R2 Score: {:.4f}".format(model_test_r2))
    r2_list.append(model_test_r2)
    
    print('='*35)
    print('\n')

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 0.2597
- Mean Squared Error: 0.0675
- Mean Absolute Error: 0.1988
- Mean Absolute Percentage Error: 2.2127%
- R2 Score: 0.7413
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.2538
- Mean Squared Error: 0.0644
- Mean Absolute Error: 0.1945
- Mean Absolute Percentage Error: 2.1630%
- R2 Score: 0.7581


Lasso
Model performance for Training set
- Root Mean Squared Error: 0.5106
- Mean Squared Error: 0.2607
- Mean Absolute Error: 0.4298
- Mean Absolute Percentage Error: 4.8427%
- R2 Score: 0.0000
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.5160
- Mean Squared Error: 0.2663
- Mean Absolute Error: 0.4336
- Mean Absolute Percentage Error: 4.8812%
- R2 Score: -0.0003


Ridge
Model performance for Training set
- Root Mean Squared Error: 0.2597
- Mean Squared Error: 0.0675
- Mean Absolute Error: 0.1988
- Mean Absolute Per

  return fit_method(estimator, *args, **kwargs)


Random Forest Regressor
Model performance for Training set
- Root Mean Squared Error: 0.0985
- Mean Squared Error: 0.0097
- Mean Absolute Error: 0.0588
- Mean Absolute Percentage Error: 0.6514%
- R2 Score: 0.9628
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.1781
- Mean Squared Error: 0.0317
- Mean Absolute Error: 0.1178
- Mean Absolute Percentage Error: 1.3034%
- R2 Score: 0.8809


XGBRegressor
Model performance for Training set
- Root Mean Squared Error: 0.1171
- Mean Squared Error: 0.0137
- Mean Absolute Error: 0.0806
- Mean Absolute Percentage Error: 0.8959%
- R2 Score: 0.9474
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.1575
- Mean Squared Error: 0.0248
- Mean Absolute Error: 0.1119
- Mean Absolute Percentage Error: 1.2390%
- R2 Score: 0.9068


CatBoosting Regressor
Model performance for Training set
- Root Mean Squared Error: 0.1340
- Mean Squared Error: 0.0180
- Mean Absolute Error

  y = column_or_1d(y, warn=True)


AdaBoost Regressor
Model performance for Training set
- Root Mean Squared Error: 0.2890
- Mean Squared Error: 0.0835
- Mean Absolute Error: 0.2382
- Mean Absolute Percentage Error: 2.6780%
- R2 Score: 0.6797
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.2841
- Mean Squared Error: 0.0807
- Mean Absolute Error: 0.2327
- Mean Absolute Percentage Error: 2.6139%
- R2 Score: 0.6967




  y = column_or_1d(y, warn=True)  # TODO: Is this still required?


Gradient Boosting Regressor
Model performance for Training set
- Root Mean Squared Error: 0.1977
- Mean Squared Error: 0.0391
- Mean Absolute Error: 0.1552
- Mean Absolute Percentage Error: 1.7277%
- R2 Score: 0.8501
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.1992
- Mean Squared Error: 0.0397
- Mean Absolute Error: 0.1554
- Mean Absolute Percentage Error: 1.7284%
- R2 Score: 0.8509




Sort values by greatest R2_score

In [None]:
pd.DataFrame(list(zip(model_list, r2_list)), columns=['Model Name', 'R2_Score']).sort_values(by=["R2_Score"],ascending=False)

Unnamed: 0,Model Name,R2_Score
7,CatBoosting Regressor,0.912298
6,XGBRegressor,0.906762
5,Random Forest Regressor,0.880879
9,Gradient Boosting Regressor,0.850872
3,K-Neighbors Regressor,0.845077
4,Decision Tree,0.813427
2,Ridge,0.758088
0,Linear Regression,0.758084
8,AdaBoost Regressor,0.69672
1,Lasso,-0.000264


Observations

* Linear, ridge and Lasso perform the poorest
* AdaBoost performs poorly, likely due to simpler decision stumps and error correction
* Best performing models are Catboost, XGBoost, Random Forest and Gradient Boost
* Catboost has best R2 score of 0.91 followed by XGboost 0.90

Now, also train the models (the one's that don't require feature scaling) on the datasets without the feature scaling to see if there is better performance

Load in the data that was not scaled

In [23]:
X_train=pd.read_csv('data/X_train.csv')
y_train=pd.read_csv('data/y_train.csv')
X_test=pd.read_csv('data/X_test.csv')
y_test=pd.read_csv('data/y_test.csv')

Define the models as done previously to run on unscaled data

In [None]:
models = {
    "Decision Tree": DecisionTreeRegressor(),

    "Random Forest Regressor": RandomForestRegressor(),

    "XGBRegressor": XGBRegressor(), 
    
    "CatBoosting Regressor": CatBoostRegressor(verbose=False),

    "AdaBoost Regressor": AdaBoostRegressor(),

    "Gradient Boosting Regressor": GradientBoostingRegressor()
}
# define empty list
model_list = []
r2_list =[]

# for every model in the list
for i in range(len(list(models))):
    # set model to model at index i
    model = list(models.values())[i]

    # Train model
    model.fit(X_train, y_train) 

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    model_train_mae, model_train_mse, model_train_rmse, model_train_mape, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae, model_test_mse, model_test_rmse, model_test_mape, model_test_r2 = evaluate_model(y_test, y_test_pred)


    print(list(models.keys())[i])

    # append each model to model_list
    model_list.append(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Squared Error: {:.4f}".format(model_train_mse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print(f"- Mean Absolute Percentage Error: {model_train_mape * 100:.4f}%")
    print("- R2 Score: {:.4f}".format(model_train_r2))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Squared Error: {:.4f}".format(model_test_mse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print(f"- Mean Absolute Percentage Error: {model_test_mape * 100:.4f}%")
    print("- R2 Score: {:.4f}".format(model_test_r2))
    r2_list.append(model_test_r2)
    
    print('='*35)
    print('\n')

Decision Tree
Model performance for Training set
- Root Mean Squared Error: 806.0401
- Mean Squared Error: 649700.6054
- Mean Absolute Error: 317.9561
- Mean Absolute Percentage Error: 3.4689%
- R2 Score: 0.9661
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 2194.7356
- Mean Squared Error: 4816864.4696
- Mean Absolute Error: 1302.6783
- Mean Absolute Percentage Error: 14.1069%
- R2 Score: 0.7590




  return fit_method(estimator, *args, **kwargs)


Random Forest Regressor
Model performance for Training set
- Root Mean Squared Error: 964.0927
- Mean Squared Error: 929474.7718
- Mean Absolute Error: 562.2639
- Mean Absolute Percentage Error: 6.2152%
- R2 Score: 0.9515
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 1771.9276
- Mean Squared Error: 3139727.5430
- Mean Absolute Error: 1122.9254
- Mean Absolute Percentage Error: 12.2411%
- R2 Score: 0.8429


XGBRegressor
Model performance for Training set
- Root Mean Squared Error: 1126.5802
- Mean Squared Error: 1269182.8750
- Mean Absolute Error: 755.1634
- Mean Absolute Percentage Error: 8.8749%
- R2 Score: 0.9338
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 1640.7978
- Mean Squared Error: 2692217.5000
- Mean Absolute Error: 1091.5347
- Mean Absolute Percentage Error: 12.1070%
- R2 Score: 0.8653


CatBoosting Regressor
Model performance for Training set
- Root Mean Squared Error: 1294.6588
-

  y = column_or_1d(y, warn=True)


AdaBoost Regressor
Model performance for Training set
- Root Mean Squared Error: 2726.3682
- Mean Squared Error: 7433083.5130
- Mean Absolute Error: 2213.9232
- Mean Absolute Percentage Error: 32.9166%
- R2 Score: 0.6123
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 2712.3454
- Mean Squared Error: 7356817.7894
- Mean Absolute Error: 2184.5625
- Mean Absolute Percentage Error: 32.3894%
- R2 Score: 0.6319




  y = column_or_1d(y, warn=True)  # TODO: Is this still required?


Gradient Boosting Regressor
Model performance for Training set
- Root Mean Squared Error: 1944.6405
- Mean Squared Error: 3781626.5472
- Mean Absolute Error: 1405.0743
- Mean Absolute Percentage Error: 16.5510%
- R2 Score: 0.8028
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 1968.9215
- Mean Squared Error: 3876651.7276
- Mean Absolute Error: 1417.6871
- Mean Absolute Percentage Error: 16.5112%
- R2 Score: 0.8060




In [None]:
pd.DataFrame(list(zip(model_list, r2_list)), columns=['Model Name', 'R2_Score']).sort_values(by=["R2_Score"],ascending=False)

Unnamed: 0,Model Name,R2_Score
3,CatBoosting Regressor,0.879983
2,XGBRegressor,0.865295
1,Random Forest Regressor,0.842903
5,Gradient Boosting Regressor,0.806031
0,Decision Tree,0.758988
4,AdaBoost Regressor,0.631901


Observations

* AdaBoost and Decision tree perform worse, while the Best performing models are Catboost, XGBoost, Random Forest and Gradient Boost as seen with the scaled data
* However, all the models have a lower R2 score when trained on the unscaled data, which shows that scaling the data improves the model's R2 scores.
* Catboost has the best R2 score of 0.88 followed by XGboost 0.87

Due to these results, I will use the CatBoosting Regressor on the scaled data as best model