# Catboost training

We train CatBoost models here, using all the previous knowledge that we got.

Why CatBoost? Formally:

- Catboost has several advantages over xgboost:
    - Requires less hyperparameter tuning (but also has less flexibility)
    - It usually works better 'out of the box'
    - Can deal with categorical features without the need for one-hot encoding
    - It usually works best with categorical variables (xgboost with numerical)
    - We can try it out with different configurations on how to load the data

Informally: it usually works fine and I like cats

## Import standard libraries

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
import time
from sklearn.preprocessing import StandardScaler

from sklearn.impute import SimpleImputer

from catboost import CatBoostRegressor

## Import custom scripts

In [10]:
import sys, os
sys.path.append(os.getcwd()+ "/../")
from src.data_preprocessing import DataPreprocessing

## Load all the features
The datapreprocessing pipeline is doing quite some stuff, and in a non-efficient manner (I don't have much time for optimizing that :( )
But it should be less than 2 min

In [11]:
dp = DataPreprocessing(df_path = "../data/real_estate_ads_2022_10.csv",
                        train_indices_path="../data/train_indices.npy", 
                        test_indices_path="../data/test_indices.npy",
                        get_params_from_params=True,
                        get_tfidf_embeddings_flag=True,
                        get_bert_embeddings_flag=True,
                        get_textual_features_flag=True,
                        transform_time_features_flag=True,
                        transform_cyclic_features_flag=True)

## Load the metrics class
That is convenient to compute multiple metrics:
- **explained_variance_score**: Measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A score of 1 indicates perfect prediction, while a score of 0 indicates that the model does not explain any of the variance.

- **r2_score**: Also known as the coefficient of determination, it indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A value of 1 indicates a perfect fit, while a value of 0 indicates that the model does not explain any of the variance.

- **mean_absolute_percentage_error (MAPE)**: Measures the average of the absolute percentage errors of predictions. It provides a percentage error which is easy to interpret but can be sensitive to very small actual values.

- **median_absolute_error**: Computes the median of all absolute differences between the target and predicted values. This metric is robust to outliers and gives a better sense of the typical error when outliers are present.

- **mean_squared_error (MSE)**: Measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. It penalizes larger errors more than smaller ones due to squaring.

- **mean_squared_log_error (MSLE)**: Similar to MSE but takes the logarithm of the predictions and actual values. It is useful when you want to penalize underestimation more than overestimation and is less sensitive to large errors than MSE.

- **custom metrics**: Compute the percentage of times that the error falls less than some threshold. This may correlate with customer satisfaction, if they are for example happy if there's less than a 5% rate, this would count the percentage of happy customers. Of course, this will need further study (for example, segmenting the score)

In [12]:
import importlib
import src.compute_metrics
importlib.reload(src.compute_metrics) # We do this for debugging purposes

from src.compute_metrics import Metrics

## Split train / test data
We can use the datapreprocessing method for that.

This is done for better reproducibility, but can be done with the sklearn train / test split, and setting a seed should suffice.

In [13]:
X_train, X_test = dp.get_train_test_split(dp.X)
y_train, y_test = dp.get_train_test_split(dp.Y)

## Define a function for training
We will use k-fold cross-validation

In [14]:
def train_catboost_and_get_metrics(X, y, 
                                    catboost_params = {},
                                    backward_transform_label=True, 
                                    backward_standardize_flag=False, verbose=False,
                                    standard_scale_flag=False,
                                    impute_flag=False):

    if impute_flag:
        imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
        X = pd.DataFrame(imp_mean.fit_transform(X), columns=X.columns)

    bst = CatBoostRegressor(**catboost_params)

    metrics = Metrics(dp=dp, backward_transform_flag=backward_transform_label, backward_standardize_flag=backward_standardize_flag)

    kf = KFold(n_splits=5, shuffle=True, random_state=42)

    for train_index, test_index in kf.split(X, y):
        X_train, X_val = X.iloc[train_index], X.iloc[test_index]
        y_train, y_val = y.iloc[train_index], y.iloc[test_index]

        if standard_scale_flag:
            scaler = StandardScaler()
            X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
            X_val = pd.DataFrame(scaler.transform(X_val), columns=X_val.columns)

        bst.fit(X_train, y_train)
        y_pred = bst.predict(X_val)

        if verbose: 
            print(f"y_pred: {y_pred[:5]}")
            print(f"y_val: {y_val[:5]}")
            
        computed_metrics = metrics.get_single_train_val_metrics(bst, X_train, y_train, X_val, y_val)
        metrics.append(computed_metrics)

    average_metrics = metrics.get_average()
    std_metrics = metrics.get_std()
    # Add _std to the keys to differentiate them from the average metrics:
    std_metrics = {f"{key}_std" : value for key, value in std_metrics.items()} 

    return {**average_metrics, **std_metrics}

## Define convenience functions for prettier display

In [15]:
def filter_metrics(metrics_dict, only_validation=True, format_mean_std_together=True):

    if only_validation:
        metrics_dict = {key: value for key, value in metrics_dict.items() if "test_" in key}

    if format_mean_std_together:
        metrics_dict = {key: f"{value:.2f} ± {metrics_dict[key+'_std']:.2f}" for key, value in metrics_dict.items() if "std" not in key}

    return metrics_dict

def highlight_max(s):
    is_max = s == s.replace("nan ± nan", "0").apply(lambda x: x.split("+-")[0]).max()
    return ['font-weight: bold' if v else '' for v in is_max]

def highlight_min(s):
    is_min = s == s.replace("nan ± nan", "0").apply(lambda x: x.split("+-")[0]).min()
    return ['font-weight: bold' if v else '' for v in is_min]

def format_results_df(results, column_names=None):
    results_df = pd.DataFrame(results).T

    if column_names is not None:
        results_df.columns = column_names
    
    def apply_highlight(column):
        if column.name in ["test_explained_variance", "test_r2", "test_custom_1", "test_custom_5", "test_custom_10", "test_custom_20"]:
            return highlight_max(column)
        else:
            return highlight_min(column)

    
    return results_df.style.apply(apply_highlight, axis=1)

## Try out different preprocessing techniques

In [16]:
results_catboost_without_impute_without_standard = train_catboost_and_get_metrics(X_train, y_train, 
                                                                catboost_params = {"verbose" : False},
                                                                impute_flag=False,
                                                                standard_scale_flag=False)
                                                                
results_catboost_with_impute_without_standard = train_catboost_and_get_metrics(X_train, y_train, 
                                                                catboost_params = {"verbose" : False},
                                                                impute_flag=True,
                                                                standard_scale_flag=False)

results_catboost_without_impute_with_standard = train_catboost_and_get_metrics(X_train, y_train, 
                                                                catboost_params = {"verbose" : False},
                                                                impute_flag=False,
                                                                standard_scale_flag=True)
                                                                
results_catboost_with_impute_with_standard = train_catboost_and_get_metrics(X_train, y_train,
                                                                catboost_params = {"verbose" : False},
                                                                impute_flag=True,
                                                                standard_scale_flag=True)

In [20]:
df_results = pd.concat([pd.DataFrame(filter_metrics(results_catboost_without_impute_without_standard, only_validation=True, format_mean_std_together=True), index=["Without Impute Without Standard"]),
                        pd.DataFrame(filter_metrics(results_catboost_with_impute_without_standard, only_validation=True, format_mean_std_together=True), index=["With Impute Without Standard"]),
                        pd.DataFrame(filter_metrics(results_catboost_without_impute_with_standard, only_validation=True, format_mean_std_together=True), index=["Without Impute With Standard"]),
                        pd.DataFrame(filter_metrics(results_catboost_with_impute_with_standard, only_validation=True, format_mean_std_together=True), index=["With Impute With Standard"])])

format_results_df(df_results, column_names=["No impute, no standard", "Impute, no standard", "No impute, standard", "Impute, standard"])

Unnamed: 0,"No impute, no standard","Impute, no standard","No impute, standard","Impute, standard"
test_explained_variance,0.64 ± 0.09,0.64 ± 0.09,0.64 ± 0.09,0.64 ± 0.09
test_r2,0.64 ± 0.09,0.64 ± 0.09,0.64 ± 0.09,0.64 ± 0.09
test_mape,221.98 ± 83.79,220.33 ± 90.18,209.04 ± 88.19,217.00 ± 95.95
test_median_absolute_error,447.64 ± 3.49,447.73 ± 3.01,447.13 ± 3.25,448.51 ± 3.06
test_mean_absolute_error,682.07 ± 7.28,683.21 ± 6.90,681.84 ± 8.31,683.19 ± 7.52
test_mean_squared_log_error,0.12 ± 0.02,0.12 ± 0.03,0.12 ± 0.02,0.12 ± 0.03
test_custom_1,10.22 ± 0.29,10.16 ± 0.23,10.32 ± 0.13,10.24 ± 0.20
test_custom_5,45.51 ± 0.35,45.54 ± 0.22,45.66 ± 0.43,45.42 ± 0.24
test_custom_10,72.07 ± 0.15,71.98 ± 0.13,72.06 ± 0.15,71.93 ± 0.17
test_custom_20,92.02 ± 0.30,92.06 ± 0.20,92.14 ± 0.20,92.05 ± 0.19


The results are quite similar, but no imputing and standardizing seems to have the best results.

However, for simplicity, we go for no imputing and no standardizing.