# Baseline model training (without outliers)

Basically the same as **07_Baseline_models_training_with_outliers.ipynb** notebook, but removing the outliers with the **OutlierImputer** class (**src/outlier_imputer.py**).

I could have (and should have) done this comparison in the same notebook (with and without outliers imputing).<br>
But, since the **DataPreprocessing** class is not yet super functional, I decided to split it in separate notebooks.

## Import standard libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.impute import SimpleImputer
import time

## Import custom scripts

In [3]:
import sys, os
sys.path.append(os.getcwd()+ "/../")
from src.data_preprocessing import DataPreprocessing

## Load all the features
The datapreprocessing pipeline is doing quite some stuff, and in a non-efficient manner (I don't have much time for optimizing that :( )
But it should be less than 2 min

In [4]:
dp = DataPreprocessing(df_path = "../data/real_estate_ads_2022_10.csv",
                        train_indices_path="../data/train_indices.npy", 
                        test_indices_path="../data/test_indices.npy",
                        remove_label_outliers_flag=True, # IMPORTANT FLAG
                        get_params_from_params=True,
                        get_tfidf_embeddings_flag=True,
                        get_bert_embeddings_flag=True,
                        get_textual_features_flag=True,
                        transform_time_features_flag=True,
                        transform_cyclic_features_flag=True)

X_train, X_test = dp.get_train_test_split(dp.X)
y_train, y_test = dp.get_train_test_split(dp.Y)

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train = pd.DataFrame(imp_mean.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(imp_mean.transform(X_test), columns=X_test.columns)

## Load the metrics class
That is convenient to compute multiple metrics:
- **explained_variance_score**: Measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A score of 1 indicates perfect prediction, while a score of 0 indicates that the model does not explain any of the variance.

- **r2_score**: Also known as the coefficient of determination, it indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A value of 1 indicates a perfect fit, while a value of 0 indicates that the model does not explain any of the variance.

- **mean_absolute_percentage_error (MAPE)**: Measures the average of the absolute percentage errors of predictions. It provides a percentage error which is easy to interpret but can be sensitive to very small actual values.

- **median_absolute_error**: Computes the median of all absolute differences between the target and predicted values. This metric is robust to outliers and gives a better sense of the typical error when outliers are present.

- **mean_absolute_error (MAE)**: Computes the mean of the absolute error (absolute difference between the target and predicted value). 

- **mean_squared_log_error (MSLE)**: Similar to MSE but takes the logarithm of the predictions and actual values. It is useful when you want to penalize underestimation more than overestimation and is less sensitive to large errors than MSE.

- **custom metrics**: Compute the percentage of times that the error falls less than some threshold. This may correlate with customer satisfaction, if they are for example happy if there's less than a 5% rate, this would count the percentage of happy customers. Of course, this will need further study (for example, segmenting the score)

In [None]:
import importlib
import src.compute_metrics
importlib.reload(src.compute_metrics) # We do this for debugging purposes

from src.compute_metrics import Metrics

ModuleNotFoundError: No module named 'src'

## Define a function for training
We will use k-fold cross-validation

In [None]:
def train_and_get_metrics(model, X, y, backward_transform_label=True, backward_standardize_flag=False, verbose=False):

    metrics = Metrics(dp=dp, backward_transform_flag=backward_transform_label, backward_standardize_flag=backward_standardize_flag)

    kf = KFold(n_splits=5, shuffle=True, random_state=42)

    for train_index, test_index in kf.split(X, y):
        X_train, X_val = X.iloc[train_index], X.iloc[test_index]
        y_train, y_val = y.iloc[train_index], y.iloc[test_index]

        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)

        if verbose: 
            print(f"y_pred: {y_pred[:5]}")
            print(f"y_val: {y_val[:5]}")


        computed_metrics = metrics.get_single_train_val_metrics(model, X_train, y_train, X_val, y_val)
        metrics.append(computed_metrics)

    average_metrics = metrics.get_average()
    std_metrics = metrics.get_std()
    # Add _std to the keys to differentiate them from the average metrics:
    std_metrics = {f"{key}_std" : value for key, value in std_metrics.items()} 

    return {**average_metrics, **std_metrics}

## Define convenience functions for prettier display

In [None]:
def filter_metrics(metrics_dict, only_validation=True, format_mean_std_together=True):

    if only_validation:
        metrics_dict = {key: value for key, value in metrics_dict.items() if "test_" in key}

    if format_mean_std_together:
        metrics_dict = {key: f"{value:.2f} ± {metrics_dict[key+'_std']:.2f}" for key, value in metrics_dict.items() if "std" not in key}

    return metrics_dict

def highlight_max(s):
    is_max = s == s.replace("nan ± nan", "0").max()
    return ['font-weight: bold' if v else '' for v in is_max]

def format_results_df(results, column_names=None):
    results_df = pd.DataFrame(results).T

    if column_names is not None:
        results_df.columns = column_names
    
    return results_df.style.apply(highlight_max, axis=1)

# Import and define baseline models

In [None]:
# Use simple linear models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso

# Support vector machines
from sklearn.svm import SVR

# K neighbors classifiers
from sklearn.neighbors import KNeighborsClassifier

# Decision trees
from sklearn.tree import DecisionTreeClassifier

# Random forests
from sklearn.ensemble import RandomForestClassifier

## Linear models

In [None]:
linear_models = {
    "Linear regression" : {
        "model": LinearRegression(),
        },
    "Lasso" : {
        "model": Lasso(),
        },
    }

results_list = []

for model_name in linear_models:
    print(model_name)
    model = linear_models[model_name]["model"]

    results = train_and_get_metrics(model, X_train, y_train, verbose=True)
    results_list.append(results)

Linear regression
y_pred: [[ 0.55614892]
 [-0.55747703]
 [ 1.17770557]
 [-0.55847119]
 [ 0.5103213 ]]
y_val:        price_per_m
18563     0.248746
49795    -0.828817
11597     1.017808
16528    -0.577675
62540    -0.107894
y_pred: [[ 0.16206179]
 [-0.41284075]
 [-0.13639937]
 [-0.68935287]
 [ 0.43581587]]
y_val:        price_per_m
34086     0.173084
41636    -0.444718
20321     0.596778
66567    -0.722584
35278     1.328041
y_pred: [[ 0.02338851]
 [ 0.8259965 ]
 [-0.8684811 ]
 [ 0.25633631]
 [ 0.51677796]]
y_val:        price_per_m
23851    -0.133053
71706     0.220067
7683     -1.013503
66762     0.112159
71258    -0.330539
y_pred: [[ 0.09859211]
 [-0.02956057]
 [ 0.28437683]
 [-0.09705792]
 [ 0.18122024]]
y_val:        price_per_m
2898      0.132444
26381     0.452741
71395     0.287999
51329    -0.112959
40072     0.073805
y_pred: [[-0.01115963]
 [ 0.48467624]
 [-0.56420116]
 [-0.41709127]
 [ 0.24578029]]
y_val:        price_per_m
6943     -1.042554
63401     0.104377
67204    -0.83

In [None]:
filtered_results = [filter_metrics(results) for results in results_list]
format_results_df(filtered_results, column_names=linear_models.keys())

Unnamed: 0,Linear regression,Lasso
test_explained_variance,0.46 ± 0.01,0.09 ± 0.00
test_r2,0.45 ± 0.01,0.08 ± 0.00
test_mape,0.14 ± 0.00,0.19 ± 0.00
test_median_absolute_error,756.75 ± 7.82,1096.48 ± 7.03
test_mean_squared_error,3266531.51 ± 103833.13,5500837.63 ± 142315.57
test_mean_squared_log_error,0.04 ± 0.00,0.06 ± 0.00
test_custom_1,5.56 ± 0.27,4.05 ± 0.17
test_custom_5,27.27 ± 0.45,18.83 ± 0.20
test_custom_10,50.83 ± 0.55,36.71 ± 0.29
test_custom_20,80.12 ± 0.32,65.22 ± 0.34


The metrics now are a bit worse.

Since removing outliers can remove some information, we decide not to remove them.