# MVP: Machine Learning & Analytics

**Autor**: Rodrigo Eduardo Modesto de Abreu

**Data**: 27/09/2025

**Matrícula**: 4052025000009

**Dataset**: [Diamond Prices](https://www.kaggle.com/datasets/nancyalaswad90/diamonds-prices)

### Contents

1. [Introduction](#introduction)
2. [The Problem](#the-problem)
3. [Data Preporcessing](#data-preprocessing)
4. [Model Processing](#model-processing)
5. [Hyperparameter Optimization](#hyperparameter-optimization)
6. [Best Model training and execution](#best-model-training-and-execution)
7. [Conclusion](#Conclusion)

## Introduction

The MVP is based on the Diamond Prices dataset which was also used as part of the development of the [MVP2](https://github.com/remabreu/DiamondsPrices/tree/main).

At this repository, you can find:
* [README file](https://github.com/remabreu/DiamondsPrices/blob/main/README.md) - That describes details of the dataset
* [Notebook](https://github.com/remabreu/DiamondsPrices/blob/main/diamonds.ipynb) - The Notebook includes the whole dataset analysis and preporcessing which is also replicated in the notebook.

## The Problem

The Diamonds Prices dataset provides many features to support in the prediction of the target variable as a supervised regression learning. The dataset is a common and well-known regression problem in [Kaggle](https://www.kaggle.com/datasets/nancyalaswad90/diamonds-prices). The Dataset is in the latest updated version and contains 53943 records and 11 Features (one of the attributes is the index and has no relationship with the data analysis).

The is only a sample of the whole dataset due to performance limitations when executing the models. Bigger samples or using the entire dataset may bring better results on models’ execution and scoring.

Exploratory Data Analysis

In summary, the dataseset present as the following:
* The dataset didn't present any missing data (only )
* Prices column was very unbalanced and skewed distribution. 
* The uncommon prices can be regarded either outliers or not depending on how there will be the use of the Diamonds, for example, as a value maximizer (i.e. Industrial use), as a collector or bridal budget. In fact, it hasn't been observed any miss-measurement or error to also regard any outlier. 
However, table and depth measurements have few significative outliers and extrapolate the "fence" outside Quartile 1 and 2 through IQR method.
* carat and price produced a strong correlation in which cut, color and clarity were adjectives of such correlation by contributing into superior prices for the same carat. This behavior was more distinctly observed on smaller/lighter carats, though. 

## Data Preprocessing

Data preprocessing will be responsible for:
1. Data cleaning (small amount of empty data) 
2. Apply log transformation to ```price``` column. For ```price```, the values are log-transformed, so they can have a better behavior when standardized, meaning that outliers have been reduced throughout log transformation. So, the right-skewed, wide range variation has been reduced, and the standardization has a better effect over it.
   - Diamond prices are highly skewed — a few very large stones cost much more than average.
      - Log transformation 
       - Reduces skew
       - Stabilizes variance
       - Improves model accuracy
But final metrics must be interpreted in real prices (USD).
1. Cleaning outliers on columns ```table``` and ```depth``` which are the only columns that outliers are more clearly considered deviations on the measurement rather than extreme points part of the dataset.
2. Create and execute the preprocessing pipeline using `ColumnTransformer`. It creates actually two preprocessing pipelines:
   - One for the input features (`preprocessor_X`), scaling numeric columns and one-hot encoding categorical columns.
   - One for the target (`preprocessor_y`), scaling the diamond prices. 
   - Training and test sets are transformed separately to prevent data leakage.
   - The target variable (price) is also log-transformed beforehand to stabilize variance and improve model performance.
3. `ColumnTransformer` applies different transformations to different groups of columns.
   - Uses `StandardScaler()` to standardize numeric columns so each has mean = 0 and standard deviation = 1.
   - This ensures numeric features are on the same scale (important for many ML models).
   - Uses `OneHotEncoder(...)` to convert categorical variables (`cut`, `color`, `clarity`) into dummy/indicator variables.
   - `remainder='passthrough'` means that any column not part of the pipeline transformation is left unchanged
4. The target variable (`price`) is also scaled using StandardScaler.
   - This is useful since the model benefits from normalized target values (e.g., linear models).

In [211]:
# Do not show warnings
import warnings
warnings.filterwarnings("ignore")

# Imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time # to measure execution time of each model 

# Kaggle API
import kagglehub

# Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Model Selection
from sklearn.model_selection import train_test_split # partition the dataset into train and test (holdout)
from sklearn.model_selection import KFold # preprare the folds to cross validation 
from sklearn.model_selection import cross_val_score # execuite cross validation
from sklearn.model_selection import RandomizedSearchCV # hyperparameter tuning with random search

# Metrics
from sklearn.metrics import mean_squared_error # MSE Evaluation Metric
from sklearn.metrics import mean_absolute_error # MAE evaluiation metric
from sklearn.metrics import root_mean_squared_error # RMSE evaluation metric
from sklearn.metrics import r2_score # R² evaluation metric
from sklearn.metrics import make_scorer # to create custom metrics

# Algorithms
from sklearn.linear_model import LinearRegression # Linear Regression algorithm 
from sklearn.linear_model import Ridge # Ridge Regularization algorithm
from sklearn.linear_model import Lasso # Lasso Regularization algorithm
from sklearn.neighbors import KNeighborsRegressor # KNN algorithm
from sklearn.tree import DecisionTreeRegressor # Decision Tree algorithm
from sklearn.dummy import DummyRegressor # Baseline algorithm
from sklearn.ensemble import RandomForestRegressor # Random Forest algorithm
from sklearn.svm import SVR # SVM algorithm
from xgboost import XGBRegressor # XGBoost algorithm

# For displaying side by side tables
from IPython.display import display_html

# Set a random seed for reproducibility
SEED = 42

In [212]:
path = kagglehub.dataset_download("nancyalaswad90/diamonds-prices")

print("Path to dataset file:", path)

#Store the dataset into a Dataframe object
diamonds_df = pd.read_csv(path+"/Diamonds Prices2022.csv")
# Take a sample of the data for faster processing (10% of the data)
#df_sample = diamonds_df.sample(frac=0.04, random_state=42)

diamonds_df.head()
#df_sample.head()

Path to dataset file: C:\Users\rodri\.cache\kagglehub\datasets\nancyalaswad90\diamonds-prices\versions\4


Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [213]:
# drop first column, ignore error in case culumn doesn't exist (already removed)
diamonds_df = diamonds_df.drop('Unnamed: 0', axis=1, errors='ignore')
diamonds_df.describe(include='all')


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
count,53943.0,53943,53943,53943,53943.0,53943.0,53943.0,53943.0,53943.0,53943.0
unique,,5,7,8,,,,,,
top,,Ideal,G,SI1,,,,,,
freq,,21551,11292,13067,,,,,,
mean,0.8,,,,61.75,57.46,3932.73,5.73,5.73,3.54
std,0.47,,,,1.43,2.23,3989.34,1.12,1.14,0.71
min,0.2,,,,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,,,,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,,,,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,,,,62.5,59.0,5324.0,6.54,6.54,4.04


In [214]:
# Check for 0 or empty values in 'x', 'y', 'z' columns
#
print("Rows with 0 or empty values: ", ((diamonds_df['x'] == 0) | (diamonds_df['y'] == 0) | (diamonds_df['z'] == 0)).sum())
print("Removing rows with 0 or empty values")
diamonds_df = diamonds_df[(diamonds_df['x'] != 0) & (diamonds_df['y'] != 0) & (diamonds_df['z'] != 0)]
print("Rows with 0 or empty values: ", ((diamonds_df['x'] == 0) | (diamonds_df['y'] == 0) | (diamonds_df['z'] == 0)).sum())


Rows with 0 or empty values:  20
Removing rows with 0 or empty values
Rows with 0 or empty values:  0


In [215]:
# Step 1: Separate features and target
X = diamonds_df.drop(columns='price')
y = diamonds_df['price']

# Step 2: Apply transformation to y
y_log = np.log1p(y)
y_log = y_log.to_frame()

# Step 3: Train/test split
# test_size: represents the proportion of the dataset to be allocated to the test set
# random_state: get the same split of data every time the code is executed
X_train, X_test, y_train_log, y_test_log = train_test_split(X, y_log,
                                                            test_size=0.2,
                                                            random_state=SEED)



In [216]:
def iqr_filter(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Apply IQR filtering on training set only
# merge sets (X) and (y) to apply filter
train = X_train.copy()
train['price'] = y_train_log
train = iqr_filter(train, 'table')
train = iqr_filter(train, 'depth')

# Separate back
y_train_log = train['price']
X_train = train.drop(columns='price')

In [217]:
# pipeline preprocessing

X_num_cols = ['carat', 'table', 'depth', 'x', 'y', 'z']
X_cat_cols = ['cut', 'color', 'clarity']
y_num_col = ['price']

# The ColumnTransformer creates a data preprocessing pipeline that applies
# different transformations to different columns
preprocessor_X = ColumnTransformer(
    # List of transformations to be applied to specific column groups
    transformers=[
        # 1st Transformer: Numerical columns
        ('t_num',
         StandardScaler(), # Applies standardization (mean=0, std=1)
         X_num_cols),

        # 2nd Transformer: Categorical columns
        ('t_cat',
         # Converts categories to one-hot encoded columns and
         #drops first category to avoid multicollinearity
         OneHotEncoder(drop='first', sparse_output=False),
         X_cat_cols)
    ],
    # Handling of columns not explicitly transformed
    remainder='passthrough' # Keep other columns (if any) - though not applicable here
)

# This is useful if the model benefits from normalized target values (e.g., linear models).
# Often combined with a log transformation (done outside this block in your code: y_train_log).
preprocessor_y = ColumnTransformer(
    transformers=[
        ('t_y', StandardScaler(), y_num_col)
    ]
)

# Apply transformations using fit_transform on training data and transform on
# testing one
X_train_processed = preprocessor_X.fit_transform(X_train)
X_test_processed = preprocessor_X.transform(X_test)

# Converts the log-transformed target (y_train_log) into a DataFrame.
# This is done to keep consistency when passing it into preprocessing or model training steps.
y_train_log_df = y_train_log.to_frame()

print(X_test.shape)        # number of rows in test features
print(y_test_log.shape)        # should match X_test
print(X_train_processed.shape)
print(X_test_processed.shape)
print(y_train_log.shape)


(10785, 9)
(10785, 1)
(40497, 23)
(10785, 23)
(40497,)


## Model Processing

The processing encompass the training and execution of several regression models to assess which one is the best performing one.

**Models**

- **Linear Models**
  - Linear Regression
  - Ridge
  - Lasso
- **Tree-Based Models**
  - Decision Tree
  - Random Forest
  - XGBoost
- **Instace Based Model**
  - KNN
- **Kernel-based**
  - SVM


These models will be executed iteratively to search for the param for each model and also to identify the best performing among all of them using the `RandomizedSearchCV` function.

Important to note that the `prices` need to be reverted back from the log transformation to process the price estimation not over a normilized value.

### Model Evaluation

The function `evaluate_model` runs cross-validation to assess how well a regression model trained on log-transformed prices performs.
Thus, it evaluates the model (passed through argument) in two spaces:
- Log space - where the model was trained (log(price)) - Uses scikit-learn’s built-in scoring: `neg_mean_squared_error`
- Real space - where predictions are converted back to actual prices, making metrics interpretable
  - (**Real RMSE**) Uses a custom scorer (`rmse_real_scorer`) that first transforms predictions back with exp() (reverting the log).
  - (**Real MAE**) Uses another custom scorer (mae_real_scorer) with inverse log transformation. It makes easier interpret than RMSE because it directly shows average dollar deviation per diamond.
- The function `evaluate_model` is designed to assess model performance. Instead of testing the model on a single train/test split, it uses cross-validation (CV) via cross_val_score 
  - Splits the dataset into cv folds (e.g., 5 folds with KFold(5)). In each fold:
    - Trains the model on (cv-1) folds.
    - Tests the model on the remaining fold.
    - Repeats this process cv times, ensuring every sample is tested once.
    - Collects one score per fold, resulting in an array of scores.
    - Taking the mean of these scores gives a more reliable estimate of performance than a single train/test split, since it reduces sensitivity to how the data is split.

In [218]:
# RMSE in real price space (convert the log prices back to real prices)
def rmse_real(y_true_log, y_pred_log):
    y_true = np.expm1(y_true_log)   # invert log1p
    y_pred = np.expm1(y_pred_log)
    return np.sqrt(mean_squared_error(y_true, y_pred))


# MAE in real price space (convert the log prices back to real prices)
def mae_real(y_true_log, y_pred_log):
    y_true = np.expm1(y_true_log)
    y_pred = np.expm1(y_pred_log)
    return mean_absolute_error(y_true, y_pred)

# Custom scorers (rmse_real, mae_real) allow evaluation in real 
# price space (after inverting the log transformation).
rmse_real_scorer = make_scorer(rmse_real, greater_is_better=False)
mae_real_scorer = make_scorer(mae_real, greater_is_better=False)

def evaluate_model(model, X, y_log, cv):
    """
    Evaluate a regression model trained on log(price).
    
    Returns a dictionary with:
    - log-RMSE
    - real-RMSE
    - real-MAE
    """
    
    # Log RMSE 
    scores_log = cross_val_score(
        model, X, y_log, cv=cv, scoring="neg_mean_squared_error"
    )
    log_rmse = -scores_log.mean()
    
    # Real RMSE
    scores_real_rmse = cross_val_score(
        model, X, y_log, cv=cv, scoring=rmse_real_scorer
    )
    real_rmse = -scores_real_rmse.mean()
    
    # Real MAE
    scores_real_mae = cross_val_score(
        model, X, y_log, cv=cv, scoring=mae_real_scorer
    )
    real_mae = -scores_real_mae.mean()
    
    return {
        "Log RMSE": log_rmse,
        "Real RMSE": real_rmse,
        "Real MAE": real_mae
    }

### Hyperparameter Optimization

The code below is intended to set up a benchmarking and hyperparameter optimization framework for multiple regression models on the diamond prices dataset.

It does three main things:
- Defines base models to compare.
- Sets up hyperparameter search spaces.
- Runs cross-validation with RandomizedSearchCV to find the best configuration per model.
  - For each model, it uses different sizes of the dataset sampling taking into account the model complexity and time to exexute

The SVM model is good on small data and may be impractical on full big datasets. The prediction time also depends on number of support vectors. On the other side, XGBoost is more appropriate for larger datasets.

The `RandomizedSearchCV` code block randomly executes samples of `n_iter` combinations from the parameter space. The cross variance can be a integer number or the `KFold` object bringing more precision to the assessment with a cost of more processing power (and time).

Thus, the algorithm iterates through all models, call `evaluate_model` for cross-evaluation by computing log-RMSE, real-RMSE, and real-MAE. The inner CV optimization is peroformed by `RandomizedSearchCV` which is responsible for finding the best hyperparameters.

Results are stored in two dictionaries:
- **results** - summary metrics for each model.
- **best_results** - detailed info about best hyperparameters and best fitted estimator.

This setup makes it easy to:
 - Compare baseline and advanced models.
 - Identify the best-performing model for diamond price prediction.
 - Document both statistical accuracy and real-world error (USD).

The next step is to **retrain** the full dataset on the best model, in this case, the one that presents lower score from the training. This is important to note that SVN and XXBoost execute the trainig (prototyping) over a sample. The sample will retreieved from memory and from the dataset enhancing the proformance execution. In this case, the different samples for SVN and XGBoost (or other model cross variation hyperparaters discovery) are retrived from memory sppeding up the process and making it more consistent.


In [220]:

base_models = {
    "Linear Regression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(),
    "Decision Tree": DecisionTreeRegressor(),
    "KNN": KNeighborsRegressor(),
    "Random Forest": RandomForestRegressor(random_state=SEED),
    "SVM": SVR(),
    "XGBoost": XGBRegressor(random_state=SEED, verbosity=0) # type: ignore
}


# Define parameter spaces
# prepare reasonable search spaces for each model
param_spaces = {
    "Linear Regression": {},  # no hyperparameters to tune
    "Ridge": {
        "alpha": np.logspace(-3, 3, 50)
    },
    "Lasso": {
        "alpha": np.logspace(-3, 3, 50)
    },
    "Decision Tree": {
        "max_depth": [3, 5, 10, None],
        "min_samples_split": [2, 5, 10, 20],
        "min_samples_leaf": [1, 2, 5, 10]
    },
    "KNN": {
        "n_neighbors": range(2, 50),
        "weights": ["uniform", "distance"],
        "p": [1, 2]  # Manhattan / Euclidean
    },
    "SVM": {
        "C": np.logspace(-2, 3, 20),
        "gamma": np.logspace(-3, 2, 20),
        "kernel": ["rbf", "poly", "sigmoid"]
    },
    "Random Forest": {
        "n_estimators": [100, 200, 300, 500],
        "max_depth": [None, 5, 10, 20],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4],
        "max_features": ["auto", "sqrt", "log2"]
    },
    "XGBoost": {
        "n_estimators": [100, 200, 500],
        "learning_rate": [0.01, 0.05, 0.1, 0.2],
        "max_depth": [3, 5, 7, 10],
        "subsample": [0.6, 0.8, 1.0],
        "colsample_bytree": [0.6, 0.8, 1.0],
        "gamma": [0, 0.1, 0.2, 0.3],
        "reg_alpha": [0, 0.01, 0.1, 1],
        "reg_lambda": [1, 1.5, 2, 5]
    }
}

partitions = 10 
kfold = KFold(n_splits=partitions, shuffle=True, random_state=SEED) # makes the partitioning in 10 folds

# Prepare models with hyperparameter tuning (Randomized Search)
# If no hyperparameters to tune, use the base model directly
# KFold CV with 10 splits ensures robust evaluation.
searches = {}
for name, model in base_models.items():
    if param_spaces[name]:  # if we have params to tune
        searches[name] = RandomizedSearchCV(
            estimator=model,
            param_distributions=param_spaces[name],
            n_iter=10,   # number of random trials
            scoring="neg_root_mean_squared_error",
            cv=kfold,        # inner CV for hyperparameter tuning
            random_state=SEED,
            n_jobs=-1
        )
    else:
        searches[name] = model  # LinearRegression (no hyperparams)

results = {}
best_results = {}
times = {}
for name, search in searches.items():
    print(f"Optimizing {name}...")
    if name == "SVM":  # SVM is too slow for this dataset
        print("Using 3% of training data for SVM")
        X_train_sample, _, y_train_sample, _ = train_test_split(X_train_processed, 
                                                                 y_train_log_df, 
                                                                 train_size=0.03, 
                                                                 random_state=SEED)
    else:
        print("Using full training data for", name)
        X_train_sample = X_train_processed
        y_train_sample = y_train_log_df
    
    start_time = time.time()
    
    results[name] = evaluate_model(search, X_train_sample, y_train_sample, cv=3) #kfold)  # outer CV
    results[name]['Model'] = search
    search.fit(X_train_sample, y_train_sample)  # fit on full training data
    print("Best hyperparameters:", search.best_params_ if hasattr(search, 'best_params_' ) else "N/A")
    end_time = time.time()
    results[name]['Duration (s)'] = end_time - start_time
    print(f"Time spent: {(end_time - start_time):.2f} seconds")
    if param_spaces[name]:
        best_results[name] = {
            "best_score": search.best_score_,
            "best_params": search.best_params_,
            "best_estimator": search.best_estimator_
        }
        results[name]['Best Score'] = -search.best_score_
    print()

pd.set_option("display.float_format", "{:,.2f}".format)
df_results = pd.DataFrame(results).T
df_results.drop(columns=['Model'])


Optimizing Linear Regression...
Using full training data for Linear Regression
Best hyperparameters: N/A
Time spent: 0.20 seconds

Optimizing Ridge...
Using full training data for Ridge
Best hyperparameters: {'alpha': 0.03906939937054617}
Time spent: 6.82 seconds

Optimizing Lasso...
Using full training data for Lasso
Best hyperparameters: {'alpha': 0.03906939937054617}
Time spent: 38.20 seconds

Optimizing Decision Tree...
Using full training data for Decision Tree
Best hyperparameters: {'min_samples_split': 5, 'min_samples_leaf': 10, 'max_depth': None}
Time spent: 10.16 seconds

Optimizing KNN...
Using full training data for KNN
Best hyperparameters: {'weights': 'distance', 'p': 1, 'n_neighbors': 13}
Time spent: 108.71 seconds

Optimizing Random Forest...
Using full training data for Random Forest
Best hyperparameters: {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': None}
Time spent: 1289.26 seconds

Optimizing SVM...
Using 3%

Unnamed: 0,Log RMSE,Real RMSE,Real MAE,Duration (s),Best Score
Linear Regression,0.02,5686.86,450.34,0.2,
Ridge,0.02,5742.62,450.82,6.82,0.14
Lasso,0.08,2185.92,943.16,38.2,0.28
Decision Tree,0.02,768.17,358.52,10.16,0.12
KNN,0.02,807.58,388.78,108.71,0.13
Random Forest,0.01,705.3,334.06,1289.26,0.11
SVM,0.02,976.94,460.5,804.17,0.13
XGBoost,0.01,559.03,274.43,266.63,0.09


### Best Model training and execution

The code snippet below selects the best model and correspondinbg optimized parameters to be trained and executed towards the `full` dataset split at the initial stages of this Notebook.

The best model is the confronted against the original price values (USD), provided on the dataset, the baseline which corresponds to the simplest model avaialable and the best model.

The choice of the baseline refers to the model used to compare all others against. Its role is to establish a minimum performance threshold, if the more robust models cannot beat it, the features, prepropressing and target transformation need to reviewed, however, if the complex models beat it, the complexity (and resource spend) adds value. 

**Baseline**: `Linear Regression` (with same preprocessing and transformation applied to all models)
`Linear Regression` has less than 1 second execution time towards a full dataset and Real MAE and MRSE don't deviate much from the best result found on the XGBoost. 

In [221]:

# Train and predict with the given model
def predict_model(model, X_train, y_train_log, X_test):
    baseline_model.fit(X_train, y_train_log)
    y_pred_log = model.predict(X_test)
    y_pred_real = np.expm1(y_pred_log)
    return y_pred_real

# Print the results
def print_results(model_name, model, y_test_real, y_pred_real):
    rmse = root_mean_squared_error(y_test_real, y_pred_real)
    r2 = r2_score(y_test_real, y_pred_real)
    mae = mean_absolute_error(y_test_real, y_pred_real)

    print(f"{model_name} Test RMSE: {rmse:,.2f}")
    print(f"{model_name} Test MAE: {mae:,.2f}")
    print(f"{model_name} Test R²: {r2:.3f}")
    print(f"Regressor: {model}")

# Fit the best model on the *entire training set*
# select the best model by its score
best_model_name = max(best_results, key=lambda k: best_results[k]["best_score"])
best_model = best_results[best_model_name]["best_estimator"]

baseline_model = results["Linear Regression"]["Model"]

y_baseline_pred_real = predict_model(baseline_model, X_train_processed, y_train_log_df, X_test_processed)
y_pred_real = predict_model(best_model, X_train_processed, y_train_log_df, X_test_processed)

y_test_real = np.expm1(y_test_log)

print_results("Baseline Model", baseline_model, y_test_real, y_baseline_pred_real)
print(f"Score: {baseline_model.score(X_test_processed, y_test_log):,.4f}")
print()

print_results("Best Model", best_model, y_test_real, y_pred_real)
print(f"Score: {best_model.score(X_test_processed, y_test_log):,.4f}")


Baseline Model Test RMSE: 3,569,600.19
Baseline Model Test MAE: 34,798.83
Baseline Model Test R²: -800556.726
Regressor: LinearRegression()
Score: 0.9679

Best Model Test RMSE: 568.14
Best Model Test MAE: 280.38
Best Model Test R²: 0.980
Regressor: XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.8, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=0, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.2, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=7, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=200, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)
Score: 0.9916


### Conclusion

The whole process follows the following flow:
1. Fectch dataset
2. split in train/test
3. preprocess/pipeline building
4. Model Selection/Tunning
5. Best Model / Base Line 
6. Final Model Training

Baseline: Linear regression models behave similarly, in which `Linear Regression` is a solid benchmark but tree-based (Random Forest, XGBoost) significantly improve prediction accuracy (with not much resource addition)

XGBoost has the most efficient results and competitive accuracy. It turns out to be the most indicated model (from the ones tested in this exercise) to run over the Diamonds Prices dataset containing over 50K registers.

SVN is the least efficient. It presents high accuracy on sample but computationally expensive at scale. It took too long to execute which required a very restricted sample representing less than 5% of the entire dataset. 

Baseline against Advanced Models: Linear Regression is a solid benchmark, but tree-based ensembles (Random Forest, XGBoost) significantly improve prediction accuracy.

SVM vs. XGBoost: SVM performed surprisingly well on smaller samples, but scalability is a concern. Therefore, XGBoost is more suitable for full dataset deployment due to its balance of performance and execution time.

**Execution Strategy**: Using representative samples for hyperparameter tuning (fast) followed by retraining on the full dataset (final model) ensures both time efficiency and model robustness.

In [222]:


def to_df(y):
    df = pd.DataFrame(y, columns=["price"])
    return df.sort_values(by='price').head(20).reset_index(drop=True)

df_best_model = to_df(y_pred_real)
df_baseline_model = to_df(y_baseline_pred_real)
y_test_real = y_test_real.sort_values(by='price').head(20).reset_index(drop=True)

df_predictor_styler = df_best_model.style.set_table_attributes("style='display:inline; margin-right: 20px;'").set_caption('Best Preditor')
df_original_styler = y_test_real.style.set_table_attributes("style='display:inline;'").set_caption('Original')
df_baseline_styler = df_baseline_model.style.set_table_attributes("style='display:inline; margin-left: 20px;'").set_caption('Baseline')

# Display side by side the first 20 predicted prices from best model, 
# baseline model and original prices
display_html(df_original_styler._repr_html_() + 
             df_baseline_styler._repr_html_() + 
             df_predictor_styler._repr_html_(), 
             raw=True)

Unnamed: 0,price
0,335.0
1,336.0
2,337.0
3,358.0
4,360.0
5,363.0
6,364.0
7,366.0
8,367.0
9,367.0

Unnamed: 0,price
0,279.030288
1,313.264299
2,315.48959
3,316.607324
4,318.690171
5,319.486916
6,321.610862
7,323.49355
8,326.981917
9,327.47805

Unnamed: 0,price
0,338.165436
1,346.47998
2,351.64566
3,363.463531
4,365.821136
5,365.913666
6,368.204346
7,373.891693
8,379.278595
9,380.431793
