# MVP: Machine Learning & Analytics

**Autor**: Rodrigo Eduardo Modesto de Abreu

**Data**: 27/08/2025

**Matrícula**: 4052025000009

**Dataset**: [Diamond Prices](https://www.kaggle.com/datasets/nancyalaswad90/diamonds-prices)

Introduction

The MVP is based on the Diamond Prices dataset which was also used as part of the development of the [MVP2](https://github.com/remabreu/DiamondsPrices/tree/main).

At this repository, you can find:
* [README file](https://github.com/remabreu/DiamondsPrices/blob/main/README.md) - That describes details of the dataset
* [Notebook](https://github.com/remabreu/DiamondsPrices/blob/main/diamonds.ipynb) - The Notebook includes the whole dataset analysis and preporcessing which is also replicated in the notebook.

The Problem

The Diamonds Prices dataset provides many features to support in the prediction of the target variable as a supervised regression learning. The  dataset is a common and well known regression problem in [Kaggle](https://www.kaggle.com/datasets/nancyalaswad90/diamonds-prices). The Dataset is in the latest updated version and contains 53943 records and 11 Features (one of the attributes is the index and has no relationship with the data analysis).

Exploratory Data Analysis

In summary, the dataseset present as the following:
* The dataset didn't present any missing data (only )
* Prices column was very unbalanced and skewed distribution. 
* The uncommon prices can be regarded either outliers or not depending on how will be the use of the Diamonds for example, a value maximizer (i.e. Industrial use), collector or bridal budget. In fact there it hasn't been observed any miss-measurement or error to also regard any outlier. However, these measurements the extrapolate the "fence" outside the Quartile 1 and 2 through IQR method.
* carat and price produced a strong correlation in which cut, color and clarity were adjectives of such correlation by contributing into superior prices for the same carat. This behavior was more distinctly observed on smaller/lighter carats, though.

Data Preprocessing



In [None]:
# Do not show warnings
import warnings
warnings.filterwarnings("ignore")

# Imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random # to define random seed

import kagglehub

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Model Selection
from sklearn.model_selection import train_test_split # partition the dataset into train and test (holdout)
from sklearn.model_selection import KFold # preprare the folds to cross validation 
from sklearn.model_selection import cross_val_score # execuite cross validation

# Metrics
from sklearn.metrics import mean_squared_error # MSE Evaluation Metric
from sklearn.metrics import mean_absolute_error # MAE evaluiation metric
from sklearn.metrics import r2_score # R² evaluation metric

# Algorithms
from sklearn.linear_model import LinearRegression # Linear Regression algorithm 
from sklearn.linear_model import Ridge # Ridge Regularization algorithm
from sklearn.linear_model import Lasso # Lasso Regularization algorithm
from sklearn.neighbors import KNeighborsRegressor # KNN algorithm
from sklearn.tree import DecisionTreeRegressor # Decision Tree algorithm
from sklearn.dummy import DummyRegressor # Baseline algorithm
from sklearn.ensemble import RandomForestRegressor # Random Forest algorithm
from sklearn.svm import SVR # algoritmo SVM



In [143]:
path = kagglehub.dataset_download("nancyalaswad90/diamonds-prices")

print("Path to dataset file:", path)

#Store the dataset into a Dataframe object
diamonds_df = pd.read_csv(path+"/Diamonds Prices2022.csv")
df_sample = diamonds_df.sample(frac=0.2, random_state=42)

df_sample.head()

Path to dataset file: C:\Users\rodri\.cache\kagglehub\datasets\nancyalaswad90\diamonds-prices\versions\4


Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1388,1389,0.24,Ideal,G,VVS1,62.1,56.0,559,3.97,4.0,2.47
19841,19842,1.21,Very Good,F,VS2,62.9,54.0,8403,6.78,6.82,4.28
41647,41648,0.5,Fair,E,SI1,61.7,68.0,1238,5.09,5.03,3.12
41741,41742,0.5,Ideal,D,SI2,62.8,56.0,1243,5.06,5.03,3.17
17244,17245,1.55,Ideal,E,SI2,62.3,55.0,6901,7.44,7.37,4.61


In [144]:
# drop first column, ignore error in case culumn doesn't exist (already removed)
df_sample = df_sample.drop('Unnamed: 0', axis=1, errors='ignore')
df_sample.describe(include='all')


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
count,10789.0,10789,10789,10789,10789.0,10789.0,10789.0,10789.0,10789.0,10789.0
unique,,5,7,8,,,,,,
top,,Ideal,G,SI1,,,,,,
freq,,4316,2299,2521,,,,,,
mean,0.79,,,,61.77,57.45,3876.57,5.72,5.72,3.54
std,0.47,,,,1.45,2.26,3951.53,1.12,1.11,0.74
min,0.2,,,,43.0,51.0,335.0,0.0,0.0,0.0
25%,0.4,,,,61.1,56.0,944.0,4.71,4.72,2.91
50%,0.7,,,,61.9,57.0,2388.0,5.69,5.71,3.52
75%,1.04,,,,62.6,59.0,5195.0,6.52,6.52,4.03


In [145]:
# Check for 0 or empty values in 'x', 'y', 'z' columns
#
print("Rows with 0 or empty values: ", ((df_sample['x'] == 0) | (df_sample['y'] == 0) | (df_sample['z'] == 0)).sum())
print("Removing rows with 0 or empty values")
df_sample = df_sample[(df_sample['x'] != 0) & (df_sample['y'] != 0) & (df_sample['z'] != 0)]
print("Rows with 0 or empty values: ", ((df_sample['x'] == 0) | (df_sample['y'] == 0) | (df_sample['z'] == 0)).sum())


Rows with 0 or empty values:  4
Removing rows with 0 or empty values
Rows with 0 or empty values:  0


In [None]:
#clean_df = diamonds_df.drop(['x', 'y', 'z'], axis=1)

In [146]:
# Step 1: Separate features and target
X = df_sample.drop(columns='price')
y = df_sample['price']

# Step 2: Apply transformation to y
y_log = np.log1p(y)
y_log = y_log.to_frame()

# Step 3: Train/test split
# test_size: represents the proportion of the dataset to be allocated to the test set
# random_state: get the same split of data every time the code is executed
X_train, X_test, y_train_log, y_test_log = train_test_split(X, y_log,
                                                            train_size=0.2,
                                                            #test_size=0.2,
                                                            random_state=42)



In [147]:
def iqr_filter(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Apply IQR filtering on training set only
# merge sets (X) and (y) to apply filter
train = X_train.copy()
train['price'] = y_train_log
train = iqr_filter(train, 'table')
train = iqr_filter(train, 'depth')

# Separate back
y_train_log = train['price']
X_train = train.drop(columns='price')

In [148]:
X_num_cols = ['carat', 'table', 'depth', 'x', 'y', 'z']
X_cat_cols = ['cut', 'color', 'clarity']
y_num_col = ['price']

# The ColumnTransformer creates a data preprocessing pipeline that applies
# different transformations to different columns
preprocessor_X = ColumnTransformer(
    # List of transformations to be applied to specific column groups
    transformers=[
        # 1st Transformer: Numerical columns
        ('t_num',
         StandardScaler(), # Applies standardization (mean=0, std=1)
         X_num_cols),

        # 2nd Transformer: Categorical columns
        ('t_cat',
         # Converts categories to one-hot encoded columns and
         #drops first category to avoid multicollinearity
         OneHotEncoder(drop='first', sparse_output=False),
         X_cat_cols)
    ],
    # Handling of columns not explicitly transformed
    remainder='passthrough' # Keep other columns (if any) - though not applicable here
)

preprocessor_y = ColumnTransformer(
    transformers=[
        ('t_y', StandardScaler(), y_num_col)
    ]
)

# Apply transformations using fit_transform on training data and transform on
# testing one
X_train_processed = preprocessor_X.fit_transform(X_train)
X_test_processed = preprocessor_X.transform(X_test)

y_train_log_df = y_train_log.to_frame()

#y_train_processed = preprocessor_y.fit_transform(y_train_log_df)
#y_test_processed = preprocessor_y.transform(y_test_log)

In [75]:
# Criando os folds para a validação cruzada
num_particoes = 10 # número de folds da validação cruzada
kfold = KFold(n_splits=num_particoes, shuffle=True, random_state=7) # faz o particionamento em 10 folds

In [None]:
# Modelagem

SEED = 7
# Definindo uma seed global para esta célula de código
np.random.seed(SEED) 
random.seed(SEED)

# Listas para armazenar os modelos, os resultados e os nomes dos modelos
models = []
results = []
names = []

# Preparando os modelos e adicionando-os em uma lista
models.append(('Dummy',  DummyRegressor(strategy='median')))
models.append(('LR', LinearRegression()))
models.append(('Ridge', Ridge()))
models.append(('Lasso', Lasso()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('RFR', RandomForestRegressor(random_state=SEED)))
models.append(('SVM', SVR()))

# Avaliando um modelo por vez
for name, model in models:
  cv_results = cross_val_score(model, X_train_processed, y_train_processed, cv=kfold, scoring='neg_mean_squared_error')
  results.append(cv_results)
  names.append(name)
  # imprime MSE, desvio padrão do MSE e RMSE dos 10 resultados da validação cruzada
  msg = "%s: MSE %0.2f (%0.2f) - RMSE %0.2f" % (name, abs(cv_results.mean()), cv_results.std(), np.sqrt(abs(cv_results.mean())))
  print(msg)

# Boxplot de comparação dos modelos
#fig = plt.figure() 
#fig.suptitle('Comparação do MSE dos Modelos') 
#ax = fig.add_subplot(111) 
#plt.boxplot(results) 
#ax.set_xticklabels(names) 
# plt.show()

In [None]:
from sklearn.metrics import make_scorer

SEED = 7
# Definindo uma seed global para esta célula de código
np.random.seed(SEED) 
random.seed(SEED)
#
# RMSE in real price space
def rmse_real(y_true_log, y_pred_log):
    y_true = np.expm1(y_true_log)   # invert log1p
    y_pred = np.expm1(y_pred_log)
    return np.sqrt(mean_squared_error(y_true, y_pred))


# MAE in real price space
def mae_real(y_true_log, y_pred_log):
    y_true = np.expm1(y_true_log)
    y_pred = np.expm1(y_pred_log)
    return mean_absolute_error(y_true, y_pred)


def evaluate_model(model, X, y_log, cv):
    """
    Evaluate a regression model trained on log(price).
    
    Returns a dictionary with:
    - log-RMSE
    - real-RMSE
    - real-MAE
    """
    
    # Log RMSE
    scores_log = cross_val_score(
        model, X, y_log, cv=cv, scoring="neg_mean_squared_error"
    )
    log_rmse = -scores_log.mean()
    
    # Real RMSE
    scores_real_rmse = cross_val_score(
        model, X, y_log, cv=cv, scoring=rmse_real_scorer
    )
    real_rmse = -scores_real_rmse.mean()
    
    # Real MAE
    scores_real_mae = cross_val_score(
        model, X, y_log, cv=cv, scoring=mae_real_scorer
    )
    real_mae = -scores_real_mae.mean()
    
    return {
        "Log RMSE": log_rmse,
        "Real RMSE": real_rmse,
        "Real MAE": real_mae
    }

rmse_real_scorer = make_scorer(rmse_real, greater_is_better=False)
mae_real_scorer = make_scorer(mae_real, greater_is_better=False)

models = {
    "Dummy Regresssor": DummyRegressor(strategy='median'),
    "Linear Regression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(random_state=SEED),
    "KNN": KNeighborsRegressor(),
    "SVM": SVR()
}

#model = SVR(kernel="rbf")
#model = DummyRegressor(strategy='median')
#model = LinearRegression()
# Criando os folds para a validação cruzada
num_particoes = 10 # número de folds da validação cruzada
kfold = KFold(n_splits=num_particoes, shuffle=True, random_state=7) # faz o particionamento em 10 folds

results = {}
for name, model in models.items():
    results[name] = evaluate_model(model, X_train_processed, y_train_log_df, cv=kfold)

df_results = pd.DataFrame(results).T  # transpose for readability
print(df_results)


#results = evaluate_model(model, X_train_processed, y_train_log_df, cv=kfold)
#print(results)



In [149]:
def evaluate_model(model, X, y_log, cv):
    """
    Evaluate a regression model trained on log(price).
    
    Returns a dictionary with:
    - log-RMSE
    - real-RMSE
    - real-MAE
    """
    
    # Log RMSE
    scores_log = cross_val_score(
        model, X, y_log, cv=cv, scoring="neg_mean_squared_error"
    )
    log_rmse = -scores_log.mean()
    
    # Real RMSE
    scores_real_rmse = cross_val_score(
        model, X, y_log, cv=cv, scoring=rmse_real_scorer
    )
    real_rmse = -scores_real_rmse.mean()
    
    # Real MAE
    scores_real_mae = cross_val_score(
        model, X, y_log, cv=cv, scoring=mae_real_scorer
    )
    real_mae = -scores_real_mae.mean()
    
    return {
        "Log RMSE": log_rmse,
        "Real RMSE": real_rmse,
        "Real MAE": real_mae
    }

In [150]:


from sklearn.model_selection import RandomizedSearchCV

base_models = {
    "Linear Regression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(),
    "Decision Tree": DecisionTreeRegressor(),
    "KNN": KNeighborsRegressor(),
    "Random Forest": RandomForestRegressor(random_state=42),
    "SVM": SVR(),
}


# Define parameter spaces
# prepare reasonable search spaces for each model

param_spaces = {
    "Linear Regression": {},  # no hyperparameters to tune
    "Ridge": {
        "alpha": np.logspace(-3, 3, 50)
    },
    "Lasso": {
        "alpha": np.logspace(-3, 3, 50)
    },
    "Decision Tree": {
        "max_depth": [3, 5, 10, None],
        "min_samples_split": [2, 5, 10, 20],
        "min_samples_leaf": [1, 2, 5, 10]
    },
    "KNN": {
        "n_neighbors": range(2, 50),
        "weights": ["uniform", "distance"],
        "p": [1, 2]  # Manhattan / Euclidean
    },
    "SVM": {
        "C": np.logspace(-2, 3, 20),
        "gamma": np.logspace(-3, 2, 20),
        "kernel": ["rbf", "poly", "sigmoid"]
    },
    "Random Forest": {
        "n_estimators": [100, 200, 300, 500],
        "max_depth": [None, 5, 10, 20],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4],
        "max_features": ["auto", "sqrt", "log2"]
    }
}

num_particoes = 10 # número de folds da validação cruzada
kfold = KFold(n_splits=num_particoes, shuffle=True, random_state=7) # faz o particionamento em 10 folds

searches = {}
for name, model in base_models.items():
    if param_spaces[name]:  # if we have params to tune
        searches[name] = RandomizedSearchCV(
            estimator=model,
            param_distributions=param_spaces[name],
            n_iter=10,   # number of random trials
            scoring="neg_root_mean_squared_error",
            cv=3, #kfold,        # inner CV for hyperparameter tuning
            random_state=42,
            n_jobs=-1
        )
    else:
        searches[name] = model  # LinearRegression (no hyperparams)

results = {}
for name, search in searches.items():
    print(f"🔍 Optimizing {name}...")
    results[name] = evaluate_model(search, X_train_processed, y_train_log_df, cv=3) #kfold)  # outer CV

pd.set_option("display.float_format", "{:,.2f}".format)
df_results = pd.DataFrame(results).T
#df_results.style.format("{:,.2f}")
df_results
#print(df_results)

🔍 Optimizing Linear Regression...
🔍 Optimizing Ridge...
🔍 Optimizing Lasso...
🔍 Optimizing Decision Tree...
🔍 Optimizing KNN...
🔍 Optimizing SVM...
🔍 Optimizing Random Forest...


Unnamed: 0,Log RMSE,Real RMSE,Real MAE
Linear Regression,0.02,848.83,405.86
Ridge,0.02,904.45,415.54
Lasso,0.07,1875.54,920.94
Decision Tree,0.04,1278.34,634.45
KNN,0.04,1221.93,622.99
SVM,0.02,875.24,411.37
Random Forest,0.03,1110.56,540.4


In [86]:
model = LinearRegression()
model.fit(X_train_processed, y_train_log_df)
pred_log = model.predict(X_test_processed)
score = model.score(X_test_processed, y_test_log)
print("score %0.2f" % score)

mse = mean_squared_error(y_test_log, pred_log)
print("MSE %0.2f" % mse)
print("RMSE %0.2f" % np.sqrt(abs(mse))) 
print("")
y_true = np.expm1(y_test_log)   # invert log1p
y_pred = np.expm1(pred_log)
r_mae = mean_absolute_error(y_true, y_pred)
r_mse = mean_squared_error(y_true, y_pred)
r_rmse = np.sqrt(r_mse)
print("Real MAE %0.2f" % r_mae)
print("Real MSE %0.2f" % r_mse)
print("Real RMSE %0.2f" % r_rmse)
print("R² %0.2f" % r2_score(y_true, y_pred))

score 0.97
MSE 0.03
RMSE 0.18

Real MAE 34798.83
Real MSE 12742045481149.05
Real RMSE 3569600.19
R² -800556.73
