<div style="text-align: center;">
  <img src="https://github.com/Hack-io-Data/Imagenes/blob/main/01-LogosHackio/logo_naranja@4x.png?raw=true" alt="esquema" />
</div>

# Construcción y Comparación de Modelos de Clasificación

El objetivo de esta práctica es construir múltiples modelos de clasificación para predecir la probabilidad de que una transacción sea fraudulenta. Además de implementar los modelos, deberás analizar y comparar las métricas obtenidas para seleccionar el modelo que mejor se ajuste al problema. En este laboratorio deberás: 



- Entrenar al menos tres modelos diferentes de clasificación, como:

   - **Regresión logística**

   - **Árboles de decisión**

   - **Bosques aleatorios**

   - etc.


- Obtener las siguientes métricas para cada modelo:

   - Precisión

   - Recall (Sensibilidad)

   - F1-Score

   - Área bajo la curva ROC (AUC-ROC)

   - Matriz de confusión

   - Accuracy

- Visualizar y comparar estas métricas en gráficos claros y explicativos.


- Analizar las métricas de rendimiento de cada modelo.

- Justificar la selección del modelo más adecuado en función del equilibrio entre precisión y recall, así como la interpretación del área bajo la curva ROC.


In [1]:
import pandas as pd

from src.support_logistic import * 
from src.support_models import *

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, learning_curve, GridSearchCV, cross_val_score, StratifiedKFold, KFold
import xgboost as xgb

import shap

import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [2]:
def calcular_metricas(model, X_train, X_test, y_train, y_test,):
    """
    Calcula métricas de rendimiento para el modelo seleccionado, incluyendo AUC, Kappa,
    tiempo de computación y núcleos utilizados.
    
    Parameters:
        y_train_pred (array-like): Predicciones del conjunto de entrenamiento.
        y_test_pred (array-like): Predicciones del conjunto de prueba.
    
    Returns:
        DataFrame: DataFrame con las métricas para los conjuntos de entrenamiento y prueba.
    """
    modelo = model
    y_train_pred = modelo.predict(X_train)
    y_test_pred = modelo.predict(X_test)
    # Registrar tiempo de ejecución
    start_time = time.time()
    if hasattr(modelo, "predict_proba"):
        prob_train = modelo.predict_proba(X_train)[:, 1]
        prob_test = modelo.predict_proba(X_test)[:, 1]
    else:
        prob_train = prob_test = None
    elapsed_time = time.time() - start_time

    # Cálculo de métricas
    metrics = {
        'precision': [precision_score(y_train, y_train_pred, average='weighted', zero_division=0), precision_score(y_test, y_test_pred, average='weighted', zero_division=0)],
        'accuracy': [accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)],
        'recall': [recall_score(y_train, y_train_pred, average='weighted', zero_division=0), recall_score(y_test, y_test_pred, average='weighted', zero_division=0)],
        'f1_score': [f1_score(y_train, y_train_pred, average='weighted', zero_division=0), f1_score(y_test, y_test_pred, average='weighted', zero_division=0)],
        'kappa': [cohen_kappa_score(y_train, y_train_pred), cohen_kappa_score(y_test, y_test_pred)],
        'auc': [roc_auc_score(y_train, prob_train) if prob_train is not None else None, roc_auc_score(y_test, prob_test) if prob_test is not None else None],
        'time_seconds': [elapsed_time, elapsed_time],
        'n_jobs': [getattr(modelo, "n_jobs", psutil.cpu_count(logical=True))] * 2
    }
    df_metrics = pd.DataFrame(metrics, columns=metrics.keys(), index=['train', 'test'])
    return df_metrics



In [3]:
def calcular_metricas(y_train, y_train_pred, y_test, y_test_pred):
    """
    Calcula métricas de rendimiento para el modelo seleccionado, incluyendo AUC, Kappa,
    tiempo de computación y núcleos utilizados.
    
    Parameters:
        y_train_pred (array-like): Predicciones del conjunto de entrenamiento.
        y_test_pred (array-like): Predicciones del conjunto de prueba.
    
    Returns:
        DataFrame: DataFrame con las métricas para los conjuntos de entrenamiento y prueba.
    """
    # modelo = model
    # y_train_pred = modelo.predict(X_train)
    # y_test_pred = modelo.predict(X_test)
    # Registrar tiempo de ejecución
    start_time = time.time()
    # if hasattr(modelo, "predict_proba"):
    #     prob_train = modelo.predict_proba(X_train)[:, 1]
    #     prob_test = modelo.predict_proba(X_test)[:, 1]
    # else:
    prob_train = prob_test = None
    elapsed_time = time.time() - start_time
    

    # Cálculo de métricas
    metrics = {
    'precision' : [precision_score(y_train, y_train_pred), precision_score(y_test, y_test_pred)],
    'accuracy' : [accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)],
    'recall' : [recall_score(y_train, y_train_pred), recall_score(y_test, y_test_pred)],
    'f1_score' : [f1_score(y_train, y_train_pred), f1_score(y_test, y_test_pred)],
    'kappa': [cohen_kappa_score(y_train, y_train_pred), cohen_kappa_score(y_test, y_test_pred)],
    'auc': [roc_auc_score(y_train, prob_train) if prob_train is not None else None, roc_auc_score(y_test, prob_test) if prob_test is not None else None]
    }
    df_metrics = pd.DataFrame(metrics, columns=metrics.keys(), index=['train', 'test'])
    return df_metrics



In [4]:
df = pd.read_pickle("datos/prepped_data.pkl")

In [5]:
X = df.drop(columns = "is_fraudulent")
y = df["is_fraudulent"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [6]:
parametros_logistic =  [{'penalty': ['l1'], 'solver': ['saga'], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'max_iter': [10000]},
    {'penalty': ['l2'], 'solver': ['liblinear'], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'max_iter': [10000]},
    {'penalty': ['elasticnet'], 'solver': ['saga'], 'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'max_iter': [10000]},
    {'penalty': ['none'], 'solver': ['lbfgs'], 'max_iter': [10000]}]

regressor = LogisticRegression(random_state=42)

gridsearch = GridSearchCV(regressor, parametros_logistic, cv=5, scoring = "f1", n_jobs=-1)

gridsearch.fit(X, y)

gridsearch.fit(X_train, y_train)
best_regressor = gridsearch.best_estimator_
y_train_predict = best_regressor.predict(X = X_train)
y_test_predict = best_regressor.predict(X = X_test)

y_train_predict_prob = best_regressor.predict_proba(X = X_train)[:, 1]
y_test_predict_prob = best_regressor.predict_proba(X = X_test)[:, 1]

metricas_regresion = metricas_logisticas(y_train, y_train_predict, y_test, y_test_predict, y_train_predict_prob, y_test_predict_prob)
metricas_regresion


Unnamed: 0,precision,accuracy,recall,f1_score,kappa,auc
train,0.539412,0.550429,0.719932,0.616734,0.099365,0.578958
test,0.562313,0.566333,0.727273,0.634242,0.122604,0.584283


In [7]:
model = ClassificationModel(X, y, random_state=42)

In [8]:
logistic = model.train("logistic", params=parametros_logistic, scoring="f1")

In [9]:
logistic

In [10]:
y_test_pred_class = model.resultados["logistic"]["pred_test"]
y_train_pred_class = model.resultados["logistic"]["pred_train"]

In [11]:
model.display_metrics()

Unnamed: 0,precision,accuracy,recall,f1_score,kappa,auc,time_seconds,n_jobs
train,0.556052,0.550429,0.550429,0.536948,0.099365,0.578958,0.003,
test,0.568174,0.566333,0.566333,0.553686,0.122604,0.584283,0.003,


In [12]:
metricas_logisticas(y_train, y_train_pred_class, y_test, y_test_pred_class)

Unnamed: 0,precision,accuracy,recall,f1_score,kappa,auc
train,0.539412,0.550429,0.719932,0.616734,0.099365,
test,0.562313,0.566333,0.727273,0.634242,0.122604,


In [13]:
all(model.X_train == X_train)

True

In [14]:
metricas_logisticas(y_train, y_train_predict, y_test, y_test_predict)

Unnamed: 0,precision,accuracy,recall,f1_score,kappa,auc
train,0.539412,0.550429,0.719932,0.616734,0.099365,
test,0.562313,0.566333,0.727273,0.634242,0.122604,


In [15]:
calcular_metricas(y_train, y_train_predict, y_test, y_test_predict)

Unnamed: 0,precision,accuracy,recall,f1_score,kappa,auc
train,0.539412,0.550429,0.719932,0.616734,0.099365,
test,0.562313,0.566333,0.727273,0.634242,0.122604,


In [16]:
print("Mejores parámetros (manual):", gridsearch.best_params_)
print("Mejores parámetros (ClassificationModel):", model.get_best_params())


Mejores parámetros (manual): {'C': 0.01, 'l1_ratio': 0.5, 'max_iter': 10000, 'penalty': 'elasticnet', 'solver': 'saga'}
Mejores parámetros (ClassificationModel): {'C': 0.01, 'l1_ratio': 0.5, 'max_iter': 10000, 'penalty': 'elasticnet', 'solver': 'saga'}


In [17]:
all(model.resultados["logistic"]["pred_train"] == y_train_predict)

True