# ❄️ Demo ML de Extremo a Extremo ❄️

En este flujo de trabajo trabajaremos a través de los siguientes elementos de un pipeline típico de machine learning tabular.

### 1. Usar Feature Store para rastrear características diseñadas
* Almacenar definiciones de características en feature store para cómputo reproducible de características ML
      
### 2. Entrenar dos modelos usando las APIs de Snowflake ML
* XGboost de línea base
* XGboost con hiperparámetros óptimos identificados mediante métodos HPO distribuidos de Snowflake ML

### 3. Registrar ambos modelos en el model registry de Snowflake
* Explorar capacidades del model registry como **seguimiento de metadatos, inferencia y explicabilidad**
* Comparar métricas del modelo en conjuntos de entrenamiento/prueba para identificar problemas de rendimiento del modelo o sobreajuste
* Etiquetar la versión del modelo con mejor rendimiento como versión 'default'
### 4. Configurar Model Monitor para rastrear 1 año de pagos de préstamos predichos y reales
* **Calcular métricas de rendimiento** como F1, Precision, Recall
* **Inspeccionar model drift** (es decir, cuánto ha cambiado la tasa de reembolso promedio predicha día a día)
* **Comparar modelos** lado a lado para entender qué modelo debe usarse en producción
* Identificar y comprender **problemas de datos**

### 5. Rastrear el linaje de datos y modelos a lo largo del proceso
* Ver y comprender
  * El **origen de los datos** utilizados para características calculadas
  * Los **datos utilizados** para entrenamiento del modelo
  * Las **versiones de modelo disponibles** que están siendo monitoreadas

In [None]:
!pip install shap snowflake-ml-python==1.11.0

In [None]:
#Actualiza este VERSION_NUM para versionar tus características, modelos, etc!
VERSION_NUM = '0'
DB = "E2E_SNOW_MLOPS_DB" 
SCHEMA = "MLOPS_SCHEMA" 
COMPUTE_WAREHOUSE = "E2E_SNOW_MLOPS_WH" 

In [None]:
import pandas as pd
import numpy as np
import sklearn
import math
import pickle
import shap
from datetime import datetime
import streamlit as st
from xgboost import XGBClassifier

# Snowpark ML
from snowflake.ml.registry import Registry
from snowflake.ml.modeling.tune import get_tuner_context
from snowflake.ml.modeling import tune
from entities import search_algorithm

#Snowflake feature store
from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity, CreationMode

# Snowpark session
from snowflake.snowpark import DataFrame
from snowflake.snowpark.functions import col, to_timestamp, min, max, month, dayofweek, dayofyear, avg, date_add, sql_expr
from snowflake.snowpark.types import IntegerType
from snowflake.snowpark import Window

#configurar sesión snowpark
from snowflake.snowpark.context import get_active_session
session = get_active_session()
session

In [None]:
try:
    print("Leyendo datos de tabla...")
    df = session.table("MORTGAGE_LENDING_DEMO_DATA")
    df.show(5)
except:
    print("¡Tabla no encontrada! Cargando datos a tabla snowflake")
    df_pandas = pd.read_csv("MORTGAGE_LENDING_DEMO_DATA.csv.zip")
    session.write_pandas(df_pandas, "MORTGAGE_LENDING_DEMO_DATA", auto_create_table=True)
    df = session.table("MORTGAGE_LENDING_DEMO_DATA")
    df.show(5)

## Observar las propiedades de tabla Snowpark de Snowflake

In [None]:
df.select(min('TS'), max('TS'))

In [None]:
#Obtener fecha y hora actual
current_time = datetime.now()
df_max_time = datetime.strptime(str(df.select(max("TS")).collect()[0][0]), "%Y-%m-%d %H:%M:%S.%f")

#Encontrar delta entre el timestamp existente más reciente y la fecha de hoy
timedelta = current_time- df_max_time

#Actualizar timestamps para representar el último ~1 año desde la fecha de hoy
df.select(min(date_add(to_timestamp("TS"), timedelta.days-1)), max(date_add(to_timestamp("TS"), timedelta.days-1)))

## Ingeniería de Características con APIs Snowpark

In [None]:
#Crear un diccionario con claves para nombres de características y valores conteniendo código de transformación

feature_eng_dict = dict()

#Características de timestamp
feature_eng_dict["TIMESTAMP"] = date_add(to_timestamp("TS"), timedelta.days-1)
feature_eng_dict["MONTH"] = month("TIMESTAMP")
feature_eng_dict["DAY_OF_YEAR"] = dayofyear("TIMESTAMP") 
feature_eng_dict["DOTW"] = dayofweek("TIMESTAMP")

# df= df.with_columns(feature_eng_dict.keys(), feature_eng_dict.values())

#Características de ingreso y préstamo
feature_eng_dict["LOAN_AMOUNT"] = col("LOAN_AMOUNT_000s")*1000
feature_eng_dict["INCOME"] = col("APPLICANT_INCOME_000s")*1000
feature_eng_dict["INCOME_LOAN_RATIO"] = col("INCOME")/col("LOAN_AMOUNT")

county_window_spec = Window.partition_by("COUNTY_NAME")
feature_eng_dict["MEAN_COUNTY_INCOME"] = avg("INCOME").over(county_window_spec)
feature_eng_dict["HIGH_INCOME_FLAG"] = (col("INCOME")>col("MEAN_COUNTY_INCOME")).astype(IntegerType())

feature_eng_dict["AVG_THIRTY_DAY_LOAN_AMOUNT"] =  sql_expr("""AVG(LOAN_AMOUNT) OVER (PARTITION BY COUNTY_NAME ORDER BY TIMESTAMP  
                                                            RANGE BETWEEN INTERVAL '30 DAYS' PRECEDING AND CURRENT ROW)""")

df = df.with_columns(feature_eng_dict.keys(), feature_eng_dict.values())
df.show(3)

In [None]:
df.explain()

## Crear un Feature Store de Snowflake

In [None]:
fs = FeatureStore(
    session=session, 
    database=DB, 
    name=SCHEMA, 
    default_warehouse=COMPUTE_WAREHOUSE,
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST
)

In [None]:
fs.list_entities()

## Configuración de Feature Store
- crear/registrar entidades de interés

In [None]:
#Primero intenta recuperar una definición de entidad existente, si no, define una nueva y regístrala
try:
    #recuperar entidad existente
    loan_id_entity = fs.get_entity('LOAN_ENTITY') 
    print('Retrieved existing entity')
except:
#definir nueva entidad
    loan_id_entity = Entity(
        name = "LOAN_ENTITY",
        join_keys = ["LOAN_ID"],
        desc = "Features defined on a per loan level")
    #registrar
    fs.register_entity(loan_id_entity)
    print("Registered new entity")

In [None]:
#Crear un dataframe con solo el ID, timestamp y características diseñadas. Usaremos esto para definir nuestro feature view
feature_df = df.select(["LOAN_ID"]+list(feature_eng_dict.keys()))
feature_df.show(5)

Aquí, el feature store referencia una tabla existente. 

También podríamos definir el dataframe mediante el uso de APIs Snowpark, y usar ese dataframe (o una función que retorna un dataframe) como definición del feature view, a continuación.

In [None]:
#definir y registrar feature view
loan_fv = FeatureView(
    name="Mortgage_Feature_View",
    entities=[loan_id_entity],
    feature_df=feature_df,
    timestamp_col="TIMESTAMP",
    refresh_freq="1 day")

#agregar descripciones a nivel de característica

loan_fv = loan_fv.attach_feature_desc(
    {
        "MONTH": "Mes del préstamo",
        "DAY_OF_YEAR": "Día del año calendario del préstamo",
        "DOTW": "Día de la semana del préstamo",
        "LOAN_AMOUNT": "Monto del préstamo en $USD",
        "INCOME": "Ingreso del hogar en $USD",
        "INCOME_LOAN_RATIO": "Ratio de LOAN_AMOUNT/INCOME",
        "MEAN_COUNTY_INCOME": "Ingreso promedio del hogar agregado a nivel de condado",
        "HIGH_INCOME_FLAG": "Bandera binaria para indicar si el ingreso del hogar es mayor que MEAN_COUNTY_INCOME",
        "AVG_THIRTY_DAY_LOAN_AMOUNT": "Promedio móvil de 30 días de LOAN_AMOUNT"
    }
)

loan_fv = fs.register_feature_view(loan_fv, version=VERSION_NUM, overwrite=True)

In [None]:
fs.list_feature_views()

In [None]:
#Crear enlace a la UI del feature store para inspeccionar el feature view recién creado!
org_name = session.sql('SELECT CURRENT_ORGANIZATION_NAME()').collect()[0][0]
account_name = session.sql('SELECT CURRENT_ACCOUNT_NAME()').collect()[0][0]

st.write(f'https://app.snowflake.com/{org_name}/{account_name}/#/features/database/{DB}/store/{SCHEMA}')

## Recuperar un Dataset del featureview

Los Datasets de Snowflake son objetos inmutables basados en archivos que existen dentro de tu sesión Snowpark. 

Pueden ser escritos a objetos persistentes de Snowflake según sea necesario. 

In [None]:
ds = fs.generate_dataset(
    name=f"MORTGAGE_DATASET_EXTENDED_FEATURES_{VERSION_NUM}",
    spine_df=df.select("LOAN_ID", "TIMESTAMP", "LOAN_PURPOSE_NAME","MORTGAGERESPONSE"), #only need the features used to fetch rest of feature view
    features=[loan_fv],
    spine_timestamp_col="TIMESTAMP",
    spine_label_cols=["MORTGAGERESPONSE"]
)

In [None]:
ds_sp = ds.read.to_snowpark_dataframe()
ds_sp.show(5)

In [None]:
import snowflake.ml.modeling.preprocessing as snowml
from snowflake.snowpark.types import StringType

OHE_COLS = ds_sp.select([col.name for col in ds_sp.schema if col.datatype ==StringType()]).columns
OHE_POST_COLS = [i+"_OHE" for i in OHE_COLS]


# Codificar categóricas a columnas numéricas
snowml_ohe = snowml.OneHotEncoder(input_cols=OHE_COLS, output_cols = OHE_COLS, drop_input_cols=True)
ds_sp_ohe = snowml_ohe.fit(ds_sp).transform(ds_sp)

#Renombrar columnas para evitar comillas anidadas dobles y caracteres de espacio en blanco
rename_dict = {}
for i in ds_sp_ohe.columns:
    if '"' in i:
        rename_dict[i] = i.replace('"','').replace(' ', '_')

ds_sp_ohe = ds_sp_ohe.rename(rename_dict)
ds_sp_ohe.columns

In [None]:
train, test = ds_sp_ohe.random_split(weights=[0.70, 0.30], seed=0)

In [None]:
train = train.fillna(0)
test = test.fillna(0)

In [None]:
train_pd = train.to_pandas()
test_pd = test.to_pandas()

## Entrenamiento del Modelo
### A continuación definiremos y ajustaremos un clasificador xgboost como nuestro modelo de línea base y evaluaremos el rendimiento
##### Nota: esto se hace completamente con frameworks OSS

In [None]:
#Definir configuración del modelo
xgb_base = XGBClassifier(
    max_depth=50,
    n_estimators=3,
    learning_rate = 0.75,
    booster = 'gbtree')

In [None]:
#Dividir datos de entrenamiento en X, y
X_train_pd = train_pd.drop(["TIMESTAMP", "LOAN_ID", "MORTGAGERESPONSE"],axis=1) #remover
y_train_pd = train_pd.MORTGAGERESPONSE

#entrenar modelo
xgb_base.fit(X_train_pd,y_train_pd)

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score
train_preds_base = xgb_base.predict(X_train_pd) #actualizar esta línea con datos correctos

f1_base_train = round(f1_score(y_train_pd, train_preds_base),4)
precision_base_train = round(precision_score(y_train_pd, train_preds_base),4)
recall_base_train = round(recall_score(y_train_pd, train_preds_base),4)

print(f'F1: {f1_base_train} \nPrecision {precision_base_train} \nRecall: {recall_base_train}')

# Model Registry

- Registrar modelos con metadatos importantes
- Gestionar ciclos de vida de modelos
- Servir modelos desde runtimes de Snowflake

In [None]:
#Crear un objeto model registry de snowflake 
from snowflake.ml.registry import Registry

# Definir nombre del modelo
model_name = f"MORTGAGE_LENDING_MLOPS_{VERSION_NUM}"

# Crear un registry para registrar el modelo
model_registry = Registry(session=session, 
                          database_name=DB, 
                          schema_name=SCHEMA,
                          options={"enable_monitoring": True})

In [None]:
#Registrar el modelo base en el model registry (si no está ya allí)
base_version_name = 'XGB_BASE'

try:
    #Verificar modelo existente
    mv_base = model_registry.get_model(model_name).version(base_version_name)
    print("Found existing model version!")
except:
    print("Logging new model version...")
    #Registrar modelo en el registry
    mv_base = model_registry.log_model(
        model_name=model_name,
        model=xgb_base, 
        version_name=base_version_name,
        sample_input_data = train.drop(["TIMESTAMP", "LOAN_ID", "MORTGAGERESPONSE"]).limit(100), #usar snowpark df para mantener linaje
        comment = f"""Modelo ML para predecir probabilidad de aprobación de préstamo.
                    Este modelo fue entrenado usando clasificador XGBoost.
                    Los hiperparámetros usados fueron:
                    max_depth={xgb_base.max_depth}, 
                    n_estimators={xgb_base.n_estimators}, 
                    learning_rate = {xgb_base.learning_rate}, 
                    algorithm = {xgb_base.booster}
                    """,
        target_platforms= ["WAREHOUSE", "SNOWPARK_CONTAINER_SERVICES"],
        options= {"enable_explainability": True}

    )
    
    #establecer métricas
    mv_base.set_metric(metric_name="Train_F1_Score", value=f1_base_train)
    mv_base.set_metric(metric_name="Train_Precision_Score", value=precision_base_train)
    mv_base.set_metric(metric_name="Train_Recall_score", value=recall_base_train)

In [None]:
#Crear tag para modelo PROD
session.sql("CREATE OR REPLACE TAG PROD")

In [None]:
#Aplicar tag prod 
m = model_registry.get_model(model_name)
m.comment = "Modelos de predicción de aprobación de préstamos" #establecer comentario a nivel de modelo
m.set_tag("PROD", base_version_name)
m.show_tags()

In [None]:
model_registry.show_models()

In [None]:
model_registry.get_model(model_name).show_versions()

In [None]:
print(mv_base)
print(mv_base.show_metrics())

In [None]:
mv_base.show_functions()

In [None]:
reg_preds = mv_base.run(test, function_name = "predict").rename(col('"output_feature_0"'), "MORTGAGE_PREDICTION")
reg_preds.show(10)

In [None]:
#ds_sp_ohe = ds_sp_ohe.rename(col('"LOAN_PURPOSE_NAME_Home improvement"'), "LOAN_PURPOSE_NAME_Home_improvement")

preds_pd = reg_preds.select(["MORTGAGERESPONSE", "MORTGAGE_PREDICTION"]).to_pandas()
f1_base_test = round(f1_score(preds_pd.MORTGAGERESPONSE, preds_pd.MORTGAGE_PREDICTION),4)
precision_base_test = round(precision_score(preds_pd.MORTGAGERESPONSE, preds_pd.MORTGAGE_PREDICTION),4)
recall_base_test = round(recall_score(preds_pd.MORTGAGERESPONSE, preds_pd.MORTGAGE_PREDICTION),4)

#registrar métricas en el modelo del model registry
mv_base.set_metric(metric_name="Test_F1_Score", value=f1_base_test)
mv_base.set_metric(metric_name="Test_Precision_Score", value=precision_base_test)
mv_base.set_metric(metric_name="Test_Recall_score", value=recall_base_test)

print(f'F1: {f1_base_test} \nPrecision {precision_base_test} \nRecall: {recall_base_test}')

# ¡Oh no! El rendimiento de nuestro modelo parece haber caído significativamente del entrenamiento a nuestro conjunto de prueba. 
## Esto es evidencia de que nuestro modelo está sobreajustado - ¿podemos arreglar esto con Optimización de Hiperparámetros Distribuida??

In [None]:
X_train = train.drop("MORTGAGERESPONSE", "TIMESTAMP", "LOAN_ID")
y_train = train.select("MORTGAGERESPONSE")
X_test = test.drop("MORTGAGERESPONSE","TIMESTAMP", "LOAN_ID")
y_test = test.select("MORTGAGERESPONSE")

In [None]:
from snowflake.ml.data import DataConnector
from snowflake.ml.modeling.tune import get_tuner_context
from snowflake.ml.modeling import tune
from entities import search_algorithm
import psutil

#Definir mapa de dataset
dataset_map = {
    "x_train": DataConnector.from_dataframe(X_train),
    "y_train": DataConnector.from_dataframe(y_train),
    "x_test": DataConnector.from_dataframe(X_test),
    "y_test": DataConnector.from_dataframe(y_test)
    }


# Definir una función de entrenamiento, con cualquier modelo que elijas dentro de ella.
def train_func():
    # Un objeto de contexto proporcionado por la API HPO para exponer datos para la prueba HPO actual
    tuner_context = get_tuner_context()
    config = tuner_context.get_hyper_params()
    dm = tuner_context.get_dataset_map()

    model = XGBClassifier(**config, random_state=42)
    model.fit(dm["x_train"].to_pandas().sort_index(), dm["y_train"].to_pandas().sort_index())
    f1_metric = f1_score(
        dm["y_train"].to_pandas().sort_index(), model.predict(dm["x_train"].to_pandas().sort_index())
    )
    tuner_context.report(metrics={"f1_score": f1_metric}, model=model)

tuner = tune.Tuner(
    train_func=train_func,
    search_space={
        "max_depth": tune.randint(1, 10),
        "learning_rate": tune.uniform(0.01, 0.1),
        "n_estimators": tune.randint(50, 100),
    },
    tuner_config=tune.TunerConfig(
        metric="f1_score",
        mode="max",
        search_alg=search_algorithm.RandomSearch(random_state=101),
        num_trials=8, #ejecutar 8 ejecuciones de prueba
        max_concurrent_trials=psutil.cpu_count(logical=False) # Usar todas las CPUs disponibles para ejecutar HPO distribuido. ¡GPUs también pueden ser usadas aquí! 
    ),
)

In [None]:
#Entrenar varios modelos candidatos (nota: esto puede tomar 1-2 minutos)
tuner_results = tuner.run(dataset_map=dataset_map)

In [None]:
#Seleccionar mejores resultados del modelo e inspeccionar configuración
tuned_model = tuner_results.best_model
tuned_model

In [None]:
#Generar predicciones
xgb_opt_preds = tuned_model.predict(train_pd.drop(["TIMESTAMP", "LOAN_ID", "MORTGAGERESPONSE"],axis=1))

#Generar métricas de rendimiento
f1_opt_train = round(f1_score(train_pd.MORTGAGERESPONSE, xgb_opt_preds),4)
precision_opt_train = round(precision_score(train_pd.MORTGAGERESPONSE, xgb_opt_preds),4)
recall_opt_train = round(recall_score(train_pd.MORTGAGERESPONSE, xgb_opt_preds),4)

print(f'Resultados de Entrenamiento: \nF1: {f1_opt_train} \nPrecision {precision_opt_train} \nRecall: {recall_opt_train}')

In [None]:
#Generar predicciones de prueba
xgb_opt_preds_test = tuned_model.predict(test_pd.drop(["TIMESTAMP", "LOAN_ID", "MORTGAGERESPONSE"],axis=1))

#Generar métricas de rendimiento en datos de prueba
f1_opt_test = round(f1_score(test_pd.MORTGAGERESPONSE, xgb_opt_preds_test),4)
precision_opt_test = round(precision_score(test_pd.MORTGAGERESPONSE, xgb_opt_preds_test),4)
recall_opt_test = round(recall_score(test_pd.MORTGAGERESPONSE, xgb_opt_preds_test),4)

print(f'Resultados de Prueba: \nF1: {f1_opt_test} \nPrecision {precision_opt_test} \nRecall: {recall_opt_test}')

# Aquí vemos que el modelo HPO tiene una precisión de entrenamiento más modesta que nuestro modelo base - pero el rendimiento no cae durante las pruebas

In [None]:
#Registrar el modelo optimizado en el model registry (si no está ya allí)
optimized_version_name = 'XGB_Optimized'

try:
    #Verificar modelo existente
    mv_opt = model_registry.get_model(model_name).version(optimized_version_name)
    print("Found existing model version!")
except:
    #Registrar modelo en el registry
    print("Logging new model version...")
    mv_opt = model_registry.log_model(
        model_name=model_name,
        model=tuned_model, 
        version_name=optimized_version_name,
        sample_input_data = train.drop(["TIMESTAMP", "LOAN_ID", "MORTGAGERESPONSE"]).limit(100),
        comment = f"""Modelo ML HPO para predecir probabilidad de aprobación de préstamo.
            Este modelo fue entrenado usando clasificador XGBoost.
            Los hiperparámetros optimizados usados fueron:
            max_depth={tuned_model.max_depth}, 
            n_estimators={tuned_model.n_estimators}, 
            learning_rate = {tuned_model.learning_rate}, 
            """,
        target_platforms= ["WAREHOUSE", "SNOWPARK_CONTAINER_SERVICES"],
        options= {"enable_explainability": True}

        

    )
    #Establecer métricas
    mv_opt.set_metric(metric_name="Train_F1_Score", value=f1_opt_train)
    mv_opt.set_metric(metric_name="Train_Precision_Score", value=precision_opt_train)
    mv_opt.set_metric(metric_name="Train_Recall_score", value=recall_opt_train)

    mv_opt.set_metric(metric_name="Test_F1_Score", value=f1_opt_test)
    mv_opt.set_metric(metric_name="Test_Precision_Score", value=precision_opt_test)
    mv_opt.set_metric(metric_name="Test_Recall_score", value=recall_opt_test)

In [None]:
#Aquí vemos que la versión BASE es nuestra versión default
model_registry.get_model(model_name).default

In [None]:
#Ahora estableceremos el modelo optimizado como la versión default del modelo en adelante
model_registry.get_model(model_name).default = optimized_version_name

In [None]:
#Ahora vemos nuestra versión optimizada que recientemente promovimos a nuestra versión de modelo DEFAULT
model_registry.get_model(model_name).default

In [None]:
#ahora actualizaremos el modelo etiquetado PROD para que sea la versión optimizada del modelo en lugar de nuestra versión base sobreajustada
m.unset_tag("PROD")
m.set_tag("PROD", optimized_version_name)
m.show_tags()

## Ahora que hemos desplegado algunas versiones de modelo y probado la inferencia... 
# ¡Expliquemos nuestros modelos!
- ### Snowflake ofrece capacidades de explicabilidad integradas sobre los modelos registrados en el model registry
- ### En la sección a continuación generaremos valores shapley usando estas funciones integradas para entender cómo las características de entrada impactan el comportamiento de nuestro modelo

In [None]:
#crear una muestra de 1000 registros
test_pd_sample=test_pd.rename(columns=rename_dict).sample(n=2500, random_state = 100).reset_index(drop=True)

#Calcular valores shapley para cada modelo
base_shap_pd = mv_base.run(test_pd_sample, function_name="explain")
opt_shap_pd = mv_opt.run(test_pd_sample, function_name="explain")

In [None]:
from snowflake.ml.monitoring import explain_visualize

feat_df=test_pd_sample.drop(["MORTGAGERESPONSE","TIMESTAMP", "LOAN_ID"],axis=1)

explain_visualize.plot_influence_sensitivity(base_shap_pd, feat_df, figsize=(1500, 500))

#Optionally test out other built-in functionality 
# explain_visualize.plot_force(base_shap_pd.iloc[0], feat_df.iloc[0], figsize=(1500, 500))
# explain_visualize.plot_violin(base_shap_pd, feat_df, figsize=(1400, 100))

### Además de las capacidades de visualización integradas, siempre puedes usar paquetes de código abierto como shap para visualizaciones adicionales

In [None]:
import shap 

shap.summary_plot(np.array(base_shap_pd.astype(float)), 
                  test_pd_sample.drop(["LOAN_ID","MORTGAGERESPONSE", "TIMESTAMP"], axis=1), 
                  feature_names = test_pd_sample.drop(["LOAN_ID","MORTGAGERESPONSE", "TIMESTAMP"], axis=1).columns)

In [None]:
shap.summary_plot(np.array(opt_shap_pd.astype(float)), 
                  test_pd_sample.drop(["LOAN_ID","MORTGAGERESPONSE", "TIMESTAMP"], axis=1), 
                  feature_names = test_pd_sample.drop(["LOAN_ID","MORTGAGERESPONSE", "TIMESTAMP"], axis=1).columns)

In [None]:
#Combinar valores shap y valores reales para facilitar los gráficos a continuación
all_shap_base = test_pd_sample.merge(base_shap_pd, right_index=True, left_index=True, how='outer')
all_shap_opt = test_pd_sample.merge(opt_shap_pd, right_index=True, left_index=True, how='outer')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

#filtrar datos para eliminar valores atípicos
asb_filtered = all_shap_base[(all_shap_base.INCOME>0) & (all_shap_base.INCOME<250000)]
aso_filtered = all_shap_opt[(all_shap_opt.INCOME>0) & (all_shap_opt.INCOME<250000)]

# Configurar la figura
fig, axes = plt.subplots(1, 2, figsize=(10, 6))
fig.suptitle("EXPLICACIÓN DE INGRESO")
# Graficar boxplots lado a lado
sns.scatterplot(data = asb_filtered, x ='INCOME', y = 'INCOME_explanation', ax=axes[0])
sns.regplot(data = asb_filtered, x ="INCOME", y = 'INCOME_explanation', scatter=False, color='red', line_kws={"lw":2},ci =100, lowess=False, ax =axes[0])

axes[0].set_title('Modelo Base')
sns.scatterplot(data = aso_filtered, x ='INCOME', y = 'INCOME_explanation',color = "orange", ax = axes[1])
sns.regplot(data = aso_filtered, x ="INCOME", y = 'INCOME_explanation', scatter=False, color='blue', line_kws={"lw":2},ci =100, lowess=False, ax =axes[1])
axes[1].set_title('Modelo Opt')

# Personalizar y mostrar el gráfico
for ax in axes:
    ax.set_xlabel("Ingreso")
    ax.set_ylabel("Influencia")
plt.tight_layout()
plt.show()


In [None]:
#filter data down to strip outliers
asb_filtered = all_shap_base[all_shap_base.LOAN_AMOUNT<2000000]
aso_filtered = all_shap_opt[all_shap_opt.LOAN_AMOUNT<2000000]


# Set up the figure
fig, axes = plt.subplots(1, 2, figsize=(10, 6))
fig.suptitle("LOAN_AMOUNT EXPLANATION")
# Plot side-by-side boxplots
sns.scatterplot(data = asb_filtered, x ='LOAN_AMOUNT', y = 'LOAN_AMOUNT_explanation', ax=axes[0])
sns.regplot(data = asb_filtered, x ="LOAN_AMOUNT", y = 'LOAN_AMOUNT_explanation', scatter=False, color='red', line_kws={"lw":2},ci =100, lowess=True, ax =axes[0])
axes[0].set_title('Base Model')

sns.scatterplot(data = aso_filtered, x ='LOAN_AMOUNT', y = 'LOAN_AMOUNT_explanation',color = "orange", ax = axes[1])
sns.regplot(data = aso_filtered, x ="LOAN_AMOUNT", y = 'LOAN_AMOUNT_explanation', scatter=False, color='blue', line_kws={"lw":2},ci =100, lowess=True, ax =axes[1])
axes[1].set_title('Opt Model')

# Customize and show the plot
for ax in axes:
    ax.set_xlabel("LOAN_AMOUNT")
    ax.set_ylabel("Influence")
    # ax.set_xlim((0,10000))
plt.tight_layout()
plt.show()


In [None]:
# Set up the figure
fig, axes = plt.subplots(1, 2, figsize=(10, 6))
fig.suptitle("HOME PURCHASE LOAN EXPLANATION")
# Plot side-by-side boxplots
sns.boxplot(data = all_shap_base, x ='LOAN_PURPOSE_NAME_HOME_PURCHASE', y = 'LOAN_PURPOSE_NAME_HOME_PURCHASE_explanation',
            hue='LOAN_PURPOSE_NAME_HOME_PURCHASE', width=0.8, ax=axes[0])
axes[0].set_title('Base Model')
sns.boxplot(data = all_shap_opt, x ='LOAN_PURPOSE_NAME_HOME_PURCHASE', y = 'LOAN_PURPOSE_NAME_HOME_PURCHASE_explanation',
            hue='LOAN_PURPOSE_NAME_HOME_PURCHASE', width=0.4, ax = axes[1])
axes[1].set_title('Opt Model')

# Customize and show the plot
for ax in axes:
    ax.set_xlabel("Home PURCHASE Loan (1 = True)")
    ax.set_ylabel("Influence")
    ax.legend(loc='upper right')

plt.show()


In [None]:
# Set up the figure
fig, axes = plt.subplots(1, 2, figsize=(10, 6))
fig.suptitle("HOME IMPROVEMENT LOAN EXPLANATION")
# Plot side-by-side boxplots
sns.boxplot(data = all_shap_base, x ='LOAN_PURPOSE_NAME_HOME_IMPROVEMENT', y = 'LOAN_PURPOSE_NAME_HOME_IMPROVEMENT_explanation',
            hue='LOAN_PURPOSE_NAME_HOME_IMPROVEMENT', width=0.8, ax=axes[0])
axes[0].set_title('Base Model')
sns.boxplot(data = all_shap_opt, x ='LOAN_PURPOSE_NAME_HOME_IMPROVEMENT', y = 'LOAN_PURPOSE_NAME_HOME_IMPROVEMENT_explanation',
            hue='LOAN_PURPOSE_NAME_HOME_IMPROVEMENT', width=0.4, ax = axes[1])
axes[1].set_title('Opt Model')

# Customize and show the plot
for ax in axes:
    ax.set_xlabel("Home Improvement Loan (1 = True)")
    ax.set_ylabel("Influence")
    ax.legend(loc='upper right')

plt.show()


# Configuración de Model Monitoring

In [None]:
train.write.save_as_table(f"DEMO_MORTGAGE_LENDING_TRAIN_{VERSION_NUM}", mode="overwrite")
test.write.save_as_table(f"DEMO_MORTGAGE_LENDING_TEST_{VERSION_NUM}", mode="overwrite")

In [None]:
session.sql("CREATE stage IF NOT EXISTS ML_STAGE").collect()

In [None]:
from snowflake import snowpark

def demo_inference_sproc(session: snowpark.Session, table_name: str, modelname: str, modelversion: str) -> str:

    reg = Registry(session=session)
    m = reg.get_model(model_name)  # Obtener el modelo usando el registry
    mv = m.version(modelversion)
    
    input_table_name=table_name
    pred_col = f'{modelversion}_PREDICTION'

    # Leer la tabla de entrada a un dataframe
    df = session.table(input_table_name)
    results = mv.run(df, function_name="predict").select("LOAN_ID",'"output_feature_0"').withColumnRenamed('"output_feature_0"', pred_col)
    # 'results' es el DataFrame de salida con predicciones

    final = df.join(results, on="LOAN_ID", how="full")
    # Escribir resultados de vuelta a la tabla Snowflake
    final.write.save_as_table(table_name, mode='overwrite',enable_schema_evolution=True)

    return "Success"

# Registrar el stored procedure
session.sproc.register(
    func=demo_inference_sproc,
    name="model_inference_sproc",
    replace=True,
    is_permanent=True,
    stage_location="@ML_STAGE",
    packages=['joblib', 'snowflake-snowpark-python', 'snowflake-ml-python'],
    return_type=StringType()
)


In [None]:
CALL model_inference_sproc('DEMO_MORTGAGE_LENDING_TRAIN_{{VERSION_NUM}}','{{model_name}}', '{{base_version_name}}');

In [None]:
CALL model_inference_sproc('DEMO_MORTGAGE_LENDING_TEST_{{VERSION_NUM}}','{{model_name}}', '{{base_version_name}}');

In [None]:
CALL model_inference_sproc('DEMO_MORTGAGE_LENDING_TRAIN_{{VERSION_NUM}}','{{model_name}}', '{{optimized_version_name}}');

In [None]:
CALL model_inference_sproc('DEMO_MORTGAGE_LENDING_TEST_{{VERSION_NUM}}','{{model_name}}', '{{optimized_version_name}}');

In [None]:
select TIMESTAMP, LOAN_ID, INCOME, LOAN_AMOUNT, XGB_BASE_PREDICTION, XGB_OPTIMIZED_PREDICTION, MORTGAGERESPONSE 
FROM DEMO_MORTGAGE_LENDING_TEST_{{VERSION_NUM}} 
limit 20

## Ahora que nuestros modelos han sido desplegados y hemos ejecutado inferencia - ¡configuremos ML Observability!

- Primero agregaremos una columna a nuestros datos de inferencia para explorar posteriormente con nuestras capacidades de segmentación 
- Definiremos un model monitor para cada modelo, con los datos de entrenamiento como nuestra línea base y los datos de prueba representando resultados de inferencia. 
- Una vez que los monitores estén definidos podemos acceder a ellos a través del Model Registry 
    - También podemos consultar métricas de drift etc. programáticamente

In [None]:
ALTER TABLE DEMO_MORTGAGE_LENDING_TEST_{{VERSION_NUM}}
ADD COLUMN IF NOT EXISTS LOAN_PURPOSE VARCHAR(50);


UPDATE DEMO_MORTGAGE_LENDING_TEST_{{VERSION_NUM}}
SET LOAN_PURPOSE = CASE
    WHEN LOAN_PURPOSE_NAME_HOME_IMPROVEMENT = 1 THEN 'HOME_IMPROVEMENT'
    WHEN LOAN_PURPOSE_NAME_HOME_PURCHASE = 1 THEN 'HOME_PURCHASE'
    WHEN LOAN_PURPOSE_NAME_REFINANCING = 1 THEN 'REFINANCING'
    ELSE 'OTHER'
END;

In [None]:
ALTER TABLE DEMO_MORTGAGE_LENDING_TRAIN_{{VERSION_NUM}}
ADD COLUMN IF NOT EXISTS LOAN_PURPOSE VARCHAR(50);


UPDATE DEMO_MORTGAGE_LENDING_TRAIN_{{VERSION_NUM}}
SET LOAN_PURPOSE = CASE
    WHEN LOAN_PURPOSE_NAME_HOME_IMPROVEMENT = 1 THEN 'HOME_IMPROVEMENT'
    WHEN LOAN_PURPOSE_NAME_HOME_PURCHASE = 1 THEN 'HOME_PURCHASE'
    WHEN LOAN_PURPOSE_NAME_REFINANCING = 1 THEN 'REFINANCING'
    ELSE 'OTHER'
END;

In [None]:
SELECT LOAN_PURPOSE_NAME_HOME_PURCHASE, LOAN_PURPOSE_NAME_HOME_IMPROVEMENT, LOAN_PURPOSE_NAME_REFINANCING, LOAN_PURPOSE FROM DEMO_MORTGAGE_LENDING_TEST_{{VERSION_NUM}} limit 10;

In [None]:
CREATE OR REPLACE MODEL MONITOR MORTGAGE_LENDING_BASE_MODEL_MONITOR
WITH
    MODEL={{model_name}}
    VERSION={{base_version_name}}
    FUNCTION=predict
    SOURCE=DEMO_MORTGAGE_LENDING_TEST_{{VERSION_NUM}}
    BASELINE=DEMO_MORTGAGE_LENDING_TRAIN_{{VERSION_NUM}}
    TIMESTAMP_COLUMN=TIMESTAMP
    PREDICTION_CLASS_COLUMNS=(XGB_BASE_PREDICTION)  
    ACTUAL_CLASS_COLUMNS=(MORTGAGERESPONSE)
    ID_COLUMNS=(LOAN_ID)
    SEGMENT_COLUMNS = ('LOAN_PURPOSE')
    WAREHOUSE={{COMPUTE_WAREHOUSE}}
    REFRESH_INTERVAL='12 hours'
    AGGREGATION_WINDOW='1 day';

In [None]:
CREATE OR REPLACE MODEL MONITOR MORTGAGE_LENDING_OPTIMIZED_MODEL_MONITOR
WITH
    MODEL={{model_name}}
    VERSION={{optimized_version_name}}
    FUNCTION=predict
    SOURCE=DEMO_MORTGAGE_LENDING_TEST_{{VERSION_NUM}}
    BASELINE=DEMO_MORTGAGE_LENDING_TRAIN_{{VERSION_NUM}}
    TIMESTAMP_COLUMN=TIMESTAMP
    PREDICTION_CLASS_COLUMNS=(XGB_OPTIMIZED_PREDICTION)  
    ACTUAL_CLASS_COLUMNS=(MORTGAGERESPONSE)
    ID_COLUMNS=(LOAN_ID)
    SEGMENT_COLUMNS = ('LOAN_PURPOSE')
    WAREHOUSE={{COMPUTE_WAREHOUSE}}
    REFRESH_INTERVAL='12 hours'
    AGGREGATION_WINDOW='1 day';

In [None]:
#Haz clic en el enlace generado para ver tu modelo en el model registry y revisar los model monitors!
st.write(f'https://app.snowflake.com/{org_name}/{account_name}/#/data/databases/{DB}/schemas/{SCHEMA}/model/{model_name.upper()}')

In [None]:
SELECT * FROM TABLE(MODEL_MONITOR_DRIFT_METRIC(
'MORTGAGE_LENDING_BASE_MODEL_MONITOR', -- model monitor to use
'DIFFERENCE_OF_MEANS', -- metric for computing drift
'XGB_BASE_PREDICTION', -- comlumn to compute drift on
'1 DAY',  -- day granularity for drift computation
DATEADD(DAY, -90, CURRENT_DATE()), -- end date
DATEADD(DAY, -60, CURRENT_DATE()) -- start date
)
)

# Configuración de Despliegue SPCS (OPCIONAL)
## Esto está deshabilitado por defecto pero descomentar las celdas de código a continuación permitirá a un usuario 

- ### Crear un nuevo compute pool con 3 nodos CPU XL
- ### Desplegar un servicio sobre nuestra versión de modelo HPO existente
- ### Probar inferencia en el container service recién creado


In [None]:
cp_name = "MORTGAGE_LENDING_INFERENCE_CP"
num_spcs_nodes = '2'
spcs_instance_family = 'CPU_X64_L'
service_name = 'MORTGAGE_LENDING_PREDICTION_SERVICE'

current_database = session.get_current_database().replace('"', '')
current_schema = session.get_current_schema().replace('"', '')
extended_service_name = f'{current_database}.{current_schema}.{service_name}'

In [None]:
session.sql(f"alter compute pool if exists {cp_name} stop all").collect()
session.sql(f"drop compute pool if exists {cp_name}").collect()
session.sql(f"create compute pool {cp_name} min_nodes={num_spcs_nodes} max_nodes={num_spcs_nodes} instance_family={spcs_instance_family} auto_resume=True auto_suspend_secs=300").collect()
session.sql(f"describe compute pool {cp_name}").show()

In [None]:
#nota: esto puede tomar hasta 5 minutos en ejecutarse

mv_opt.create_service(
    service_name=extended_service_name,
    service_compute_pool=cp_name,
    ingress_enabled=True,
    max_instances=int(num_spcs_nodes)
)

In [None]:
model_registry.get_model(f"MORTGAGE_LENDING_MLOPS_{VERSION_NUM}").show_versions()

In [None]:
mv_container = model_registry.get_model(f"MORTGAGE_LENDING_MLOPS_{VERSION_NUM}").default
mv_container.run(test, function_name = "predict", service_name = "MORTGAGE_LENDING_PREDICTION_SERVICE").rename('"output_feature_0"', 'XGB_PREDICTION')

In [None]:
SHOW ENDPOINTS IN SERVICE E2E_SNOW_MLOPS_DB.MLOPS_SCHEMA.MORTGAGE_LENDING_PREDICTION_SERVICE

In [None]:
#Detener el servicio para ahorrar costos
# session.sql(f"alter compute pool if exists {cp_name} stop all").collect()

## Conclusión 

#### 🛠️ Feature Store de Snowflake rastrea definiciones de características y mantiene el linaje de orígenes y destinos 🛠️
#### 🚀 Model Registry de Snowflake proporciona a los usuarios un framework seguro y flexible para registrar modelos, etiquetar candidatos para producción y ejecutar trabajos de inferencia y explicabilidad 🚀
#### 📈 La observabilidad ML en Snowflake permite a los usuarios monitorear el rendimiento del modelo a lo largo del tiempo y detectar drift de modelo, características y concepto 📈
#### 🔮 Todos los modelos registrados en el Model Registry pueden ser accedidos para inferencia, explicabilidad, seguimiento de linaje, visibilidad y más 🔮
