# ‚ùÑÔ∏è End-to-end ML Demo ‚ùÑÔ∏è

In this worfklow we will work through the following elements of a typical tabular machine learning pipeline.

### 1. Use Feature Store to track engineered features
* Store feature defintions in feature store for reproducible computation of ML features
      
### 2. Train two Models using the Snowflake ML APIs
* Baseline XGboost
* XGboost with optimal hyper-parameters identified via Snowflake ML distributed HPO methods

### 3. Register both models in Snowflake model registry
* Explore model registry capabilities such as **metadata tracking, inference, and explainability**
* Compare model metrics on train/test set to identify any issues of model performance or overfitting
* Tag the best performing model version as 'default' version
### 4. Set up Model Monitor to track 1 year of predicted and actual loan repayments
* **Compute performance metrics** such a F1, Precision, Recall
* **Inspect model drift** (i.e. how much has the average predicted repayment rate changed day-to-day)
* **Compare models** side-by-side to understand which model should be used in production
* Identify and understand **data issues**

### 5. Track data and model lineage throughout
* View and understand
  * The **origin of the data** used for computed features
  * The **data used** for model training
  * The **available model versions** being monitored

In [None]:
!pip install snowflake-ml-python==1.18.0

In [None]:
#Update this VERSION_NUM to version your features, models etc!
VERSION_NUM = '0'
DB = "E2E_SNOW_MLOPS_DB" 
SCHEMA = "MLOPS_SCHEMA" 
COMPUTE_WAREHOUSE = "E2E_SNOW_MLOPS_WH" 
ROLE = "E2E_SNOW_MLOPS_ROLE"

In [1]:
import pandas as pd
import numpy as np
import sklearn
import math
import pickle
import shap
from datetime import datetime
import streamlit as st
from xgboost import XGBClassifier

from versioning import version_featureview, version_data

# Snowpark ML
from snowflake.ml.registry import Registry
from snowflake.ml.modeling.tune import get_tuner_context
from snowflake.ml.modeling import tune
from entities import search_algorithm

#Snowflake feature store
from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity, CreationMode

# Snowpark session
from snowflake.snowpark import DataFrame
from snowflake.snowpark.functions import col, to_timestamp, min, max, month, dayofweek, dayofyear, avg, date_add, sql_expr
from snowflake.snowpark.types import IntegerType, StringType
from snowflake.snowpark import Window

#setup snowpark session
from snowflake.snowpark.context import get_active_session
session = get_active_session()
# session.use_role('')
session.use_role(ROLE)
session.use_warehouse(COMPUTE_WAREHOUSE)
session.use_database(DB)
session.use_schema(SCHEMA)


In [None]:
df = session.table("MORTGAGE_LENDING_DEMO_DATA")
df.show(5)

## Observe Snowflake Snowpark table properties

In [None]:
df.select(min('TS'), max('TS')).show()

In [None]:
#Get current date and time
current_time = datetime.now()
df_max_time = datetime.strptime(str(df.select(max("TS")).collect()[0][0]), "%Y-%m-%d %H:%M:%S.%f")

#Find delta between latest existing timestamp and today's date
timedelta = current_time- df_max_time

## Feature Engineering with Snowpark APIs

In [None]:
#Create a dict with keys for feature names and values containing transform code

feature_eng_dict = dict()

#Timstamp features
feature_eng_dict["TIMESTAMP"] = date_add(to_timestamp("TS"), timedelta.days-1)
feature_eng_dict["MONTH"] = month("TIMESTAMP")
feature_eng_dict["DAY_OF_YEAR"] = dayofyear("TIMESTAMP") 
feature_eng_dict["DOTW"] = dayofweek("TIMESTAMP")

# df= df.with_columns(feature_eng_dict.keys(), feature_eng_dict.values())

#Income and loan features
feature_eng_dict["LOAN_AMOUNT"] = col("LOAN_AMOUNT_000s")*1000
feature_eng_dict["INCOME"] = (col("APPLICANT_INCOME_000s")*1000).astype(IntegerType())
feature_eng_dict["INCOME_LOAN_RATIO"] = col("INCOME")/col("LOAN_AMOUNT")

df_eng = df.with_columns(feature_eng_dict.keys(), feature_eng_dict.values())
df_eng.show(3)

In [None]:
df_eng.explain()

## Create a Snowflake Feature Store

In [None]:
fs = FeatureStore(
    session=session, 
    database=DB, 
    name=SCHEMA, 
    default_warehouse=COMPUTE_WAREHOUSE,
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST
)

In [None]:
fs.list_entities()

## Feature Store configuration
- create/register entities of interest

In [None]:
#First try to retrieve an existing entity definition, if not define a new one and register
try:
    #retrieve existing entity
    loan_id_entity = fs.get_entity('LOAN_ENTITY') 
    print('Retrieved existing entity')
except:
#define new entity
    loan_id_entity = Entity(
        name = "LOAN_ENTITY",
        join_keys = ["LOAN_ID"],
        desc = "Features defined on a per loan level")
    #register
    fs.register_entity(loan_id_entity)
    print("Registered new entity")

In [None]:
#Create a dataframe with just the ID, timestamp, and engineered features. We will use this to define our feature view
feature_df = df_eng.select(["LOAN_ID"]+list(feature_eng_dict.keys()))
feature_df.show(5)

Here, the feature store references an existing table. 

We could also define the dataframe via the use of Snowpark APIs, and use that dataframe (or a function that returns a dataframe) as the feature view definition, below.

In [None]:
#define and register feature view
loan_fv = FeatureView(
    name="Mortgage_Feature_View",
    entities=[loan_id_entity],
    feature_df=feature_df,
    timestamp_col="TIMESTAMP",
    refresh_freq="1 day")

#add feature level descriptions

loan_fv = loan_fv.attach_feature_desc(
    {
        "MONTH": "Month of loan",
        "DAY_OF_YEAR": "Day of calendar year of loan",
        "DOTW": "Day of the week of loan",
        "LOAN_AMOUNT": "Loan amount in $USD",
        "INCOME": "Household income in $USD",
        "INCOME_LOAN_RATIO": "Ratio of LOAN_AMOUNT/INCOME",
    }
)

loan_fv = fs.register_feature_view(loan_fv, version=VERSION_NUM,overwrite=True)

# alternatively, use version hashing
#version = version_featureview(loan_fv)
#loan_fv = fs.register_feature_view(loan_fv, version=version)

In [None]:
fs.list_feature_views()

In [None]:
df_eng.show(3)

In [None]:

cat_cols = ["LOAN_PURPOSE_NAME"]

ohe_dict = {}
for c in cat_cols:
    vals = df_eng.select(c).distinct().collect()

    for v in vals:
        key = f"{c}_{v[c].replace(' ','_').upper()}"
        ohe_dict[key] = (col(c)==v[c]).astype(IntegerType())
        
ohe_df = df_eng.with_columns(ohe_dict.keys(), ohe_dict.values())

ohe_df = ohe_df.select(["LOAN_ID","TIMESTAMP"]+list(ohe_dict.keys()))
ohe_df.show()

In [None]:
#define and register feature view
cat_fv = FeatureView(
    name="Mortgage_Feature_View_CATEGORIES",
    entities=[loan_id_entity],
    feature_df=ohe_df,
    timestamp_col="TIMESTAMP",
)

cat_fv = fs.register_feature_view(cat_fv, version=VERSION_NUM,overwrite=True)

# alternatively, use version hashing
#version = version_featureview(cat_fv)
#cat_fv = fs.register_feature_view(cat_fv, version=version)

In [None]:
fs.list_feature_views()

## Retrieve a Dataset from the featureview

Snowflake Datasets are immutable, file-based objects that exist within your Snowpark session. 

They can be written to persistent Snowflake objects as needed. 

In [None]:
# TODO: explanation of timestamp usage here?

# subset of data, only need the features used to fetch rest of feature view
spine_df = df_eng.select("LOAN_ID", "TIMESTAMP", "MORTGAGERESPONSE").filter(month("TIMESTAMP")==5)

ds = fs.generate_dataset(
    name=f"MORTGAGE_DATASET_EXTENDED_FEATURES",
    spine_df=spine_df, 
    features=[loan_fv,cat_fv],
    spine_timestamp_col="TIMESTAMP",
    spine_label_cols=["MORTGAGERESPONSE"]
)

In [None]:
ds_sp = ds.read.to_snowpark_dataframe()
ds_sp.show(5)

## Conclusion 

#### üõ†Ô∏è Snowflake Feature Store tracks feature definitions and maintains lineage of sources and destinations üõ†Ô∏è
#### üöÄ Snowflake Model Registry gives users a secure and flexible framework to log models, tag candidates for production, and run inference and explainability jobs üöÄ
#### üìà ML observability in Snowflake allows users to montior model performance over time and detect model, feature, and concept drift üìà
#### üîÆ All models logged in the Model Registry can be accessed for inference, explainability, lineage tracking, visibility and more üîÆ