# Snowflake ML Intro Notebook - ML Forecasting
This notebook introduces several key features of Snowflake ML in the process of training a machine learning model for forecasting Chicago bus ridership.


* Establish secure connection to Snowflake
* Load features and target from Snowflake table into Snowpark DataFrame
* Add features into a feature store
* Prepare features for model training
* Train ML model using Snowpark ML distributed processing
* Save the model to the Snowflake Model Registry
* Run model predictions inside Snowflake
* Deploy model to Snowflake Container Services

This notebook is intended to highlight Snowflake functionality and should not be taken as a best practice for time series forecasting. 

[Get Notebook](https://github.com/rajshah4/snowflake-notebooks/blob/main/Forecasting_ChicagoBus/Snowpark_Forecasting_Bus_FeatureStore.ipynb)  

[Go to folder with dataset](https://github.com/rajshah4/snowflake-notebooks/blob/main/Forecasting_ChicagoBus/)  

[See more snowflake notebooks from raj](https://github.com/rajshah4/snowflake-notebooks/)

## 1. Setup Environment

In [1]:
# Snowflake connector
from snowflake import connector
#from snowflake.ml.utils import connection_params

# Snowpark for Python
from snowflake.snowpark.session import Session
from snowflake.snowpark.types import Variant
from snowflake.snowpark.version import VERSION
from snowflake.snowpark import functions as F
from snowflake.snowpark.types import *

# Snowpark ML
from snowflake.ml.modeling.compose import ColumnTransformer
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.preprocessing import StandardScaler, OrdinalEncoder
from snowflake.ml.modeling.impute import SimpleImputer
from snowflake.ml.modeling.model_selection import GridSearchCV
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.ml import version
mlversion = version.VERSION

#Feature Store
from snowflake.ml.feature_store import FeatureStore, CreationMode, Entity, FeatureView

# Misc
import pandas as pd
import json
import logging 
logger = logging.getLogger("snowflake.snowpark.session")
logger.setLevel(logging.ERROR)

import sys
print(sys.version) ##Last run used Python 3.11


3.11.7 (main, Dec 15 2023, 12:09:56) [Clang 14.0.6 ]


## Establish Secure Connection to Snowflake

Using the Snowpark Python API, it’s quick and easy to establish a secure connection between Snowflake and Notebook. I prefer using a `toml` configuration file [as documented here](https://docs.snowflake.com/en/developer-guide/snowflake-python-api/snowflake-python-connecting-snowflake).
 *Note: Other connection options include Username/Password, MFA, OAuth, Okta, SSO*

I recently moved to using [private / public key pair](https://docs.snowflake.com/en/user-guide/key-pair-auth) for authentication. This is more secure than using a password. I also don't have to log into MFA everytime I run the notebook.

The creds.json should look like this:
```
{
    "account": "awb99999",
    "user": "your_user_name",
    "password": "your_password",
    "warehouse": "your_warehouse"
  }

In [2]:
from snowflake.snowpark import Session
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.backends import default_backend

with open('../../creds.json') as f:
    data = json.load(f)
    USERNAME = data['user']
    SF_ACCOUNT = data['account']
    SF_WH = data['warehouse']
    passphrase = data['passphrase']

# Read the private key from the .p8 file
with open('../../rsa_key.p8', 'rb') as key_file:
    private_key = key_file.read()

# If the private key is encrypted, load it with a passphrase
# Replace 'your_key_passphrase' with your actual passphrase if needed
private_key_obj = serialization.load_pem_private_key(
    private_key,
    password=passphrase.encode() if passphrase else None,
    backend=default_backend()
)

# Define connection parameters including the private key
CONNECTION_PARAMETERS = {
    'user': USERNAME,
    'account': SF_ACCOUNT,
    'private_key': private_key_obj,
    'warehouse': SF_WH,
}

# Create a session with the specified connection parameters
session = Session.builder.configs(CONNECTION_PARAMETERS).create()

from snowflake.core.warehouse import Warehouse
from snowflake.core import Root
root = Root(session)

Verify everything is connected. I like to do this to remind people to make sure they are using the latest versions.

In [3]:
snowflake_environment = session.sql('select current_user(), current_version()').collect()
snowpark_version = VERSION

# Current Environment Details
print('User                        : {}'.format(snowflake_environment[0][0]))
print('Role                        : {}'.format(session.get_current_role()))
print('Database                    : {}'.format(session.get_current_database()))
print('Schema                      : {}'.format(session.get_current_schema()))
print('Warehouse                   : {}'.format(session.get_current_warehouse()))
print('Snowflake version           : {}'.format(snowflake_environment[0][1]))
print('Snowpark for Python version : {}.{}.{}'.format(snowpark_version[0],snowpark_version[1],snowpark_version[2]))
print('Snowflake ML version        : {}.{}.{}'.format(mlversion[0],mlversion[2],mlversion[4]))


User                        : RSHAH
Role                        : "RAJIV"
Database                    : "RAJIV"
Schema                      : "PUBLIC"
Warehouse                   : "RAJIV"
Snowflake version           : 8.38.3
Snowpark for Python version : 1.20.0
Snowflake ML version        : 1.6.3


Throughout this notebook, I will change warehouse sizes. For this notebook warehouse size really doesn't matter much, but I want people to understand how easily and quickly you can change the warehouse size. This is one of my favorite features of Snowflake, just how its always ready for me.

In [4]:
session.sql("CREATE OR REPLACE SCHEMA ML_DEMO").collect()
session.sql("USE SCHEMA ML_DEMO").collect()

[Row(status='Statement executed successfully.')]

Create a feature store

In [5]:
fs = FeatureStore(
    session=session,
    database="RAJIV",
    name="FEATURE_STORE_MLDEMO",
    default_warehouse="RAJIV",
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST,
)

## 2. Load Data in Snowflake 

Let's get the data (900k rows) and also make the column names all upper cases. It's easier to work with columns names that aren't case sensitive.

In [6]:
df_clean = pd.read_csv('CTA_Daily_Totals_by_Route.csv')
df_clean.columns = df_clean.columns.str.upper()
print (df_clean.shape)
print (df_clean.dtypes)
df_clean.head()

(893603, 4)
ROUTE      object
DATE       object
DAYTYPE    object
RIDES       int64
dtype: object


Unnamed: 0,ROUTE,DATE,DAYTYPE,RIDES
0,3,01/01/2001,U,7354
1,4,01/01/2001,U,9288
2,6,01/01/2001,U,6048
3,8,01/01/2001,U,6309
4,9,01/01/2001,U,11207


Let's create a Snowpark dataframe and split the data for test/train. This operation is done inside Snowflake and not in your local environment. We will also save this as a table so we don't ever have to manually upload this dataset again.

PRO TIP -- Snowpark will inherit the schema of a pandas dataframe into Snowflake. Either change your schema before importing or after it has landed in snowflake. People that put models into production are very careful about data types.

In [7]:
input_df = session.create_dataframe(df_clean)
schema = input_df.schema
print(schema)

StructType([StructField('ROUTE', StringType(16777216), nullable=True), StructField('DATE', StringType(16777216), nullable=True), StructField('DAYTYPE', StringType(16777216), nullable=True), StructField('RIDES', LongType(), nullable=True)])


In [8]:
input_df.write.mode('overwrite').save_as_table('RAJIV.FEATURE_STORE_MLDEMO.CHICAGO_BUS_RIDES')

Let's read from the table, since that is generally what you will be doing in production. We have 893,000 rows of ridership data.

In [9]:
df = session.read.table('CHICAGO_BUS_RIDES')
print (df.count())
df.show()

893603
----------------------------------------------
|"ROUTE"  |"DATE"      |"DAYTYPE"  |"RIDES"  |
----------------------------------------------
|3        |01/01/2001  |U          |7354     |
|4        |01/01/2001  |U          |9288     |
|6        |01/01/2001  |U          |6048     |
|8        |01/01/2001  |U          |6309     |
|9        |01/01/2001  |U          |11207    |
|10       |01/01/2001  |U          |385      |
|11       |01/01/2001  |U          |610      |
|12       |01/01/2001  |U          |3678     |
|18       |01/01/2001  |U          |375      |
|20       |01/01/2001  |U          |7096     |
----------------------------------------------



An entity is an abstraction over a set of primary keys used for looking up feature data. An Entity represents a real-world "thing" that has data associated with it. Below cell registers an entity called "route" in Feature Store.

In [10]:
entity = Entity(
    name="route",
    join_keys=["DATE"],
)
fs.register_entity(entity)

#Show the entities
fs.list_entities().show()

  return f(self, *args, **kargs)


-----------------------------------------------------------------------------------
|"NAME"  |"JOIN_KEYS"  |"DESC"                                          |"OWNER"  |
-----------------------------------------------------------------------------------
|ROUTE   |["DATE"]     |Starting and ending stations for the bike ride  |RAJIV    |
-----------------------------------------------------------------------------------



## 3. Distributed Feature Engineering

Let's do some feature engineering and then move that logic to the feature store. The feature engineering includes: adding a day of the week and aggregation the data by day and then later joining in weather data.

These operations are done inside the Snowpark warehouse which provides improved performance and scalability with distributed execution for these scikit-learn preprocessing functions. This dataset uses SMALL, but you can always move up to larger ones including Snowpark Optimized warehouses (16x memory per node than a standard warehouse), e.g., `session.sql("create or replace warehouse snowpark_opt_wh with warehouse_size = 'MEDIUM' warehouse_type = 'SNOWPARK-OPTIMIZED'").collect()`

In [11]:
session.sql("create or replace warehouse snowpark_opt_wh with warehouse_size = 'MEDIUM' warehouse_type = 'SNOWPARK-OPTIMIZED'").collect()

[Row(status='Warehouse SNOWPARK_OPT_WH successfully created.')]

Simple feature engineering

In [12]:
from snowflake.snowpark.functions import col, to_timestamp, dayofweek, month,sum, listagg, lag
from snowflake.snowpark import Window

df = df.with_column('DATE', to_timestamp(col('DATE'), 'MM/DD/YYYY'))

# Add a new column for the day of the week
# The day of week is represented as an integer, with 0 = Sunday, 1 = Monday, ..., 6 = Saturday
df = df.with_column('DAY_OF_WEEK', dayofweek(col('DATE')))

df.show()

-----------------------------------------------------------------------
|"ROUTE"  |"DAYTYPE"  |"RIDES"  |"DATE"               |"DAY_OF_WEEK"  |
-----------------------------------------------------------------------
|3        |U          |7354     |2001-01-01 00:00:00  |1              |
|4        |U          |9288     |2001-01-01 00:00:00  |1              |
|6        |U          |6048     |2001-01-01 00:00:00  |1              |
|8        |U          |6309     |2001-01-01 00:00:00  |1              |
|9        |U          |11207    |2001-01-01 00:00:00  |1              |
|10       |U          |385      |2001-01-01 00:00:00  |1              |
|11       |U          |610      |2001-01-01 00:00:00  |1              |
|12       |U          |3678     |2001-01-01 00:00:00  |1              |
|18       |U          |375      |2001-01-01 00:00:00  |1              |
|20       |U          |7096     |2001-01-01 00:00:00  |1              |
----------------------------------------------------------------

A bit more feature engineering, but again, this is very familiar syntax.

In [13]:
# Add a new column for the month
df = df.with_column('MONTH', month(col('DATE')))

# Group by DATE, DAY_OF_WEEK, and MONTH, then aggregate
total_riders = df.group_by('DATE','DAY_OF_WEEK','MONTH').agg(
    F.listagg('DAYTYPE', is_distinct=True).alias('DAYTYPE'),
    F.sum('RIDES').alias('TOTAL_RIDERS')
).order_by('DATE')

#Define a window specification
window_spec = Window.order_by('DATE')

# Add a lagged column for total ridership of the previous day
total_riders = total_riders.with_column('PREV_DAY_RIDERS', lag(col('TOTAL_RIDERS'), 1).over(window_spec))

# Show the resulting dataframe
print (total_riders.count())
print (total_riders.show())

7364
--------------------------------------------------------------------------------------------------
|"DATE"               |"DAY_OF_WEEK"  |"MONTH"  |"DAYTYPE"  |"TOTAL_RIDERS"  |"PREV_DAY_RIDERS"  |
--------------------------------------------------------------------------------------------------
|2001-01-01 00:00:00  |1              |1        |U          |295439          |NULL               |
|2001-01-02 00:00:00  |2              |1        |W          |776862          |295439             |
|2001-01-03 00:00:00  |3              |1        |W          |820048          |776862             |
|2001-01-04 00:00:00  |4              |1        |W          |867675          |820048             |
|2001-01-05 00:00:00  |5              |1        |W          |887519          |867675             |
|2001-01-06 00:00:00  |6              |1        |A          |575407          |887519             |
|2001-01-07 00:00:00  |0              |1        |U          |374435          |575407             |
|2001

### Also, you can use ChatGPT to generate the code for you.

 <img src="fe_forecasting.png" alt="Forecasting Visualization" width="600"/>


**Feature Views**

In [14]:
agg_fv = FeatureView(
    name="AggBusData",
    entities=[entity],
    feature_df=total_riders,
    timestamp_col="DATE",
)

agg_fv = fs.register_feature_view(agg_fv, version="1", overwrite=True)

# Show our newly created Feature View and display as Pandas DataFrame
fs.list_feature_views().to_pandas()

Unnamed: 0,NAME,VERSION,DATABASE_NAME,SCHEMA_NAME,CREATED_ON,OWNER,DESC,ENTITIES,REFRESH_FREQ,REFRESH_MODE,SCHEDULING_STATE,WAREHOUSE
0,WEATHER,1,RAJIV,FEATURE_STORE_MLDEMO,2024-10-13 05:18:07.130,RAJIV,,"[\n ""ROUTE""\n]",1 day,INCREMENTAL,ACTIVE,RAJIV
1,AGGBUSDATA,1,RAJIV,FEATURE_STORE_MLDEMO,2024-10-13 13:22:57.771,RAJIV,,"[\n ""ROUTE""\n]",,,,


### High level view of the Feature Store

 <img src="featurestore.png" alt="Forecasting Visualization" width="600"/>

## Join in the Weather Data from the Snowflake Marketplace

Instead of downloading data and building pipelines, Snowflake has a lot of useful data, including weather data in it's Marketplace. This means the data is only a SQL query away. 

 [Cybersyn Weather](https://app.snowflake.com/marketplace/listing/GZTSZAS2KIM/cybersyn-inc-weather-environmental-essentials?search=weather)

SQL QUERY: 
```
SELECT
  ts.noaa_weather_station_id,
  ts.DATE,
  COALESCE(MAX(CASE WHEN ts.variable = 'minimum_temperature' THEN ts.Value ELSE NULL END), 0) AS minimum_temperature,
  COALESCE(MAX(CASE WHEN ts.variable = 'precipitation' THEN ts.Value ELSE NULL END), 0) AS precipitation,
  COALESCE(MAX(CASE WHEN ts.variable = 'maximum_temperature' THEN ts.Value ELSE NULL END), 0) AS maximum_temperature
FROM
  cybersyn.noaa_weather_metrics_timeseries AS ts
JOIN
  cybersyn.noaa_weather_station_index AS idx
ON
  (ts.noaa_weather_station_id = idx.noaa_weather_station_id)
WHERE
  idx.NOAA_WEATHER_STATION_ID = 'USW00014819'
  AND (ts.VARIABLE = 'minimum_temperature' OR ts.VARIABLE = 'precipitation' OR ts.VARIABLE = 'maximum_temperature')
GROUP BY
  ts.noaa_weather_station_id,
  ts.DATE
LIMIT 1000;
```


In [15]:
weather = session.read.table('RAJIV.FEATURE_STORE_MLDEMO.CHICAGO_WEATHER')

from snowflake.snowpark.types import DoubleType
weather = weather.withColumn('MINIMUM_TEMPERATURE', weather['MINIMUM_TEMPERATURE'].cast(DoubleType()))
weather = weather.withColumn('MAXIMUM_TEMPERATURE', weather['MAXIMUM_TEMPERATURE'].cast(DoubleType()))
weather = weather.withColumn('PRECIPITATION', weather['PRECIPITATION'].cast(DoubleType()))

weather.show()

------------------------------------------------------------------------------------------------------------
|"NOAA_WEATHER_STATION_ID"  |"DATE"      |"MINIMUM_TEMPERATURE"  |"MAXIMUM_TEMPERATURE"  |"PRECIPITATION"  |
------------------------------------------------------------------------------------------------------------
|USW00014819                |2020-05-21  |13.3                   |20.0                   |0.0              |
|USW00014819                |2021-11-09  |6.1                    |13.9                   |0.0              |
|USW00014819                |2016-05-09  |10.6                   |15.6                   |9.1              |
|USW00014819                |2013-09-14  |10.6                   |22.2                   |0.0              |
|USW00014819                |2015-04-12  |8.3                    |21.7                   |0.0              |
|USW00014819                |2019-04-19  |5.0                    |9.4                    |0.0              |
|USW00014819       

Creating a feature view here for the weather data. This feature view will refresh every 24 hours. This is essential data that is constantly changing and Snowflake uses a dynamic table to manage the process. 

In [16]:
weather_fv = FeatureView(
    name="weather",
    entities=[entity],
    feature_df=weather,
    timestamp_col="DATE",
    refresh_freq="1 day", 
)

weather_fv = fs.register_feature_view(weather_fv, version="1", overwrite=True)

## Generate Training Dataset

Our feature store is filled with data, but we don't need to use it all. Here we select a subset of the feature store for training. 

In [17]:
# Create a date range between 2017 and 2019
date_range = pd.date_range(start='01/01/2013', end='12/31/2019')
date_column = date_range.strftime('%m/%d/%Y')
df = pd.DataFrame(date_column, columns=['DATE'])
spine_df = session.create_dataframe(df)

Generate a training dataset 

In [18]:
training_set = fs.generate_training_set(
    spine_df=spine_df,
    features=[agg_fv,weather_fv])

In [19]:
training_set.show()

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"DATE"      |"DAY_OF_WEEK"  |"MONTH"  |"DAYTYPE"  |"TOTAL_RIDERS"  |"PREV_DAY_RIDERS"  |"NOAA_WEATHER_STATION_ID"  |"MINIMUM_TEMPERATURE"  |"MAXIMUM_TEMPERATURE"  |"PRECIPITATION"  |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|12/25/2015  |5              |12       |U          |229983          |528494             |USW00014819                |2.2                    |7.8                    |0.0              |
|06/18/2019  |2              |6        |W          |779717          |734899             |USW00014819                |14.4                   |26.1                   |0.0              |
|02/01/2018  |4              |2        |W          |821729          |812563     

In [20]:
## Dropping any null values
from snowflake.snowpark.functions import col, is_null, to_date

# Create a filter condition for non-finite values across all columns
non_finite_filter = None

# Iterate over all columns and update the filter condition
for column in training_set.columns:
    current_filter = is_null(col(column))
    non_finite_filter = current_filter if non_finite_filter is None else (non_finite_filter | current_filter)

# Apply the filter to the DataFrame to exclude rows with any non-finite values
df_filtered = training_set.filter(~non_finite_filter)

In [21]:
#Split the data into training and test sets
df_filtered = df_filtered.withColumn("DATE", to_date(col("DATE"), 'MM/dd/yyyy'))
train = df_filtered.filter(col('DATE') < '01/01/2019')
test = df_filtered.filter(col('DATE') >= '01/01/2019')

In [22]:
print (train.count())
print (test.count())

2190
365


## 4. Distributed Feature Engineering in a Pipeline

Feature engineering + XGBoost

In [23]:
session.sql("create or replace warehouse snowpark_opt_wh with warehouse_size = 'MEDIUM' warehouse_type = 'SNOWPARK-OPTIMIZED'").collect()
session.sql("alter warehouse snowpark_opt_wh set max_concurrency_level = 1").collect()

[Row(status='Statement executed successfully.')]

In [24]:
 ## Distributed Preprocessing - 25X to 50X faster
numeric_features = ['DAY_OF_WEEK','MONTH','PREV_DAY_RIDERS','MINIMUM_TEMPERATURE','MAXIMUM_TEMPERATURE','PRECIPITATION']
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

categorical_cols = ['DAYTYPE']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=-99999))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_cols)
        ])

pipeline = Pipeline(steps=[('preprocessor', preprocessor),('model', XGBRegressor())])

## 5. Distributed Training

These operations are done inside the Snowpark warehouse which provides improved performance and scalability with distributed execution for these scikit-learn preprocessing functions, hyperparameter tuning (grid and random) and XGBoost training (and many other types of models).

In [25]:
 ## Distributed HyperParameter Optimization
hyper_param = dict(
        model__max_depth=[2,4],
        model__learning_rate=[0.1,0.3],
    )

xg_model = GridSearchCV(
    estimator=pipeline,
    param_grid=hyper_param,
    #cv=5,
    input_cols=numeric_features + categorical_cols,
    label_cols=['TOTAL_RIDERS'],
    output_cols=["TOTAL_RIDERS_FORECAST"],
)

# Fit and Score
xg_model.fit(train)
##Takes 25 seconds

<snowflake.ml.modeling.model_selection.grid_search_cv.GridSearchCV at 0x30c654190>

## 6. Model Evaluation
Look at the results of the mode. cv_results is a dictionary, where each key is a string describing one of the metrics or parameters, and the corresponding value is an array with one entry per combination of parameters

In [26]:
session.sql("create or replace warehouse snowpark_opt_wh with warehouse_size = 'SMALL'").collect()

[Row(status='Warehouse SNOWPARK_OPT_WH successfully created.')]

In [27]:
cv_results = xg_model.to_sklearn().cv_results_

for i in range(len(cv_results['params'])):
    print(f"Parameters: {cv_results['params'][i]}")
    print(f"Mean Test Score: {cv_results['mean_test_score'][i]}")
    print()

Parameters: {'model__learning_rate': 0.1, 'model__max_depth': 2}
Mean Test Score: 0.9336184119766777

Parameters: {'model__learning_rate': 0.1, 'model__max_depth': 4}
Mean Test Score: 0.9471567106665015

Parameters: {'model__learning_rate': 0.3, 'model__max_depth': 2}
Mean Test Score: 0.9400118721311458

Parameters: {'model__learning_rate': 0.3, 'model__max_depth': 4}
Mean Test Score: 0.9450308182128312



Look at the accuracy of the model

In [28]:
from snowflake.ml.modeling.metrics import mean_absolute_error
testpreds = xg_model.predict(test)
print('MSE:', mean_absolute_error(df=testpreds, y_true_col_names='TOTAL_RIDERS', y_pred_col_names='"TOTAL_RIDERS_FORECAST"'))
testpreds.select("DATE", "TOTAL_RIDERS", "TOTAL_RIDERS_FORECAST").show(10)         

MSE: 34076.139726027395
---------------------------------------------------------
|"DATE"      |"TOTAL_RIDERS"  |"TOTAL_RIDERS_FORECAST"  |
---------------------------------------------------------
|2019-05-15  |806302          |820550.5625              |
|2019-06-20  |751511          |757523.0                 |
|2019-04-17  |708517          |761306.75                |
|2019-02-14  |819970          |811306.0625              |
|2019-01-07  |717818          |800692.25                |
|2019-12-05  |827466          |804885.375               |
|2019-06-18  |779717          |812158.3125              |
|2019-08-04  |374596          |393182.75                |
|2019-03-30  |427224          |434139.125               |
|2019-07-26  |721724          |758486.25                |
---------------------------------------------------------



Materialize the results to a table

In [29]:
testpreds.write.save_as_table(table_name='RAJIV.FEATURE_STORE_MLDEMO.CHICAGO_BUS_RIDES_FORECAST', mode='overwrite')

Using metrics from snowpark so calculation is done inside snowflake

## 7. Save to the Model Registry and use for Predictions (Python & SQL)

Connect to the registry

In [30]:
from snowflake.ml.registry import Registry
reg = Registry(session=session, database_name="RAJIV", schema_name="FEATURE_STORE_MLDEMO")

In [None]:
model_ref = reg.log_model(
    model_name="Forecasting_Bus_Ridership",
    version_name="v12",    
    model=xg_model,
    conda_dependencies=["scikit-learn","xgboost"],
    sample_input_data=train,
    comment="XGBoost model, Oct 9"
)

Let's retrieve the model from the registry

In [32]:
reg_model = reg.get_model("Forecasting_Bus_Ridership").version("v12")

Here is an example of exporting from the model registry

In [33]:
#reg_model.export("/Users/rajishah/Code/snowflake-notebooks/Forecasting_ChicagoBus/model")

Let's do predictions inside the warehouse for some evaluation

In [34]:
test.show()

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"DAY_OF_WEEK"  |"MONTH"  |"DAYTYPE"  |"TOTAL_RIDERS"  |"PREV_DAY_RIDERS"  |"NOAA_WEATHER_STATION_ID"  |"MINIMUM_TEMPERATURE"  |"MAXIMUM_TEMPERATURE"  |"PRECIPITATION"  |"DATE"      |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2              |6        |W          |779717          |734899             |USW00014819                |14.4                   |26.1                   |0.0              |2019-06-18  |
|0              |8        |U          |374596          |496373             |USW00014819                |22.8                   |30.0                   |0.0              |2019-08-04  |
|3              |4        |W          |708517          |721679             |USW0

In [35]:
remote_prediction = reg_model.run(test, function_name='predict')
remote_prediction.sort("DATE").select("DATE","TOTAL_RIDERS","TOTAL_RIDERS_FORECAST").show(10)

---------------------------------------------------------
|"DATE"      |"TOTAL_RIDERS"  |"TOTAL_RIDERS_FORECAST"  |
---------------------------------------------------------
|2019-01-01  |247279          |311394.375               |
|2019-01-02  |585996          |602599.0625              |
|2019-01-03  |660631          |732633.6875              |
|2019-01-04  |662011          |718364.0625              |
|2019-01-05  |440848          |464067.8125              |
|2019-01-06  |316844          |336178.4375              |
|2019-01-07  |717818          |800692.25                |
|2019-01-08  |779946          |808019.3125              |
|2019-01-09  |743021          |791728.1875              |
|2019-01-10  |743075          |757489.125               |
---------------------------------------------------------



save evaluation metrics in the model registry

In [36]:
from snowflake.ml.modeling.metrics import mean_absolute_error
mae = mean_absolute_error(df=remote_prediction, y_true_col_names='TOTAL_RIDERS', y_pred_col_names='"TOTAL_RIDERS_FORECAST"')
reg_model.set_metric("MAE", value=mae)

In [37]:
reg_model.show_metrics()

{'MAE': 34076.139726027395}

If you look in the activity view, you can find the SQL which will run a bit faster.  This SQL command is showing the result in a snowflake dataframe. You could use `collect` to pull the info out into your local session.

Modify the SQL with by adding in your specific model with this line: `MODEL_VERSION_ALIAS AS MODEL RAJIV.FEATURE_STORE_MLDEMO.FORECASTING_BUS_RIDERSHIP VERSION V1` and updating the location of your target predictions which is located here: `SNOWPARK_ML_MODEL_INFERENCE_INPUT`

In [38]:
sqlquery = """SELECT "DATE", "TOTAL_RIDERS",  CAST ("TMP_RESULT"['TOTAL_RIDERS_FORECAST'] AS DOUBLE) AS "TOTAL_RIDERS_FORECAST" FROM (WITH SNOWPARK_ML_MODEL_INFERENCE_INPUT AS (SELECT  *  FROM ( SELECT "DAY_OF_WEEK", "MONTH", "DAYTYPE", "TOTAL_RIDERS", "PREV_DAY_RIDERS", "NOAA_WEATHER_STATION_ID", "MINIMUM_TEMPERATURE", "MAXIMUM_TEMPERATURE", "PRECIPITATION", to_date("DATE", 'MM/dd/yyyy') AS "DATE" FROM (
                    SELECT
                        l_1.*,
                        r_1.* EXCLUDE (DATE)
                    FROM (
                    SELECT
                        l_0.*,
                        r_0.* EXCLUDE (DATE)
                    FROM (SELECT  *  FROM ("RAJIV"."PUBLIC"."SNOWPARK_TEMP_TABLE_MDXB71DXXU")) l_0
                    LEFT JOIN (
                        SELECT DATE, DAY_OF_WEEK, MONTH, DAYTYPE, TOTAL_RIDERS, PREV_DAY_RIDERS
                        FROM RAJIV.FEATURE_STORE_MLDEMO.AGGBUSDATA$1
                    ) r_0
                    ON l_0.DATE = r_0.DATE
                ) l_1
                    LEFT JOIN (
                        SELECT DATE, NOAA_WEATHER_STATION_ID, MINIMUM_TEMPERATURE, MAXIMUM_TEMPERATURE, PRECIPITATION
                        FROM RAJIV.FEATURE_STORE_MLDEMO.WEATHER$1
                    ) r_1
                    ON l_1.DATE = r_1.DATE
                ) WHERE NOT ((((((((("DATE" IS NULL OR "DAY_OF_WEEK" IS NULL) OR "MONTH" IS NULL) OR "DAYTYPE" IS NULL) OR "TOTAL_RIDERS" IS NULL) OR "PREV_DAY_RIDERS" IS NULL) OR "NOAA_WEATHER_STATION_ID" IS NULL) OR "MINIMUM_TEMPERATURE" IS NULL) OR "MAXIMUM_TEMPERATURE" IS NULL) OR "PRECIPITATION" IS NULL)) WHERE ("DATE" >= '01/01/2019')),MODEL_VERSION_ALIAS AS MODEL RAJIV.FEATURE_STORE_MLDEMO.FORECASTING_BUS_RIDERSHIP VERSION V1
                SELECT *,
                    MODEL_VERSION_ALIAS!PREDICT(DAY_OF_WEEK, MONTH, PREV_DAY_RIDERS, MINIMUM_TEMPERATURE, MAXIMUM_TEMPERATURE, PRECIPITATION, DAYTYPE) AS TMP_RESULT
                FROM SNOWPARK_ML_MODEL_INFERENCE_INPUT) ORDER BY "DATE" ASC NULLS FIRST LIMIT 10"""

In [39]:
#results = session.sql(sqlquery).show()

## 8. Get Shap Explanations

In [40]:
#explanations = reg_model.run(test, function_name="explain")

## 9. Show lineage

In [84]:
TABLE_NAME = "RAJIV.FEATURE_STORE_MLDEMO.CHICAGO_BUS_RIDES_FORECAST"
df = session.lineage.trace(f"RAJIV.FEATURE_STORE_MLDEMO.CHICAGO_BUS_RIDES", "TABLE", direction="downstream", distance=4)

In [42]:
df.show()

--------------------------------------------------------------------------------------------------------------------------------------
|"SOURCE_OBJECT"                                     |"TARGET_OBJECT"                                     |"DIRECTION"  |"DISTANCE"  |
--------------------------------------------------------------------------------------------------------------------------------------
|{                                                   |{                                                   |Downstream   |1           |
|  "createdOn": "2024-10-13T20:22:44Z",              |  "createdOn": "2024-09-30T16:48:58Z",              |             |            |
|  "domain": "TABLE",                                |  "domain": "FEATURE_VIEW",                         |             |            |
|  "name": "RAJIV.FEATURE_STORE_MLDEMO.CHICAGO_B...  |  "name": "RAJIV.FEATURE_STORE_MLDEMO.AGGBUSDATA",  |             |            |
|  "status": "ACTIVE"                                | 

In [43]:
session.sql("create or replace warehouse snowpark_opt_wh with warehouse_size = 'SMALL'").collect()

[Row(status='Warehouse SNOWPARK_OPT_WH successfully created.')]

## 10. Deploy the model to SPCS for Inference

It's now possible to deploy and run a model in Snowpark Container Services (SPCS), thus making Snowflake Model Registry more universal and useful by supporting large models that need distributed clusters or GPUs for execution, or that have pip dependencies on OSS or user’s own libraries and frameworks. All of these benefits can be realized without mastering knowledge of docker containers, kubernetes, etc.

In [None]:
reg_model.create_service(service_name="ChicagoBusForecastv13",
                  service_compute_pool="NOTEBOOK_CPU_S",
                  image_repo="rajiv.public.images",
                  build_external_access_integration="RAJ_OPEN_ACCESS_INTEGRATION",
                  ingress_enabled=True)

In [80]:
spcs_prediction = reg_model.run(test, function_name='predict', service_name="CHICAGOBUSFORECASTV12")
spcs_prediction.sort("DATE").select("DATE","TOTAL_RIDERS","TOTAL_RIDERS_FORECAST").show(10)

---------------------------------------------------------
|"DATE"      |"TOTAL_RIDERS"  |"TOTAL_RIDERS_FORECAST"  |
---------------------------------------------------------
|2019-01-01  |247279          |311394.375               |
|2019-01-02  |585996          |602599.0625              |
|2019-01-03  |660631          |732633.6875              |
|2019-01-04  |662011          |718364.0625              |
|2019-01-05  |440848          |464067.8125              |
|2019-01-06  |316844          |336178.4375              |
|2019-01-07  |717818          |800692.25                |
|2019-01-08  |779946          |808019.3125              |
|2019-01-09  |743021          |791728.1875              |
|2019-01-10  |743075          |757489.125               |
---------------------------------------------------------



In [None]:
session.sql("SHOW ENDPOINTS IN SERVICE RAJIV.FEATURE_STORE_MLDEMO.CHICAGOBUSFORECASTV12").collect()

In [None]:
import json
from pprint import pprint
import requests

# Generate headers using the active connection
def get_headers(existing_snowflake_conn):
    token = existing_snowflake_conn._rest._token_request('ISSUE')
    headers = {'Authorization': f'Snowflake Token=\"{token["data"]["sessionToken"]}\"'}
    return headers

# Put the endpoint URL and your data here
URL = 'https://jub4q47i-sfsenorthamerica-demo412.snowflakecomputing.app/predict'

def prepare_data(test):
    # Ensure 'test' is defined before this point
    # Assuming 'test' is a Snowpark DataFrame object
    df = test.toPandas()
    df['DATE'] = 0
    df = df.to_dict(orient='records')
    data = {
        'data': []
        }
    for idx, x in enumerate(df):
        data['data'].append([idx, list(x.values())])
            
    return data

# Send the request to the endpoint
def send_request(data: dict, headers: dict):
    output = requests.post(URL, json=data, headers=headers)
    assert (output.status_code == 200), f"Failed to get response from the service. Status code: {output.status_code}"
    return output.content

# Testing the flow using your active session
# Assuming `existing_snowflake_conn` is your current session's connection object
headers = get_headers(session._conn._conn)
print (headers)
data = prepare_data(test)  # Convert the DataFrame to the right format
print (data)

# Send the request and get the results
results = send_request(data=data, headers=headers)
pprint(json.loads(results))

In [46]:
#reg_model.delete_service("ChicagoBusForecastv11")

In [None]:
session.sql("DROP SERVICE RAJIV.FEATURE_STORE_MLDEMO.CHICAGOBUSFORECASTV11").collect()

In [48]:
#session.close()