Steps to run notebook:
1. Create a conda env with python3.8 (Empty conda env)
```
conda create --name snowml python=3.8
```
2. Activate conda env
```
conda activate snowml
```
3. Install conda pkg
```
conda install snowflake-ml-python 
# or local build if there are changes in SnowML lib you need: bazel build //snowflake/ml:wheel
# then do pip install {built pkg}
```
4. Install jupyter notebook
```
conda install jupyter
```
5. Start notebook
```
jupyter notebook
```

## Feature Store Example with Time Series Features
This notebook demonstrates advanced feature store usage with time series features. 
It will compute features from NY taxi trip data and demonstrate connections between training and prediction.
The reference example by Databricks is here: https://docs.databricks.com/en/_extras/notebooks/source/machine-learning/feature-store-with-uc-taxi-example.html#feature-store/feature-store

## Setup UI and Auto Import

In [32]:
# Scale cell width with the browser window to accommodate .show() commands for wider tables.
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### [Optional 1] Import from local code repository

In [31]:
import sys
import os

# Simplify reading from the local repository
cwd=os.getcwd()
REPO_PREFIX="snowflake/ml"
LOCAL_REPO_PATH=cwd[:cwd.find(REPO_PREFIX)].rstrip('/')

if LOCAL_REPO_PATH not in sys.path:
    print(f"Adding {LOCAL_REPO_PATH} to system path")
    sys.path.append(LOCAL_REPO_PATH)

#### [Optional 2] Import from installed snowflake-ml-python wheel

In [2]:
import os
conda_env = os.environ['CONDA_DEFAULT_ENV']
import sys
sys.path.append(f'/opt/homebrew/anaconda3/envs/{conda_env}/lib/python3.8/site-packages')

## Prepare demo data

In [33]:
import importlib
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F, types as T
from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity, CreationMode
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark.types import DateType, TimeType, _NumericType, TimestampType
import datetime


In [34]:
session = Session.builder.configs(SnowflakeLoginOptions()).create()

In [35]:
source_df = session.table("SNOWML_FEATURE_STORE_TEST_DB.TEST_DATASET.yellow_tripdata_2016_01")

source_df = source_df.select(
    [
        "TRIP_DISTANCE", 
        "FARE_AMOUNT",
        "PASSENGER_COUNT",
        "PULOCATIONID",
        "DOLOCATIONID",
        F.cast(source_df.TPEP_PICKUP_DATETIME / 1000000, TimestampType()).alias("PICKUP_TS"),
        F.cast(source_df.TPEP_DROPOFF_DATETIME / 1000000, TimestampType()).alias("DROPOFF_TS"),
    ]).filter("DROPOFF_TS >= '2016-01-01 00:00:00' AND DROPOFF_TS < '2016-01-03 00:00:00'")
source_df.show()

-------------------------------------------------------------------------------------------------------------------------------------
|"TRIP_DISTANCE"  |"FARE_AMOUNT"  |"PASSENGER_COUNT"  |"PULOCATIONID"  |"DOLOCATIONID"  |"PICKUP_TS"          |"DROPOFF_TS"         |
-------------------------------------------------------------------------------------------------------------------------------------
|3.2              |14.0           |1                  |48              |262             |2016-01-01 00:12:22  |2016-01-01 00:29:14  |
|1.0              |9.5            |2                  |162             |48              |2016-01-01 00:41:31  |2016-01-01 00:55:10  |
|0.9              |6.0            |1                  |246             |90              |2016-01-01 00:53:37  |2016-01-01 00:59:57  |
|0.8              |5.0            |1                  |170             |162             |2016-01-01 00:13:28  |2016-01-01 00:18:07  |
|1.8              |11.0           |1                  |161    

## Create FeatureStore Client

Let's first create a feature store client.

We can pass in an existing database name, or a new database will be created upon the feature store initialization.

In [36]:
DEMO_DB = "FS_TIME_SERIES_EXAMPLE"
session.sql(f"DROP DATABASE IF EXISTS {DEMO_DB}").collect()  # start from scratch
session.sql(f"CREATE DATABASE IF NOT EXISTS {DEMO_DB}").collect()
session.sql(f"CREATE OR REPLACE WAREHOUSE PUBLIC WITH WAREHOUSE_SIZE='XSMALL'").collect()

fs = FeatureStore(
    session=session, 
    database=DEMO_DB, 
    name="AWESOME_FS", 
    default_warehouse="PUBLIC",
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST,
)

## Create and register new Entities

In [37]:
trip_pickup = Entity(name="trip_pickup", join_keys=["PULOCATIONID"])
trip_dropoff = Entity(name="trip_dropoff", join_keys=["DOLOCATIONID"])
fs.register_entity(trip_pickup)
fs.register_entity(trip_dropoff)
fs.list_entities().to_pandas()

Unnamed: 0,NAME,JOIN_KEYS,DESC
0,TRIP_DROPOFF,DOLOCATIONID,
1,TRIP_PICKUP,PULOCATIONID,


## Define feature pipeline
We will compute a few time series features in the pipeline here.
Before we have *__value based range between__* in SQL, we will use a work around to mimic the calculation (NOTE: the work around won't be very accurate on computing the time series value due to missing gap filling functionality, but it should be enough for a demo purpose)

We will define two feature groups:
1. pickup features
    - Mean fare amount over the past 2 and 5 hours
2. dropoff features
    - Count of trips over the past 2 and 5 hours

### This is a UDF computing time window end
We will later turn these into built in functions for feature store

In [38]:
@F.pandas_udf(
    name="vec_window_end",
    is_permanent=True,
    stage_location=session.get_session_stage(),
    packages=["numpy", "pandas", "pytimeparse"],
    replace=True,
    session=session,
    immutable=True,
)
def vec_window_end_compute(
    x: T.PandasSeries[datetime.datetime],
    interval: T.PandasSeries[str],
) -> T.PandasSeries[datetime.datetime]:
    import numpy as np
    import pandas as pd
    from pytimeparse.timeparse import timeparse

    time_slice = timeparse(interval[0])
    if time_slice is None:
        raise ValueError(f"Cannot parse interval {interval[0]}")
    time_slot = (x - np.datetime64('1970-01-01T00:00:00')) // np.timedelta64(1, 's') // time_slice * time_slice + time_slice
    return pd.to_datetime(time_slot, unit='s')

### Define feature pipeline logics

In [39]:
from snowflake.snowpark import Window
from snowflake.snowpark.functions import col

# NOTE: these time window calculations are approximates and are not handling time gaps

def pre_aggregate_fn(df, ts_col, group_by_cols):
    df = df.with_column("WINDOW_END", F.call_udf("vec_window_end", F.col(ts_col), "15m"))
    df = df.group_by(group_by_cols + ["WINDOW_END"]).agg(
            F.sum("FARE_AMOUNT").alias("FARE_SUM_1_hr"),
            F.count("*").alias("TRIP_COUNT_1_hr")
         )
    return df

def pickup_features_fn(df):
    df = pre_aggregate_fn(df, "PICKUP_TS", ["PULOCATIONID"])
    
    window1 = Window.partition_by("PULOCATIONID").order_by(col("WINDOW_END").desc()).rows_between(Window.CURRENT_ROW, 7)
    window2 = Window.partition_by("PULOCATIONID").order_by(col("WINDOW_END").desc()).rows_between(Window.CURRENT_ROW, 19)

    df = df.with_columns(
        [
            "SUM_FARE_2_hr",
            "COUNT_TRIP_2hr",
            "SUM_FARE_5_hr",
            "COUNT_TRIP_5hr",
        ],
        [
            F.sum("FARE_SUM_1_hr").over(window1),
            F.sum("TRIP_COUNT_1_hr").over(window1),
            F.sum("FARE_SUM_1_hr").over(window2),
            F.sum("TRIP_COUNT_1_hr").over(window2),
        ]
    ).select(
        [
            col("PULOCATIONID"),
            col("WINDOW_END").alias("TS"),
            (col("SUM_FARE_2_hr") / col("COUNT_TRIP_2hr")).alias("MEAN_FARE_2_hr"),
            (col("SUM_FARE_5_hr") / col("COUNT_TRIP_5hr")).alias("MEAN_FARE_5_hr"),
        ]
    )
    return df

def dropoff_features_fn(df):
    df = pre_aggregate_fn(df, "DROPOFF_TS", ["DOLOCATIONID"])
    window1 = Window.partition_by("DOLOCATIONID").order_by(col("WINDOW_END").desc()).rows_between(Window.CURRENT_ROW, 7)
    window2 = Window.partition_by("DOLOCATIONID").order_by(col("WINDOW_END").desc()).rows_between(Window.CURRENT_ROW, 19)

    df = df.select(
        [
            col("DOLOCATIONID"),
            col("WINDOW_END").alias("TS"),
            F.sum("TRIP_COUNT_1_hr").over(window1).alias("COUNT_TRIP_2_hr"),
            F.sum("TRIP_COUNT_1_hr").over(window2).alias("COUNT_TRIP_5_hr"),
        ]
    )
    return df

pickup_df = pickup_features_fn(source_df)
pickup_df.show()

dropoff_df = dropoff_features_fn(source_df)
dropoff_df.show()

----------------------------------------------------------------------------------
|"PULOCATIONID"  |"TS"                 |"MEAN_FARE_2_HR"    |"MEAN_FARE_5_HR"    |
----------------------------------------------------------------------------------
|98              |2016-01-01 04:45:00  |26.0                |26.0                |
|98              |2016-01-01 14:00:00  |19.75               |19.75               |
|98              |2016-01-02 22:30:00  |156.5               |156.5               |
|225             |2016-01-01 00:30:00  |9.6                 |9.6                 |
|225             |2016-01-01 00:45:00  |11.833333333333334  |11.833333333333334  |
|225             |2016-01-01 01:00:00  |15.045454545454545  |15.045454545454545  |
|225             |2016-01-01 01:15:00  |13.928571428571429  |13.928571428571429  |
|225             |2016-01-01 01:30:00  |12.717948717948717  |12.717948717948717  |
|225             |2016-01-01 01:45:00  |13.169811320754716  |13.169811320754716  |
|225

## Create FeatureViews and materialize

Once the FeatureView construction is done, we can materialize the FeatureView to the Snowflake backend and incremental maintenance will start.

In [40]:
pickup_fv = FeatureView(
    name="trip_pickup_features", 
    entities=[trip_pickup], 
    feature_df=pickup_df, 
    timestamp_col="ts",
    refresh_freq="1 minute",
).attach_feature_desc({"MEAN_FARE_2_HR": "avg fare over past 2hr"})
pickup_fv = fs.register_feature_view(feature_view=pickup_fv, version="v1", block=True)

  self._create_dynamic_table(


In [42]:
dropoff_fv = FeatureView(
    name="trip_dropoff_features", 
    entities=[trip_dropoff], 
    feature_df=dropoff_df, 
    timestamp_col="ts",
    refresh_freq="1 minute",
).attach_feature_desc({"COUNT_TRIP_2_HR": "trip count over past 2hr"})
dropoff_fv = fs.register_feature_view(feature_view=dropoff_fv, version="v1", block=True)

## Explore FeatureViews
We can easily discover what are the materialized FeatureViews and the corresponding features with *__fs.list_feature_views()__*. 

We can also apply filters based on Entity name or FeatureView names.

In [43]:
fs.list_feature_views().select(["NAME", "VERSION", "ENTITIES", "FEATURE_DESC"]).show()

---------------------------------------------------------------------------------------------------------------------
|"NAME"                 |"VERSION"  |"ENTITIES"                  |"FEATURE_DESC"                                    |
---------------------------------------------------------------------------------------------------------------------
|TRIP_DROPOFF_FEATURES  |V1         |[                           |{                                                 |
|                       |           |  {                         |  "COUNT_TRIP_2_HR": "trip count over past 2hr",  |
|                       |           |    "desc": "",             |  "COUNT_TRIP_5_HR": ""                           |
|                       |           |    "join_keys": [          |}                                                 |
|                       |           |      "DOLOCATIONID"        |                                                  |
|                       |           |    ],             

## Generate training data and train a model
The training data generation will lookup __point-in-time correct__ feature values and join with the spine dataframe.

In [44]:
spine_df = source_df.select(["PULOCATIONID", "DOLOCATIONID", "PICKUP_TS", "FARE_AMOUNT"])
training_data = fs.generate_dataset(
    spine_df=spine_df,
    features=[pickup_fv, dropoff_fv],
    materialized_table="yellow_tripdata_2016_01_training_data",
    spine_timestamp_col="PICKUP_TS",
    spine_label_cols = ["FARE_AMOUNT"]
)

-----------------------------------------------------------------------------------------------------------------------------------------------------------
|"DOLOCATIONID"  |"PICKUP_TS"          |"PULOCATIONID"  |"FARE_AMOUNT"  |"MEAN_FARE_2_HR"    |"MEAN_FARE_5_HR"    |"COUNT_TRIP_2_HR"  |"COUNT_TRIP_5_HR"  |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
|262             |2016-01-01 00:12:22  |48              |14.0           |NULL                |NULL                |NULL               |NULL               |
|48              |2016-01-01 00:41:31  |162             |9.5            |11.451428571428572  |11.451428571428572  |137                |137                |
|90              |2016-01-01 00:53:37  |246             |6.0            |13.765232974910393  |13.765232974910393  |214                |214                |
|162             |2016-01-01 00:13:28  |170             |5.0    

{'queries': ['SELECT * FROM FS_TIME_SERIES_EXAMPLE.AWESOME_FS.yellow_tripdata_2016_01_training_data_2023_12_12_14_10_32'],
 'post_actions': []}

In [45]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

training_pd = training_data.df.to_pandas()
X = training_pd.drop(["FARE_AMOUNT", "PICKUP_TS"], axis=1)
y = training_pd["FARE_AMOUNT"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

Unnamed: 0,DOLOCATIONID,PULOCATIONID,MEAN_FARE_2_HR,MEAN_FARE_5_HR,COUNT_TRIP_2_HR,COUNT_TRIP_5_HR
359595,90,249,8.806985,9.258179,404.0,995.0
360562,170,79,10.242111,10.517555,821.0,2251.0
681540,50,107,9.416096,9.226157,394.0,956.0
951079,48,151,10.308511,10.27894,1289.0,3380.0
477164,79,249,9.562016,9.554124,1124.0,2827.0


In [46]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
estimator = make_pipeline(imp, LinearRegression())

reg = estimator.fit(X, y)
r2_score = reg.score(X_test, y_test)
print(r2_score * 100,'%')

y_pred = reg.predict(X_test)
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

31.1498254012058 %
Mean squared error: 91.42


## Log model with Model Registry
We can log the model along with its training dataset metadata with model registry.

In [47]:
from snowflake.ml.registry import model_registry, artifact
import time

registry = model_registry.ModelRegistry(session=session, database_name="my_cool_registry", create_if_not_exists=True)



In [48]:
DATASET_NAME = "MY_DATASET"
DATASET_VERSION = f"V1_{time.time()}"

my_dataset = registry.log_artifact(
    artifact=training_data,
    name=DATASET_NAME,
    version=DATASET_VERSION,
)



In [50]:
model_name = "MY_MODEL"
model_version = f"V1_{time.time()}"

model_ref = registry.log_model(
    model_name=model_name,
    model_version=model_version,
    model=estimator,
    artifacts=[my_dataset],
)



## Restore model and predict with latest features
We retrieve the training dataset from registry then construct dataframe of latest feature values. Then we restore the model from registry. At last, we can predict with latest feature values.

In [51]:
pred_df = training_data.df.sample(0.01).select(
    ['PULOCATIONID', 'DOLOCATIONID', 'PICKUP_TS'])

enriched_df = fs.retrieve_feature_values(
    spine_df=pred_df, 
    features=training_data.load_features(), 
    spine_timestamp_col='PICKUP_TS'
).drop(['PICKUP_TS']).to_pandas()

pred = estimator.predict(enriched_df)
print(pred)



[ 9.71003863 13.95909809 13.95909809 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268 11.70994268
 11.70994268 11.70994268 11.70994268 11.70994268 11

In [None]:
model_ref = model_registry.ModelReference(
    registry=registry, 
    model_name=model_name, 
    model_version=model_version,
).load_model()

pred = model_ref.predict(enriched_df)

print(pred)

## DO NOT READ
Below is a simple test for the window_end function

In [None]:
from snowflake.snowpark import Session
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import functions as F, types as T
import datetime

session = Session.builder.configs(SnowflakeLoginOptions()).create()

udf_name = "window_end"
    
@F.pandas_udf(
    name=udf_name,
    replace=True,
    packages=["numpy", "pandas", "pytimeparse"],
    session=session,
)
def vec_window_end_compute(
    x: T.PandasSeries[datetime.datetime],
    interval: T.PandasSeries[str],
) -> T.PandasSeries[datetime.datetime]:
    import numpy as np
    import pandas as pd
    from pytimeparse.timeparse import timeparse

    time_slice = timeparse(interval[0])
    if time_slice is None:
        raise ValueError(f"Cannot parse interval {interval[0]}")
    time_slot = (x - np.datetime64('1970-01-01T00:00:00')) // np.timedelta64(1, 's') // time_slice * time_slice + time_slice
    return pd.to_datetime(time_slot, unit='s')

df = session.create_dataframe(
    [
        '2023-01-31 01:02:03.004',
        '2023-01-31 01:14:59.999',
        '2023-01-31 01:15:00.000',
        '2023-01-31 01:15:00.004',
        '2023-01-31 01:17:10.007',
    ], 
    schema=['a']
)
df = df.select([F.to_timestamp("a").alias("ts")])

df = df.select(["TS", F.call_udf(udf_name, F.col("TS"), "15m").alias("window_end")])
df.show()

In [None]:
session.sql("select window_end(ts, '15m') from foobar").collect()