Steps to run notebook:
1. Create a conda env with python3.8 (Empty conda env)
```
conda create --name snowml python=3.8
```
2. Activate conda env
```
conda activate snowml
```
3. Install conda pkg
```
conda install snowflake-ml-python 
# or local build if there are changes in SnowML lib you need: bazel build //snowflake/ml:wheel
# then do pip install {built pkg}
```
4. Install jupyter notebook
```
conda install jupyter
```
5. Start notebook
```
jupyter notebook
```

## Basic Feature Store Usage Example
This notebook demonstrates feature store usage for static features.
The reference example by Databricks is here: https://docs.databricks.com/en/_extras/notebooks/source/machine-learning/feature-store-with-uc-basic-example.html

## Setup UI and Auto Import

In [None]:
# Scale cell width with the browser window to accommodate .show() commands for wider tables.
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

%load_ext autoreload
%autoreload 2

#### [Optional 1] Import from local code repository

In [41]:
import sys
import os

# Simplify reading from the local repository
cwd=os.getcwd()
REPO_PREFIX="snowflake/ml"
LOCAL_REPO_PATH=cwd[:cwd.find(REPO_PREFIX)].rstrip('/')

if LOCAL_REPO_PATH not in sys.path:
    print(f"Adding {LOCAL_REPO_PATH} to system path")
    sys.path.append(LOCAL_REPO_PATH)

#### [Optional 2] Import from installed snowflake-ml-python wheel

In [37]:
import sys

sys.path.insert(0, '/tmp/snowml')

In [38]:
import importlib
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from snowflake.ml.feature_store.feature_view import FeatureView
from snowflake.ml.feature_store.entity import Entity
from snowflake.ml.feature_store.feature_store import FeatureStore, CreationMode
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions

In [39]:
session = Session.builder.configs(SnowflakeLoginOptions()).create()

SnowflakeLoginOptions() is in private preview since 0.2.0. Do not use it in production. 


## Prepare demo data

We will use wine quality dataset to demonstrate feature store usage

In [None]:
session.file.put("file://winequality-red.csv", session.get_session_stage())

SOURCE_DB = session.get_current_database()
SOURCE_SCHEMA = session.get_current_schema()

from snowflake.snowpark.types import StructType, StructField, IntegerType, StringType, FloatType
input_schema = StructType(
    [
        StructField("fixed_acidity", FloatType()), 
        StructField("volatile_acidity", FloatType()), 
        StructField("citric_acid", FloatType()), 
        StructField("residual_sugar", FloatType()), 
        StructField("chlorides", FloatType()), 
        StructField("free_sulfur_dioxide", IntegerType()),
        StructField("total_sulfur_dioxide", IntegerType()), 
        StructField("density", FloatType()), 
        StructField("pH", FloatType()), 
        StructField("sulphates", FloatType()),
        StructField("alcohol", FloatType()), 
        StructField("quality", IntegerType())
    ]
)
df = session.read.options({"field_delimiter": ";", "skip_header": 1}).schema(input_schema).csv(f"{session.get_session_stage()}/winequality-red.csv")
df.write.mode("overwrite").save_as_table("wine_data")

## Generate new synthetic data [Optional]
Run the cell below to generate new synthetic data for the wine dataset if needed.
NOTE: the synthetic data will be randomized based on the original data's statistics, so it may affect training quality.

In [None]:
from snowflake.ml.feature_store._internal.synthetic_data_generator import (
    SyntheticDataGenerator,
)
session2 = Session.builder.configs(SnowflakeLoginOptions()).create()
generator = SyntheticDataGenerator(session2, SOURCE_DB, SOURCE_SCHEMA, "wine_data")
generator.trigger(batch_size=10, num_batches=30, freq=10)

## Create FeatureStore Client

Let's first create a feature store client.

We can pass in an existing database name, or a new database will be created upon the feature store initialization.

In [40]:
DEMO_DB = "FS_DEMO_DB"
session.sql(f"DROP DATABASE IF EXISTS {DEMO_DB}").collect()  # start from scratch
session.sql(f"CREATE DATABASE IF NOT EXISTS {DEMO_DB}").collect()
session.sql(f"CREATE OR REPLACE WAREHOUSE PUBLIC WITH WAREHOUSE_SIZE='XSMALL'").collect()

fs = FeatureStore(
    session=session, 
    database=DEMO_DB, 
    name="AWESOME_FS", 
    default_warehouse="PUBLIC",
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST,
)

Dumped 10 rows to table wine_data.


## Create and register a new Entity

We will create an Entity called *wine* and register it with the feature store.

You can retrieve the active Entities in the feature store with list_entities() API.

In [None]:
entity = Entity(name="wine", join_keys=["wine_id"])
fs.register_entity(entity)
fs.list_entities().to_pandas()

## Load source data and do some simple feature engineering

Then we will load from the source table and conduct some simple feature engineerings.

Here we are just doing two simple data manipulation (but more complex ones are carried out the same way):
1. Assign a wine_id column to the source
2. Derive a new column by multipying two existing feature columns

In [None]:
source_df = session.table(f"{SOURCE_DB}.{SOURCE_SCHEMA}.wine_data")
source_df.to_pandas()

In [None]:
def addIdColumn(df, id_column_name):
    # Add id column to dataframe
    columns = df.columns
    new_df = df.withColumn(id_column_name, F.monotonically_increasing_id())
    return new_df[[id_column_name] + columns]

def generate_new_feature(df):
    # Derive a new feature column
    return df.withColumn("my_new_feature", df["FIXED_ACIDITY"] * df["CITRIC_ACID"])

df = addIdColumn(source_df, "wine_id")
feature_df = generate_new_feature(df)
feature_df = feature_df.select(
    [
        'WINE_ID',
        'FIXED_ACIDITY',
        'VOLATILE_ACIDITY',
        'CITRIC_ACID',
        'RESIDUAL_SUGAR',
        'CHLORIDES',
        'FREE_SULFUR_DIOXIDE',
        'TOTAL_SULFUR_DIOXIDE',
        'DENSITY',
        'PH',
        'my_new_feature',
    ]
)
feature_df.to_pandas()

## Create a new FeatureView and materialize the feature pipeline

Once the FeatureView construction is done, we can materialize the FeatureView to the Snowflake backend and incremental maintenance will start.

In [None]:
fv = FeatureView(name="wine_features", entities=[entity], feature_df=feature_df, desc="wine features")
fs.register_feature_view(feature_view=fv, version="v1", refresh_freq="1 minute", block=True)

In [None]:
# Examine the FeatureView content
fs.read_feature_view(fv).to_pandas()

## Explore additional features

Now I have my FeatureView created with a collection of features, but what if I want to explore additional features on top?

Since a materialized FeatureView is immutable (due to singe DDL for the backend dynamic table), we will need to create a new FeatureView for the additional features and then merge them.

In [None]:
extra_feature_df = df.select(
    [
        'WINE_ID',
        'SULPHATES',
        'ALCOHOL',
    ]
)

new_fv = FeatureView(name="extra_wine_features", entities=[entity], feature_df=extra_feature_df, desc="extra wine features")
fs.register_feature_view(feature_view=new_fv, version="v1", refresh_freq="1 minute", block=True)

In [None]:
# We can easily retrieve all FeatureViews for a given Entity.
fs.list_feature_views(entity_name="wine").to_pandas()

## Create new feature view with combined feature results [Optional]

Now we have two FeatureViews ready, we can choose to create a new one by merging the two (it's just like a join and we provide a handy function for that). The new FeatureView won't incur the cost of feature pipelines but only the table join cost.

Obviously we can also just work with two separate FeatureViews (most of our APIs support multiple FeatureViews), the capability of merging is just to make the features better organized and easier to share.

In [None]:
full_fv = fs.merge_features(features=[fv, new_fv], name="full_wine_features")
fs.register_feature_view(feature_view=full_fv, version="v1")

## Generate Training Data

After our feature pipelines are fully setup, we can start using them to generate training data and later do model prediction.

Generate training data is easy since materialized FeatureViews already carry most of the metadata like join keys, timestamp for point-in-time lookup, etc. We just need to provide the spine data (it's called spine because we are essentially enriching the data by joining features with it).

In [None]:
spine_df = session.table(f"{SOURCE_DB}.{SOURCE_SCHEMA}.wine_data")
spine_df = addIdColumn(source_df, "wine_id")
spine_df = spine_df.select("wine_id", "quality")
spine_df.to_pandas()

In [None]:
session.sql(f"DROP TABLE IF EXISTS FS_DEMO_DB.AWESOME_FS.wine_training_data_table").collect()
training_data = fs.generate_dataset(
    spine_df=spine_df, 
    features=[full_fv], 
    materialized_table="wine_training_data_table", 
    spine_timestamp_col=None, 
    spine_label_cols=["quality"],
    save_mode="merge",
)

training_pd = training_data.df.to_pandas()
training_pd

## Train a model

Now let's training a simple random forest model and evaluate the prediction accuracy.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

X = training_pd.drop("QUALITY", axis=1)
y = training_pd["QUALITY"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

In [None]:
def train_model(X_train, X_test, y_train, y_test):
    ## fit and log model 
    rf = RandomForestRegressor(max_depth=3, n_estimators=20, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"MSE: {mse}, Accuracy: {round(100*(1-np.mean(np.abs((y_test - y_pred) / np.abs(y_test)))))}")
    return rf
        
rf = train_model(X_train, X_test, y_train, y_test)
print(rf)

## Log model with Model Registry

We can log the model along with its training dataset metadata with model registry.

In [None]:
from snowflake.ml.registry import model_registry
from tests.integ.snowflake.ml.test_utils import (
    test_env_utils,
)

registry = model_registry.ModelRegistry(session=session, database_name="my_cool_registry", create_if_not_exists=True)

In [None]:
model_ref = registry.log_model(
    model_name="my_random_forest_regressor",
    model_version="v1",
    model=rf,
    tags={"author": "my_rf_with_training_data"},
    conda_dependencies=[
        test_env_utils.get_latest_package_versions_in_server(session, "snowflake-snowpark-python")
    ],
    dataset=training_data,
    options={"embed_local_ml_library": True},
)

## Restore model and predict with latest features

We retrieve the training dataset from registry then construct dataframe of latest feature values. Then we restore the model from registry. At last, we can predict with latest feature values.

In [None]:
registered_training_data = registry.get_dataset(
    model_name="my_random_forest_regressor", 
    model_version="v1",
)

test_pdf = training_pd.sample(3, random_state=996)[['WINE_ID']]
test_df = session.create_dataframe(test_pdf)

latest_features = fs.retrieve_feature_values(test_df, registered_training_data.load_features())
latest_features_pdf = latest_features.to_pandas()
print(latest_features_pdf)

In [None]:
model_ref = model_registry.ModelReference(registry=registry, model_name="my_random_forest_regressor", model_version="v1")
restored_model = model_ref.load_model()  # type: ignore[attr-defined]
restored_prediction = restored_model.predict(latest_features_pdf)

print(restored_prediction)