### Snowflake Feature Store

The Snowflake Feature Store lets data scientists and ML engineers create, maintain, and use ML features in data science and ML workloads, all within Snowflake.

Generically, features are data elements used as inputs to a machine learning model. Many columns in a dataset, such as temperature or attendance, can be used as features as-is. In other cases, a column can be made more useful for training via preprocessing and transformation. For example, you might derive a day-of-week feature from a timestamp to allow the model to detect weekly patterns. Other common feature transformations involve aggregating, differentiating, or time-shifting data. Feature engineering is the process of deciding what features are needed by your models and defining how they will be derived from the raw data.

A feature store lets you standardize commonly used feature transformations in a central repository, enabling reuse, helping to reduce duplication of data and effort, and improving productivity. It also helps maintain features by updating them on new source data, always providing correct, consistent, and fresh features in a single source of truth. By cultivating consistency in how features are extracted from raw data, a feature store can also help to make your production ML pipelines more robust.

The Snowflake Feature Store is designed to make creating, storing, and managing features for data science and machine learning workloads easier and more efficient. Hosted natively inside Snowflake, the Snowflake Feature Store provides the following advantages:

* Your data remains secure, completely under your control and governance, and never leaves Snowflake.
* The Snowsight Feature Store UI makes it easy to search for and discover features.
* Access is managed with fine-grained role-based access control.

Key benefits of the Snowflake Feature Store include support for:

* Both batch and streaming data, with efficient automatic updates as new data arrives
* Backfill and point-in-time correct features with ASOF JOIN
* Feature transformations authored in Python or SQL
* Automatic update and refresh of feature values from source data with Snowflake managed Feature Views
* Ability to use user-managed feature pipelines with external tools such as dbt

The Snowflake Feature Store is fully integrated with the Snowflake Model Registry and other Snowflake ML features for end-to-end production ML.

In [None]:
# Lets just set up a few things
# Make sure you add the following to the packages
# altair
# matplotlib
# numpy
# seaborn
# snowflake-ml-python

# Standard library imports
import os
import time
import math

# Third-party library imports
import pandas as pd
import numpy as np



# Snowflake library imports
import streamlit as st

import altair as alt
import matplotlib.pyplot as plt
import seaborn as sns

from snowflake.ml.feature_store import (
FeatureStore,
FeatureView,
CreationMode)

from snowflake.ml import dataset
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
from snowflake.snowpark.context import get_active_session
session = get_active_session()
session.query_tag = {"origin":"sf_sit-is", 
                     "name":"credit_card_fraud", 
                     "version":{"major":1, "minor":0},
                     "attributes":{"is_quickstart":0, "source":"notebook"}}     

# Set the style for the plots
sns.set(style="whitegrid")

# Custom color palettes
colors = {
    'Non-Fraud Bars': '#4C72B0',
    'Fraud Bars': '#55A868',
    'Non-Fraud Line': '#1f77b4',
    'Fraud Line': '#ff7f0e'
}

In [None]:
use warehouse CC_FINS_WH;
use database CC_FINS_DB;
use schema analytics;

### Generating Datasets for Training
 
We are now ready to generate our training set. We'll define a spine DataFrame to form the backbone of our generated dataset and pass it into FeatureStore.generate_dataset() along with our Feature Views.

NOTE: The spine serves as a request template and specifies the entities, labels and timestamps (when applicable). The feature store then attaches feature values along the spine using an AS-OF join to efficiently combine and serve the relevant, point-in-time correct feature data.

In [None]:
create or replace TABLE TRANSACTIONS_DATA (USER_ID VARCHAR,TRANSACTION_ID VARCHAR(16777216),IS_FRAUD VARCHAR);

insert into TRANSACTIONS_DATA(User_ID, Transaction_ID, IS_FRAUD) SELECT distinct User_ID, Transaction_ID, IS_FRAUD FROM CREDITCARD_TRANSACTIONS;

select * from CREDITCARD_TRANSACTIONS limit 10;

In [None]:
full_df = session.sql("SELECT * FROM CREDITCARD_TRANSACTIONS")
dataset=full_df.toPandas()

TRANSACTIONS_DATA_df = session.table("TRANSACTIONS_DATA")


df= TRANSACTIONS_DATA_df.select( F.col("TRANSACTION_ID"),F.col("IS_FRAUD")).groupBy(F.col("IS_FRAUD")) \
          .agg(F.count_distinct(F.col("TRANSACTION_ID")).alias("TOTAL_FRAUD")) 

# Visualization of the fraud and normal data using a bar chart displayed in Streamlit. Shows the total number of distinct transactions for each fraud category.
st.bar_chart(df,x="IS_FRAUD",y="TOTAL_FRAUD")

In [None]:
# Create a histogram that shows the distribution of transaction amounts, distinguishing between fraudulent and non-fraudulent transactions. 

 
dataset['IS_FRAUD'] = dataset['IS_FRAUD'].astype(int)
# Set the style for the plots
sns.set(style="whitegrid")
# Background color
background_color = "#f0f0f0"  # Light gray
# 1. Distribution of Transaction Amounts
plt.figure(figsize=(4,4))
sns.histplot(data=dataset, x='TRANSACTION_AMOUNT', hue='IS_FRAUD', kde=True, bins=50)
plt.title('Distribution of Transaction Amounts')
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')
plt.legend(title='Transaction', loc='upper right', labels=['Normal', 'Fraud'])
plt.show()

In [None]:
# Create a histogram that shows the distribution of clicks, distinguishing between fraudulent and non-fraudulent transactions. 
#CLICKS, LOGIN_PER_HOUR, and PAGES_VISITED Distributions

sns.set(style="whitegrid")

# Custom color palettes
colors = {
    'Normal Bars': '#4C72B0',
    'Fraud Bars': '#55A868',
    'Normal Line': '#1f77b4',
    'Fraud Line': '#ff7f0e'
}
# 4. CLICKS Distribution
plt.figure(figsize=(4, 4))
sns.histplot(data=dataset, x='CLICKS', hue='IS_FRAUD', multiple='dodge', kde=True, bins=30)
plt.title('Clicks Distribution')
plt.xlabel('Clicks')
plt.ylabel('Frequency')
plt.legend(title='Transaction', loc='upper right', labels=['Normal', 'Fraud'])
plt.show()

In [None]:
# Create a histogram that shows the distribution of logins, distinguishing between fraudulent and non-fraudulent transactions.  

plt.figure(figsize=(4, 4))
sns.histplot(data=dataset, x='LOGIN_PER_HOUR', hue='IS_FRAUD', multiple='dodge', kde=True, bins=30)
plt.title('Login Per Hour Distribution')
plt.xlabel('Login Per Hour')
plt.ylabel('Frequency')
plt.legend(title='Is Fraud', loc='upper right', labels=['Non-Fraud', 'Fraud'])
plt.show()

In [None]:
# Create a histogram that shows the distribution of time elapsed online, distinguishing between fraudulent and non-fraudulent transactions.  

plt.figure(figsize=(4,4))

sns.histplot(data=dataset, x='TIME_ELAPSED', hue='IS_FRAUD', kde=True, bins=50)
plt.title('Time Elapsed Distribution')
plt.xlabel('Time Elapsed (seconds)')
plt.ylabel('Frequency')
plt.legend(title='Is Fraud', loc='upper right', labels=['Non-Fraud', 'Fraud'])
plt.show()

In [None]:
# Create a histogram that shows the distribution of location, distinguishing between fraudulent and non-fraudulent transactions.  
# Define location coordinates
location_coords = {
    'New York': (40.7128, -74.0060),
    'Los Angeles': (34.0522, -118.2437),
    'Chicago': (41.8781, -87.6298),
    'Houston': (29.7604, -95.3698),
    'Phoenix': (33.4484, -112.0740),
    'Philadelphia': (39.9526, -75.1652),
    'San Antonio': (29.4241, -98.4936),
    'San Diego': (32.7157, -117.1611),
    'Dallas': (32.7767, -96.7970),
    'San Jose': (37.3382, -121.8863),
    'Moscow': (55.7558, 37.6176)  # Add Moscow coordinates
}

# Add latitude and longitude based on location
dataset['LATITUDE'] = dataset['LOCATION'].map(lambda loc: location_coords.get(loc, (None, None))[0])
dataset['LONGITUDE'] = dataset['LOCATION'].map(lambda loc: location_coords.get(loc, (None, None))[1])

# Filter for plotting
plt.figure(figsize=(6, 6))

# Plot all locations
scatter = plt.scatter(dataset['LONGITUDE'], dataset['LATITUDE'], 
                      c=dataset['IS_FRAUD'].map({0: 'purple', 1: 'red'}),
                      alpha=0.5)

# Create custom legend
purple_patch = plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='purple', markersize=10, label='Normal')
red_patch = plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10, label='Fraud')

# Plot details
plt.title('Geographical Distribution of Transactions')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

# Set legend with custom handles
plt.legend(handles=[purple_patch, red_patch], title='Transaction Type', loc='upper left', bbox_to_anchor=(1, 1), frameon=True, fontsize='small')

plt.grid(True)

# Set background color for the plot
plt.gcf().set_facecolor("#f0f0f0")  # Light gray
plt.show()

### Feature Store

The feature store contains feature views for customers and transactions. Model features will be accessed from the feature store.

In [None]:
# Access feature views

fs = FeatureStore(
    session=session,
    database="CC_FINS_DB",
    name="ANALYTICS",
    default_warehouse="CC_FINS_WH",
    creation_mode=CreationMode.FAIL_IF_NOT_EXIST
)

customer_fv : FeatureView = fs.get_feature_view(
    name='Customer_Features',
    version='V1'
)
print(customer_fv)

trans_fv : FeatureView = fs.get_feature_view(
    name='Trans_Features',
    version='V1'
)
print(trans_fv)

We can go over to Snowsight and we should be able to see the views that were created (CC_FINS_DB.ANALYTICS.CUSTOMER_FEATURES$V1) as well as by going to the AI&ML option on the left menu and then the Features option.


In [None]:
# Generate a training data set with the feature store’s generate_training_set method, which enriches a Snowpark DataFrame that contains the source data with the derived feature values.  
# Get transactions dataset and get features from the feature store
def create_dataset(spine_df, name):
    train_dataset = fs.generate_dataset(
    name=name,
    spine_df=spine_df,
    features=[customer_fv, trans_fv]
    )
    df = train_dataset.read.to_snowpark_dataframe()
    return df
# Split into train/validation/test

datasets = TRANSACTIONS_DATA_df.random_split([.8,.2])

# Build training tables
train_df = create_dataset(datasets[0], "train")
val_df = create_dataset(datasets[1], "validation")

#View the training dataset.This contains the columns except for Ids. The Label is included here as this will be specified in the LABEL field during model training. 
train_df.show()

In [None]:
# Create separate views for training and validation to be used with a Binary Classifier. 
# Columns in the inference data that were not present in the training dataset are ignored. 

train_df.write.mode("overwrite").save_as_table("training_fd_table")

session.sql("CREATE OR REPLACE VIEW fraud_classification_training_view AS SELECT IS_FRAUD,LATITUDE,LONGITUDE,LOCATION,TOTAL_TRANSACTIONS,STDDEV_TRANSACTION_AMOUNT,NUM_UNIQUE_MERCHANTS, MEAN_WEEKLY_SPENT,MEAN_MONTHLY_SPENT,MEAN_YEARLY_SPENT,TIME_ELAPSED,CLICKS,CUMULATIVE_CLICKS,CUMULATIVE_LOGINS_PER_HOUR FROM training_fd_table").collect()

val_df.drop("IS_FRAUD").collect()
val_df.write.mode("overwrite").save_as_table("val_fd_table")

session.sql("CREATE OR REPLACE VIEW fraud_classification_val_view AS SELECT * EXCLUDE IS_FRAUD FROM val_fd_table").collect()

In [None]:
SELECT * FROM fraud_classification_val_view LIMIT 2;

### Build the model
We can create the classification model by running the following statement 

In [None]:
CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION fraud_classification_model(
    INPUT_DATA => SYSTEM$REFERENCE('view', 'fraud_classification_training_view'),
    TARGET_COLNAME => 'IS_FRAUD'
);

-- View all classification models, use the SHOW command. 
SHOW SNOWFLAKE.ML.CLASSIFICATION;

In [None]:
-- Add a table to use for the Streamlit App that will be used for ongoing Predictions 

CREATE or replace table CC_APP_TBL AS SELECT * FROM CREDITCARD_TRANSACTIONS WHERE TRANSACTION_ID NOT IN (SELECT DISTINCT TRANSACTION_ID FROM training_fd_table);
alter table CC_APP_TBL drop column IS_FRAUD;

In [None]:
-- Run inference (prediction) on a dataset, use the model’s PREDICT method. 

CREATE OR REPLACE TABLE fraud_predictions AS
SELECT *,fraud_classification_model!PREDICT(INPUT_DATA => object_construct(*)) as predictions
from fraud_classification_val_view;

### View the predictions.

The prediction object includes predicted probabilities for each class and the predicted class based on the maximum predicted probability. 

The predictions are returned in the same order as the original features were provided.

In [None]:
 SELECT * FROM fraud_predictions;

In [None]:
-- In the result set, we see that the model produces both a predicted class denoted by class as well giving us the probability of the respective class membership. 
-- Oftentimes, we may want to parse out the probabilities or the prediction directly, and have it in its own column 

select * EXCLUDE PREDICTIONS,
        predictions:class::STRING AS class,
        predictions['probability'][class] as probability
from fraud_predictions;

-- we can see from our data that it's very easy to classify if transactions are fraud

Now that we have built our classifier, we can begin to evaluate it to better understand both its performance as well as the primary factors within the dataset that were driving the predictions. Follow along below to see the various commands you may run to evalute your own classifier:

### Confusion Matrix & Model Accuracy
One of the most common ways of evaluating a classifier is by creating a Confusion Matrix, which allows us to visualize the types of errors that the model is making. Typically, they are used to calculate a classifier's Precision & Recall; which describe both the accuracy of a model when it predicts a certain class of interest (Precision), as well as how many of that specific class of interest were classified (recall)

Returns a table containing the number of instances of each combination of actual class and predicted class in models where evaluation was enabled at instantiation. You can use this dataset to plot a confusion matrix. This method takes no arguments. See Confusion Matrix in show_confusion_matrix.

dataset_type - The name of the dataset used for metrics calculation, currently EVAL.

actual_class - The actual class.

predicted_class - The predicted class.

count - The number of instances of the given combination of actual and predicted class.

logs - Contains error or warning messages.

In [None]:
 CALL fraud_classification_model!SHOW_CONFUSION_MATRIX(); 

### Evaluation Metrics

The show_evaluation_metrics calculates the following False Positive, False Negative, True Positive and True Negative

To get the evaluation metrics for your model, call the <model_name>!SHOW_EVALUATION_METRICS method. By default, the forecasting function evaluates all models it trains using a method called cross-validation. This means that under the hood, in addition to training the final model on all of the training data you provide, the function also trains models on subsets of your training data. Those models are then used to predict your target metric on the withheld data, allowing the function to compare those predictions to actual values in your historical data.

If you don’t need these evaluation metrics, you can set evaluate to FALSE. If you want to control the way cross-validation is run, you can use the following parameters:

In [None]:
 CALL fraud_classification_model!SHOW_EVALUATION_METRICS();

### Threshold Metrics

show_threshold_metrics provides raw counts and metrics for a specific threshold for each class. This can be used to plot ROC and PR curves or do threshold tuning if desired. The threshold varies from 0 to 1 for each specific class; a predicted probability is assigned.

The sample is classified as belonging to a class if the predicted probability of being in that class exceeds the specified threshold. The true and false positives and negatives are computed considering the negative class as every instance that does not belong to the class being considered. The following metrics are then computed.

True positive rate (TPR): The proportion of actual positive instances that the model correctly identifies (equivalent to Recall).

False positive rate (FPR): The proportion of actual negative instances that were incorrectly predicted as positive.

Accuracy: The ratio of correct predictions (both true positives and true negatives) to the total number of predictions, an overall measure of how well the model is performing. This metric can be misleading in unbalanced cases.

Support: The number of actual occurrences of a class in the specified dataset. Higher support values indicate a larger representation of a class in the dataset. Support is not itself a metric of the model but a characteristic of the dataset.

In [None]:
CALL fraud_classification_model!SHOW_THRESHOLD_METRICS()

### Feature Importances
The last thing we want to understand when evaluating the classifier is to get a sense of the importance of each of the individual input columns or features we made use of. 

Better understand what's driving a model's prediction to give us more insight into the business process we are trying to model out
Engineer new features or remove ones that are not too impactful to increase the model's performance.
The ML Classification function provides a method to do just this, and provides us a ranked list of the relative importance of all the input features, such that their values are between 0 and 1, and the importances across all the features sum to be 1.

In [None]:
CALL fraud_classification_model!SHOW_FEATURE_IMPORTANCE();

We're done with this part of the tutorial. Depending on the time available, we will now take a look at a Streamlit application built to support Fraud Detection