# Feature Store: Manage Features and Generate Training Data

This notebook demonstrates the `snowflakeR` package interface to the Snowflake Feature Store.
You'll learn how to define entities, create feature views, generate training data with
point-in-time correct joins, and tie it all together with the Model Registry.

**Sections:**
1. [Setup & Feature Store Context](#section-1-setup)
2. [Entities](#section-2-entities)
3. [Feature Views](#section-3-feature-views)
4. [Training Data Generation](#section-4-training-data)
5. [Retrieve Features for Inference](#section-5-inference)
6. [End-to-End: Feature Store + Model Registry](#section-6-end-to-end)

---

# Section 1: Setup & Feature Store Context

## Workspace Notebook setup

In [None]:
# Workspace Notebook: configure rpy2 (skip if already done or running locally)
import sys
sys.path.insert(0, '..')

from r_helpers import setup_r_environment
result = setup_r_environment()
print(f"R ready: {result['success']}")

In [None]:
%%R
library(snowflakeR)

conn <- sfr_connect()

# Create a Feature Store context targeting a specific schema
# `create = TRUE` creates the schema and required tags if they don't exist
fs <- sfr_feature_store(
  conn,
  database  = conn$database,   # or specify explicitly: "ML_DB"
  schema    = conn$schema,     # or specify explicitly: "FEATURES"
  warehouse = conn$warehouse,
  create    = TRUE
)

fs

### What is an `sfr_feature_store` object?

It holds the connection and the target database/schema/warehouse for all Feature Store
operations. Pass it as the first argument to all `sfr_*_entity()` and `sfr_*_feature_view()` functions.

---

# Section 2: Entities

**Entities** define join keys that link features to business objects (customers, products, etc.).
They are the foundation for Feature Views.

## Create sample data

First, let's create some sample tables to work with.

In [None]:
%%R
# Create sample order data
sfr_execute(conn, "
  CREATE OR REPLACE TABLE SFR_DEMO_ORDERS (
    customer_id INT,
    order_date  DATE,
    order_total DOUBLE
  )
")

sfr_execute(conn, "
  INSERT INTO SFR_DEMO_ORDERS VALUES
    (1, '2025-01-15', 45.50),
    (1, '2025-02-20', 82.30),
    (1, '2025-03-10', 15.00),
    (2, '2025-01-22', 120.00),
    (2, '2025-03-05', 55.75),
    (3, '2025-02-01', 200.00),
    (3, '2025-02-15', 30.25),
    (3, '2025-03-20', 95.50)
")

# Create sample label data
sfr_execute(conn, "
  CREATE OR REPLACE TABLE SFR_DEMO_LABELS (
    customer_id INT,
    churned     INT
  )
")

sfr_execute(conn, "
  INSERT INTO SFR_DEMO_LABELS VALUES (1, 0), (2, 1), (3, 0)
")

cat("Sample tables created.\n")

## Create and manage entities

In [None]:
%%R
# Create a customer entity
customer <- sfr_create_entity(
  fs,
  name      = "SFR_DEMO_CUSTOMER",
  join_keys = "CUSTOMER_ID",
  desc      = "Demo customer entity"
)

customer

In [None]:
%%R
# List all entities
sfr_list_entities(fs)

In [None]:
%%R
# Get a specific entity
customer <- sfr_get_entity(fs, "SFR_DEMO_CUSTOMER")
customer

In [None]:
%%R
# Update the description
sfr_update_entity(fs, "SFR_DEMO_CUSTOMER", desc = "Primary customer entity for demo")

---

# Section 3: Feature Views

**Feature Views** define the SQL transformation that produces features.
They can be:
- **Managed:** Automatically refreshed as a dynamic table (`refresh_freq` specified)
- **External:** Manually maintained (no `refresh_freq`)

## Create a Feature View from SQL

In [None]:
%%R
# One-step creation: SQL-based Feature View
fv <- sfr_create_feature_view(
  fs,
  name     = "SFR_DEMO_CUST_FEATURES",
  version  = "v1",
  entities = customer,
  features = "
    SELECT
      customer_id,
      AVG(order_total)   AS avg_order_total,
      COUNT(*)           AS order_count,
      SUM(order_total)   AS total_spend,
      MAX(order_date)    AS last_order_date
    FROM SFR_DEMO_ORDERS
    GROUP BY customer_id
  ",
  desc = "Customer aggregate features from orders"
)

fv

### Alternative: Two-step (draft then register)

This mirrors the Python API and is useful when you want to inspect the draft:

```r
%%R
# Step 1: Create a local draft
fv_draft <- sfr_feature_view(
  name     = "MY_FEATURES",
  entities = customer,
  features = "SELECT ... FROM ...",
  refresh_freq = "1 hour"
)

# Step 2: Register (materialise)
fv <- sfr_register_feature_view(fs, fv_draft, version = "v1")
```

### Alternative: dbplyr-based features

```r
%%R
library(dplyr); library(dbplyr)

orders_tbl <- tbl(conn, "SFR_DEMO_ORDERS")
features_query <- orders_tbl |>
  group_by(customer_id) |>
  summarise(avg_total = mean(order_total), order_count = n())

fv <- sfr_create_feature_view(
  fs, "CUST_FV_DBPLYR", "v1",
  entities = customer,
  features = features_query   # dbplyr lazy table -> SQL
)
```

## Manage Feature Views

In [None]:
%%R
# List all Feature Views
sfr_list_feature_views(fs)

In [None]:
%%R
# Get a specific version
fv <- sfr_get_feature_view(fs, "SFR_DEMO_CUST_FEATURES", "v1")
fv

In [None]:
%%R
# Read feature data directly
feature_data <- sfr_read_feature_view(fs, "SFR_DEMO_CUST_FEATURES", "v1")
feature_data

### Refresh management (for managed Feature Views)

```r
%%R
# Manually trigger a refresh
sfr_refresh_feature_view(fs, "MY_FV", "v1")

# Check refresh history
sfr_get_refresh_history(fs, "MY_FV", "v1")

# Pause/resume automatic refresh
sfr_suspend_feature_view(fs, "MY_FV", "v1")
sfr_resume_feature_view(fs, "MY_FV", "v1")
```

---

# Section 4: Training Data Generation

Join **spine** (label) data with Feature Views using point-in-time correct joins.
This ensures no data leakage -- features are joined as-of the label timestamp.

In [None]:
%%R
# Generate training data by joining labels with features
training_data <- sfr_generate_training_data(
  fs,
  spine = "SELECT customer_id, churned FROM SFR_DEMO_LABELS",
  features = list(
    list(name = "SFR_DEMO_CUST_FEATURES", version = "v1")
  ),
  spine_label_cols = "churned"
)

training_data

The result is a regular R data.frame -- ready for `lm()`, `glm()`, `randomForest()`, etc.

---

# Section 5: Retrieve Features for Inference

At inference time, fetch the **latest** feature values (no labels, no PIT logic).

In [None]:
%%R
# Get current features for all customers
inference_features <- sfr_retrieve_features(
  fs,
  spine = "SELECT DISTINCT customer_id FROM SFR_DEMO_ORDERS",
  features = list(
    list(name = "SFR_DEMO_CUST_FEATURES", version = "v1")
  )
)

inference_features

---

# Section 6: End-to-End -- Feature Store + Model Registry

Tie everything together: generate training data from the Feature Store,
train a model in R, register it, and score new customers.

In [None]:
%%R
# 1. Generate training data from Feature Store
training <- sfr_generate_training_data(
  fs,
  spine = "SELECT customer_id, churned FROM SFR_DEMO_LABELS",
  features = list(
    list(name = "SFR_DEMO_CUST_FEATURES", version = "v1")
  ),
  spine_label_cols = "churned"
)

cat("Training data:\n")
str(training)

In [None]:
%%R
# 2. Train a model in R
model <- glm(
  churned ~ avg_order_total + order_count + total_spend,
  data   = training,
  family = binomial
)

summary(model)

In [None]:
%%R
# 3. Test locally
test_input <- training[, c("avg_order_total", "order_count", "total_spend")]
preds <- sfr_predict_local(model, test_input)
cbind(training[, c("customer_id", "churned")], preds)

In [None]:
%%R
# 4. Register to Model Registry
reg <- sfr_model_registry(conn)

mv <- sfr_log_model(
  reg,
  model      = model,
  model_name = "SFR_DEMO_CHURN",
  input_cols = list(
    avg_order_total = "double",
    order_count     = "double",
    total_spend     = "double"
  ),
  output_cols = list(prediction = "double"),
  comment = "Logistic regression for customer churn"
)

mv

In [None]:
%%R
# 5. Score new customers using Feature Store features
new_features <- sfr_retrieve_features(
  fs,
  spine = "SELECT DISTINCT customer_id FROM SFR_DEMO_ORDERS",
  features = list(
    list(name = "SFR_DEMO_CUST_FEATURES", version = "v1")
  )
)

# Local prediction (or use sfr_predict for remote)
scores <- sfr_predict_local(
  model,
  new_features[, c("avg_order_total", "order_count", "total_spend")]
)

cbind(new_features[, "customer_id", drop = FALSE], churn_score = scores$prediction)

---

## Cleanup

In [None]:
%%R
# Delete demo objects
sfr_delete_model(reg, "SFR_DEMO_CHURN")
sfr_delete_feature_view(fs, "SFR_DEMO_CUST_FEATURES", "v1")
sfr_delete_entity(fs, "SFR_DEMO_CUSTOMER")

sfr_execute(conn, "DROP TABLE IF EXISTS SFR_DEMO_ORDERS")
sfr_execute(conn, "DROP TABLE IF EXISTS SFR_DEMO_LABELS")

sfr_disconnect(conn)
cat("All demo objects cleaned up.\n")

---

## Next steps

- **Full Feature Store API:** `vignette("feature-store", package = "snowflakeR")`
- **Model Registry details:** `vignette("model-registry", package = "snowflakeR")`
- **Workspace Notebook tips:** `vignette("workspace-notebooks", package = "snowflakeR")`