# Varo ML Models - Model Registry

This notebook trains ML models for the Varo Intelligence Agent:
- **Transaction Fraud Detection** - Classify transactions as fraud or legitimate
- **Cash Advance Eligibility** - Predict advance repayment success
- **Customer Lifetime Value** - Predict customer LTV

All models are registered to Snowflake Model Registry and can be added as tools to the Intelligence Agent.

## Prerequisites

**Required Packages** (configured automatically):
- `snowflake-ml-python`
- `scikit-learn`

**Database Context:**
- **Database:** VARO_INTELLIGENCE  
- **Schema:** ANALYTICS  
- **Warehouse:** VARO_FEATURE_WH

**Note:** This notebook uses Snowflake Model Registry. Ensure you have appropriate permissions to create and register models.


## Import Required Packages


In [None]:
# Import Python packages
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Import Snowpark
from snowflake.snowpark.context import get_active_session
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T

# Import Snowpark ML
from snowflake.ml.modeling.preprocessing import StandardScaler, OneHotEncoder
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.linear_model import LogisticRegression
from snowflake.ml.modeling.ensemble import RandomForestClassifier, GradientBoostingRegressor
from snowflake.ml.modeling.metrics import accuracy_score, mean_absolute_error, mean_squared_error
from snowflake.ml.registry import Registry

print("âœ… Packages imported successfully")


## Connect to Snowflake

Get active session and set context to Varo database.


---
# Feature Store Registration

Register our Dynamic Tables with Snowflake's native Feature Store API so they appear in the AI/ML UI.


In [None]:
# Import Feature Store API
from snowflake.ml.feature_store import (
    FeatureStore, 
    FeatureView,
    Entity,
    CreationMode
)

# Create/Connect to Feature Store
fs = FeatureStore(
    session=session,
    database="VARO_INTELLIGENCE",
    name="FEATURE_STORE",  # This is our schema name
    default_warehouse="VARO_FEATURE_WH",
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST
)

print("âœ… Feature Store connected")
print(f"   Location: VARO_INTELLIGENCE.FEATURE_STORE")


In [None]:
# Register Customer Entity
customer_entity = Entity(
    name="CUSTOMER",
    join_keys=["customer_id"],
    desc="Varo bank customer entity"
)

fs.register_entity(customer_entity)
print("âœ… CUSTOMER entity registered")


In [None]:
# Register Dynamic Tables as Feature Views

# 1. Customer Profile Features
customer_profile_fv = FeatureView(
    name="CUSTOMER_PROFILE_FEATURES",
    entities=[customer_entity],
    feature_df=session.table("VARO_INTELLIGENCE.FEATURE_STORE.CUSTOMER_PROFILE_FEATURES"),
    timestamp_col="feature_timestamp",
    desc="Customer demographic, account, and direct deposit features"
)
fs.register_feature_view(feature_view=customer_profile_fv, version="v1")
print("âœ… CUSTOMER_PROFILE_FEATURES registered")

# 2. Transaction Pattern Features
transaction_pattern_fv = FeatureView(
    name="TRANSACTION_PATTERN_FEATURES",
    entities=[customer_entity],
    feature_df=session.table("VARO_INTELLIGENCE.FEATURE_STORE.TRANSACTION_PATTERN_FEATURES"),
    timestamp_col="feature_timestamp",
    desc="Transaction patterns, velocity, and spending behavior features"
)
fs.register_feature_view(feature_view=transaction_pattern_fv, version="v1")
print("âœ… TRANSACTION_PATTERN_FEATURES registered")

# 3. Advance Risk Features
advance_risk_fv = FeatureView(
    name="ADVANCE_RISK_FEATURES",
    entities=[customer_entity],
    feature_df=session.table("VARO_INTELLIGENCE.FEATURE_STORE.ADVANCE_RISK_FEATURES"),
    timestamp_col="feature_timestamp",
    desc="Cash advance history, repayment behavior, and risk scoring features"
)
fs.register_feature_view(feature_view=advance_risk_fv, version="v1")
print("âœ… ADVANCE_RISK_FEATURES registered")

# 4. Fraud Detection Features
fraud_detection_fv = FeatureView(
    name="FRAUD_DETECTION_FEATURES",
    entities=[customer_entity],
    feature_df=session.table("VARO_INTELLIGENCE.FEATURE_STORE.FRAUD_DETECTION_FEATURES"),
    timestamp_col="feature_timestamp",
    desc="Real-time fraud indicators and anomaly detection features"
)
fs.register_feature_view(feature_view=fraud_detection_fv, version="v1")
print("âœ… FRAUD_DETECTION_FEATURES registered")

print("\nðŸŽ‰ All Feature Views registered! Check AI/ML > Features in Snowflake UI")


In [None]:
# Get active Snowflake session
session = get_active_session()

# Set context
session.use_database('VARO_INTELLIGENCE')
session.use_schema('ANALYTICS')
session.use_warehouse('VARO_FEATURE_WH')

print(f"âœ… Connected - Role: {session.get_current_role()}")
print(f"   Warehouse: {session.get_current_warehouse()}")
print(f"   Database.Schema: {session.get_fully_qualified_current_schema()}")


---
# MODEL 1: Transaction Fraud Detection

Classify transactions as fraudulent or legitimate using customer and transaction features.


### Prepare Fraud Training Data


In [None]:
# Get transaction data with customer features for fraud detection
fraud_df = session.sql("""
SELECT
    t.transaction_id,
    t.customer_id,
    t.amount::FLOAT AS amount,
    t.merchant_category,
    t.transaction_type,
    t.is_international::BOOLEAN AS is_international,
    c.credit_score::FLOAT AS credit_score,
    c.risk_tier,
    COALESCE(a.current_balance, 0)::BIGINT AS account_balance,
    -- Target: Is fraud (based on fraud_score threshold)
    (t.fraud_score > 0.7)::BOOLEAN AS is_fraud
FROM RAW.TRANSACTIONS t
LEFT JOIN RAW.CUSTOMERS c ON t.customer_id = c.customer_id
LEFT JOIN RAW.ACCOUNTS a ON t.account_id = a.account_id
WHERE t.transaction_date >= DATEADD('month', -6, CURRENT_DATE())
  AND t.amount > 10
LIMIT 10000
""")

print(f"Fraud detection data: {fraud_df.count()} transactions")
fraud_df.show(5)


### Train Fraud Classification Model


In [None]:
# Train/test split (80/20)
train_fraud, test_fraud = fraud_df.random_split([0.8, 0.2], seed=42)

# Drop ID columns
train_fraud = train_fraud.drop("TRANSACTION_ID", "CUSTOMER_ID")
test_fraud = test_fraud.drop("TRANSACTION_ID", "CUSTOMER_ID")

# Fill any remaining NaN values
train_fraud = train_fraud.fillna({"ACCOUNT_BALANCE": 0, "CREDIT_SCORE": 650.0})
test_fraud = test_fraud.fillna({"ACCOUNT_BALANCE": 0, "CREDIT_SCORE": 650.0})

# Create pipeline with preprocessing and classification
fraud_pipeline = Pipeline([
    ("Encoder", OneHotEncoder(
        input_cols=["MERCHANT_CATEGORY", "TRANSACTION_TYPE", "RISK_TIER"],
        output_cols=["MERCHANT_CATEGORY_ENC", "TRANSACTION_TYPE_ENC", "RISK_TIER_ENC"],
        drop_input_cols=True,
        handle_unknown="ignore"
    )),
    ("Scaler", StandardScaler(
        input_cols=["AMOUNT", "CREDIT_SCORE", "ACCOUNT_BALANCE"],
        output_cols=["AMOUNT_SCALED", "CREDIT_SCORE_SCALED", "ACCOUNT_BALANCE_SCALED"]
    )),
    ("Classifier", RandomForestClassifier(
        label_cols=["IS_FRAUD"],
        output_cols=["FRAUD_PREDICTION"],
        n_estimators=100,
        max_depth=10
    ))
])

# Train model
fraud_pipeline.fit(train_fraud)
print("âœ… Fraud detection model trained")


### Evaluate and Register Fraud Model


In [None]:
# Make predictions on test set
fraud_predictions = fraud_pipeline.predict(test_fraud)

# Calculate metrics
fraud_accuracy = accuracy_score(df=fraud_predictions, y_true_col_names="IS_FRAUD", y_pred_col_names="FRAUD_PREDICTION")
fraud_metrics = {"accuracy": round(fraud_accuracy, 4)}
print(f"Fraud model metrics: {fraud_metrics}")

# Register model
reg = Registry(session)
fraud_version = reg.log_model(
    model=fraud_pipeline,
    model_name="FRAUD_DETECTION_MODEL",
    comment="Predicts transaction fraud using Random Forest based on transaction and customer features",
    metrics=fraud_metrics
)

print(f"âœ… Fraud model registered as FRAUD_DETECTION_MODEL version {fraud_version.version_name}")


---
# MODEL 2: Cash Advance Repayment Success

Predict whether cash advances will be repaid successfully.


### Prepare Advance Training Data


In [None]:
# Get cash advance data with customer features
advance_df = session.sql("""
SELECT
    ca.advance_id,
    ca.customer_id,
    ca.advance_amount::FLOAT AS advance_amount,
    ca.fee_amount::FLOAT AS fee_amount,
    ca.eligibility_score::FLOAT AS eligibility_score,
    c.credit_score::FLOAT AS credit_score,
    c.risk_tier,
    c.employment_status,
    -- Count direct deposits
    COUNT(DISTINCT dd.deposit_id)::FLOAT AS deposit_count,
    -- Average deposit amount
    AVG(dd.amount)::FLOAT AS avg_deposit_amount,
    -- Target: Was repaid successfully
    (ca.advance_status = 'REPAID')::BOOLEAN AS was_repaid
FROM RAW.CASH_ADVANCES ca
INNER JOIN RAW.CUSTOMERS c ON ca.customer_id = c.customer_id
INNER JOIN RAW.DIRECT_DEPOSITS dd ON ca.customer_id = dd.customer_id
WHERE ca.advance_date >= DATEADD('month', -12, CURRENT_DATE())
  AND ca.eligibility_score IS NOT NULL
  AND c.credit_score IS NOT NULL
  AND c.risk_tier IS NOT NULL
  AND c.employment_status IS NOT NULL
  AND dd.amount IS NOT NULL
GROUP BY ca.advance_id, ca.customer_id, ca.advance_amount, ca.fee_amount, ca.eligibility_score,
         c.credit_score, c.risk_tier, c.employment_status, ca.advance_status
HAVING AVG(dd.amount) IS NOT NULL
  AND COUNT(DISTINCT dd.deposit_id) > 0
LIMIT 5000
""")

print(f"Advance data: {advance_df.count()} advances")
advance_df.show(5)


### Train Advance Repayment Model


In [None]:
# Split data
train_advance, test_advance = advance_df.random_split([0.8, 0.2], seed=42)

# Drop ID columns
train_advance = train_advance.drop("ADVANCE_ID", "CUSTOMER_ID")
test_advance = test_advance.drop("ADVANCE_ID", "CUSTOMER_ID")

# Fill any remaining NaN/NULL values
train_advance = train_advance.fillna({
    "ADVANCE_AMOUNT": 100.0,
    "FEE_AMOUNT": 5.0,
    "ELIGIBILITY_SCORE": 0.5,
    "CREDIT_SCORE": 650.0,
    "DEPOSIT_COUNT": 0.0,
    "AVG_DEPOSIT_AMOUNT": 1000.0
})
test_advance = test_advance.fillna({
    "ADVANCE_AMOUNT": 100.0,
    "FEE_AMOUNT": 5.0,
    "ELIGIBILITY_SCORE": 0.5,
    "CREDIT_SCORE": 650.0,
    "DEPOSIT_COUNT": 0.0,
    "AVG_DEPOSIT_AMOUNT": 1000.0
})

# Create pipeline
advance_pipeline = Pipeline([
    ("Encoder", OneHotEncoder(
        input_cols=["RISK_TIER", "EMPLOYMENT_STATUS"],
        output_cols=["RISK_TIER_ENC", "EMPLOYMENT_STATUS_ENC"],
        drop_input_cols=True,
        handle_unknown="ignore"
    )),
    ("Scaler", StandardScaler(
        input_cols=["ADVANCE_AMOUNT", "FEE_AMOUNT", "ELIGIBILITY_SCORE", "CREDIT_SCORE", "DEPOSIT_COUNT"],
        output_cols=["ADVANCE_AMOUNT_SCALED", "FEE_AMOUNT_SCALED", "ELIGIBILITY_SCORE_SCALED", "CREDIT_SCORE_SCALED", "DEPOSIT_COUNT_SCALED"]
    )),
    ("Classifier", LogisticRegression(
        label_cols=["WAS_REPAID"],
        output_cols=["REPAYMENT_PREDICTION"]
    ))
])

# Train
advance_pipeline.fit(train_advance)
print("âœ… Advance repayment model trained")


### Evaluate and Register Advance Model


In [None]:
# Make predictions
advance_predictions = advance_pipeline.predict(test_advance)

# Calculate metrics
advance_accuracy = accuracy_score(df=advance_predictions, y_true_col_names="WAS_REPAID", y_pred_col_names="REPAYMENT_PREDICTION")
advance_metrics = {"accuracy": round(advance_accuracy, 4)}
print(f"Advance model metrics: {advance_metrics}")

# Register model
advance_version = reg.log_model(
    model=advance_pipeline,
    model_name="ADVANCE_ELIGIBILITY_MODEL",
    comment="Predicts cash advance repayment success using Logistic Regression based on customer creditworthiness and deposit patterns",
    metrics=advance_metrics
)

print(f"âœ… Advance model registered as ADVANCE_ELIGIBILITY_MODEL version {advance_version.version_name}")


---
# MODEL 3: Customer Lifetime Value Prediction

Predict customer lifetime value using engagement and behavior metrics.


### Prepare LTV Training Data


In [None]:
# Get customer LTV data with features
ltv_df = session.sql("""
SELECT
    c.customer_id,
    c.lifetime_value::FLOAT AS lifetime_value,
    DATEDIFF('month', c.acquisition_date, CURRENT_DATE())::FLOAT AS tenure_months,
    c.credit_score::FLOAT AS credit_score,
    c.risk_tier,
    c.acquisition_channel,
    -- Product count
    COUNT(DISTINCT a.account_id)::FLOAT AS product_count,
    -- Average account balance (handle NULL)
    COALESCE(AVG(a.current_balance), 0)::FLOAT AS avg_account_balance,
    -- Transaction count (last 90 days)
    COUNT(DISTINCT CASE WHEN t.transaction_date >= DATEADD('day', -90, CURRENT_DATE())
                   THEN t.transaction_id END)::FLOAT AS recent_transaction_count,
    -- Has direct deposit
    (COUNT(DISTINCT dd.deposit_id) > 0)::BOOLEAN AS has_direct_deposit
FROM RAW.CUSTOMERS c
LEFT JOIN RAW.ACCOUNTS a ON c.customer_id = a.customer_id
LEFT JOIN RAW.TRANSACTIONS t ON c.customer_id = t.customer_id
LEFT JOIN RAW.DIRECT_DEPOSITS dd ON c.customer_id = dd.customer_id
WHERE c.customer_status = 'ACTIVE'
  AND c.lifetime_value > 0
GROUP BY c.customer_id, c.lifetime_value, c.acquisition_date, c.credit_score, c.risk_tier, c.acquisition_channel
LIMIT 5000
""")

print(f"LTV data: {ltv_df.count()} customers")
ltv_df.show(5)


### Train LTV Regression Model


In [None]:
# Split data
train_ltv, test_ltv = ltv_df.random_split([0.8, 0.2], seed=42)

# Drop CUSTOMER_ID
train_ltv = train_ltv.drop("CUSTOMER_ID")
test_ltv = test_ltv.drop("CUSTOMER_ID")

# Fill any remaining NaN values
train_ltv = train_ltv.fillna({
    "TENURE_MONTHS": 1.0,
    "CREDIT_SCORE": 650.0,
    "PRODUCT_COUNT": 1.0,
    "AVG_ACCOUNT_BALANCE": 0.0,
    "RECENT_TRANSACTION_COUNT": 0.0
})
test_ltv = test_ltv.fillna({
    "TENURE_MONTHS": 1.0,
    "CREDIT_SCORE": 650.0,
    "PRODUCT_COUNT": 1.0,
    "AVG_ACCOUNT_BALANCE": 0.0,
    "RECENT_TRANSACTION_COUNT": 0.0
})

# Create pipeline
ltv_pipeline = Pipeline([
    ("Encoder", OneHotEncoder(
        input_cols=["RISK_TIER", "ACQUISITION_CHANNEL"],
        output_cols=["RISK_TIER_ENC", "ACQUISITION_CHANNEL_ENC"],
        drop_input_cols=True,
        handle_unknown="ignore"
    )),
    ("Scaler", StandardScaler(
        input_cols=["TENURE_MONTHS", "CREDIT_SCORE", "PRODUCT_COUNT", "AVG_ACCOUNT_BALANCE", "RECENT_TRANSACTION_COUNT"],
        output_cols=["TENURE_MONTHS_SCALED", "CREDIT_SCORE_SCALED", "PRODUCT_COUNT_SCALED", "AVG_ACCOUNT_BALANCE_SCALED", "RECENT_TRANSACTION_COUNT_SCALED"]
    )),
    ("Regressor", GradientBoostingRegressor(
        label_cols=["LIFETIME_VALUE"],
        output_cols=["PREDICTED_LTV"],
        n_estimators=100,
        max_depth=6
    ))
])

# Train
ltv_pipeline.fit(train_ltv)
print("âœ… LTV prediction model trained")


### Evaluate and Register LTV Model


In [None]:
# Predict on test set
ltv_predictions = ltv_pipeline.predict(test_ltv)

# Calculate metrics
ltv_mae = mean_absolute_error(df=ltv_predictions, y_true_col_names="LIFETIME_VALUE", y_pred_col_names="PREDICTED_LTV")
ltv_rmse = mean_squared_error(df=ltv_predictions, y_true_col_names="LIFETIME_VALUE", y_pred_col_names="PREDICTED_LTV") ** 0.5
ltv_metrics = {"mae": round(ltv_mae, 2), "rmse": round(ltv_rmse, 2)}
print(f"LTV model metrics: {ltv_metrics}")

# Register model
ltv_version = reg.log_model(
    model=ltv_pipeline,
    model_name="CUSTOMER_LTV_MODEL",
    comment="Predicts customer lifetime value using Gradient Boosting based on engagement and behavior metrics",
    metrics=ltv_metrics
)

print(f"âœ… LTV model registered as CUSTOMER_LTV_MODEL version {ltv_version.version_name}")


---
# Verify Models in Registry


In [None]:
# Show all models in the registry
print("Models in registry:")
reg.show_models()

# Show versions for fraud model
print("\nFraud Detection Model versions:")
reg.get_model("FRAUD_DETECTION_MODEL").show_versions()

# Show versions for advance model  
print("\nAdvance Eligibility Model versions:")
reg.get_model("ADVANCE_ELIGIBILITY_MODEL").show_versions()

# Show versions for LTV model
print("\nCustomer LTV Model versions:")
reg.get_model("CUSTOMER_LTV_MODEL").show_versions()

print("\nâœ… All models registered and ready to add to Intelligence Agent")


---
# Test Model Inference

Test calling each model to make predictions.


In [None]:
# Test fraud detection on sample transactions
fraud_model = reg.get_model("FRAUD_DETECTION_MODEL").default
sample_fraud = fraud_df.limit(5).drop("TRANSACTION_ID", "CUSTOMER_ID")
sample_fraud = sample_fraud.fillna({"ACCOUNT_BALANCE": 0, "CREDIT_SCORE": 650.0})
fraud_preds = fraud_model.run(sample_fraud, function_name="predict")
print("Fraud Detection predictions:")
fraud_preds.select("IS_FRAUD", "FRAUD_PREDICTION").show()

# Test advance repayment on sample advances
advance_model = reg.get_model("ADVANCE_ELIGIBILITY_MODEL").default
sample_advance = advance_df.limit(5).drop("ADVANCE_ID", "CUSTOMER_ID")
sample_advance = sample_advance.fillna({
    "ADVANCE_AMOUNT": 100.0,
    "FEE_AMOUNT": 5.0,
    "ELIGIBILITY_SCORE": 0.5,
    "CREDIT_SCORE": 650.0,
    "DEPOSIT_COUNT": 0.0,
    "AVG_DEPOSIT_AMOUNT": 1000.0
})
advance_preds = advance_model.run(sample_advance, function_name="predict")
print("\nAdvance Repayment predictions:")
advance_preds.select("WAS_REPAID", "REPAYMENT_PREDICTION").show()

# Test LTV prediction on sample customers
ltv_model = reg.get_model("CUSTOMER_LTV_MODEL").default
sample_ltv = ltv_df.limit(5).drop("CUSTOMER_ID")
sample_ltv = sample_ltv.fillna({
    "TENURE_MONTHS": 1.0,
    "CREDIT_SCORE": 650.0,
    "PRODUCT_COUNT": 1.0,
    "AVG_ACCOUNT_BALANCE": 0.0,
    "RECENT_TRANSACTION_COUNT": 0.0
})
ltv_preds = ltv_model.run(sample_ltv, function_name="predict")
print("\nCustomer LTV predictions:")
ltv_preds.select("LIFETIME_VALUE", "PREDICTED_LTV").show()

print("\nâœ… All models tested successfully!")


---
# Next Steps

## Add Models to Intelligence Agent

**Using the SQL Script (Recommended)**
Run `sql/agent/10_create_intelligence_agent.sql` which automatically configures all ML model procedures.

**The Python procedures** in `sql/ml/09_create_model_functions.sql` will use these registered models.

## Example Questions for Agent

- "Is this $500 international transaction likely fraud?"
- "Check if customer CUST00001234 is eligible for a cash advance"
- "Predict the lifetime value for our newest customer cohort"
- "Which customers show high fraud risk patterns?"

The models will now be available as tools your agent can use!
