# Train Continuous CLV Prediction Model

This notebook trains an XGBoost model to predict 12-month customer lifetime value for established customers.

**Data Source**: Features from Snowflake Feature Store (feature_engineering_continuous.ipynb)

**Model Purpose**: Provide updated CLV predictions for customers with 3+ months of history, enabling dynamic segmentation and retention strategies.

**Steps**:
1. Load training data from Feature Store
2. Create additional behavioral features
3. Train XGBoost with hyperparameter tuning
4. Evaluate model performance
5. Deploy to Snowflake Model Registry with Feature Store lineage



In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb
from snowflake.snowpark.context import get_active_session
from snowflake.ml.registry import Registry
from snowflake.ml.model import task
from snowflake.ml.model.target_platform import TargetPlatform
from snowflake.ml.modeling import tune
from snowflake.ml.modeling.tune.search import BayesOpt
from snowflake.ml.data.data_connector import DataConnector
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

# Get active Snowflake session
session = get_active_session()



## Configuration

Set your database and schema here:

In [None]:
# Database and schema configuration
DATABASE = 'ML_DEMO'
SCHEMA = 'PUBLIC'

# Set context
session.use_database(DATABASE)
session.use_schema(SCHEMA)

print(f"Using database: {DATABASE}")
print(f"Using schema: {SCHEMA}")
print(f"Current warehouse: {session.get_current_warehouse()}")
print(f"Current role: {session.get_current_role()}")

## Generate Snowflake Dataset from Feature Store

**Best Practice: Use `fs.generate_dataset()` for training data**

Per Snowflake documentation, Datasets are the recommended way to create training data:
- **Immutable snapshots** ensure reproducibility
- **Automatic versioning** (v1, v2, v3, etc.)
- **ML lineage tracking** links Feature Store → Dataset → Model
- **Efficient storage** for distributed training with TensorFlow, PyTorch
- **No manual table saves needed** - Dataset is automatically materialized

This approach is superior to manually saving training data to tables.

In [None]:
# Import Feature Store dependencies
from snowflake.ml.feature_store import (
    FeatureStore,
    CreationMode
)

# Connect to Feature Store
FEATURE_STORE_NAME = 'CLV_FEATURE_STORE'

fs = FeatureStore(
    session=session,
    database=DATABASE,
    name=FEATURE_STORE_NAME,
    default_warehouse=session.get_current_warehouse(),
    creation_mode=CreationMode.FAIL_IF_NOT_EXIST
)

print(f"✓ Connected to Feature Store: {DATABASE}.{FEATURE_STORE_NAME}")

# Get registered feature views
rfm_fv = fs.get_feature_view("RFM_FEATURES", "1.0")
purchase_patterns_fv = fs.get_feature_view("PURCHASE_PATTERNS", "1.0")
engagement_fv = fs.get_feature_view("ENGAGEMENT_FEATURES", "1.0")
derived_fv = fs.get_feature_view("DERIVED_FEATURES", "1.0")

print("✓ Retrieved 4 feature views:")
print("  - RFM_FEATURES")
print("  - PURCHASE_PATTERNS")
print("  - ENGAGEMENT_FEATURES")
print("  - DERIVED_FEATURES")

# Load spine (customer profile with target)
spine_df = session.table(f"{DATABASE}.{SCHEMA}.CONTINUOUS_CUSTOMERS_PROFILE_WITH_TARGET")

print(f"\n✓ Loaded spine: {spine_df.count()} customers with target variable")

# Generate Snowflake Dataset (immutable, versioned snapshot)
# This automatically materializes the data - no manual save needed!
dataset = fs.generate_dataset(
    name="CONTINUOUS_TRAINING_DATASET",
    version="v1",
    spine_df=spine_df,
    features=[
        rfm_fv,
        purchase_patterns_fv,
        engagement_fv,
        derived_fv
    ],
    spine_timestamp_col=None,  # No point-in-time joins needed
    spine_label_cols=["FUTURE_12M_LTV"],
    desc="CLV training dataset with all features from Feature Store"
)

print(f"\n✓ Generated Snowflake Dataset: CONTINUOUS_TRAINING_DATASET v1")

## Load Generated Training Dataset

In [None]:
# Read training data from the Dataset we just generated
# Convert Dataset to Snowpark DataFrame
df_snowpark = dataset.read.to_snowpark_dataframe()

# Convert to Pandas for scikit-learn/XGBoost
df = df_snowpark.to_pandas()

# IMPORTANT: Snowflake columns are UPPERCASE by default
# Convert all column names to lowercase for easier Python manipulation
df.columns = df.columns.str.lower()

print(f"✓ Loaded {len(df)} customer records from Dataset CONTINUOUS_TRAINING_DATASET (v1)")
print(f"✓ Features: {len(df.columns)} columns")
print(f"✓ Converted column names to lowercase for Python compatibility")

# Convert date columns if present
date_columns = ['signup_date', 'first_purchase_date', 'last_purchase_date']
for col in date_columns:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col])

# Convert any categorical columns to object to avoid fillna issues
for col in df.select_dtypes(include=['category']).columns:
    df[col] = df[col].astype('object')

# Sanitize categorical values to make them SQL-safe
# Replace hyphens and plus signs with underscores to avoid SQL identifier issues
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].astype(str).str.replace('-', '_', regex=False).str.replace('+', '_PLUS', regex=False)

print(f"\nData comes from immutable Snowflake Dataset with:")
print("  ✓ Feature Store lineage tracking")
print("  ✓ Reproducible training snapshots")
print("  ✓ Automatic versioning")
print("✓ Sanitized categorical values for SQL compatibility")
df.head()

## Additional Feature Engineering in Python

**Note**: Many derived features (RFM scores, lifecycle stages, velocities) already exist in the DERIVED_FEATURES feature view from the Feature Store. 

Here we add additional Python-based transformations for model training:
- Pandas-based RFM quintile scoring (for comparison)
- Lifecycle stage assignment (redundant with Feature Store but kept for consistency)
- Purchase consistency metrics
- Velocity calculations (using Feature Store base features)
- Engagement ratios
- Tenure cohort binning

In [None]:
def assign_lifecycle_stage(row):
    if row['customer_tenure_days'] < 180:
        return 'new'
    elif row['recency_days'] > 90:
        return 'at_risk'
    elif row['frequency'] >= 20:
        return 'champion'
    elif row['monetary_total'] >= df['monetary_total'].quantile(0.75):
        return 'high_value'
    else:
        return 'regular'

df['lifecycle_stage'] = df.apply(assign_lifecycle_stage, axis=1)

print("\nLifecycle stage distribution:")
print(df['lifecycle_stage'].value_counts())

In [None]:
# Handle missing values in inter-purchase days standard deviation
df['std_inter_purchase_days'] = 0  # Feature Store doesn't compute this yet
df['purchase_consistency'] = 1.0  # Default to consistent

# If we had inter-purchase std dev, we'd calculate:
# df['purchase_consistency'] = 1 / (1 + df['std_inter_purchase_days'].fillna(0))

df['purchase_consistency'].fillna(1.0, inplace=True)

In [None]:
# Use features from Feature Store (already computed)
df['purchase_velocity_30d'] = df['recent_30d_count'] / 30
df['purchase_velocity_90d'] = df['recent_90d_count'] / 90

df['spending_velocity_30d'] = df['recent_30d_amount'] / 30
df['spending_velocity_90d'] = df['recent_90d_amount'] / 90

df['velocity_acceleration'] = df['purchase_velocity_30d'] - df['purchase_velocity_90d']

print("✓ Calculated velocity indicators from Feature Store features")



In [None]:
df['engagement_per_purchase'] = df['total_interactions'] / df['frequency'].replace(0, np.nan)
df['engagement_per_purchase'].fillna(0, inplace=True)

df['support_intensity'] = df['support_tickets'] / df['frequency'].replace(0, np.nan)
df['support_intensity'].fillna(0, inplace=True)

In [None]:
tenure_bins = [0, 180, 365, 540, 999999]
tenure_labels = ['0-6m', '6-12m', '12-18m', '18m+']
df['tenure_cohort'] = pd.cut(df['customer_tenure_days'], bins=tenure_bins, labels=tenure_labels)

# Convert to object immediately to avoid Categorical issues later
df['tenure_cohort'] = df['tenure_cohort'].astype('object')

cohort_avg_monetary = df.groupby('tenure_cohort')['monetary_total'].transform('mean')
df['monetary_vs_cohort'] = df['monetary_total'] / cohort_avg_monetary

cohort_avg_frequency = df.groupby('tenure_cohort')['frequency'].transform('mean')
df['frequency_vs_cohort'] = df['frequency'] / cohort_avg_frequency

In [None]:
categorical_features = [
    'age_group',
    'region',
    'segment',
    'lifecycle_stage',
    'tenure_cohort'
]

numerical_features = [
    'recency_days',
    'frequency',
    'monetary_total',
    'monetary_avg',
    'customer_tenure_days',
    'unique_categories_purchased',
    'total_items_purchased',
    'recent_30d_amount',
    'recent_30d_count',
    'recent_90d_amount',
    'recent_90d_count',
    'total_interactions',
    'website_visits',
    'email_opens',
    'email_clicks',
    'support_tickets',
    'product_views',
    'cart_adds',
    'email_engagement_rate',
    'rfm_score',
    'purchase_consistency',
    'purchase_velocity_30d',
    'purchase_velocity_90d',
    'spending_velocity_30d',
    'spending_velocity_90d',
    'velocity_acceleration',
    'engagement_per_purchase',
    'support_intensity',
    'monetary_vs_cohort',
    'frequency_vs_cohort'
]

# Convert categorical columns to object type before fillna to avoid Categorical dtype issues
for col in categorical_features:
    if df[col].dtype.name == 'category':
        df[col] = df[col].astype('object')

# Fill missing values
df[categorical_features] = df[categorical_features].fillna('unknown')
df[numerical_features] = df[numerical_features].fillna(0)
df[numerical_features] = df[numerical_features].replace([np.inf, -np.inf], 0)

X = df[categorical_features + numerical_features]
y = df['future_12m_ltv']

print(f"Feature matrix shape: {X.shape}")
print(f"Target variable shape: {y.shape}")
print(f"\nFeatures from Feature Store: {len(categorical_features)} categorical, {len(numerical_features)} numerical")
print("✓ All features have lineage tracked in Feature Store")

In [None]:
df_sorted = df.sort_values('last_purchase_date').reset_index(drop=True)

train_size = int(0.7 * len(df_sorted))
val_size = int(0.15 * len(df_sorted))

train_df = df_sorted.iloc[:train_size]
val_df = df_sorted.iloc[train_size:train_size + val_size]
test_df = df_sorted.iloc[train_size + val_size:]

X_train = train_df[categorical_features + numerical_features]
y_train = train_df['future_12m_ltv']

X_val = val_df[categorical_features + numerical_features]
y_val = val_df['future_12m_ltv']

X_test = test_df[categorical_features + numerical_features]
y_test = test_df['future_12m_ltv']

print(f"Train set: {len(X_train)} samples (mean LTV: ${y_train.mean():.2f})")
print(f"Validation set: {len(X_val)} samples (mean LTV: ${y_val.mean():.2f})")
print(f"Test set: {len(X_test)} samples (mean LTV: ${y_test.mean():.2f})")

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

print(f"Processed feature dimensionality: {X_train_processed.shape[1]}")

In [None]:
# Simple in-memory XGB Training Option
# fast_local_hpo.py
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV

# 1. Define the model
xgb_local = XGBRegressor(
    objective='reg:squarederror',
    n_jobs=-1, 
    random_state=42
)

# 2. Define the param grid (standard dictionary)
param_dist = {
    "n_estimators": range(100, 301, 50),
    "max_depth": range(4, 9),
    "learning_rate": [0.01, 0.05, 0.1, 0.2, 0.3],
    "subsample": [0.7, 0.8, 0.9],
    "reg_lambda": [0.5, 1.0, 1.5, 2.0]
}

# 3. Use standard RandomizedSearchCV (Runs locally in seconds)
# Note: We use the already processed numpy arrays (X_train_processed)
search = RandomizedSearchCV(
    estimator=xgb_local,
    param_distributions=param_dist,
    n_iter=10,  # Same as your num_trials
    scoring='neg_root_mean_squared_error',
    cv=3,
    verbose=1,
    n_jobs=-1  # Use all cores on the driver node
)

print("Starting local HPO (Fast)...")
search.fit(X_train_processed, y_train)

print(f"Best parameters: {search.best_params_}")
print(f"Best RMSE: {-search.best_score_:.4f}")

# 4. Use the best estimator directly
best_model = search.best_estimator_

In [None]:
# # Distributed HPO option
# train_connector = DataConnector.from_dataframe(
#     session.create_dataframe(
#         pd.concat([X_train, y_train.rename('future_12m_ltv')], axis=1)
#     )
# )

# val_connector = DataConnector.from_dataframe(
#     session.create_dataframe(
#         pd.concat([X_val, y_val.rename('future_12m_ltv')], axis=1)
#     )
# )

# # Define search space with Snowflake ML tune functions
# # OPTIMIZED: Focus on 5 most impactful hyperparameters per best practices
# # Research shows 2-3x trials per hyperparameter is sufficient for Bayesian optimization
# search_space = {
#     "n_estimators": tune.uniform(100, 300),      # Tree count - HIGH impact
#     "max_depth": tune.uniform(4, 8),              # Tree complexity - HIGH impact  
#     "learning_rate": tune.loguniform(0.01, 0.3),  # Shrinkage - HIGH impact
#     "subsample": tune.uniform(0.7, 0.9),          # Row sampling - prevents overfitting
#     "reg_lambda": tune.uniform(0.5, 2.0)          # L2 regularization - prevents overfitting
# }

# # Store preprocessor and feature names globally for training function
# global_preprocessor = preprocessor
# global_categorical_features = categorical_features
# global_numerical_features = numerical_features

# # Define training function for HPO
# def train_func():
#     from snowflake.ml.modeling.tune import get_tuner_context
#     import xgboost as xgb
#     from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    
#     context = get_tuner_context()
#     params = context.get_hyper_params()
#     dataset_map = context.get_dataset_map()
    
#     train_df = dataset_map['train'].to_pandas()
#     val_df = dataset_map['val'].to_pandas()
    
#     X_train_hpo = train_df[global_categorical_features + global_numerical_features]
#     y_train_hpo = train_df['future_12m_ltv']
    
#     X_val_hpo = val_df[global_categorical_features + global_numerical_features]
#     y_val_hpo = val_df['future_12m_ltv']
    
#     X_train_processed = global_preprocessor.transform(X_train_hpo)
#     X_val_processed = global_preprocessor.transform(X_val_hpo)
    
#     # Use reasonable defaults for parameters not being tuned
#     model = xgb.XGBRegressor(
#         objective='reg:squarederror',
#         random_state=42,
#         n_jobs=-1,
#         # Tuned parameters
#         n_estimators=int(params["n_estimators"]),
#         max_depth=int(params["max_depth"]),
#         learning_rate=params["learning_rate"],
#         subsample=params["subsample"],
#         reg_lambda=params["reg_lambda"],
#         # Fixed parameters (using XGBoost defaults or good general values)
#         min_child_weight=5,        # Default: 1, using 5 for regression stability
#         colsample_bytree=0.8,      # Default: 1, using 0.8 for generalization
#         reg_alpha=0.0              # Default: 0 (no L1 regularization)
#     )
    
#     model.fit(X_train_processed, y_train_hpo)
    
#     val_pred = model.predict(X_val_processed)
#     rmse = np.sqrt(mean_squared_error(y_val_hpo, val_pred))
#     mae = mean_absolute_error(y_val_hpo, val_pred)
#     r2 = r2_score(y_val_hpo, val_pred)
#     mape = np.mean(np.abs((y_val_hpo - val_pred) / y_val_hpo.replace(0, np.nan))) * 100
    
#     context.report(
#         metrics={"rmse": rmse, "mae": mae, "r2": r2, "mape": mape},
#         model=model
#     )

# # Configure HPO - OPTIMIZED for speed while maintaining quality
# # Best practices: 2x trials per hyperparameter minimum (5 params × 2 = 10 trials)
# tuner_config = tune.TunerConfig(
#     metric="rmse",
#     mode="min",
#     search_alg=BayesOpt(
#         utility_kwargs={"kind": "ucb", "kappa": 2.5, "xi": 0.0}
#     ),
#     num_trials=10,              # Reduced from 20 (sufficient for 5 params)
#     max_concurrent_trials=4
# )

# dataset_map = {
#     "train": train_connector,
#     "val": val_connector
# }

# print("Starting optimized hyperparameter tuning...")
# print(f"  Strategy: Bayesian Optimization with focused search space")
# print(f"  Search space: {len(search_space)} key hyperparameters (reduced from 8)")
# print(f"  Total trials: {tuner_config.num_trials} (reduced from 20)")
# print(f"  Concurrent trials: {tuner_config.max_concurrent_trials}")
# print(f"  Expected speedup: ~2x faster while maintaining model quality")
# print(f"\nTuning parameters:")
# for param in search_space.keys():
#     print(f"  - {param}")

# # Run HPO
# tuner = tune.Tuner(train_func, search_space, tuner_config)
# tuner_results = tuner.run(dataset_map=dataset_map)

# print("\n✓ Hyperparameter optimization completed!")

# # Extract best result - it's a DataFrame with 1 row
# best_result_df = tuner_results.best_result
# best_result_row = best_result_df.iloc[0]  # Get the first (and only) row as a Series

# # Extract and display hyperparameters
# print(f"\nBest hyperparameters:")
# best_params = {}
# for col in best_result_df.columns:
#     if col.startswith('config/'):
#         param_name = col.replace('config/', '')
#         value = float(best_result_row[col])
#         best_params[param_name] = value
#         print(f"  {param_name}: {value:.4f}")

# # Extract metrics
# best_metrics = {
#     'rmse': float(best_result_row['rmse']),
#     'mae': float(best_result_row['mae']),
#     'r2': float(best_result_row['r2']),
#     'mape': float(best_result_row['mape'])
# }

# print(f"\nBest validation metrics:")
# print(f"  RMSE: ${best_metrics['rmse']:.2f}")
# print(f"  MAE: ${best_metrics['mae']:.2f}")
# print(f"  R²: {best_metrics['r2']:.4f}")
# print(f"  MAPE: {best_metrics['mape']:.2f}%")

# # Train final model with best hyperparameters
# best_model = xgb.XGBRegressor(
#     objective='reg:squarederror',
#     random_state=42,
#     n_jobs=-1,
#     # Best tuned parameters
#     n_estimators=int(best_params["n_estimators"]),
#     max_depth=int(best_params["max_depth"]),
#     learning_rate=best_params["learning_rate"],
#     subsample=best_params["subsample"],
#     reg_lambda=best_params["reg_lambda"],
#     # Fixed parameters (same as training function)
#     min_child_weight=5,
#     colsample_bytree=0.8,
#     reg_alpha=0.0
# )

# best_model.fit(X_train_processed, y_train)
# print("\n✓ Final model trained with best hyperparameters")

In [None]:
y_train_pred = best_model.predict(X_train_processed)
y_val_pred = best_model.predict(X_val_processed)
y_test_pred = best_model.predict(X_test_processed)

def evaluate_model(y_true, y_pred, dataset_name):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true.replace(0, np.nan))) * 100
    
    print(f"\n{dataset_name} Metrics:")
    print(f"  RMSE: ${rmse:.2f}")
    print(f"  MAE: ${mae:.2f}")
    print(f"  R²: {r2:.4f}")
    print(f"  MAPE: {mape:.2f}%")
    
    return {'rmse': rmse, 'mae': mae, 'r2': r2, 'mape': mape}

train_metrics = evaluate_model(y_train, y_train_pred, "Train")
val_metrics = evaluate_model(y_val, y_val_pred, "Validation")
test_metrics = evaluate_model(y_test, y_test_pred, "Test")

generalization_gap = train_metrics['r2'] - test_metrics['r2']
print(f"\nGeneralization gap (Train R² - Test R²): {generalization_gap:.4f}")

if generalization_gap > 0.1:
    print("⚠️ Warning: Significant overfitting detected")
elif generalization_gap > 0.05:
    print("⚠️ Caution: Moderate overfitting detected")
else:
    print("✓ Model shows excellent generalization")

## Create Full Pipeline for Deployment


In [None]:
# Create complete pipeline with preprocessor and best model
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', best_model)
])

# Fit on full training set
full_pipeline.fit(X_train, y_train)

# Verify pipeline works
pipeline_test_pred = full_pipeline.predict(X_test)
pipeline_test_rmse = np.sqrt(mean_squared_error(y_test, pipeline_test_pred))
print(f"Pipeline test RMSE: ${pipeline_test_rmse:.2f}")
print("✓ Full pipeline ready for deployment")


## Prediction Distribution Analysis

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(y_test, y_test_pred, alpha=0.5)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Future LTV ($)')
axes[0].set_ylabel('Predicted Future LTV ($)')
axes[0].set_title('Actual vs Predicted CLV')
axes[0].grid(True, alpha=0.3)

residuals = y_test - y_test_pred
axes[1].hist(residuals, bins=50, edgecolor='black')
axes[1].set_xlabel('Prediction Error ($)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Prediction Errors')
axes[1].axvline(0, color='r', linestyle='--', lw=2)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('continuous_model_predictions.png')
plt.show()

## Feature Importance Analysis

In [None]:
feature_names = (
    numerical_features + 
    list(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features))
)

feature_importance_df = pd.DataFrame({
    'FEATURE': feature_names,
    'IMPORTANCE': best_model.feature_importances_
}).sort_values('IMPORTANCE', ascending=False)

print("\nTop 20 Most Important Features:")
print(feature_importance_df.head(20))

plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importance_df.head(20), x='IMPORTANCE', y='FEATURE')
plt.title('Top 20 Feature Importances - Continuous CLV Model')
plt.xlabel('Importance')
plt.tight_layout()
plt.savefig('continuous_feature_importance.png')
plt.show()

## Deploy to Snowflake Model Registry

**Deployment Strategy**:
- Use Snowflake Model Registry for versioning and management
- Deploy to both WAREHOUSE (SQL inference) and SPCS (Python inference)
- Register with sample input for schema inference
- Include comprehensive metrics for tracking
- Feature Store lineage automatically tracked

**Target Platforms**:
- **WAREHOUSE**: Enables SQL-based inference (e.g., `SELECT CONTINUOUS_CLV_MODEL!PREDICT(...)`)
- **SNOWPARK_CONTAINER_SERVICES**: Enables Python inference in containers


In [None]:
# Initialize Model Registry with active session
registry = Registry(session=session)

# Prepare sample input for schema inference
sample_input = X_train.head(100)

# Log model to registry with both target platforms and HPO results
model_version = registry.log_model(
    model=full_pipeline,
    model_name="CONTINUOUS_CLV_MODEL",
    version_name="V1",
    comment="Continuous CLV model with HPO and Feature Store - supports Warehouse and SPCS inference",
    metrics={
        "test_rmse": float(test_metrics['rmse']),
        "test_mae": float(test_metrics['mae']),
        "test_r2": float(test_metrics['r2']),
        "test_mape": float(test_metrics['mape']),
        "train_r2": float(train_metrics['r2']),
        "generalization_gap": float(generalization_gap),
        "hpo_best_rmse": float(best_metrics['rmse'])
    },
    sample_input_data=sample_input,
    task=task.Task.TABULAR_REGRESSION,
    target_platforms=[
        TargetPlatform.WAREHOUSE,                    # SQL inference
        TargetPlatform.SNOWPARK_CONTAINER_SERVICES   # Python inference in containers
    ]
)

print(f"\n✓ Model registered successfully!")
print(f"  Database: {DATABASE}")
print(f"  Schema: {SCHEMA}")
print(f"  Model: CONTINUOUS_CLV_MODEL")
print(f"  Version: V1")
print(f"  Target Platforms:")
print(f"    - WAREHOUSE (SQL inference)")
print(f"    - SNOWPARK_CONTAINER_SERVICES (Python inference)")
print(f"  Features: From Feature Store with automatic lineage")
print(f"  HPO: Bayesian optimization with {tuner_config.num_trials} trials")


## Use Feature Store for Continuous Inference

**Best Practice: Use Same Feature Views for Training AND Inference**

The Snowflake Feature Store ensures training-serving consistency by using the same feature views for both:
- **Training**: `fs.generate_training_set()` creates training data from feature views
- **Inference**: `fs.retrieve_feature_values()` retrieves features using the same views. You can also just select directly from the Dynamic Table that backs that Feature View, as we see in the Dynamic Table definition below.

This eliminates training-serving skew and ensures predictions use identical feature logic.

**Architecture**:
1. Raw data arrives in staging tables (transactions, interactions)
2. Feature Store feature views automatically refresh (daily)
3. Inference uses `fs.retrieve_feature_values()` or selects from the Feature View using SQL to get updated features
4. Model scores customers using consistent feature definitions

**Benefits**:
- ✓ Single source of truth for feature definitions
- ✓ Automatic refresh (1 day schedule)
- ✓ No feature computation duplication
- ✓ Feature lineage tracked end-to-end
- ✓ Guaranteed consistency between training and inference

## Automatic SQL-based Inference with Dynamic Tables

**Best of Both Worlds: Feature Store + Dynamic Tables + SQL Inference**

Since Feature Views are backed by Dynamic Tables/Views, you CAN create a Dynamic Table that:
1. Queries Feature Views directly
2. Uses MODEL!PREDICT() for SQL-based inference  
3. Automatically refreshes on schedule

This provides fully automatic, incremental continuous inference with zero Python code execution!

In [None]:
# OPTION B: Create Dynamic Table that queries Feature Views (including DERIVED_FEATURES) and calls MODEL!PREDICT()

create_auto_inference_dt = f"""
CREATE OR REPLACE DYNAMIC TABLE {DATABASE}.{SCHEMA}.CONTINUOUS_CLV_PREDICTIONS_AUTO
    TARGET_LAG = '1 hour'
    WAREHOUSE = {session.get_current_warehouse()}
    REFRESH_MODE = AUTO
AS
SELECT 
    rfm.customer_id,
    -- Include key features for context
    rfm.recency_days,
    rfm.frequency,
    rfm.monetary_total,
    df.rfm_score,
    df.lifecycle_stage,
    -- Call the model directly in SQL with ALL features!
    {DATABASE}.{SCHEMA}.CONTINUOUS_CLV_MODEL!PREDICT(
        -- Customer profile features
        age_group => c.age_group,
        region => c.region,
        segment => c.segment,
        -- RFM base features
        recency_days => rfm.recency_days,
        frequency => rfm.frequency,
        monetary_total => rfm.monetary_total,
        monetary_avg => rfm.monetary_avg,
        customer_tenure_days => rfm.customer_tenure_days,
        -- Purchase pattern features
        unique_categories_purchased => pp.unique_categories_purchased,
        total_items_purchased => pp.total_items_purchased,
        recent_30d_amount => pp.recent_30d_amount,
        recent_30d_count => pp.recent_30d_count,
        recent_90d_amount => pp.recent_90d_amount,
        recent_90d_count => pp.recent_90d_count,
        -- Engagement features
        total_interactions => ef.total_interactions,
        website_visits => ef.website_visits,
        email_opens => ef.email_opens,
        email_clicks => ef.email_clicks,
        support_tickets => ef.support_tickets,
        product_views => ef.product_views,
        cart_adds => ef.cart_adds,
        email_engagement_rate => ef.email_engagement_rate,
        -- Derived features (from DERIVED_FEATURES view)
        rfm_score => df.rfm_score,
        lifecycle_stage => df.lifecycle_stage,
        purchase_consistency => df.purchase_consistency,
        purchase_velocity_30d => df.purchase_velocity_30d,
        purchase_velocity_90d => df.purchase_velocity_90d,
        spending_velocity_30d => df.spending_velocity_30d,
        spending_velocity_90d => df.spending_velocity_90d,
        velocity_acceleration => df.velocity_acceleration,
        engagement_per_purchase => df.engagement_per_purchase,
        support_intensity => df.support_intensity,
        tenure_cohort => df.tenure_cohort,
        monetary_vs_cohort => df.monetary_vs_cohort,
        frequency_vs_cohort => df.frequency_vs_cohort
    ) AS predicted_12m_ltv,
    CURRENT_TIMESTAMP() AS prediction_timestamp
FROM {DATABASE}.CLV_FEATURE_STORE.RFM_FEATURES rfm
LEFT JOIN {DATABASE}.CLV_FEATURE_STORE.PURCHASE_PATTERNS pp 
    ON rfm.customer_id = pp.customer_id
LEFT JOIN {DATABASE}.CLV_FEATURE_STORE.ENGAGEMENT_FEATURES ef 
    ON rfm.customer_id = ef.customer_id
LEFT JOIN {DATABASE}.CLV_FEATURE_STORE.DERIVED_FEATURES df
    ON rfm.customer_id = df.customer_id
LEFT JOIN {DATABASE}.{SCHEMA}.CONTINUOUS_CUSTOMERS_PROFILE c
    ON rfm.customer_id = c.customer_id
"""

try:
    session.sql(create_auto_inference_dt).collect()
    print(f"✓ Dynamic Table created: {DATABASE}.{SCHEMA}.CONTINUOUS_CLV_PREDICTIONS_AUTO")
    print(f"✓ Queries ALL 4 Feature Views:")
    print(f"  - RFM_FEATURES")
    print(f"  - PURCHASE_PATTERNS")
    print(f"  - ENGAGEMENT_FEATURES")
    print(f"  - DERIVED_FEATURES ⭐")
    
    # Show sample predictions
    print("\nSample predictions from automatic Dynamic Table:")
    sample_df = session.table(f"{DATABASE}.{SCHEMA}.CONTINUOUS_CLV_PREDICTIONS_AUTO").limit(10)
    sample_df.show()
    
except Exception as e:
    print(f"⚠️ Note: Dynamic Table creation encountered an issue")
    print(f"Error: {str(e)}")
    print("\nThis is expected if:")
    print("  - Model signature doesn't match named parameters")
    print("  - Need to re-train model with new DERIVED_FEATURES")
    print("\nTo fix:")
    print("  1. Re-run feature_engineering_continuous.ipynb to create DERIVED_FEATURES view")
    print("  2. Re-run this notebook to train model with ALL features")
    print("  3. Then this Dynamic Table will work!")
    print(f"\nRun: SELECT * FROM TABLE({DATABASE}.{SCHEMA}.CONTINUOUS_CLV_MODEL!SHOW_FUNCTIONS())")
    print("     to see the exact function signature")

## Summary

This notebook accomplished:

1. ✓ **Dataset Generation**: Used `fs.generate_dataset()` to create immutable, versioned snapshot
2. ✓ **Additional Feature Engineering**: Created Python-based transformations for model training
3. ✓ **Optimized Model Training**: XGBoost with focused hyperparameter tuning (10 trials, 5 key parameters)
4. ✓ **Overfitting Prevention**: Multiple regularization techniques and validation
5. ✓ **Model Evaluation**: Comprehensive metrics across train/val/test sets
6. ✓ **Deployment**: Registered to Snowflake Model Registry with WAREHOUSE and SPCS platforms
7. ✓ **Automatic Inference**: Created Dynamic Table for SQL-based continuous predictions

## Hyperparameter Optimization Strategy

**Optimized Approach (Following Best Practices)**:
- **5 key hyperparameters** tuned (reduced from 8)
- **10 trials** with Bayesian Optimization (reduced from 20)
- **~2x faster** training time while maintaining model quality

**Parameters Tuned**:
1. `n_estimators` (100-300): Number of trees - HIGH impact on performance
2. `max_depth` (4-8): Tree complexity - HIGH impact on overfitting
3. `learning_rate` (0.01-0.3): Shrinkage factor - HIGH impact on convergence
4. `subsample` (0.7-0.9): Row sampling - prevents overfitting
5. `reg_lambda` (0.5-2.0): L2 regularization - prevents overfitting

**Parameters Fixed** (using sensible defaults):
- `min_child_weight=5`: Regression stability
- `colsample_bytree=0.8`: Feature sampling for generalization  
- `reg_alpha=0.0`: No L1 regularization needed

**Rationale**:
- Bayesian optimization with 2x trials per hyperparameter (5 params × 2 = 10 trials minimum)
- Focuses on most impactful parameters per XGBoost tuning research
- Balances training speed with model quality
- Follows Snowflake HPO best practices for efficient resource usage

## Architecture: Clear Separation of Concerns

**feature_engineering_continuous.ipynb**:
- Creates 4 Feature Views in Feature Store (automatic refresh)
- Creates spine with target variable (`CONTINUOUS_CUSTOMERS_PROFILE_WITH_TARGET`)
- Does NOT create training dataset

**train_continuous_model.ipynb** (this notebook):
- Generates **Snowflake Dataset** from Feature Views using `fs.generate_dataset()`
- Trains model with optimized hyperparameter tuning
- Registers model to Model Registry with automatic lineage
- Creates automatic inference Dynamic Table

## Why Datasets Instead of Tables?

Per Snowflake best practices, **Datasets are the recommended approach** for ML training data:

**Snowflake Datasets (`fs.generate_dataset()`)** ✅ RECOMMENDED:
- **Immutable snapshots** guarantee reproducibility
- **Automatic versioning** (v1, v2, v3) tracks data evolution
- **ML lineage tracking** links Feature Store → Dataset → Model Registry
- **Schema-level objects** designed specifically for ML workflows
- **Efficient storage** (Parquet files) for distributed training
- **Framework integration** with TensorFlow, PyTorch, Snowpark ML
- **No manual saves needed** - automatically materialized

**Manual table saves** ❌ NOT RECOMMENDED:
- No immutability guarantees
- No version management
- Limited metadata support
- No automatic ML lineage
- Requires manual table writes
- Risk of data being overwritten

**Reference**: See [Model Training and Inference](https://docs.snowflake.com/en/developer-guide/snowflake-ml/feature-store/modeling) docs

## Dataset Benefits

Each time you run training:
- Create new Dataset version (v2, v3, etc.)
- Immutable snapshot ensures reproducibility
- ML lineage automatically tracks: Raw Data → Feature Views → Dataset → Model
- Can always reproduce any model by going back to its Dataset version

## Production Workflow (Recommended)

```
Raw Data → Base Tables
    ↓
Feature Store Views (refresh: 1 day) [feature_engineering_continuous.ipynb]
    ↓
Snowflake Dataset (immutable v1, v2, ...) [train_continuous_model.ipynb]
    ↓
Optimized Model Training + Registry (10 trials, 5 params)
    ↓
Inference Dynamic Table (refresh: 1 hour)
    ↓
CONTINUOUS_CLV_PREDICTIONS_AUTO
```

**Result**: Fully reproducible, auditable ML pipeline with complete lineage!

## Next Steps

- Monitor Dynamic Table refresh history and lag
- When retraining: Create new Dataset version (v2) for lineage
- Set up alerts for refresh failures
- Integrate predictions into downstream applications
- Consider A/B testing CLV-based strategies
- If more tuning needed: Gradually increase trials or add parameters