# Lereta Intelligence Agent - ML Models

**Training 3 Machine Learning Models for Tax & Flood Intelligence**

This notebook trains 3 ML models for the Lereta Intelligence Agent:
1. **TAX_DELINQUENCY_PREDICTOR** - Predicts property tax delinquency risk
2. **CLIENT_CHURN_PREDICTOR** - Predicts client churn risk
3. **LOAN_RISK_CLASSIFIER** - Classifies loans by risk level (LOW/MEDIUM/HIGH)

---

## Prerequisites

**⚠️ IMPORTANT: Execute files 01-04 BEFORE running this notebook!**

This notebook requires feature views created in file 04:
- **V_TAX_DELINQUENCY_FEATURES** (created in sql/views/04_create_views.sql)
- **V_CLIENT_CHURN_FEATURES** (created in sql/views/04_create_views.sql)
- **V_LOAN_RISK_FEATURES** (created in sql/views/04_create_views.sql)

**Correct Execution Order**:
1. Execute sql/setup/01_database_and_schema.sql
2. Execute sql/setup/02_create_tables.sql
3. Execute sql/data/03_generate_synthetic_data.sql (10-20 min)
4. Execute sql/views/04_create_views.sql ← **Creates ML feature views**
5. Run this notebook (trains models)
6. Execute sql/ml/07_ml_model_wrappers.sql (wraps trained models)
7. Execute sql/agent/08_create_ai_agent.sql

**Packages**: `snowflake-ml-python=1.19.0`, `scikit-learn=1.6.1`, `pandas=2.2.3`


## Setup and Imports


In [None]:
# Import required libraries
from snowflake.snowpark import Session
from snowflake.ml.modeling.ensemble import RandomForestClassifier
from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.modeling.preprocessing import OrdinalEncoder, StandardScaler
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.registry import Registry
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully")


In [None]:
# Get current session
session = Session.builder.getOrCreate()

# Set context
session.use_database("LERETA_INTELLIGENCE")
session.use_schema("ML_MODELS")
session.use_warehouse("LERETA_WH")

print("✅ Session configured")
print(f"Database: {session.get_current_database()}")
print(f"Schema: {session.get_current_schema()}")
print(f"Warehouse: {session.get_current_warehouse()}")


In [None]:
# Initialize Model Registry
registry = Registry(
    session=session,
    database_name="LERETA_INTELLIGENCE",
    schema_name="ML_MODELS"
)

print("✅ Model Registry initialized")


## Model 1: Tax Delinquency Prediction

### Business Problem
Predict which properties are likely to become delinquent on property taxes in the next 90 days. This enables proactive outreach to clients and borrowers to prevent delinquencies.

### Features
- Property characteristics (type, assessed value)
- Tax amount and jurisdiction
- Historical payment patterns
- Days since last payment
- Client service quality score
- Loan characteristics


In [None]:
# Verify feature view exists, then load data
try:
    tax_delinquency_df = session.table("LERETA_INTELLIGENCE.ANALYTICS.V_TAX_DELINQUENCY_FEATURES")
    record_count = tax_delinquency_df.count()
    print(f"✅ Loaded {record_count} records for tax delinquency prediction")
    tax_delinquency_df.show(5)
except Exception as e:
    print("❌ ERROR: V_TAX_DELINQUENCY_FEATURES view not found!")
    print("Please execute sql/ml/07_ml_model_wrappers.sql before running this notebook.")
    print(f"Error details: {str(e)}")
    raise


In [None]:
# Split data for training and testing
train_df, test_df = tax_delinquency_df.random_split([0.8, 0.2], seed=42)

# Drop ID column not needed for training
train_df = train_df.drop("TAX_RECORD_ID")
test_df = test_df.drop("TAX_RECORD_ID")

print(f"Training set: {train_df.count()} records")
print(f"Test set: {test_df.count()} records")

# Create tax delinquency prediction pipeline - optimized for speed
tax_delinquency_pipeline = Pipeline([
    ("Encoder", OneHotEncoder(
        input_cols=["PROPERTY_TYPE", "FLOOD_ZONE", "JURISDICTION_TYPE", "LOAN_TYPE", "CLIENT_TYPE", "LOAN_STATUS", "CLIENT_STATUS"],
        output_cols=["PROPERTY_TYPE_ENC", "FLOOD_ZONE_ENC", "JURISDICTION_TYPE_ENC", "LOAN_TYPE_ENC", "CLIENT_TYPE_ENC", "LOAN_STATUS_ENC", "CLIENT_STATUS_ENC"],
        drop_input_cols=True,
        handle_unknown="ignore"
    )),
    ("Classifier", RandomForestClassifier(
        label_cols=["ACTUAL_DELINQUENT"],
        output_cols=["PREDICTED_DELINQUENT"],
        n_estimators=10,
        max_depth=5,
        random_state=42
    ))
])

print("✅ Tax delinquency pipeline created (optimized for speed)")

# Train the model
print("Training tax delinquency prediction model...")
tax_delinquency_pipeline.fit(train_df)
print("✅ Tax delinquency model trained")


In [None]:
# Evaluate model on test set
test_predictions = tax_delinquency_pipeline.predict(test_df)
test_results = test_predictions.select("ACTUAL_DELINQUENT", "PREDICTED_DELINQUENT").to_pandas()

from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(test_results['ACTUAL_DELINQUENT'], test_results['PREDICTED_DELINQUENT'])

print(f"Test Accuracy: {accuracy:.3f}")
print("\nClassification Report:")
print(classification_report(
    test_results['ACTUAL_DELINQUENT'], 
    test_results['PREDICTED_DELINQUENT']
))


In [None]:
# Delete existing model if it exists to force fresh registration
try:
    registry.delete_model("TAX_DELINQUENCY_PREDICTOR")
    print("✅ Deleted existing TAX_DELINQUENCY_PREDICTOR")
except:
    print("No existing model to delete")

# Register model in Model Registry
# Drop label column from sample data - model signature should only include features
sample_data = train_df.drop("ACTUAL_DELINQUENT").limit(100)

registry.log_model(
    model=tax_delinquency_pipeline,
    model_name="TAX_DELINQUENCY_PREDICTOR",
    target_platforms=['WAREHOUSE'],
    sample_input_data=sample_data,
    comment="Predicts property tax delinquency risk"
)

print("✅ TAX_DELINQUENCY_PREDICTOR registered in Model Registry")


## Model 2: Client Churn Prediction

### Business Problem
Identify clients (financial institutions) at risk of canceling their Lereta subscriptions. This enables proactive retention efforts and improved customer success.

### Features
- Subscription characteristics (tier, billing cycle, property count)
- Service utilization patterns
- Support ticket volume and satisfaction
- Revenue and transaction trends
- Client profile and status


In [None]:
# Load client churn feature data from pre-built view
client_churn_df = session.table("LERETA_INTELLIGENCE.ANALYTICS.V_CLIENT_CHURN_FEATURES")

print(f"✅ Loaded {client_churn_df.count()} records for client churn prediction")
client_churn_df.show(5)


In [None]:
# Feature engineering for churn prediction
categorical_features_churn = ['CLIENT_TYPE', 'SERVICE_TYPE', 'SUBSCRIPTION_TIER', 'BILLING_CYCLE']
numerical_features_churn = [
    'SERVICE_QUALITY_SCORE', 'TOTAL_PROPERTIES', 'LIFETIME_VALUE', 'MONTHS_AS_CLIENT',
    'MONTHLY_PRICE', 'PROPERTY_COUNT_LIMIT', 'USER_LICENSES', 'SUBSCRIPTION_DURATION_DAYS',
    'TOTAL_SUPPORT_TICKETS', 'AVG_SATISFACTION_RATING', 'AVG_RESOLUTION_TIME', 'OPEN_TICKETS',
    'TOTAL_TRANSACTIONS', 'TOTAL_REVENUE', 'AVG_TRANSACTION_AMOUNT'
]

# Split data
train_churn_df, test_churn_df = client_churn_df.random_split([0.8, 0.2], seed=42)

print(f"Training set: {train_churn_df.count()} rows")
print(f"Test set: {test_churn_df.count()} rows")

# Build churn prediction pipeline with XGBoost
client_churn_pipeline = Pipeline(
    steps=[
        ('encoder', OrdinalEncoder(input_cols=categorical_features_churn, output_cols=categorical_features_churn)),
        ('scaler', StandardScaler(input_cols=numerical_features_churn, output_cols=numerical_features_churn)),
        ('classifier', XGBClassifier(
            input_cols=categorical_features_churn + numerical_features_churn,
            label_cols=['IS_CHURNED'],
            n_estimators=150,
            max_depth=8,
            learning_rate=0.1,
            random_state=42
        ))
    ]
)

# Train the model
print("Training Client Churn Prediction Model...")
client_churn_model = client_churn_pipeline.fit(train_churn_df)
print("✅ Model trained successfully!")


In [None]:
# Evaluate churn model
churn_predictions_df = client_churn_model.predict(test_churn_df)

print("\n=== Client Churn Model Performance ===")
accuracy_churn = accuracy_score(df=churn_predictions_df, y_true_col_name='IS_CHURNED', y_pred_col_name='OUTPUT_IS_CHURNED')
precision_churn = precision_score(df=churn_predictions_df, y_true_col_name='IS_CHURNED', y_pred_col_name='OUTPUT_IS_CHURNED')
recall_churn = recall_score(df=churn_predictions_df, y_true_col_name='IS_CHURNED', y_pred_col_name='OUTPUT_IS_CHURNED')
f1_churn = f1_score(df=churn_predictions_df, y_true_col_name='IS_CHURNED', y_pred_col_name='OUTPUT_IS_CHURNED')

print(f"Accuracy:  {accuracy_churn:.4f}")
print(f"Precision: {precision_churn:.4f}")
print(f"Recall:    {recall_churn:.4f}")
print(f"F1 Score:  {f1_churn:.4f}")

churn_predictions_df.select('CLIENT_ID', 'IS_CHURNED', 'OUTPUT_IS_CHURNED').show(10)


In [None]:
# Register churn model
session.use_schema("ANALYTICS")

model_name_churn = "CLIENT_CHURN_PREDICTOR"
model_version_churn = "v1"

print(f"Registering model: {model_name_churn}_{model_version_churn}")
registry.log_model(
    model=client_churn_model,
    model_name=model_name_churn,
    version_name=model_version_churn,
    comment="Predicts client churn risk using XGBoost. Features: subscription metrics, support satisfaction, revenue trends, service utilization."
)

print(f"✅ Model {model_name_churn} version {model_version_churn} registered successfully!")


## Model 3: Loan Risk Classification

### Business Problem
Classify loans into risk categories (Low, Medium, High) based on tax compliance, flood zone risks, and property characteristics. This helps prioritize monitoring efforts and identify high-risk portfolios.

### Features
- Flood zone and insurance requirements
- Tax payment history and delinquency status
- Loan-to-value ratio
- Property type and assessed value
- Escrow account status
- Jurisdiction tax rates


In [None]:
# Load loan risk feature data from pre-built view
loan_risk_df = session.table("LERETA_INTELLIGENCE.ANALYTICS.V_LOAN_RISK_FEATURES")

print(f"✅ Loaded {loan_risk_df.count()} records for loan risk classification")
loan_risk_df.show(5)


In [None]:
# Feature engineering for loan risk classification
categorical_features_risk = [
    'LOAN_TYPE', 'LOAN_STATUS', 'PROPERTY_TYPE', 'FLOOD_ZONE', 
    'PROPERTY_STATE', 'JURISDICTION_TYPE', 'CLIENT_TYPE'
]
numerical_features_risk = [
    'LOAN_AMOUNT', 'LOAN_AGE_MONTHS', 'LOAN_TO_VALUE_RATIO', 'ASSESSED_VALUE',
    'HIGH_FLOOD_RISK', 'TAX_AMOUNT', 'PENALTY_AMOUNT', 'TAX_RATE',
    'TAX_PAID_ON_TIME', 'DAYS_PAYMENT_DELAY', 'SERVICE_QUALITY_SCORE'
]

# Split data
train_risk_df, test_risk_df = loan_risk_df.random_split([0.8, 0.2], seed=42)

print(f"Training set: {train_risk_df.count()} rows")
print(f"Test set: {test_risk_df.count()} rows")

# Build loan risk classification pipeline
loan_risk_pipeline = Pipeline(
    steps=[
        ('encoder', OrdinalEncoder(input_cols=categorical_features_risk, output_cols=categorical_features_risk)),
        ('scaler', StandardScaler(input_cols=numerical_features_risk, output_cols=numerical_features_risk)),
        ('classifier', RandomForestClassifier(
            input_cols=categorical_features_risk + numerical_features_risk,
            label_cols=['RISK_LEVEL'],
            n_estimators=120,
            max_depth=12,
            random_state=42
        ))
    ]
)

# Train the model
print("Training Loan Risk Classification Model...")
loan_risk_model = loan_risk_pipeline.fit(train_risk_df)
print("✅ Model trained successfully!")


In [None]:
# Evaluate loan risk model
risk_predictions_df = loan_risk_model.predict(test_risk_df)

print("\n=== Loan Risk Classification Model Performance ===")
accuracy_risk = accuracy_score(df=risk_predictions_df, y_true_col_name='RISK_LEVEL', y_pred_col_name='OUTPUT_RISK_LEVEL')

print(f"Accuracy: {accuracy_risk:.4f}")

# Show sample predictions
risk_predictions_df.select('LOAN_ID', 'RISK_LEVEL', 'OUTPUT_RISK_LEVEL', 'FLOOD_ZONE', 'DELINQUENT').show(10)


In [None]:
# Register loan risk model
session.use_schema("ANALYTICS")

model_name_risk = "LOAN_RISK_CLASSIFIER"
model_version_risk = "v1"

print(f"Registering model: {model_name_risk}_{model_version_risk}")
registry.log_model(
    model=loan_risk_model,
    model_name=model_name_risk,
    version_name=model_version_risk,
    comment="Classifies loans into LOW/MEDIUM/HIGH risk categories using Random Forest. Features: flood zones, tax compliance, LTV ratio, property characteristics."
)

print(f"✅ Model {model_name_risk} version {model_version_risk} registered successfully!")


## Summary and Next Steps

### Models Created

1. **TAX_DELINQUENCY_PREDICTOR** (Random Forest)
   - Predicts property tax delinquency risk
   - Use case: Proactive client alerts, portfolio risk management

2. **CLIENT_CHURN_PREDICTOR** (XGBoost)
   - Predicts client subscription cancellation risk
   - Use case: Customer retention, account management prioritization

3. **LOAN_RISK_CLASSIFIER** (Random Forest)
   - Classifies loans into risk levels (LOW/MEDIUM/HIGH)
   - Use case: Portfolio risk assessment, monitoring prioritization

### Integration with AI Agent

All models are registered in the Snowflake Model Registry and can be:
- Called via SQL UDFs (created in file 07)
- Integrated with the Lereta Intelligence Agent
- Used for batch scoring and real-time predictions
- Monitored for performance drift

### Next Steps

1. Execute **file 07** (Python wrappers) to create SQL UDFs for each model
2. Execute **file 08** (AI Agent creation) to integrate models with the agent
3. Test model predictions through the AI Agent interface
4. Monitor model performance and retrain as needed


In [None]:
# Verify all models in registry
print("\n=== Registered Models ===")
registry.show_models()

# Close session
session.close()
print("\n✅ All models trained and registered successfully!")
