# 🏦 Financial Services Synthetic Data Generator v1.0
## Leveraging Snowflake Native Synthetic Data Generation

### Overview
This notebook demonstrates how to generate large-scale synthetic financial services data using Snowflake's native `GENERATE_SYNTHETIC_DATA` stored procedure. It focuses on typical retail banking entities and relationships:

- Customers and KYC attributes
- Branches and products
- Accounts and transactions

Key goals:
- Statistical consistency vs realistic banking distributions (balances, transaction amounts, product uptake)
- Join-key consistency across tables
- Privacy-preserving output suitable for dev/test and analytics

### Data Architecture
```
📊 SEED DATA (Manual) → 🤖 SYNTHETIC DATA (Snowflake AI)
├── FINANCIAL_INSTITUTIONS (8) → (8)
├── BRANCHES (40) → (400+)
├── PRODUCTS (25) → (250+)
├── CUSTOMERS (200) → (50,000+)
├── ACCOUNTS (300) → (120,000+)
└── TRANSACTIONS (500) → (5,000,000+)
```

### Requirements
- Snowflake Enterprise Edition or higher
- Medium Snowpark-optimized warehouse recommended
- Anaconda terms accepted in the Snowflake account


In [None]:
# 📦 SETUP AND CONFIGURATION

import pandas as pd
import numpy as np
import random
import string
import json
import datetime as dt
from datetime import timedelta
from typing import List, Dict, Any, Optional

# Snowflake imports
from snowflake.snowpark import Session, functions as F
from snowflake.snowpark.types import *

# Get active Snowflake session
session = get_active_session()

# Set seeds for reproducibility
random.seed(42)
np.random.seed(42)

# Configuration for synthetic data generation (Financial Services)
CONFIG = {
    'database': 'FIN_SERV_SYNTH_DB',
    'schema': 'SEED_DATA',
    'synth_schema': 'SYNTHETIC_DATA',
    'warehouse': None,  # Auto-detect

    # Seed sizes (small realistic datasets)
    'seed_institutions': 8,
    'seed_branches': 40,
    'seed_products': 25,
    'seed_customers': 200,
    'seed_accounts': 300,
    'seed_transactions': 500,

    # Synthetic target sizes (generated by Snowflake)
    'target_customers': 50000,
    'target_accounts': 120000,
    'target_transactions': 5000000,

    # Privacy settings
    'enable_privacy_filter': True,
    'replace_output_tables': True
}

print("🚀 Financial Services Synthetic Data Generator v1.0")
print(f"📊 Database: {CONFIG['database']}")
print(f"🌱 Seed data will be created in: {CONFIG['schema']}")
print(f"🤖 Synthetic data will be generated in: {CONFIG['synth_schema']}")
print(f"🎯 Target synthetic volumes: {CONFIG['target_customers']:,} customers, {CONFIG['target_accounts']:,} accounts")
print("✅ Using Snowflake native GENERATE_SYNTHETIC_DATA")


In [None]:
# 🏗️ ENVIRONMENT SETUP

def setup_database_environment():
    """Setup database, schemas, and warehouse for synthetic data generation (Financial Services)."""
    print("🏗️ Setting up Snowflake environment...")
    
    try:
        # Auto-detect and configure warehouse
        current_wh = session.sql("SELECT CURRENT_WAREHOUSE()").collect()[0][0]
        if current_wh:
            print(f"   ✅ Using warehouse: {current_wh}")
            CONFIG['warehouse'] = current_wh
        else:
            warehouses = session.sql("SHOW WAREHOUSES").collect()
            if warehouses:
                wh_name = warehouses[0]['name']
                session.sql(f"USE WAREHOUSE {wh_name}").collect()
                CONFIG['warehouse'] = wh_name
                print(f"   🔄 Switched to warehouse: {wh_name}")
            else:
                raise Exception("No warehouses available")
        
        # Create database and schemas
        print(f"   🏗️ Creating database: {CONFIG['database']}")
        session.sql(f"CREATE DATABASE IF NOT EXISTS {CONFIG['database']}").collect()
        session.sql(f"USE DATABASE {CONFIG['database']}").collect()
        
        print(f"   📁 Creating schemas...")
        session.sql(f"CREATE SCHEMA IF NOT EXISTS {CONFIG['schema']}").collect()
        session.sql(f"CREATE SCHEMA IF NOT EXISTS {CONFIG['synth_schema']}").collect()
        
        # Set working schema to seed data
        session.sql(f"USE SCHEMA {CONFIG['schema']}").collect()
        
        # Verify setup
        current_db = session.sql("SELECT CURRENT_DATABASE()").collect()[0][0]
        current_schema = session.sql("SELECT CURRENT_SCHEMA()").collect()[0][0]
        current_wh = session.sql("SELECT CURRENT_WAREHOUSE()").collect()[0][0]
        
        print(f"✅ Environment ready:")
        print(f"   📋 Database: {current_db}")
        print(f"   📋 Active Schema: {current_schema}")
        print(f"   📋 Warehouse: {current_wh}")
        
        return True
        
    except Exception as e:
        print(f"❌ Environment setup failed: {e}")
        return False

# Setup environment
if setup_database_environment():
    print("🎯 Ready to create seed data!")
else:
    print("💥 Cannot proceed without proper environment setup")


In [None]:
# 🗃️ CREATE SEED DATA TABLES (Financial Services)

def create_seed_tables():
    """Create optimized table schemas for financial services synthetic data generation."""
    
    print("🗃️ Creating seed data table schemas...")
    
    # Drop existing tables to start fresh
    tables = ['FINANCIAL_INSTITUTIONS', 'BRANCHES', 'PRODUCTS', 'CUSTOMERS', 'ACCOUNTS', 'TRANSACTIONS']
    for table in tables:
        session.sql(f"DROP TABLE IF EXISTS {table}").collect()
    
    # FINANCIAL_INSTITUTIONS - Static reference data
    session.sql("""
        CREATE TABLE FINANCIAL_INSTITUTIONS (
            INSTITUTION_ID STRING PRIMARY KEY,
            INSTITUTION_NAME STRING NOT NULL,
            COUNTRY STRING NOT NULL,
            FOUNDED_YEAR INTEGER,
            HEADQUARTERS STRING,
            ANNUAL_REVENUE DECIMAL(15,2),
            EMPLOYEE_COUNT INTEGER,
            MARKET_SEGMENT STRING,
            CREATED_DATE DATE DEFAULT CURRENT_DATE()
        )
    """).collect()
    
    # BRANCHES - Distribution network
    session.sql("""
        CREATE TABLE BRANCHES (
            BRANCH_ID STRING PRIMARY KEY,
            INSTITUTION_ID STRING NOT NULL,
            BRANCH_NAME STRING NOT NULL,
            ADDRESS STRING,
            CITY STRING,
            STATE STRING,
            ZIP_CODE STRING,
            PHONE STRING,
            EMAIL STRING,
            OPEN_DATE DATE,
            SERVICE_RATING DECIMAL(3,2),
            CREATED_DATE DATE DEFAULT CURRENT_DATE(),
            FOREIGN KEY (INSTITUTION_ID) REFERENCES FINANCIAL_INSTITUTIONS(INSTITUTION_ID)
        )
    """).collect()
    
    # PRODUCTS - Banking products
    session.sql("""
        CREATE TABLE PRODUCTS (
            PRODUCT_ID STRING PRIMARY KEY,
            INSTITUTION_ID STRING NOT NULL,
            PRODUCT_NAME STRING NOT NULL,
            PRODUCT_TYPE STRING NOT NULL, -- Checking, Savings, Credit Card, Mortgage, Auto Loan, CD
            INTEREST_RATE DECIMAL(5,3),
            ANNUAL_FEE DECIMAL(10,2),
            MIN_BALANCE DECIMAL(12,2),
            MAX_CREDIT_LIMIT DECIMAL(12,2),
            TERM_MONTHS INTEGER,
            CREATED_DATE DATE DEFAULT CURRENT_DATE(),
            FOREIGN KEY (INSTITUTION_ID) REFERENCES FINANCIAL_INSTITUTIONS(INSTITUTION_ID)
        )
    """).collect()
    
    # CUSTOMERS - Primary synthetic target
    session.sql("""
        CREATE TABLE CUSTOMERS (
            CUSTOMER_ID STRING PRIMARY KEY,
            FIRST_NAME STRING NOT NULL,
            LAST_NAME STRING NOT NULL,
            EMAIL STRING UNIQUE,
            PHONE STRING,
            DATE_OF_BIRTH DATE,
            GENDER STRING,
            ADDRESS STRING,
            CITY STRING,
            STATE STRING,
            ZIP_CODE STRING,
            CREDIT_SCORE INTEGER,
            ANNUAL_INCOME INTEGER,
            CUSTOMER_SINCE DATE,
            KYC_STATUS STRING,
            EMPLOYMENT_STATUS STRING,
            CREATED_DATE DATE DEFAULT CURRENT_DATE()
        )
    """).collect()
    
    # ACCOUNTS - Customer accounts tied to products
    session.sql("""
        CREATE TABLE ACCOUNTS (
            ACCOUNT_ID STRING PRIMARY KEY,
            CUSTOMER_ID STRING NOT NULL,
            PRODUCT_ID STRING NOT NULL,
            BRANCH_ID STRING NOT NULL,
            OPEN_DATE DATE NOT NULL,
            CLOSE_DATE DATE,
            ACCOUNT_STATUS STRING DEFAULT 'Active',
            CURRENT_BALANCE DECIMAL(14,2) DEFAULT 0.00,
            CREDIT_LIMIT DECIMAL(12,2) DEFAULT 0.00,
            INTEREST_RATE DECIMAL(5,3),
            CREATED_DATE DATE DEFAULT CURRENT_DATE(),
            FOREIGN KEY (CUSTOMER_ID) REFERENCES CUSTOMERS(CUSTOMER_ID),
            FOREIGN KEY (PRODUCT_ID) REFERENCES PRODUCTS(PRODUCT_ID),
            FOREIGN KEY (BRANCH_ID) REFERENCES BRANCHES(BRANCH_ID)
        )
    """).collect()
    
    # TRANSACTIONS - High-volume transactional data
    session.sql("""
        CREATE TABLE TRANSACTIONS (
            TRANSACTION_ID STRING PRIMARY KEY,
            ACCOUNT_ID STRING NOT NULL,
            TRANSACTION_DATE DATE NOT NULL,
            TRANSACTION_TYPE STRING NOT NULL, -- Debit, Credit, Payment, Transfer, Fee, Interest
            AMOUNT DECIMAL(12,2) NOT NULL,
            MERCHANT STRING,
            CATEGORY STRING,
            DESCRIPTION TEXT,
            CREATED_DATE DATE DEFAULT CURRENT_DATE(),
            FOREIGN KEY (ACCOUNT_ID) REFERENCES ACCOUNTS(ACCOUNT_ID)
        )
    """).collect()
    
    print("✅ All seed table schemas created successfully!")
    print(f"   📊 Created {len(tables)} tables optimized for synthetic data generation")

# Create the schemas
create_seed_tables()


In [None]:
# 🌱 POPULATE SEED DATA (Financial Services)

class SeedDataGenerator:
    """Generate realistic seed data for retail banking / financial services."""
    
    def __init__(self):
        # Financial institutions and segments
        self.institutions = {
            'SnowBank': {'country': 'USA', 'founded': 1985, 'hq': 'Bozeman, MT', 'segment': 'Regional Bank'},
            'Aurora Credit Union': {'country': 'USA', 'founded': 1965, 'hq': 'Madison, WI', 'segment': 'Credit Union'},
            'Pioneer National': {'country': 'USA', 'founded': 1908, 'hq': 'Dallas, TX', 'segment': 'National Bank'},
            'Pacific Trust': {'country': 'USA', 'founded': 1978, 'hq': 'San Diego, CA', 'segment': 'Regional Bank'},
            'Metropolis Financial': {'country': 'USA', 'founded': 2001, 'hq': 'New York, NY', 'segment': 'Digital Bank'},
            'Frontier Savings': {'country': 'USA', 'founded': 1952, 'hq': 'Denver, CO', 'segment': 'Savings Bank'},
            'Liberty Mutual Bank': {'country': 'USA', 'founded': 1992, 'hq': 'Boston, MA', 'segment': 'Retail Bank'},
            'Harbor Capital': {'country': 'USA', 'founded': 1974, 'hq': 'Seattle, WA', 'segment': 'Retail Bank'}
        }
        
        # Product templates by type
        self.product_templates = {
            'Checking': ['Everyday Checking', 'Premium Checking', 'Student Checking'],
            'Savings': ['High Yield Savings', 'Basic Savings', 'Kids Savings'],
            'Credit Card': ['CashBack Visa', 'Travel Rewards', 'Low APR Platinum'],
            'Mortgage': ['30yr Fixed', '15yr Fixed', '5/1 ARM'],
            'Auto Loan': ['New Auto Loan', 'Used Auto Loan'],
            'CD': ['6-Month CD', '12-Month CD', '24-Month CD']
        }
        
        self.us_cities = [
            ('New York', 'NY', '10001'), ('Los Angeles', 'CA', '90001'), ('Chicago', 'IL', '60601'),
            ('Houston', 'TX', '77001'), ('Phoenix', 'AZ', '85001'), ('Philadelphia', 'PA', '19101'),
            ('San Antonio', 'TX', '78201'), ('San Diego', 'CA', '92101'), ('Dallas', 'TX', '75201'),
            ('San Jose', 'CA', '95101'), ('Austin', 'TX', '78701'), ('Jacksonville', 'FL', '32099'),
            ('San Francisco', 'CA', '94101'), ('Seattle', 'WA', '98101'), ('Denver', 'CO', '80201'),
            ('Boston', 'MA', '02101'), ('Miami', 'FL', '33101'), ('Charlotte', 'NC', '28201'),
            ('Columbus', 'OH', '43085'), ('Nashville', 'TN', '37201')
        ]
        
        self.first_names = ['James', 'Mary', 'John', 'Patricia', 'Robert', 'Jennifer', 'Michael', 'Linda', 'William', 'Elizabeth', 'David', 'Barbara', 'Richard', 'Susan', 'Joseph', 'Jessica', 'Thomas', 'Sarah', 'Christopher', 'Karen']
        self.last_names = ['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Garcia', 'Miller', 'Davis', 'Rodriguez', 'Martinez', 'Hernandez', 'Lopez', 'Gonzalez', 'Wilson', 'Anderson', 'Thomas', 'Taylor', 'Moore', 'Jackson', 'Martin']
    
    def generate_id(self, prefix: str, counter: int) -> str:
        return f"{prefix}{counter:06d}"
    
    def generate_phone(self) -> str:
        return f"({random.randint(200,999)}) {random.randint(200,999)}-{random.randint(1000,9999)}"
    
    def generate_email(self, first_name: str, last_name: str) -> str:
        domains = ['gmail.com', 'yahoo.com', 'outlook.com', 'icloud.com']
        return f"{first_name.lower()}.{last_name.lower()}@{random.choice(domains)}"
    
    def random_date_between(self, start_date: dt.date, end_date: dt.date) -> dt.date:
        start_dt = dt.datetime.combine(start_date, dt.time.min)
        end_dt = dt.datetime.combine(end_date, dt.time.min)
        time_between = end_dt - start_dt
        days_between = time_between.days
        random_days = random.randrange(days_between)
        return (start_dt + timedelta(days=random_days)).date()

# Initialize the generator
seed_gen = SeedDataGenerator()
print("✅ Seed data generator initialized with financial services data")


In [None]:
# 🏦 CREATE INSTITUTIONS, BRANCHES, PRODUCTS, CUSTOMERS, ACCOUNTS SEED DATA

def create_institutions_seed():
    print("🏦 Creating financial institutions seed data...")
    institutions_data = []
    for i, (name, info) in enumerate(seed_gen.institutions.items(), 1):
        revenue = random.randint(1_000_000_000, 25_000_000_000)
        employees = random.randint(2_000, 50_000)
        institutions_data.append({
            'INSTITUTION_ID': seed_gen.generate_id('INST', i),
            'INSTITUTION_NAME': name,
            'COUNTRY': info['country'],
            'FOUNDED_YEAR': info['founded'],
            'HEADQUARTERS': info['hq'],
            'ANNUAL_REVENUE': revenue,
            'EMPLOYEE_COUNT': employees,
            'MARKET_SEGMENT': info['segment']
        })
    df = pd.DataFrame(institutions_data)
    session.write_pandas(df, 'FINANCIAL_INSTITUTIONS', auto_create_table=False, overwrite=True)
    print(f"   ✅ Created {len(df)} financial institutions")
    return df


def create_branches_seed():
    print("🏢 Creating branches seed data...")
    institutions_df = session.table('FINANCIAL_INSTITUTIONS').to_pandas()
    branches_data = []
    branch_counter = 1
    for _, inst in institutions_df.iterrows():
        for i in range(5):  # 5 branches per institution
            city, state, zip_code = random.choice(seed_gen.us_cities)
            branches_data.append({
                'BRANCH_ID': seed_gen.generate_id('BR', branch_counter),
                'INSTITUTION_ID': inst['INSTITUTION_ID'],
                'BRANCH_NAME': f"{inst['INSTITUTION_NAME']} - {city}",
                'ADDRESS': f"{random.randint(100, 9999)} {random.choice(['Main St', 'Oak Ave', 'Elm St', 'Park Dr', 'First Ave'])}",
                'CITY': city,
                'STATE': state,
                'ZIP_CODE': zip_code,
                'PHONE': seed_gen.generate_phone(),
                'EMAIL': f"contact@{inst['INSTITUTION_NAME'].lower().replace(' ', '')}.com",
                'OPEN_DATE': seed_gen.random_date_between(dt.date(1980, 1, 1), dt.date(2022, 12, 31)),
                'SERVICE_RATING': round(random.uniform(3.0, 5.0), 2)
            })
            branch_counter += 1
    df = pd.DataFrame(branches_data)
    session.write_pandas(df, 'BRANCHES', auto_create_table=False, overwrite=True)
    print(f"   ✅ Created {len(df)} branches")
    return df


def create_products_seed():
    print("🧾 Creating products seed data...")
    institutions_df = session.table('FINANCIAL_INSTITUTIONS').to_pandas()
    products_data = []
    product_counter = 1
    for _, inst in institutions_df.iterrows():
        for product_type, names in seed_gen.product_templates.items():
            for product_name in names:
                # Default terms by type
                if product_type in ['Checking', 'Savings']:
                    interest = round(random.uniform(0.1, 4.0), 3)
                    fee = random.choice([0.00, 5.00, 10.00, 15.00])
                    min_balance = random.choice([0.00, 100.00, 500.00, 1000.00])
                    max_credit = 0.00
                    term = None
                elif product_type == 'Credit Card':
                    interest = round(random.uniform(9.99, 24.99), 3)
                    fee = random.choice([0.00, 49.00, 95.00])
                    min_balance = 0.00
                    max_credit = random.choice([5000.00, 10000.00, 20000.00])
                    term = None
                elif product_type in ['Mortgage', 'Auto Loan']:
                    interest = round(random.uniform(2.5, 9.5), 3)
                    fee = 0.00
                    min_balance = 0.00
                    max_credit = 0.00
                    term = random.choice([36, 60, 84, 120, 180, 360])
                elif product_type == 'CD':
                    interest = round(random.uniform(3.0, 6.0), 3)
                    fee = 0.00
                    min_balance = random.choice([500.00, 1000.00, 5000.00])
                    max_credit = 0.00
                    term = random.choice([6, 12, 24])
                else:
                    interest = 0.0
                    fee = 0.0
                    min_balance = 0.0
                    max_credit = 0.0
                    term = None
                products_data.append({
                    'PRODUCT_ID': seed_gen.generate_id('PRD', product_counter),
                    'INSTITUTION_ID': inst['INSTITUTION_ID'],
                    'PRODUCT_NAME': product_name,
                    'PRODUCT_TYPE': product_type,
                    'INTEREST_RATE': interest,
                    'ANNUAL_FEE': fee,
                    'MIN_BALANCE': min_balance,
                    'MAX_CREDIT_LIMIT': max_credit,
                    'TERM_MONTHS': term
                })
                product_counter += 1
    df = pd.DataFrame(products_data)
    session.write_pandas(df, 'PRODUCTS', auto_create_table=False, overwrite=True)
    print(f"   ✅ Created {len(df)} products")
    return df


def create_customers_seed():
    print("👤 Creating customers seed data...")
    customers_seed = []
    for i in range(1, CONFIG['seed_customers'] + 1):
        city, state, zip_code = random.choice(seed_gen.us_cities)
        first_name = random.choice(seed_gen.first_names)
        last_name = random.choice(seed_gen.last_names)
        customers_seed.append({
            'CUSTOMER_ID': seed_gen.generate_id('CUST', i),
            'FIRST_NAME': first_name,
            'LAST_NAME': last_name,
            'EMAIL': seed_gen.generate_email(first_name, last_name),
            'PHONE': seed_gen.generate_phone(),
            'DATE_OF_BIRTH': seed_gen.random_date_between(dt.date(1950, 1, 1), dt.date(2005, 12, 31)),
            'GENDER': random.choice(['Male', 'Female', 'Other']),
            'ADDRESS': f"{random.randint(100, 9999)} {random.choice(['Main St', 'Oak Ave', 'Elm St', 'Park Dr', 'First Ave'])}",
            'CITY': city,
            'STATE': state,
            'ZIP_CODE': zip_code,
            'CREDIT_SCORE': random.randint(300, 850),
            'ANNUAL_INCOME': random.randint(25000, 300000),
            'CUSTOMER_SINCE': seed_gen.random_date_between(dt.date(2010, 1, 1), dt.date(2024, 1, 1)),
            'KYC_STATUS': random.choice(['Verified', 'Pending', 'Review']),
            'EMPLOYMENT_STATUS': random.choice(['Employed', 'Self-Employed', 'Unemployed', 'Student', 'Retired'])
        })
    df = pd.DataFrame(customers_seed)
    session.write_pandas(df, 'CUSTOMERS', auto_create_table=False, overwrite=True)
    print(f"   ✅ Created {len(df)} customer seed records")
    return df


def create_accounts_seed():
    print("💳 Creating accounts seed data...")
    customers_df = session.table('CUSTOMERS').to_pandas()
    products_df = session.table('PRODUCTS').to_pandas()
    branches_df = session.table('BRANCHES').to_pandas()
    
    accounts_data = []
    account_counter = 1
    for _, cust in customers_df.iterrows():
        # Each customer holds 1-3 accounts
        for _ in range(random.choice([1, 1, 2, 2, 3])):
            product = products_df.sample(1).iloc[0]
            branch = branches_df.sample(1).iloc[0]
            open_date = seed_gen.random_date_between(dt.date(2010, 1, 1), dt.date(2024, 1, 1))
            close_date = None if random.random() > 0.85 else seed_gen.random_date_between(open_date, dt.date(2024, 12, 31))
            status = 'Active' if close_date is None else 'Closed'
            # Seed balances and credit limits by product type
            if product['PRODUCT_TYPE'] in ['Checking', 'Savings']:
                balance = round(random.uniform(100.0, 100000.0), 2)
                credit_limit = 0.00
                interest = product['INTEREST_RATE']
            elif product['PRODUCT_TYPE'] == 'Credit Card':
                balance = round(random.uniform(0.0, float(product['MAX_CREDIT_LIMIT'] or 10000.0)), 2)
                credit_limit = float(product['MAX_CREDIT_LIMIT'] or 10000.0)
                interest = product['INTEREST_RATE']
            elif product['PRODUCT_TYPE'] in ['Mortgage', 'Auto Loan']:
                principal = random.choice([25000, 50000, 150000, 300000, 600000])
                balance = float(principal)
                credit_limit = 0.00
                interest = product['INTEREST_RATE']
            elif product['PRODUCT_TYPE'] == 'CD':
                balance = float(random.choice([1000, 5000, 10000, 25000]))
                credit_limit = 0.00
                interest = product['INTEREST_RATE']
            else:
                balance = 0.00
                credit_limit = 0.00
                interest = None
            accounts_data.append({
                'ACCOUNT_ID': seed_gen.generate_id('ACCT', account_counter),
                'CUSTOMER_ID': cust['CUSTOMER_ID'],
                'PRODUCT_ID': product['PRODUCT_ID'],
                'BRANCH_ID': branch['BRANCH_ID'],
                'OPEN_DATE': open_date,
                'CLOSE_DATE': close_date,
                'ACCOUNT_STATUS': status,
                'CURRENT_BALANCE': balance,
                'CREDIT_LIMIT': credit_limit,
                'INTEREST_RATE': interest
            })
            account_counter += 1
    df = pd.DataFrame(accounts_data)
    session.write_pandas(df, 'ACCOUNTS', auto_create_table=False, overwrite=True)
    print(f"   ✅ Created {len(df)} accounts")
    return df

# Create seed data
institutions_df = create_institutions_seed()
branches_df = create_branches_seed()
products_df = create_products_seed()
customers_df = create_customers_seed()
accounts_df = create_accounts_seed()

print(f"🎯 Foundation seed data complete: {len(institutions_df)} institutions, {len(branches_df)} branches, {len(products_df)} products, {len(customers_df)} customers, {len(accounts_df)} accounts")


In [None]:
# 🤖 SNOWFLAKE SYNTHETIC DATA GENERATION (Financial Services)

def generate_synthetic_data():
    """Use Snowflake's GENERATE_SYNTHETIC_DATA to create large-scale datasets for FS."""
    
    print("🤖 Starting Snowflake synthetic data generation...")
    print("📋 Leveraging Snowflake's AI algorithms for statistical consistency")
    
    try:
        # Create a consistency secret for join key consistency
        print("🔐 Creating consistency secret for join key relationships...")
        session.sql(
            """
            CREATE OR REPLACE SECRET FS_CONSISTENCY_SECRET
            TYPE = SYMMETRIC_KEY
            ALGORITHM = GENERIC
            """
        ).collect()
        
        print("👥 Generating synthetic customers data (targeting scale via subsequent step)...")
        session.sql(f"""
            CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({{
                'datasets': [
                    {{
                        'input_table': '{CONFIG['database']}.{CONFIG['schema']}.CUSTOMERS',
                        'output_table': '{CONFIG['database']}.{CONFIG['synth_schema']}.CUSTOMERS_SYNTHETIC',
                        'columns': {{ 'CUSTOMER_ID': {{'join_key': true}} }}
                    }}
                ],
                'consistency_secret': SYSTEM$REFERENCE('SECRET', 'FS_CONSISTENCY_SECRET', 'SESSION', 'READ')::STRING,
                'replace_output_tables': {str(CONFIG['replace_output_tables']).lower()},
                'similarity_filter': {str(CONFIG['enable_privacy_filter']).lower()}
            }});
        """).collect()
        
        customers_count = session.sql(f"SELECT COUNT(*) FROM {CONFIG['database']}.{CONFIG['synth_schema']}.CUSTOMERS_SYNTHETIC").collect()[0][0]
        print(f"   ✅ Generated {customers_count:,} synthetic customer records")
        
        print("🏢 Generating synthetic branches data...")
        session.sql(f"""
            CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({{
                'datasets': [
                    {{
                        'input_table': '{CONFIG['database']}.{CONFIG['schema']}.BRANCHES',
                        'output_table': '{CONFIG['database']}.{CONFIG['synth_schema']}.BRANCHES_SYNTHETIC',
                        'columns': {{ 
                            'BRANCH_ID': {{'join_key': true}},
                            'INSTITUTION_ID': {{'join_key': true}}
                        }}
                    }}
                ],
                'consistency_secret': SYSTEM$REFERENCE('SECRET', 'FS_CONSISTENCY_SECRET', 'SESSION', 'READ')::STRING,
                'replace_output_tables': {str(CONFIG['replace_output_tables']).lower()},
                'similarity_filter': {str(CONFIG['enable_privacy_filter']).lower()}
            }});
        """).collect()
        branches_count = session.sql(f"SELECT COUNT(*) FROM {CONFIG['database']}.{CONFIG['synth_schema']}.BRANCHES_SYNTHETIC").collect()[0][0]
        print(f"   ✅ Generated {branches_count:,} synthetic branch records")
        
        print("🧾 Generating synthetic products data...")
        session.sql(f"""
            CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({{
                'datasets': [
                    {{
                        'input_table': '{CONFIG['database']}.{CONFIG['schema']}.PRODUCTS',
                        'output_table': '{CONFIG['database']}.{CONFIG['synth_schema']}.PRODUCTS_SYNTHETIC',
                        'columns': {{ 
                            'PRODUCT_ID': {{'join_key': true}},
                            'INSTITUTION_ID': {{'join_key': true}}
                        }}
                    }}
                ],
                'consistency_secret': SYSTEM$REFERENCE('SECRET', 'FS_CONSISTENCY_SECRET', 'SESSION', 'READ')::STRING,
                'replace_output_tables': {str(CONFIG['replace_output_tables']).lower()},
                'similarity_filter': {str(CONFIG['enable_privacy_filter']).lower()}
            }});
        """).collect()
        products_count = session.sql(f"SELECT COUNT(*) FROM {CONFIG['database']}.{CONFIG['synth_schema']}.PRODUCTS_SYNTHETIC").collect()[0][0]
        print(f"   ✅ Generated {products_count:,} synthetic product records")
        
        print("\n🎉 Synthetic data generation completed successfully!")
        print("📊 Summary of generated synthetic data:")
        print(f"   👥 Customers: {customers_count:,} records")
        print(f"   🏢 Branches: {branches_count:,} records")
        print(f"   🧾 Products: {products_count:,} records")
        
        return True
        
    except Exception as e:
        print(f"❌ Synthetic data generation failed: {e}")
        print("💡 Check prerequisites: edition, Anaconda terms, warehouse sizing, seed data volumes")
        return False

# Minimal seed for transactions will be generated later after accounts
print("🌱 Prepared to run synthetic data generation for core entities...")

# Generate synthetic data for initial entities
generate_synthetic_data()


In [None]:
# 🚀 MULTI-RUN SYNTHETIC DATA SCALING (Financial Services)

def scale_synthetic_data():
    """Scale output volumes via multi-run batching and consolidation for FS entities."""
    print("🚀 Scaling synthetic data via multiple generation runs...")
    
    try:
        # Check existing synthetic data counts
        def count(table_name: str) -> int:
            try:
                return session.sql(f"SELECT COUNT(*) FROM {CONFIG['database']}.{CONFIG['synth_schema']}.{table_name}").collect()[0][0]
            except:
                return 0
        
        current = {
            'customers': count('CUSTOMERS_SYNTHETIC'),
            'branches': count('BRANCHES_SYNTHETIC'),
            'products': count('PRODUCTS_SYNTHETIC')
        }
        print(f"   📊 Existing synthetic counts: {current}")
        
        # Scale customers in batches of ~seed size
        customers_needed = max(0, CONFIG['target_customers'] - current['customers'])
        if customers_needed > 0:
            seed_size = session.sql(f"SELECT COUNT(*) FROM {CONFIG['database']}.{CONFIG['schema']}.CUSTOMERS").collect()[0][0]
            iterations = min(10, (customers_needed // max(1, seed_size)) + 1)
            print(f"   👥 Scaling customers with {iterations} iterations...")
            for i in range(1, iterations + 1):
                session.sql(f"""
                    CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({{
                        'datasets': [{{
                            'input_table': '{CONFIG['database']}.{CONFIG['schema']}.CUSTOMERS',
                            'output_table': '{CONFIG['database']}.{CONFIG['synth_schema']}.CUSTOMERS_SYNTHETIC_BATCH_{i}',
                            'columns': {{'CUSTOMER_ID': {{'join_key': true}}}}
                        }}],
                        'consistency_secret': SYSTEM$REFERENCE('SECRET', 'FS_CONSISTENCY_SECRET', 'SESSION', 'READ')::STRING,
                        'replace_output_tables': true,
                        'similarity_filter': false
                    }});
                """).collect()
                session.sql(f"""
                    UPDATE {CONFIG['database']}.{CONFIG['synth_schema']}.CUSTOMERS_SYNTHETIC_BATCH_{i}
                    SET CUSTOMER_ID = CONCAT('CUST', LPAD((ROW_NUMBER() OVER (ORDER BY CUSTOMER_ID) + {(i-1) * seed_size})::STRING, 6, '0'))
                """).collect()
            
            # Consolidate customers
            session.sql(f"DROP TABLE IF EXISTS {CONFIG['database']}.{CONFIG['synth_schema']}.CUSTOMERS_SYNTHETIC").collect()
            session.sql(f"""
                CREATE TABLE {CONFIG['database']}.{CONFIG['synth_schema']}.CUSTOMERS_SYNTHETIC AS
                SELECT * FROM {CONFIG['database']}.{CONFIG['synth_schema']}.CUSTOMERS_SYNTHETIC_BATCH_1
            """).collect()
            batch_tables = session.sql(f"SHOW TABLES LIKE 'CUSTOMERS_SYNTHETIC_BATCH_%' IN SCHEMA {CONFIG['database']}.{CONFIG['synth_schema']}").collect()
            for b in batch_tables[1:]:
                name = b['name']
                session.sql(f"INSERT INTO {CONFIG['database']}.{CONFIG['synth_schema']}.CUSTOMERS_SYNTHETIC SELECT * FROM {CONFIG['database']}.{CONFIG['synth_schema']}.{name}").collect()
            for b in batch_tables:
                name = b['name']
                session.sql(f"DROP TABLE {CONFIG['database']}.{CONFIG['synth_schema']}.{name}").collect()
        
        # Similar scaling can be added for branches and products if needed later
        final_counts = {
            'customers': count('CUSTOMERS_SYNTHETIC'),
            'branches': count('BRANCHES_SYNTHETIC'),
            'products': count('PRODUCTS_SYNTHETIC')
        }
        print(f"📈 Final synthetic counts: {final_counts}")
        return True
    except Exception as e:
        print(f"❌ Scaled synthetic data generation failed: {e}")
        return False

scale_synthetic_data()


## 🎯 Financial Services Synthetic Data Generator v1.0 - Complete!

### 🏆 What We've Built

This notebook demonstrates enterprise-grade synthetic data generation for retail banking using Snowflake's native `GENERATE_SYNTHETIC_DATA`:

- Realistic seed data: institutions, branches, products, customers, accounts
- AI-powered scaling: statistical consistency from Snowflake algorithms
- Relationship preservation: join-key integrity maintained across entities
- Privacy protection: optional similarity filtering
- Production ready: scalable patterns and batch consolidation

### 📊 Summary of Generated Synthetic Data

- Customers: see counts printed during generation
- Branches: see counts printed during generation
- Products: see counts printed during generation

### 🔮 Next Steps

1. Add synthetic generation for `ACCOUNTS` and `TRANSACTIONS` using the same approach
2. Enrich products with underwriting rules and pricing tiers
3. Add fraud-signal features on transactions (merchant/category patterns)
4. Parameterize scaling to reach 50K+ customers and multi-million transactions

### 📚 Key Learnings

- Seed quality drives fidelity of synthetic outputs (20+ distinct rows recommended)
- Join-key configuration is essential to preserve relationships at scale
- For large volumes, use iterative batching with consolidation

### 🚀 Usage Instructions

1. Run setup cells to create DB/schemas
2. Execute seed creation cells for institutions → accounts
3. Run synthetic generation cells (customers, branches, products)
4. Optionally run scaling cell to increase volumes
