# Customer.IO Setup and Configuration Management

## **Comprehensive Infrastructure Setup for Customer.IO Data Pipelines**

This notebook implements a **production-ready setup and configuration management system** for Customer.IO data pipelines, featuring:

- **🔧 Environment Configuration Management** - Type-safe configuration with Pydantic validation
- **🔐 Secure Secrets Management** - Databricks secrets integration with encryption support  
- **🗄️ Delta Lake Infrastructure** - Automated table creation with optimized schemas
- **⚡ Performance Optimization** - Spark configuration tuning for Customer.IO workloads
- **🛡️ Circuit Breaker Protection** - Fault-tolerant validation with retry mechanisms
- **📊 Comprehensive Monitoring** - Health checks and validation with detailed reporting
- **🧪 Synthetic Data Generation** - Realistic test data for development and testing
- **🎯 Production Deployment** - Environment-specific configurations and secrets management

## **Enterprise Features**

- **Type-Safe Configuration**: Pydantic models with comprehensive validation
- **Error Handling**: Circuit breaker patterns and retry mechanisms  
- **Monitoring**: Structured logging and performance metrics
- **Security**: Encrypted secrets and secure credential management
- **Scalability**: Auto-scaling Spark configurations and Delta Lake optimization
- **Testing**: Comprehensive validation framework with synthetic data generation

## **Architecture Overview**

The notebook follows the **sophisticated 6-section pattern** established for production-ready Customer.IO implementations:

1. **📋 Comprehensive Documentation** - Complete setup guide and requirements
2. **🔨 Core Imports and Setup** - Production dependencies and environment configuration  
3. **📝 Type-Safe Model Definitions** - Pydantic models for configuration and validation
4. **⚙️ Main Manager Class** - SetupManager with dependency injection and monitoring
5. **🚀 Example Usage and Testing** - Complete workflows and validation scenarios
6. **📊 Summary Documentation** - Results summary and next steps

Ready for **production deployment** with enterprise-grade reliability and monitoring.

In [ ]:
# ============================================================================
# SECTION 2: CORE IMPORTS AND SETUP  
# ============================================================================

# Core Python libraries
import json
import os
import time
import threading
from datetime import datetime, timedelta, timezone
from typing import Dict, List, Optional, Any, Union, Literal
from dataclasses import dataclass, field
from pathlib import Path
import uuid
import base64
import secrets

# Databricks and Spark imports
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, 
    TimestampType, BooleanType, DoubleType, ArrayType, MapType
)
from pyspark.sql import functions as F
from delta.tables import DeltaTable

# HTTP and validation libraries
import httpx
from pydantic import BaseModel, Field, validator
import structlog

# Data generation and testing
from faker import Faker
from dateutil import tz

# Utilities from Customer.IO project
from utils.api_client import CustomerIOClient
from utils.error_handlers import retry_on_error, ErrorContext, CustomerIOError, CircuitBreaker

# Initialize components
fake = Faker()
fake.seed_instance(42)  # For reproducible test data
logger = structlog.get_logger("setup_manager")

# Get Spark session
spark = SparkSession.getActiveSession()
if not spark:
    raise RuntimeError("No active Spark session found")

print("SUCCESS: All imports and core setup completed")
print(f"   Spark version: {spark.version}")
print(f"   Python libraries: httpx, pydantic, structlog, faker")
print(f"   Customer.IO utilities: api_client, error_handlers")
print(f"   Logging configured with structured logging")

# ============================================================================
# SECTION 3: TYPE-SAFE MODEL DEFINITIONS
# ============================================================================

from enum import Enum

class Environment(str, Enum):
    """Environment types for deployment."""
    DEVELOPMENT = "development"
    STAGING = "staging" 
    PRODUCTION = "production"
    TESTING = "testing"

class ValidationStatus(str, Enum):
    """Validation result status."""
    SUCCESS = "SUCCESS"
    WARNING = "WARNING"
    ERROR = "ERROR"

class CustomerIOConfig(BaseModel):
    """Type-safe configuration for Customer.IO API settings."""
    
    api_key: str = Field(..., description="Customer.IO API key")
    region: Literal["us", "eu"] = Field(default="us", description="API region")
    
    # Rate limiting configuration
    RATE_LIMIT_REQUESTS: int = Field(default=3000, description="Requests per window")
    RATE_LIMIT_WINDOW: int = Field(default=3, description="Rate limit window in seconds")
    
    # Request size limits
    MAX_REQUEST_SIZE: int = Field(default=32 * 1024, description="Max request size in bytes")
    MAX_BATCH_SIZE: int = Field(default=500 * 1024, description="Max batch size in bytes")
    
    # Retry configuration
    MAX_RETRIES: int = Field(default=3, description="Maximum retry attempts")
    RETRY_BACKOFF_FACTOR: float = Field(default=2.0, description="Backoff multiplier")
    
    @validator('api_key')
    def validate_api_key(cls, v: str) -> str:
        """Validate API key format."""
        if not v or len(v.strip()) == 0:
            raise ValueError("API key cannot be empty")
        if len(v) < 10:
            raise ValueError("API key appears to be too short")
        return v.strip()
    
    @validator('region')
    def validate_region(cls, v: str) -> str:
        """Validate and normalize region."""
        return v.lower()
    
    @property
    def base_url(self) -> str:
        """Get base URL based on region."""
        if self.region == "eu":
            return "https://cdp-eu.customer.io/v1"
        else:
            return "https://cdp.customer.io/v1"
    
    def get_headers(self) -> Dict[str, str]:
        """Get HTTP headers for API requests."""
        auth_string = base64.b64encode(f"{self.api_key}:".encode()).decode()
        
        return {
            "Authorization": f"Basic {auth_string}",
            "Content-Type": "application/json",
            "User-Agent": "CustomerIO-Databricks-Setup/1.0.0",
            "Accept": "application/json"
        }
    
    class Config:
        """Pydantic model configuration."""
        validate_assignment = True
        extra = "forbid"

@dataclass
class ValidationResult:
    """Type-safe validation result."""
    status: ValidationStatus
    component: str
    result: str
    error: Optional[Exception] = None
    
    def __str__(self) -> str:
        return f"{self.status.value} {self.component:<25} {self.result}"

@dataclass
class SetupConfiguration:
    """Complete setup configuration."""
    customerio_region: str
    database_name: str
    catalog_name: str
    environment: Environment
    api_key: str
    
    def get_full_database_name(self) -> str:
        """Get full database name."""
        return f"{self.catalog_name}.{self.database_name}"

print("SUCCESS: Type-safe models defined")
print(f"   CustomerIOConfig: API configuration with validation")
print(f"   ValidationResult: Structured validation results")
print(f"   SetupConfiguration: Complete environment setup")
print(f"   Environment enum: {list(Environment)}")
print(f"   ValidationStatus enum: {list(ValidationStatus)}")

In [ ]:
# ============================================================================
# SECTION 4: MAIN MANAGER CLASS
# ============================================================================

from utils.setup_manager import SetupManager, SetupConfiguration, CustomerIOConfig

# Initialize SetupManager with dependency injection
setup_manager = SetupManager(spark)

# Create Databricks widgets for configuration
try:
    # Configuration widgets (non-sensitive data only)
    dbutils.widgets.dropdown("customerio_region", "us", ["us", "eu"], "Customer.IO Region")
    dbutils.widgets.text("database_name", "customerio_demo", "Database Name") 
    dbutils.widgets.text("catalog_name", "main", "Unity Catalog Name")
    dbutils.widgets.dropdown("environment", "test", ["test", "sandbox", "production"], "Environment")
    
    print("SUCCESS: Configuration widgets created")
    print("   customerio_region: Choose API region (us/eu)")
    print("   database_name: Database name for Customer.IO data")
    print("   catalog_name: Unity Catalog name")
    print("   environment: Deployment environment")
    
except Exception as e:
    print(f"INFO: Widget creation skipped (may already exist): {str(e)}")

# Create configuration from widgets with secure secret management
config = setup_manager.create_configuration_from_widgets()

print(f"\nSUCCESS: SetupManager initialized and configured")
print(f"   Region: {config.customerio_region}")
print(f"   Database: {config.get_full_database_name()}")
print(f"   Environment: {config.environment.value}")
print(f"   API Key: {'SECURED' if config.environment.value != 'testing' else 'TEST_MODE'}")
print(f"   Circuit breaker configured for fault tolerance")
print(f"   Synthetic data generator ready")
print(f"   Schema manager with 6 optimized table schemas")

# Validate Customer.IO configuration
customerio_validation = setup_manager.validate_customerio_config(config)
print(f"\nAPI Configuration Validation: {customerio_validation}")

if customerio_validation.status.value == "ERROR":
    print("WARNING: API configuration validation failed - using test mode")
    print("To configure production secrets, run:")
    print("   databricks secrets create-scope customerio")
    print("   databricks secrets put customerio production_api_key")
    print("   databricks secrets put customerio sandbox_api_key")

# ============================================================================
# SECTION 5: EXAMPLE USAGE AND TESTING
# ============================================================================

print("ANALYSIS: Running complete environment setup with validation...")

# Execute complete setup with synthetic data generation
setup_results = setup_manager.setup_complete_environment(
    num_customers=1000,  # Generate 1000 test customers
    num_events=5000      # Generate 5000 test events
)

# Display detailed results
print(f"\nSetup Results Summary:")
print(f"   Overall Status: {setup_results['overall_status'].upper()}")
print(f"   Configuration: {setup_results['config'].get_full_database_name()}")
print(f"   Environment: {setup_results['config'].environment.value}")
print(f"   Circuit Breaker State: {setup_results['circuit_breaker_state']}")

# Verify data was loaded successfully
if setup_results['overall_status'] in ['success', 'warning']:
    print("\nVERIFICATION: Checking loaded data...")
    
    # Check customers table
    customers_df = spark.table(f"{config.get_full_database_name()}.customers")
    customers_count = customers_df.count()
    print(f"   Customers table: {customers_count:,} records")
    
    # Check events table
    events_df = spark.table(f"{config.get_full_database_name()}.events")
    events_count = events_df.count()
    print(f"   Events table: {events_count:,} records")
    
    # Show sample data with proper formatting
    print("\nSample Customer Data:")
    (customers_df
     .select("customer_id", "email", "traits", "custom_attributes", "is_active", "region")
     .show(3, truncate=False))
    
    print("\nSample Event Data:")
    (events_df
     .select("event_name", "customer_id", "timestamp", "event_category", "properties")
     .orderBy(F.desc("timestamp"))
     .show(5, truncate=False))
    
    # Display event category distribution
    print("\nEvent Category Distribution:")
    (events_df
     .groupBy("event_category")
     .count()
     .orderBy(F.desc("count"))
     .show())
    
    print("SUCCESS: Environment setup completed and verified!")
    
else:
    print("ERROR: Setup failed - check validation errors above")
    raise Exception("Critical setup validation failed - cannot proceed")

# Test SetupManager functionality
print("\nTEST: Validating SetupManager functionality...")

# Test configuration creation
test_config = setup_manager.create_configuration_from_widgets()
assert test_config.customerio_region in ["us", "eu"], "Invalid region"
assert test_config.database_name, "Database name required"
assert test_config.catalog_name, "Catalog name required"
print("   ✓ Configuration creation test passed")

# Test Customer.IO config validation
test_validation = setup_manager.validate_customerio_config(test_config)
assert test_validation.status in [ValidationStatus.SUCCESS, ValidationStatus.WARNING], "Config validation failed"
print("   ✓ Customer.IO configuration validation test passed")

# Test data generation
test_customers = setup_manager.data_generator.generate_customers(10, "us")
assert len(test_customers) == 10, "Should generate 10 customers"
assert all("customer_id" in c for c in test_customers), "All customers need customer_id"
print("   ✓ Synthetic customer generation test passed")

test_events = setup_manager.data_generator.generate_events(test_customers, 20)
assert len(test_events) == 20, "Should generate 20 events"
assert all("event_id" in e for e in test_events), "All events need event_id"
print("   ✓ Synthetic event generation test passed")

print("SUCCESS: All SetupManager functionality tests passed!")

# Performance optimization recommendations
print(f"\nPERFORMACE: Recommended Spark configurations for Customer.IO workloads:")
print(f"   spark.sql.adaptive.enabled=true")
print(f"   spark.sql.adaptive.coalescePartitions.enabled=true") 
print(f"   spark.databricks.delta.autoOptimize.optimizeWrite=true")
print(f"   spark.databricks.delta.autoOptimize.autoCompact=true")
print(f"   Current tables optimized with Delta Lake auto-optimization")

print(f"\nREADY: Environment fully configured for Customer.IO data pipeline development!")

In [ ]:
# ============================================================================
# SECTION 6: SUMMARY DOCUMENTATION
# ============================================================================

print("📊 CUSTOMER.IO SETUP AND CONFIGURATION - COMPLETION SUMMARY")
print("=" * 80)

print(f"""
🎯 **SETUP COMPLETED SUCCESSFULLY**

**Environment Configuration:**
   • Database: {config.get_full_database_name()}
   • Region: {config.customerio_region.upper()}
   • Environment: {config.environment.value.upper()}
   • API Configuration: {'PRODUCTION' if config.environment.value != 'testing' else 'TEST MODE'}

**Infrastructure Created:**
   • ✅ Unity Catalog database with Delta Lake optimization
   • ✅ 6 production-ready table schemas (customers, events, groups, devices, api_responses, batch_operations)
   • ✅ Synthetic test data: 1,000 customers, 5,000 events
   • ✅ Type-safe configuration management with Pydantic validation
   • ✅ Circuit breaker patterns for fault tolerance
   • ✅ Structured logging with correlation IDs

**Key Features Implemented:**
   • 🔐 **Secure Secrets Management**: Databricks secrets integration
   • ⚡ **Performance Optimization**: Delta Lake auto-optimization enabled
   • 🛡️ **Error Handling**: Circuit breaker and retry mechanisms
   • 📊 **Comprehensive Monitoring**: Structured logging and validation
   • 🧪 **Testing Framework**: Synthetic data generation with realistic patterns
   • 🔄 **Production Readiness**: Environment-specific configurations

**Architecture Benefits:**
   • **Type Safety**: Pydantic models with comprehensive validation
   • **Fault Tolerance**: Circuit breaker patterns prevent cascade failures
   • **Scalability**: Auto-scaling Spark configurations and Delta optimizations
   • **Security**: Encrypted secrets and secure credential management
   • **Observability**: Structured logging with performance metrics
   • **Maintainability**: Clean separation of concerns with manager pattern

**Next Steps:**
   1. 📝 Customer Management (01_authentication_and_utilities.ipynb)
   2. 🔐 People Management (02_people_management.ipynb) 
   3. 👥 Events and Tracking (03_events_and_tracking.ipynb)
   4. 📧 Objects and Relationships (04_objects_and_relationships.ipynb)
   5. 🎯 Device Management (05_device_management.ipynb)
   6. 📊 Advanced Tracking (06_advanced_tracking.ipynb)
   7. 🔄 E-commerce Events (07_ecommerce_events.ipynb)
   8. 📱 Suppression and GDPR (08_suppression_and_gdpr.ipynb)
   9. 🌐 Batch Operations (09_batch_operations.ipynb)
   10. 🚀 Data Pipelines (10_data_pipelines_integration.ipynb)
   11. 📈 Monitoring & Observability (11_monitoring_and_observability.ipynb)
   12. 🏭 Production Deployment (12_production_deployment.ipynb)

**Production Deployment Checklist:**
   □ Configure production secrets in Databricks
   □ Set up monitoring and alerting
   □ Configure backup and disaster recovery
   □ Implement CI/CD pipeline integration
   □ Set up data quality monitoring
   □ Configure access controls and permissions

**Extracted Utilities Available:**
   • `utils.setup_manager.SetupManager`: Complete setup management
   • `utils.api_client.CustomerIOClient`: API client with rate limiting
   • `utils.error_handlers.CircuitBreaker`: Fault tolerance patterns
   • `utils.validators`: Data validation utilities

🎉 **READY FOR CUSTOMER.IO DATA PIPELINE DEVELOPMENT!**
""")

print("\n" + "=" * 80)
print("Setup completed successfully! All systems operational. 🚀")