# Customer.IO Monitoring and Observability

## Purpose

This notebook demonstrates comprehensive monitoring and observability solutions for Customer.IO data pipelines using the existing ObservabilityManager from utils/observability_manager.py.

## Prerequisites

- Complete setup from `00_setup_and_configuration.ipynb`
- Complete authentication from `01_authentication_and_utilities.ipynb`
- Customer.IO API key configured in Databricks secrets
- Understanding of monitoring and observability concepts

## Key Topics Covered

1. **Metrics Collection** - Real-time performance and business metrics
2. **Alerting** - Intelligent threshold-based alerts and notifications
3. **Health Monitoring** - System and service health checks
4. **Distributed Tracing** - Request correlation and performance analysis
5. **Dashboard Integration** - Comprehensive observability dashboards
6. **Production Monitoring** - Enterprise-grade monitoring patterns

In [None]:
# Essential imports for monitoring and observability
import json
import time
import threading
from datetime import datetime, timezone, timedelta
from typing import Dict, List, Optional, Any, Union, Callable, Set
from dataclasses import dataclass, field
from collections import defaultdict, deque
from enum import Enum
import structlog
from pydantic import BaseModel, Field, validator
import statistics
import uuid
import psutil
from concurrent.futures import ThreadPoolExecutor
import asyncio
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("SUCCESS: Core monitoring imports loaded")

In [ ]:
# Import Customer.IO utilities with existing ObservabilityManager
import sys
import os

# Add utils to path for imports
sys.path.append(os.path.join(os.getcwd(), 'utils'))

from api_client import CustomerIOClient
from observability_manager import (
    ObservabilityManager,
    MetricType,
    AlertSeverity,
    HealthStatus,
    TraceStatus,
    Metric,
    Alert,
    HealthCheck,
    TraceSpan,
    MetricsCollector,
    AlertRule,
    AlertManager,
    HealthMonitor,
    TracingContext,
    DistributedTracer,
    trace_function
)
from error_handlers import retry_on_error, ErrorContext, CustomerIOError
from validators import validate_request_size

print("SUCCESS: Customer.IO utilities and ObservabilityManager imported")

## Using Existing ObservabilityManager

The observability components are already implemented in `utils/observability_manager.py`. This notebook demonstrates usage of the existing manager.

In [ ]:
# Core enums are imported from utils/observability_manager.py
print("Available metric types:", [metric_type.value for metric_type in MetricType])
print("Available alert severities:", [severity.value for severity in AlertSeverity])
print("Available health statuses:", [status.value for status in HealthStatus])
print("Available trace statuses:", [status.value for status in TraceStatus])

print("SUCCESS: Using existing monitoring enums from ObservabilityManager")

In [ ]:
# Example: Creating metrics using existing Metric model
try:
    sample_metric = Metric(
        name="api_response_time",
        type=MetricType.TIMER,
        value=150.5,
        unit="ms",
        tags={"endpoint": "/api/track", "method": "POST"},
        source="api_server"
    )
    
    print("Created sample metric:")
    print(f"  Name: {sample_metric.name}")
    print(f"  Type: {sample_metric.type}")
    print(f"  Value: {sample_metric.value} {sample_metric.unit}")
    print(f"  Tags: {sample_metric.tags}")
    print(f"  Timestamp: {sample_metric.timestamp}")
    
    # Example alert
    sample_alert = Alert(
        name="High Response Time",
        severity=AlertSeverity.HIGH,
        message="API response time exceeded threshold",
        source="alert_manager",
        metric_name="api_response_time",
        threshold_value=100.0,
        current_value=150.5
    )
    
    print(f"\nCreated sample alert: {sample_alert.name} ({sample_alert.severity})")
    print(f"Alert ID: {sample_alert.alert_id}")
    print(f"Message: {sample_alert.message}")
    print(f"Is resolved: {sample_alert.is_resolved()}")
    
    print("SUCCESS: Using existing Metric and Alert models")
    
except Exception as e:
    print(f"ERROR: Failed to create models: {e}")

In [ ]:
# Example: Creating health checks and trace spans
try:
    # Health check example
    health_check = HealthCheck(
        check_id="api_connectivity",
        name="Customer.IO API Connectivity",
        status=HealthStatus.HEALTHY,
        response_time_ms=45.2,
        details={"endpoint": "api.customer.io", "region": "us"},
        consecutive_failures=0
    )
    
    print("Created health check:")
    print(f"  Name: {health_check.name}")
    print(f"  Status: {health_check.status}")
    print(f"  Response time: {health_check.response_time_ms}ms")
    print(f"  Is healthy: {health_check.is_healthy()}")
    
    # Trace span example
    trace_span = TraceSpan(
        trace_id="trace-12345",
        operation_name="send_customer_event",
        service_name="customer_io_service",
        tags={"user_id": "user_123", "event_type": "track"}
    )
    
    print(f"\nCreated trace span:")
    print(f"  Trace ID: {trace_span.trace_id}")
    print(f"  Span ID: {trace_span.span_id}")
    print(f"  Operation: {trace_span.operation_name}")
    print(f"  Service: {trace_span.service_name}")
    print(f"  Status: {trace_span.status}")
    
    # Simulate span completion
    trace_span.add_log("info", "Starting event processing")
    trace_span.finish(TraceStatus.SUCCESS)
    print(f"  Duration: {trace_span.duration_ms}ms")
    print(f"  Logs: {len(trace_span.logs)} entries")
    
    print("SUCCESS: Using existing HealthCheck and TraceSpan models")
    
except Exception as e:
    print(f"ERROR: Failed to create models: {e}")

## Using Existing MetricsCollector

The MetricsCollector is already implemented in `utils/observability_manager.py` with high-performance buffering and aggregation.

In [ ]:
# Example: Using existing MetricsCollector
try:
    # Initialize MetricsCollector
    metrics_collector = MetricsCollector(buffer_size=1000, flush_interval_seconds=30)
    
    # Record different types of metrics
    metrics_collector.record_counter("api_requests_total", value=1, tags={"endpoint": "/track"})
    metrics_collector.record_gauge("memory_usage_mb", value=512.3, unit="MB")
    metrics_collector.record_timer("response_time", duration_ms=125.7, tags={"operation": "track_event"})
    
    # Record custom metrics
    custom_metric = Metric(
        name="custom_business_metric",
        type=MetricType.GAUGE,
        value=89.5,
        unit="percent",
        tags={"category": "conversion", "funnel": "signup"},
        source="business_logic"
    )
    metrics_collector.record_metric(custom_metric)
    
    # Get statistics
    timer_stats = metrics_collector.get_metric_statistics("response_time", MetricType.TIMER)
    buffer_status = metrics_collector.get_buffer_status()
    
    print("MetricsCollector Usage:")
    print(f"  Buffer utilization: {buffer_status['buffer_utilization']:.1f}%")
    print(f"  Aggregated metric types: {buffer_status['aggregated_metric_types']}")
    print(f"  Last flush: {buffer_status['last_flush']}")
    
    if timer_stats:
        print(f"  Response time stats: mean={timer_stats.get('mean', 0):.1f}ms, count={timer_stats.get('count', 0)}")
    
    print("SUCCESS: Using existing MetricsCollector with buffering and aggregation")
    
except Exception as e:
    print(f"ERROR: Failed to use MetricsCollector: {e}")

## Using Existing AlertManager

The AlertManager is already implemented with intelligent threshold management and notification routing.

In [ ]:
# Example: Using existing AlertManager
try:
    # Initialize AlertManager with MetricsCollector
    alert_manager = AlertManager(metrics_collector)
    
    # Create alert rules
    high_error_rate_rule = AlertRule(
        name="High Error Rate Alert",
        metric_name="error_rate_percent",
        condition=">",
        threshold=5.0,
        severity=AlertSeverity.HIGH,
        window_minutes=5,
        consecutive_violations=2,
        tags={"team": "platform", "priority": "urgent"}
    )
    
    response_time_rule = AlertRule(
        name="Response Time Alert",
        metric_name="avg_response_time_ms",
        condition=">",
        threshold=1000.0,
        severity=AlertSeverity.MEDIUM,
        window_minutes=10,
        consecutive_violations=3
    )
    
    # Add rules to manager
    alert_manager.add_rule(high_error_rate_rule)
    alert_manager.add_rule(response_time_rule)
    
    # Test rule evaluation
    test_value = 8.5  # This exceeds the 5.0 threshold
    rule_violated = high_error_rate_rule.evaluate(test_value)
    
    print("AlertManager Usage:")
    print(f"  Total rules: {len(alert_manager.rules)}")
    print(f"  Test evaluation (error rate {test_value}%): {'VIOLATED' if rule_violated else 'OK'}")
    
    # Get alert statistics
    alert_stats = alert_manager.get_alert_statistics()
    print(f"  Alert statistics: {alert_stats['total_rules']} total, {alert_stats['enabled_rules']} enabled")
    print(f"  Active alerts: {alert_stats['active_alerts']}")
    
    # Get active alerts
    active_alerts = alert_manager.get_active_alerts()
    if active_alerts:
        print(f"  Current active alerts: {len(active_alerts)}")
        for alert in active_alerts[:3]:  # Show first 3
            print(f"    - {alert.name} ({alert.severity}): {alert.message}")
    else:
        print("  No active alerts")
    
    print("SUCCESS: Using existing AlertManager with rule evaluation")
    
except Exception as e:
    print(f"ERROR: Failed to use AlertManager: {e}")

## Using Existing HealthMonitor

The HealthMonitor is already implemented with comprehensive health checks for services and dependencies.

In [ ]:
# Example: Using existing HealthMonitor
try:
    # Initialize customer.io client for health monitor
    client = CustomerIOClient(
        site_id="test_site_123",
        api_key="test_key_456",
        region="us"
    )
    
    # Initialize HealthMonitor
    health_monitor = HealthMonitor(client)
    
    # Register custom health check
    def check_database_connection():
        """Custom database health check."""
        try:
            # Simulate database check
            import random
            response_time = random.uniform(0.01, 0.1)
            connection_count = random.randint(5, 50)
            
            if connection_count > 45:
                status = HealthStatus.DEGRADED
            else:
                status = HealthStatus.HEALTHY
            
            return {
                "name": "Database Connection",
                "status": status,
                "details": {
                    "response_time_ms": response_time * 1000,
                    "active_connections": connection_count,
                    "max_connections": 50
                }
            }
        except Exception as e:
            return {
                "name": "Database Connection",
                "status": HealthStatus.UNHEALTHY,
                "details": {"error": str(e)}
            }
    
    health_monitor.register_check("database", "Database Connection", check_database_connection)
    
    # Run specific health check
    db_check = health_monitor.run_check("database")
    print("Database Health Check:")
    print(f"  Name: {db_check.name}")
    print(f"  Status: {db_check.status}")
    print(f"  Response time: {db_check.response_time_ms:.1f}ms")
    print(f"  Details: {db_check.details}")
    print(f"  Is healthy: {db_check.is_healthy()}")
    
    # Run all health checks
    all_checks = health_monitor.run_all_checks()
    print(f"\nAll Health Checks ({len(all_checks)} total):")
    for check_id, check in all_checks.items():
        status_icon = "✓" if check.is_healthy() else "✗"
        print(f"  {status_icon} {check.name}: {check.status}")
    
    # Get system health overview
    system_health = health_monitor.get_system_health()
    print(f"\nSystem Health Overview:")
    print(f"  Overall status: {system_health['status']}")
    print(f"  Health percentage: {system_health['summary']['health_percentage']:.1f}%")
    print(f"  Healthy checks: {system_health['summary']['healthy_checks']}/{system_health['summary']['total_checks']}")
    
    print("SUCCESS: Using existing HealthMonitor with custom checks")
    
except Exception as e:
    print(f"ERROR: Failed to use HealthMonitor: {e}")

## Using Existing DistributedTracer

The DistributedTracer is already implemented for complex request flows and performance analysis.

In [ ]:
# Example: Using existing DistributedTracer
try:
    # Initialize DistributedTracer
    tracer = DistributedTracer()
    
    # Start a trace for complex operation
    main_trace = tracer.start_trace(
        "user_registration_flow",
        service_name="user_service",
        tags={"user_id": "user_12345", "flow": "registration"}
    )
    
    print(f"Started main trace: {main_trace.trace_id}")
    print(f"Root span: {main_trace.span_id}")
    
    # Add child spans
    validation_span = tracer.start_span(
        "validate_user_data",
        service_name="validation_service",
        tags={"step": "validation"}
    )
    
    tracer.add_span_log("info", "Starting user data validation")
    time.sleep(0.05)  # Simulate work
    tracer.add_span_log("info", "Validation completed successfully")
    tracer.finish_span(validation_span, TraceStatus.SUCCESS)
    
    # Another child span
    db_span = tracer.start_span(
        "create_user_record",
        service_name="database_service",
        tags={"table": "users", "operation": "insert"}
    )
    
    tracer.add_span_log("info", "Creating user record")
    time.sleep(0.02)  # Simulate database work
    tracer.add_span_log("info", "User record created", user_id="user_12345")
    tracer.finish_span(db_span, TraceStatus.SUCCESS)
    
    # Customer.IO span
    cio_span = tracer.start_span(
        "send_welcome_event",
        service_name="customer_io",
        tags={"event_type": "welcome", "channel": "email"}
    )
    
    tracer.add_span_log("info", "Sending welcome event to Customer.IO")
    time.sleep(0.1)  # Simulate API call
    tracer.add_span_log("info", "Welcome event sent successfully")
    tracer.finish_span(cio_span, TraceStatus.SUCCESS)
    
    # Finish main trace
    tracer.finish_span(main_trace, TraceStatus.SUCCESS)
    
    # Get trace tree
    trace_tree = tracer.get_trace_tree(main_trace.trace_id)
    print(f"\nTrace Analysis:")
    print(f"  Total spans: {trace_tree['total_spans']}")
    print(f"  Total duration: {trace_tree['total_duration_ms']:.2f}ms")
    print(f"  Root spans: {len(trace_tree['root_spans'])}")
    
    # Show span hierarchy
    def print_span_tree(span_data, level=0):
        indent = "  " * level
        span = span_data["span"]
        print(f"{indent}- {span['operation_name']} ({span['service_name']}): {span['duration_ms']:.1f}ms")
        for child in span_data["children"]:
            print_span_tree(child, level + 1)
    
    print(f"\nSpan Hierarchy:")
    for root_span in trace_tree['root_spans']:
        print_span_tree(root_span)
    
    # Get tracing statistics
    trace_stats = tracer.get_tracing_statistics()
    print(f"\nTracing Statistics:")
    print(f"  Total traces: {trace_stats['total_traces']}")
    print(f"  Total spans: {trace_stats['total_spans']}")
    print(f"  Active spans: {trace_stats['active_spans']}")
    print(f"  Average trace duration: {trace_stats['average_trace_duration_ms']:.2f}ms")
    print(f"  Average spans per trace: {trace_stats['average_spans_per_trace']}")
    
    # Demonstrate trace function decorator
    @trace_function("customer_lookup", "user_service")
    def lookup_customer(customer_id: str):
        """Example function with automatic tracing."""
        time.sleep(0.03)  # Simulate work
        return {"id": customer_id, "name": "John Doe", "plan": "premium"}
    
    # The decorator creates its own tracer, so we'll call it
    result = lookup_customer("customer_789")
    print(f"\nFunction tracing result: {result}")
    
    print("SUCCESS: Using existing DistributedTracer with span hierarchy")
    
except Exception as e:
    print(f"ERROR: Failed to use DistributedTracer: {e}")

## Using Existing ObservabilityManager

The comprehensive ObservabilityManager combines all monitoring components into a unified interface.

In [ ]:
# Example: Using comprehensive ObservabilityManager
try:
    # Initialize ObservabilityManager with Customer.IO client
    observability = ObservabilityManager(client)
    observability.setup_default_alerts()
    
    print("ObservabilityManager initialized with components:")
    print(f"  - MetricsCollector: {type(observability.metrics_collector).__name__}")
    print(f"  - AlertManager: {type(observability.alert_manager).__name__}")
    print(f"  - HealthMonitor: {type(observability.health_monitor).__name__}")
    print(f"  - DistributedTracer: {type(observability.tracer).__name__}")
    
    # Record various operations
    observability.record_request("api_call", duration_ms=125.5, success=True, tags={"endpoint": "/track"})
    observability.record_request("api_call", duration_ms=89.2, success=True, tags={"endpoint": "/identify"})
    observability.record_request("api_call", duration_ms=2500.0, success=False, tags={"endpoint": "/batch"})
    
    # Record Customer.IO specific events
    observability.record_customer_io_event("track", "user_123", success=True, response_time_ms=95.3)
    observability.record_customer_io_event("identify", "user_456", success=True, response_time_ms=67.8)
    observability.record_customer_io_event("batch", "batch_789", success=False, response_time_ms=3000.0)
    
    # Record batch operations
    observability.record_batch_operation(
        batch_size=100,
        processing_time_ms=1250.0,
        success_count=95,
        error_count=5
    )
    
    # Use tracing
    trace = observability.start_trace("complex_customer_operation", service="customer_service")
    
    span1 = observability.start_span("validate_input", service="validation")
    time.sleep(0.02)
    observability.finish_span(span1, success=True)
    
    span2 = observability.start_span("process_data", service="processor")
    time.sleep(0.05)
    observability.finish_span(span2, success=True)
    
    span3 = observability.start_span("send_to_customerio", service="customer_io")
    time.sleep(0.08)
    observability.finish_span(span3, success=True)
    
    observability.finish_span(trace, success=True)
    
    print(f"\nRecorded operations and completed trace: {trace.trace_id}")
    
    # Get comprehensive dashboard data
    dashboard = observability.get_dashboard_data()
    print(f"\nDashboard Data:")
    print(f"  System uptime: {dashboard['system']['uptime_seconds']:.1f}s")
    print(f"  Total requests: {dashboard['requests']['total_requests']}")
    print(f"  Error rate: {dashboard['requests']['error_rate_percent']:.1f}%")
    print(f"  Requests/minute: {dashboard['requests']['requests_per_minute']:.1f}")
    print(f"  Health status: {dashboard['system']['health_status']}")
    print(f"  Active alerts: {dashboard['alerts']['active_alerts']}")
    print(f"  Metrics buffer utilization: {dashboard['metrics']['buffer_utilization']:.1f}%")
    
    # Generate health report
    health_report = observability.generate_health_report()
    print(f"\nHealth Report:")
    print(f"  Overall health: {health_report['overall_health']}")
    print(f"  Health checks: {health_report['summary']['passing_health_checks']}/{health_report['summary']['total_health_checks']} passing")
    print(f"  Active alerts: {health_report['summary']['active_alert_count']}")
    
    if health_report['recommendations']:
        print(f"  Recommendations:")
        for rec in health_report['recommendations'][:3]:  # Show first 3
            print(f"    - {rec}")
    
    # Get manager metrics
    manager_metrics = observability.get_metrics()
    print(f"\nManager Metrics:")
    print(f"  Manager uptime: {manager_metrics['manager']['uptime_seconds']:.1f}s")
    print(f"  Total requests processed: {manager_metrics['manager']['total_requests']}")
    print(f"  Total errors: {manager_metrics['manager']['total_errors']}")
    
    components = manager_metrics['components']
    print(f"  Components status:")
    print(f"    - MetricsCollector buffer: {components['metrics_collector']['buffer_utilization']:.1f}%")
    print(f"    - AlertManager rules: {components['alert_manager']['total_rules']}")
    print(f"    - HealthMonitor checks: {components['health_monitor']['registered_checks']}")
    print(f"    - Tracer: {components['tracer']['total_traces']} traces, {components['tracer']['total_spans']} spans")
    
    print("SUCCESS: Comprehensive observability system operational")
    
except Exception as e:
    print(f"ERROR: Failed to use ObservabilityManager: {e}")

## Practical Observability Examples

Comprehensive examples demonstrating observability features with the existing ObservabilityManager.

In [ ]:
# Initialize ObservabilityManager with Customer.IO client
try:
    client = CustomerIOClient(
        site_id="test_site_123",
        api_key="test_key_456",
        region="us"
    )
    
    observability = ObservabilityManager(client)
    observability.setup_default_alerts()
    
    print("SUCCESS: ObservabilityManager initialized with Customer.IO client")
    print(f"Manager start time: {observability.start_time}")
    print(f"Default alert rules configured: {len(observability.alert_manager.rules)}")
    
except Exception as e:
    print(f"ERROR: Failed to initialize ObservabilityManager: {e}")

In [ ]:
# Example: Record various Customer.IO operations
try:
    import random
    
    # Simulate various API operations with realistic performance patterns
    operations = [
        {"name": "track_event", "endpoint": "/api/track", "base_time": 100},
        {"name": "identify_user", "endpoint": "/api/identify", "base_time": 80},
        {"name": "batch_operation", "endpoint": "/api/batch", "base_time": 500},
        {"name": "suppress_user", "endpoint": "/api/suppress", "base_time": 150},
        {"name": "delete_user", "endpoint": "/api/delete", "base_time": 200}
    ]
    
    print("Simulating Customer.IO operations...")
    
    for i in range(15):
        op = random.choice(operations)
        
        # Add realistic variance to response times
        response_time = op["base_time"] + random.uniform(-30, 100)
        success = random.random() > 0.05  # 95% success rate
        
        # Record request metrics
        observability.record_request(
            operation=op["name"],
            duration_ms=response_time,
            success=success,
            tags={"endpoint": op["endpoint"], "version": "v1"}
        )
        
        # Record Customer.IO specific metrics
        observability.record_customer_io_event(
            event_type=op["name"],
            user_id=f"user_{i % 5}",  # 5 different users
            success=success,
            response_time_ms=response_time
        )
    
    # Simulate batch operations
    for batch_num in range(3):
        batch_size = random.randint(50, 200)
        processing_time = batch_size * random.uniform(8, 15)  # ~10ms per item
        success_count = int(batch_size * random.uniform(0.9, 1.0))  # 90-100% success
        error_count = batch_size - success_count
        
        observability.record_batch_operation(
            batch_size=batch_size,
            processing_time_ms=processing_time,
            success_count=success_count,
            error_count=error_count
        )
    
    print("SUCCESS: Recorded 15 API operations and 3 batch operations")
    print(f"Total requests tracked: {observability.request_count}")
    print(f"Total errors tracked: {observability.error_count}")
    
except Exception as e:
    print(f"ERROR: Failed to record operations: {e}")

In [ ]:
# Example: Complex distributed tracing scenario
try:
    # Simulate a complex customer journey with multiple touchpoints
    trace = observability.start_trace(
        "customer_journey_flow",
        service="customer_experience",
        user_id="user_premium_123",
        journey_type="onboarding"
    )
    
    print(f"Started customer journey trace: {trace.trace_id}")
    
    # Step 1: User registration validation
    validation_span = observability.start_span(
        "validate_registration_data",
        service="validation_service"
    )
    time.sleep(0.03)  # Simulate validation work
    observability.tracer.add_span_log("info", "Email validation completed")
    observability.tracer.add_span_log("info", "Password strength verified")
    observability.finish_span(validation_span, success=True)
    
    # Step 2: Account creation
    account_span = observability.start_span(
        "create_customer_account",
        service="account_service"
    )
    time.sleep(0.05)  # Simulate database operations
    observability.tracer.add_span_log("info", "Account created successfully")
    observability.tracer.add_span_log("info", "User preferences initialized")
    observability.finish_span(account_span, success=True)
    
    # Step 3: Customer.IO identification
    cio_identify_span = observability.start_span(
        "customerio_identify_user",
        service="customer_io"
    )
    time.sleep(0.08)  # Simulate API call
    observability.tracer.add_span_log("info", "User identified in Customer.IO")
    observability.tracer.add_span_log("info", "Profile attributes synchronized")
    observability.finish_span(cio_identify_span, success=True)
    
    # Step 4: Send welcome sequence
    welcome_span = observability.start_span(
        "trigger_welcome_sequence",
        service="customer_io"
    )
    time.sleep(0.12)  # Simulate campaign trigger
    observability.tracer.add_span_log("info", "Welcome email campaign triggered")
    observability.tracer.add_span_log("info", "Onboarding sequence initiated")
    observability.finish_span(welcome_span, success=True)
    
    # Step 5: Analytics tracking
    analytics_span = observability.start_span(
        "track_registration_event",
        service="analytics_service"
    )
    time.sleep(0.04)  # Simulate analytics processing
    observability.tracer.add_span_log("info", "Registration event tracked")
    observability.tracer.add_span_log("info", "Conversion funnel updated")
    observability.finish_span(analytics_span, success=True)
    
    # Finish the main journey
    observability.finish_span(trace, success=True)
    
    # Analyze the trace
    trace_tree = observability.tracer.get_trace_tree(trace.trace_id)
    print(f"\nCustomer Journey Analysis:")
    print(f"  Total operation time: {trace_tree['total_duration_ms']:.2f}ms")
    print(f"  Number of service calls: {trace_tree['total_spans']}")
    print(f"  Journey success: All spans completed successfully")
    
    # Show detailed breakdown
    spans = observability.tracer.get_trace(trace.trace_id)
    print(f"\nDetailed Timeline:")
    for span in sorted(spans, key=lambda s: s.start_time):
        if span.duration_ms:
            print(f"  {span.operation_name} ({span.service_name}): {span.duration_ms:.1f}ms")
            if span.logs:
                for log in span.logs[-1:]:  # Show last log entry
                    print(f"    └─ {log['message']}")
    
    print("SUCCESS: Complex distributed tracing completed")
    
except Exception as e:
    print(f"ERROR: Distributed tracing failed: {e}")

In [ ]:
# Example: Health monitoring and alerting
try:
    # Run comprehensive health checks
    health_results = observability.health_monitor.run_all_checks()
    
    print("Health Check Results:")
    for check_id, check in health_results.items():
        status_icon = "✅" if check.is_healthy() else "❌"
        print(f"  {status_icon} {check.name}")
        print(f"      Status: {check.status}")
        print(f"      Response Time: {check.response_time_ms:.1f}ms")
        if check.details:
            key_details = list(check.details.items())[:2]  # Show first 2 details
            for key, value in key_details:
                if isinstance(value, (int, float)):
                    print(f"      {key}: {value}")
                else:
                    print(f"      {key}: {str(value)[:50]}...")
        print()
    
    # Get overall system health
    system_health = observability.health_monitor.get_system_health()
    print(f"Overall System Health: {system_health['status']}")
    print(f"Health Score: {system_health['summary']['health_percentage']:.1f}%")
    print(f"Healthy Checks: {system_health['summary']['healthy_checks']}/{system_health['summary']['total_checks']}")
    
    # Check for any active alerts
    active_alerts = observability.alert_manager.get_active_alerts()
    print(f"\nAlert Status:")
    print(f"  Active Alerts: {len(active_alerts)}")
    
    if active_alerts:
        for alert in active_alerts[:3]:  # Show first 3 alerts
            print(f"    - {alert.name} ({alert.severity})")
            print(f"      {alert.message}")
            print(f"      Triggered: {alert.triggered_at}")
    else:
        print("  No active alerts - system operating normally")
    
    # Show alert rule status
    alert_stats = observability.alert_manager.get_alert_statistics()
    print(f"\nAlert Configuration:")
    print(f"  Total Rules: {alert_stats['total_rules']}")
    print(f"  Enabled Rules: {alert_stats['enabled_rules']}")
    print(f"  Alerts (24h): {alert_stats['alerts_last_24h']}")
    
    print("SUCCESS: Health monitoring and alerting operational")
    
except Exception as e:
    print(f"ERROR: Health monitoring failed: {e}")

In [ ]:
# Example: Dashboard data and comprehensive monitoring
try:
    # Allow time for metrics to accumulate
    time.sleep(1)
    
    # Get comprehensive dashboard data
    dashboard = observability.get_dashboard_data()
    
    print("📊 OBSERVABILITY DASHBOARD")
    print("=" * 50)
    
    # System overview
    print(f"🖥️  SYSTEM STATUS")
    print(f"   Uptime: {dashboard['system']['uptime_seconds']:.1f} seconds")
    print(f"   Health: {dashboard['system']['health_status']}")
    print(f"   Started: {dashboard['system']['start_time'][:19]}Z")
    
    # Request metrics
    print(f"\n📈 REQUEST METRICS")
    requests = dashboard['requests']
    print(f"   Total Requests: {requests['total_requests']}")
    print(f"   Total Errors: {requests['total_errors']}")
    print(f"   Error Rate: {requests['error_rate_percent']:.1f}%")
    print(f"   Req/Min: {requests['requests_per_minute']:.1f}")
    
    # Performance metrics
    if dashboard.get('performance'):
        perf = dashboard['performance']
        print(f"\n⚡ PERFORMANCE METRICS")
        print(f"   Avg Response: {perf.get('mean', 0):.1f}ms")
        print(f"   P95 Response: {perf.get('p95', 0):.1f}ms")
        print(f"   Min/Max: {perf.get('min', 0):.1f}ms / {perf.get('max', 0):.1f}ms")
        print(f"   Total Samples: {perf.get('count', 0)}")
    
    # Alert status
    alerts = dashboard['alerts']
    print(f"\n🚨 ALERT STATUS")
    print(f"   Active Alerts: {alerts['active_alerts']}")
    print(f"   Critical Alerts: {alerts['critical_alerts']}")
    alert_stats = alerts['alert_statistics']
    print(f"   Alert Rules: {alert_stats['enabled_rules']}/{alert_stats['total_rules']} enabled")
    
    # Tracing metrics
    tracing = dashboard['tracing']
    print(f"\n🔍 DISTRIBUTED TRACING")
    print(f"   Total Traces: {tracing['total_traces']}")
    print(f"   Total Spans: {tracing['total_spans']}")
    print(f"   Active Spans: {tracing['active_spans']}")
    print(f"   Avg Trace Duration: {tracing['average_trace_duration_ms']:.1f}ms")
    
    # Metrics buffer status
    metrics_info = dashboard['metrics']
    print(f"\n📊 METRICS BUFFER")
    print(f"   Buffer Utilization: {metrics_info['buffer_utilization']:.1f}%")
    print(f"   Buffer Size: {metrics_info['buffer_size']}/{metrics_info['buffer_capacity']}")
    print(f"   Metric Types: {metrics_info['aggregated_metric_types']}")
    print(f"   Last Flush: {metrics_info['last_flush'][:19]}Z")
    
    print(f"\n📅 Report Generated: {dashboard['timestamp'][:19]}Z")
    
    # Generate detailed health report
    print("\n" + "=" * 50)
    health_report = observability.generate_health_report()
    
    print(f"🏥 HEALTH REPORT")
    print(f"   Overall Health: {health_report['overall_health']}")
    
    summary = health_report['summary']
    print(f"   Health Checks: {summary['passing_health_checks']}/{summary['total_health_checks']} passing")
    print(f"   Active Alerts: {summary['active_alert_count']}")
    print(f"   Critical Alerts: {summary['critical_alert_count']}")
    
    if health_report['recommendations']:
        print(f"\n💡 RECOMMENDATIONS:")
        for i, rec in enumerate(health_report['recommendations'][:3], 1):
            print(f"   {i}. {rec}")
    else:
        print(f"\n✅ No recommendations - system running optimally")
    
    # Show manager metrics
    manager_metrics = observability.get_metrics()
    print(f"\n🔧 MANAGER STATUS")
    manager = manager_metrics['manager']
    print(f"   Manager Uptime: {manager['uptime_seconds']:.1f}s")
    print(f"   Requests Processed: {manager['total_requests']}")
    print(f"   Errors Tracked: {manager['total_errors']}")
    
    features = manager_metrics['features']
    enabled_features = [name for name, enabled in features.items() if enabled]
    print(f"   Active Features: {', '.join(enabled_features)}")
    
    print("\n✅ SUCCESS: Comprehensive observability system operational")
    print("   All monitoring components functioning correctly")
    print("   Ready for production workloads")
    
except Exception as e:
    print(f"❌ ERROR: Dashboard generation failed: {e}")

## Summary

This notebook demonstrates using the existing comprehensive **ObservabilityManager** from `utils/observability_manager.py` for Customer.IO data pipeline monitoring.

**Key Components Used:**
- **ObservabilityManager**: Unified interface for all monitoring capabilities
- **MetricsCollector**: High-performance metrics collection with buffering and aggregation
- **AlertManager**: Intelligent alerting with threshold management and rule evaluation
- **HealthMonitor**: Comprehensive health checks for services and dependencies
- **DistributedTracer**: Request correlation and performance analysis across services

**Capabilities Demonstrated:**
- Real-time metrics collection (counters, gauges, timers, histograms)
- Configurable alerting rules with multiple severity levels and violation tracking
- System resource monitoring and custom health check registration
- Distributed tracing with span hierarchy and correlation across services
- Performance analytics with statistical aggregation (P95, P99, mean, etc.)
- Comprehensive dashboard data generation for monitoring systems
- Automated health reporting with actionable recommendations

**Integration Benefits:**
- Customer.IO API monitoring and performance tracking
- Batch operation monitoring with success/error rate tracking
- Error tracking and intelligent alerting
- System resource utilization monitoring
- Custom health check registration for business-specific monitoring

**Production Ready Features:**
- Thread-safe operations with background processing
- Configurable buffer sizes and flush intervals
- Comprehensive error handling and recovery
- Statistical aggregation and trend analysis
- Multi-service distributed tracing support
- Automated alert resolution and notification

The system integrates seamlessly with existing Customer.IO operations and provides enterprise-grade monitoring capabilities for production Databricks environments.