# Operationalizing Agent Platform: Telemetry
Welcome to the telemetry lab. In this lab, we'll instrument our code to output traces, logs, and metrics using OpenTelemetry (OTEL), with outputs initially directed to the console.

We'll cover:
* The importance of using instrumentation facades to create "two-way door" decisions that allow you to change your telemetry backends without refactoring code
* How auto instrumenters and our own custom telemetry data play together
* Common patterns for logging, metric emission, and distributed tracing in agent systems
* Practical examples of instrumenting different components of an agent platform
* How to configure and use various OTEL exporters
* Examining how our telemetry pipeline works in the EKS cluster we've already deployed

By the end of this lab, you'll understand how to effectively monitor and debug agent systems in production using industry-standard observability practices.

# Creating Trace Context
The core concept here is to use libraries that are OTLP (open telemetry protocol) compatible. The trick is to use OTELs context propagation to work between different instrumentation libraries. Instrumentation libraries are useful because they patch common packages to provide auto-instrumentation of things like LLMCalls, Agent frameworks, etc.. Otherwise you'll be writing a LOT of custom instrumentation code which will ruin your day. It also creates a 2-way door decision if you decide you want to use a different instrumentation library. As long as it's OTLP compatible, you're in business.

## Where to send my telemetry data? 
Avoid sending your telemetry directly to a provider. Instead send it to an OTEL collector running somewhere. Configure the OTEL collectors to send the telemetry information wherever you want them to go. Want to send it to LangFuse? Great! Want to send it to LogFire or Phoenix? Also great! Anything that is OTEL compatible is fair game. Your code shouldn't care where the telemetry data is going.

## What about custom spans, metrics, and logs? 
This is where you use a Facade. it's common practice to abstract your metrics, logs, and custom trace logic to an abstraction layer. If you change your mind and want to change the underlying logging/metric/trace configuration, you just need to update the facade.

To get started, lets build our facade and OTEL provider. 


# Creating a facade
Firstly, we need to create a facade that we can call to interface with the collectors. What you don't want is to be in a situation where you're using an SDK or tool directly throughout your code. If you decide to switch or pick something different, it will be a painful refactor. Keep vendor specific logic in the collector, not your code. Lockin comes from using framework specific enrichments that might be difficult to reverse.

There are two components to the face
1. The actual facade itself
2. A provider that teh facade writes to (aka provider) 

First lets define an interface for our provider.

In [None]:
from abc import ABC, abstractmethod
from opentelemetry.trace import Tracer
from opentelemetry.metrics import Meter
from logging import Logger

class BaseObservabilityProvider(ABC):
    """
    Abstract base class for observability providers.
    Implementations can switch out the underlying tooling.
    """
    @abstractmethod
    def get_tracer(self, name: str) -> Tracer:
        """Get a tracer for creating spans."""
        pass
    
    @abstractmethod
    def get_meter(self, name: str) -> Meter:
        """Get a meter for recording metrics."""
        pass
    
    @abstractmethod
    def get_logger(self, name: str) -> Logger:
        """Get a logger for recording logs."""
        pass

Now that we have our interface, lets define an otel provider that would work by calling the otel endpoint in a cluster. This is pretty undifferentiated work so we'll just use the version in our common/ directory

In [None]:
from agentic_platform.core.observability.provider.otel_provider import OpenTelemetryProvider

# Local Testing
This otel collector requires an OTEL endpoint setup to send telemetry to. For the initial part of this notebook, we want to emit the metrics to our jupyter notebook. The setup is a bit different to output to the console. Let's subclass the setup methods of the OpenTelemetryProvider and replace them with Console exporters.

In [None]:
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
# OpenTelemetry imports for traces
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

# OpenTelemetry imports for metrics
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

import logging

from typing import Optional, Dict

class ConsoleOpenTelemetryProvider(OpenTelemetryProvider):
    """
    OpenTelemetry provider that outputs to console for Jupyter notebooks.
    """
    
    def __init__(
        self,
        service_name: str,
        service_version: str = "0.1.0",
        additional_attributes: Optional[Dict[str, str]] = None
    ):
        # Don't call parent's __init__ to avoid running its setup methods
        # Instead, copy the initialization code but skip the setup calls
        self.service_name = service_name
        self.service_version = service_version
        
        # Create base resource with service info
        resource_attributes = {
            "service.name": service_name,
            "service.version": service_version,
        }
        
        # Add any additional attributes
        if additional_attributes:
            resource_attributes.update(additional_attributes)
        
        self.resource = Resource.create(resource_attributes)
        
        # Initialize providers
        self.tracer_provider = None
        self.meter_provider = None
        self.logger_provider = None
        
        # Call our own setup methods
        self._setup_tracing()
        self._setup_metrics()
        self._setup_logging()

        # Track which loggers were configured. For the console logger, we don't really need this
        # But the subclass will.
        self._configured_loggers = set()
    
    def _setup_tracing(self) -> None:
        """Override to use console exporter instead of OTLP."""
        # Create and set global tracer provider only if not already set
        self.tracer_provider = TracerProvider(resource=self.resource)
        console_exporter = ConsoleSpanExporter()
        self.tracer_provider.add_span_processor(SimpleSpanProcessor(console_exporter))
        trace.set_tracer_provider(self.tracer_provider)
    
    def _setup_metrics(self) -> None:
        """Override to use console exporter instead of OTLP."""
        console_exporter = ConsoleMetricExporter()
        reader = PeriodicExportingMetricReader(console_exporter, export_interval_millis=1000)
        self.meter_provider = MeterProvider(resource=self.resource, metric_readers=[reader])
        metrics.set_meter_provider(self.meter_provider)
    
    def _setup_logging(self) -> None:
        """Override to use basic console logging instead of OTLP."""
        # Configure basic Python logging to console
        logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(name)s: %(message)s')

    # Override the get logger method with basic logging to stdout. 
    # The OTEL logger adds OTEL context to the log record that requires
    # the actual Meter and Tracer to be available vs. the console versions.
    def get_logger(self, name: str) -> Logger:
        """Get a logger that writes to stdout with trace context."""
        logger = logging.getLogger(name)
        return logger

Lastly, we need to create our Facade. This will consume an observability provider and act as an interface between the rest of our code and the underlying provider

In [None]:
from typing import Any, Dict, Optional
from opentelemetry.trace import Tracer
from opentelemetry.metrics import Meter
from logging import Logger

class ObservabilityFacade:
    """
    A unified service to handle all observability concerns.
    This facade class encapsulates all three telemetry signals.
    """
    
    def __init__(self, service_name: str, provider: BaseObservabilityProvider):
        """
        Initialize the observability service.
        
        Args:
            service_name: Name of the service
            provider: pre-configured observability provider
        """
        self.service_name = service_name
        
        # Use provided provider.
        self.provider = provider
        
        # Get the components for use
        self.tracer = self.provider.get_tracer(service_name)
        self.meter = self.provider.get_meter(service_name)
        self.logger = self.provider.get_logger(service_name)
        
        # Create commonly used meters
        self.counter_metrics = {}
        self.gauge_metrics = {}
        self.histogram_metrics = {}
    
    def create_counter(self, name: str, description: str, unit: str = "1") -> None:
        """Create a counter metric."""
        self.counter_metrics[name] = self.meter.create_counter(
            name=name,
            description=description,
            unit=unit
        )
    
    def create_gauge(self, name: str, description: str, unit: str = "1") -> None:
        """Create a gauge metric."""
        self.gauge_metrics[name] = self.meter.create_observable_gauge(
            name=name,
            description=description,
            unit=unit
        )
    
    def create_histogram(self, name: str, description: str, unit: str = "1") -> None:
        """Create a histogram metric."""
        self.histogram_metrics[name] = self.meter.create_histogram(
            name=name,
            description=description,
            unit=unit
        )
    
    def increment_counter(self, name: str, value: int = 1, attributes: Optional[Dict[str, Any]] = None) -> None:
        """Increment a counter metric."""
        if name not in self.counter_metrics:
            self.create_counter(name, f"Counter for {name}")
        self.counter_metrics[name].add(value, attributes)
    
    def record_histogram(self, name: str, value: float, attributes: Optional[Dict[str, Any]] = None) -> None:
        """Record a value to a histogram metric."""
        if name not in self.histogram_metrics:
            self.create_histogram(name, f"Histogram for {name}")
        self.histogram_metrics[name].record(value, attributes)
    
    def start_span(self, name: str, attributes: Optional[Dict[str, Any]] = None, kind=None):
        """Start a new trace span."""
        return self.tracer.start_as_current_span(name, attributes=attributes, kind=kind)
    
    def log(self, level: int, message: str, **kwargs) -> None:
        """Log a message at the specified level."""
        self.logger.log(level, message, **kwargs)
    
    def debug(self, message: str, **kwargs) -> None:
        """Log a debug message."""
        self.logger.debug(message, **kwargs)
    
    def info(self, message: str, **kwargs) -> None:
        """Log an info message."""
        self.logger.info(message, **kwargs)
    
    def warning(self, message: str, **kwargs) -> None:
        """Log a warning message."""
        self.logger.warning(message, **kwargs)
    
    def error(self, message: str, **kwargs) -> None:
        """Log an error message."""
        self.logger.error(message, **kwargs)
    
    def critical(self, message: str, **kwargs) -> None:
        """Log a critical message."""
        self.logger.critical(message, **kwargs)

    # Direct accessor methods:
    def get_tracer(self) -> Tracer:
        return self.tracer
        
    def get_meter(self) -> Meter:
        return self.meter
        
    def get_logger(self) -> Logger:
        return self.logger


# Global instance
_instance: Optional[ObservabilityFacade] = None

def configure_facade(service_name: str, provider: BaseObservabilityProvider) -> 'ObservabilityFacade':
    """
    Configure and set the global ObservabilityFacade instance.
    """
    global _instance
    _instance = ObservabilityFacade(service_name, provider)
    return _instance

def get_facade() -> Optional['ObservabilityFacade']:
    """
    Get the global ObservabilityFacade instance if configured.
    """
    return _instance

# Configure The Facade
Great! We now have everything we need to start instrumenting our system. For the purpose of this notebook, we'll be outputting the OTEL telemetry data to our jupyter notebook output. Later on in the notebook, we will swap out the underlying provider for our OTEL provider in the EKS cluster running and see it in action. 

In [None]:
# Create the console provider
console_provider = ConsoleOpenTelemetryProvider(
    service_name="jupyter-notebook-test",
    service_version="0.1.0",
    additional_attributes={"environment": "notebook"}
)

telemetry: ObservabilityFacade = configure_facade("jupyter-notebook-test", console_provider)

# Create a counter instrument
telemetry.create_counter("sample_counter", "Counts sample events", "1")

# Test out our spans
We'll start a span called batch-operations and use our telemetry object to tie them all together with the span context. If you run this multiple times, you'll see that the parentId is the same for the child spans but is different every time you run it. 

In [None]:
import time

# Create an outer parent span
with telemetry.start_span("batch-operations"):
    telemetry.info("Starting batch of operations")  # This log is tied to the parent span
    
    # Test metric collection
    for i in range(3):
        telemetry.increment_counter("sample_counter", value=1, attributes={"action": "test"})
        print(f"Added metric {i+1}")
        
        # Create child spans - these will automatically be children of the outer span
        with telemetry.start_span(f"operation-{i}"):
            print(f"Performing operation {i}")
            telemetry.info(f"Processing operation {i}")  # This log is tied to the child span
            time.sleep(0.5)
    
    telemetry.info("Completed all operations")  # This log is tied to the parent span again


# Auto Instrumenters
To date (4/6/2024) there's no OTLP compatible auto instrumenter that is comprehensive so we need to patch them together using the the global context of the OTEL SDK as the glue. 

## OpenLLMetry
OpenLLMetry is fairly comprehensive and has instrumentation for most of the major LLM providers including Bedrock, SageMaker. It also has instrumentation for LangChain, LlamaIndex, CrewAI, and LiteLLM, etc.. What it doesn't have is auto instrumentation for PydanticAI or other newer and emerging frameworks. 

## LogFire
To suppliment this, we can use LogFire's free SDK which is also OTLP compatible.


# Example Instrumentation
Lets auto instrument a sample FastAPI application composed of a couple classes that call a LangChain agent, a PydanticAI agent, and Bedrock directly to see it in action. This is representative of what a sample Agent application server might look like in the agent platform. 

**Note** Because this is going to be very long, we won't add our normal abstraction layers to the agents to keep the implementation easier to understand in a Jupyter notebook context


In [None]:
from pydantic_ai import Agent as PydanticAIAgent
from langgraph.prebuilt import create_react_agent
from langchain_core.tools import tool as lc_tool
from langchain_aws import ChatBedrockConverse

from agentic_platform.core.models.llm_models import LLMRequest, LLMResponse
from agentic_platform.core.models.memory_models import Message
from agentic_platform.core.client.llm_gateway.bedrock_gateway_client import BedrockGatewayClient

########################################################
# Define PyAIAgent
########################################################

example_facade: ObservabilityFacade = get_facade()

logger: Logger = example_facade.get_logger()

class PyAIAgent:
    def __init__(self):
        self.agent: PyAIAgent = PydanticAIAgent(
            'bedrock:us.anthropic.claude-3-5-haiku-20241022-v1:0',
            system_prompt='You are a helpful PydanticAI agent / assistant.'
        )

    def invoke(self, prompt: str) -> str:
        logger.info("Invoking PydanticAI agent")
        # Demonstrate a metric getting emited.
        example_facade.create_counter("pyai_invocations", "Counts PydanticAI invocations", "1")
        response = self.agent.run_sync(prompt)
        return response.data
    
########################################################
# Define LangChainAgent
########################################################

@lc_tool
def echo(x: str) -> str:
    """Echo the input"""
    return x

class LangChainAgent:
    def __init__(self):
        llm = ChatBedrockConverse(
            model="us.anthropic.claude-3-5-haiku-20241022-v1:0",
            temperature=.5
        )

        # Create react agent requires tools in the definition? 
        tools = [echo]
        self.agent = create_react_agent(model=llm, tools=tools)

    def invoke(self, prompt: str) -> str:
        logger.info("Invoking LangChain agent")
        inputs = {"messages": [("system", "You are a helpful LangChain agent / assistant."), ("user", prompt)]}
        response = self.agent.invoke(inputs)
        return response['messages'][-1].content
    
########################################################
# Call Bedrock directly with our abstraction types

class CallBedrockDirectly:
    @staticmethod
    def call(prompt: str) -> str:
        logger.info("Calling Bedrock directly")
        request: LLMRequest = LLMRequest(
            system_prompt="You are a helpful direct Bedrock calling assistant",
            model_id="us.anthropic.claude-3-5-haiku-20241022-v1:0",
            messages=[Message(role="user", text=prompt)],
            hyperparams={"temperature": .5}
        )
        response: LLMResponse = BedrockGatewayClient.chat_invoke(request)
        return response.message
    
########################################################
# Write some middle piece of code that calls everything.
########################################################

class AgentCoordinator:
    @staticmethod
    def call_everything(prompt: str) -> str:
        pyai: PyAIAgent = PyAIAgent()
        langchain: LangChainAgent = LangChainAgent()

        pyai_response: str = pyai.invoke(prompt)
        langchain_response: str = langchain.invoke(prompt)

        bedrock_response: str = CallBedrockDirectly.call(prompt)

        response: str = f"PyAI: {pyai_response}\n\nLangChain: {langchain_response}\n\nBedrock: {bedrock_response}"

        return response

# Use Middlware to set the global context
Middleware is a core component of most web frameworks. We can create the global context and spans for each http request using middlware and just plug it into our FastAPI server

In [None]:
from typing import Optional, List
from starlette.middleware.base import BaseHTTPMiddleware
from fastapi import Request, Response
from starlette.types import ASGIApp
import logging

logger = logging.getLogger(__name__)

class TelemetryMiddleware(BaseHTTPMiddleware):
    """
    Middleware that sets up OpenTelemetry instrumentation.
    Configures the telemetry facade and enables auto-instrumentation.
    """
    
    def __init__(
        self, 
        app: ASGIApp, 
        service_name: str = "fastapi-app",
        service_version: str = "0.1.0",
        excluded_paths: List[str] = None,
    ):
        super().__init__(app)
        self.excluded_paths = excluded_paths or ["/health", "/metrics"]

        
        # Set up telemetry provider
        try:
            provider: BaseObservabilityProvider = ConsoleOpenTelemetryProvider(
                service_name=service_name,
                service_version=service_version
            )
            
            # Configure the global telemetry facade
            self.telemetry = configure_facade(service_name, provider)
            self.telemetry.info(f"Telemetry configured for {service_name} v{service_version}")

            # BedrockInstrumentor().instrument()
            
        except Exception as e:
            logger.error(f"Error initializing telemetry: {e}")
            self.telemetry = None
    
    async def dispatch(self, request: Request, call_next):
        # Skip telemetry for excluded paths
        path = request.scope["path"]
        for excluded_path in self.excluded_paths:
            if path == excluded_path or path.startswith(excluded_path):
                return await call_next(request)
        
        # Handle the request - auto-instrumentation creates spans automatically
        response = await call_next(request)
        return response

In [None]:
########################################################
# Setup Nest Asyncio
########################################################

import nest_asyncio
nest_asyncio.apply()

#################################################################
# Import auto instrumenters
#################################################################
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.bedrock import BedrockInstrumentor
from opentelemetry.instrumentation.langchain import LangchainInstrumentor

########################################################
# Define FastAPI Server
########################################################
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

# Add the middware here which sets our global tracer context.
app.add_middleware(
    TelemetryMiddleware,
    service_name="agent-api",
    service_version="1.0.0",
    excluded_paths=["/health", "/metrics"]  # Skip instrumentation for these paths
)

# Auto instrments the FastAPI app. It adds a middleware to the app.
# Secondly since the global span context is set, it'll use that context.
FastAPIInstrumentor.instrument_app(app)

# This patches the boto3 client to record Bedrock invocation calls.
BedrockInstrumentor().instrument()

# Instruments LangChain. Note: This will be a bit noisy with the Bedrock instrumenter.
LangchainInstrumentor().instrument()



class AgentRequest(BaseModel):
    prompt: str

class AgentResponse(BaseModel):
    response: str

@app.post("/agent")
async def agent_endpoint(request: AgentRequest):
    response: str = AgentCoordinator.call_everything(request.prompt)
    return AgentResponse(response=response)

In [None]:
from fastapi.testclient import TestClient

client = TestClient(app)
# Now make requests to your API
print("Sending request to the API...")
response = client.post(
    "/agent",
    json={"prompt": "Tell me about OpenTelemetry"}
)


# Conclusion
In this lab we dove into telemetry for our agent applications using OTEL and OpenLLMetry auto instrumenters. We encourage you to make some calls to the agent base existing agents and check out the traces, logs, and metrics in action that are pushed to LangFuse and OpenSearch.