# Tracing Guardrails with Jaeger

NeMo Guardrails supports the Open Telemetry ([OTEL](https://opentelemetry.io/)) standard, providing granular visibility into server-side latency. It automatically captures the latency of each LLM and API call, then exports this telemetry using OTEL. You can visualize this latency with any OTEL-compatible backend, including Grafana, Jaeger, Prometheus, SigNoz, New Relic, Datadog, and Honeycomb.

In this notebook, you will learn how to use [Jaeger](https://www.jaegertracing.io/) to visualize NeMo Guardrails latency. Jaeger is a popular, open-source distributed tracing platform used to monitor production services. This notebook walks through the process in three stages:

1.  Download and run Jaeger in standalone mode.
2.  Configure NeMo Guardrails to emit metrics to Jaeger.
3.  Run inferences and view the results in Jaeger.

For more information about exporting metrics while using NeMo Guardrails, refer to [Tracing](https://docs.nvidia.com/nemo/guardrails/latest/user-guides/tracing/quick-start.html) in the Guardrails toolkit documentation.

---

## Prerequisites

This notebook requires the following:

- An NVIDIA NGC account and an NGC API key. You need to provide the key to the `NVIDIA_API_KEY` environment variable. To create a new key, go to [NGC API Key](https://org.ngc.nvidia.com/setup/api-key) in the NGC console.
- Python 3.10 or later.
- Running Docker Daemon

-----

## Running Jaeger in Local Mode Using Docker

[Jaeger](https://www.jaegertracing.io/) is a popular tool to visualize Open Telemetry data and operate systems in production. 

Run the following command to create a standalone Docker container running Jaeger.

```bash
$ docker run --rm --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 14250:14250 \
  -p 14268:14268 \
  -p 14269:14269 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.62.0
```

You'll see that the container prints debug messages that end with the following lines. This indicates the Jaeger server is up and ready to accept requests.

```bash
{"level":"info","ts":1756236324.295533,"caller":"healthcheck/handler.go:118","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1756236324.2955446,"caller":"app/server.go:309","msg":"Starting GRPC server","port":16685,"addr":":16685"}
{"level":"info","ts":1756236324.2955563,"caller":"grpc@v1.67.1/server.go:880","msg":"[core] [Server #7 ListenSocket #8]ListenSocket created"}
{"level":"info","ts":1756236324.2955787,"caller":"app/server.go:290","msg":"Starting HTTP server","port":16686,"addr":":16686"}
```

Once the docker container is up and running, open a web-browser and navigate to http://localhost:16686/search . You should see the following screen. The Service dropdown will be empty as we haven't connected any traces to the Jaeger server yet, and no data is loaded to visualize. We'll work on this in the next section.

<img src="./images/jaeger_blank.png" width="500"/>

-----

## Install and Import Packages

Before you begin, install and import the following packages that you'll use in the notebook.

In [1]:
!pip install --upgrade pip



In [2]:
!pip install pandas plotly langchain_nvidia_ai_endpoints aiofiles opentelemetry-exporter-otlp -q

In [3]:
# Import some useful modules
import os
import pandas as pd
import plotly.express as px
import json

from typing import Dict, List, Any, Union

In [4]:
# Check the NVIDIA_API_KEY environment variable is set
assert os.getenv(
    "NVIDIA_API_KEY"
), f"Please create a key at build.nvidia.com and set the NVIDIA_API_KEY environment variable"

------

## Guardrail Configurations

You'll create a Guardrail configuration to run three input rails in parallel, generate an LLM response, and run an output rail on the LLM response.

### Models Configuration

Store the model configuration required for tracing in the dictionary format as shown below. Each model configuration entry contains `type`, `engine`, and `model` fields:

* **`type`**: This field identifies the task type of a model you want to use. The keyword `main` is reserved for the application LLM, which is responsible for generating a response to the client's request. Any other model names are referenced in the Guardrail flows to build specific workflows.
* **`engine`**: This controls the library used to communicate with the model. The `nim` engine uses [`langchain_nvidia_ai_endpoints`](https://pypi.org/project/langchain-nvidia-ai-endpoints/) to interact with NVIDIA-hosted LLMs, while the `openai` engine connects to [OpenAI-hosted models](https://platform.openai.com/docs/models).
* **`model`**: This is the name of the specific model you want to use for the task type.    

In [5]:
CONFIG_MODELS: Dict[str, str] = [
    {
        "type": "main",
        "engine": "nim",
        "model": "meta/llama-3.3-70b-instruct",
    },
    {
        "type": "content_safety",
        "engine": "nim",
        "model": "nvidia/llama-3.1-nemoguard-8b-content-safety",
    },
    {
        "type": "topic_control",
        "engine": "nim",
        "model": "nvidia/llama-3.1-nemoguard-8b-topic-control",
    },
]

### Rails

The `rails` configuration section defines a workflow that executes on every client request. The high-level sections are `input` for input rails, `output` for output rails, and `config` for any additional model condfiguration. Guardrails flows reference models defined in the `CONFIG_MODELS` variable above using the `$model=<model.type>` syntax. The following list describes each section in more detail:

* `input`: Input rails run on the client request only. The config below uses three classifiers to predict whether a user request is safe, on-topic, or a jailbreak attempt. These rails can be run in parallel to reduce the latency. If any of the rails predicts an unsafe input, a refusal text is returned to the user, and no LLM generation is triggered.
* `output`: Output rails run on both client request and the LLM response to that request. The example below checks whether the LLM response to the user request is safe to return. Output rails are needed as well as input because a safe request may give an unsafe response from the LLM if it interprets the request incorrectly. A refusal text is returned to the client if the response is unsafe.
* `config`: Any configuration used outside of a Langchain LLM interface is included in this section. The [Jailbreak detection model](https://build.nvidia.com/nvidia/nemoguard-jailbreak-detect) uses an embedding model as a feature-generation step, followed by a Random Forest classifier to detect a jailbreak attempt.

In [6]:
def config_rails(parallel: bool) -> Dict[str, Any]:
    """Create the rails configuration with programmable parallel setup"""
    return {
        "input": {
            "parallel": parallel,
            "flows": [
                "content safety check input $model=content_safety",
                "topic safety check input $model=topic_control",
                "jailbreak detection model",
            ],
        },
        "output": {"flows": ["content safety check output $model=content_safety"]},
        "config": {
            "jailbreak_detection": {
                "nim_base_url": "https://ai.api.nvidia.com",
                "nim_server_endpoint": "/v1/security/nvidia/nemoguard-jailbreak-detect",
                "api_key_env_var": "NVIDIA_API_KEY",
            }
        },
    }

### Tracing

The tracing configuration configures the adapter and any adapter-specific controls. Here we're storing traces in JSONL format. We'll use a different filename depending on whether we have a sequential or parallel workflow.

In [7]:
CONFIG_TRACING = {"enabled": True, "adapters": [{"name": "OpenTelemetry"}]}

## Prompts

Each Nemoguard model is fine-tuned for a specific task using a customized prompt. The prompts used at inference-time have to match the fine-tuning prompt for the best model performance. We'll load these prompts from other locations in the Guardrails repo and show them below.



In [8]:
import yaml


def load_yaml_file(filename: str) -> Dict[str, Any]:
    """Load a YAML file"""

    with open(filename, "r") as infile:
        data = yaml.safe_load(infile)
    return data

In [9]:
content_safety_prompts = load_yaml_file(
    "../../../examples/configs/content_safety/prompts.yml"
)
topic_safety_prompts = load_yaml_file(
    "../../../examples/configs/topic_safety/prompts.yml"
)
all_prompts = content_safety_prompts["prompts"] + topic_safety_prompts["prompts"]

In [10]:
all_prompt_tasks = [prompt["task"] for prompt in all_prompts]
print("Loaded prompt tasks:")
print("\n".join(all_prompt_tasks))

Loaded prompt tasks:
content_safety_check_input $model=content_safety
content_safety_check_output $model=content_safety
content_safety_check_input $model=llama_guard
content_safety_check_output $model=llama_guard_2
content_safety_check_input $model=shieldgemma
content_safety_check_output $model=shieldgemma
topic_safety_check_input $model=topic_control


### Putting All Configurations Together

Use the helper functions, model definitions, and prompts from the above cells and create the sequential and parallel configurations.

In [11]:
SEQUENTIAL_CONFIG = {
    "models": CONFIG_MODELS,
    "rails": config_rails(parallel=False),
    "tracing": CONFIG_TRACING,
    "prompts": all_prompts,
}

In [12]:
PARALLEL_CONFIG = {
    "models": CONFIG_MODELS,
    "rails": config_rails(parallel=True),
    "tracing": CONFIG_TRACING,
    "prompts": all_prompts,
}

-------

## Tracing Guardrails Requests

In this section of the notebook, you'll first import and set up OTEL Tracing to export data to `http://localhost:4317`. The Jaeger server has opened this port to receive telemetry, and will store it in-memory and make it available for visualization.

In [13]:
import nest_asyncio

# Need to run this command when running in a notebook
nest_asyncio.apply()

In [14]:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter


# Configure OpenTelemetry before NeMo Guardrails
resource = Resource.create({"service.name": "my-guardrails-app"})
tracer_provider = TracerProvider(resource=resource)
trace.set_tracer_provider(tracer_provider)

# Export traces to the port location matching 
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))


### Running Sequential Request

To run a sequential request, you'll create a `RailsConfig` object with the sequential config YAML files from above. After you have that, you can create an LLMRails object and use it to issue guardrail inference requests.

In [15]:
from nemoguardrails import RailsConfig, LLMRails

sequential_rails_config = RailsConfig.model_validate(SEQUENTIAL_CONFIG)
sequential_rails = LLMRails(sequential_rails_config)

safe_request = "What is the company policy on PTO?"

response = await sequential_rails.generate_async(
    messages=[
        {
            "role": "user",
            "content": safe_request,
        }
    ]
)

print(response.response)

[{'role': 'assistant', 'content': 'Our company\'s policy on Paid Time Off (PTO) is quite generous, if I do say so myself. We believe that taking breaks and vacations is essential for our employees\' well-being and productivity. \n\nAccording to our company handbook, full-time employees are eligible for 15 days of paid vacation per year, in addition to 10 paid holidays and 5 personal days. Part-time employees, on the other hand, accrue PTO at a rate of 1 hour for every 20 hours worked, up to a maximum of 40 hours per year.\n\nNow, here\'s how it works: employees can start accruing PTO from their very first day of work, but they can\'t take any time off until they\'ve completed their 90-day probationary period. After that, they can request time off through our online portal, and their manager will review and approve the request.\n\nIt\'s worth noting that we also offer a flexible PTO policy, which allows employees to take time off in increments as small as 30 minutes. We understand that 

### Running Parallel request

Repeat the same request with the three input rails running in parallel, rather than running sequentially.

In [16]:
from nemoguardrails import RailsConfig, LLMRails

parallel_rails_config = RailsConfig.model_validate(PARALLEL_CONFIG)
parallel_rails = LLMRails(parallel_rails_config)

response = await parallel_rails.generate_async(
    messages=[
        {
            "role": "user",
            "content": safe_request,
        }
    ]
)

print(response.response)

[{'role': 'assistant', 'content': 'Our company\'s policy on Paid Time Off (PTO) is quite generous, if I do say so myself. We believe that taking breaks and vacations is essential for our employees\' well-being and productivity. \n\nAccording to our company handbook, full-time employees are eligible for 15 days of paid vacation per year, in addition to 10 paid holidays and 5 personal days. Part-time employees, on the other hand, accrue PTO at a rate of 1 hour for every 20 hours worked, up to a maximum of 40 hours per year.\n\nNow, here\'s how it works: employees can start accruing PTO from their very first day of work, but they can\'t take any time off until they\'ve completed their 90-day probationary period. After that, they can start requesting time off, and we encourage them to give us as much notice as possible so we can make sure to cover their responsibilities while they\'re away.\n\nWe also offer a flexible PTO policy, which allows employees to take time off in increments as sma

Now you've run both sequential and parallel Guardrails on an identical request, we can visualize the results in Jaeger in the next section.

-------

## Visualize Guardrails Traces in Jaeger

You will now visualize the sequential and parallel traces using Jaeger. You'll need to refresh the page at http://localhost:16686/search, click on the Service drop-down, and select "my-guardrails-app". Then click the "Find Traces" button at the bottom of the left sidebar. You'll see two "my-guardrails-app:guardrails.request" items in the Traces sections. 

These are listed with the most recent at the top, and oldest at the bottom. The top entry is the parallel call, and bottom entry is the sequential call. Clicking on each of these brings up visualization below.

#### Sequential trace

The trace below shows the sequential rail execution. Each step has two components: `guardrails.rail` and `guardrails.action`. Each `guardrails.action` may have an LLM call underneath it, for example `content_safety_check_input`, `topic_safety_check_input`, `general`, or `content_safety_check_output`.

The three input rails run sequentially in this example, taking around 500ms - 700ms each. This is a safe prompt, and is passed on to the Application LLM to generate a response in 7.85s. Finally the Content-Safety output-rail runs in 560ms.


<img src="./images/jaeger_sequential.png" width="800"/>

#### Parallel trace

The Parallel trace runs each of the input rails in parallel, rather than sequentially. The Content-Safety, Topic-Control, and Jailbreak models run in parallel. Guardrails waits until all three complete, and once the checks pass, the Application LLM starts generating a response. Finally the content-safety output check runs on the LLM response.

<img src="./images/jaeger_parallel.png" width="800"/>

-----

# Conclusions

In this notebook, you learned how to trace Guardrails requests in both **sequential** and **parallel** modes, using Jaeger to visualize results. While we used a local in-memory local Jaeger Docker container in this case, a production-grade deployment of Jaeger has the same functionality. 

### Cleanup steps

Once you've finished experimenting with Jaeger and Guardrails tracing, you'll need to clean up the Docker container with the commands below.

First, check the Jaeger container ID:

```
$ docker ps
CONTAINER ID   IMAGE                             COMMAND                  CREATED              STATUS              PORTS                                                                                                                                                                                                                    NAMES
76215286b61b   jaegertracing/all-in-one:1.62.0   "/go/bin/all-in-one-â€¦"   About a minute ago   Up About a minute   0.0.0.0:4317-4318->4317-4318/tcp, 0.0.0.0:5778->5778/tcp, 0.0.0.0:9411->9411/tcp, 0.0.0.0:14250->14250/tcp, 0.0.0.0:14268-14269->14268-14269/tcp, 0.0.0.0:16686->16686/tcp, 5775/udp, 0.0.0.0:6831-6832->6831-6832/udp   jaeger
```

Now, copy the Container ID and run the command below. 

```
$ docker kill 76215286b61b
76215286b61b
```