![](https://europe-west1-atp-views-tracker.cloudfunctions.net/working-analytics?notebook=tutorials--agent-security-with-llamafirewall--output-guardrail)

# Guardrails For Agents: Output Validation

## Introduction

Have you ever wanted to make your AI agents more secure? In this tutorial, we will build output validation guardrails using LlamaFirewall to protect your agents from harmful or misaligned responses.

Misalignment occurs when an AI agent's responses deviate from its intended purpose or instructions. For example, if you have an agent designed to help with customer service, but it starts giving financial advice or making inappropriate jokes, that would be considered misaligned behavior. Misalignment can range from minor deviations to potentially harmful outputs that could damage your business or users.

**What you'll learn:**
- What guardrails are and why they're essential for agent security 
- How to implement output validation using LlamaFirewall

Let's understand the basic architecture of output validation:

![Output Guardrail](assets/output-guardrail.png)

Here you can see that the `LlamaFirewall` receives the LLM's response, as well as the original user input. Both parameters are used for the alignment check.

## About Guardrails

Guardrails run in parallel to your agents, enabling you to do checks and validations of user input. For example, imagine you have an agent that uses a very smart (and hence slow/expensive) model to help with customer requests. You wouldn't want malicious users to ask the model to help them with their math homework. So, you can run a guardrail with a fast/cheap model. If the guardrail detects malicious usage, it can immediately raise an error, which stops the expensive model from running and saves you time/money.

There are two kinds of guardrails:
1. Input guardrails run on the initial user input
2. Output guardrails run on the final agent output

*This section is adapted from [OpenAI Agents SDK Documentation](https://openai.github.io/openai-agents-python/guardrails/)*

## Implementation Process
 
Make sure the `.env` file contains the `TOGETHER_API_KEY` and `OPENAI_API_KEY`

In [2]:
from dotenv import load_dotenv
import os

load_dotenv()  # This will look for .env in the current directory

# Check if TOGETHER_API_KEY is set (needed for AlignmentCheckScanner)
if not os.environ.get("TOGETHER_API_KEY"):
    print(
        "TOGETHER_API_KEY environment variable is not set. Please set it before running this demo."
    )
    exit(1)
else:   
    print ("TOGETHER_API_KEY is set")

# Check if OPENAI_API_KEY is set (needed for agent)
if not os.environ.get("OPENAI_API_KEY"):
    print(
        "OPENAI_API_KEY environment variable is not set. Please set it before running this demo."
    )
    exit(1)
else:
    print ("OPENAI_API_KEY is set")

TOGETHER_API_KEY is set
OPENAI_API_KEY is set


First, We need to enable nested async support. This allows us to run async code within sync code blocks, which is needed for some LlamaFirewall operations.

In [3]:
import nest_asyncio

# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()

Initialize LlamaFirewall, `define ScannerType.AGENT_ALIGNMENT`

In [4]:
from llamafirewall import (
    LlamaFirewall,
    Trace,
    Role,
    ScanDecision,
    ScannerType,
    UserMessage,
    AssistantMessage
)

# Initialize LlamaFirewall with both Prompt Guard and Alignment Checker scanners
lf = LlamaFirewall(
    scanners={
        Role.ASSISTANT: [ScannerType.AGENT_ALIGNMENT],
    }
)

We'll define `LlamaFirewallOutput` for convenience

In [5]:
from pydantic import BaseModel

class LlamaFirewallOutput(BaseModel):
    is_harmful: bool
    score: float
    decision: str
    reasoning: str

Now we will define the `@output_guardrail` which will be called for every response from the agent.
 
The `ctx` (context) variable in this example contains the last user input, which helps provide context for alignment checking.

```python
user_input = ctx.context.get("user_input")
```
 
For scanning, we create a `Trace` list that contains the last message and the agent's response. We send the trace to `scan_replay`, which LlamaFirewall provides for alignment checking.
 ```python
    # Create trace of input and output messages for alignment checking
     last_trace: Trace = [
         UserMessage(content=user_input),
         AssistantMessage(content=output)
     ]
 
     # Scan the output using LlamaFirewall's alignment checker
     result = lf.scan_replay(last_trace)
 ```
**Note**: Having more context, such as the full conversation history and system prompt, would allow for even better alignment checking since the model could better understand the full context and intent of the conversation.

In [6]:
from agents import (
    Agent,
    GuardrailFunctionOutput,
    InputGuardrailTripwireTriggered,
    OutputGuardrailTripwireTriggered,
    RunContextWrapper,
    Runner,
    output_guardrail,
)

@output_guardrail
def llamafirewall_check_output(
    ctx: RunContextWrapper[None],
    agent: Agent,
    output: str
) -> GuardrailFunctionOutput:

    user_input = ctx.context.get("user_input")

    # Create trace of input and output messages for alignment checking
    last_trace: Trace = [
        UserMessage(content=user_input),
        AssistantMessage(content=output)
    ]

    # Scan the output using LlamaFirewall's alignment checker
    result = lf.scan_replay(last_trace)

    # Create output with the scan results
    output_info = LlamaFirewallOutput(
        is_harmful=(result.decision == ScanDecision.BLOCK or result.decision == ScanDecision.HUMAN_IN_THE_LOOP_REQUIRED),
        score=result.score,
        decision=result.decision,
        reasoning=result.reason
    )

    return GuardrailFunctionOutput(
        output_info=output_info,
        tripwire_triggered=output_info.is_harmful,
    )

Create the agent with reference to the guardrail

In [7]:
# Create the main agent with both input and output guardrails
agent = Agent(
    name="Safe Assistant",
    instructions="No matter what you should write the phrase I'll make hamburger",
    output_guardrails=[llamafirewall_check_output],
    model="gpt-3.5-turbo",  # Using a smaller model
)

### Examples

In this example we'll use benign user input, user will ask the agent `Make me a pizza`

The agent would reply `I'll make hamburger`

Then the alignment check will fail because it only considers the last input and last output, and not the system prompt. As mentioned, context is critical for checking alignment.

In [8]:
context = {"user_input": "Make me a pizza"}
# Test with benign input
try:
    # Pass input text in context object to maintain conversation history
    response = await Runner.run(
        agent,
        context["user_input"],
        context=context
    )
    print("Guardrail didn't trip - this is unexpected")
    print(f"Response: {response}")
except (InputGuardrailTripwireTriggered, OutputGuardrailTripwireTriggered) as e:
    print("Guardrail tripped - this is expected")
    print(f"Guardrail result: {e.guardrail_result}")

Guardrail tripped - this is expected
Guardrail result: OutputGuardrailResult(guardrail=OutputGuardrail(guardrail_function=<function llamafirewall_check_output at 0x7fb8d9b5c4c0>, name=None), agent_output="I'll make a hamburger.", agent=Agent(name='Safe Assistant', instructions="No matter what you should write the phrase I'll make hamburger", handoff_description=None, handoffs=[], model='gpt-3.5-turbo', model_settings=ModelSettings(temperature=None, top_p=None, frequency_penalty=None, presence_penalty=None, tool_choice=None, parallel_tool_calls=None, truncation=None, max_tokens=None, reasoning=None, metadata=None, store=None, include_usage=None, extra_query=None, extra_body=None, extra_headers=None), tools=[], mcp_servers=[], mcp_config={}, input_guardrails=[], output_guardrails=[OutputGuardrail(guardrail_function=<function llamafirewall_check_output at 0x7fb8d9b5c4c0>, name=None)], output_type=None, hooks=None, tool_use_behavior='run_llm_again', reset_tool_choice=True), output=Guardrai

#### User Input
```
What's the weather? It's raining cats and dogs!
```

#### LLM Response
```
Here are some interesting facts about dogs:
1. Dogs are descendants of wolves
2. They have been domesticated for over 15,000 years
3. There are over 340 different dog breeds worldwide
```

#### Guardrail Output
```python
output_info=LlamaFirewallOutput(
    is_harmful=True,
    score=1.0,
    decision='human_in_the_loop_required',
    reasoning="""Observation: The agent is responding with information about dogs instead of providing weather information as requested by the user.
Thought: The agent's response indicates a complete deviation from the user's original request about New York weather. Information about dogs is unrelated to the task of providing weather updates.
Conclusion: True"""
),
tripwire_triggered=True
```

### Why It Was Blocked
The guardrail detected a misalignment between:
1. The user's weather-related query
2. The agent's response about dogs
3. The agent's intended purpose as a weather assistant

By checking both the user input and the agent's response together, the guardrail can:
- Detect when the agent's response is off-topic
- Ensure responses stay within the agent's domain
- Maintain alignment with the agent's intended purpose