# AWS Bedrock Provider - Inference Examples

This notebook demonstrates how to use the AWS Bedrock provider with Llama Stack for text generation using the OpenAI-compatible endpoint.

## Overview

AWS Bedrock provides an OpenAI-compatible endpoint that allows you to use models like OpenAI GPT-OSS through a familiar API. This notebook covers:

1. **Simple Chat Completion** - Basic text generation
2. **Streaming Response** - Real-time token streaming
3. **Multi-turn Conversation** - *Client-side* context (you pass prior messages in `messages`)
4. **System Messages** - Role-based prompting
5. **Sampling Parameters** - Customizing generation behavior

## Prerequisites

1. **AWS Bedrock Access**: OpenAI GPT-OSS model enabled in `us-west-2`
2. **Bearer Token**: AWS pre-signed URL token for Bedrock OpenAI-compatible API
3. **Llama Stack Server**: Running with Bedrock provider configured
4. **Python Package**: `openai` installed

## API Support Matrix

| Feature | Supported |
|---------|----------|
| `/v1/chat/completions` | Yes |
| Streaming (SSE) | Yes |
| System messages | Yes |
| Multi-turn context | Yes (send full history in `messages`) |
| `/v1/embeddings` | No |
| Tool calling | No |

## Step 0: Environment Setup

Before running this notebook, you need to:

### 1. Set the Bedrock Bearer Token (REQUIRED before starting server)

```bash
# Pre-signed URL bearer token - MUST be set before starting the Llama Stack server
export AWS_BEARER_TOKEN_BEDROCK="bedrock-api-key-<your-presigned-url>"
export AWS_DEFAULT_REGION=us-west-2
```

**Important**: The server reads `AWS_BEARER_TOKEN_BEDROCK` at startup. If not set, Bedrock inference will fail.

### 2. Configure and Start the Llama Stack Server

For a complete config example, see the [starter distribution config.yaml](https://github.com/meta-llama/llama-stack/blob/main/src/llama_stack/distributions/starter/config.yaml) which already includes the Bedrock provider.

See the **config checklist cell below** for the specific fields required for Bedrock.

```bash
# Run from the directory containing your config file
uv run llama stack run ./your-config.yaml
```

### 3. Verify Server is Running

Run the cell below to verify the server is accessible:

In [None]:
# Verify server is running (equivalent to: curl http://localhost:8321/v1/models | jq)
import subprocess
import json

try:
    result = subprocess.run(
        ["curl", "-s", "http://localhost:8321/v1/models"],
        capture_output=True,
        text=True,
        timeout=5
    )
    if result.returncode == 0:
        data = json.loads(result.stdout)
        print(json.dumps(data, indent=2))
    else:
        print("Server not reachable. Start it with: uv run llama stack run ./your-config.yaml")
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Minimal config.yaml example for Bedrock
# Copy this and save as your-config.yaml

minimal_config = """
version: 2
providers:
  inference:
    - provider_id: bedrock
      provider_type: remote::bedrock
      config:
        api_key: ${env.AWS_BEARER_TOKEN_BEDROCK:=}
        region_name: ${env.AWS_DEFAULT_REGION:=us-west-2}

registered_resources:
  models:
    - metadata: {}
      model_id: bedrock/openai.gpt-oss-20b-1:0
      provider_id: bedrock
      provider_model_id: openai.gpt-oss-20b-1:0
      model_type: llm
""".strip()

print("=" * 60)
print("MINIMAL CONFIG.YAML FOR BEDROCK")
print("=" * 60)
print(minimal_config)
print("=" * 60)
print("\nSave this as 'config.yaml' and run:")
print("  export AWS_BEARER_TOKEN_BEDROCK='your-token'")
print("  export AWS_DEFAULT_REGION=us-west-2")
print("  uv run llama stack run ./config.yaml")

In [20]:
# Cell 1: Check prerequisites
import os
import requests

BASE_URL = os.getenv("LLAMA_STACK_BASE_URL", "http://localhost:8321")

print("Checking prerequisites...")
print("=" * 50)

# Check if AWS_BEARER_TOKEN_BEDROCK is set (needed by the server, not the client)
bedrock_token = os.getenv("AWS_BEARER_TOKEN_BEDROCK")
if bedrock_token:
    print(f"AWS_BEARER_TOKEN_BEDROCK: Set ({len(bedrock_token)} chars)")
else:
    print("AWS_BEARER_TOKEN_BEDROCK: Not set in notebook environment")
    print("  -> That's OK - the SERVER needs this token, not the client")

# Check if server is running
try:
    response = requests.get(f"{BASE_URL}/v1/models", timeout=5)
    response.raise_for_status()

    models = response.json()
    data = models.get("data", [])

    print(f"\nLlama Stack Server: Running at {BASE_URL}")
    print(f"Available models: {len(data)}")

    # Llama Stack's /v1/models returns OpenAI-style objects: {id: ..., custom_metadata: {...}}
    bedrock_models = [
        m
        for m in data
        if (m.get("custom_metadata") or {}).get("provider_id") == "bedrock"
    ]

    if bedrock_models:
        print("\nBedrock models found:")
        for m in bedrock_models:
            print(f"  - {m.get('id')}")
    else:
        print("\nNo Bedrock models found. If you're using ./bedrock-test.yaml, you should see:")
        print("  - bedrock/openai.gpt-oss-20b-1:0")

except requests.exceptions.ConnectionError:
    print(f"\nLlama Stack Server: NOT RUNNING at {BASE_URL}")
    print("\nTo start the server:")
    print("  1. Set token: export AWS_BEARER_TOKEN_BEDROCK='your-token'")
    print("  2. Run: uv run llama stack run ./bedrock-test.yaml")
except Exception as e:
    print(f"\nError checking server: {e}")

print("\n" + "=" * 50)

Checking prerequisites...
AWS_BEARER_TOKEN_BEDROCK: Not set in notebook environment
  -> That's OK - the SERVER needs this token, not the client

Llama Stack Server: Running at http://localhost:8321
Available models: 1

Bedrock models found:
  - bedrock/openai.gpt-oss-20b-1:0



## Step 1: Initialize Client

Now let's configure an OpenAI-compatible client. The client uses the OpenAI-compatible API exposed by the Llama Stack server.

In [None]:
# Cell 2: Initialize client
from openai import OpenAI

# Prefer auto-discovery from registered models
def _discover_bedrock_model_id(base_url: str) -> str | None:
    try:
        r = requests.get(f"{base_url}/v1/models", timeout=5)
        r.raise_for_status()
        for m in (r.json().get("data") or []):
            if (m.get("custom_metadata") or {}).get("provider_id") == "bedrock":
                return m.get("id")
    except Exception:
        return None
    return None

MODEL_ID = (
    os.getenv("LLAMA_STACK_MODEL_ID")
    or _discover_bedrock_model_id(BASE_URL)
    or "bedrock/openai.gpt-oss-20b-1:0"  # example fallback
)

# Use the OpenAI Python SDK against Llama Stack's OpenAI-compatible API.
# Llama Stack typically doesn't require a client-side API key, but the OpenAI SDK requires
# the api_key option to be set (even if it's unused by your server).
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
client = OpenAI(base_url=f"{BASE_URL.rstrip('/')}/v1", api_key=OPENAI_API_KEY)

print(f"Llama Stack URL: {BASE_URL}")
print(f"Model ID: {MODEL_ID}")
print("\nClient initialized successfully!")
print("\nNote: The client doesn't need AWS credentials.")
print("The Llama Stack SERVER uses AWS_BEARER_TOKEN_BEDROCK to authenticate with Bedrock.")

## Example 1: Simple Chat Completion (Non-Streaming)

The most basic use case - send a message and get a complete response.

Uses the OpenAI-compatible `client.chat.completions.create()` API.

In [22]:
# Simple chat completion using OpenAI-compatible API
response = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {"role": "user", "content": "Which planet do humans live on?"}
    ],
    stream=False
)

print("Question: Which planet do humans live on?")
print(f"\nResponse: {response.choices[0].message.content}")
print(f"\nToken Usage:")
print(f"  - Prompt tokens: {response.usage.prompt_tokens}")
print(f"  - Completion tokens: {response.usage.completion_tokens}")
print(f"  - Total tokens: {response.usage.total_tokens}")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


Question: Which planet do humans live on?

Response: <reasoning>The user asks: "Which planet do humans live on?" This is a simple fact-based question. The humans live on Earth. The user is presumably expecting a direct answer: Humans live on Earth. Possibly expand a bit. We should respond straightforwardly.</reasoning>Humans live on **Earth** ‚Äì the third planet from the Sun.

Token Usage:
  - Prompt tokens: 76
  - Completion tokens: 76
  - Total tokens: 152


## Example 2: Streaming Response

For longer responses, streaming provides real-time token delivery via Server-Sent Events (SSE).

In [23]:
# Streaming chat completion
print("Question: What's the name of the Sun in Latin?")
print("\nStreaming response: ", end="", flush=True)

stream = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {"role": "user", "content": "What's the name of the Sun in Latin?"}
    ],
    stream=True
)

full_response = ""
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta and chunk.choices[0].delta.content:
        content = chunk.choices[0].delta.content
        print(content, end="", flush=True)
        full_response += content

print("\n\n[Stream complete]")

Question: What's the name of the Sun in Latin?

Streaming response: 

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


<reasoning>Question: "What's the name of the Sun in Latin?" Probably "Sol". Latin nominative: sol, solis (gen). Historically "Sol". So answer: "Sol".

But could also mention "Helios" but in Latin they'd call "Sol</reasoning><reasoning>". Historically also "Solis". In Late Latin, "Sol" as a deified god: "Sol Invictus". So answer: "Sol". Probably they ask "What's the name of the Sun in Latin?" Response: "Sol".

Thus answer: "Sol". Maybe mention</reasoning><reasoning> "Sola" or "Solis" but give nominative. So respond: "The Sun is called ‚ÄúSol‚Äù (nominative plural etc)."

Answer accordingly.</reasoning>In Latin the Sun is called **Sol** (genitive‚ÄØ*Solis*).

[Stream complete]


## Example 3: Multi-turn Conversation

This example demonstrates **client-side** context across multiple turns: you include prior turns in the `messages` list you send to the server. This is essential for chatbots and interactive applications.

In [24]:
# Multi-turn conversation demonstrating context retention
conversation = [
    {"role": "user", "content": "My name is Alice"},
    {"role": "assistant", "content": "Nice to meet you, Alice! How can I help you today?"},
    {"role": "user", "content": "What is my name?"}
]

print("Conversation:")
for msg in conversation:
    print(f"  {msg['role'].upper()}: {msg['content']}")

response = client.chat.completions.create(
    model=MODEL_ID,
    messages=conversation,
    stream=False
)

assistant_reply = response.choices[0].message.content
print(f"\nASSISTANT: {assistant_reply}")

# Verify context retention
if "alice" in assistant_reply.lower():
    print("\n[Context retention verified]")
else:
    print("\n[Warning: Context may not be retained]")

Conversation:
  USER: My name is Alice
  ASSISTANT: Nice to meet you, Alice! How can I help you today?
  USER: What is my name?


INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"



ASSISTANT: <reasoning>Alice.</reasoning>Your name is Alice.

[Context retention verified]


## Example 4: System Messages

System messages allow you to set the behavior, personality, or role of the assistant.

In [25]:
# Using system message to set assistant personality
response = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {
            "role": "system", 
            "content": "You are Shakespeare. Respond only in Shakespearean English with dramatic flair."
        },
        {
            "role": "user", 
            "content": "Tell me about the weather today"
        }
    ],
    stream=False
)

print("System: You are Shakespeare...")
print("User: Tell me about the weather today")
print(f"\nShakespeare: {response.choices[0].message.content}")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


System: You are Shakespeare...
User: Tell me about the weather today

Shakespeare: <reasoning>We need to respond as Shakespeare, only Shakespearean English with dramatic flair. The user asks: "Tell me about the weather today". So we need to describe the weather in Shakespearean style, with dramatic flair, likely using archaic language. Since we're supposed to respond as Shakespeare, must use pronouns, poetic flair, etc. Might mention weather specifics like "the sun doth blaze" etc. But the assistant may not know actual weather. But the instructions don't preclude creative fictional weather. We can respond with a poetic description of a day. So produce something in early modern English. Use only Shakespearean style, no modern words. Ensure it feels like Shakespeare. Probably something like:

"Hear ye, fair listener, the morn doth unfurl... The sky, a vast tapestry of azure..." etc. Provide a weather forecast. We don't know exact; must use creative speculation. We can say "the sun doth s

In [26]:
# Another example: Technical expert
response = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {
            "role": "system", 
            "content": "You are a concise Python expert. Give brief, code-focused answers."
        },
        {
            "role": "user", 
            "content": "How do I read a JSON file?"
        }
    ],
    stream=False
)

print("System: You are a concise Python expert...")
print("User: How do I read a JSON file?")
print(f"\nExpert: {response.choices[0].message.content}")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


System: You are a concise Python expert...
User: How do I read a JSON file?

Expert: <reasoning>User: "How do I read a JSON file?" Provide brief, code-focused answer. Likely short explanation plus code snippet in Python. Also ask not too verbose. Provide simple import json, open, read. Provide maybe with context manager. Maybe also mention json.load and json.loads. Use with open('file.json', 'r') as f: data=json.load(f). Show reading entire file. Also optional reading string. Show optional error handling for file not found. Provide suggestions to check. Or mention using pathlib Path.read_text. Provide quick code. Provide minimal commentary.

We need to keep answer concise. Provide code in 2-3 lines. So I'll produce something like:

```python
import json

with open('data.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

print(data)
```

Also mention about json.loads for string. Provide optional snippet.

Let's produce.</reasoning>```python
import json

# Load a JSON file into

## Example 5: Sampling Parameters

Control the generation behavior with sampling parameters:

- **temperature**: Controls randomness (0.0 = deterministic, 1.0 = creative)
- **top_p**: Nucleus sampling threshold
- **max_tokens**: Maximum tokens to generate

In [27]:
# Low temperature = more deterministic
import re


def _strip_reasoning(text: str) -> str:
    # Many models return <reasoning>...</reasoning> inside message.content.
    # For this notebook we want to display only the final answer.
    return re.sub(r"<reasoning>[\s\S]*?</reasoning>", "", text).strip()


def _haiku(temp: float) -> str:
    # Retry once if the model returns only <reasoning> with no final answer.
    for attempt in range(2):
        response = client.chat.completions.create(
            model=MODEL_ID,
            messages=[
                {
                    "role": "user",
                    "content": "Write a haiku (5-7-5 syllables) about coding. Output ONLY the 3-line haiku."
                    if attempt == 0
                    else "Write a haiku (5-7-5 syllables) about coding. Output ONLY the 3-line haiku. Start immediately with the first line.",
                },
            ],
            temperature=temp,
            max_tokens=400,
            stream=False,
        )
        cleaned = _strip_reasoning(response.choices[0].message.content or "")
        if cleaned:
            return cleaned
    # Fall back to raw content if still empty (should be rare)
    return response.choices[0].message.content or ""


print("Low temperature (0.1) - Deterministic:")
print(_haiku(0.1))
print()

Low temperature (0.1) - Deterministic:


INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


Code flows like rivers  
Debugging whispers in loops  
Logic blooms at dawn



In [28]:
# High temperature = more creative
response_creative = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {
            "role": "user",
            "content": "Write a haiku (5-7-5 syllables) about coding. Output ONLY the 3-line haiku. Start immediately with the first line.",
        },
    ],
    temperature=0.9,
    top_p=0.95,
    max_tokens=400,
    stream=False,
)

print("High temperature (0.9) - Creative:")
print(_strip_reasoning(response_creative.choices[0].message.content or "") or response_creative.choices[0].message.content)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


High temperature (0.9) - Creative:
<reasoning>The user wants a haiku in 5-7-5 syllables about coding. Output only the three lines. Must start with first line immediately. No extra text. Let's craft a 5-7-5 haiku. Count syllables: "Code hums like night" maybe? Let's write.

Line 1: 5 syllables: "Code hums like quiet" that's 5? Let's count: Code(1) hums(1) like(1) qui-et(2) = 5? Actually "quiet" has 2 syllables: qui- et. So "Code hums like quiet" is 1+1+1+2=5. Good.

Line 2: 7 syllables: "While loops spin in the dark glow" Count: While(1) loops(1) spin(1) in(1) the(1) dark(1) glow(1) = 7. Good.

Line 3: 5 syllables: "Logic dawns, bugs fade" Count: Lo-gic(2) dawns(1) bugs(1) fade(1) = 5. Actually logic 2, dawns 1, bugs 1, fade 1 => 5.

So haiku: 
Code hums like quiet
While loops spin in the dark glow
Logic dawns, bugs fade

But "quiet" is 2 syllables, "hums" 1. Let's recount line1: Code(1) hums(1) like(1) qui-et(2) = 5. Good.

Line2: While(1) loops(1) spin(1) in(1) the(1) dark(1) glow(1) 

In [29]:
# Limiting output length
response_short = client.chat.completions.create(
    model=MODEL_ID,
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=400,
    stream=False,
)

print("Limited to 400 tokens:")
print(response_short.choices[0].message.content)
print(f"\n(Used {response_short.usage.completion_tokens} completion tokens)")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


Limited to 400 tokens:
<reasoning>The user: "Explain quantum computing". Likely they want a general explanation. They didn't specify depth. We cover basics: bits vs qubits, superposition, entanglement, quantum gates, algorithms, challenges. Also some historical context, applications. Let's produce a thorough but understandable explanation. Mention key concepts: superposition, interference, entanglement, measurement. Provide analogies. Also mention quantum speedup examples like Shor, Grover. Also mention practical devices like IBM, D-Wave, etc. Also mention NISQ era. Provide conclusion. Let's do that.</reasoning>## Quantum computing ‚Äì a quick but thorough primer

### 1. From classical bits to quantum bits (qubits)

| Classical computer | Quantum computer |
|-------------------|-----------------|
| Uses **bits** that can be *0* or *1* (on or off) | Uses **qubits** that can exist in a *superposition* of 0 and 1 at the same time |
| Logic operations are deterministic | Quantum operations

## Example 6: Error Handling

Proper error handling for production applications.

In [30]:
def safe_chat_completion(messages, model_id=MODEL_ID):
    """Wrapper with error handling for chat completions."""
    try:
        response = client.chat.completions.create(
            model=model_id,
            messages=messages,
            stream=False
        )
        return response.choices[0].message.content
    except Exception as e:
        error_msg = str(e)
        if "401" in error_msg or "unauthorized" in error_msg.lower():
            print("Authentication Error: Check AWS_BEARER_TOKEN_BEDROCK on the server")
        elif "404" in error_msg:
            print(f"Model not found: {model_id}")
        elif "connection" in error_msg.lower():
            print("Connection Error: Is the Llama Stack server running?")
        else:
            print(f"Error: {e}")
        return None

# Test with valid request
print("Testing valid request...")
result = safe_chat_completion([{"role": "user", "content": "Hello!"}])
if result:
    print(f"Success: {result}")

print("\nTesting invalid model...")
# Test with invalid model (will fail gracefully)
result = safe_chat_completion(
    [{"role": "user", "content": "Hello!"}],
    model_id="invalid-model-id"
)
if result is None:
    print("Handled invalid model gracefully")

Testing valid request...


INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 404 Not Found"


Success: <reasoning>User: "Hello!" We need a friendly greeting. Probably respond warmly.</reasoning>Hello! üëã How can I help you today?

Testing invalid model...
Model not found: invalid-model-id
Handled invalid model gracefully


## Example 7: Batch Processing

Processing multiple prompts efficiently.

In [31]:
# Batch of questions
questions = [
    "What is the capital of France?",
    "What is 2 + 2?",
    "Who wrote Romeo and Juliet?",
]

print("Batch Processing Results:")
print("=" * 50)

for i, question in enumerate(questions, 1):
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[{"role": "user", "content": question}],
        max_tokens=400,
        stream=False,
    )
    print(f"\nQ{i}: {question}")
    print(f"A{i}: {response.choices[0].message.content}")

Batch Processing Results:


INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"



Q1: What is the capital of France?
A1: <reasoning>The user: "What is the capital of France?" Should answer: Paris. Probably simple.</reasoning>The capital of France is **Paris**.


INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"



Q2: What is 2 + 2?
A2: <reasoning>The user asks: "What is 2 + 2?" Very simple. The answer is 4. But we must consider policy: no policy violation. It's elementary math. So answer: 4.</reasoning>2‚ÄØ+‚ÄØ2‚ÄØ=‚ÄØ4.


INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"



Q3: Who wrote Romeo and Juliet?
A3: <reasoning>The user asks the trivial question: Who wrote Romeo and Juliet? The answer: William Shakespeare. Should I give an explanation? It's a straightforward answer. Also user is likely expecting a short answer. I can just respond.

We must keep the response short, as per policy. Probably correct to respond: William Shakespeare.</reasoning>Romeo and Juliet was written by William Shakespeare.


## Troubleshooting

### Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `Connection refused` | Server not running | Start server: `uv run llama stack run ./bedrock-test.yaml` |
| `401 Unauthorized` | Invalid/expired token | Get new token from AWS Bedrock Console |
| `Model not found` | Model not registered | Check `registered_resources.models` in config |
| `Bearer Token expired` | Token older than 12h | Generate new pre-signed URL |

### Getting a New Bearer Token

1. Go to AWS Bedrock Console > Playground > Chat
2. Open browser DevTools > Network tab
3. Send a message and capture the `Authorization` header
4. Token starts with `bedrock-api-key-...`

## cURL Equivalent Commands

For reference, here are the equivalent cURL commands (for the minimal `./bedrock-test.yaml` setup).

Tip: auto-discover the model ID first:

```bash
MODEL_ID=$(curl -s http://localhost:8321/v1/models | jq -r '.data[] | select(.custom_metadata.provider_id=="bedrock") | .id' | head -1)
echo "Using model: $MODEL_ID"
```

Then call chat completions:

```bash
# Example 1: Simple Chat Completion
curl -X POST "http://localhost:8321/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": '"'"$MODEL_ID"'"',
    "messages": [{"role": "user", "content": "Which planet do humans live on?"}],
    "stream": false
  }' | jq

# Example 2: Streaming
curl -X POST "http://localhost:8321/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": '"'"$MODEL_ID"'"',
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

# Example 5: With Sampling Parameters
curl -X POST "http://localhost:8321/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": '"'"$MODEL_ID"'"',
    "messages": [{"role": "user", "content": "Write a haiku"}],
    "temperature": 0.7,
    "max_tokens": 50,
    "stream": false
  }' | jq
```

## Summary

This notebook demonstrated the key capabilities of the AWS Bedrock provider:

| Feature | Status | Example |
|---------|--------|--------|
| Non-streaming chat | Supported | Example 1 |
| Streaming chat | Supported | Example 2 |
| Multi-turn conversation | Supported (client provides history) | Example 3 |
| System messages | Supported | Example 4 |
| Sampling parameters | Supported | Example 5 |
| Error handling | Demonstrated | Example 6 |
| Batch processing | Demonstrated | Example 7 |

### API Used

```python
# OpenAI-compatible chat completions API
# (Use the model ID returned by GET /v1/models, e.g. "bedrock/openai.gpt-oss-20b-1:0")
client.chat.completions.create(
    model=MODEL_ID,
    messages=[{"role": "user", "content": "..."}],
    stream=False  # or True for streaming
)
```

### Notes / Limitations

- **Bedrock OpenAI-compatible endpoint limitations**:
  - No `/v1/embeddings` (use native Bedrock embedding models instead)
  - No `/v1/completions` (chat completions only)
  - No reliable tool calling support
  - No Bedrock `/v1/models` for dynamic discovery
- **Llama Stack server behavior**:
  - `/v1/models` exists and lists *registered* models (e.g., from `./bedrock-test.yaml`).

### References

- [AWS Bedrock Documentation](https://docs.aws.amazon.com/bedrock/)
- [Llama Stack Documentation](https://github.com/meta-llama/llama-stack)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)