<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/llm/openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baseten Integration Cookbook

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [7]:
%pip install llama-index llama-index-llms-baseten

Note: you may need to restart the kernel to use updated packages.


## Basic Usage

In [1]:
from llama_index.llms.baseten import Baseten

#### Call `complete` with a prompt

In [6]:
# Create LLM instance with debug info
# Need to delete before committing
llm = Baseten(
    model_id="YOUR_MODEL_ID",
    api_key="YOUR_API_KEY",
)

# We are storing the model_id in the model field
print(f"Model ID: {llm.model}")
print(f"API Base URL: {llm.api_base}")

# Try a direct request first to verify API key and endpoint, making sure model is up
# Your model must be in production
import requests
direct_response = requests.post(
    f"https://model-{llm.model}.api.baseten.co/environments/production/predict",
    headers={"Authorization": f"Api-Key {llm.api_key}"},
    json={
        "messages": [{"role": "user", "content": "Paul Graham is"}],
    }
)

print("\nDirect request response:")
print(f"Status: {direct_response.status_code}")
print(f"Response: {direct_response.json()}")

# Now try the LLM call
try:
    llm_response = llm.complete("Paul Graham is")
    print("\nLLM response:")
    print(llm_response.text)
except Exception as e:
    print("\nError in LLM call:")
    print(f"Error type: {type(e)}")
    print(f"Error message: {str(e)}")


Model ID: yqvr2lxw
API Base URL: https://model-yqvr2lxw.api.baseten.co/environments/production/sync/v1

Direct request response:
Status: 200
Response: {'id': '161', 'choices': [{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'content': "Paul Graham is a significant figure in the tech industry, recognized as a venture capitalist, programmer, and technical writer. He founded Y Combinator, one of the world's most successful startup accelerators. Paul Graham is known for his insightful writings on the tech industry and startups, and he played a crucial role in fostering many successful companies in various technology domains.", 'refusal': None, 'role': 'assistant', 'audio': None, 'function_call': None, 'tool_calls': None}}], 'created': 1743032053, 'model': '', 'object': 'chat.completion', 'service_tier': None, 'system_fingerprint': None, 'usage': {'completion_tokens': 73, 'prompt_tokens': 32, 'total_tokens': 105, 'completion_tokens_details': None, 'prompt_tokens_details

#### Call `chat` with a list of messages

In [7]:
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.chat(messages)

In [8]:
print(resp)

assistant: Ahoy there, matey! My name is Captain Blunderbuss, but you can call me Cap'n if you're feeling friendly. Now, what's a dandy thing we can do with a name like that?


## Streaming

Using `stream_complete` endpoint

In [9]:
resp = llm.stream_complete("Paul Graham is ")

In [10]:
for r in resp:
    print(r.delta, end="")

Paul Graham is an American entrepreneur, venture capitalist, and writer. He is best known as the founder of Y Combinator, a startup accelerator program, and as the author of several influential books on entrepreneurship, including "Hackers & Painters" and "Real Software." Graham has been a significant figure in the tech industry and has provided valuable insights and advice to many startups and entrepreneurs.

Using `stream_chat` endpoint

In [11]:
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.stream_chat(messages)

In [12]:
for r in resp:
    print(r.delta, end="")

Ahoy there, matey! My name is Captain Blunderbuss, but you can call me Cap'n if you're feeling friendly. Now, what's a dandy thing we can do with a name like that?

# Async
"Async operations are used for long-running inference tasks that may hit request timeouts, batch inference jobs, and prioritizing certain requests."

(1) In the integation, `acomplete` async function is implemented using the aiohttp library, an asynchronous HTTP client in python. The function invokes the async_predict at the approriate Baseten model endpoint, then the user receives a response with the request_id if successful. The user can then check the status or cancel the async_predict request using the returned request_id.

(2) Once the model finishes executing the request, the async result will be posted to the user provided webhook endpoint. The user's endpoint is responsible for validating the webhook signature for security, then processing and storing the output.

* Testing was completed with the Baseten *async inference user guide*'s webhook endpoint running on Replit.

* achat was not implemented, because chat does not make sense for async operations.

* The OpenAI parent class's async methods cannot be used directly because Baseten's async_request endpoint returns a request_id immediately.

OpenAI: Wait for completion → return result

Baseten: Get request_id → result is posted to webhook

In [1]:
async_llm = Baseten(
    model_id="YOUR_MODEL_ID",
    api_key="YOUR_API_KEY", 
    webhook_endpoint="YOUR_WEBHOOK_ENDPOINT",
)
response = await async_llm.acomplete("Paul Graham is")
print(response) # This is the request id

35643965636d4c3da6f54b5c3b354aa0


In [2]:
"""
This will return the status information of a request using an async_predict request's request_id and the model_id the async_predict request was made with.
"""

import requests
import os

model_id = "YOUR_MODEL_ID"
request_id = "YOUR_REQUEST_ID"
# Read secrets from environment variables
baseten_api_key = "YOUR_API_KEY"

resp = requests.get(
    f"https://model-{model_id}.api.baseten.co/async_request/{request_id}",
    headers={"Authorization": f"Api-Key {baseten_api_key}"})

print(resp.json())

{'request_id': '35643965636d4c3da6f54b5c3b354aa0', 'model_id': 'yqvr2lxw', 'deployment_id': '31kmg1w', 'status': 'SUCCEEDED', 'webhook_status': 'SUCCEEDED', 'created_at': '2025-03-27T00:17:51.578558Z', 'status_at': '2025-03-27T00:18:38.768572Z', 'errors': []}
