<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/llm/openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baseten Cookbook

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
# %pip install llama-index llama-index-llms-baseten

## Model APIs vs. Dedicated Deployments

Baseten offers two main ways to access LLMs. 
1. Model APIs is a public endpoint for popular open source models (Deepseek, Llama, etc) where you can directly use a frontier model via slug e.g.  `deepseek-ai/DeepSeek-V3-0324` and you will be charged on a per-token basis. 
2. Dedicated deployments are useful for serving custom models where you want to autoscale production workloads and have fine-grain configuration. You need to deploy a model in your Baseten dashboard provide the 8 character model id like `abcd1234`.


In [17]:
import sys
import os
# Add the integration path explicitly
integration_path = '/Users/alexker/code/llama_index/llama-index-integrations/llms/llama-index-llms-baseten'
if integration_path not in sys.path:
    sys.path.insert(0, integration_path)

# Now try the import
from llama_index.llms.baseten import Baseten

#### Model APIs

In [18]:
# Create LLM instance with debug info for Model APIs
llm = Baseten(
    model_id="deepseek-ai/DeepSeek-V3-0324",
    api_key="IOiHoajg.eOuTpW5QkkeJgzyXejbYSo7PDurr05sV",
    model_apis=True,  # Default
)

# We are storing the model_id in the model field
print(f"Model ID: {llm.model}")
print(f"API Base URL: {llm.api_base}")

# Try a direct request first to verify API key and endpoint
import requests
direct_response = requests.post(
    f"https://inference.baseten.co/v1/chat/completions",
    headers={"Authorization": f"Api-Key {llm.api_key}"},
    json={
        "model": llm.model,
        "messages": [
            {
                "role": "user",
                "content": "What is the capital of France?"
            }
        ],
        "max_tokens": 32,
    }
)

print("\nDirect request response:")
print(f"Status: {direct_response.status_code}")
print(f"Response: {direct_response.json()}")

# Now try the LLM call
try:
    llm_response = llm.complete("What is the capital of France?")
    print("\nLLM response:")
    print(llm_response.text)
except Exception as e:
    print("\nError in LLM call:")
    print(f"Error type: {type(e)}")
    print(f"Error message: {str(e)}")

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.chat(messages)
print(resp)

Model ID: deepseek-ai/DeepSeek-V3-0324
API Base URL: https://inference.baseten.co/v1/

Direct request response:
Status: 200
Response: {'id': 'chatcmpl-5e014980abc94be7a1eac5a195732927', 'choices': [{'index': 0, 'message': {'content': 'The capital of France is **Paris**.  \n\nParis is known for its iconic landmarks such as the **Eiffel Tower**, **Louvre Museum', 'refusal': None, 'tool_calls': None, 'role': 'assistant', 'function_call': None, 'audio': None}, 'finish_reason': 'length', 'logprobs': None}], 'created': 1752177248, 'model': 'baseten/DeepSeek-V3-FP4', 'service_tier': None, 'system_fingerprint': None, 'object': 'chat.completion', 'usage': {'prompt_tokens': 13, 'completion_tokens': 30, 'total_tokens': 43, 'prompt_tokens_details': None, 'completion_tokens_details': None}}

LLM response:
The capital of France is **Paris**. It is one of the most famous cities in the world, known for its rich history, culture, and landmarks like the Eiffel Tower, the Louvre Museum, and Notre-Dame Ca

#### Call `complete` with a prompt

In [6]:
# Create LLM instance with debug info
# Need to delete before committing
llm = Baseten(
    model_id="6wg17egw",
    api_key="IOiHoajg.eOuTpW5QkkeJgzyXejbYSo7PDurr05sV",
)

# We are storing the model_id in the model field
print(f"Model ID: {llm.model}")
print(f"API Base URL: {llm.api_base}")

# Try a direct request first to verify API key and endpoint, making sure model is up
# Your model must be in production
import requests
direct_response = requests.post(
    f"https://model-{llm.model}.api.baseten.co/environments/production/predict",
    headers={"Authorization": f"Api-Key {llm.api_key}"},
    json={
        "messages": [{"role": "user", "content": "Paul Graham is"}],
    }
)

print("\nDirect request response:")
print(f"Status: {direct_response.status_code}")
print(f"Response: {direct_response.json()}")

# Now try the LLM call
try:
    llm_response = llm.complete("Paul Graham is")
    print("\nLLM response:")
    print(llm_response.text)
except Exception as e:
    print("\nError in LLM call:")
    print(f"Error type: {type(e)}")
    print(f"Error message: {str(e)}")


Model ID: yqvr2lxw
API Base URL: https://model-yqvr2lxw.api.baseten.co/environments/production/sync/v1

Direct request response:
Status: 200
Response: {'id': '161', 'choices': [{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'content': "Paul Graham is a significant figure in the tech industry, recognized as a venture capitalist, programmer, and technical writer. He founded Y Combinator, one of the world's most successful startup accelerators. Paul Graham is known for his insightful writings on the tech industry and startups, and he played a crucial role in fostering many successful companies in various technology domains.", 'refusal': None, 'role': 'assistant', 'audio': None, 'function_call': None, 'tool_calls': None}}], 'created': 1743032053, 'model': '', 'object': 'chat.completion', 'service_tier': None, 'system_fingerprint': None, 'usage': {'completion_tokens': 73, 'prompt_tokens': 32, 'total_tokens': 105, 'completion_tokens_details': None, 'prompt_tokens_details

#### Call `chat` with a list of messages

In [19]:
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.chat(messages)

In [20]:
print(resp)

assistant: Arrr, matey! I be known as Captain Crimsonbeard—though me beard be more fiery red than crimson, truth be told! A pirate of legend, scourge of the seven memes, and connoisseur of questionable life choices. But ye can call me Cap’n if ye like, or "That Weird Pirate Who Won’t Stop Talking About Pineapples." Now, what mischief brings ye to me ship today? 🏴‍☠️🍍


## Streaming

Using `stream_complete` endpoint

In [13]:
resp = llm.stream_complete("Paul Graham is ")

In [14]:
for r in resp:
    print(r.delta, end="")

Paul Graham is a British-American entrepreneur, essayist, and venture capitalist, best known as a co-founder of **Y Combinator**, a highly influential startup accelerator that has helped launch companies like Airbnb, Dropbox, Stripe, and Reddit.  

### Key Facts About Paul Graham:  
1. **Early Career**: Originally a programmer, he developed **Viaweb**, one of the first web-based applications, which was acquired by Yahoo! in 1998 and became Yahoo! Store.  
2. **Y Combinator**: In 2005, he co-founded Y Combinator with Jessica Livingston, Robert Morris, and Trevor Blackwell. It pioneered the "seed accelerator" model, providing funding and mentorship to early-stage startups.  
3. **Essays**: Graham is known for his insightful essays on startups, technology, and life philosophy, available on his website ([paulgraham.com](http://www.paulgraham.com)). Popular ones include *"How to Get Startup Ideas"* and *"Do Things That Don't Scale."*  
4. **Investments**: Through YC, he has backed thousands

Using `stream_chat` endpoint

In [15]:
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.stream_chat(messages)

In [16]:
for r in resp:
    print(r.delta, end="")

Arrr, me name be Captain Crimsonbeard! A fearsome and flamboyant pirate with a beard as red as the setting sun and a wardrobe brighter than a treasure chest full o’ jewels! I sail the seven seas in search of adventure, gold, and the finest rum—always with a dramatic flair and a twinkle in me eye. 

What be yer name, matey? Or shall I just call ye "Lucky Crewmember" for now? *winks and adjusts my feathered hat*

# Async
"Async operations are used for long-running inference tasks that may hit request timeouts, batch inference jobs, and prioritizing certain requests."

(1) In the integation, `acomplete` async function is implemented using the aiohttp library, an asynchronous HTTP client in python. The function invokes the async_predict at the approriate Baseten model endpoint, then the user receives a response with the request_id if successful. The user can then check the status or cancel the async_predict request using the returned request_id.

(2) Once the model finishes executing the request, the async result will be posted to the user provided webhook endpoint. The user's endpoint is responsible for validating the webhook signature for security, then processing and storing the output.

* Testing was completed with the Baseten *async inference user guide*'s webhook endpoint running on Replit.

* achat was not implemented, because chat does not make sense for async operations.

* The OpenAI parent class's async methods cannot be used directly because Baseten's async_request endpoint returns a request_id immediately.

OpenAI: Wait for completion → return result

Baseten: Get request_id → result is posted to webhook

In [1]:
async_llm = Baseten(
    model_id="YOUR_MODEL_ID",
    api_key="YOUR_API_KEY", 
    webhook_endpoint="YOUR_WEBHOOK_ENDPOINT",
)
response = await async_llm.acomplete("Paul Graham is")
print(response) # This is the request id

35643965636d4c3da6f54b5c3b354aa0


In [2]:
"""
This will return the status information of a request using an async_predict request's request_id and the model_id the async_predict request was made with.
"""

import requests
import os

model_id = "YOUR_MODEL_ID"
request_id = "YOUR_REQUEST_ID"
# Read secrets from environment variables
baseten_api_key = "YOUR_API_KEY"

resp = requests.get(
    f"https://model-{model_id}.api.baseten.co/async_request/{request_id}",
    headers={"Authorization": f"Api-Key {baseten_api_key}"})

print(resp.json())

{'request_id': '35643965636d4c3da6f54b5c3b354aa0', 'model_id': 'yqvr2lxw', 'deployment_id': '31kmg1w', 'status': 'SUCCEEDED', 'webhook_status': 'SUCCEEDED', 'created_at': '2025-03-27T00:17:51.578558Z', 'status_at': '2025-03-27T00:18:38.768572Z', 'errors': []}
