-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Initial Checks
- I confirm that I'm using the latest version of Pydantic AI
- I confirm that I searched for my issue in https://github.com/pydantic/pydantic-ai/issues before opening this issue
Description
Description
After a LLM finishes a call (request), it also returns some statistics, for example, number of prompt tokens, completion token. This is tracked by the class Usage
(pydantic-ai/pydantic_ai_slim/pydantic_ai/usage.py
(link))
According to the class' Documentation (link), as of 04.2025, the attribute details
should contain "any extra details returned by the model." But this is not the case with the latest version of Pydantic AI
(version 0.1.3
).
I use llama-server (part of llama-cpp) as the backend to host a LLM model (in form of GGUF file). Using tcpflow
(link) to capture the communication between Server and Client, I can see the last messsage sent from the Server as followed:
{
"choices":[
{
"finish_reason":"stop",
"index":0,
"delta":{
}
}
],
"created":1745457407,
"id":"chatcmpl-G7Hmg3VGIPYO6hFk6nw7b4VtC4vqyliz",
"model":"Qwen2.5-7B-Instruct-1M-q4_k_m-Finetuned",
"system_fingerprint":"b5127-e959d32b",
"object":"chat.completion.chunk",
"usage":{
"completion_tokens":32,
"prompt_tokens":52,
"total_tokens":84
},
"timings":{
"prompt_n":17,
"prompt_ms":2470.602,
"prompt_per_token_ms":145.3295294117647,
"prompt_per_second":6.880914044431277,
"predicted_n":32,
"predicted_ms":6924.775,
"predicted_per_token_ms":216.39921875,
"predicted_per_second":4.621088771837353
}
}
The completion_tokens
and prompt_tokens
are well-captured by the Usage
class (respectively, response_tokens
and request_tokens
). But all about the time taken to process, e.g. prompt_ms
, prompt_per_token_ms
are missed in the field details
of Usage
class. Unless I am mistaken, the field should contain any extra details returned by the model.
Expectation
The field details
of Usage
class should contain any extra details returned by the model, e.g. timings
or prompt_per_token_ms
.
Example Code
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider
...
agent = Agent(
OpenAIModel(
"model_name",
provider=OpenAIProvider(
api_key=os.environ["LLM_API_KEY"],
base_url=f"{LLM_URL}:8081/v1",
http_client=AsyncClient(headers={"Connection": "close"}),
),
),
retries=3,
deps_type=str,
)
...
async with agent.run_stream(
latest_user_message,
message_history=message_history,
deps=system_prompt,
) as result:
async for chunk in result.stream_text(delta=True):
writer(chunk)
print(result.usage())
Python, Pydantic AI & LLM client version
+ Windows 11, WSL2, Ubuntu 24.04
+ Pydantic AI v0.1.3
+ Python v3.12.7
+ llama-cli 5117
+ Langchain Core 0.3.49