# How to track token usage in ChatModels

:::info Prerequisites

This guide assumes familiarity with the following concepts:
- [Chat models](/docs/concepts/#chat-models)

:::

Tracking token usage to calculate cost is an important part of putting your app in production. This guide goes over how to obtain this information from your LangChain model calls.

This guide requires `langchain-openai >= 0.1.8`.

In [None]:
%pip install --upgrade --quiet langchain langchain-openai

## Using LangSmith

You can use [LangSmith](https://www.langchain.com/langsmith) to help track token usage in your LLM application. See the [LangSmith quick start guide](https://docs.smith.langchain.com/).

## Using AIMessage.usage_metadata

A number of model providers return token usage information as part of the chat generation response. When available, this information will be included on the `AIMessage` objects produced by the corresponding model.

LangChain `AIMessage` objects include a [usage_metadata](https://api.python.langchain.com/en/latest/messages/langchain_core.messages.ai.AIMessage.html#langchain_core.messages.ai.AIMessage.usage_metadata) attribute. When populated, this attribute will be a [UsageMetadata](https://api.python.langchain.com/en/latest/messages/langchain_core.messages.ai.UsageMetadata.html) dictionary with standard keys (e.g., `"input_tokens"` and `"output_tokens"`).

Examples:

**OpenAI**:

In [1]:
# # !pip install -qU langchain-openai

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
openai_response = llm.invoke("hello")
openai_response.usage_metadata

{'input_tokens': 8, 'output_tokens': 9, 'total_tokens': 17}

**Anthropic**:

In [2]:
# !pip install -qU langchain-anthropic

from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-3-haiku-20240307")
anthropic_response = llm.invoke("hello")
anthropic_response.usage_metadata

{'input_tokens': 8, 'output_tokens': 12, 'total_tokens': 20}

### Using AIMessage.response_metadata

Metadata from the model response is also included in the AIMessage [response_metadata](https://api.python.langchain.com/en/latest/messages/langchain_core.messages.ai.AIMessage.html#langchain_core.messages.ai.AIMessage.response_metadata) attribute. These data are typically not standardized. Note that different providers adopt different conventions for representing token counts:

In [3]:
print(f'OpenAI: {openai_response.response_metadata["token_usage"]}\n')
print(f'Anthropic: {anthropic_response.response_metadata["usage"]}')

OpenAI: {'completion_tokens': 9, 'prompt_tokens': 8, 'total_tokens': 17}

Anthropic: {'input_tokens': 8, 'output_tokens': 12}


### Streaming

Some providers support token count metadata in a streaming context.

#### OpenAI

For example, OpenAI will return a message [chunk](https://api.python.langchain.com/en/latest/messages/langchain_core.messages.ai.AIMessageChunk.html) at the end of a stream with token usage information. This behavior is supported by `langchain-openai >= 0.1.8` and can be enabled by setting `stream_options={"include_usage": True}`.

```{=mdx}
:::note
By default, the last message chunk in a stream will include a `"finish_reason"` in the message's `response_metadata` attribute. If we include token usage in streaming mode, an additional chunk containing usage metadata will be added to the end of the stream, such that `"finish_reason"` appears on the second to last message chunk.
:::
```

In [4]:
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

aggregate = None
for chunk in llm.stream("hello", stream_options={"include_usage": True}):
    print(chunk)
    aggregate = chunk if aggregate is None else aggregate + chunk

content='' id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf'
content='Hello' id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf'
content='!' id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf'
content=' How' id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf'
content=' can' id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf'
content=' I' id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf'
content=' assist' id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf'
content=' you' id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf'
content=' today' id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf'
content='?' id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf'
content='' response_metadata={'finish_reason': 'stop'} id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf'
content='' id='run-b40e502e-d30e-4617-94ad-95b4dfee14bf' usage_metadata={'input_tokens': 8, 'output_tokens': 9, 'total_tokens': 17}


Note that the usage metadata will be included in the sum of the individual message chunks:

In [5]:
print(aggregate.content)
print(aggregate.usage_metadata)

Hello! How can I assist you today?
{'input_tokens': 8, 'output_tokens': 9, 'total_tokens': 17}


To disable streaming token counts for OpenAI, set `"include_usage"` to False in `stream_options`, or omit it from the parameters:

In [6]:
aggregate = None
for chunk in llm.stream("hello"):
    print(chunk)

content='' id='run-0085d64c-13d2-431b-a0fa-399be8cd3c52'
content='Hello' id='run-0085d64c-13d2-431b-a0fa-399be8cd3c52'
content='!' id='run-0085d64c-13d2-431b-a0fa-399be8cd3c52'
content=' How' id='run-0085d64c-13d2-431b-a0fa-399be8cd3c52'
content=' can' id='run-0085d64c-13d2-431b-a0fa-399be8cd3c52'
content=' I' id='run-0085d64c-13d2-431b-a0fa-399be8cd3c52'
content=' assist' id='run-0085d64c-13d2-431b-a0fa-399be8cd3c52'
content=' you' id='run-0085d64c-13d2-431b-a0fa-399be8cd3c52'
content=' today' id='run-0085d64c-13d2-431b-a0fa-399be8cd3c52'
content='?' id='run-0085d64c-13d2-431b-a0fa-399be8cd3c52'
content='' response_metadata={'finish_reason': 'stop'} id='run-0085d64c-13d2-431b-a0fa-399be8cd3c52'


You can also enable streaming token usage by setting `model_kwargs` when instantiating the chat model. This can be useful when incorporating chat models into LangChain [chains](/docs/concepts#langchain-expression-language-lcel): usage metadata can be monitored when [streaming intermediate steps](/docs/how_to/streaming#using-stream-events) or using tracing software such as [LangSmith](https://docs.smith.langchain.com/).

See the below example, where we return output structured to a desired schema, but can still observe token usage streamed from intermediate steps.

In [8]:
from langchain_core.pydantic_v1 import BaseModel, Field


class Joke(BaseModel):
    """Joke to tell user."""

    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")


llm = ChatOpenAI(
    model="gpt-3.5-turbo-0125",
    model_kwargs={"stream_options": {"include_usage": True}},
)
# Under the hood, .with_structured_output binds tools to the
# chat model and appends a parser.
structured_llm = llm.with_structured_output(Joke)

async for event in structured_llm.astream_events("Tell me a joke", version="v2"):
    if event["event"] == "on_chat_model_end":
        print(f'Token usage: {event["data"]["output"].usage_metadata}\n')
    elif event["event"] == "on_chain_end":
        print(event["data"]["output"])
    else:
        pass

Token usage: {'input_tokens': 79, 'output_tokens': 23, 'total_tokens': 102}

setup='Why was the math book sad?' punchline='Because it had too many problems.'


Token usage is also visible in the corresponding [LangSmith trace](https://smith.langchain.com/public/fe6513d5-7212-4045-82e0-fefa28bc7656/r) in the payload from the chat model.

## Using callbacks

There are also some API-specific callback context managers that allow you to track token usage across multiple calls. It is currently only implemented for the OpenAI API and Bedrock Anthropic API.

### OpenAI

Let's first look at an extremely simple example of tracking token usage for a single Chat model call.

In [9]:
# !pip install -qU langchain-community wikipedia

from langchain_community.callbacks.manager import get_openai_callback

llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)

with get_openai_callback() as cb:
    result = llm.invoke("Tell me a joke")
    print(cb)

Tokens Used: 27
	Prompt Tokens: 11
	Completion Tokens: 16
Successful Requests: 1
Total Cost (USD): $2.95e-05


Anything inside the context manager will get tracked. Here's an example of using it to track multiple calls in sequence.

In [10]:
with get_openai_callback() as cb:
    result = llm.invoke("Tell me a joke")
    result2 = llm.invoke("Tell me a joke")
    print(cb.total_tokens)

55


```{=mdx}
:::note
Cost information is currently not available in streaming mode. This is because model names are currently not propagated through chunks in streaming mode, and the model name is used to look up the correct pricing. Token counts however are available:
:::
```

In [11]:
with get_openai_callback() as cb:
    for chunk in llm.stream("Tell me a joke", stream_options={"include_usage": True}):
        pass
    print(cb.total_tokens)

28


If a chain or agent with multiple steps in it is used, it will track all those steps.

In [12]:
from langchain.agents import AgentExecutor, create_tool_calling_agent, load_tools
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You're a helpful assistant"),
        ("human", "{input}"),
        ("placeholder", "{agent_scratchpad}"),
    ]
)
tools = load_tools(["wikipedia"])
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(
    agent=agent, tools=tools, verbose=True, stream_runnable=False
)

```{=mdx}
:::note
We have to set `stream_runnable=False` for cost information, as described above. By default the AgentExecutor will stream the underlying agent so that you can get the most granular results when streaming events via AgentExecutor.stream_events.
:::
```

In [13]:
with get_openai_callback() as cb:
    response = agent_executor.invoke(
        {
            "input": "What's a hummingbird's scientific name and what's the fastest bird species?"
        }
    )
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Total Cost (USD): ${cb.total_cost}")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `wikipedia` with `{'query': 'hummingbird scientific name'}`


[0m[36;1m[1;3mPage: Hummingbird
Summary: Hummingbirds are birds native to the Americas and comprise the biological family Trochilidae. With approximately 366 species and 113 genera, they occur from Alaska to Tierra del Fuego, but most species are found in Central and South America. As of 2024, 21 hummingbird species are listed as endangered or critically endangered, with numerous species declining in population.
Hummingbirds have varied specialized characteristics to enable rapid, maneuverable flight: exceptional metabolic capacity, adaptations to high altitude, sensitive visual and communication abilities, and long-distance migration in some species. Among all birds, male hummingbirds have the widest diversity of plumage color, particularly in blues, greens, and purples. Hummingbirds are the smallest mature birds, measuring 7.5–13 cm (3–5 in) in leng

### Bedrock Anthropic

The `get_bedrock_anthropic_callback` works very similarly:

In [12]:
# !pip install langchain-aws
from langchain_aws import ChatBedrock
from langchain_community.callbacks.manager import get_bedrock_anthropic_callback

llm = ChatBedrock(model_id="anthropic.claude-v2")

with get_bedrock_anthropic_callback() as cb:
    result = llm.invoke("Tell me a joke")
    result2 = llm.invoke("Tell me a joke")
    print(cb)

Tokens Used: 96
	Prompt Tokens: 26
	Completion Tokens: 70
Successful Requests: 2
Total Cost (USD): $0.001888


## Next steps

You've now seen a few examples of how to track token usage for supported providers.

Next, check out the other how-to guides chat models in this section, like [how to get a model to return structured output](/docs/how_to/structured_output) or [how to add caching to your chat models](/docs/how_to/chat_model_caching).