```{contents}
```
## Token Streaming


**Token streaming** is the ability of an LLM to **emit tokens incrementally as they are generated**, instead of waiting for the full response to complete.
This enables **real-time output**, similar to how ChatGPT types responses word by word.

In LangChain, token streaming is supported through the runnable interface of LangChain.

```
Prompt
  ↓
LLM
  ↓
Token → Token → Token → Done
```

---

### Why Token Streaming Matters

Without streaming:

* User waits for the entire response
* High perceived latency
* Poor UX for long answers

With streaming:

* Immediate feedback
* Better responsiveness
* Ideal for chat UIs, copilots, terminals

---

### How Token Streaming Works Internally

1. LLM starts generating tokens
2. Each token is yielded as soon as it’s ready
3. Client consumes tokens in a loop
4. Final completion signal is sent

This is **push-based generation**, not batch output.

---

### Architecture View

![Image](https://miro.medium.com/0%2AkNsNAIa_9n4z0qGU)

![Image](https://miro.medium.com/1%2A-l2jXGB7FIv2-bRiEoaoOQ.png)

![Image](https://langtail-web.vercel.app/images/blog/token-flow.png)

---

### Basic Token Streaming (LLM Only)



In [1]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(streaming=True)

for chunk in llm.stream("Explain token streaming in simple terms"):
    print(chunk.content, end="", flush=True)


Token streaming is a method used to transfer small amounts of data or information, piece by piece, in a continuous flow. Instead of transferring all the data at once, tokens or small packets of information are sent one after the other, allowing for a more efficient and continuous transfer of data. This method is often used in online media streaming services, where content like videos or music is broken down into smaller chunks (tokens) and streamed to the user in real-time, ensuring a smooth and uninterrupted playback experience.



**What happens**

* Tokens arrive one by one
* `chunk.content` contains partial text
* Output appears immediately

---

### Token Streaming in a RunnableSequence



In [3]:
from langchain_core.runnables import RunnableLambda
from langchain_classic.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template(
    "Explain the following topic briefly:\n{topic}"
)

chain = (
    RunnableLambda(lambda x: {"topic": x})
    | prompt
    | ChatOpenAI(streaming=True)
)

for chunk in chain.stream("Token streaming"):
    print(chunk.content, end="")


Token streaming refers to the process of dividing a piece of content, such as a video or audio file, into smaller segments called tokens. These tokens can then be accessed and streamed individually, allowing for faster and more efficient playback, especially in cases where internet connections or device capabilities may limit the ability to stream larger files continuously. By breaking down the content into tokens, users can access smaller chunks of data at a time, reducing loading times and improving overall streaming performance.



Key point:

* **Only the final LLM step streams**
* Earlier steps run normally

---

### Async Token Streaming (`astream`)



In [5]:
import asyncio

async def run():
    async for chunk in llm.astream("Explain async token streaming"):
        print(chunk.content, end="")

await run()


Async token streaming enables the data to be sent and received in a non-blocking manner. This means that the data is transmitted in small chunks (tokens) asynchronously, allowing the sending and receiving processes to continue with other tasks while the data is being streamed. This can improve performance and efficiency by reducing the need to wait for the entire data set to be transmitted before processing begins. Additionally, async token streaming can help handle large amounts of data more effectively, as it allows for the continuous flow of data without overloading the system.



Used in:

* FastAPI
* WebSockets
* SSE endpoints

---

### Token Streaming with FastAPI (Real-World)

```python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.get("/chat")
def chat(q: str):
    def token_generator():
        for chunk in llm.stream(q):
            yield chunk.content
    return StreamingResponse(token_generator(), media_type="text/plain")
```

This enables:

* Browser-based streaming
* Chat-like experience
* Non-blocking responses

---

### Token Streaming vs Normal Invocation

| Aspect     | Token Streaming | Normal Invoke |
| ---------- | --------------- | ------------- |
| Output     | Incremental     | Full response |
| Latency    | Low             | High          |
| UX         | Real-time       | Delayed       |
| Complexity | Slightly higher | Simple        |

---

### What Can Stream Tokens

| Component        | Token Streaming |
| ---------------- | --------------- |
| LLMs             | ✅               |
| RunnableSequence | ✅ (via LLM)     |
| RunnableParallel | ✅ (per branch)  |
| Retriever        | ❌               |
| Prompt templates | ❌               |
| RunnableLambda   | ❌               |

Token streaming only occurs at **token-producing components**.

---

### Common Mistakes

* Forgetting `streaming=True`
* Printing `chunk` instead of `chunk.content`
* Expecting non-LLM steps to stream
* Blocking the event loop in async servers

---

### When to Use Token Streaming

Use it when:

* Building chat applications
* Responses are long
* UX responsiveness matters
* Users expect live feedback

Avoid it when:

* Batch/offline processing
* Post-processing is required before output
* Logging-only pipelines

---

### Mental Model

Token streaming turns:

```
invoke() → full text
```

into:

```
stream() → token → token → token
```

---

### Key Takeaways

* Token streaming emits output **incrementally**
* Enabled via `streaming=True`
* Works through sequences and parallel graphs
* Essential for production chat UX