```{contents}
```
## Streaming Runnables

**Streaming Runnables** allow a LangChain runnable to **emit partial outputs incrementally** instead of waiting for the full result.
This is essential for **token-by-token LLM responses**, real-time UIs, and low-latency user experience.

Supported natively in LangChain via the **Runnable interface**.

```
Input
  ↓
Runnable (LLM)
  ↓
Token Stream → UI / Client
```

---

### Why Streaming Is Important

Without streaming:

* User waits for full response
* High perceived latency
* Poor UX for long answers

With streaming:

* Tokens arrive immediately
* Faster feedback
* ChatGPT-like experience

---

### How Streaming Works Internally

1. Runnable is invoked in **streaming mode**
2. LLM emits tokens/chunks
3. Each chunk is yielded immediately
4. Client consumes the stream

```
LLM generates → token → token → token → done
```

---

### Architecture View

![Image](https://dz2cdn1.dzone.com/storage/temp/18047285-screenshot-2024-11-18-at-121208pm.png)

![Image](https://miro.medium.com/0%2AkNsNAIa_9n4z0qGU)

![Image](https://langtail-web.vercel.app/images/blog/token-flow.png)


---

### Streaming a Runnable Directly (LLM Only)



In [1]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(streaming=True)

for chunk in llm.stream("Explain streaming runnables"):
    print(chunk.content, end="")


Streaming runnables are a concept in programming where a series of tasks or operations are executed concurrently in a continuous flow, or "stream". This can be achieved through the use of threads or other parallel processing techniques, allowing multiple runnables to be run simultaneously.

Streaming runnables are often used in applications that require processing large amounts of data or performing complex computations. By breaking down the tasks into smaller, independent runnables that can be executed concurrently, the overall performance and efficiency of the application can be improved.

One common use case for streaming runnables is in streaming applications, such as video or audio streaming services, where data is continuously processed and transmitted in real-time. By using streaming runnables, the application can handle incoming data streams more efficiently and deliver a smoother and more seamless streaming experience for the user.

Overall, streaming runnables are a powerful 



**Behavior**

* Tokens are printed as they are generated
* No waiting for full response

---

### Streaming with RunnableSequence

Streaming works **end-to-end** through a sequence.



In [2]:
from langchain_core.runnables import RunnableLambda
from langchain_classic.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    "Explain this concept briefly:\n{topic}"
)

chain = (
    RunnableLambda(lambda x: {"topic": x})
    | prompt
    | ChatOpenAI(streaming=True)
)

for chunk in chain.stream("Runnable streaming"):
    print(chunk.content, end="")


Runnable streaming refers to the ability to start a process or task that involves continuously receiving and processing data in real time. This typically involves running a program or script that can handle a continuous flow of data without interruption. Examples of such tasks include streaming video, audio, or sensor data, where the program needs to constantly process incoming data without waiting for it to complete before moving on to the next piece of data.



Key point:

* Only the **final runnable** (LLM) needs streaming enabled
* Earlier steps execute normally

---

### Streaming with RunnableParallel



In [3]:
from langchain_core.runnables import RunnableParallel
from langchain_openai import ChatOpenAI

parallel = RunnableParallel(
    short=ChatOpenAI(streaming=True, model="gpt-3.5-turbo"),
    detailed=ChatOpenAI(streaming=True, model="gpt-4")
)

for output in parallel.stream("Explain RAG"):
    print(output)


{'short': AIMessageChunk(content='', additional_kwargs={}, response_metadata={'model_provider': 'openai'}, id='lc_run--019b4bf2-a952-7823-a5b2-c1c277024024')}
{'short': AIMessageChunk(content='R', additional_kwargs={}, response_metadata={'model_provider': 'openai'}, id='lc_run--019b4bf2-a952-7823-a5b2-c1c277024024')}
{'short': AIMessageChunk(content='AG', additional_kwargs={}, response_metadata={'model_provider': 'openai'}, id='lc_run--019b4bf2-a952-7823-a5b2-c1c277024024')}
{'short': AIMessageChunk(content=' stands', additional_kwargs={}, response_metadata={'model_provider': 'openai'}, id='lc_run--019b4bf2-a952-7823-a5b2-c1c277024024')}
{'short': AIMessageChunk(content=' for', additional_kwargs={}, response_metadata={'model_provider': 'openai'}, id='lc_run--019b4bf2-a952-7823-a5b2-c1c277024024')}
{'short': AIMessageChunk(content=' Red', additional_kwargs={}, response_metadata={'model_provider': 'openai'}, id='lc_run--019b4bf2-a952-7823-a5b2-c1c277024024')}
{'short': AIMessageChunk(con


Each branch:

* Streams independently
* Output arrives as **partial structured updates**

---

### Streaming Events with `.astream_events()` (Advanced)



In [4]:
async for event in chain.astream_events(
    "Explain streaming",
    version="v1"
):
    print(event)


{'event': 'on_chain_start', 'run_id': '019b4bf3-798e-7221-a04c-bffd3c7e4968', 'name': 'RunnableSequence', 'tags': [], 'metadata': {}, 'data': {'input': 'Explain streaming'}, 'parent_ids': []}
{'event': 'on_chain_start', 'name': 'RunnableLambda', 'run_id': '019b4bf3-79ce-7172-8e18-9d27f71fe613', 'tags': ['seq:step:1'], 'metadata': {}, 'data': {}, 'parent_ids': []}
{'event': 'on_chain_stream', 'name': 'RunnableLambda', 'run_id': '019b4bf3-79ce-7172-8e18-9d27f71fe613', 'tags': ['seq:step:1'], 'metadata': {}, 'data': {'chunk': {'topic': 'Explain streaming'}}, 'parent_ids': []}
{'event': 'on_chain_end', 'name': 'RunnableLambda', 'run_id': '019b4bf3-79ce-7172-8e18-9d27f71fe613', 'tags': ['seq:step:1'], 'metadata': {}, 'data': {'input': 'Explain streaming', 'output': {'topic': 'Explain streaming'}}, 'parent_ids': []}
{'event': 'on_prompt_start', 'name': 'ChatPromptTemplate', 'run_id': '019b4bf3-79d5-76d1-8f1d-722924b65a3f', 'tags': ['seq:step:2'], 'metadata': {}, 'data': {'input': {'topic': '



Event types include:

* `on_chain_start`
* `on_llm_stream`
* `on_chain_end`

Used for:

* Observability
* Tracing
* UI synchronization

---

### Streaming + FastAPI (Real-World Pattern)

```python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.get("/chat")
def chat(q: str):
    def token_stream():
        for chunk in chain.stream(q):
            yield chunk.content
    return StreamingResponse(token_stream(), media_type="text/plain")
```

This enables:

* Browser streaming
* SSE-like behavior
* Chat UIs

---

### What Can and Cannot Stream

| Component        | Streaming                 |
| ---------------- | ------------------------- |
| LLMs             | ✅                         |
| Prompt templates | ❌                         |
| RunnableLambda   | ❌                         |
| Retriever        | ❌                         |
| RunnableSequence | ✅ (if final step streams) |
| RunnableParallel | ✅ (per branch)            |

Streaming happens at **token-producing nodes**.

---

### Streaming vs Non-Streaming Execution

| Aspect     | Streaming       | Non-Streaming   |
| ---------- | --------------- | --------------- |
| Latency    | Low             | High            |
| UX         | Real-time       | Delayed         |
| Complexity | Slightly higher | Simple          |
| Use case   | Chat / UI       | Batch / backend |

---

### Common Use Cases

* Chat applications
* Copilot-style assistants
* Live dashboards
* Long-form generation
* Agent reasoning display

---

### Common Mistakes

* Forgetting `streaming=True`
* Expecting non-LLM steps to stream
* Not handling async streams correctly
* Printing `chunk` instead of `chunk.content`

---

### Mental Model

Streaming Runnables turn:

```
invoke() → result
```

into:

```
stream() → chunk → chunk → chunk
```

---

### Key Takeaways

* Streaming Runnables emit **incremental output**
* Enabled at the **LLM level**
* Works through sequences and parallel graphs
* Essential for production-grade UX