```{contents}
```
## Batch Execution

**Batch execution** runs the **same runnable pipeline on multiple inputs at once** instead of invoking it repeatedly in a loop.
It is natively supported by runnables in LangChain via `.batch()` / `.abatch()`.

```
[input1, input2, input3]
          ↓
      Runnable
          ↓
[output1, output2, output3]
```

Each input is processed **independently**, but execution is optimized.

---

### Why Batch Execution Is Needed

Batch execution:

* Reduces overhead (fewer model calls / RPCs)
* Improves throughput
* Enables parallelism
* Is cost-efficient for embeddings, LLM calls, and retrieval

Instead of:

```python
for x in inputs:
    chain.invoke(x)
```

You do:

```python
chain.batch(inputs)
```

---

### How Batch Execution Works Internally

1. Inputs are collected into a list
2. Runnable executes them concurrently (where possible)
3. Results are returned **in the same order**

Important:

* **No shared state between inputs**
* One failure affects only that input (by default)

---

### Basic Batch Execution (RunnableLambda)



In [1]:
from langchain_core.runnables import RunnableLambda

to_upper = RunnableLambda(lambda x: x.upper())

to_upper.batch(["rag", "agents", "memory"])


['RAG', 'AGENTS', 'MEMORY']



**Output**

```python
["RAG", "AGENTS", "MEMORY"]
```

---

### Batch Execution with RunnableSequence



In [2]:
from langchain_core.runnables import RunnableLambda

chain = (
    RunnableLambda(lambda x: x.strip())
    | RunnableLambda(lambda x: x.lower())
)

chain.batch([
    "  LangChain  ",
    "  RUNNABLES ",
    "  BATCH MODE "
])


['langchain', 'runnables', 'batch mode']



**Output**

```python
["langchain", "runnables", "batch mode"]
```

Each input flows through the **entire sequence independently**.

---

### Batch Execution with LLMs (Very Common)



In [3]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI()

llm.batch([
    "Explain RAG",
    "Explain agents",
    "Explain memory"
])


[AIMessage(content='RAG stands for Red, Amber, Green and is a method used to prioritize tasks or projects based on their urgency and importance. \n\n- Red: Indicates tasks or projects that are critical and require immediate attention. These are high priority items that need to be addressed as soon as possible. \n- Amber: Indicates tasks or projects that are important but not as urgent as red items. These may require attention in the near future, but can be delayed momentarily without major consequences. \n- Green: Indicates tasks or projects that are not urgent and can be addressed when time allows. These are low priority items that can be completed once more critical tasks have been taken care of. \n\nBy using the RAG system, individuals or teams can easily identify which tasks or projects need immediate action and which can be deferred or completed at a later time. This helps to streamline workflow, prioritize work effectively and ensure that critical tasks are not overlooked.', addi



Benefits:

* One optimized batch call
* Lower latency than sequential `.invoke()`

---

### Batch Execution in RAG Pipelines



In [6]:
# Setup for RAG batch example
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_classic.schema import Document
from langchain_classic.prompts import ChatPromptTemplate

# Sample documents
docs = [
    Document(page_content="RAG (Retrieval Augmented Generation) combines retrieval with LLMs to answer questions using external knowledge."),
    Document(page_content="Vector memory stores embeddings of documents for semantic search and retrieval."),
    Document(page_content="LCEL (LangChain Expression Language) is a declarative way to compose chains using the pipe operator.")
]

# Create retriever
vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

# Create prompt template
prompt = ChatPromptTemplate.from_template(
    "Answer the question using the context:\n\nContext: {context}\n\nQuestion: {question}"
)

In [8]:
chain = (
    {
        "question": RunnableLambda(lambda x: x),
        "context": retriever
    }
    | prompt
    | llm
)

questions = [
    "What is RAG?",
    "What is vector memory?",
    "What is LCEL?"
]

responses = chain.batch(questions)
responses

[AIMessage(content='RAG (Retrieval Augmented Generation) combines retrieval with LLMs to answer questions using external knowledge.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 23, 'prompt_tokens': 180, 'total_tokens': 203, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-CpzJ118uxNeAILL4rLUKK2QGemwPH', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--019b4bfc-e4e8-71e0-bffb-33f53f60d698-0', usage_metadata={'input_tokens': 180, 'output_tokens': 23, 'total_tokens': 203, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}),
 AIMessage(content='Vector memory stores embeddings of documents fo



Each question:

* Runs retrieval
* Builds prompt
* Calls LLM
* Returns its own answer

---

### Batch Execution with RunnableParallel



In [9]:
from langchain_core.runnables import RunnableParallel

parallel = RunnableParallel(
    length=RunnableLambda(lambda x: len(x)),
    words=RunnableLambda(lambda x: len(x.split()))
)

parallel.batch([
    "Runnable batch execution",
    "Parallel runnables"
])


[{'length': 24, 'words': 3}, {'length': 18, 'words': 2}]



**Output**

```python
[
  {"length": 24, "words": 3},
  {"length": 18, "words": 2}
]
```

Each input produces a **structured result**.

---

### Async Batch Execution (`abatch`)



In [11]:
results = await chain.abatch([
    "Explain batching",
    "Explain streaming"
])
results

[AIMessage(content='Batching is a technique used in various machine learning tasks to process multiple inputs simultaneously in order to improve efficiency and speed. In the context of the provided documents, batching could refer to processing multiple embeddings of documents in vector memory, composing multiple chains using LCEL, or answering multiple questions using external knowledge and LLMs in a Retrieval Augmented Generation system. This approach allows for parallel processing of data and enables faster computation compared to processing inputs one by one.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 92, 'prompt_tokens': 177, 'total_tokens': 269, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint



Use when:

* Running in async frameworks
* Handling high concurrency

---

### Batch vs Streaming vs Parallel

| Concept   | Purpose         |
| --------- | --------------- |
| Batch     | Many inputs     |
| Streaming | Partial outputs |
| Parallel  | Many branches   |

They solve **different problems** and can be combined.

---

### Error Handling in Batch Execution

By default:

* One failing input does **not** stop others

Optional:

```python
chain.batch(inputs, return_exceptions=True)
```

---

### When to Use Batch Execution

Use it when:

* Processing multiple user queries
* Generating embeddings
* Running offline jobs
* Evaluations / benchmarks
* Indexing pipelines

Avoid it when:

* Each input depends on previous output
* You need conversational state

---

### Mental Model

Batch execution is **horizontal scaling** of a runnable:

```
Same pipeline × many inputs
```

---

### Key Takeaways

* Batch execution runs **one runnable on many inputs**
* Uses `.batch()` / `.abatch()`
* Faster and cheaper than loops
* Preserves input–output ordering
* Essential for production workloads
