```{contents}
```
## Retry and Fallback

In LLM pipelines, **retry** and **fallback** are **resilience mechanisms** used to handle failures such as:

* Transient network errors
* Rate limits
* Timeouts
* Model unavailability

Both are first-class concepts in LangChain runnables.

```
Request
  ↓
Primary Runnable
  ├─ success → return result
  └─ failure
        ├─ Retry (same runnable)
        └─ Fallback (alternative runnable)
```

---

### Retry vs Fallback (Conceptual Difference)

| Mechanism    | Purpose                    | Strategy                  |
| ------------ | -------------------------- | ------------------------- |
| **Retry**    | Handle transient failures  | Try again (same runnable) |
| **Fallback** | Handle persistent failures | Switch to backup runnable |

They are often **used together**.

---

### Retry


**Retry** automatically re-executes a runnable when it fails, based on:

* Max attempts
* Backoff strategy
* Exception type

Retry is ideal for **temporary issues**.

---

### Retry Demonstration (Runnable)



In [1]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI().with_retry(
    stop_after_attempt=3,   # max retries
    wait_exponential_jitter=True
)

response = llm.invoke("Explain retry in LLMs")
print(response.content)

Retry in LLMs (Large Language Models) refers to the process of re-running a failed inference or generation task multiple times in order to improve the model's output quality and accuracy. 

When an LLM encounters an error or produces a low-quality output during inference, the retry mechanism allows the model to try again with slightly different input data or parameters. This can help the model overcome obstacles such as sampling bias, noise in the data, or misinterpretation of context.

Retry in LLMs can be implemented in various ways, such as by adjusting the randomness in sampling, changing the temperature hyperparameter, or fine-tuning the model based on feedback from previous attempts. By incorporating retry mechanisms, LLMs can improve their performance and generate more accurate and reliable outputs.




**What happens**

* If the first call fails → retry
* Up to 3 attempts
* Exponential backoff with jitter

---

### Retry in a RunnableSequence

```python
chain = (
    prompt
    | ChatOpenAI().with_retry(stop_after_attempt=2)
)

chain.invoke("What is retry logic?")
```

Retry applies **only to the runnable it is attached to**.

---

### Common Retry Use Cases

* API rate limits
* Temporary network drops
* Model overload
* Intermittent infra issues

---

### Fallback


**Fallback** defines **alternative runnables** to execute if the primary runnable fails.

Fallback is ideal for:

* Model outages
* Cost-aware degradation
* SLA guarantees

---

### Fallback Demonstration (Two Models)

```python
from langchain_openai import ChatOpenAI

primary_llm = ChatOpenAI(model="gpt-4")
backup_llm = ChatOpenAI(model="gpt-3.5-turbo")

llm_with_fallback = primary_llm.with_fallbacks([backup_llm])

response = llm_with_fallback.invoke("Explain fallback strategy")
print(response.content)
```

**Execution logic**

1. Try `gpt-4`
2. If it fails → switch to `gpt-3.5`
3. Return first successful response

---

### Multiple Fallbacks (Priority Order)

```python
llm = ChatOpenAI(model="gpt-4").with_fallbacks([
    ChatOpenAI(model="gpt-3.5-turbo"),
    ChatOpenAI(model="gpt-3.5-turbo-16k")
])
```

Fallbacks are tried **in order**.

---

### Retry + Fallback

#### Combined Demonstration



In [2]:
primary = ChatOpenAI(model="gpt-4").with_retry(
    stop_after_attempt=2
)

secondary = ChatOpenAI(model="gpt-3.5-turbo").with_retry(
    stop_after_attempt=2
)

llm = primary.with_fallbacks([secondary])

llm.invoke("Explain retry and fallback")


AIMessage(content='"Retry" and "fallback" are terms used in computing to describe methods for handling issues when an operation does not succeed on the first attempt.\n\nRetry is a strategy where a system attempts the same operation again if it fails the first time. This can be useful in cases where the failure was due to a temporary issue (like a network glitch). For instance, when you send a request to a server and the server does not respond, the client will often retry the same request instead of immediately indicating a failure.\n\nFallback, on the other hand, is a strategy where a system will try a different approach if the primary method fails. This can be useful in cases where the primary method is known to have reliability issues or in cases where the failure was due to a more permanent issue with the primary method. For example, when a main server fails, the system may have a fallback server to use instead.\n\nBoth are considered as part of error handling and resilience desig



**Execution order**

1. GPT-4 attempt #1
2. GPT-4 retry #2
3. If still failing → GPT-3.5 attempt #1
4. GPT-3.5 retry #2

---

### Retry + Fallback in RAG Pipelines



In [4]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Create a sample retriever and prompt for demonstration
vectorstore = FAISS.from_texts(
    ["LangChain supports retry and fallback mechanisms for resilience."],
    embedding=OpenAIEmbeddings()
)
retriever = vectorstore.as_retriever()

prompt = ChatPromptTemplate.from_template(
    "Answer the question based on context:\nContext: {context}\nQuestion: {question}"
)

In [7]:
chain = (
    {
        "question": RunnablePassthrough(),
        "context": retriever
    }
    | prompt
    | ChatOpenAI().with_retry()
        .with_fallbacks([ChatOpenAI(model="gpt-3.5-turbo")])
)
chain.invoke("How does LangChain ensure reliability?")

AIMessage(content='LangChain ensures reliability by supporting retry and fallback mechanisms for resilience.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 68, 'total_tokens': 81, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-Cpzgcz7q30LfaHuuei4EW1Q983cGq', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--019b4c13-3494-7893-8f06-12404a6ae923-0', usage_metadata={'input_tokens': 68, 'output_tokens': 13, 'total_tokens': 81, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})


This ensures:

* Retrieval still happens
* Generation is resilient

---

### Failure Scoping (Important)

#### Where Retry/Fallback Applies

| Level    | Behavior                                |
| -------- | --------------------------------------- |
| Runnable | Only that step retries                  |
| Sequence | Downstream not executed on failure      |
| Parallel | Any branch failure fails whole runnable |
| Batch    | Per-input isolation                     |

---

### Best Practices

#### When to Use Retry

Use retry when:

* Failures are transient
* Same request is safe to repeat
* You want transparent recovery

Avoid retry when:

* Failures are deterministic
* Calls are expensive or stateful

---

#### When to Use Fallback

Use fallback when:

* You have alternative models/tools
* SLA is critical
* Graceful degradation is acceptable

---

#### Common Mistakes

* Retrying indefinitely
* Retrying non-idempotent operations
* Not logging fallback usage
* Using fallback without retry

---

### Mental Model

Retry = **“try again”**
Fallback = **“try something else”**

Together they provide **fault tolerance**.

---

### Key Takeaways

* Retry handles **temporary failures**
* Fallback handles **permanent failures**
* Both are composable at runnable level
* Essential for production-grade LLM systems