# Adding Durability

In this section, we will do the following:
- Describe the concepts of durable execution
- Transform the previous agent into a Temporal Workflow
- Use Temporal tooling to manage the lifecycle of your agent

## SETUP NOTEBOOK

Run the following code blocks to install various packages and tools necessary to run this notebook

**Be sure to add your .env file again. It doesn't persist across notebooks or sesions**

```
LLM_API_KEY = YOUR_API_KEY
LLM_MODEL = openai/gpt-4o
```

In [1]:
# allows us to run the Temporal Asyncio event loop within the event loop of Jupyter Notebooks
import nest_asyncio
nest_asyncio.apply()

In [2]:
# Create .env file
with open(".env", "w") as fh:
  fh.write("LLM_API_KEY = YOUR_API_KEY\nLLM_MODEL = openai/gpt-4o")

In [3]:
%pip install temporalio litellm reportlab python-dotenv

Collecting temporalio
  Downloading temporalio-1.16.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.7/92.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting litellm
  Downloading litellm-1.76.0-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.2/41.2 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting reportlab
  Downloading reportlab-4.4.3-py3-none-any.whl.metadata (1.7 kB)
Collecting nexus-rpc==1.1.0 (from temporalio)
  Downloading nexus_rpc-1.1.0-py3-none-any.whl.metadata (2.8 kB)
Collecting types-protobuf>=3.20 (from temporalio)
  Downloading types_protobuf-6.30.2.20250822-py3-none-any.whl.metadata (2.2 kB)
Downloading temporalio-1.16.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.2/13.2 MB[0m [31m84.8 MB/s[0m eta [36m0:00:00

In [4]:
!curl -sSf https://temporal.download/cli.sh | sh

[1mtemporal:[0m Downloading Temporal CLI latest
[1mtemporal:[0m Temporal CLI installed at /root/.temporalio/bin/temporal
[1mtemporal:[0m For convenience, we recommend adding it to your PATH
[1mtemporal:[0m If using bash, run echo export PATH="\$PATH:/root/.temporalio/bin" >> ~/.bashrc


In [5]:
# Mermaid renderer, run at the beginning to setup rendering of diagrams
import base64
from IPython.display import Image, display

def render_mermaid(graph_definition):
    """
    Renders a Mermaid diagram in Google Colab using mermaid.ink.

    Args:
        graph_definition (str): The Mermaid diagram code (e.g., "graph LR; A-->B;").
    """
    graph_bytes = graph_definition.encode("ascii")
    base64_bytes = base64.b64encode(graph_bytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))

# Adding Durability

## What Can Go Wrong with AI Agents?

Let's brainstorm the issues you might face when running your research agent from Notebook 1 in production.

**Think about these categories:**
* **Technical failures:** What external services could fail?
* **Timing issues:** What if something takes longer than expected?
* **Recovery challenges:** If something breaks halfway through, what happens?

*Take 2 minutes to discuss with your neighbor, then we'll share answers*

## Common Issues with Agents in Production

**Common answers we typically hear:**
* LLM API timeouts or rate limiting
* PDF generation fails due to disk space
* Network connectivity issues
* Process crashes mid-execution
* Restarting burns money

## These Aren't New Problems

The challenges you just identified? They're the same problems we've been solving in distributed systems for decades:

**Your Research Agent in Production Reality:**
* **LLM API call** - External service that can timeout, rate limit, or be down.
* **PDF generation** - File system operation that can fail due to disk space
* **User input/output** - Network operations that can be interrupted

**This is a distributed system!** Your "simple" agent is actually:
* Multiple network calls to external services
* File system operations
* State that needs to persist across failures
* Coordination between different steps

**The good news:** We have battle-tested solutions for these problems.

**The challenge:** Traditional distributed systems tools weren't designed for AI workflows.

## AI Needs Durable Execution

Recall your research agent from Notebook 1? Here's what happens in production:

**Scenario:** User asks for research on "sustainable energy trends"
1. LLM call succeeds - generates great research content
2. PDF generation fails - disk full or permission error
3. **User has to start over completely**

## What Developers Actually Want

* "Just fix the disk issue and generate the PDF from the research you already got"
* "Don't make me pay for the same LLM call twice"
* "Don't lose my work because of a simple file system error"

## What Normal Execution Gives Us

* Start from the beginning every time
* Lose all intermediate results
* No memory of what succeeded vs what failed

## What We Need

A way to make our AI agents resilient to these failures.

## This is Durable Execution

<!-- This is a big slide in the middle with only a title for effect -->

## What Is Durable Execution?

* Crash-proof execution
* Retries upon failure
* Maintains application state, resuming after a crash at the point of failure
* Can run across a multitude of processes, even on different machines
  * Virtualizes execution

## Durable Execution Requirements

Temporal relies on a Replay mechanism to recover from failure.
As your program progresses, Temporal saves the input and output from function calls to the history.
This allows a failed program to restart right where it left off.

**Because of this, Temporal requires your workflow to be deterministic**

A Workflow is deterministic if it produces the same output given the same input.

## Temporal Provides Durable Execution

* Open-Source MIT Licensed
* Code based approach to Workflow design
* Use your own tools, processes, and libraries
* Support for 7 languages
  * Python, TypeScript, Ruby, Java, Go, PHP, .NET

## Demo

<!--
1. To demonstrate the power of durable execution, we'll first show the power of running the app with no durable execution.
2. From the normal directory, run `app.py`. Follow the README instructions on how to do so.  
3. When prompted, provide the research topic you want OpenAI to
perform research for in the CLI.
4. Before the process generates a PDF, kill the process.
5. Rerun `app.py` and show that the process restarted and you have to have your agent start the research again. Emphasize that from a cost perspective, this could be very costly, because you could have to re-run through many tokens to get to where you left off.
6. Now show the durable version by running the Worker and Workflow from the `durable` directory. Follow the README instructions on how to do so.
7. When prompted, provide the research topic you want OpenAI to perform research for in the CLI.
8. Before the process generates a PDF, kill the Worker.
8. Rerun the Worker and show that you continue right where you left off.
9. Emphasize that you lost no progress or data. The Workflow will continue by generating the PDF (available in the `durable` directory) and completing the process successfully.
10. Show the Workflow Execution completion in the Web UI.
-->

## _Wait, how can AI code be deterministic?_

Your **workflow** needs to be deterministic, not the entire application.

The key to understanding this is to separate your applications repeatable (deterministic) and non-repeatable (non-deterministic) parts.

1. **Deterministic parts** - Execute the same way when re-run with the same input
  * Ex: Branching, looping, mathematical operations, etc.
2. **Non-deterministic parts** - Run arbitrary code that has the potential to fail due to external conditions
  * Ex: Calling LLMs, accessing the file system, writing to a database

## Consider the Following Example

* Depending on the time of day, a different decision is made
* If it's 5:00pm, it's dinner time
* If it's 9:30am, it's breakfast time

**What would happen if a user ran this application at 11:59am, it crashed and was replayed at 12:01pm? What would the user expect?**

In [6]:
diagram = """
graph TD
    A["Get Current Time"] --> B["Is am or pm?"]
    B --> C["Time for breakfast"]
    B --> D["Time for dinner"]
"""
render_mermaid(diagram)


## _What Does This Have to Do with AI?_

_Statement_: Since workflows need to be deterministic, your agents will follow the same path every time.

_Reality_: **This statement is 100% incorrect.**

### **Determistic != predetermined**

## AI Agent Reality Check

**Common Fear:** "If my workflow is deterministic, my AI agent will always do the same thing."

**Reality:** Your agent can be completely dynamic while still being deterministic.

## AI Research Agent Examples

**Each run is completely different** (dynamic), but **each individual run is reproducible** (deterministic).

In [7]:
diagram = """
graph TD
    A["Ask AI for a Plan"] --> B{"Make Tea"}
    B --> C["Boil Water"]
    C --> D["Steep tea"]
    D --> E["Remove and enjoy"]
"""
render_mermaid(diagram)


In [8]:
diagram = """
graph TD
    A["Ask AI for a Plan"] --> B{"Slay a dragon"}
    B --> C["Find the weak spot"]
    C --> D["Acquire the correct weapon"]
    D --> E["Carry out your attack"]
"""
render_mermaid(diagram)

In [9]:
diagram = """
graph TD
    A["Ask AI for a Plan"] --> B{"Write Code"}
    B --> C["Locate files"]
    C --> D["Write code"]
    D --> E["Evaluate result"]
"""
render_mermaid(diagram)

## How Deterministic Workflows Are *Essential* for AI Workflows

* The steps of evaluate goal, locate tools, execute tools, evaluate if goal is complete **makes up the Agentic loop**.
* Tools that the LLM decides to call become **dynamic**, not **non-deterministic**.
* **Deterministic, not predetermined**

## Let's Make Your Agent Durable

We're about to transform your simple research agent into a durable one. Here's what changes:

* **Functions** → **Activities** (your tools become crash-proof)
* **Direct calls** → **Workflow coordination** (orchestrates activities safely)
* **Manual error handling** → **Automatic retries and recovery**

This results in a process such as:
LLM Decision → Tool A → Result X (Saved in history, then on replay, same result X will result in the same next decision) → Next Decision

## What stays the same

* Your core logic (LLM call → PDF generation)
* Your inputs and outputs
* Your business requirements

## What gets better

* Automatic retries when API calls fail, timeout, or rate-limit
* Resume exactly where you left off after crashes  
* Built-in observability and monitoring

## Package Our Inputs & Outputs for Ease of Management

For ease of use, evolution of parameters, and type checking, Temporal recommends passing and returing a single object from functions. `dataclass` is the recommended structure here, but anything serializable will work.

In [10]:
from dataclasses import dataclass

@dataclass
class LLMCallInput:
  prompt: str
  llm_api_key: str
  llm_model: str

@dataclass
class PDFGenerationInput:
  content: str
  filename: str = "research_pdf.pdf"

## Tasks/Tools become Activities

To turn a function/method into an Activity, add the `@activity.defn` decorator.

In [11]:
from temporalio import activity
from litellm import completion, ModelResponse

@activity.defn
def llm_call(input: LLMCallInput) -> ModelResponse:
    response = completion(
      model=input.llm_model,
      api_key=input.llm_api_key,
      messages=[{ "content": input.prompt,"role": "user"}]
    )
    return response

In [12]:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch

@activity.defn
def create_pdf_activity(input: PDFGenerationInput) -> str:
    print("Creating PDF document...")

    doc = SimpleDocTemplate(input.filename, pagesize=letter)
    styles = getSampleStyleSheet()
    title_style = ParagraphStyle(
        'CustomTitle',
        parent=styles['Heading1'],
        fontSize=24,
        spaceAfter=30,
        alignment=1
    )

    story = []
    title = Paragraph("Research Report", title_style)
    story.append(title)
    story.append(Spacer(1, 20))
    paragraphs = input.content.split('\n\n')
    for para in paragraphs:
        if para.strip():
          p = Paragraph(para.strip(), styles['Normal'])
          story.append(p)
          story.append(Spacer(1, 12))

    doc.build(story)

    print(f"SUCCESS! PDF created: {input.filename}")
    return input.filename

## What is an Activity

* An Activity is a function/method that is prone to failure and/or non-deterministic.
* Temporal requires all non-deterministic code be run in an Activity
* Activities retry over and over until they succeed or until your customized retry or timeout configuration is hit.

## What Activities give you

* **Automatic retries** when external the code fails
* **Timeout handling** for slow operations and detecting failures
* **Detailed visibility** of execution, including inputs/outputs for debugging
* **Automatic checkpoints** - if your workflow crashes, Activities aren't re-executed, and continue from the last known good state



## Your Code

**Your LLM call is now:**
* Protected against API timeouts
* Automatically retried with backoff
* Observable for debugging

**Your PDF generation is now:**
* Protected against file system errors
* Automatically retried if temporary failures
* Tracked for completion verification

## Activities Are Called from Workflows

You orchestrate the execution of your Activities from within a Workflow

## More Input/Output Packaging

Just like with Activities, Temporal recommends passing a single object to the Workflow for input and returning a single object.

In [13]:
@dataclass
class GenerateReportInput:
    prompt: str

@dataclass
class GenerateReportOutput:
    result: str

## Load Environment Variables

Now is the time to load in your environment variables with your `LLM_API_KEY`

In [22]:
import os
from dotenv import load_dotenv

load_dotenv(override=True)


# Get LLM_API_KEY environment variable
LLM_MODEL = os.getenv("LLM_MODEL", "openai/gpt-4o")
LLM_API_KEY = os.getenv("LLM_API_KEY", None)

## Creating the Workflow

* Activities are orchestrated within a Temporal Workflow.
* Workflows must **not** make API calls, file system calls, or anything non-deterministic. That is what Activities are for.
* Workflows are async, and you define them as a class decorated with the `@workflow.defn` decorator.
* Every Workflow has a **single** entry point, which is an `async` method decorated with `@workflow.run`.

In [15]:
import asyncio
from datetime import timedelta
import logging

from temporalio import workflow

# sandboxed=False is a Notebook only requirement. You normally don't do this
@workflow.defn(sandboxed=False)
class GenerateReportWorkflow:

    @workflow.run
    async def run(self, input: GenerateReportInput) -> GenerateReportOutput:

        llm_call_input = LLMCallInput(prompt=input.prompt, llm_api_key=LLM_API_KEY, llm_model=LLM_MODEL)

        research_facts = await workflow.execute_activity(
            llm_call,
            llm_call_input,
            start_to_close_timeout=timedelta(seconds=30),
        )

        workflow.logger.info("Research complete!")

        pdf_generation_input = PDFGenerationInput(content=research_facts["choices"][0]["message"]["content"])

        pdf_filename = await workflow.execute_activity(
            create_pdf_activity,
            pdf_generation_input,
            start_to_close_timeout=timedelta(seconds=10),
        )

        return GenerateReportOutput(result=f"Successfully created research report PDF: {pdf_filename}")

## Running a Worker

* Temporal Workflows are run on Workers
* Workers wait for tasks to do, such as executing an Activity or Workflow, and perform them
* Workers find tasks by listenting on a Task Queue
* Workers have Workflows and Activities registered to them so the Worker knows what it is allowed to execute
* This makes the execution of work indirect; _any_ Worker can pick up a registered Workflow or Activity

In [16]:
from temporalio.client import Client
from temporalio.worker import Worker
import concurrent.futures

async def run_worker() -> None:
    logging.basicConfig(level=logging.INFO)
    logging.getLogger("LiteLLM").setLevel(logging.WARNING)

    client = await Client.connect("localhost:7233", namespace="default")

    # Run the Worker
    with concurrent.futures.ThreadPoolExecutor(max_workers=100) as activity_executor:
        worker = Worker(
            client,
            task_queue="research",
            workflows=[GenerateReportWorkflow],
            activities=[llm_call, create_pdf_activity],
            activity_executor=activity_executor
        )

        print(f"Starting the worker....")
        await worker.run()

## Running a Temporal Service

* The Temporal Service brings it all together
* The Temporal Service can be run locally, self-hosted, or you can use Temporal Cloud
* The service acts as the supervisor of your Workflows, Activities, and everything else

In [17]:
# Start the Temporal Dev Server
import os
import subprocess

command = "/root/.temporalio/bin/temporal server start-dev --ui-port 8000"
temporal_server = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, preexec_fn=os.setsid)

## Starting the Worker

* A Workflow can't execute if a Worker isn't running

In [19]:
# Due to the limitation of Jupyter Notebooks and Google Collab, this is how
# you must start the worker in a Notebook environment
worker = asyncio.create_task(run_worker())


# If you are running this code in a typical Python environment, you can start
# the Worker by just calling `asyncio.run`
# if __name__ == "__main__":
#    asyncio.run(run_worker())

## Executing the Workflow

* Temporal Workflows are executed indirectly
* You **don't** just execute the file, you request execution from the Temporal Service
* You do this using a Temporal Client
* In the client you specfiy the Workflow to run, the data, a Workflow ID to identify the execution, and the Task Queue to request on
  * This Task Queue **must exactly match** the Task Queue specified in the Worker
* Workflows can be started asynchonously or synchronously


In [23]:
import asyncio

from temporalio.client import Client


client = await Client.connect("localhost:7233", namespace="default")

print("Welcome to the Research Report Generator!")
prompt = input("Enter your research topic or question: ").strip()

if not prompt:
    prompt = "Give me 5 fun and fascinating facts about tardigrades. Make them interesting and educational!"
    print(f"No prompt entered. Using default: {prompt}")

# Asynchronous start of a Workflow
handle = await client.start_workflow(
    GenerateReportWorkflow.run,
    GenerateReportInput(prompt=prompt),
    id="generate-research-report-workflow",
    task_queue="research",
)

print(f"Started workflow. Workflow ID: {handle.id}, RunID {handle.result_run_id}")

Welcome to the Research Report Generator!
Enter your research topic or question: 
No prompt entered. Using default: Give me 5 fun and fascinating facts about tardigrades. Make them interesting and educational!
Started workflow. Workflow ID: generate-research-report-workflow, RunID 0198ecc2-d2c4-7e9c-92b1-3776e02ca506


## Getting the Result

The example above uses async execution. You can `await` the handle to get the result.

In [22]:
# Get the result
result = await handle.result()
print(f"Result: {result}")

Result: GenerateReportOutput(result='Successfully created research report PDF: research_pdf.pdf')


## Exploring the Web UI

- Temporal provides a robust Web UI for managing Workflow Executions
- Can gain insights like responses from Activities, execution time, and failures
- Great for debugging

In [24]:
# Get the Temporal Web UI URL
from google.colab.output import eval_js
print(eval_js("google.colab.kernel.proxyPort(8000)"))

https://8000-m-s-1juj2863z1aj7-b.us-west1-0.prod.colab.dev


## Simulating Failure

What happens if the Worker process were to crash during execution?

## Adding a Durable Timer

- Timers introduce delays in your Workflow
- Durable Timers fire regardless if there is a Worker running
- Let's add one to the Workflow to give us time to kill the Worker in the middle of execution.

In [25]:
import asyncio
from datetime import timedelta
import logging

from temporalio import workflow

# sandboxed=False is a Notebook only requirement. You normally don't do this
@workflow.defn(sandboxed=False)
class GenerateReportWorkflow:

    @workflow.run
    async def run(self, input: GenerateReportInput) -> GenerateReportOutput:

        llm_call_input = LLMCallInput(prompt=input.prompt, llm_api_key=LLM_API_KEY, llm_model=LLM_MODEL)

        research_facts = await workflow.execute_activity(
            llm_call,
            llm_call_input,
            start_to_close_timeout=timedelta(seconds=30),
        )

        workflow.logger.info("Research complete!")

        # Adding a Timer here to pause the Workflow Execution
        await workflow.sleep(timedelta(seconds=20))

        pdf_generation_input = PDFGenerationInput(content=research_facts["choices"][0]["message"]["content"])

        pdf_filename = await workflow.execute_activity(
            create_pdf_activity,
            pdf_generation_input,
            start_to_close_timeout=timedelta(seconds=10),
        )

        return GenerateReportOutput(result=f"Successfully created research report PDF: {pdf_filename}")

## Restart the Worker

- After a Workflow change, you must restart the Worker for the change to take effect.

In [26]:
# Run this to kill the current Worker
x = worker.cancel()

if x:
  print("Worker killed")
else:
  print("Worker was not running. Nothing to kill")

Worker killed


In [27]:
# Starting the Worker again
worker = asyncio.create_task(run_worker())

 # Check if the task is in the set of all tasks
if worker in asyncio.all_tasks():
    # The sleep is necessary because of the async task scheduling in Jupyter
    print("Task is currently active.")
else:
    print("Task is not found in active tasks (might have finished or not yet scheduled).")

Task is currently active.


## Start the Workflow and Simulate an Error

Start the Workflow again, wait about ~10 seconds to let the first Activity complete, then kill the Worker.

In [30]:
import time

client = await Client.connect("localhost:7233", namespace="default")

print("Welcome to the Research Report Generator!")
prompt = input("Enter your research topic or question: ").strip()

if not prompt:
    prompt = "Give me 5 fun and fascinating facts about tardigrades. Make them interesting and educational!"
    print(f"No prompt entered. Using default: {prompt}")

# Asynchronous start of a Workflow
handle = await client.start_workflow(
    GenerateReportWorkflow.run,
    GenerateReportInput(prompt=prompt),
    id="generate-research-report-workflow",
    task_queue="research",
)

print(f"Started workflow. Workflow ID: {handle.id}, RunID {handle.result_run_id}")

Welcome to the Research Report Generator!
Enter your research topic or question: 
No prompt entered. Using default: Give me 5 fun and fascinating facts about tardigrades. Make them interesting and educational!
Started workflow. Workflow ID: generate-research-report-workflow, RunID 0198ecc6-4b2f-77c2-badb-418cbc0689c0


In [31]:
# Run this to kill the current Worker
x = worker.cancel()

if x:
  print("Worker killed")
else:
  print("Worker was not running. Nothing to kill")

Worker killed


## Watch the Progress in the Web UI

- Go to the Web UI and watch the progress. Try to locate the following things:
  - Input to the Workflow
  - Result of the first Activity
  - The Timer firing
  - The Workflow timing out

In [29]:
# Get the Temporal Web UI URL
from google.colab.output import eval_js
print(eval_js("google.colab.kernel.proxyPort(8000)"))

https://8000-m-s-1juj2863z1aj7-b.us-west1-0.prod.colab.dev


## Restart the Worker to Resume Execution

- Restart the Worker and return to the WebUI
- You will see the Workflow pick up where it left off as if nothing happened

In [32]:
# Starting the Worker again
worker = asyncio.create_task(run_worker())

---
# Exercise 2 - Adding Durability

* In these exercises you will:
  * **FILL IN**
* Go to the **Exercise** Directory in the Google Drive and open the **Practice** Directory
* Open _02-Adding-Durability.ipynb_ and follow the instructions
* If you get stuck, raise your hand and someone will come by and help. You can also check the `Solution` directory for the answers
* **You have 5 mins**