<a href="https://colab.research.google.com/github/tonyjosephsebastians/AI-Design-patterns/blob/main/GROUP_1_%E2%80%94_%E2%80%9CRequests_are_slow_or_unreliable%E2%80%9D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üîµ GROUP 1 ‚Äî Requests Are Slow or Unreliable  
## Foundation Patterns (Start Here)

This group addresses the **most common and dangerous failure** in AI systems:

> **AI workloads are slow, flaky, and expensive ‚Äî but beginners design APIs as if they are fast and reliable.**

We will intentionally build a **bad system first**, feel the pain, and then fix it step by step.

---

## üéØ Patterns Covered in Group 1

We learn these **together** because they solve the *same class of problems*:

1. Sync vs Async Execution  
2. Long-Running Task Pattern  
3. Job / Workflow Pattern  
4. Retry + Backoff Pattern  
5. Timeout Pattern  
6. Circuit Breaker Pattern  
7. Partial Result Pattern  
8. Graceful Degradation Pattern  

---

# üß™ STEP 1 ‚Äî CREATE THE FAILURE  
**(DO THIS FIRST ‚Äî DO NOT SKIP)**

---

## üéØ Goal of Step 1

Understand **why synchronous APIs break** for AI workloads.

If you don‚Äôt *feel* this failure, the patterns will feel abstract.

---

## üß± What We Will Build (INTENTIONALLY BAD)

A naive API that:
- uploads a document
- does ‚Äúheavy AI processing‚Äù (OCR, embeddings, etc.)
- **blocks the HTTP request**

‚ö†Ô∏è This is exactly how many beginner AI APIs are built.

---

## üìå Colab Cell 1 ‚Äî Install Dependencies

```bash
!pip install fastapi uvicorn nest_asyncio


In [1]:
!pip install fastapi uvicorn nest_asyncio



In [2]:
import time
import nest_asyncio
from fastapi import FastAPI
from fastapi.responses import JSONResponse

nest_asyncio.apply()

app = FastAPI()

@app.post("/documents")
def uploadDocument():
  time.sleep(20)
  return JSONResponse({"status":"done"})

In [None]:
import uvicorn
import asyncio

config = uvicorn.Config(app, host="0.0.0.0", port=8000, loop="asyncio")
server = uvicorn.Server(config)
asyncio.run(server.serve())

#PATTERN #1 ‚Äî Sync vs Async Execution

Anything that may take more than a few seconds must be asynchronous.

Why Sync APIs Fail for AI

Sync APIs assume:

fast execution

reliable downstream services

no retries

AI workloads are:

slow

flaky

retry-prone

rate-limited (429s)

‚ùå Sync + AI = broken system

Type of Work	API Style
Chat responses	Sync (often streaming)
OCR, embeddings, indexing	Async
Agent workflows	Async
Batch extraction	Async

Async by default for AI pipelines

STEP 2 ‚Äî APPLY THE FIRST FIX

Introduce an Async Boundary
üéØ Goal of Step 2

Return immediately to the client

Move heavy work out of the request lifecycle

This introduces two patterns at once:

Sync vs Async Execution

Long-Running Task Pattern

In [9]:
import uuid
import time
from fastapi import BackgroundTasks

jobs = {}  # in-memory job store (temporary)

@app.post("/documents")
def uploadDocument(background_tasks: BackgroundTasks):
  jobId = str(uuid.uuid4())
  jobs[jobId] = "pending"
  background_tasks.add_task(processDocument, jobId)
  return JSONResponse({
        "job_id": job_id,
        "status": "queued"
    })


def process_document(job_id: str):
    jobs[job_id]["status"] = "running"
    time.sleep(20)  # simulate heavy AI work
    jobs[job_id]["status"] = "done"

In [8]:
@app.get("/jobs/{job_id}")
def get_job_status(job_id: str):
      return jobs.get(job_id, {"error": "job not found"})


PATTERNS YOU JUST LEARNED
‚úÖ Pattern 1 ‚Äî Sync vs Async Execution

API responds immediately

Work continues in background

‚úÖ Pattern 2 ‚Äî Long-Running Task Pattern

Heavy AI work detached from HTTP request

No blocking of workers

‚úÖ Pattern 3 ‚Äî Job / Workflow Pattern

Explicit job identifier

Client polls job status

Work is trackable

Current limitations:

‚ùå No failure handling

‚ùå No retries

‚ùå No timeout protection

‚ùå No circuit breaker

‚ùå No progress tracking

# üîµ GROUP 1 ‚Äî Step 3  
## Job States + Failure Handling  
**(Workflow Pattern + State Pattern ‚Äî GoF)**

---

## üß† Why Step 3 exists (Failure First)

Right now, our async version stores only:

```python
jobs[job_id] = {"status": "queued"}


What breaks?

If background processing crashes, the job can get stuck forever

There is no progress visibility

Failures are not captured

No clear lifecycle guarantees

So we need a real job lifecycle (state machine).

e will:

Define explicit job states

Track progress

Capture errors

Enforce a simple, predictable lifecycle

Patterns learned here:

‚úÖ Job / Workflow Pattern

‚úÖ State Pattern (GoF)

‚úÖ Partial Result (foundation via progress reporting)

Job State Machine

Production-friendly lifecycle:

queued ‚Üí running ‚Üí succeeded
              ‚Üò failed


(We can add cancelled later.)