```{contents}
```
## Microservice Architecture 

### 1. Motivation

Modern **Generative AI systems** (LLMs, multimodal models, agents, RAG pipelines) are:

* **Large-scale**
* **Latency-sensitive**
* **Rapidly evolving**
* **Cost-intensive**

A **microservice architecture** enables:

| Requirement     | Benefit                                                           |
| --------------- | ----------------------------------------------------------------- |
| Scalability     | Independent scaling of heavy components (LLM, embeddings, search) |
| Modularity      | Swap models, tools, vector DBs without rewriting the system       |
| Reliability     | Failures isolated to individual services                          |
| Experimentation | A/B test models & prompts safely                                  |
| Cost control    | Scale expensive services only when needed                         |

---

### 2. Core Concept

**Microservice Architecture** decomposes a GenAI system into **small, independent services** that communicate over APIs.

Each service owns **one responsibility**.

```
User → API Gateway → Orchestrator → {LLM | Retrieval | Memory | Tools | Safety | Logging}
```

---

### 3. Typical GenAI Microservice Decomposition

| Service               | Responsibility                              |
| --------------------- | ------------------------------------------- |
| API Gateway           | Authentication, rate limits, routing        |
| Orchestrator / Agent  | Workflow control, tool selection            |
| Prompt Service        | Prompt templates, versioning                |
| LLM Service           | Model inference (OpenAI, local, fine-tuned) |
| Embedding Service     | Text → vector                               |
| Retrieval Service     | Vector search / RAG                         |
| Memory Service        | Long-term / conversation memory             |
| Tool Service          | External APIs, functions                    |
| Safety Service        | Moderation, PII filtering                   |
| Observability Service | Logs, traces, metrics                       |
| Caching Service       | Response & embedding cache                  |

---

### 4. Reference Architecture

```
                 ┌──────────────┐
User ──HTTPS──▶  │ API Gateway  │
                 └─────┬────────┘
                       ▼
                ┌─────────────┐
                │ Orchestrator│
                └─────┬───────┘
        ┌──────────────┼──────────────────┐
        ▼              ▼                  ▼
  Prompt Svc     Retrieval Svc       Tool Svc
        │              │                  │
        ▼              ▼                  ▼
   LLM Service   Vector DB / RAG     External APIs
        │
        ▼
  Safety + Logging + Cache
```

---

### 5. Communication Patterns

| Pattern                           | Usage                               |
| --------------------------------- | ----------------------------------- |
| REST / gRPC                       | Low-latency synchronous calls       |
| Async Messaging (Kafka, RabbitMQ) | Long workflows, event-driven agents |
| Streaming (WebSockets, SSE)       | Token streaming from LLM            |
| Service Mesh                      | Observability, retries, security    |

---

### 6. Workflow Example: RAG Query

**User Question → Answer**

1. API Gateway receives request
2. Orchestrator requests embedding
3. Retrieval Service fetches documents
4. Prompt Service assembles context
5. LLM Service generates response
6. Safety filters output
7. Cache & Logging store results

```
Question
   ↓
Embedding → Vector DB → Context
   ↓
Prompt Assembly
   ↓
LLM Inference
   ↓
Safety → Response
```

---

### 7. Example: Minimal Python Microservices (FastAPI)

**LLM Service**

```python
from fastapi import FastAPI
from openai import OpenAI

app = FastAPI()
client = OpenAI()

@app.post("/generate")
def generate(prompt: str):
    return client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": prompt}]
    )
```

**Embedding Service**

```python
@app.post("/embed")
def embed(text: str):
    return client.embeddings.create(
        model="text-embedding-3-large",
        input=text
    )
```

**Orchestrator**

```python
import requests

def answer(query):
    emb = requests.post("http://embed:8000/embed", json=query).json()
    docs = search_vector_db(emb)
    prompt = build_prompt(query, docs)
    return requests.post("http://llm:8000/generate", json=prompt).json()
```

---

### 8. Deployment & Scaling

| Layer             | Technology                         |
| ----------------- | ---------------------------------- |
| Containers        | Docker                             |
| Orchestration     | Kubernetes                         |
| Service Discovery | Consul, Kubernetes DNS             |
| API Gateway       | Kong, NGINX                        |
| Autoscaling       | HPA, KEDA                          |
| Observability     | Prometheus, Grafana, OpenTelemetry |

---

### 9. Patterns Specific to Generative AI

| Pattern                 | Purpose                       |
| ----------------------- | ----------------------------- |
| Model Abstraction Layer | Swap OpenAI ↔ local models    |
| Prompt Versioning       | Reproducible experiments      |
| Response Caching        | Reduce LLM cost               |
| Model Router            | Route queries by cost/quality |
| Tool-augmented Agents   | Dynamic workflows             |
| Human-in-the-loop       | Safety & review               |

---

### 10. Comparison with Monolithic Design

| Aspect            | Monolith     | Microservices |
| ----------------- | ------------ | ------------- |
| Scaling           | Whole system | Per-service   |
| Model upgrades    | Risky        | Isolated      |
| Failure impact    | System-wide  | Contained     |
| Experimentation   | Hard         | Easy          |
| Cost optimization | Coarse       | Fine-grained  |

---

### 11. When Microservices Become Necessary

Use microservices when:

* Multiple models are in production
* RAG pipelines are complex
* High traffic and strict latency SLAs
* Rapid experimentation & iteration
* Multi-team development

---

### 12. Key Takeaways

* Microservices provide **control, scalability, and reliability** for GenAI systems.
* They decouple **models, data, prompts, tools, memory, and safety**.
* They enable **continuous improvement** without downtime.
* They are foundational for **agentic systems and enterprise GenAI platforms**.

