```{contents}
```
## Architecture & Infrastructure

---

### 1. Big Picture: What Is “Architecture & Infrastructure” in Generative AI?

| Layer                     | Purpose                                                                      |
| ------------------------- | ---------------------------------------------------------------------------- |
| **Model Architecture**    | Mathematical structure of the model (e.g., Transformer, Diffusion)           |
| **Training Architecture** | How the model is trained (distributed systems, optimization, data pipelines) |
| **Serving Architecture**  | How models are deployed and used in production                               |
| **Infrastructure**        | Hardware, networking, storage, orchestration, security, scaling              |

Generative AI systems are **full-stack systems**, not just neural networks.

---

### 2. Core Model Architectures

| Model Type         | Used For                | Core Components                                   |
| ------------------ | ----------------------- | ------------------------------------------------- |
| Transformer (LLMs) | Text, code              | Self-attention, feed-forward, positional encoding |
| Diffusion Models   | Images, video           | U-Net, noise scheduler                            |
| VAEs               | Representation learning | Encoder, decoder, latent space                    |
| GANs               | Image generation        | Generator, discriminator                          |

**Transformer (LLM) Anatomy**

```
Input → Embedding → N × [Attention → MLP] → LayerNorm → Output Head
```

---

### 3. Training Architecture (Offline Pipeline)

```
Data → Cleaning → Tokenization → Sharding → Distributed Training → Checkpointing
```

#### Distributed Training

| Parallelism          | Purpose                     |
| -------------------- | --------------------------- |
| Data Parallelism     | Split data across GPUs      |
| Model Parallelism    | Split model across GPUs     |
| Pipeline Parallelism | Split layers across devices |
| ZeRO / FSDP          | Memory optimization         |

**Training Stack**

| Layer               | Example              |
| ------------------- | -------------------- |
| Framework           | PyTorch, JAX         |
| Distributed Runtime | NCCL, Ray, DeepSpeed |
| Storage             | S3, GCS, HDFS        |
| Scheduler           | Kubernetes, Slurm    |

---

### 4. Inference & Serving Architecture

```
User → API Gateway → Load Balancer → Model Server → GPU → Response
```

**Model Serving Components**

| Component        | Role                   |
| ---------------- | ---------------------- |
| Tokenizer        | Convert text to tokens |
| Inference Engine | Executes model         |
| KV Cache         | Speeds up decoding     |
| Batching Engine  | Groups requests        |
| Autoscaler       | Adds/removes replicas  |

---

### 5. Infrastructure Stack

| Layer         | Examples                |
| ------------- | ----------------------- |
| Compute       | GPUs (A100, H100), TPUs |
| Networking    | InfiniBand, NVLink      |
| Storage       | Object stores, SSDs     |
| Orchestration | Kubernetes              |
| Observability | Prometheus, Grafana     |
| Security      | IAM, secrets manager    |

---

### 6. End-to-End System Flow

```
[User]
   ↓
[API Gateway]
   ↓
[Request Router]
   ↓
[Inference Cluster]
   ↓
[GPU Workers]
   ↓
[Postprocessing]
   ↓
[Response]
```

---

### 7. Minimal Deployment Example (HuggingFace + FastAPI)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from fastapi import FastAPI

app = FastAPI()
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

@app.post("/generate")
def generate(prompt: str):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=50)
    return tokenizer.decode(outputs[0])
```

---

### 8. Common Architecture Patterns

| Pattern                  | Description                 |
| ------------------------ | --------------------------- |
| Single-Model API         | One large foundation model  |
| MoE (Mixture of Experts) | Sparse expert routing       |
| RAG                      | Retrieval + Generation      |
| Multi-Model Pipelines    | Chain of specialized models |
| Edge + Cloud             | On-device inference + cloud |

---

### 9. Scalability & Optimization Techniques

| Problem       | Solution                           |
| ------------- | ---------------------------------- |
| High latency  | Quantization, batching, KV caching |
| Memory limits | ZeRO, offloading                   |
| Cost          | Spot instances, autoscaling        |
| Throughput    | Tensor parallelism                 |

---

### 10. Why This Architecture Matters

* Enables **training trillion-parameter models**
* Supports **millions of concurrent users**
* Balances **latency, cost, accuracy, reliability**

Generative AI is fundamentally a **distributed systems problem** wrapped around deep learning.

