```{contents}
```
## Rate Limiting

### 1. Definition

**Rate limiting** is a control mechanism that restricts how many requests a client (user, application, API key, IP) can make to a **Generative AI service** within a fixed time window.

It protects:

* **Model availability**
* **Infrastructure stability**
* **Fair usage**
* **Cost predictability**
* **Abuse prevention**

Formally:

> A policy that enforces an upper bound on request volume per identity per time interval.

---

### 2. Why Rate Limiting Is Critical for Generative AI

Generative models are **compute-intensive** and **stateful**.

| Risk Without Rate Limiting | Consequence          |
| -------------------------- | -------------------- |
| DDoS or abuse              | Service outage       |
| Runaway loops / agents     | Unbounded cost       |
| Single tenant overload     | Starvation of others |
| Prompt spamming            | Model degradation    |
| Cost explosion             | Budget failure       |

---

### 3. What Is Being Limited?

Generative AI systems typically limit **multiple dimensions** simultaneously:

| Dimension           | Example              |
| ------------------- | -------------------- |
| Requests            | 60 requests / minute |
| Tokens in           | 100K tokens / minute |
| Tokens out          | 50K tokens / minute  |
| Concurrent requests | 5 parallel calls     |
| Compute             | GPU-seconds / minute |

---

### 4. Core Rate Limiting Strategies

| Strategy           | Description                   | When Used              |
| ------------------ | ----------------------------- | ---------------------- |
| **Fixed Window**   | Count requests per interval   | Simple APIs            |
| **Sliding Window** | Continuous rolling window     | Smoother control       |
| **Token Bucket**   | Accumulate tokens over time   | Bursty traffic         |
| **Leaky Bucket**   | Enforces constant output rate | Streaming stability    |
| **Adaptive**       | Dynamically adjusts limits    | AI workload management |

---

### 5. Conceptual Architecture in Generative AI APIs

```text
Client → API Gateway → Rate Limiter → Prompt Router → Model Server → Response
```

The **rate limiter** executes **before** the model is invoked.

---

### 6. Example Policy

| Metric           | Limit            |
| ---------------- | ---------------- |
| Requests         | 60 / minute      |
| Input tokens     | 100,000 / minute |
| Output tokens    | 50,000 / minute  |
| Concurrent calls | 3                |

Violation triggers:

* HTTP **429 Too Many Requests**
* Optional `Retry-After` header

---

### 7. Demonstration with Code

#### Token Bucket Implementation (Python)

```python
import time
from collections import deque

class TokenBucket:
    def __init__(self, rate, capacity):
        self.rate = rate              # tokens per second
        self.capacity = capacity
        self.tokens = capacity
        self.timestamp = time.time()

    def allow(self, cost=1):
        now = time.time()
        delta = now - self.timestamp
        self.tokens = min(self.capacity, self.tokens + delta * self.rate)
        self.timestamp = now

        if self.tokens >= cost:
            self.tokens -= cost
            return True
        return False
```

#### Applying to a Generative AI Request

```python
bucket = TokenBucket(rate=10, capacity=20)

def call_model(prompt_tokens):
    if not bucket.allow(cost=prompt_tokens):
        raise Exception("Rate limit exceeded")
    return generate_text()
```

---

### 8. Token-Based Rate Limiting (LLM-Specific)

Unlike traditional APIs, LLMs limit **tokens**, not just requests.

| Why tokens?                                                          |
| -------------------------------------------------------------------- |
| Long prompts and long generations cost more compute than short ones. |

Example policy:

```
100,000 input tokens / minute
50,000 output tokens / minute
```

---

### 9. Multi-Tier Rate Limiting in AI Platforms

| Tier       | Typical Limits              |
| ---------- | --------------------------- |
| Free       | 10 req/min, 10K tokens/min  |
| Pro        | 60 req/min, 100K tokens/min |
| Enterprise | Custom quotas               |

---

### 10. Interaction with Agents & Tools

Autonomous agents can accidentally violate limits via loops:

```text
Planner → Tool → Model → Tool → Model → ...
```

Mitigation:

* **Per-agent quotas**
* **Global token budget**
* **Hard execution caps**

---

### 11. Relationship to Cost Control

Since LLM cost ∝ tokens processed:

> **Rate limiting is also budget limiting.**

---

### 12. Summary Table

| Aspect          | Role                   |
| --------------- | ---------------------- |
| Protects system | Prevents overload      |
| Protects users  | Fair resource sharing  |
| Protects budget | Predictable spend      |
| Protects model  | Prevents misuse        |
| Controls agents | Stops runaway behavior |

---

### 13. Key Takeaway

> **Rate limiting is the primary safety valve of large-scale generative AI infrastructure.**
