# Understanding and Handling OpenAI Rate Limits (with Responses API)

This notebook walks through:

- What **rate limits** are and why they exist  
- How to **trigger a rate limit error** using `AsyncOpenAI` and multiple parallel requests  
- How to **monitor token usage** using the **Responses API**  
- Practical patterns for handling rate limits safely in production


## 1. What are rate limits?

When you call the OpenAI API, you're sharing a pool of compute capacity with many other users.  
To keep the system fair, fast, and stable, OpenAI enforces **rate limits** — caps on how much you can use the API over some window of time.

Common types of limits include:

- **Requests per minute (RPM)** – how many requests you can send per minute  
- **Tokens per minute (TPM)** – how many input + output tokens you can use per minute  
- Sometimes **requests per day / tokens per day**, depending on your plan

If you exceed these limits, the API responds with **HTTP 429 (Rate limit)** and the SDK raises a `RateLimitError`.

---

### Why do rate limits exist?

Rate limits help:

- Prevent any single user from overloading the system  
- Keep **latency predictable** for everyone  
- Protect against accidental infinite loops or runaway scripts  
- Let OpenAI plan capacity and pricing fairly

---

### How are rate limits enforced?

At a high level, OpenAI tracks your usage across a rolling time window (for example: per minute).  
If a new request would push you beyond your allowed RPM/TPM, that request is rejected with a rate‑limit error.

**Important:**

- Your exact limits depend on your **account, plan, and model**.  
- Limits can differ per **project** or **organization**.  
- You can see your limits and usage in the [OpenAI dashboard](https://platform.openai.com/dashboard).

We'll now move from theory to code.


## 2. Setup

Run the cell below **once** in your environment to install dependencies if you haven't already.

- `openai` – official Python SDK
- `python-dotenv` – optional, for loading `OPENAI_API_KEY` from a `.env` file


In [None]:
# Optional: install dependencies (uncomment if needed)
# !pip install --upgrade openai python-dotenv

### Configure the OpenAI client

You should set your API key in an environment variable:

```bash
export OPENAI_API_KEY="sk-..."
```

Or put it in a `.env` file alongside this notebook:

```text
OPENAI_API_KEY=sk-...
```

The code below:

- Loads environment variables from `.env` if present
- Creates both synchronous `OpenAI` and asynchronous `AsyncOpenAI` clients


In [None]:
import os
from openai import OpenAI, AsyncOpenAI, RateLimitError
from dotenv import load_dotenv

# Load environment variables from .env if present
load_dotenv()

api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise RuntimeError("OPENAI_API_KEY is not set. Please set it in your environment or .env file.")

# Synchronous and asynchronous clients
client = OpenAI(api_key=api_key)
async_client = AsyncOpenAI(api_key=api_key)

# We'll use a small, inexpensive model so it's cheap to experiment.
DEFAULT_MODEL = "gpt-5-nano"  # adjust if this model name changes in your account
model_list = client.models.list()
print("Available models:", [model.id for model in model_list.data])


## 3. Basic Responses API Call + Token Usage

The **Responses API** is the recommended way to generate text.  
Every response includes a `usage` object that reports how many tokens were consumed.

Let's:
1. Make a simple call.
2. Inspect `response.usage`.


In [None]:
# Simple Responses API call and usage inspection

response = client.responses.create(
    model=DEFAULT_MODEL,
    input="""Explain what API rate limits are in 2 short bullet points."""
)

print("Model output:\n", response.output_text)

print("\nToken usage:")
print(response.usage)
print("Input tokens:", response.usage.input_tokens)
print("Output tokens:", response.usage.output_tokens)
print("Total tokens:", response.usage.total_tokens)


## 4. Tracking Token Usage Across Multiple Calls

It's often useful to track cumulative token usage in your application for:

- **Cost alerting / budgeting**
- Staying under **tokens-per-minute (TPM)** limits
- Logging / analytics

Below is a helper wrapper that:

- Calls the model
- Logs the usage of each call
- Keeps running totals for input, output, and total tokens

You can also make use of external observability tools/libraries like [arize-phoenix](https://arize.com/docs/phoenix/get-started/get-started-tracing) to gain valuable insights based


In [None]:
from dataclasses import dataclass, field
from typing import List


@dataclass
class TokenStats:
    total_input_tokens: int = 0
    total_output_tokens: int = 0
    total_tokens: int = 0
    history: List[dict] = field(default_factory=list)

    def update(self, usage):
        self.total_input_tokens += usage.input_tokens
        self.total_output_tokens += usage.output_tokens
        self.total_tokens += usage.total_tokens
        self.history.append(
            {
                "input_tokens": usage.input_tokens,
                "output_tokens": usage.output_tokens,
                "total_tokens": usage.total_tokens,
            }
        )

    def summary(self):
        return {
            "total_input_tokens": self.total_input_tokens,
            "total_output_tokens": self.total_output_tokens,
            "total_tokens": self.total_tokens,
            "num_calls": len(self.history),
        }


token_stats = TokenStats()


def ask_with_tracking(prompt: str, model: str = DEFAULT_MODEL) -> str:
    """Call the Responses API and update token stats."""
    response = client.responses.create(
        model=model,
        input=prompt,
    )

    token_stats.update(response.usage)
    print("Prompt:", prompt)
    print("Response:\n", response.output_text)
    print("Usage for this call:", response.usage)
    print("\nRunning totals:", token_stats.summary())
    print("-" * 60)
    return response.output_text


# Example: make a few calls and watch the cumulative usage grow
ask_with_tracking("Give me 3 synonyms for 'fast'.")
ask_with_tracking("Give me 3 synonyms for 'slow'.")
ask_with_tracking("Explain what tokens are in one short paragraph.")

### Inspecting Rate-Limit Headers

Some endpoints return rate-limit information in response headers.  
You can inspect these headers using the `with_raw_response` helper.

This can be useful to see:

- How many requests/tokens you just used
- How many you have left in the current window


In [None]:
# Inspect raw HTTP response headers to see rate-limit information (if present)

raw = client.responses.with_raw_response.create(
    model=DEFAULT_MODEL,
    input="""Say 'hello' in 3 different languages."""
)

print("Status code:", raw.status_code)

print("\nRate-limit related headers (if present):")
for name, value in raw.headers.items():
    if "ratelimit" in name.lower():
        print(f"{name}: {value}")

parsed = raw.parse()
print("\nOutput:", parsed.output_text)
print("Token usage:", parsed.usage)


## 5. Handling Rate Limits Safely in Production

In real applications, you should **never** just hammer the API until it breaks.  
Instead, implement a **retry with exponential backoff** pattern when you see `RateLimitError` (429).

Key ideas:

- Catch `openai.RateLimitError` specifically  
- Wait (sleep) for a bit before retrying  
- Increase the wait time with each retry (exponential backoff)  
- Add **jitter** (a bit of randomness) so multiple workers don't synchronize their retries


In [None]:
import random
import openai
import asyncio
from typing import Optional


async def safe_call_with_backoff(
    prompt: str,
    model: str = DEFAULT_MODEL,
    max_attempts: int = 5,
    base_delay: float = 1.0,
) -> Optional[str]:
    """Call the Responses API with retries and exponential backoff on rate limits."""
    attempt = 0

    while attempt < max_attempts:
        try:
            response = await async_client.responses.create(
                model=model,
                input=prompt,
                max_output_tokens=64,
            )
            print("Success on attempt", attempt + 1)
            print("Output:", response.output_text)
            print("Token usage:", response.usage)
            return response.output_text

        except openai.RateLimitError as e:
            attempt += 1
            if attempt >= max_attempts:
                print("Gave up after", max_attempts, "attempts due to rate limits.")
                raise

            # Exponential backoff with jitter
            delay = base_delay * (2 ** (attempt - 1))
            delay = delay * (0.5 + random.random())  # add jitter between 0.5x and 1.5x
            print(f"Rate limit hit (attempt {attempt}). Sleeping for {delay:.2f} seconds...")
            await asyncio.sleep(delay)

        except Exception as e:
            print("Unexpected error:", e)
            raise


# Example: use the safe caller
async def demo_safe_call():
    await safe_call_with_backoff("Give me a funny fact about turtles.")


await demo_safe_call()

## 6. Summary

In this notebook, you:

1. Learned what **rate limits** are and why they exist.  
2. Saw how to **inspect token usage** via `response.usage` and track cumulative tokens.  
3. Used `with_raw_response` to inspect **rate-limit headers** (when present).  
4. Implemented a **robust retry with exponential backoff** pattern using `RateLimitError`.

You can now:

- Adapt the tracking logic to feed dashboards, alerting, or cost controls  
- Wrap all your API calls with a reusable `safe_call_with_backoff` helper  
- Tune concurrency (`NUM_REQUESTS`) and delays to stay within your real limits

Feel free to customize the code for your application's structure and logging style.
