# Rate Limits & Efficient Prompting — Using OpenAI Responses API


This notebook explains API rate limits, practical strategies to handle them, and demonstrates how to use the OpenAI **Responses API** with robust retry/backoff logic, batching, and efficient prompting techniques. Run the code cells after setting your `OPENAI_API_KEY` in the environment.

## Quick summary: What are Rate Limits?

- **Rate limits** control how many requests or tokens you can send to the API in a time window.
- You may encounter `429` HTTP errors when you exceed a limit.
- Limits exist per-account, per-model, and per-endpoint. They can change; always check your account dashboard.

This notebook covers:

1. Detecting and handling `429` errors with exponential backoff.
2. Reducing request volume: batching, caching, and streaming.
3. Reducing token usage: concise prompts, system messages, and few-shot design.


### Learn

- **Why they exist:** protect service stability, ensure fair usage, and control costs.
- **Common headers to read:** `x-ratelimit-limit`, `x-ratelimit-remaining`, `x-ratelimit-reset` (names may vary by provider). These headers tell you how many requests remain and when the window resets.
- **Simple handling:** when you get a 429, pause (`sleep`) for a short time (exponential backoff) and retry.

### Try

Below are runnable examples:
- a) a simulated quick-loop that shows hitting a rate limit and fixing with a sleep
- b) a robust wrapper that inspects exceptions for rate-limit headers and waits accordingly


In [None]:


import time
import random

class SimulatedAPI:
    def __init__(self, allowed_per_minute=10):
        self.allowed = allowed_per_minute
        self.calls = 0
        self.window_start = time.time()

    def call(self):
        # reset window after 60s
        now = time.time()
        if now - self.window_start >= 60:
            self.window_start = now
            self.calls = 0
        self.calls += 1
        if self.calls > self.allowed:
            # simulate 429 with headers
            raise Exception('429 Too Many Requests', {'x-ratelimit-remaining': '0', 'x-ratelimit-reset': str(int(self.window_start + 60))})
        return {'status': 200, 'data': f'result_{self.calls}'}

sim = SimulatedAPI(allowed_per_minute=5)

# Try calling quickly in a loop without pause (will trigger simulated 429)
results = []
for i in range(8):
    try:
        r = sim.call()
        results.append(r)
        print('OK', r)
    except Exception as e:
        err_msg, headers = e.args if len(e.args) > 1 else (str(e), {})
        print('Error:', err_msg, 'headers=', headers)
        print('Adding a short sleep to respect rate limits...')
        time.sleep(5)  # naive fix
        try:
            r = sim.call()
            results.append(r)
            print('Retry OK', r)
        except Exception as e2:
            print('Retry still failed:', e2)
print('Simulation done. Results collected:', len(results))

In [None]:
# Use this pattern around your openai.responses.create calls. It checks exception messages and optional headers
# to determine how long to wait before retrying.
import os
try:
    import openai
except Exception:
    raise ImportError("Please install the openai package: uv add openai")
openai.api_key = os.environ.get('OPENAI_API_KEY') or '<PUT_YOUR_API_KEY_HERE>'

def call_with_rate_handling(fn, max_retries=6, base_delay=1.0, max_delay=60.0):
    attempt = 0
    while True:
        try:
            return fn()
        except Exception as e:
            attempt += 1
            # Try to extract HTTP status or headers if present
            err_str = str(e).lower()
            headers = {}
            # Many HTTP client exceptions store headers in e.args or e.http_headers; we try common patterns.
            if hasattr(e, 'headers') and isinstance(e.headers, dict):
                headers = e.headers
            elif len(e.args) > 1 and isinstance(e.args[1], dict):
                headers = e.args[1]
            remaining = headers.get('x-ratelimit-remaining') or headers.get('x-ratelimit_remaining') or None
            reset = headers.get('x-ratelimit-reset') or headers.get('x-ratelimit_reset') or None
            is_429 = '429' in err_str or 'rate limit' in err_str or 'too many' in err_str
            if not is_429 or attempt > max_retries:
                print('Not retrying; re-raising. Error:', e)
                raise
            # If the server told us when it resets, use that
            if reset:
                try:
                    reset_ts = int(reset)
                    wait = max(0, reset_ts - int(time.time()))
                    wait = min(wait, max_delay)
                    print(f'Received reset header; sleeping {wait}s before retry (headers: {headers})')
                except Exception:
                    wait = min(max_delay, base_delay * (2 ** (attempt - 1)))
                    wait = wait * random.uniform(0.5, 1.0)
                    print(f'Could not parse reset header; using backoff {wait:.1f}s (attempt {attempt})')
            else:
                wait = min(max_delay, base_delay * (2 ** (attempt - 1)))
                wait = wait * random.uniform(0.5, 1.0)
                print(f'Rate limited; backing off {wait:.1f}s (attempt {attempt})')
            time.sleep(wait)

# Example usage with OpenAI Responses API (commented out — enable when you have API access):
def make_api_call():
    return openai.responses.create(model='gpt-4.1-mini', input=[{'role':'user','content':'Explain rate limits in one sentence.'}])
resp = call_with_rate_handling(make_api_call)
print(resp)

## Example: Basic Responses API call

A simple example using the `openai` Python client and `responses` API. Replace the model with one available to your account.


In [None]:
import os


openai.api_key = os.environ.get('OPENAI_API_KEY') or '<PUT_YOUR_API_KEY_HERE>'

def simple_response_example():
    print("Sending a simple prompt to Responses API (example).")
    resp = openai.responses.create(
        model="gpt-4.1-mini",  # replace with a model you have access to
        input=[
            {"role": "system", "content": "You are a concise assistant that answers in plain English."},
            {"role": "user", "content": "Explain rate limits in two sentences."}
        ]
    )
    # The response shape may contain generative content in resp.output
    print("Response object keys:", list(resp.model_dump().keys()))
    # Try to print text output (SDK may return different structure depending on version)
    try:
        # Many SDK versions: resp.output[0].content[0].text
        outputs = resp.output
        if isinstance(outputs, list):
            # Flatten simple case
            texts = []
            for item in outputs:
                if isinstance(item, dict):
                    for c in item.get('content', []):
                        if isinstance(c, dict) and 'text' in c:
                            texts.append(c['text'])
                        elif isinstance(c, str):
                            texts.append(c)
                elif isinstance(item, str):
                    texts.append(item)
            print('\n'.join(texts))
        else:
            print(outputs)
    except Exception as e:
        print('Could not parse text output from response:', e)

# Run example (comment out if you don't want to call the API now)
simple_response_example()

## Robust retry/backoff helper (handles 429s and transient errors)

This helper retries on HTTP 429 (rate limit) and on some transient server errors. It uses exponential backoff with jitter.
Use this wrapper around any API call that may be rate-limited.


In [None]:
import time
from requests.exceptions import RequestException

def retry_with_backoff(fn, max_retries=6, base_delay=1.0, max_delay=60.0, allowed_exceptions=(Exception,)):
    """Call fn() and retry on exceptions with exponential backoff + jitter.
    fn should be a callable that performs an API call and either returns a value or raises an Exception.
    """
    attempt = 0
    while True:
        try:
            return fn()
        except Exception as e:
            attempt += 1
            # If it's an OpenAI HTTP error with status 429 or a requests rate-limit, retry
            err_str = str(e).lower()
            is_rate = ('429' in err_str) or ('rate limit' in err_str) or ('rate_limit' in err_str)
            is_transient = isinstance(e, RequestException) or 'timed out' in err_str or 'tempor' in err_str
            if attempt > max_retries or not (is_rate or is_transient):
                print('Not retrying; re-raising exception. Last error:', e)
                raise
            # exponential backoff with full jitter
            sleep = min(max_delay, base_delay * (2 ** (attempt - 1)))
            sleep = sleep * random.uniform(0.5, 1.0)
            print(f"Retry {attempt}/{max_retries} after error: {e}. Sleeping {sleep:.2f}s...")
            time.sleep(sleep)

# Example usage wrapping the Responses API call
def responses_with_retry(prompt_text):
    def call():
        return openai.responses.create(
            model="gpt-4.1-mini",
            input=[{"role":"system","content":"You are helpful."},
                   {"role":"user","content": prompt_text}],
            max_output_tokens=300
        )
    return retry_with_backoff(call)

# Example (commented out):
r = responses_with_retry('Explain exponential backoff and why it helps with rate limits.') 
print(r)

## Efficient prompting patterns

Tips to reduce tokens and improve output quality:

1. **System message**: Put stable instructions in the system role so you don't repeat them in every request.
2. **Be concise**: Use short, direct prompts. Avoid long unnecessary context.
3. **Few-shot only when necessary**: Provide 1–3 examples, not dozens.
4. **Use `max_output_tokens`** to cap responses and avoid runaway outputs.
5. **Use structured output** (JSON) or `function_call`/tools when you need machine-parseable data.


### Example: concise vs verbose prompt

Compare two prompts and see token differences. We'll show how to call the Responses API with a short and a long prompt.


In [None]:
short_prompt = "Give me a 2-line summary of why API rate limits exist."
long_prompt = """You are an expert technical writer. Please write a detailed multi-paragraph explanation of why API rate limits exist, "
including historical context, server architecture reasons, and economic considerations. Include many examples and suggestions for developers."
"""

print('Short prompt length (chars):', len(short_prompt))
print('Long prompt length (chars):', len(long_prompt))

# You can measure approximate token usage via tiktoken (if installed) or estimate via simple heuristics.
try:
    import tiktoken
    enc = tiktoken.encoding_for_model('gpt-4o-mini') if hasattr(tiktoken, 'encoding_for_model') else tiktoken.get_encoding('cl100k_base')
    s_tokens = len(enc.encode(short_prompt))
    l_tokens = len(enc.encode(long_prompt))
    print('Estimated short prompt tokens:', s_tokens)
    print('Estimated long prompt tokens:', l_tokens)
except Exception:
    print('tiktoken not available; install tiktoken to get token-level estimates (pip install tiktoken).')

# Example calls (commented out to avoid accidental API calls)
# r_short = responses_with_retry(short_prompt)
# r_long = responses_with_retry(long_prompt)


## Batching & Stream strategies

- **Batching**: Combine multiple small requests into a single request when possible (e.g., send a list of short inputs in one call). This reduces per-request overhead and rate-limit pressure.
- **Streaming**: If supported, streaming reduces perceived latency and can be useful for long outputs; it doesn't necessarily reduce tokens but improves responsiveness.

Example: how to batch multiple independent prompts in one Responses API call by joining inputs.


In [None]:
# Simple batching pattern: join inputs and include separators; parse outputs afterward.
prompts = [
    'Summarize the concept of rate limiting in one line.',
    'Give one tip to reduce API usage for developers.'
]

batched_input = '\n---\n'.join(prompts)
print('Batched input length:', len(batched_input))

# Example call (commented out):
resp = responses_with_retry(batched_input)
print(resp)

# Note: alternatively you can send a list of messages in the input array and post-process the response


## Saving Cost and Time

### Learn

- **Prompt caching:** store input → output mappings so repeated identical prompts reuse cached results instead of calling the API again.
- **Smaller prompts:** remove unnecessary context; summarize long histories before sending.
- **Pick smaller models:** use cheaper models for simple tasks and route hard tasks to larger models (task routing).
- **Stop early:** use `max_output_tokens` (or `max_tokens`) to cap output length and control cost.
- **Memory reuse:** keep recent useful outputs in memory rather than re-requesting.

### Try

Below is an interactive comparison: ask the same question twice — once using caching, once without — and compare time and (simulated) cost.


In [None]:
# 9) Prompt caching example (REAL OpenAI Responses API)
import time
import json
import os
import hashlib
import openai

openai.api_key = os.environ.get("OPENAI_API_KEY") or "<PUT_YOUR_KEY_HERE>"

CACHE_DIR = '.prompt_cache'
os.makedirs(CACHE_DIR, exist_ok=True)

def cache_key(prompt):
    return hashlib.sha256(prompt.encode('utf-8')).hexdigest()

def save_cache(prompt, response_text):
    key = cache_key(prompt)
    path = os.path.join(CACHE_DIR, key + '.json')
    with open(path, 'w', encoding='utf-8') as f:
        json.dump({'prompt': prompt, 'response': response_text, 'ts': time.time()}, f)
    return path

def load_cache(prompt):
    key = cache_key(prompt)
    path = os.path.join(CACHE_DIR, key + '.json')
    if os.path.exists(path):
        with open(path, 'r', encoding='utf-8') as f:
            return json.load(f)
    return None

# ---- REAL Responses API call ----
def call_llm(prompt):
    """Call OpenAI Responses API and return plain text output."""
    resp = openai.responses.create(
        model="gpt-5-mini",
        input=[
            {"role": "system", "content": "Provide concise helpful answers."},
            {"role": "user", "content": prompt}
        ],
        max_output_tokens=200
    )

    # Extract text safely from resp.output
    txt = ""
    try:
        for item in resp.output:
            for c in item.content:
                if hasattr(c, "text"):
                    txt += c.text
    except Exception:
        txt = str(resp)
    return txt.strip()

# ---- Compare cached vs uncached ----
prompt = "List three quick tips to reduce API costs."

print("First call (no cache)")
t0 = time.time()
cached = load_cache(prompt)
if cached:
    print("Loaded from cache:", cached["response"])
else:
    out = call_llm(prompt)
    save_cache(prompt, out)
    print("API response:", out)
t1 = time.time()
print("First call took: {:.3f}s".format(t1 - t0))

print("\nSecond call (with cache)")
t0 = time.time()
cached2 = load_cache(prompt)
if cached2:
    print("Loaded from cache:", cached2["response"])
else:
    out2 = call_llm(prompt)
    save_cache(prompt, out2)
    print("API response:", out2)
t1 = time.time()
print("Second call took: {:.5f}s".format(t1 - t0))

## Monitoring Cost and Usage

### Learn

- Inspect `response.usage` to see `prompt_tokens`, `completion_tokens`, and `total_tokens` (SDKs may differ in naming).  
- Log tokens, time, and estimated cost per call.  
- Keep running totals per day to estimate spend.  
- Cost formula: `(input_tokens × rate_in) + (output_tokens × rate_out)` — rates depend on model and should be retrieved from your billing/pricing page.  
- Review your actual usage and billing on the OpenAI dashboard: https://platform.openai.com/usage

### Try

Implement a `track_cost()` helper below to log usage and running totals. Example uses simulated `usage` objects but demonstrates how to integrate real `response['usage']` values.



In [None]:
# Real Responses API call + robust extraction and logging (handles input_tokens/output_tokens)
import os
import time
import csv
import openai
openai.api_key = os.environ.get("OPENAI_API_KEY") or "<PUT_YOUR_API_KEY_HERE>"

MODEL = "gpt-4.1-mini"       # replace with a model you have access to
FILE_PATH = "/mnt/data/03_function_calling_and_tools.ipynb"  # the local path (runner will convert to URL)

LOG_FILE = 'api_usage_log.csv'
fieldnames = ['ts', 'prompt_tokens', 'completion_tokens', 'total_tokens', 'duration_s', 'estimated_cost_usd']


def estimate_cost(usage, rate_in_per_token=0.000003, rate_out_per_token=0.000006): # update according to model pricing
    """Estimate cost in USD. Update rates to match your model pricing (per-token costs)."""
    in_t = usage.get('prompt_tokens', 0)
    out_t = usage.get('completion_tokens', 0)
    return in_t * rate_in_per_token + out_t * rate_out_per_token

def track_cost(usage, duration_s, rate_in_per_token=0.000003, rate_out_per_token=0.000006):
    cost = estimate_cost(usage, rate_in_per_token, rate_out_per_token)
    row = {
        'ts': int(time.time()),
        'prompt_tokens': usage.get('prompt_tokens', 0),
        'completion_tokens': usage.get('completion_tokens', 0),
        'total_tokens': usage.get('total_tokens', usage.get('prompt_tokens',0)+usage.get('completion_tokens',0)),
        'duration_s': round(duration_s, 3),
        'estimated_cost_usd': round(cost, 10)
    }
    write_header = not os.path.exists(LOG_FILE)
    with open(LOG_FILE, 'a', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        if write_header:
            writer.writeheader()
        writer.writerow(row)
    print('Logged:', row)
    return row

# --------- Robust usage extractor ---------
def _int_from_maybe_object(val):
    """Extract integer count from a field which might be int, str, or an object with numeric attributes."""
    if val is None:
        return 0
    if isinstance(val, int):
        return val
    if isinstance(val, str) and val.isdigit():
        return int(val)
    # It may be an SDK object like InputTokensDetails(cached_tokens=0) or similar; try attributes
    for attr in ('total', 'value', 'cached_tokens', 'count', 'n'):
        if hasattr(val, attr):
            try:
                return int(getattr(val, attr))
            except Exception:
                pass
    # try dict-like
    try:
        if isinstance(val, dict):
            for k in ('total', 'value', 'cached_tokens', 'count', 'n'):
                if k in val:
                    return int(val[k])
    except Exception:
        pass
    # As a last resort, try to coerce to int
    try:
        return int(val)
    except Exception:
        return 0

def extract_usage(resp):
    """
    Return a normalized usage dict:
      {'prompt_tokens': int, 'completion_tokens': int, 'total_tokens': int}
    Handles multiple SDK variants: input_tokens/output_tokens, prompt_tokens/completion_tokens, or nested places.
    """
    usage = {}
    # Common top-level places
    candidates = []
    try:
        if isinstance(resp, dict):
            candidates.append(resp.get('usage'))
            candidates.append(resp.get('response', {}).get('usage'))
            candidates.append(resp.get('metadata', {}).get('usage') if isinstance(resp.get('metadata'), dict) else None)
            # Also add resp keys directly
            candidates.append({k: resp.get(k) for k in ('input_tokens','output_tokens','total_tokens','prompt_tokens','completion_tokens')})
        # object-like resp
        if hasattr(resp, 'usage'):
            candidates.append(getattr(resp, 'usage'))
    except Exception:
        pass

    # Flatten & find numbers
    found = {}
    for cand in candidates:
        if not cand:
            continue
        # if it's dict-like
        if isinstance(cand, dict):
            for k,v in cand.items():
                found[k] = v
        else:
            # try to inspect attributes (SDK objects)
            for attr in ('input_tokens','output_tokens','total_tokens','prompt_tokens','completion_tokens'):
                if hasattr(cand, attr):
                    found[attr] = getattr(cand, attr)

    # Map different key names to normalized fields
    prompt_t = None
    completion_t = None
    total_t = None

    # input_tokens / output_tokens (newer SDK) -> prompt_tokens/completion_tokens
    if 'input_tokens' in found:
        prompt_t = _int_from_maybe_object(found.get('input_tokens'))
    if 'output_tokens' in found:
        completion_t = _int_from_maybe_object(found.get('output_tokens'))

    # fallback to explicit prompt_tokens/completion_tokens
    if prompt_t is None and 'prompt_tokens' in found:
        prompt_t = _int_from_maybe_object(found.get('prompt_tokens'))
    if completion_t is None and 'completion_tokens' in found:
        completion_t = _int_from_maybe_object(found.get('completion_tokens'))

    # total tokens
    if 'total_tokens' in found:
        total_t = _int_from_maybe_object(found.get('total_tokens'))

    # If total exists but prompt/completion do not, attempt to split proportionally (best-effort)
    if total_t is not None and (prompt_t is None and completion_t is None):
        prompt_t = 0
        completion_t = total_t

    # If prompt/completion exist but total doesn't, compute it
    if total_t is None:
        total_t = (prompt_t or 0) + (completion_t or 0)

    # final defaults
    prompt_t = int(prompt_t or 0)
    completion_t = int(completion_t or 0)
    total_t = int(total_t or 0)

    return {
        'prompt_tokens': prompt_t,
        'completion_tokens': completion_t,
        'total_tokens': total_t
    }

# --------- Extract text from resp.output ---------
def extract_text_from_output(resp):
    """
    Extract readable text from resp.output which may be a list of ResponseOutputMessage objects,
    each with content list of ResponseOutputText objects that have .text attribute.
    """
    texts = []
    try:
        # dict-like
        if isinstance(resp, dict) and 'output' in resp:
            out = resp['output']
        elif hasattr(resp, 'output'):
            out = getattr(resp, 'output')
        else:
            out = None

        if not out:
            return ""

        # iterate items
        for item in out:
            # item could be dict-like or object
            content = None
            if isinstance(item, dict):
                content = item.get('content') or item.get('text') or item.get('message') or []
            else:
                content = getattr(item, 'content', None) or getattr(item, 'text', None) or []

            # content is often a list of content pieces
            if isinstance(content, list):
                for piece in content:
                    if isinstance(piece, dict) and 'text' in piece:
                        texts.append(piece['text'])
                    else:
                        # object-like piece
                        txt = getattr(piece, 'text', None)
                        if txt:
                            texts.append(txt)
                        elif isinstance(piece, str):
                            texts.append(piece)
            elif isinstance(content, str):
                texts.append(content)
            else:
                # try to string-convert the item
                try:
                    texts.append(str(item))
                except Exception:
                    pass
    except Exception:
        pass

    return "\\n".join(t for t in texts if t)


In [None]:

# --------- Simple retry wrapper for transient errors / 429s ---------
def call_with_retry(fn, max_retries=5, base_delay=1.0, max_delay=30.0):
    attempt = 0
    while True:
        try:
            return fn()
        except Exception as e:
            attempt += 1
            err_s = str(e).lower()
            is_rate = ('429' in err_s) or ('rate limit' in err_s) or ('too many' in err_s)
            is_transient = 'timed out' in err_s or 'timeout' in err_s or isinstance(e, TimeoutError)
            if attempt > max_retries or not (is_rate or is_transient):
                print("Not retrying; raising:", e)
                raise
            sleep = min(max_delay, base_delay * (2 ** (attempt - 1)))
            sleep = sleep * random.uniform(0.6, 1.0)
            print(f"Attempt {attempt}/{max_retries} failed with '{e}'. Sleeping {sleep:.2f}s before retry.")
            time.sleep(sleep)


In [None]:

# --------- Build prompt and make call ---------
prompt = (
    f"Summarize the notebook at this local path (runner will convert to a URL): {FILE_PATH}\\n\\n"
    "Provide a concise 3-bullet summary. If file is inaccessible, say so."
)

messages = [
    {"role": "system", "content": "You are a concise assistant that summarizes notebooks in bullet points."},
    {"role": "user", "content": prompt}
]

def make_call_and_track():
    start = time.time()
    def call():
        return openai.responses.create(
            model=MODEL,
            input=messages,
            max_output_tokens=300
        )
    resp = call_with_retry(call)
    duration = time.time() - start

    # Extract text and print
    text = extract_text_from_output(resp)
    print("\\n--- Model Output (clean) ---")
    print(text if text else "(no textual output extracted)")

    # Normalize usage and track cost
    usage = extract_usage(resp)
    print("\\n--- Normalized usage ---")
    print(usage)
    print("-"*30)

    # Ensure total exists
    if 'total_tokens' not in usage:
        usage['total_tokens'] = usage.get('prompt_tokens', 0) + usage.get('completion_tokens', 0)

    # Log with track_cost
    try:
        row = track_cost(usage, duration)
        print("-"*30)
        print("\\nTracked cost:", row)
        print("-"*30)
    except Exception as e:
        print("track_cost raised an exception:", e)

    return resp, usage, text

# ---- Run it (uncomment to execute) ----
resp, usage, text = make_call_and_track()
print("Done. Extracted usage:", usage)
