# OllaBridge — Local Deployment Demo (Enterprise-Style)

This notebook is a **hands-on, copy/paste-ready** tutorial for using a **local OllaBridge gateway** (OpenAI-compatible API) with clean, production-grade examples.

**What you’ll do**
- Start the gateway locally with **auto-reload**
- Verify the service (health + model readiness)
- Call **Chat Completions** with the official `openai` Python SDK
- Use **streaming** responses
- Apply **enterprise hygiene**: configuration, secrets, timeouts, retries, and logging

> **Assumptions**
> - You already started OllaBridge (or you will in Step 1) and see output similar to:
>
> `... start --host 0.0.0.0 --port 11435 --reload ...`
>
> with:
> - Local API: `http://localhost:11435/v1`
> - Health: `http://localhost:11435/health`
> - Key: `sk-ollabridge-...`

**Last updated:** 2026-01-05

## 0) Prerequisites

### Required
- Python **3.10+**
- A running **OllaBridge** gateway (local)
- A model available in your gateway (example below uses `deepseek-r1`)

### Install client dependencies (this notebook)

In [4]:
# If you're running this notebook locally:
# python -m pip install -U openai requests python-dotenv

!python -m pip -q install -U openai requests python-dotenv

## 1) Start the gateway (auto-reload)

Run this command **in a terminal** (not inside the notebook) from your OllaBridge repo/venv:

```bash
.venv/bin/python3 -m ollabridge.cli.main start --host 0.0.0.0 --port 11435 --reload --log-level info
```

When it’s ready you should see a banner like:

- Local API: `http://localhost:11435/v1`
- Health: `http://localhost:11435/health`
- Key: `sk-ollabridge-...`

### Security note (enterprise)
Treat the key like a secret:
- do **not** commit it to git
- prefer environment variables (or a secrets manager)

## 2) Configure the client (base URL + API key)

You can provide the key in either header style:
- `X-API-Key: sk-...`
- `Authorization: Bearer sk-...`

The `openai` Python SDK uses `Authorization: Bearer ...` automatically.

In [1]:
import os
from dotenv import load_dotenv

# ✅ Load variables from .env file into the environment
# This defaults to looking for a file named ".env" in the current directory
load_dotenv() 

# Now os.getenv will find the values defined in your .env file
OLLAS_BASE_URL = os.getenv("OLLAS_BASE_URL", "http://localhost:11435/v1")
OLLAS_API_KEY  = os.getenv("OLLAS_API_KEY", "sk-ollabridge-REPLACE_ME")

print("Base URL:", OLLAS_BASE_URL)
# Check if key is loaded (and not the default placeholder)
print("API key set:", OLLAS_API_KEY.startswith("sk-ollabridge-") and "REPLACE_ME" not in OLLAS_API_KEY)

Base URL: http://localhost:11435/v1
API key set: True


In [2]:
import requests
from urllib.parse import urlparse

def gateway_health(base_url: str, timeout_s: int = 10):
    """Return (ok: bool, payload_or_error: object)."""
    try:
        health_url = base_url.replace("/v1", "") + "/health"
        r = requests.get(health_url, timeout=timeout_s)
        r.raise_for_status()
        return True, r.json()
    except Exception as e:
        return False, str(e)

def require_gateway(base_url: str):
    ok, info = gateway_health(base_url, timeout_s=10)
    if ok:
        print("✅ Gateway reachable:", base_url)
        print("Health:", info)
        return True
    print("⚠️ Gateway NOT reachable:", base_url)
    print("Reason:", info)
    print("\nTip: start OllaBridge first, or set the correct base URL / tunnel URL.")
    return False

## 3) Health check (fast validation)

This should return HTTP 200 and a simple JSON payload.

In [3]:
# Health check (safe): won't crash the notebook if the gateway isn't up yet.
health_url = OLLAS_BASE_URL.replace("/v1", "") + "/health"
ok, info = gateway_health(OLLAS_BASE_URL, timeout_s=20)
if ok:
    info
else:
    print("Skipping health check because gateway is not reachable.")
    print("Reason:", info)

## 4) List models (optional)

If your gateway supports the standard OpenAI-compatible endpoint, this will show the models it exposes.

In [4]:
if not require_gateway(OLLAS_BASE_URL):
    print('⏭️ Skipping this cell (gateway not reachable).')
else:
    from openai import OpenAI

    client = OpenAI(
        base_url=OLLAS_BASE_URL,
        api_key=OLLAS_API_KEY,
    )

    # Some gateways support /models; if yours doesn't, skip this cell.
    try:
        models = client.models.list()
        [m.id for m in models.data][:10]
    except Exception as e:
        print("Model listing not available (this is OK). Error:", e)

✅ Gateway reachable: http://localhost:11435/v1
Health: {'status': 'ok', 'mode': 'gateway', 'default_model': 'llama3', 'detail': 'runtimes=1'}


## 5) Chat Completions (basic)

This is the core OpenAI-compatible usage.

In [5]:
if not require_gateway(OLLAS_BASE_URL):
    print('⏭️ Skipping this cell (gateway not reachable).')
else:
    model_name = os.getenv("OLLAS_MODEL", "deepseek-r1")

    resp = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": "You are a concise assistant."},
            {"role": "user", "content": "Give me 3 bullet points on why gateways are useful in enterprise AI deployments."},
        ],
        temperature=0.2,
    )

    print(resp.choices[0].message.content)

✅ Gateway reachable: http://localhost:11435/v1
Health: {'status': 'ok', 'mode': 'gateway', 'default_model': 'llama3', 'detail': 'runtimes=1'}
Here are 3 bullet points on why gateways are useful in enterprise AI deployments:

*   **Security Enforcement:** Gateways can enforce security policies (like authentication, authorization, and data masking) at the edge or between systems, protecting sensitive AI models and data.
*   **Interoperability & Integration:** They act as intermediaries, translating data formats and handling communication between diverse systems (on-prem, cloud, SaaS) and AI platforms.
*   **Controlled Access:** Gateways provide a centralized point to manage and restrict access to internal AI infrastructure or proprietary models, preventing unauthorized exposure.


## 6) Streaming responses (production-friendly UX)

Streaming is ideal for:
- chat UIs
- long outputs
- faster perceived latency

In [6]:
if not require_gateway(OLLAS_BASE_URL):
    print('⏭️ Skipping this cell (gateway not reachable).')
else:
    stream = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": "Answer in a short paragraph."},
            {"role": "user", "content": "Explain what auto-reload means for a local API gateway in development."},
        ],
        temperature=0.2,
        stream=True,
    )

    out = []
    for event in stream:
        delta = event.choices[0].delta
        if delta and delta.content:
            out.append(delta.content)
            print(delta.content, end="", flush=True)

    print()

✅ Gateway reachable: http://localhost:11435/v1
Health: {'status': 'ok', 'mode': 'gateway', 'default_model': 'llama3', 'detail': 'runtimes=1'}



## 7) Robust client wrapper (timeouts, retries, structured errors)

Below is a small helper you can copy into production services.

**Why?**
- predictable timeouts
- transparent retry policy
- consistent logging and error handling

In [7]:
if not require_gateway(OLLAS_BASE_URL):
    print('⏭️ Skipping this cell (gateway not reachable).')
else:
    from dataclasses import dataclass
    from typing import List, Dict, Optional
    import time
    import logging

    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger("ollabridge-demo")

    @dataclass
    class ChatConfig:
        model: str = "deepseek-r1"
        temperature: float = 0.2
        max_retries: int = 2
        request_timeout_s: int = 60  # SDK uses httpx; timeouts are handled internally

    def chat_once(messages: List[Dict[str, str]], cfg: ChatConfig) -> str:
        last_err: Optional[Exception] = None
        for attempt in range(cfg.max_retries + 1):
            try:
                t0 = time.time()
                resp = client.chat.completions.create(
                    model=cfg.model,
                    messages=messages,
                    temperature=cfg.temperature,
                )
                dt = time.time() - t0
                logger.info("chat completion ok (%.2fs)", dt)
                return resp.choices[0].message.content
            except Exception as e:
                last_err = e
                logger.warning("attempt %d failed: %s", attempt + 1, e)
                if attempt < cfg.max_retries:
                    time.sleep(1.5 * (attempt + 1))
        raise RuntimeError(f"All retries exhausted. Last error: {last_err}") from last_err

    print(
        chat_once(
            [{"role":"user","content":"Write a one-sentence definition of an API gateway."}],
            ChatConfig(model=model_name),
        )
    )

✅ Gateway reachable: http://localhost:11435/v1
Health: {'status': 'ok', 'mode': 'gateway', 'default_model': 'llama3', 'detail': 'runtimes=1'}


INFO:httpx:HTTP Request: POST http://localhost:11435/v1/chat/completions "HTTP/1.1 200 OK"
INFO:ollabridge-demo:chat completion ok (3.90s)


An API gateway is an intermediary server acting as a single entry point for clients to access multiple backend services, handling routing, security, rate limiting, and other cross-cutting concerns.


## 8) Troubleshooting checklist

### Gateway not reachable
- Confirm the process is running:
  - `Uvicorn running on http://0.0.0.0:11435`
- Verify the URL:
  - Base: `http://localhost:11435/v1`
  - Health: `http://localhost:11435/health`

### 401 / unauthorized
- Ensure you are using the correct key from the banner:
  - `Authorization: Bearer sk-ollabridge-...`

### Model not found
- Verify the model name in your gateway banner, and set:
  - `OLLAS_MODEL=...`

### Dev workflow (auto-reload)
- With `--reload`, edits to the code trigger an automatic server reload.
  - Great for rapid iteration
  - Avoid in production (use a fixed build + stable config)

## Next steps

- Add observability: request IDs, structured logs, metrics.
- Use a reverse proxy (TLS termination) for shared environments.
- Rotate keys and enforce allowlists/rate limits.

If you want, I can also generate:
- a small **FastAPI** client service that calls your gateway
- a **Docker Compose** demo for local dev + reverse proxy + metrics