Run any Cloudflare Workers AI model in Claude Code — Nemotron 120B, Gemma 4, Llama 3.3, and more.
A lightweight Cloudflare Worker that translates between Anthropic's Messages API (what Claude Code speaks) and OpenAI's Chat Completions API (what Workers AI speaks). Supports streaming, tool calling, and thinking/reasoning tokens.
Claude Code CLI (Anthropic format)
→ This proxy (translates Anthropic ↔ OpenAI)
→ Cloudflare AI Gateway (routing, caching, analytics)
→ Workers AI (Nemotron, Gemma, Llama, etc.)
git clone https://github.com/quantum-encoding/claude-code-proxy.git
cd claude-code-proxy
npm installEdit src/models.ts to add/remove models.
You need:
- A Cloudflare account
- An AI Gateway (free, takes 30 seconds)
- An AI Gateway API token with "Run" permission
# Set your secrets
echo "YOUR_GATEWAY_URL" | npx wrangler secret put CF_AI_GATEWAY_URL
echo "YOUR_AIG_TOKEN" | npx wrangler secret put CF_AIG_TOKEN
echo "YOUR_PROXY_PASSWORD" | npx wrangler secret put PROXY_AUTH_TOKEN
# Deploy
npx wrangler deployThe gateway URL format is: https://gateway.ai.cloudflare.com/v1/{ACCOUNT_ID}/{GATEWAY_NAME}
cp run-model.sh.example ~/.local/bin/run-nemo
chmod +x ~/.local/bin/run-nemoEdit the script and fill in your proxy URL and auth token. Then:
run-nemo| Feature | Status |
|---|---|
| Text generation | Working |
| Streaming (SSE) | Working |
| Tool calling | Working |
| Thinking/reasoning | Working (as Anthropic thinking blocks) |
| System prompts | Working |
| Multi-turn conversation | Working |
| Vision/images | Not supported (Workers AI limitation) |
Edit src/models.ts:
export const MODEL_MAP: Record<string, string> = {
'nemotron': 'workers-ai/@cf/nvidia/nemotron-3-120b-a12b',
'gemma4': 'workers-ai/@cf/google/gemma-4-26b-a4b-it',
'llama': 'workers-ai/@cf/meta/llama-3.3-70b-instruct-fp8-fast',
// Add your own:
// 'my-model': 'workers-ai/@cf/provider/model-name',
};Then redeploy: npx wrangler deploy
The proxy itself runs on Cloudflare Workers free tier (100k requests/day). You only pay for Workers AI inference:
| Model | Input | Output |
|---|---|---|
| Nemotron 3 120B | $0.50/M tokens | $1.50/M tokens |
| Gemma 4 26B | $0.10/M tokens | $0.30/M tokens |
| Llama 3.3 70B | $0.20/M tokens | $0.60/M tokens |
The proxy handles two critical translations:
systemprompt →messages[0].role: "system"tool_choice.type: "any"→"auto"(some models don't support"required")input_schema→parameters- Content blocks (text, tool_use, tool_result) → flat messages
delta.content→content_block_deltawithtext_deltadelta.reasoning→content_block_deltawiththinking_deltadelta.tool_calls→ buffered, then emitted ascontent_block_start+input_json_deltafinish_reason: "tool_calls"→stop_reason: "tool_use"
The streaming translation uses a TransformStream to pipe events in real-time.
- Nemotron uses
reasoningfor all output even withenable_thinking: false. The proxy maps this to Anthropic thinking blocks. tool_choice: "required"crashes Nemotron. The proxy maps Anthropic's"any"to"auto"instead.- No vision support. Workers AI models don't support image inputs through the gateway.
MIT