agentic-inference

Exploring the agentic AI inference stack — from open-weight models to self-hosted serving to hybrid routing.

Companion repo to Agents Don't Need Better Models. They Need Better Infrastructure.

Quickstart

pip install -r requirements.txt

Visualizations

Four publication-quality figures generated from code:

python -m viz.stack                    # three-layer architecture diagram
python -m viz.cost                     # API vs self-hosted cost crossover
python -m viz.trace                    # agent loop execution trace
python -m viz.routing                  # hybrid routing decisions scatter

All accept --out path/ for custom output directory.

Figure	What it shows
`stack`	Governance → Serving → Models with hybrid routing arrows
`cost`	Monthly cost crossover at ~5,500 agent calls/day
`trace`	Step-by-step agent loop: model → tool call → result → answer
`routing`	Tasks plotted by complexity, colored by local vs frontier backend

Run the demos (requires API keys)

export MISTRAL_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key

# 01 — Agent loop on Mistral Small 4
python projects/01_tool_calling/demo.py

# 02 — Hybrid router: Mistral (local) + Claude (frontier)
python projects/02_hybrid_router/demo.py --threshold 0.6

Both demos run the real agent/router, save trace data as JSON, and auto-generate visualizations.

Projects

Mini projects showcasing Mistral Small 4 (119B, agentic tool-calling, open weights) on the NVIDIA inference stack.

01 — Tool-Calling Agent Loop ✅

The core pattern. A while(tool_use) agent loop running entirely on Mistral Small 4.

The model decides which tools to call and when to stop — no hardcoded step sequence. Demonstrates that open-weight models can drive autonomous tool-calling loops, not just answer questions.

projects/01_tool_calling/
  agent.py    — Agent class: while(tool_use) loop with trace capture
  tools.py    — ToolRegistry + 5 built-in tools (calculator, file read, etc.)
  demo.py     — Run agent → save trace JSON → render viz

Works with any OpenAI-compatible endpoint: NIM, vLLM, Mistral API
Auto-generates tool schemas from Python type hints
Full trace capture: tokens, latency, tool calls per turn
Visualization: viz/trace.py renders the execution as a vertical flow diagram

02 — Hybrid Router ✅

Route by complexity. Easy calls → self-hosted Mistral. Hard calls → frontier API (Claude).

A lightweight routing layer that classifies incoming requests and dispatches them to the right backend. The same architecture described in the blog article — and the same pattern NVIDIA's OpenShell Privacy Router implements at the infrastructure level.

projects/02_hybrid_router/
  router.py   — HybridRouter: classify → route → complete, with stats
  demo.py     — Run 15 sample tasks → save decisions JSON → render viz

Classifier: Mistral Small 4 scores task complexity (0–1) in a single call
Threshold routing: below 0.6 → local Mistral, above → Claude API
Aggregate stats: local %, avg latency, tokens per backend
Visualization: viz/routing.py plots decisions as a scatter with threshold line

03 — Job Scanner on Open Stack

Port a real workflow from API to self-hosted. Same logic, different infrastructure.

The career monitor from the blog article currently runs on Claude's API — 345 calls/day across 23 companies. Port the two-stage screening pipeline (title filter → JD keyword classification) to Mistral Small 4 on NIM.

Stage 1 (title classification): should match or exceed API accuracy
Stage 2 (JD analysis): test whether 119B parameters handle nuanced keyword matching
Benchmark: accuracy parity, latency difference, cost difference over 30 days

04 — Inference Pattern Benchmarks

Measure what the article claims. Fan-out, chaining, and iterative loops — benchmarked.

Pattern	Workload	What to measure
Parallel fan-out	23 concurrent classification calls	Throughput ceiling, rate limit impact
Sequential chain	4-step transcript → summary → actions → push	End-to-end latency
Iterative loop	3-pass draft → evaluate → revise	Token cost scaling, context growth

05 — Devstral Code Agent

Agentic coding on open weights. Devstral (24B active / 123B total) as a local Copilot alternative.

Build a minimal code agent that reads a file, identifies issues, proposes fixes, and applies them — the same loop Claude Code and Codex run, but on a self-hosted model.

Structure

agentic-inference/
  viz/                          # Visualization modules (matplotlib, dark theme)
    theme.py                    # Shared NVIDIA-green palette
    stack.py                    # Three-layer architecture diagram
    cost.py                     # API vs self-hosted cost crossover
    trace.py                    # Agent loop execution trace
    routing.py                  # Hybrid routing decisions scatter
  projects/
    01_tool_calling/            # ✅ Agent loop + tool registry
    02_hybrid_router/           # ✅ Complexity-based routing
    03_job_scanner/             # Port career monitor to open stack
    04_inference_benchmarks/    # Benchmark fan-out, chain, loop
    05_devstral_code_agent/     # Code agent on Devstral

Stack Reference

Layer	Tool	Role
Orchestration	Dynamo 1.0	Multi-node inference coordination
Governance	OpenShell	Sandbox + policy engine + privacy router
Serving	NIM	One-command model containers
Optimization	TensorRT-LLM	GPU compiler optimization
Inference	vLLM	Community inference engine
Context	CMX (BlueField-4)	Hardware context memory offload

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
projects		projects
viz		viz
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agentic-inference

Quickstart

Visualizations

Run the demos (requires API keys)

Projects

01 — Tool-Calling Agent Loop ✅

02 — Hybrid Router ✅

03 — Job Scanner on Open Stack

04 — Inference Pattern Benchmarks

05 — Devstral Code Agent

Structure

Stack Reference

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agentic-inference

Quickstart

Visualizations

Run the demos (requires API keys)

Projects

01 — Tool-Calling Agent Loop ✅

02 — Hybrid Router ✅

03 — Job Scanner on Open Stack

04 — Inference Pattern Benchmarks

05 — Devstral Code Agent

Structure

Stack Reference

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages