Exploring the agentic AI inference stack — from open-weight models to self-hosted serving to hybrid routing.
Companion repo to Agents Don't Need Better Models. They Need Better Infrastructure.
pip install -r requirements.txtFour publication-quality figures generated from code:
python -m viz.stack # three-layer architecture diagram
python -m viz.cost # API vs self-hosted cost crossover
python -m viz.trace # agent loop execution trace
python -m viz.routing # hybrid routing decisions scatterAll accept --out path/ for custom output directory.
| Figure | What it shows |
|---|---|
stack |
Governance → Serving → Models with hybrid routing arrows |
cost |
Monthly cost crossover at ~5,500 agent calls/day |
trace |
Step-by-step agent loop: model → tool call → result → answer |
routing |
Tasks plotted by complexity, colored by local vs frontier backend |
export MISTRAL_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key
# 01 — Agent loop on Mistral Small 4
python projects/01_tool_calling/demo.py
# 02 — Hybrid router: Mistral (local) + Claude (frontier)
python projects/02_hybrid_router/demo.py --threshold 0.6Both demos run the real agent/router, save trace data as JSON, and auto-generate visualizations.
Mini projects showcasing Mistral Small 4 (119B, agentic tool-calling, open weights) on the NVIDIA inference stack.
The core pattern. A
while(tool_use)agent loop running entirely on Mistral Small 4.
The model decides which tools to call and when to stop — no hardcoded step sequence. Demonstrates that open-weight models can drive autonomous tool-calling loops, not just answer questions.
projects/01_tool_calling/
agent.py — Agent class: while(tool_use) loop with trace capture
tools.py — ToolRegistry + 5 built-in tools (calculator, file read, etc.)
demo.py — Run agent → save trace JSON → render viz
- Works with any OpenAI-compatible endpoint: NIM, vLLM, Mistral API
- Auto-generates tool schemas from Python type hints
- Full trace capture: tokens, latency, tool calls per turn
- Visualization:
viz/trace.pyrenders the execution as a vertical flow diagram
Route by complexity. Easy calls → self-hosted Mistral. Hard calls → frontier API (Claude).
A lightweight routing layer that classifies incoming requests and dispatches them to the right backend. The same architecture described in the blog article — and the same pattern NVIDIA's OpenShell Privacy Router implements at the infrastructure level.
projects/02_hybrid_router/
router.py — HybridRouter: classify → route → complete, with stats
demo.py — Run 15 sample tasks → save decisions JSON → render viz
- Classifier: Mistral Small 4 scores task complexity (0–1) in a single call
- Threshold routing: below 0.6 → local Mistral, above → Claude API
- Aggregate stats: local %, avg latency, tokens per backend
- Visualization:
viz/routing.pyplots decisions as a scatter with threshold line
Port a real workflow from API to self-hosted. Same logic, different infrastructure.
The career monitor from the blog article currently runs on Claude's API — 345 calls/day across 23 companies. Port the two-stage screening pipeline (title filter → JD keyword classification) to Mistral Small 4 on NIM.
- Stage 1 (title classification): should match or exceed API accuracy
- Stage 2 (JD analysis): test whether 119B parameters handle nuanced keyword matching
- Benchmark: accuracy parity, latency difference, cost difference over 30 days
Measure what the article claims. Fan-out, chaining, and iterative loops — benchmarked.
| Pattern | Workload | What to measure |
|---|---|---|
| Parallel fan-out | 23 concurrent classification calls | Throughput ceiling, rate limit impact |
| Sequential chain | 4-step transcript → summary → actions → push | End-to-end latency |
| Iterative loop | 3-pass draft → evaluate → revise | Token cost scaling, context growth |
Agentic coding on open weights. Devstral (24B active / 123B total) as a local Copilot alternative.
Build a minimal code agent that reads a file, identifies issues, proposes fixes, and applies them — the same loop Claude Code and Codex run, but on a self-hosted model.
agentic-inference/
viz/ # Visualization modules (matplotlib, dark theme)
theme.py # Shared NVIDIA-green palette
stack.py # Three-layer architecture diagram
cost.py # API vs self-hosted cost crossover
trace.py # Agent loop execution trace
routing.py # Hybrid routing decisions scatter
projects/
01_tool_calling/ # ✅ Agent loop + tool registry
02_hybrid_router/ # ✅ Complexity-based routing
03_job_scanner/ # Port career monitor to open stack
04_inference_benchmarks/ # Benchmark fan-out, chain, loop
05_devstral_code_agent/ # Code agent on Devstral
| Layer | Tool | Role |
|---|---|---|
| Orchestration | Dynamo 1.0 | Multi-node inference coordination |
| Governance | OpenShell | Sandbox + policy engine + privacy router |
| Serving | NIM | One-command model containers |
| Optimization | TensorRT-LLM | GPU compiler optimization |
| Inference | vLLM | Community inference engine |
| Context | CMX (BlueField-4) | Hardware context memory offload |
MIT