Skip to content

tyoon10/agentic-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agentic-inference

Exploring the agentic AI inference stack — from open-weight models to self-hosted serving to hybrid routing.

Companion repo to Agents Don't Need Better Models. They Need Better Infrastructure.

Quickstart

pip install -r requirements.txt

Visualizations

Four publication-quality figures generated from code:

python -m viz.stack                    # three-layer architecture diagram
python -m viz.cost                     # API vs self-hosted cost crossover
python -m viz.trace                    # agent loop execution trace
python -m viz.routing                  # hybrid routing decisions scatter

All accept --out path/ for custom output directory.

Figure What it shows
stack Governance → Serving → Models with hybrid routing arrows
cost Monthly cost crossover at ~5,500 agent calls/day
trace Step-by-step agent loop: model → tool call → result → answer
routing Tasks plotted by complexity, colored by local vs frontier backend

Run the demos (requires API keys)

export MISTRAL_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key

# 01 — Agent loop on Mistral Small 4
python projects/01_tool_calling/demo.py

# 02 — Hybrid router: Mistral (local) + Claude (frontier)
python projects/02_hybrid_router/demo.py --threshold 0.6

Both demos run the real agent/router, save trace data as JSON, and auto-generate visualizations.

Projects

Mini projects showcasing Mistral Small 4 (119B, agentic tool-calling, open weights) on the NVIDIA inference stack.

01 — Tool-Calling Agent Loop ✅

The core pattern. A while(tool_use) agent loop running entirely on Mistral Small 4.

The model decides which tools to call and when to stop — no hardcoded step sequence. Demonstrates that open-weight models can drive autonomous tool-calling loops, not just answer questions.

projects/01_tool_calling/
  agent.py    — Agent class: while(tool_use) loop with trace capture
  tools.py    — ToolRegistry + 5 built-in tools (calculator, file read, etc.)
  demo.py     — Run agent → save trace JSON → render viz
  • Works with any OpenAI-compatible endpoint: NIM, vLLM, Mistral API
  • Auto-generates tool schemas from Python type hints
  • Full trace capture: tokens, latency, tool calls per turn
  • Visualization: viz/trace.py renders the execution as a vertical flow diagram

02 — Hybrid Router ✅

Route by complexity. Easy calls → self-hosted Mistral. Hard calls → frontier API (Claude).

A lightweight routing layer that classifies incoming requests and dispatches them to the right backend. The same architecture described in the blog article — and the same pattern NVIDIA's OpenShell Privacy Router implements at the infrastructure level.

projects/02_hybrid_router/
  router.py   — HybridRouter: classify → route → complete, with stats
  demo.py     — Run 15 sample tasks → save decisions JSON → render viz
  • Classifier: Mistral Small 4 scores task complexity (0–1) in a single call
  • Threshold routing: below 0.6 → local Mistral, above → Claude API
  • Aggregate stats: local %, avg latency, tokens per backend
  • Visualization: viz/routing.py plots decisions as a scatter with threshold line

03 — Job Scanner on Open Stack

Port a real workflow from API to self-hosted. Same logic, different infrastructure.

The career monitor from the blog article currently runs on Claude's API — 345 calls/day across 23 companies. Port the two-stage screening pipeline (title filter → JD keyword classification) to Mistral Small 4 on NIM.

  • Stage 1 (title classification): should match or exceed API accuracy
  • Stage 2 (JD analysis): test whether 119B parameters handle nuanced keyword matching
  • Benchmark: accuracy parity, latency difference, cost difference over 30 days

04 — Inference Pattern Benchmarks

Measure what the article claims. Fan-out, chaining, and iterative loops — benchmarked.

Pattern Workload What to measure
Parallel fan-out 23 concurrent classification calls Throughput ceiling, rate limit impact
Sequential chain 4-step transcript → summary → actions → push End-to-end latency
Iterative loop 3-pass draft → evaluate → revise Token cost scaling, context growth

05 — Devstral Code Agent

Agentic coding on open weights. Devstral (24B active / 123B total) as a local Copilot alternative.

Build a minimal code agent that reads a file, identifies issues, proposes fixes, and applies them — the same loop Claude Code and Codex run, but on a self-hosted model.

Structure

agentic-inference/
  viz/                          # Visualization modules (matplotlib, dark theme)
    theme.py                    # Shared NVIDIA-green palette
    stack.py                    # Three-layer architecture diagram
    cost.py                     # API vs self-hosted cost crossover
    trace.py                    # Agent loop execution trace
    routing.py                  # Hybrid routing decisions scatter
  projects/
    01_tool_calling/            # ✅ Agent loop + tool registry
    02_hybrid_router/           # ✅ Complexity-based routing
    03_job_scanner/             # Port career monitor to open stack
    04_inference_benchmarks/    # Benchmark fan-out, chain, loop
    05_devstral_code_agent/     # Code agent on Devstral

Stack Reference

Layer Tool Role
Orchestration Dynamo 1.0 Multi-node inference coordination
Governance OpenShell Sandbox + policy engine + privacy router
Serving NIM One-command model containers
Optimization TensorRT-LLM GPU compiler optimization
Inference vLLM Community inference engine
Context CMX (BlueField-4) Hardware context memory offload

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages