Applied AI researcher. Most of my work lives in the gap between agent demos and agents that don't fall over on Friday night.
LLM agents in production — the part nobody tweets about. Picking the right tools out of fifty. Evaluating tool-use without LLM-judge wash. Recording traces that survive a postmortem. Building flows you can resume after a worker dies on step 7.
The four repos pinned here are pieces of the same picture: how to take an agent from "works in a notebook" to "shipped, observable, and debuggable".
- Why "demo-good" and "prod-good" are different problems, and how the gap shows up empirically
- Cheap routing layers that protect the LLM from its own option-explosion problem
- Trace formats that one engineer can read and one machine can grep
- Checkpointing patterns for agents whose tool calls cost real money
Python · PyTorch · vLLM · OpenAI / Anthropic / Qwen SDKs · Pydantic · pytest · Jinja2 · Rich · Docker · tmux · a stubborn refusal to add a database
agent-eval-kit— YAML cases, replayable traces, deterministic checks for tool-using agentstool-router— pick K tools before they ever hit the LLMagent-tape— structured trace recorder + replay; jsonl on disk, no serviceflowmind— declarative agent flows in YAML, with checkpointing
Not on Twitter. Not looking for a job. PRs welcome, issues even more welcome — I read all of them.