Ma Xiaoming sitianjia

Ma Xiaoming

Applied AI researcher. Most of my work lives in the gap between agent demos and agents that don't fall over on Friday night.

What I work on

LLM agents in production — the part nobody tweets about. Picking the right tools out of fifty. Evaluating tool-use without LLM-judge wash. Recording traces that survive a postmortem. Building flows you can resume after a worker dies on step 7.

The four repos pinned here are pieces of the same picture: how to take an agent from "works in a notebook" to "shipped, observable, and debuggable".

What I'm thinking about

Why "demo-good" and "prod-good" are different problems, and how the gap shows up empirically
Cheap routing layers that protect the LLM from its own option-explosion problem
Trace formats that one engineer can read and one machine can grep
Checkpointing patterns for agents whose tool calls cost real money

Tools I keep reaching for

Python · PyTorch · vLLM · OpenAI / Anthropic / Qwen SDKs · Pydantic · pytest · Jinja2 · Rich · Docker · tmux · a stubborn refusal to add a database

Pinned

agent-eval-kit — YAML cases, replayable traces, deterministic checks for tool-using agents
tool-router — pick K tools before they ever hit the LLM
agent-tape — structured trace recorder + replay; jsonl on disk, no service
flowmind — declarative agent flows in YAML, with checkpointing

Activity

_{Not on Twitter. Not looking for a job. PRs welcome, issues even more welcome — I read all of them.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly