面向全国大学生数学建模竞赛、MCM/ICM 等建模赛事的 AI Agent 工作流系统。
MathModelAgent 不是一个传统软件项目,而是一套以 Skill 工作流、LangGraph Runtime、本地 RAG 能力层、证据追踪、沙箱执行、论文评审门禁 为核心的数学建模竞赛生产线。目标不是让 AI 直接“写一篇看起来像论文的文本”,而是把赛题理解、模型选择、代码实验、图表生成、论文写作、评审修订和最终验收拆成可审计、可复盘、可人工把关的阶段。
Current Project Version: V2.7-alpha
Skill Workflow Base: V2.6-compatible, 1 orchestrator + 7 phase skills
LangGraph Runtime: v1.0-alpha, contest_graph_v3 + Benchmark Arena
Archived Pipeline: V1, preserved under archive/v1/
Core Principle: workspace files are the shared memory, chat history is not the state source
Safety Principle: Human Gate + copied run workspace + allowlist writes + audit-only final verify
Benchmark Track: provider-free fixtures + real-provider Phase 1 planning smoke + multi-provider comparison
V2.7-alpha 的重点是在 V2.6 能力层之上补齐 LangGraph 安全闭环运行时 和 Benchmark Arena:
-
LangGraph Contest Runtime v1.0-alpha
新增contest_graph_v3,将 Human Gate、Phase 2 沙箱实验、Phase 3 论文草稿沙箱、Phase 4 竞赛审稿、Phase 5 受控修订和 Phase 6 audit-only 串成完整安全闭环。 -
Benchmark Arena
新增scripts/langgraph_benchmark.py,可批量扫描 benchmark workspace fixtures,运行contest_graph_v3,并输出 Markdown + JSON benchmark 报告。 -
受控沙箱执行
Phase 2 仅允许安全 Python 命令在 copied run workspace 内执行;Phase 3 和 Phase 5 只允许写入指定paper/与reports/文件,非法路径整批拒绝,异常写入回滚。 -
Human Gate 保留为硬边界
LangGraph 可以提出模型路线,但不会自动写HUMAN_MODEL_REVIEW.md或MODELING_DECISION.md。没有人工确认,流程不会进入实验阶段。 -
最终验收保持只读
Phase 6 只做 audit-only,不自动写VERIFY_REPORT.md,不声称 final PASS。
V2.6 的本地 RAG、source quality、figure evidence map、executable templates、evaluator-optimizer 和 evidence trace 仍然是底层能力基础。
MathModelAgent V2.7-alpha
│
├── Skill Pipeline
│ ├── mm-start-contest-v2 # Orchestrator
│ ├── mm-problem-intake # Phase 0: problem and data intake
│ ├── mm-model-strategy # Phase 1: model strategy and human gate
│ ├── mm-data-experiment # Phase 2: coding, results, visualization
│ ├── mm-paper-build # Phase 3: paper construction and claim trace
│ ├── mm-contest-review # Phase 4: contest-style review
│ ├── mm-revision-integrator # Phase 5: revision loop
│ └── mm-final-verify # Phase 6: final acceptance
│
├── LangGraph Runtime
│ ├── dry_run # Safe orchestration smoke
│ ├── llm_plan # Structured plan generation only
│ ├── controlled_apply # Allowlist-based report writes
│ ├── phase_execute # Phase 1 / Phase 4 one-step execution
│ ├── contest_graph_v0 # Full graph skeleton + Human Gate pause
│ ├── contest_graph_v1 # Phase 2 sandbox experiment executor
│ ├── contest_graph_v2 # Phase 3 paper draft sandbox
│ ├── contest_graph_v3 # Phase 5 revision sandbox + audit-only final
│ └── Benchmark Arena # Batch fixture runner and stability report
│
├── Capability Layer
│ ├── local RAG knowledge base
│ ├── model method cards
│ ├── problem type router
│ ├── anti-template review
│ ├── judge skim review
│ ├── figure evidence map
│ └── source quality policy
│
├── Evidence Layer
│ ├── RESULTS_MANIFEST.json
│ ├── CLAIM_TRACE.md
│ ├── METHOD_IMPLEMENTATION_MATRIX.md
│ ├── FIGURE_AUDIT.md
│ ├── PAPER_SCORECARD.md
│ ├── REVISION_ACTIONS.md
│ └── REVISION_STATUS.md
│
└── Control Center
├── FastAPI backend
├── Vue 3 + Vite frontend
└── Manual / Codex / Claude Code / OpenCode prompt preparation
Contest state lives in the workspace, not in chat history.
Skills, subagents and LangGraph phases communicate through durable files such as PROBLEM_BRIEF.md, MODELING_DECISION.md, RESULTS_MANIFEST.json, CLAIM_TRACE.md, REVISION_STATUS.md and VERIFY_REPORT.md.
This makes the workflow easier to resume, audit, debug and compare across contest runs.
AI may propose and review candidate routes, but the final modeling route must pass the human confirmation gate before coding begins.
This is intentional. In mathematical modeling contests, a wrong early modeling route can make every later artifact beautifully wrong.
The paper stage should not invent results. It reads from code outputs, figures, result manifests and claim trace files.
A claim without evidence is either weakened, rewritten or blocked by review.
The system contains independent review roles: model reviewer, devil's advocate, visualization reviewer, contest reviewer and final integrator.
They are used to catch weak assumptions, template abuse, unsupported claims, poor figures, missing validation and submission risks.
LangGraph Runtime is designed to pause, reject, roll back and audit. It should not bypass Human Gate, write final PASS, or modify forbidden directories just to look more autonomous.
Bootstrap: mm-start-contest-v2
│
├─ Phase 0: mm-problem-intake
│ Agents: problem-analyst, data-auditor
│ Outputs: PROBLEM_BRIEF.md, DATA_AUDIT.md, reports/INTAKE_GATE.md
│
├─ Phase 1: mm-model-strategy
│ Agents: model-strategist, model-reviewer, devils-advocate
│ Outputs: MODEL_CANDIDATES.md, MODEL_REVIEW_AI.md,
│ HUMAN_MODEL_REVIEW.md, MODELING_DECISION.md,
│ ANALYSIS_MODELING_REPORT.md, ANALYSIS_GATE.md,
│ FIGURE_PLAN.md
│
├─ Phase 2: mm-data-experiment
│ Agents: experiment-coder, visualization-reviewer
│ Outputs: code/, figures/, results/RESULTS_MANIFEST.json,
│ EXPERIMENT_LOG.md, RESULTS_REPORT.md, FIGURE_AUDIT.md
│
├─ Phase 3: mm-paper-build
│ Agents: paper-writer, claim traceability check
│ Outputs: paper/, CLAIM_TRACE.md,
│ METHOD_IMPLEMENTATION_MATRIX.md, PAPER_BUILD_REPORT.md
│
├─ Phase 4: mm-contest-review
│ Agents: contest-reviewer, devils-advocate,
│ visualization-reviewer, model-reviewer
│ Outputs: PAPER_SCORECARD.md, REVISION_ACTIONS.md
│
├─ Phase 5: mm-revision-integrator
│ Purpose: repair BLOCKER / HIGH / important MEDIUM issues
│ Outputs: revised artifacts, REVISION_STATUS.md
│
└─ Phase 6: mm-final-verify
Agent: final-integrator
Output: VERIFY_REPORT.md
A contest run is complete only when VERIFY_REPORT.md = PASS and all hard gates are satisfied.
LangGraph is an optional runtime layer under app/backend. It does not replace the V2 skills. It orchestrates safe phase execution around the existing file-based workspace contract.
Supported modes:
| Mode | Purpose | Write level |
|---|---|---|
dry_run |
Smoke-test graph wiring and reports | LangGraph reports only |
llm_plan |
Generate structured PhasePlan JSON | Plan files only |
controlled_apply |
Apply allowlisted low-risk report writes | Phase 1 / Phase 4 reports |
phase_execute |
One-step plan + apply for Phase 1 / Phase 4 | Allowlisted phase reports |
contest_graph_v0 |
Full graph skeleton with Human Gate pause | Safe mixed strategy |
contest_graph_v1 |
Adds Phase 2 sandbox experiment executor | code/, figures/, results/, selected reports in copied run workspace |
contest_graph_v2 |
Adds Phase 3 paper draft sandbox | paper/ and evidence reports in copied run workspace |
contest_graph_v3 |
Adds Phase 5 revision sandbox and audit-only final | Revised paper/ and selected evidence reports |
Key runtime outputs:
reports/LANGGRAPH_RUN_REPORT.md
reports/LANGGRAPH_PHASE_PLAN.json
reports/LANGGRAPH_PHASE_PLAN.md
reports/LANGGRAPH_APPLY_DIFF.md
reports/LANGGRAPH_CONTEST_GRAPH_REPORT.md
reports/LANGGRAPH_BENCHMARK_REPORT.md
reports/LANGGRAPH_BENCHMARK_REPORT.json
reports/AGENT_RUNS.md
Benchmark runner:
python scripts/langgraph_benchmark.py --root tests/langgraph_benchmark_fixtures --mode contest_graph_v3 --provider noneMore details:
docs/langgraph-runner.md
docs/testing/langgraph-phase-runner.tdd.md
knowledge/ stores the local RAG configuration, samples and source notes. Large raw files and private contest data should stay local and must not be committed.
knowledge/
├── README.md
├── libraries.json
├── samples/
│ ├── cumcm_problems/
│ ├── mcm_icm_problems/
│ ├── high_score_papers/
│ ├── model_methods/
│ ├── code_templates/
│ ├── figure_templates/
│ ├── paper_expression/
│ └── review_rubrics/
└── source_notes/
| Library | Purpose |
|---|---|
cumcm_problems |
历年国赛题库、题型标签、隐含评分点 |
mcm_icm_problems |
美赛题面、赛道、英文表达、常见模型路线 |
excellent_papers |
高分论文结构、摘要、图表、模型路线、结论表达 |
model_methods |
评价、预测、优化、机理、图论、统计、仿真、多目标决策等模型卡 |
code_templates |
Python/R/MATLAB 清洗、建模、验证、可视化脚本 |
figure_templates |
推荐图、图表审计标准、caption 写法、证据图谱 |
paper_expression |
摘要、问题重述、假设、公式说明、结果分析、灵敏度分析 |
review_rubrics |
评分标准、评委快审、扣分点、反模板审查、高低分差距 |
# Index built-in samples without external vector store
python scripts\rag_ingest.py --source knowledge\samples --vector-store none
# Query all libraries
python scripts\rag_query.py "综合评价类题目 TOPSIS 权重 稳定性"
# Query a specific library
python scripts\rag_query.py "预测 优化 混合题 约束 验证" --library model_methods
# JSON output for agent consumption
python scripts\rag_query.py "评委 快审 摘要 关键图 结论" --library review_rubrics --jsonOptional local vector store:
pip install chromadb sentence-transformers
python scripts\rag_ingest.py --source knowledge\raw --vector-store chroma --embedding-mode sentence-transformer --embedding-model BAAI/bge-m3RAG is advisory. It provides evidence, candidate routes and review hints. Final modeling decisions still go through mm-model-strategy, human review and later contest-style checks.
app/ provides a local full-stack control center for V2 workspaces: Vue 3 + Pinia + TypeScript frontend, FastAPI backend, and LangGraph Runtime v1.0-alpha optional orchestration layer.
Beginner guide: docs/frontend-beginner-guide.md
Deployment guide: docs/local-deployment-guide.md
Windows:
powershell -ExecutionPolicy Bypass -File scripts/setup_control_center.ps1
powershell -ExecutionPolicy Bypass -File scripts/start_control_center.ps1Open:
http://127.0.0.1:5173
Notes:
provider=nonedoes not need an API key.- To use real DeepSeek / OpenAI-compatible models, copy
.env.exampleto.envand fill API keys in the backend environment. - Do not paste API keys into the frontend browser page.
Backend: FastAPI, default http://127.0.0.1:8000
Frontend: Vue 3 + Vite, default http://127.0.0.1:5173
LangGraph Runtime: v1.0-alpha, contest_graph_v3 + Benchmark Arena
Safety: Human Gate preserved, provider=none safe launcher, run workspace isolation
| Page | Purpose |
|---|---|
| Overview | Dashboard with audit strip, phase timeline, recommendations, issues |
| Phase | Per-phase inputs/outputs, prompt generation, harness preparation |
| Artifacts | Workspace file index with quick filters (Core Gates, LangGraph, Evidence, Review) |
| Console | Prompt generation + run history |
| LangGraph | Runtime status, run config, run summary, phase results table, sandbox/paper/revision cards, files, audit |
| Runs | Run workspace browser — list, browse, and preview artifacts inside copied run workspaces |
| Benchmark Lab | Legacy 2022C audit, LangGraph benchmark reports, real provider reports, multi-model compare, safe provider=none launcher |
| Settings | New workspace creation, source upload, health check, harness adapters |
- Frontend does not bypass Human Gate (never writes
HUMAN_MODEL_REVIEW.mdorMODELING_DECISION.md) - Frontend does not auto-write
VERIFY_REPORT.md - Safe Benchmark Launcher enforces
provider=none,copy_workspace=true,contest_graph_v3 - Run artifact API is scoped to
source/runs/{run_name}/only - Benchmark report API is scoped to
docs/,docs/real_benchmarks/,docs/benchmarks/only - No real API key management in the UI
# Backend
cd app/backend
uvicorn app.main:app --host 127.0.0.1 --port 8000 --reload
# Frontend
cd app/frontend
pnpm install
pnpm run devThen open http://127.0.0.1:5173.
Beginner? Read docs/getting-started.md — step-by-step tutorial in Chinese.
cd app/frontend && pnpm run build # vue-tsc + vite
python -m pytest tests/test_langgraph_api.py -q # 12 tests
python -m pytest tests/test_benchmark_reports_api.py -q # 8 tests
python -m pytest tests/test_run_workspace_artifacts_api.py -q # 8 tests
python -m pytest tests/test_safe_langgraph_benchmark_api.py -q # 5 testsdocs/getting-started.md— 小白使用教程(推荐首次使用者阅读)docs/RELEASE_v2.7-alpha.md— V2.7-alpha release notesdocs/frontend-control-center-v2.md— full feature map and safety docsdocs/frontend-api-contract.md— API endpoint referencedocs/langgraph-runner.md— LangGraph runtime architecturedocs/testing/frontend-langgraph-e2e-smoke.md— E2E smoke test report
├── README.md
├── AGENTS.md # Codex-facing project guidance
├── CLAUDE.md # Claude Code-facing project guidance
├── FILE_RELATIONSHIP_MAP.md # Full dependency graph and execution logic
├── mathmodelagent.skills.sh.json # Skill manifest
│
├── knowledge/ # V2.6+ local RAG knowledge base
│ ├── README.md
│ ├── libraries.json
│ ├── samples/
│ └── source_notes/
│
├── skills/
│ ├── _references/ # Shared contracts, rubrics, method cards, protocols
│ │ ├── v2_pipeline_contract.md
│ │ ├── workflow_state_contract.md
│ │ ├── codex_subagent_protocol.md
│ │ ├── contest_score_rubric.md
│ │ ├── paper_benchmark_profile.md
│ │ ├── figure_quality_standard.md
│ │ ├── agent_review_protocol.md
│ │ ├── model_method_cards.md
│ │ ├── problem_type_router.md
│ │ ├── anti_template_review.md
│ │ ├── judge_skim_review_protocol.md
│ │ ├── rag_usage_contract.md
│ │ ├── source_quality_policy.md
│ │ ├── figure_evidence_map.md
│ │ ├── executable_model_templates.md
│ │ ├── evaluator_optimizer_protocol.md
│ │ ├── agent_profiles/
│ │ └── scripts/
│ │
│ ├── mm-start-contest-v2/
│ ├── mm-problem-intake/
│ ├── mm-model-strategy/
│ ├── mm-data-experiment/
│ ├── mm-paper-build/
│ ├── mm-contest-review/
│ ├── mm-revision-integrator/
│ ├── mm-final-verify/
│ ├── 5writing/templates/ # Typst and LaTeX contest templates
│ ├── doctor/
│ └── typst-author/
│
├── scripts/
│ ├── rag_ingest.py
│ ├── rag_query.py
│ ├── import_zhanwen_mathmodel.py
│ ├── audit_benchmark.py
│ ├── langgraph_benchmark.py # LangGraph Benchmark Arena runner
│ ├── new_v2_workspace.py
│ ├── memory_log.py
│ ├── memory_brief.py
│ └── memory_distill.py
│
├── app/ # Local Control Center + LangGraph Runtime backend
│ ├── backend/
│ ├── frontend/
│ └── start.bat
│
├── docs/
│ ├── control-center-beginner-guide.md
│ ├── control-center-ui-spec.md
│ ├── langgraph-runner.md
│ └── testing/
│ └── langgraph-phase-runner.tdd.md
│
├── tests/ # Runtime, API, benchmark and stabilization tests
├── examples/ # Sanitized example contest workspaces
├── workspaces/ # Local active contest workspaces, normally ignored
└── archive/v1/ # Archived V1 legacy pipeline
A V2 workspace should contain the following artifacts:
<workspace>/
├── plan.md
├── todo.md
├── WORKFLOW_STATE.md
├── PROBLEM_BRIEF.md
├── DATA_AUDIT.md
├── reports/
│ ├── INTAKE_GATE.md
│ ├── MODEL_CANDIDATES.md
│ ├── MODEL_REVIEW_AI.md
│ ├── HUMAN_MODEL_REVIEW.md
│ ├── MODELING_DECISION.md
│ ├── ANALYSIS_MODELING_REPORT.md
│ ├── ANALYSIS_GATE.md
│ ├── FIGURE_PLAN.md
│ ├── EXPERIMENT_LOG.md
│ ├── RESULTS_REPORT.md
│ ├── FIGURE_AUDIT.md
│ ├── CLAIM_TRACE.md
│ ├── METHOD_IMPLEMENTATION_MATRIX.md
│ ├── PAPER_BUILD_REPORT.md
│ ├── PAPER_SCORECARD.md
│ ├── REVISION_ACTIONS.md
│ ├── REVISION_STATUS.md
│ └── VERIFY_REPORT.md
├── results/
│ └── RESULTS_MANIFEST.json
├── code/
├── figures/
└── paper/
LangGraph runs may additionally create reports/LANGGRAPH_*.md, reports/LANGGRAPH_*.json, reports/AGENT_RUNS.md and local history files.
| Agent | Purpose | Permissions | Reasoning |
|---|---|---|---|
problem-analyst |
Parse problem, subquestions, objectives, constraints | read-only | medium |
data-auditor |
Inspect data files, fields, units, missingness and anomalies | read-only | medium |
model-strategist |
Generate candidate modeling routes | write reports/ |
high |
model-reviewer |
Review model fit, rigor and feasibility | read-only | high |
devils-advocate |
Attack weak assumptions and find hidden risks | read-only | high |
experiment-coder |
Implement scripts, run experiments, save outputs | write code/, results/, figures/ |
high |
visualization-reviewer |
Review figure quality, readability and evidence value | read-only | medium |
paper-writer |
Draft and revise paper sections | write paper/ and selected reports |
high |
contest-reviewer |
Score against contest rubric | read-only | high |
final-integrator |
Verify consistency and final submission readiness | write paper/ and reports/ |
high |
Profiles live in:
skills/_references/agent_profiles/
Custom Codex agent names use the mathmodel-* prefix, for example mathmodel-experiment-coder.
Each gate returns one of:
PASS
CONDITIONAL_PASS
FAIL
The project is complete only when all of the following are true:
VERIFY_REPORT.md = PASS- All contest score dimensions are
>= 4, unless explicitly marked as justified N/A REVISION_ACTIONS.mdhas no unresolvedBLOCKERorHIGHitemsFIGURE_AUDIT.mdhas no failed paper figuresMETHOD_IMPLEMENTATION_MATRIX.mdhas no unimplemented core methodsCLAIM_TRACE.mdhas no missing core claims and no weak claims stated as strong- The paper compiles cleanly and the final PDF opens correctly
- Internal workflow files are not leaked into the final paper text
LangGraph contest_graph_v3 can help reach these conditions, but it does not write final PASS by itself.
The default contest review uses 10 dimensions, each scored from 0 to 5.
| Dimension | What it checks |
|---|---|
| Problem understanding | Questions, assumptions, constraints, evaluation criteria |
| Data understanding | Files, fields, units, missing values, anomalies |
| Modeling fit | Whether methods match the data and question type |
| Mathematical rigor | Variables, formulas, objectives, constraints, derivations |
| Implementation | Reproducible code and alignment with the approved model |
| Result validity | Error analysis, sensitivity, robustness and sanity checks |
| Visualization | Figures support reasoning and appear in the paper |
| Writing structure | Complete contest paper structure and coherent narrative |
| Claim traceability | Claims map to results, figures, code or decisions |
| Submission readiness | No placeholders, no broken compilation, no obvious leakage |
Rating guide:
5 = strong high-score quality
4 = acceptable contest-quality baseline
3 = visibly weak
2 = significant score loss
1 = mostly missing
0 = absent
skills/5writing/templates/ contains Typst and LaTeX templates for 17 contest types.
Chinese templates:
CUMCM, ChangSanJiao, HuaShuBei, HuaweiBei, HuaZhongBei,
MathorCup, APMCM, ShuWeiBei, WuYiBei, DianGongBei,
DongSanSheng, Stats, MCM, Default
English templates:
MCM/ICM, APMCM, Default
Each contest type has both Typst and LaTeX variants where available.
Clone the repository:
git clone https://github.com/zklzzklzkl/MathModel.git MathModelAgent
cd MathModelAgentInstall optional RAG dependencies:
pip install chromadb sentence-transformersInstall Control Center backend dependencies:
cd app/backend
pip install -e .Install optional LangGraph runtime dependencies when using graph modes:
pip install -r app/backend/requirements-langgraph.txtFor Claude Code skills, copy skills into your local skills directory if needed:
cp -r skills/* ~/.claude/skills/On Windows PowerShell, adapt the destination path to your local Claude Code or Codex skill location.
python scripts/new_v2_workspace.py workspaces/my-contest --contest CUMCM --engine LaTeX --language 中文/mm-start-contest-v2
python skills/_references/scripts/audit_v2_run.py --workspace workspaces/my-contestProvider-free fixture benchmark:
python scripts/langgraph_benchmark.py --root tests/langgraph_benchmark_fixtures --mode contest_graph_v3 --provider noneReal provider Phase 1 planning smoke:
python scripts/real_provider_benchmark.py --workspace examples/2022C/DeepSeekV4Pro_V2.3 --mode llm_plan --phase 1 --provider deepseek --model deepseek-chatMulti-provider Phase 1 planning comparison:
python scripts/real_provider_compare.py --workspace examples/2022C/DeepSeekV4Pro_V2.3 --mode llm_plan --phase 1 --provider-model deepseek:deepseek-chat --provider-model openai-compatible:<model>Real provider commands read API keys only from local environment variables such as MATHMODEL_LLM_API_KEY. They write sanitized reports under docs/real_benchmarks/ and do not run controlled_apply, experiments, paper drafting or final verification.
python scripts/audit_benchmark.py --root examples/2022Ccd app
.\start.batOpen:
http://127.0.0.1:5173
ARS can provide deeper methodology and editorial audits. Set ARS_ROOT to enable. It is advisory-only and should not be treated as a hard dependency.
Nature Figure integration can strengthen scientific plotting quality. Set NATURE_SKILLS_ROOT if installed.
Typical checks:
python skills/_references/scripts/resolve_nature_figure.py --workspace .
python skills/_references/scripts/audit_v2_run.py --workspace <contest-workspace>PNG-only or Pillow-generated data figures should not be accepted as core evidence figures when vector-quality output is required.
Do not commit private contest data, large raw PDFs, local vector stores, local databases, runtime logs or generated private workspaces.
Normally ignored or local-only paths include:
workspaces/
knowledge/raw/
knowledge/.local/
examples/**/source/
examples/**/runs/
**/control-center-history.jsonl
.env
.venv
node_modules/
dist/
Commit only sanitized examples, scripts, templates, contracts and source notes.
Sanitized benchmark reports under docs/real_benchmarks/ may be committed when they contain no API keys, no private contest data and no active workspace payloads.
V2.7-alpha is the active project version.
The stable workflow foundation remains the V2 skill pipeline. The current experimental runtime milestone is LangGraph Runtime v1.0-alpha, centered on contest_graph_v3 and Benchmark Arena.
Benchmark Arena now includes provider-free fixture benchmarks, real-provider Phase 1 planning smoke reports, and a deterministic multi-provider comparison MVP.
V1 is archived under archive/v1/ and should not be used for new contests.
This repository is best understood as a contest-oriented AI workflow framework. The most important deliverable is not a single script, but a reproducible workspace containing model decisions, code results, figures, evidence traces, review reports, revision records and a final compiled paper.
CC-BY-NC 4.0