A truth-first, reproducible, open-source Lean 4 theorem prover CLI that combines:
- LLM-guided reasoning (creative step proposals)
- Lean verification (zero hallucinations: only accepted if Lean checks)
- Retrieval (premise selection from mathlib / local repos)
- Search + caching (best-first / beam + transposition table)
Ulam AI is designed to plug into Codex / Claude Code / Gemini CLI / Ollama and produce machine-checked Lean 4 proofs. Start here: UlamAI Prover Tutorial.
With Homebrew:
brew tap ulamai/ulamai
brew install ulamaiFrom a fresh clone:
git clone https://github.com/ulamai/ulamai.git
cd ulamai
./install.shIn the folder where you want to prove or formalize things:
ulam -leanThen launch the interactive menu:
ulamSuggested Codex model: gpt-5.2-codex (or gpt-5.3-codex if available).
Example for Prove with natural language guidance:
prove that every PID is a UFD
These modes apply to both prove and formalize workflows (you can set them via Settings or CLI flags).
Proof modes:
tactic: LLM proposes the next tactic; LeanDojo checks each step. Pros: fast feedback, goal-state aware. Cons: brittle for long chains.lemma: LLM drafts a lemma plan; each lemma is proved sequentially. Pros: decomposes big proofs. Cons: depends on plan quality, more scaffolding.llm: LLM rewrites the Lean file; Lean CLI typechecks. Pros: handles multi-step edits, no Dojo required. Cons: slower, less guidance than goals, relies on LLM (requires Lean CLI).
Prove output formats:
lean(default): current machine-checked Lean proving pipeline.tex: separate informal proving pipeline that plans a claim graph, solves claims with worker drafts, runs judge + adversarial verifier + domain checks, supports pass-based replan/backtrack when a plan stalls, and writes resumable per-run artifacts (state.json,events.jsonl,summary.json) before composing a final.texproof draft for laterformalize.
Lean backends:
dojo: Pantograph/LeanDojo server. Pros: goal-state access, tactic execution. Cons: extra install, toolchain pinning sensitivity.cli:lake env leantypecheck. Pros: simple, works with any toolchain. Cons: no goal-state feedback.lsp: Lean language server. Pros: goal/diagnostic loop without Pantograph, usable inllmand optional tactic/lemma search. Cons: still experimental for search parity vs Dojo.
This repo now contains a working benchmark-ready proving/formalization pipeline with reproducible reporting and optional Lean LSP loops:
- v0.2.5 highlights: optional stateful Lean LSP runner for tactic/lemma search (
--lean lsp), TUI-selectable search backend (dojo|lsp), inference profiles (--inference-profile,--gen-k,--exec-k,--verify-level), planner/replan caching metrics, stricter parity-gate comparability checks, baseline-gated campaign automation (run_bench_campaign.sh --compare-to ...), and an upgradedprove --output-format texinformal pipeline with claim-graph solving + judge/verifier/check gates. - Autop tactics (aesop/simp/linarith/ring) as fallback during proof search
- Axiom toggle (axioms/constants allowed by default; disable with
--no-allow-axioms) - Resume last formalization in the menu + reuse prior artifacts
- Per-lemma LaTeX proof snippets injected into Lean comments and LLM prompts
- Formalize proof search now runs sequential tactic scripts (better multi-step chaining)
- Scripted solver (sequential tactics) enabled by default
- Lemma-first planning with automatic expansion on failure
- Run summaries appended to
.leanon failed attempts - Settings for solver choice and lemma limits
- LLM-only mode (Lean CLI typecheck, no Dojo)
ulam proveandulam replaycommands- Best-first search with beam cap + repair loop
- LLM adapters: OpenAI-compatible + Anthropic + Gemini + Ollama + CLI wrappers
- Lean runner: mock implementation + LeanDojo-v2/PyPantograph runner (external install required)
- Retrieval: token-overlap or embedding-based from a
--premisesfile - Trace logging to JSONL (
run.jsonlby default) - Interactive menu (
ulam) for configuration + guided workflows
What is not implemented yet: robust value models and full autoformalization workflows.
Pipeline reference:
docs/pipeline.md
Homebrew (recommended):
brew tap ulamai/ulamai
brew install ulamaiFrom a clone:
git clone https://github.com/ulamai/ulamai.git
cd ulamai
./install.shIf your Python is externally managed (PEP‑668), you can force a user install:
ULAM_BREAK_SYSTEM_PACKAGES=1 ./install.shThen:
ulam --helpInteractive menu:
ulamLogin from CLI:
ulam auth codex
ulam auth claude
ulam auth geminiFormalize a LaTeX document:
ulam formalize path/to/paper.tex --out path/to/paper.leanToy example:
ulam formalize examples/Formalize.tex --out examples/Formalize.leanTutorial with runnable examples:
- Markdown guide:
examples/UlamAI_Prover_Tutorial.md - Colab notebook:
examples/UlamAI_Prover_Tutorial.ipynb - Colab open link (main branch):
https://colab.research.google.com/github/ulamai/ulamai/blob/main/examples/UlamAI_Prover_Tutorial.ipynb - Includes both olympiad inputs: statement-only (
FormalizePolishOlympiad.tex) and full-proof (pol25.tex).
Olympiad-style formalization example:
ulam formalize examples/FormalizePolishOlympiad.tex --out examples/FormalizePolishOlympiad.leanSame olympiad theorem with full informal proof (recommended):
ulam formalize examples/pol25.tex --out examples/pol25.leanFormalization options:
--no-equivalenceto skip statement equivalence checks.--llm-check/--no-llm-checkto enable or disable semantic integrity checks.--llm-check-timing end|mid+endto choose when semantic checks run.--llm-check-repairs Nto set semantic repair attempts.--artifacts-dirto store per-round artifacts (defaults toruns/formalize_YYYYMMDD_HHMMSS).--proof-backend(tactic|lemma|llm|dojo) to choose proof backend.--lean-backend(dojo|cli|lsp) to choose typecheck backend.--segmentand--segment-wordsto formalize long TeX piece‑wise.- Local declaration retrieval is enabled by default during formalize proof search;
set
ULAM_FORMALIZE_LOCAL_RETRIEVER=0to disable it.
One-line installer:
curl -fsSL https://raw.githubusercontent.com/ulamai/ulamai/main/install.sh | bashHomebrew (tap):
brew tap ulamai/ulamai
brew install ulamaiMaintainers: the Homebrew tap is auto-updated on release. See .github/workflows/update-homebrew-tap.yml and scripts/update_homebrew_tap.sh (requires TAP_PUSH_TOKEN with push access to ulamai/homebrew-ulamai).
Mock mode lets you smoke-test the CLI without Lean installed:
python3 -m ulam prove examples/Smoke.lean --theorem irrational_sqrt_two_smokeNatural language guidance:
python3 -m ulam prove examples/Smoke.lean --theorem irrational_sqrt_two_smoke --instruction "Use a short automation tactic first."Run TeX proving with replan/backtrack and artifacts:
python3 -m ulam prove --theorem infinitely_many_primes --output-format tex \
--statement "There are infinitely many prime numbers." \
--llm codex_cli --tex-rounds 3 --tex-judge-repairs 2 --tex-worker-drafts 2 \
--tex-replan-passes 2 --tex-artifacts-dir runs/prove_texStatement source used in that command:
examples/ProveTexPrimes.txt
Resume a prior TeX proving run:
python3 -m ulam prove --theorem infinitely_many_primes --output-format tex \
--statement "There are infinitely many prime numbers." \
--llm codex_cli --tex-resume runs/prove_tex/<run_dir>Verbose logs (LLM suggestions + tactic outcomes):
python3 -m ulam prove examples/Smoke.lean --theorem irrational_sqrt_two_smoke --verboseAttach context files:
python3 -m ulam prove examples/Smoke.lean --theorem irrational_sqrt_two_smoke --context examples/Smoke.leanReplay the run:
python3 -m ulam replay run.jsonlRead-only checkpoint (typecheck/placeholders/axiom drift):
python3 -m ulam checkpoint examples/Smoke.lean --theorem irrational_sqrt_two_smoke --strictRun review with next-step suggestions:
python3 -m ulam review --trace run.jsonl --file examples/Smoke.lean --theorem irrational_sqrt_two_smokeExecute deterministic replay (re-run every tactic from the trace):
python3 -m ulam replay run.jsonl --execute --strictRun the regression suite (mock by default):
python3 -m ulam bench --suite bench/regression.jsonl
# alias form also works
python3 -m ulam bench --suite regressionRun the suite with machine-readable reports:
python3 -m ulam bench --suite bench/regression.jsonl --report-json runs/bench/latest.json --report-markdown runs/bench/latest.mdRun with an inference profile (budgeted generate/rank/verify loop):
python3 -m ulam bench --suite internal_regression --llm codex_cli --openai-model gpt-5.3-codex --lean dojo --inference-profile balanced
# optional explicit overrides
python3 -m ulam bench --suite internal_regression --inference-profile verify --gen-k 8 --exec-k 3 --verify-level strictValidate a benchmark suite before running:
python3 -m ulam bench-validate --suite bench/suites/internal_regression.jsonl
# alias form
python3 -m ulam bench-validate --suite internal_regressionBuild a miniF2F suite from a local checkout:
python3 -m ulam bench-make-minif2f --root /path/to/miniF2F --split valid --out bench/suites/minif2f_valid.jsonl
python3 -m ulam bench-validate --suite bench/suites/minif2f_valid.jsonlList known suite aliases and status:
python3 -m ulam bench-list-suitesBuild fixed regression100 from a larger source suite:
python3 -m ulam bench-make-regression100 \
--source bench/suites/minif2f_valid.jsonl \
--out bench/suites/regression100.jsonl \
--size 100 \
--seed 0
python3 -m ulam bench-validate --suite regression100Compare two benchmark reports:
python3 -m ulam bench-compare --a runs/bench/a.json --b runs/bench/b.json --out-markdown runs/bench/compare.mdCompare with an automated parity gate (non-regression thresholds):
python3 -m ulam bench-compare \
--a runs/bench/a.json \
--b runs/bench/b.json \
--gate \
--max-solved-drop 0 \
--max-success-rate-drop 0 \
--max-semantic-pass-rate-drop 0 \
--max-regression-rejection-rate-increase 0 \
--max-median-time-increase-pct 25 \
--max-planner-replan-triggers-increase 0 \
--max-planner-cached-tactic-tries-drop 0Notes:
--gatenow also checks comparability by default: same suite SHA256 and same inference profile/budgets.- If you intentionally compare different suites or profile settings, pass
--allow-suite-mismatchand/or--allow-profile-mismatch.
Run a reproducible benchmark campaign (timestamped artifacts + env snapshot):
scripts/run_bench_campaign.sh --suite bench/suites/internal_regression.jsonl -- --llm codex_cli --openai-model gpt-5.3-codex --lean dojoRun a gated campaign against a baseline report (release/CI style):
scripts/run_bench_campaign.sh \
--suite bench/suites/internal_regression.jsonl \
--compare-to runs/bench_campaigns/baseline/report.json \
--max-solved-drop 0 \
--max-success-rate-drop 0 \
--max-semantic-pass-rate-drop 0 \
--max-regression-rejection-rate-increase 0 \
--max-median-time-increase-pct 25 \
--max-planner-replan-triggers-increase 0 \
--max-planner-cached-tactic-tries-drop 0 \
-- --llm codex_cli --openai-model gpt-5.3-codex --lean dojoNotes:
- Campaign runs default to local repo code (
python3 -m ulam) for version consistency. - The wrapper now fails if report artifacts are missing or if report metadata version/commit does not match the repo checkout.
- When
--compare-tois provided, the wrapper runsbench-compare --gatewith strict non-regression defaults (including planner counters and profile/suite comparability checks). - Use
--allow-profile-mismatchand/or--allow-suite-mismatchonly for intentional cross-profile or cross-suite comparisons. - Bench report JSON/Markdown now include dataset/split breakdowns and top tags when suite rows provide that metadata.
Build a retrieval index from local project + mathlib declarations:
python3 -m ulam index build --project /path/to/lean-project --scope both --out .ulam/premises_both.jsonlLeanDojo-v2 mode (real Lean, requires a Lean project and sorry placeholder):
python3 -m ulam prove path/to/File.lean --theorem MyTheorem --lean dojo --lean-project /path/to/lean-projectLemma-first mode:
python3 -m ulam prove path/to/File.lean --theorem MyTheorem --prove-mode lemmaLLM-only mode (no Dojo, uses Lean CLI typecheck):
python3 -m ulam prove path/to/File.lean --theorem MyTheorem --prove-mode llm --llm-rounds 4Formalize a .tex document:
python3 -m ulam formalize paper.tex --out paper.leanFormalize with LLM-only proof attempts:
python3 -m ulam formalize paper.tex --proof-backend llm --lean-backend lspLLM-only proving with Lean LSP diagnostics:
python3 -m ulam prove path/to/File.lean --theorem MyTheorem --prove-mode llm --lean lspInformal TeX proving route (separate from Lean proving):
python3 -m ulam prove path/to/File.lean --theorem MyTheorem --output-format tex --tex-rounds 3 --tex-judge-repairs 2From statement-only input (no Lean file):
python3 -m ulam prove --theorem MyTheorem --output-format tex --statement "For all n, ... "TUI path: Settings -> Prover settings -> Default prove output format (lean|tex).
TeX mode knobs:
--tex-outexplicit output path for the generated.texdraft.--tex-roundsmax fixed orchestration rounds for claim solving.--tex-worker-draftsfixed worker candidates generated per claim per round.--tex-judge-repairsmax consecutive rounds without accepted claim before composing best available draft.
TeX mode execution model:
- planner emits a claim graph (
claims, dependencies, assumptions, required facts), - workers draft each claim,
- each claim is gated by judge + adversarial verifier + domain checker,
- accepted claims are composed into the final theorem proof.
- token allocation is fixed by the settings above (no adaptive widening).
Install the CLI entrypoint if you want ulam directly:
pip install -e .UlamAI runs real proofs through LeanDojo, which needs a Lean project with Mathlib.
One-command setup (interactive):
ulam -lean
# or
python3 -m ulam -leanUse one of the exact command forms above.
Non-interactive:
ulam lean-setup --dir ./ulam-lean --yesFlags:
--skip-elanto skip installing Lean.--no-buildto skiplake build.--no-dojoto skip LeanDojo/Pantograph.--no-configto avoid writing.ulam/config.json.--pip-timeout/--pip-retriesto handle slow PyPI downloads.--toolchainto pin a specific Lean toolchain (otherwise uses Mathlib’s default).--use-mathlib-toolchainto keep the toolchain from the Mathlib template even if Pantograph differs.--lakefile-leanto generate alakefile.leanmirror oflakefile.toml.
Note: Pantograph is anchored to a specific Lean toolchain (from its src/lean-toolchain). UlamAI will
align the Mathlib project to that toolchain when possible.
Install Lean + Lake (macOS/Linux):
curl https://elan.lean-lang.org/elan-init.sh -sSf | sh
source $HOME/.elan/envCreate a Mathlib project:
lake +leanprover-community/mathlib4:lean-toolchain new MyMathlibProject math
cd MyMathlibProject && lake buildInstall LeanDojo + Pantograph:
pip install lean-dojo-v2
pip install git+https://github.com/stanford-centaur/PyPantographThen set ULAM_LEAN_PROJECT or configure the Lean project path in the menu.
OpenAI-compatible (default):
ULAM_OPENAI_API_KEYULAM_OPENAI_BASE_URL(defaulthttps://api.openai.com)ULAM_OPENAI_MODEL(defaultgpt-4.1)
OpenAI (ChatGPT subscription) login:
- Run
ulam auth codex(orcodex login) and UlamAI can import credentials from~/.codex/auth.json. - This is the same flow used by the official Codex CLI (ChatGPT sign‑in creates a key automatically).
Codex CLI provider (subscription):
- Use
codex loginand set--llm codex_cli(no API key required).
Ollama:
ULAM_OLLAMA_BASE_URL(defaulthttp://localhost:11434)ULAM_OLLAMA_MODEL(defaultllama3.1) How to set it up (example with Llama 3.1 8B):
# 1) Install Ollama
brew install ollama # or follow https://ollama.com/download
# 2) Start the server
ollama serve
# 3) Pull a model
ollama pull llama3.1:8b
# 4) Point Ulam to it (optional; defaults shown)
export ULAM_OLLAMA_BASE_URL="http://localhost:11434"
export ULAM_OLLAMA_MODEL="llama3.1:8b"Claude (Anthropic):
ULAM_ANTHROPIC_API_KEYorULAM_ANTHROPIC_SETUP_TOKENULAM_ANTHROPIC_BASE_URL(defaulthttps://api.anthropic.com)ULAM_ANTHROPIC_MODEL(defaultclaude-3-5-sonnet-20240620)
Claude Code CLI provider (subscription):
- Run
ulam auth claude(orclaude setup-token) and set--llm claude_cli(no API key required).
Gemini API:
ULAM_GEMINI_API_KEY(orGEMINI_API_KEY)ULAM_GEMINI_BASE_URL(defaulthttps://generativelanguage.googleapis.com/v1beta/openai)ULAM_GEMINI_MODEL(defaultgemini-3.1-pro-preview)
Gemini CLI provider (subscription/login):
- Run
ulam auth geminiand complete OAuth in your browser (no API key required). - Ulam first uses its built-in browser+callback flow and automatically falls back to Gemini CLI native login.
- If your environment is unusual and auto-discovery still fails, set:
ULAM_GEMINI_OAUTH_CLIENT_IDULAM_GEMINI_OAUTH_CLIENT_SECRET- or
ULAM_GEMINI_OAUTH2_JS(path to Gemini CLIoauth2.js)
Embeddings (for retrieval):
ULAM_EMBED_API_KEY(defaults toULAM_OPENAI_API_KEY)ULAM_EMBED_BASE_URL(defaults toULAM_OPENAI_BASE_URL)ULAM_EMBED_MODEL(defaulttext-embedding-3-small)
Premises file format:
--premises path/to/premises.txt- One premise per line (e.g.,
lemma_name : statement)
Retrievers:
--retriever simple(token overlap)--retriever embedding(OpenAI-compatible embeddings)--retriever none--retriever-k Nto control how many premises are injected per state (default8)- If
--premisesis omitted, Ulam can auto-index declarations from Lean sources:--retriever-source local|mathlib|both--retriever-build auto|always|never--retriever-index path/to/index.jsonl(default.ulam/premises_<source>.jsonlin project)- Build explicitly with
ulam index build ...
Replay metadata:
- Each run trace
*.jsonlnow writes a sidecar*.meta.jsonwith pinned environment data (Lean backend, project path, toolchain, mathlib commit/rev, file hash, and run config). ulam replay ... --execute --strictchecks and replays against that metadata.
Solver strategy:
--solver searchfor best-first tactic search--solver scriptfor sequential script mode--solver portfolioto run script-first then best-first fallback
Menu config file:
- Stored at
.ulam/config.jsonby default (override withULAM_CONFIGorULAM_CONFIG_DIR). - If no provider credentials are set, the menu will prompt you to configure them before proving.
- LLM‑only mode fails to typecheck: make sure
lean/lakeis on PATH and the file is inside a Lean project (haslakefile.leanorlean-toolchain). - LeanDojo mismatch errors: run
ulam -leanin your project folder or re-runulam lean-setupto align the toolchain. - Pantograph missing after Python/Homebrew upgrades: run
ulam -leanto reinstall LeanDojo/Pantograph. Ulam also attempts one-time auto-install on demand (disable withULAM_AUTO_INSTALL_PANTOGRAPH=0). - Codex/Claude/Gemini CLI hangs: set
LLM request timeoutin Settings or keep it0and use heartbeat logs to verify it’s running.
Ulam AI is a CLI tool that:
- opens a Lean goal (from a theorem in a file or a snippet),
- repeatedly asks an LLM for a single next action (tactic line or small lemma),
- executes it in Lean,
- uses errors as feedback to repair and backtrack,
- returns a final verified Lean proof (or a replayable failure trace).
- Not an IDE replacement (but it can generate patches you apply in your editor)
- Not a fully-fledged RL system on day 1
- Not “autoformalize any paper” out of the gate (that’s a later module)
- Verified-first: trust Lean, not the model.
- Search backbone: don’t bet everything on one-shot prompting; use best-first/beam.
- Reproducibility: pin toolchain + mathlib commit + timeouts + seeds; log everything.
- Small, local lemmas: encourage
havesteps that are easy to check. - Exploit automation:
simp,aesop,linarith,ring,norm_num, etc.
- Lean runner (stateful): interacts with Lean 4 / mathlib and applies tactics
- LLM policy: proposes candidate next tactics (k=8–16 per state)
- Retriever: selects relevant premises (top-k) from mathlib/local repo
- Search engine: best-first/beam search over proof states
- Cache (current): transposition table:
state_key -> best_seen_score(avoid re-exploring) - Repair loop: on tactic failure, ask LLM to fix syntax/type mismatch, bounded retries
- Requires LeanDojo-v2 + PyPantograph installed.
- Assumes the target theorem contains a
sorryplaceholder. - The CLI loads the Lean file and selects the goal corresponding to that
sorry. - Use
--lean-project(orULAM_LEAN_PROJECT) to point at the Lean project root.
Lean execution time dominates. Best-first + caching + transpositions delivers a strong baseline fast and provides clear debugging signals. MCTS/MCGS can be layered later.
- LeanDojo-v2 runner for proof-state execution
- Embedding-based retrieval (mathlib + local repo)
- Step cache
(state_key, tactic) -> result - Deterministic replay baseline with pinned metadata + strict execute/alignment checks (hardening ongoing)
- Minimal regression suite (
ulam bench)
- Optional Lean LSP backend track for interactive goal/diagnostic loops in LLM/formalize mode (
Dojoremains default for tactic search until parity gate). - Define parity gate for any LSP promotion: runner
start/applysemantics, replay/cache stability, and non-regression on internal + miniF2F slice. - Better state canonicalization (stable hashing)
- Better scoring heuristic / lightweight value model (non-RL)
- More robust retrieval formatting (names + one-line statements)
- Stronger action constraints and cost controls (e.g.,
aesopbudgeted) - Regression suite management (
ulam bench-list-suites, fixed-suite builder, andulam bench --suite regression100after generation) - Execution plan for next patch release:
docs/v0.2.1_plan.md
Potential formalization pipeline (later v0.2.x, not active yet):
- Add pass-based replan/backtrack for
formalizeat declaration-graph level (beyond round-local repairs). - Upgrade LLM proof filling from single-candidate loops to worker-candidate + judge/verifier selection.
- Move LLM
formalizeedits to declaration/body-scoped patches (reduce full-file collateral regressions). - Add stateful formalization artifacts (
state.json,events.jsonl,summary.json) for deterministic resume/debugging. - Add dedicated formalization regression tests (stalls, equivalence repair behavior, resume determinism, guardrail non-regression).
- Trace extraction into JSONL
- Train tactic policy (SFT) on mathlib traces
- Optional “proofstep dataset” for your chosen model family
- Evaluate on miniF2F slice + internal suite
- Proof-state graph reuse improvements
- MCGS/MCTS option behind a flag
- Learned value model to guide search (trained from traces)
- Parallel rollouts (Ray or multiprocessing)
- On-policy fine-tuning (PPO/GRPO-style) on successful+near-miss trajectories
- Reward shaping that correlates with solved proofs (avoid fragile heuristics)
- Curriculum scheduling (easy → harder)
- PDF/text → candidate formal statements
- Semantic checks (avoid “formalization drift”)
- Iterative type-check repair + theorem retrieval scaffolding
- Output: Lean files + proof attempts + TODO gaps
This validates end-to-end retrieval + tactic selection + Lean checking.
import Mathlib.NumberTheory.Real.Irrational
theorem irrational_sqrt_two_smoke : Irrational (Real.sqrt 2) := by
simpa using irrational_sqrt_two