spec: generic syntax-highlight definition mechanism (#9955)#10129
spec: generic syntax-highlight definition mechanism (#9955)#10129lonexreb wants to merge 4 commits intowarpdotdev:masterfrom
Conversation
Adds product.md + tech.md for issue warpdotdev#9955: a contributor-friendly mechanism for adding new languages to Warp's syntax highlighting without modifying compiled Rust code and without releasing Warp. Investigation: today, adding a language requires changes in 5+ places in crates/languages/src/lib.rs (SUPPORTED_LANGUAGES array, language_by_filename match, to_arborium_name match, get_arborium_highlight_query match, plus a grammars/<lang>/ folder). The closed registry blocks the most-requested kind of community contribution: "I use $LANG and would happily contribute the highlighting definition." That contribution today requires touching the internal arborium crate dependency and shipping a Warp release. Spec proposes: - Three-source discovery (compile-time hardcoded, bundled directory, user-local directory) with explicit priority order. Hardcoded > bundled > user-local. Staged migration: V1 adds the discovery layer beside the hardcoded path; existing 32 languages keep working unchanged. No flag day. - Schema-driven language.toml (display_name, internal_name, comment_prefix, indent_unit, file_associations [extensions, filenames, shebangs, aliases], brackets, parser, ts_abi). One contract a contributor learns; everything else is standard tree-sitter files. - WASM-only for runtime-loaded user grammars; native dynamic libraries (.so/.dylib/.dll) explicitly rejected with a clear error message and no dlopen call exists on the loader path. - Validation with clear failure modes: a malformed grammar does NOT break Warp startup; surfaces via log + Settings page notification. Other grammars load normally. - Settings > Editor > Languages page lists all loaded grammars, their source, and any failures. - Tree-sitter substrate preserved (rejects switching to TextMate/Sublime regex grammars referenced in the issue as community-distribution exemplars only, not as recommended tech). - Per-language migration template for the 32 existing languages, each as an independently revertable PR. Test plan covers five schema-validation unit tests, four loader unit tests (including ABI mismatch and collision dedup), and three integration tests with a real tree-sitter grammar fixture. Six risks called out (WASM perf cost, schema versioning, capture- name standard set, ABI mismatch detection, sub-language injection, theme integration) with concrete TECH decisions or recommended follow-ups for each. Four open questions for maintainer review on tree-sitter version prerequisites, XDG path fallback, per-language disable in V1, and inventory-style auto-registration for bundled parsers.
|
I'm starting a first review of this spec-only pull request. You can view the conversation on Warp. I completed the review and no human review was requested for this pull request. Comment Powered by Oz |
There was a problem hiding this comment.
Overview
The spec defines a tree-sitter-based registry for bundled and user-local syntax highlighting definitions, with a product contract, loader architecture, migration plan, and tests. The direction addresses the linked issue, but several requirements contradict each other or leave security-critical implementation details unresolved.
Concerns
- The goal promises no compiled Rust changes/no release for both bundled and user-local paths, while the bundled path still ships in Warp and the tech spec requires Cargo and parser-map changes.
- The migration plan alternates between a single mechanical migration PR and independently revertable per-language migrations.
- The loader/API model does not cleanly represent failed grammar loads, yet the product requires failed grammars to appear in diagnostics.
- The product allows missing
highlights.scm, but the tech loader rejects highlight-query load failures.
Security
- User-local WASM is treated as sufficient sandboxing without specifying resource limits or host capabilities for untrusted parsers loaded at startup.
- Failure diagnostics log and display full grammar directory paths even though the telemetry section identifies paths as PII.
Verdict
Found: 0 critical, 6 important, 1 suggestions
Request changes
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
| ## Goal | ||
|
|
||
| A contributor can add a new language to Warp's syntax highlighting | ||
| **without modifying compiled Rust code and without releasing Warp**, |
There was a problem hiding this comment.
| match statements are replaced with a registry-driven lookup. The | ||
| existing 32 languages get their associations migrated from the | ||
| match statements to per-language `language.toml` files in a | ||
| single mechanical PR (this spec calls out that PR as a follow-up, |
There was a problem hiding this comment.
| libraries (`.so`, `.dylib`, `.dll`) are explicitly rejected and | ||
| never loaded. The WASM is loaded via tree-sitter's existing WASM | ||
| runtime. WASM provides the sandboxing that makes user-local | ||
| grammars safe. |
There was a problem hiding this comment.
| A grammar directory that fails to load (malformed `language.toml`, | ||
| WASM that fails to instantiate, `highlights.scm` that fails to | ||
| parse against the grammar) does NOT break Warp startup. Instead: | ||
| - A `log::error!` fires with the directory path and the failure |
There was a problem hiding this comment.
|
|
||
| # User-local grammars: WASM file path relative to the grammar dir. | ||
| # Mutually exclusive with [parser.rust_crate]. | ||
| wasm = "grammar.wasm" |
There was a problem hiding this comment.
💡 [SUGGESTION] Split this schema block into separate bundled and user-local examples; as written, the canonical example sets both rust_crate and wasm even though the comments say they are mutually exclusive.
| pub struct LoadedLanguage { | ||
| pub language: Arc<Language>, | ||
| pub source: LanguageSource, | ||
| pub failure: Option<LoadFailure>, |
There was a problem hiding this comment.
LoadedLanguage cannot represent a grammar that fails before a Language is constructed because language is mandatory, yet failed grammars must be returned and listed in Settings. Define a separate LoadResult/FailedGrammar variant before implementation.
| Reject if `WasmStore` reports an ABI mismatch with the host's | ||
| `tree_sitter::TREE_SITTER_LANGUAGE_VERSION`. | ||
| 3. Compile `highlights.scm` against the resolved grammar via | ||
| `Query::new`. On failure: record `LoadFailure`, return. |
There was a problem hiding this comment.
highlights.scm still loads without coloring, but this loader step treats highlight-query failure as LoadFailure and returns. Specify missing-file handling separately from invalid-query handling.
Bot review (warpdotdev#10129) identified 6 important + 1 suggestion concerns: - Goal conflated bundled and user-local paths: bundled still ships in Warp and the tech spec requires Cargo / parser-map changes. - Migration plan alternated between "single mechanical PR" and "independently revertable per-language migrations." - LoadedLanguage couldn't represent grammars that fail before a Language is constructed — yet failed grammars must appear in the Settings page diagnostics. - Product allowed missing highlights.scm to load without coloring, but the tech loader treated highlight-query failure as LoadFailure and returned. Missing-file vs invalid-query were conflated. - WASM treated as sufficient sandboxing without specifying resource limits or host capabilities for untrusted parsers. - Logs and notifications used the full grammar directory path while telemetry called paths PII. - Schema example set both `rust_crate` and `wasm` even though comments said they were mutually exclusive. Fixes: - Goal split into G1 (user-local: no Warp release) and G2 (bundled: no hand-written match arms but does ship with Warp). The original issue's actual unblocking outcome — contributor velocity — is now served by both paths even though only G1 satisfies "no release." - Single migration strategy: one PR per language, independently revertable, V1 migrates zero. The product spec is now consistent. - New LoadResult sum type with Loaded(LoadedLanguage) and Failed(FailedGrammar) variants. FailedGrammar carries a best-effort internal_name (None if even the TOML parse failed) and a typed LoadFailureReason. Settings page lists both. - Missing-vs-invalid split: missing optional .scm files produce LoadWarning::HighlightsScmMissing/etc. and the grammar still loads. Present-but-invalid queries produce LoadResult::Failed and the grammar is rejected (shipping a broken query is worse than no query). Tests T10/T11/T12 cover both. - WASM safety contract: no host capabilities, parse timeout (100ms default), input-size cap (8MiB default), runtime memory cap (64MiB default), instantiation timeout (5s). All tunable via env vars. Worker-thread isolation called out as a known V1 limitation. Tests T13/T14/IT5 cover the limits. - Path redaction: logs use directory basenames only; full paths appear ONLY in the local Settings UI (not log output, not telemetry). Telemetry payloads enumerated explicitly. - Schema example split into bundled-only (`rust_crate`) and bundled-or-user-local (`wasm`) shapes. validate() now also rejects rust_crate in user-local grammars and rejects unknown top-level keys. Each correction includes a "Correction (review warpdotdev#10129)" callout so reviewers can cross-reference the bot's concerns.
|
Pushed corrections addressing all 6 important + 1 suggestion concerns from the bot review:
Each correction includes a "Correction (review #10129)" callout in the spec for cross-reference. Ready for re-review. /oz-review |
There was a problem hiding this comment.
Overview
This spec defines a schema-driven mechanism for bundled and user-local tree-sitter grammars, including migration strategy, validation, settings visibility, and privacy/security constraints. The direction aligns with the linked issue, but several design details need to be resolved before implementation can proceed safely.
Concerns
- The WASM safety contract bounds parser execution but not execution of untrusted highlight/indent/symbol queries.
- The missing-
highlights.scmbehavior conflicts with the existingLanguageAPI constraints the spec says will remain unchanged. - The core user-local WASM path is still gated on an unresolved dependency/version question.
Security
- User-supplied
.scmqueries need explicit CPU/memory/output bounds or a fallback path, not just parser timeouts.
Verdict
Found: 0 critical, 3 important, 0 suggestions
Request changes
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
| filesystem, network, or process capabilities to grammar code by | ||
| design. The loader rejects any WASM module that attempts to import | ||
| symbols outside tree-sitter's required exports. | ||
| - **CPU bound — parse timeout.** Each parse invocation is gated by |
There was a problem hiding this comment.
highlights.scm, indents.scm, and identifiers.scm queries can also consume CPU/memory or emit excessive matches. Specify query execution limits (for example match limits, viewport/input caps, timeout/fallback behavior) before treating user-local grammars as bounded.
|
|
||
| The existing `editor.indent_unit` per-language settings, the | ||
| `renderer.theme` highlight color mappings, and any other downstream | ||
| consumer of `Language` continues to work. The `Language` struct |
There was a problem hiding this comment.
highlights.scm path: the current Language has a mandatory highlight_query, and consumers construct highlight state from it unconditionally. Either make the query optional across the API or specify an empty-query fallback so a no-coloring language can actually be constructed.
|
|
||
| ## Open questions for maintainer review | ||
|
|
||
| 1. WASM grammars require a tree-sitter version that supports |
There was a problem hiding this comment.
WasmStore version question in the spec and include the concrete dependency/upgrade plan if Warp's current tree-sitter stack does not already support it.
…t spec
Round-2 bot review identified 3 remaining concerns:
- The WASM safety contract bounded parser execution but not
query execution. Tree-sitter Query::matches / Query::captures
runs in a separate code path with its own potential
pathologies and was unbounded.
- "Missing highlights.scm loads without coloring" conflicts with
the actual Language API — pub highlight_query: Query (NOT
Option<Query>) at crates/languages/src/lib.rs. A Language
cannot be constructed without one.
- The user-local WASM path is gated on an unresolved tree-sitter
version question. Warp's bundled grammars come through the
internal arborium crate (version 2); confirming WASM support
needs maintainer input.
Fixes:
- Added WARP_GRAMMAR_QUERY_TIMEOUT_MS (default 50ms) bound on
query execution via a QueryCursor wrapper polling an
AtomicBool from a watchdog thread. Added
WARP_GRAMMAR_MAX_QUERY_CAPTURES (default 100k) cap on per-buffer
query output to bound memory for pathological highlight queries.
Both apply to highlight, indent, and identifiers queries
uniformly.
- Missing-highlights.scm now synthesizes an empty highlight query
via Query::new(grammar, "") instead of leaving the field unset.
Tree-sitter accepts empty source (zero patterns); matches at
runtime return zero captures so no coloring is applied. The
Language API stays unchanged, preserving B8. Indent and
identifiers queries are already Option<Query> in the codebase
so they keep their None semantics for missing files.
- G1 (user-local grammars, no Warp release) is explicitly deferred
out of V1 until the arborium / tree-sitter version question
resolves. V1 ships only G2 (bundled-grammar discovery). User-
local WASM is wired through the loader as
LoadResult::Failed { reason: UserLocalWasmNotYetSupported } so
the API shape stabilizes without enabling the path. A follow-up
PR flips the gate once the WASM-tree-sitter version is
confirmed.
Each correction includes a "Correction (re-review warpdotdev#10129)" callout.
|
Round-2 corrections pushed (commit 05b4a80):
/oz-review |
There was a problem hiding this comment.
Overview
This spec proposes a product and technical design for schema-driven syntax-highlight grammar discovery, including bundled grammar definitions and an eventual user-local WASM grammar path.
Concerns
- The V1 scope is internally inconsistent: the goal section defers user-local WASM grammars, while the behavior contract, acceptance criteria, architecture, and tests still require user-local grammars to load and render in V1.
- The bundled grammar path still has wording that understates the required Rust/Cargo changes, which would mislead follow-up implementers and contributors.
- The diagnostic surface is left as either a CLI command or settings page in one section, while acceptance criteria require the settings page, so the required deliverable should be made singular.
Security
- Telemetry includes user-controlled
internal_namevalues for grammar load events. For user-local grammars, this can disclose private project/customer language names to analytics unless the spec restricts, hashes, or omits that field for user-local sources.
Verdict
Found: 1 critical, 2 important, 1 suggestions
Request changes
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
| `.nim` files render with syntax highlighting in Warp. | ||
| No edits to `lib.rs` were required. | ||
|
|
||
| A2. A user drops `~/.warp/grammars/zig/` containing |
There was a problem hiding this comment.
🚨 [CRITICAL] A2 requires user-local WASM grammars to render after restart, but G1 explicitly says V1 does not ship this path and should return UserLocalWasmNotYetSupported; make the user-local acceptance criteria/tests follow-up-only or bring G1 back into V1 and resolve the WasmStore dependency question.
|
|
||
| **Bundled:** A `crates/languages/grammars/<lang>/` directory is | ||
| discovered at compile time via the existing `RustEmbed` mechanism. | ||
| A new language directory is the only required Rust change; no |
There was a problem hiding this comment.
bundled_parsers.rs parser-map entry; reword this to promise no lib.rs match-arm edits rather than no other Rust/Cargo changes.
| > across both surfaces. | ||
|
|
||
| **Telemetry events** (sent to Warp's analytics): | ||
| - `grammar_loaded`: one-time at startup. Payload: `internal_name`, |
There was a problem hiding this comment.
internal_name is user-controlled for user-local grammars and may reveal private project/customer names in analytics; specify that user-local values are omitted, allowlisted, or hashed before telemetry is emitted.
|
|
||
| ### B7 — Discoverability of installed grammars | ||
|
|
||
| A new command `warp_grammars list` (or a settings-page surface, |
There was a problem hiding this comment.
💡 [SUGGESTION] Pick one required diagnostic surface here; A7 requires Settings → Editor → Languages, so leaving warp_grammars list as an alternative makes the implementation target ambiguous.
…t (round 3)
Round-3 bot review identified 1 critical + 2 important + 1 security:
- V1 scope inconsistent: goal section deferred user-local WASM,
but acceptance criteria A2/A4/A5 still required user-local
grammars to load and render in V1.
- Bundled grammar G2 wording understated the actual Rust/Cargo
changes ("no edits to lib.rs were required" was too narrow).
- Diagnostic surface offered "CLI command OR settings page" while
A7 required the settings page.
- Telemetry sent user-controlled `internal_name` for user-local
grammars, which can disclose private/customer language names.
Fixes:
- A2/A4/A5 rewritten for V1 (bundled-only). Original user-local
criteria preserved as A2.future / A4.future / A5.future for the
follow-up PR. A7 unchanged (Settings page required).
- G2 expanded to honestly list the Cargo changes (Cargo.toml +
bundled_parsers.rs entry per language) while clarifying these
are mechanical one-line additions, not the five-place hand-coded
match-statement spread the issue was asking us to remove.
- B7 narrowed: Settings → Editor → Languages page is the required
deliverable. CLI command is a non-V1 follow-up.
- Telemetry split by source: Hardcoded/Bundled keep internal_name
(public, ship in binary); UserLocal events strip internal_name
entirely. Aggregate counts of user-local adoption without
learning which grammars individuals installed. reason_kind alone
is enough to identify systemic failure modes.
Each correction includes a "Correction (re-review warpdotdev#10129)" callout.
|
Round-3 corrections pushed (commit 9a5a5c0):
/oz-review |
|
I'm re-reviewing this spec-only pull request in response to a review request. You can view the conversation on Warp. I completed the review and no human review was requested for this pull request. Comment Powered by Oz |
There was a problem hiding this comment.
Overview
This spec defines a path toward schema-driven syntax-highlight grammars while preserving the existing hardcoded languages. The bundled-language direction is generally scoped, but the V1 boundary is still internally inconsistent after deferring user-local WASM support.
Concerns
- The product spec still describes user-local grammars as loaded in V1 even though the goals and acceptance criteria say they must surface as
UserLocalWasmNotYetSupported. - The tech spec cannot represent the required deferred user-local state because
LoadFailureReasonomits the required variant and the loader algorithm still instantiates WASM unconditionally. - The integration tests still assert future G1 behavior, which would force implementation of a path the product spec says is not in V1.
Security
- The user-local WASM safety contract documents that same-thread parser hangs can still starve syntax-tree refresh. Before user-local grammar loading is enabled, the spec needs a fail-closed or isolated execution requirement rather than accepting this as a V1 limitation.
Verdict
Found: 1 critical, 5 important, 0 suggestions
Request changes
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
|
|
||
| **Bundled:** A `crates/languages/grammars/<lang>/` directory is | ||
| discovered at compile time via the existing `RustEmbed` mechanism. | ||
| A new language directory is the only required Rust change; no |
There was a problem hiding this comment.
Cargo.toml dependency and a bundled_parsers.rs entry for Rust-crate parsers. Update B2 so the product contract matches the documented contributor steps.
|
|
||
| **User-local:** A `~/.warp/grammars/<lang>/` (or | ||
| `$XDG_CONFIG_HOME/warp/grammars/<lang>/`) directory is discovered at | ||
| startup. User-local grammars are loaded after bundled grammars and |
There was a problem hiding this comment.
UserLocalWasmNotYetSupported, but this behavior contract says user-local grammars are loaded after bundled and collision-deduped. Rewrite B2 as detect-and-report-only for V1, moving loading and override semantics to the future G1 criteria.
| - **Startup-load timeout.** WASM module instantiation is wrapped in | ||
| a 5-second hard timeout. A grammar that fails to instantiate in | ||
| time is treated as a load failure (B6). | ||
| - **No worker isolation in V1.** All parsers share the editor |
There was a problem hiding this comment.
| pub schema_version: Option<u32>, | ||
| } | ||
|
|
||
| pub enum LoadFailureReason { |
There was a problem hiding this comment.
🚨 [CRITICAL] A2 requires LoadResult::Failed { reason: UserLocalWasmNotYetSupported }, but this enum has no such variant. Add the variant and specify where the loader returns it, otherwise the V1 acceptance criteria cannot be implemented as written.
| - `rust_crate`: look up via the compile-time `bundled_parsers.rs` | ||
| map. On miss: return `LoadResult::Failed { reason: | ||
| ParserCrateNotFound }`. | ||
| - `wasm`: `tree_sitter::WasmStore::load_language(&wasm_bytes)`. |
There was a problem hiding this comment.
wasm parser, but V1 must not load user-local WASM. Add a source-specific branch before parser resolution that returns UserLocalWasmNotYetSupported for UserLocal, and clarify whether bundled WASM is also gated or supported in V1.
| tree-sitter-toml grammar shrunk to a minimal subset), point | ||
| `WARP_USER_GRAMMAR_DIR` env var at it, call | ||
| `discover_grammars()`. Assert the language returns as | ||
| `LoadResult::Loaded(...)` and |
There was a problem hiding this comment.
Failed(UserLocalWasmNotYetSupported) and no language_by_filename association; move the loaded-parser assertion to the G1 follow-up.
Summary
Spec for issue #9955 — adding a new language to Warp's syntax highlighting today requires changes in 5+ places in
crates/languages/src/lib.rs(SUPPORTED_LANGUAGESarray,language_by_filename,to_arborium_name,get_arborium_highlight_query, plus agrammars/<lang>/folder), all of which require modifying compiled Rust code and shipping a Warp release. This blocks the most-requested kind of community contribution: "I use $LANG and would happily contribute the highlighting definition."Investigation
SUPPORTED_LANGUAGES: [&str; 32](lib.rs:23).language_by_filename(lib.rs:115),to_arborium_name(lib.rs:226),get_arborium_highlight_query(lib.rs:239).arborium::tree_sitter::{Language, Query}(lib.rs:7) — tree-sitter is the right substrate; this spec preserves it.crates/languages/grammars/<lang>/already exist withconfig.yaml,identifiers.scm,indents.scm— embedded viaRustEmbed.arboriumsupport OR vendoring a tree-sitter grammar.What's in the spec
product.md— 8 testable behavior invariants (B1–B8), 7 acceptance criteria (A1–A7), explicit non-goals (no TextMate-style regex grammars, no native dynamic-library loading, no sub-language injection in V1, no hot-reload), and 6 risks with concrete TECH decisions.tech.md— picks the three-source discovery architecture, thelanguage.tomlschema, the loader, the file-association index, and the staged migration strategy. 5+4 unit tests plus 3 integration tests with a real tree-sitter fixture.Architectural choices
language.toml— one contract a contributor learns; everything else is standard tree-sitter files (highlights.scm,indents.scm, optionalidentifiers.scm)..so/.dylib/.dll) explicitly rejected with a clear error and nodlopencall exists on the loader path. Bundled grammars can be either WASM or a Cargo dependency on a Rust grammar crate.log::error!+Settings → Editor → Languagesnotification. Other grammars load normally.bundled_parsers.rscompile-time map is the only hand-edited file for adding bundled grammars (with an open question about whether to useinventory-style auto-registration to eliminate even that).Test plan
LoadFailurenot panic, collision dedup, ABI mismatch reporting)crates/languages/src/lib_tests.rsandcrates/syntax_tree/src/queries/*_tests.rspass unchanged~/.warp/grammars/zig/directory withlanguage.toml,highlights.scm,grammar.wasm; restart;.zigfiles rendergrammar.so; verify clear error, nodlopenOpen questions for maintainer review
WasmStore. Verify against currentCargo.lock.$XDG_CONFIG_HOME/warp/grammars/(when set) vs~/.warp/grammars/(fallback). Confirm precedence.bundled_parsers.rsuse aninventory-style auto-registration pattern to eliminate the only remaining hand-edit? Adds a build-time crate dep but removes the last manual step.Closes (spec-only) #9955 — implementation PR will follow once spec direction is confirmed. The 32 existing languages will migrate via independent follow-up PRs, one per language.