Skip to content

security: block URI scheme bypasses + tree-sitter grammar v1.0 parity (v1.0.1)#19

Merged
rjmitchell merged 17 commits intomainfrom
feature/editor-v1-parity
Apr 10, 2026
Merged

security: block URI scheme bypasses + tree-sitter grammar v1.0 parity (v1.0.1)#19
rjmitchell merged 17 commits intomainfrom
feature/editor-v1-parity

Conversation

@rjmitchell
Copy link
Copy Markdown
Owner

Summary

Ships v1.0.1 with two main things plus a pile of hardening.

Security (CSO audit findings #10, #11 + adversarial review):

  • Block javascript: and data: URI schemes in link href and figure src via allowlist (http, https, mailto, tel), defaulting to #. Both Python and JS renderers.
  • Block protocol-relative URLs (//evil.com) — browsers treat these as absolute, so [text -> //evil.com] was a valid CLN document that slipped past the sanitizer as an open redirect.
  • Block percent-encoded bypasses (javascript%3a..., JAVASCRIPT%3A...) — browsers decode before routing, so literal-colon detection was bypassable. Both renderers now percent-decode before scheme extraction and handle malformed sequences safely.

Tree-sitter grammar v1.0 parity (closes the gap where the Python parser supported nested lists + multi-paragraph items but the grammar was still on v0.1):

  • External scanner with indent stack (12-level cap, serialized/deserialized across parses with corrupt-buffer guards).
  • scan_indent_tokens distinguishes nested list markers (INDENT), multi-paragraph continuations (LIST_CONTINUATION requires after_blank_line + indent > 0), and dedents (zero-width DEDENT). Bails during tree-sitter error recovery to avoid corrupting the indent stack.
  • New grammar rules: list_item_body, list_item_continuation, nested_list. Both unordered and ordered nesting work, including mixed nesting.
  • 9 new tree-sitter corpus tests covering flat, nested (both types), multi-paragraph (both types), and list-followed-by-block regressions.

Editor internals (to consume the new CST shape):

  • clnComment block type for // comment lines.
  • Converter populates children[] from list_item_body — shared convertListItemBody helper between unordered and ordered.
  • Depth-aware serializer with content-column alignment for continuations (including double-digit ordered markers like 10. ).
  • Comment converter also strips trailing whitespace (round-trip fidelity fix).

Hardening / refactors:

  • Scanner uses TREE_SITTER_SERIALIZATION_BUFFER_SIZE instead of magic 1024, removes unused <string.h> include, collapses redundant deserialize guards.
  • Editor converter DRY'd (30 lines eliminated).
  • Stronger test assertions: positive href="#" checks, case-insensitive scheme coverage, converter unit tests for nested list paths.

Test Coverage

  • Python: 362 tests (was 349, +13 for URL security including 5 new bypass tests)
  • Editor (vitest): 336 tests (was 316, +20 for clnComment, nested list converter, depth-aware serializer)
  • clearnotation-js (vitest): 129 passed + 3 skipped (was 116, +13 for URL security including protocol-relative and percent-encoded bypass tests)
  • Tree-sitter corpus: 35 tests (was 26, +9 for list grammar)
  • Conformance fixtures: 70/70 still passing

AI-assessed coverage: ~95% (above 80% target). All critical paths have ★★★ tests (behavior + edge cases + error paths). The one structural gap (scanner corrupt-buffer branches) is defensive C code with no direct test framework — acceptable.

Pre-Landing Review

Two stages of review ran:

Structural pass: CLEAN. Two informational items (both non-blocking, both tracked):

  • Converter recursion has no explicit depth guard (scanner's MAX_STACK_DEPTH=12 already bounds input, and JS stack is ~10,000 frames — no realistic overflow).
  • Comment serializer with programmatically-set multi-line props.text would round-trip incorrectly, but this can't happen from a parsed CLN source (grammar enforces single-line comments).

Specialist review: 9 informational items found across testing + maintainability. All 7 mechanical items auto-fixed across 3 commits:

  • Added missing data: figure src test to JS, ordered list continuation converter test, multi-paragraph ordered corpus test
  • Extracted convertListItemBody helper (DRY, -30 lines)
  • Replaced magic 1024 with TREE_SITTER_SERIALIZATION_BUFFER_SIZE
  • Removed unused <string.h> include
  • Collapsed redundant deserialize branches

Skipped: 2 stylistic items (LIST_INDENT constant for a 2-char string, cross-language safeUrl reference comment).

Adversarial review: Found 2 CRITICAL bypasses in the new URL sanitizer (C-1 protocol-relative, C-2 percent-encoded). Both fixed in commit 2ad7306 with regression tests in both renderers. Plus 1 INFORMATIONAL scanner hardening fix (error-recovery bail) in commit 3969fbd.

Plan Completion

All 39 checkboxes in docs/superpowers/plans/2026-04-09-editor-v1-parity.md marked done. Every planned task delivered. Additional scope added during ship: adversarial-driven URL bypass fixes (C-1, C-2), scanner error-recovery guard, DRY refactors.

Scope Drift

CLEAN. Intent: editor v1.0 parity + security fix. Delivered: exactly that, plus ship-time hardening driven by adversarial review.

TODOS

Completed: "Editor v1.0 parity + URL security (v1.0.1)" entry updated in TODOS.md with the full shipped scope.

Test plan

  • All 362 Python tests pass
  • All 336 editor tests pass
  • All 129 clearnotation-js tests pass (+3 skipped)
  • All 35 tree-sitter corpus tests pass
  • All 70 conformance fixtures pass
  • Protocol-relative URL bypass blocked (verified: //evil.com#)
  • Percent-encoded scheme bypass blocked (verified: javascript%3aalert(1)#)
  • Flat + nested (both types) + multi-paragraph (both types) lists parse correctly
  • Editor round-trip: parse → convert → serialize produces identical CLN source

🤖 Generated with Claude Code

Specs:
- JS renderer parity (completed this session)
- Include-aware watch (completed this session)
- Landing page redesign (completed this session)
- Editor v1.0 parity: tree-sitter grammar upgrade, nested lists,
  comments, converter/serializer updates (ready for implementation)

Plans:
- JS renderer parity (completed)
- Include-aware watch (completed)
- Landing page redesign (completed)
- Editor v1.0 parity (ready for next session, includes CSO security
  fix for javascript: URI XSS in renderers)
…for editor v1.0 parity

- Strip trailing whitespace from comment node text in block-converter (round-trip fix)
- Add 3 unit tests for comment conversion (trailing newline, indented prefix, no newline)
- Update CLAUDE.md test counts: 357 Python / 331 editor / 123 JS
- Add Editor v1.0 parity entry to TODOS.md Completed section
Block //evil.com (protocol-relative) URLs which browsers treat as absolute,
and decode percent-encoding before scheme extraction to catch javascript%3a...
bypasses. Both Python and JS renderers updated; regression tests added.
Add guard at the top of scan_indent_tokens that detects the error-recovery
state (all 5 external symbols valid simultaneously) and returns false
immediately to prevent incorrect state mutation during recovery.
@rjmitchell rjmitchell merged commit 8b34258 into main Apr 10, 2026
2 checks passed
@rjmitchell rjmitchell deleted the feature/editor-v1-parity branch April 10, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant