Skip to content

Render speed-ups and performance investigations.#153

Merged
KubaO merged 18 commits into
twinbasic:mainfrom
KubaO:staging
May 23, 2026
Merged

Render speed-ups and performance investigations.#153
KubaO merged 18 commits into
twinbasic:mainfrom
KubaO:staging

Conversation

@KubaO
Copy link
Copy Markdown
Collaborator

@KubaO KubaO commented May 23, 2026

No description provided.

KubaO added 18 commits May 22, 2026 22:15
Extend html-compress to mark book-combined as compress-eligible
so book.html collapses inter-element whitespace at Jekyll time
instead of paged.js's WhiteSpaceFilter doing ~37k DOM mutations
at render time.

Reorder :pages, :post_render and :documents, :post_render hooks
into a three-tier convention so adding compress to book.html
composes correctly with the other plugins:

  :high   mutators  (book-href-rewrite)
  :normal compress  (html-compress)
  :low    readers   (pdfify capture, offlinify per-page rewrite)

Without the layering, book-href-rewrite's landing-heading strip
ran after compress, leaving adjacent single-space runs that no
downstream pass collapsed. The 3-tier ordering makes "compress
is the last cleanup pass among mutators" and "readers see final
compressed bytes" hold by construction.

Verified: 0 outside-pre multi-whitespace runs in the regenerated
book.html (was 37,087 without compress). Branch-counting the
WhiteSpaceFilter post-fix shows DOM mutations drop from ~37k to 0.
Ruby-prof A/B confirms the priority shuffle is CPU-invariant; the
only attributable cost is one extra compress! call (~480 ms once
per Jekyll build, ~300-500 ms saved per paged.js render).

Adds analyze-trace.mjs --children mode used to localise this
during the investigation. Full writeup in perf/README.md and
docs/_plugins/html-compress.md.
A 3+3 paired cpu-profile A/B (perf/ab-aggregate.mjs) showed the
filter's 181k TreeWalker callbacks cost ~600 ms of CPU on every
render even when html-compress has already collapsed inter-element
whitespace at Jekyll time. ~125 ms is direct (filterTree/filterEmpty
self); the rest is indirect -- gBCR, recalcStyle, performLayout
and UpdateStyleAndLayout all run ~14% cheaper per call when V8's
IC and Blink scheduler aren't being churned by 181k C++->JS
dispatches. The cost is small per call but compounds because the
walk lives inside the same microtask continuation as the per-page
render loop.

Earlier wall-clock A/B (3+3, 8.78s vs 8.53s) had attributed the
delta to noise; that was wrong. Per-row aggregation across paired
cpu profiles shows the filterTree row at 88 ms (sd 14) vs 2 ms (sd
1) -- a 6 sigma shift -- and the downstream gBCR row at -338 ms
mean, consistent with the trace's -574 ms drop on
Document::UpdateStyleAndLayout total.

The fix: gate the TreeWalker invocation behind
window.PagedConfig.runWhitespaceFilter (default undefined = off).
Our pipeline never sets the flag because html-compress already
does the work; documents that need the cleanup can opt back in.

Also adds perf/ab-aggregate.mjs (per-row mean+SD aggregator across
6 paired cpu profiles) and a long writeup in perf/README.md with
the methodology, the corrected understanding of why the filter has
cost (not flush migration -- it does no layout-flushing work; it's
V8 IC pressure + Blink scheduler overhead), and lessons about when
to trust wall-clock vs aggregated cpu-profile rows.
- docs/render-book.mjs, perf/measure.mjs: add --disable-gpu and
  --disable-software-rasterizer. Renderer ~120 MB lighter, gpu-process
  ~84 MB lighter (shrinks to a 16 MB stub -- only --in-process-gpu
  kills it entirely, at +15 s wall clock; rejected), generate ~5 s
  faster, PDF byte-identical.
- perf/probe-parallel.mjs: two-shard pageRanges parallel-generate
  probe. N=2 saves ~17 s wall clock (render+generate ~36 s vs ~53 s
  single-process), confirms two browsers parallelise at the OS level.
  Not shipped -- N=2 ~5 GB peak, N=4 ~10 GB peak, over CI budget.
- perf/probe-memory.mjs + sample-mem.ps1: per-process tree memory
  sampler. PowerShell + WMI walks the chrome.exe parent->child tree
  at 500 ms intervals, reports per-process private bytes + working
  set. Used to A/B the --disable-gpu / --in-process-gpu / --single-
  process variants (the last crashes in modern headless).
- perf/probe-renderer-mem.mjs + analyze-mem-trace.mjs: per-allocator
  renderer breakdown via Chromium's memory-infra trace + on-demand
  PMD dumps. Shows the 1.9 GB renderer is ~80 % Blink (Oilpan heap),
  not V8 (V8 is 34 MB). Top object classes are paged.js's per-page
  CSS grid (132 MB), 1 M ComputedStyle (74 MB), LayoutNG fragments
  (~200 MB combined), 411 k AXNodeObject for tagged-PDF (41 MB).
- --gc-passes N flag on probe-renderer-mem.mjs: triggers V8 +
  Memory.simulatePressureNotification between render and generate.
  One pass + pressure (~1 s) frees ~180 MB of dangling Blink objects
  reachable from no user-visible state. Not shipped -- masking a
  retention defect (paged.js hooks? detach-pages closures?) rather
  than fixing it. Hypotheses + next-step heap-snapshot direction
  documented in perf/README.md.
…ion.

CDP HeapProfiler.takeHeapSnapshot at post-render (and post-gc when
combined with --gc-passes) -- ~200 MB .heapsnapshot file per dump,
loadable in Chrome DevTools Memory tab. The Comparison view between
pre- and post-gc snapshots shows which V8-visible categories the GC
freed; Summary + filter "Detached" surfaces DOM nodes still held by
JS after their owning page was removed, and Retainers gives the
exact chain. Workflow documented in perf/README.md under "--heap-
snapshot: extract V8 retainer chains".

Oilpan-only objects (PhysicalBoxFragment, LogicalLineItems,
ConstraintSpace::RareData -- no V8 wrapper) don't appear in the V8
snapshot but are typically owned by a DOM node that does, so the
investigation route is detached-DOM-from-snapshot + ownership graph
from the memory-infra dump.
V8 heap snapshot diff pre-gc vs post-gc is byte-identical -- same
2,938,992 nodes, same 108.9 MB self_size, same per-category counts.
Rules out the "dangling JS references" hypothesis the gc-pass probe
initially suggested.

Per-Blink-class diff of the memory-infra dumps (new
perf/diff-blink-classes.mjs) shows what actually gets freed: style-
system caches and layout intermediates that are unreachable from
the moment their page finalises but stay in Oilpan because nothing
forces a major GC during the synchronous render loop. Two ~100% freed
categories are the cleanest signal: CachedMatchedProperties (Blink's
style-sharing cache, dead after layout) and GridItemData (paged.js's
per-page-template CSS grid items, dead after layout). The remainder
is sub-ComputedStyle (StyleBoxData, StyleSurroundData, StyleMisc*),
ShapeResultView / HarfBuzzRunGlyphData / ShapeResultRun, layout-
fragment RareData.

Conclusion: not a leak, not actionable as a retention fix. The only
direct mitigation is forcing a GC (already rejected, costs ~1 s).
Indirect lever is upstream DOM size (DOM-shape audit).

Tooling produced:
- perf/analyze-heap-snapshot.mjs: top type x name aggregation +
  pairwise diff for V8 heap snapshots. Also surfaces the
  detachedness=2 subset (corrected from earlier mis-read of the V8
  DetachednessV8 enum, where {1=Attached, 2=Detached}).
- perf/diff-blink-classes.mjs: per-Blink-class diff between two
  memory-infra dumps in the same trace. Strips the per-dump GUID
  suffix from class names so the same class lines up across dumps.

README updated: GC-pass section title and intro corrected; "What
might be holding the references" replaced with "What the GC actually
freed"; --heap-snapshot workflow re-framed as a visibility check
rather than a retainer-chain hunt (because the diff is zero).
Research notes from the conversation that explored what it would take
to extract Blink's draw stream (SkPicture / cc::PaintRecord), spawn
standalone PrintCompositor utility processes, or build a Chromium-
linked helper binary -- all to enable parallel PDF generation without
N-way memory blowup.

Five approaches catalogued with honest cost estimates:
  A. Patch + upstream a Chromium flag (skip PrintCompositor for
     single-renderer, or streaming printToPDF).
  B. Port SkPDF to JS (doesn't help alone -- the input data
     extraction is the real bottleneck).
  C. Frida + reimplement Mojo client in Node (~15-22 weeks).
  D. Frida + CanvasKit-WASM workers (~6-10 weeks, tagged-PDF rebuild
     required).
  E. Helper binary linking Chromium components (~4-6 weeks total,
     corrected from earlier overestimates -- shallow gclient sync
     ~20-30 GB and ~30-90 min, targeted ninja build of ~1500-2500
     TUs ~30-90 min first time).

All rejected for the current 70 s build, but documented so the
analysis isn't lost if the book size or CI budget makes it relevant
again. Also captures the hard facts:

- chrome.dll is a single 283 MB monolithic binary with exactly six
  exported functions (ChromeMain + 5 others); PrintCompositor / Mojo
  / Skia / Blink / V8 are not externally callable.
- The idle Chromium tree is ~125-180 MB (corrected from earlier
  claim of "70-1100 MB"; the high end was PDF-in-transit, not
  steady-state).
- HarfBuzz shaping results and SkTextBlob glyph positions never
  leave the renderer via any public API; the natural extraction
  point is the Mojo serialization between renderer and
  PrintCompositor.

New probe perf/probe-idle-browser.mjs measures the idle baseline
(post-launch, post-newPage, post-goto(about:blank)) -- the data
behind the corrected memory math.

Pointer from perf/README.md "Memory" section to CHROMIUM.md so the
separate research file is discoverable.
@KubaO KubaO merged commit 53b8f25 into twinbasic:main May 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant