Render speed-ups and performance investigations. by KubaO · Pull Request #153 · twinbasic/documentation

KubaO · 2026-05-23T21:22:42Z

No description provided.

…ved).

Extend html-compress to mark book-combined as compress-eligible so book.html collapses inter-element whitespace at Jekyll time instead of paged.js's WhiteSpaceFilter doing ~37k DOM mutations at render time. Reorder :pages, :post_render and :documents, :post_render hooks into a three-tier convention so adding compress to book.html composes correctly with the other plugins: :high mutators (book-href-rewrite) :normal compress (html-compress) :low readers (pdfify capture, offlinify per-page rewrite) Without the layering, book-href-rewrite's landing-heading strip ran after compress, leaving adjacent single-space runs that no downstream pass collapsed. The 3-tier ordering makes "compress is the last cleanup pass among mutators" and "readers see final compressed bytes" hold by construction. Verified: 0 outside-pre multi-whitespace runs in the regenerated book.html (was 37,087 without compress). Branch-counting the WhiteSpaceFilter post-fix shows DOM mutations drop from ~37k to 0. Ruby-prof A/B confirms the priority shuffle is CPU-invariant; the only attributable cost is one extra compress! call (~480 ms once per Jekyll build, ~300-500 ms saved per paged.js render). Adds analyze-trace.mjs --children mode used to localise this during the investigation. Full writeup in perf/README.md and docs/_plugins/html-compress.md.

A 3+3 paired cpu-profile A/B (perf/ab-aggregate.mjs) showed the filter's 181k TreeWalker callbacks cost ~600 ms of CPU on every render even when html-compress has already collapsed inter-element whitespace at Jekyll time. ~125 ms is direct (filterTree/filterEmpty self); the rest is indirect -- gBCR, recalcStyle, performLayout and UpdateStyleAndLayout all run ~14% cheaper per call when V8's IC and Blink scheduler aren't being churned by 181k C++->JS dispatches. The cost is small per call but compounds because the walk lives inside the same microtask continuation as the per-page render loop. Earlier wall-clock A/B (3+3, 8.78s vs 8.53s) had attributed the delta to noise; that was wrong. Per-row aggregation across paired cpu profiles shows the filterTree row at 88 ms (sd 14) vs 2 ms (sd 1) -- a 6 sigma shift -- and the downstream gBCR row at -338 ms mean, consistent with the trace's -574 ms drop on Document::UpdateStyleAndLayout total. The fix: gate the TreeWalker invocation behind window.PagedConfig.runWhitespaceFilter (default undefined = off). Our pipeline never sets the flag because html-compress already does the work; documents that need the cleanup can opt back in. Also adds perf/ab-aggregate.mjs (per-row mean+SD aggregator across 6 paired cpu profiles) and a long writeup in perf/README.md with the methodology, the corrected understanding of why the filter has cost (not flush migration -- it does no layout-flushing work; it's V8 IC pressure + Blink scheduler overhead), and lessons about when to trust wall-clock vs aggregated cpu-profile rows.

…rding.

…e signal.

…p on Windows.

- docs/render-book.mjs, perf/measure.mjs: add --disable-gpu and --disable-software-rasterizer. Renderer ~120 MB lighter, gpu-process ~84 MB lighter (shrinks to a 16 MB stub -- only --in-process-gpu kills it entirely, at +15 s wall clock; rejected), generate ~5 s faster, PDF byte-identical. - perf/probe-parallel.mjs: two-shard pageRanges parallel-generate probe. N=2 saves ~17 s wall clock (render+generate ~36 s vs ~53 s single-process), confirms two browsers parallelise at the OS level. Not shipped -- N=2 ~5 GB peak, N=4 ~10 GB peak, over CI budget. - perf/probe-memory.mjs + sample-mem.ps1: per-process tree memory sampler. PowerShell + WMI walks the chrome.exe parent->child tree at 500 ms intervals, reports per-process private bytes + working set. Used to A/B the --disable-gpu / --in-process-gpu / --single- process variants (the last crashes in modern headless). - perf/probe-renderer-mem.mjs + analyze-mem-trace.mjs: per-allocator renderer breakdown via Chromium's memory-infra trace + on-demand PMD dumps. Shows the 1.9 GB renderer is ~80 % Blink (Oilpan heap), not V8 (V8 is 34 MB). Top object classes are paged.js's per-page CSS grid (132 MB), 1 M ComputedStyle (74 MB), LayoutNG fragments (~200 MB combined), 411 k AXNodeObject for tagged-PDF (41 MB). - --gc-passes N flag on probe-renderer-mem.mjs: triggers V8 + Memory.simulatePressureNotification between render and generate. One pass + pressure (~1 s) frees ~180 MB of dangling Blink objects reachable from no user-visible state. Not shipped -- masking a retention defect (paged.js hooks? detach-pages closures?) rather than fixing it. Hypotheses + next-step heap-snapshot direction documented in perf/README.md.

…ion. CDP HeapProfiler.takeHeapSnapshot at post-render (and post-gc when combined with --gc-passes) -- ~200 MB .heapsnapshot file per dump, loadable in Chrome DevTools Memory tab. The Comparison view between pre- and post-gc snapshots shows which V8-visible categories the GC freed; Summary + filter "Detached" surfaces DOM nodes still held by JS after their owning page was removed, and Retainers gives the exact chain. Workflow documented in perf/README.md under "--heap- snapshot: extract V8 retainer chains". Oilpan-only objects (PhysicalBoxFragment, LogicalLineItems, ConstraintSpace::RareData -- no V8 wrapper) don't appear in the V8 snapshot but are typically owned by a DOM node that does, so the investigation route is detached-DOM-from-snapshot + ownership graph from the memory-infra dump.

V8 heap snapshot diff pre-gc vs post-gc is byte-identical -- same 2,938,992 nodes, same 108.9 MB self_size, same per-category counts. Rules out the "dangling JS references" hypothesis the gc-pass probe initially suggested. Per-Blink-class diff of the memory-infra dumps (new perf/diff-blink-classes.mjs) shows what actually gets freed: style- system caches and layout intermediates that are unreachable from the moment their page finalises but stay in Oilpan because nothing forces a major GC during the synchronous render loop. Two ~100% freed categories are the cleanest signal: CachedMatchedProperties (Blink's style-sharing cache, dead after layout) and GridItemData (paged.js's per-page-template CSS grid items, dead after layout). The remainder is sub-ComputedStyle (StyleBoxData, StyleSurroundData, StyleMisc*), ShapeResultView / HarfBuzzRunGlyphData / ShapeResultRun, layout- fragment RareData. Conclusion: not a leak, not actionable as a retention fix. The only direct mitigation is forcing a GC (already rejected, costs ~1 s). Indirect lever is upstream DOM size (DOM-shape audit). Tooling produced: - perf/analyze-heap-snapshot.mjs: top type x name aggregation + pairwise diff for V8 heap snapshots. Also surfaces the detachedness=2 subset (corrected from earlier mis-read of the V8 DetachednessV8 enum, where {1=Attached, 2=Detached}). - perf/diff-blink-classes.mjs: per-Blink-class diff between two memory-infra dumps in the same trace. Strips the per-dump GUID suffix from class names so the same class lines up across dumps. README updated: GC-pass section title and intro corrected; "What might be holding the references" replaced with "What the GC actually freed"; --heap-snapshot workflow re-framed as a visibility check rather than a retainer-chain hunt (because the diff is zero).

Research notes from the conversation that explored what it would take to extract Blink's draw stream (SkPicture / cc::PaintRecord), spawn standalone PrintCompositor utility processes, or build a Chromium- linked helper binary -- all to enable parallel PDF generation without N-way memory blowup. Five approaches catalogued with honest cost estimates: A. Patch + upstream a Chromium flag (skip PrintCompositor for single-renderer, or streaming printToPDF). B. Port SkPDF to JS (doesn't help alone -- the input data extraction is the real bottleneck). C. Frida + reimplement Mojo client in Node (~15-22 weeks). D. Frida + CanvasKit-WASM workers (~6-10 weeks, tagged-PDF rebuild required). E. Helper binary linking Chromium components (~4-6 weeks total, corrected from earlier overestimates -- shallow gclient sync ~20-30 GB and ~30-90 min, targeted ninja build of ~1500-2500 TUs ~30-90 min first time). All rejected for the current 70 s build, but documented so the analysis isn't lost if the book size or CI budget makes it relevant again. Also captures the hard facts: - chrome.dll is a single 283 MB monolithic binary with exactly six exported functions (ChromeMain + 5 others); PrintCompositor / Mojo / Skia / Blink / V8 are not externally callable. - The idle Chromium tree is ~125-180 MB (corrected from earlier claim of "70-1100 MB"; the high end was PDF-in-transit, not steady-state). - HarfBuzz shaping results and SkTextBlob glyph positions never leave the renderer via any public API; the natural extraction point is the Mojo serialization between renderer and PrintCompositor. New probe perf/probe-idle-browser.mjs measures the idle baseline (post-launch, post-newPage, post-goto(about:blank)) -- the data behind the corrected memory math. Pointer from perf/README.md "Memory" section to CHROMIUM.md so the separate research file is discoverable.

KubaO added 18 commits May 22, 2026 22:15

Disable per-page ResizeObserver: it caught no real reflows (~130ms sa…

ebdb817

…ved).

Add Chrome trace analysis.

78fccf0

Strip all async from paged.js render chain; RunMicrotasks 6333->0.56ms.

688ad1f

Hybrid trace: embed V8 cpu_profiler samples in --tracing output.

7886aed

Add analyze-hybrid.mjs: bottom-up + callees view across JS and Blink.

0a968d9

Per-section CSS cost attribution via ab-css.mjs; defer pageRanges sha…

4ce5289

…rding.

ab-css: pair-paired diffs + Windows /affinity auto-relaunch for stabl…

c08fc86

…e signal.

ab-css: sweep rouge.css and print.css extras; document findings.

4e4be3c

Extract pin-cpu.mjs; auto-pin measure, profile-load, profile-roundtri…

df395ef

…p on Windows.

Factor the performance README.

6a0235c

Make --detach-pages the default.

1142534

Make --timing opt-in rather than opt-out (--no-timing).

1b4fad6

KubaO merged commit 53b8f25 into twinbasic:main May 23, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Render speed-ups and performance investigations.#153

Render speed-ups and performance investigations.#153
KubaO merged 18 commits into
twinbasic:mainfrom
KubaO:staging

KubaO commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KubaO commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant