feat(sql): add byte caps for CachedWindow and ORDER BY native memory by puzpuzpuz · Pull Request #7157 · questdb/questdb

puzpuzpuz · 2026-05-26T10:17:05Z

Summary

Adds six byte-denominated per-operator caps for native memory growth in the CachedWindow record store, the LongTreeChain key and value heaps (used by both window functions and ORDER BY), and the ORDER BY sort heaps. The caps are unset (uncapped) by default; setting any of them bounds the operator and raises LimitOverflowException with a (raise X) hint naming the key that needs to change:

cairo.sql.window.cache.max.bytes
cairo.sql.window.rowid.max.bytes
cairo.sql.window.tree.max.bytes
cairo.sql.sort.key.max.bytes
cairo.sql.sort.light.value.max.bytes
cairo.sql.sort.value.max.bytes

Marks cairo.sql.window.tree.max.pages, cairo.sql.window.rowid.max.pages, cairo.sql.sort.key.max.pages, cairo.sql.sort.light.value.max.pages, and cairo.sql.sort.value.max.pages deprecated. They continue to be parsed: if a user has one set, the derived byte default for the corresponding new key becomes pageSize * maxPages. An explicit new *.max.bytes value wins when both are set.
For cairo.sql.window.cache.max.bytes, precedence is resolved against the legacy cairo.sql.window.store.max.pages (and its cairo.sql.analytic.store.max.pages alias), which still drive the per-partition MemoryCARW window function buffers. The new bytes key wins when set explicitly; the legacy pages key wins when only it is explicit; otherwise the new bytes default wins. The resolved key path is plumbed into the runtime so the error message names the binding constraint instead of sending the user down the wrong knob. One nuance worth flagging: when only the legacy cairo.sql.analytic.store.max.pages alias is explicit, the runtime hint names cairo.sql.window.store.max.pages (the modern key that supersedes it) rather than the alias the user has in their config — raising the modern key still takes effect via the same resolution path.
When sort-key materialization is engaged, cairo.sql.sort.key.max.bytes is split across the materialized column buffers in proportion to each column's fixed-size width (via ColumnType.sizeOf). A wider column (e.g. DECIMAL256, 32 B per row) receives a proportionally larger share than a narrow one (e.g. BOOLEAN, 1 B per row), so each buffer's row capacity stays roughly balanced and the total reachable memory tracks the operator-wide budget rather than the wide column's footprint times the column count. Each buffer is still floored at one page so the cursor can initialise; the only overshoot is for sub-page budgets, where the floor exceeds the requested cap by at most bufferCount * PAGE_SIZE.
Page-size config keys (cairo.sql.sort.{key,light.value,value}.page.size and the three cairo.sql.window.*.page.size keys) clamp to a minimum of 1 byte at read time, so a misconfigured *.page.size=0 fails over to a tiny page rather than propagating a 0 into downstream divisions and surfacing as an ArithmeticException from the factory constructor.
Tradeoff: with the caps unset by default, this PR provides opt-in protection rather than preventing OOM out of the box. Operators that previously survived a lag(...) over a multi-billion-row table on unbounded growth keep that behavior; deployments that want a guard must set one of the new keys. The previous iteration of this PR defaulted each cap to 4 GiB, which would have been a breaking change for any operator already exceeding that ceiling in production; this version is not.
Without any cap set, the only ceilings remain the internal compressed-offset limits in AbstractRedBlackTree (~32 GiB key heap) and LongTreeChain (~16 GiB value heap), plus the JVM's native memory budget.

Test plan

CachedWindowMemoryCapTest exercises each of the cache, rowid, and tree caps firing, the happy path with uncapped defaults, the same workload succeeding after raising the cap, and per-cursor reset across repeated executions.
OrderBySortKeyMaterializationTest covers the operator-wide ceiling for single-column and equal-width multi-column sort-key materialization, plus the size-weighted split via a mixed BYTE + LONG schema where the even split would fire and the weighted split succeeds.
PropServerConfigurationTest covers deprecation precedence (including the case where both the modern deprecated key and its analytic alias are set with conflicting values), the resolved CachedWindow cap key, explicit-Integer.MAX_VALUE-on-the-deprecated-key behavior, and the page-size clamp under misconfigured *.page.size=0.
ServerMainTest "show parameters" snapshot updated to include the six new keys at Long.MAX_VALUE.
SampleByFillTest.testSortedRecordCursorFactoryHandlesKeyHeapOverflow switches from the max.pages=-1 cleanup probe to a 64-byte sort.key.max.bytes budget that triggers the same LimitOverflowException cleanup path, and asserts the (raise cairo.sql.sort.key.max.bytes) hint so a rename to the property key would fail the test instead of silently breaking the user-facing remediation guidance.

…mory to prevent OOM A single ad-hoc lag(...) OVER (PARTITION BY ...) query could grow NATIVE_TREE_CHAIN allocations to roughly 48 GiB before any cap fired; two concurrent queries on a 100 GiB instance hit the host RSS limit and made the database unresponsive. The defaults were effectively unbounded because the existing *.max.pages knobs multiplied a small page size by Integer.MAX_VALUE, with the only real ceiling being the internal compressed-offset limits in AbstractRedBlackTree (~32 GiB key heap) and LongTreeChain (~16 GiB value heap). This change introduces byte-denominated per-operator caps on the LongTreeChain key + value heaps, the CachedWindow RecordArray, and the ORDER BY tree chains, each defaulting to 4 GiB: cairo.sql.window.tree.max.bytes cairo.sql.window.rowid.max.bytes cairo.sql.window.cache.max.bytes cairo.sql.sort.key.max.bytes cairo.sql.sort.light.value.max.bytes cairo.sql.sort.value.max.bytes The legacy *.max.pages keys (tree, rowid, sort.key, sort.light.value, sort.value) are marked deprecated and parsed for backward compat: if the user has one set, pageSize * maxPages becomes the derived byte default for the corresponding new key. The new *.max.bytes key wins when both are set. Breaking change: a long-running ORDER BY or windowed lookback over a multi-billion-row table that previously consumed more than 4 GiB of native memory in one of these operators will now throw LimitOverflowException ("memory exceeded in RedBlackTree" / "in LongTreeChain") instead of growing unbounded. Operators can raise the byte cap explicitly via the new config keys. cairo.sql.window.store.max.pages is intentionally untouched: it still drives per-partition MemoryCARW buffers in roughly 30 window function factories with their own page-based cap mechanism, and is out of scope for this change.

coderabbitai · 2026-05-26T10:17:13Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5c3eab13-dabf-4978-a7f4-776d9e9ca121

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch puzpuzpuz_proper_window_function_mem_cap

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The cap-exceeded message thrown by AbstractRedBlackTree, LongTreeChain, and LimitedSizeLongTreeChain previously said only "limit of N memory exceeded in RedBlackTree" / "in LongTreeChain". The same primitive backs different operators -- a RedBlackTree may be a window function's tree-key heap (cairo.sql.window.tree.max.bytes) or a sort's tree-key heap (cairo.sql.sort.key.max.bytes) -- so an operator hitting the cap could not tell which knob to raise. Each ctor now also takes the property path of the config key that owns its key heap (and value heap, for the chain variants). The overflow message appends "(raise <config_key> to increase)" so the user-facing error names the exact property to tune. Affected callers updated: CachedWindowRecordCursorFactory window.tree + window.rowid SortedLightRecordCursorFactory sort.key + sort.light.value LimitedSizeSortedLightRecordCursorFactory same AsyncTopKAtom same RecordTreeChain (fat ORDER BY in SortedRecordCursorFactory) is not covered here: its overflows surface from the inner MemoryPages / RecordChain primitives, which carry their own messages and would need a separate wrapping pass. Tracked as a follow-up.

Add registerDeprecated() calls for the five *.max.pages keys that PR 7157 marks as deprecated (sort.key, sort.light.value, sort.value, window.rowid, window.tree). Without this the validator was silently accepting the old keys, contradicting the deprecation comments in PropertyKey. Repoint the two pre-existing analytic.* aliases at the new *.max.bytes targets, since the previously-pointed-at *.max.pages keys are now themselves deprecated. Update ServerMainTest.testShowParameters to include the six new *.max.bytes keys with their 4 GiB default value. The test asserts a strict bidirectional set equality between actual and expected property dumps, so the missing entries would have caused "Extra properties" failures.

The first round of byte-cap work in this branch threaded the config key path through AbstractRedBlackTree / LongTreeChain / LimitedSizeLongTreeChain, so cap-overflow exceptions for ORDER BY trees and window tree/rowid chains end with "(raise <key> to increase)". The CachedWindow RecordArray path, SortedRecordCursorFactory's RecordTreeChain, EncodedSortRecordCursor and EncodedSortLightRecordCursor, and SortKeyMaterializingRecordCursor still threw generic messages that did not name the new config keys the user is supposed to raise. Add an optional maxPagesConfigKey to MemoryCARWImpl and MemoryPages via a new constructor overload; existing constructors delegate with null and keep their old message verbatim. Add an Vm.getCARWInstance overload that takes the key. RecordChain, RecordArray, and RecordTreeChain gain config-key parameters that they forward to the underlying MemoryCARW / MemoryPages. The relevant SQL cursor and factory callsites pass the matching new *.max.bytes key: - CachedWindowRecordCursorFactory -> cairo.sql.window.cache.max.bytes - SortedRecordCursorFactory -> cairo.sql.sort.key.max.bytes + cairo.sql.sort.value.max.bytes - EncodedSortRecordCursor's value RecordChain -> cairo.sql.sort.value.max.bytes - EncodedSortRecordCursor entryMem inline throw -> cairo.sql.sort.key.max.bytes - EncodedSortLightRecordCursor entryMem inline throw -> cairo.sql.sort.key.max.bytes + cairo.sql.sort.light.value.max.bytes - SortKeyMaterializingRecordCursor per-buffer MemoryCARW -> cairo.sql.sort.key.max.bytes CachedWindowMemoryCapTest.testCacheCapFiresAndCleansUp now asserts on the config-key-bearing message. Existing WindowFunctionTest and SecurityTest assertions are unaffected: their substrings still appear verbatim in the enriched messages, and the per-partition window buffers (governed by cairo.sql.window.store.max.pages, intentionally out of scope for this PR) keep the legacy generic message because their callers do not pass a config key.

deriveMaxBytesDefault previously used Integer.MAX_VALUE as the "unset" sentinel and returned the new 4 GiB default whenever the deprecated *.max.pages key read back as Integer.MAX_VALUE. The problem: the previously-documented default for those keys was 2^31 (= Integer.MAX_VALUE), and the published server.conf template lists exactly that value. A user who copied the documented value verbatim, or who deliberately pinned the deprecated key to 2^31 to keep the historical effectively- unbounded behavior, would silently end up with a 4 GiB cap. Switch to explicit-presence detection: read both deprecated keys (so they stay registered in the property tracker and remain visible in (show parameters)), then check whether each key was set in properties or the environment. If the main key is explicitly set, its value wins; otherwise if the alias is explicitly set, its value wins; otherwise fall through to the new 4 GiB default. The main-over-alias ordering matches the previous implementation. testDeprecatedMaxPagesHonorsExplicitIntegerMaxValue covers the 2^31 case directly. testDeprecatedMaxPagesDerivesMaxBytes, testNewMaxBytesWinsOverDeprecatedMaxPages, and ServerMainTest testShowParameters continue to pass.

Test-only changes; no production behavior touched. - testDeprecatedMaxPagesDerivesMaxBytes: swap the window.tree slot from the main cairo.sql.window.tree.max.pages key to the older cairo.sql.analytic.tree.max.pages alias, so the alias derivation path is actually exercised. Main-key derivation remains covered by the cairo.sql.sort.key.max.pages slot in the same test. - Move testNewMaxBytesWinsOverDeprecatedMaxPages out of the testDeprecated* cluster and into its alphabetical home between testMinimum2SharedWorkers and testNotValidAllowedVolumePaths0. - Rename testConcurrentCursorsHaveIndependentCaps to testRepeatedCursorsStayUnderCap and rewrite its header comment. The original name and comment promised concurrency and per-cursor cap isolation; the body only runs two sequential queries. - Drop the AndCleansUp suffix from testCacheCapFiresAndCleansUp, testRowIdCapFiresAndCleansUp, and testTreeKeyCapFiresAndCleansUp. Each test asserts only that the cap exception fires; cleanup is implicit via assertMemoryLeak. - LimitedSizeLongTreeChainTest.before: replace hardcoded "cairo.sql.sort.*.max.bytes" strings with PropertyKey.*.getPropertyPath() so a PropertyKey rename will not silently desync the test from production messages.

Three review follow-ups, no behavior change: - Drop the redundant " to increase" suffix from cap-overflow exception messages. "(raise <key>)" alone already conveys the intent. Affected throw sites: AbstractRedBlackTree, LongTreeChain, LimitedSizeLongTreeChain, MemoryCARWImpl, MemoryPages, plus the two EncodedSort* cursor classes (two sites each). The literal assertions in CachedWindowMemoryCapTest are updated in lockstep. Other tests (SecurityTest, WindowFunctionTest) assert on the operator-name prefix only and need no change. - Rename *MaxPagesByBytes locals to *MaxPagesFromBytes across the three derivation sites (CachedWindowRecordCursorFactory, EncodedSortRecordCursor, SortKeyMaterializingRecordCursor). "by bytes" reads as a possessive construction; "from bytes" is the actual relationship (max pages derived from a byte budget). - Disclose the per-buffer multiplier in SortKeyMaterializingRecordCursor. The cursor allocates one MemoryCARW per materialized column, each capped at maxBytes, so the total budget for the operator is bufferCount * maxBytes. This was true under the previous max.pages naming too, but the byte denomination invites the misreading that the cap is the aggregate. Adds a constructor comment plus a note on the cairo.sql.sort.key.max.bytes entry in server.conf and the AMI copy.

- PropServerConfiguration: downgrade deriveMaxBytesDefault (both overloads) and isPropertyExplicitlySet from protected to private. No subclass in OSS overrides them. Relocate each into the private instance-method cluster in alphabetical order: deriveMaxBytesDefault between configureSharedThreadPool and getCommitMode, isPropertyExplicitlySet between initIlpTransport and pathEquals. - CachedWindowMemoryCapTest: add testCacheCapRaisedUnblocksQuery. Mirrors the dataset and query in testCacheCapFires (50_000 rows, same lag-over-partition expression) but raises cairo.sql.window.cache.max.bytes to 16 MiB. The query runs to completion; the test asserts the first 3 result rows. Closes the loop on the negative tests by demonstrating that raising the new cap unblocks a previously-failing workload. - Move testRepeatedCursorsStayUnderCap to its correct alphabetical position between testHappyPathUnchanged and testRowIdCapFires (the earlier rename had left it out of order).

CachedWindowRecordCursorFactory previously took min(cache.max.bytes / page.size, window.store.max.pages) and always told the user to raise cache.max.bytes when growth failed. When store.max.pages was the smaller cap (e.g. on a deployment that explicitly pinned it low for the legacy per-partition window buffers), raising cache.max.bytes had no effect and the user was sent on a wild-goose chase. Move the precedence decision into PropServerConfiguration so the resolved page count and the property path to name in the error message are computed once at config load. The new bytes key wins when it is explicitly set; the legacy pages key (and its analytic.* alias) wins when only it is explicit; the new bytes default wins otherwise. The runtime then passes the resolved key string into RecordArray, so the "(raise X)" hint always names the constraint a user must actually change. Add unit tests in PropServerConfigurationTest covering the four precedence cases (default, both-explicit, legacy-only, legacy-alias) and an end-to-end test in CachedWindowMemoryCapTest that drives the factory to overflow and asserts the error names cairo.sql.window.store.max.pages.

SortKeyMaterializingRecordCursor was passing the full cairo.sql.sort.key.max.bytes budget to every per-column buffer, so the true operator ceiling was bufferCount * maxBytes. A user setting "max 4 GiB" on a query that materializes 10 sort columns silently got a 40 GiB ceiling, and the documented per-operator contract did not hold. The error message named cairo.sql.sort.key.max.bytes when one buffer overflowed, with no hint that the budget had been multiplied. Divide the budget evenly across buffers in the cursor ctor, floored at PAGE_SIZE so each buffer can still allocate its initial page even with tiny budgets or many columns. For realistic budgets the floor never binds (a 4 GiB budget over 100 columns is still 40 MiB per buffer, well above the 8 KiB floor); for pathological tiny configurations the floor is a safety net rather than a leak in the contract. Drop the "scales with the number of materialized columns" caveat from server.conf since the contract is now what the docs already promised. Add two tests in OrderBySortKeyMaterializationTest. With a 64 KiB budget over 5000 deterministic DOUBLE rows, the single-column query fits (the one buffer gets the full 64 KiB) while the two-column query overflows (each buffer gets 32 KiB and needs ~40 KiB). The pair proves the budget is now operator-wide instead of per-buffer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The six new cairo.sql.*.max.bytes keys (sort.key, sort.light.value, sort.value, window.cache, window.rowid, window.tree) previously defaulted to 4 GiB. That changed the user-visible behavior of unset configs from "unbounded" to "fails at 4 GiB", which is a silent regression for anyone who never set the deprecated *.max.pages keys. Make the keys uncapped by default and document them as opt-in; setting any of the six remains the simple way to bound the operator's native memory. deriveMaxBytesDefault now returns Long.MAX_VALUE when no deprecated key is set, and sqlWindowCacheMaxBytes defaults to Long.MAX_VALUE. DefaultCairoConfiguration follows suit so test-side defaults stay aligned with production defaults. EncodedSortLightRecordCursor sums two byte caps and then clamps to MAX_HEAP_SIZE_LIMIT. With both operands at Long.MAX_VALUE the addition overflows to a negative budget, and the subsequent min() returns that negative value, so maxEntries goes negative and the operator silently admits every row. Clamp each operand to MAX_HEAP_SIZE_LIMIT first so the saturating sum stays well-defined when either key is uncapped. Downstream operators already saturate Long.MAX_VALUE bytes correctly: RecordTreeChain.derivePageBudget clamps to Integer.MAX_VALUE pages, AbstractRedBlackTree and LongTreeChain clamp to MAX_*_HEAP_SIZE_LIMIT, EncodedSortRecordCursor clamps to MAX_HEAP_SIZE_LIMIT, and getSqlWindowCacheMaxPagesResolved clamps to Integer.MAX_VALUE. Update server.conf in both core and pkg/ami/marketplace/assets to show the keys as unset, and update PropServerConfigurationTest, ServerMainTest, and CachedWindowMemoryCapTest to assert the new defaults.

SortKeyMaterializingRecordCursor previously split sort.key.max.bytes evenly across per-column buffers, so the wide buffer (e.g. DECIMAL256, 32B per row) capped many rows earlier than the narrow one (e.g. BYTE, 1B per row). The effective operator-wide row capacity dropped well below what the byte budget should have allowed. The constructor now derives each buffer's share from ColumnType.sizeOf(colType) so wider columns receive proportionally more pages. With BYTE+LONG at a 64 KiB budget, the weighted split hands the LONG buffer 7 pages (56 KiB / 7168 rows) where the even split gave it 4 pages (32 KiB / 4096 rows), letting the cursor absorb 5000 rows where the previous algorithm fired the cap. Negative byte caps that propagate from a legacy *.max.pages=-1 through deriveMaxBytesDefault now clamp to 0 so the PAGE_SIZE floor takes over instead of multiplying through to a negative limit. The PAGE_SIZE floor stays: each buffer must allocate its initial page, so for sub-page budgets the operator-wide ceiling still overshoots by at most bufferCount * PAGE_SIZE. ColumnType.sizeOf is positive for every type the upstream codegen filter forwards to the materialiser, so the weighted share is always meaningful in practice. Adds testMaterializationCapWeightedByColumnSize covering the BYTE+LONG mixed-width case at the 64 KiB budget where even split fires and weighted split succeeds.

Six *.page.size keys (sort.key, sort.light.value, sort.value, window.store, window.rowid, window.tree) accepted user-supplied 0 verbatim. The zero then flowed into divisions added by this PR: - EncodedSortRecordCursor:82 (sort.value.page.size) - RecordTreeChain.derivePageBudget (sort.key/value page sizes) - PropServerConfiguration.sqlWindowCacheMaxPagesResolved (window.store) - DefaultCairoConfiguration.getSqlWindowCacheMaxPagesResolved (window.store) A misconfigured cairo.sql.sort.value.page.size=0 now throws ArithmeticException from inside the factory constructor instead of a useful ServerConfigurationException at startup. The three window-side keys already pass through Numbers.ceilPow2 but ceilPow2(0)=0, so the clamp is needed there too. PropServerConfiguration now applies Math.max(1, ...) at the assignment site so the resolved field is always at least one byte. Downstream code keeps its own Math.max(1L, ...) on the page count, so the operator-wide ceiling still saturates at Integer.MAX_VALUE pages when the cap is set to Long.MAX_VALUE. Adds testPageSizesClampedToOne pinning the clamp for all six keys plus the CachedWindow resolved page count under the worst-case pageSize=1.

Five small follow-ups from the review thread: - Null-skip the (raise X) suffix in AbstractRedBlackTree.checkKeyCapacity, LongTreeChain.checkValueCapacity, and LimitedSizeLongTreeChain.checkValueCapacity the same way MemoryCARWImpl.extend0 and MemoryPages.allocate0 already do, so a future caller that passes a null config key gets a plain "memory exceeded" message instead of "(raise null)". - Collapse the four duplicated LimitOverflowException construction chains in EncodedSortRecordCursor.buildAndSort and EncodedSortLightRecordCursor.buildAndSort into per-class throwLimitOverflow helpers. - Document the dual role of cairo.sql.window.store.max.pages in server.conf so a deployment tuning that key learns it also acts as the legacy CachedWindow record-store cap when cairo.sql.window.cache.max.bytes is unset. - Add testDeprecatedMaxPagesAndAliasBothExplicitMainWins to pin the documented "modern key wins over alias" precedence under conflicting explicit values on cairo.sql.window.tree.max.pages and its cairo.sql.analytic.tree.max.pages alias. - SampleByFillTest.testSortedRecordCursorFactoryHandlesKeyHeapOverflow now also asserts the "(raise cairo.sql.sort.key.max.bytes)" hint so renaming the key would surface as a test failure rather than silently breaking the user-facing remediation guidance.

PropServerConfiguration emits an advisory log line when an explicitly configured *.max.bytes is below the matching *.page.size. The AbstractRedBlackTree/LongTreeChain/CachedWindow path silently floors the effective cap at one page, so the runtime LimitOverflowException reports the floor rather than the operator's setting. Surfacing the override at init lets the operator catch the silent raise before the cap fires. Six call sites covered: sort.key, sort.light.value, sort.value, window.cache, window.rowid, window.tree. The marketplace server.conf template gains the three new window byte keys (window.cache.max.bytes, window.rowid.max.bytes, window.tree.max.bytes) alongside the existing analytic.* documentation; the main core/.../server.conf already had them. Both server.conf files said the sort.key.max.bytes budget is split "evenly" across materialised column buffers, but SortKeyMaterializingRecordCursor's constructor uses ColumnType.sizeOf weighting. The docs now describe the actual proportional split. testMaxBytesBelowPageSizeAccepted pins the contract: configuration with max.bytes < page.size is accepted, the field stores the verbatim value, and the resolved CachedWindow page count floors at one.

Replace the clamp-to-1 on the sort/window page-size config keys with an explicit startup rejection. Clamping a misconfigured *.page.size to 1 left tree- and heap-backed operators tripping an assertion or writing past a sub-block native buffer at query time, instead of failing loudly at config load as the clamp comment claimed. validatePageSizeAtLeast now rejects a page size below the operator's block minimum via ServerConfigurationException.forInvalidKey: - sort.key >= 64: it also backs RecordTreeChain's MemoryPages, whose 41-byte node cannot straddle a non-contiguous page (ceilPow2 >= 41). - window.tree >= 24: AbstractRedBlackTree key node. - sort.light.value / window.rowid >= 12: value-chain entry. - window.store >= 64: sizes per-window-function buffers via store / RECORD_SIZE (widest 40) and the RecordArray index page (store >> 4). sort.value stays clamped at 1 - a divisor-only RecordChain page with no fixed block, where a 1-byte page is slow but safe. Add defensive asserts mirroring AbstractRedBlackTree's existing key-heap guard: LongTreeChain and LimitedSizeLongTreeChain assert the value page holds one chain entry, and RecordTreeChain asserts its ceilPow2 page holds one node. These catch any non-config caller under -ea. Preserve each deprecated *.max.pages key's historical reader so no previously-valid value fails to start: the sort.* keys keep getIntSize (size suffixes such as 2k) and the window.* keys keep getInt (underscore separators such as 1_000_000). WindowFunctionTest reset window.store.page.size to 0 in a finally block, which the new rejection catches on the test config rebuild; reset it to the 1m default instead. Add config tests for the rejection band, the exact-minimum boundary, and both legacy parsing formats.

Address the medium-severity review findings on the byte-cap change. derivePageBudget() and EncodedSortRecordCursor divided the byte cap by the raw page size, but MemoryPages and MemoryCARWImpl round the page up to a power of two before allocating. A non-power-of-two sort page size therefore let the heap overshoot the configured *.max.bytes by up to ~2x. Both sites now divide by Numbers.ceilPow2(pageSize) so the cap holds. Power-of-two sizes (all shipped defaults) are unaffected. Add coverage for the new error paths and (raise X) hints: - LimitedSizeLongTreeChainTest pins the key-heap and value-heap overflow messages directly. - SampleByFillTest pins the sort.value.max.bytes hint via the value chain. - PropServerConfigurationTest restores the negative legacy *.max.pages derivation. - OrderByMemoryCapTest exercises the serial top-N path and the parallel AsyncTopK per-worker path end to end. Document the per-operator page-size minimums enforced at startup and note that cairo.sql.window.store.max.pages also governs the bare per-partition window-buffer overflow error, in both server.conf copies.

puzpuzpuz · 2026-05-29T11:25:54Z

Confirmed fresh-28 independently: RecordArray line 59 allocates auxMem with page pageSize >> 4 (= 4 bytes at the new 64-byte store minimum), and beginRecord() line 64 does
auxMem.putLong(...) (8 bytes) — so with a cache cap resolving to 1 page the aux side throws before storing any record, and the justification comment at line 153-154 ("store >> 4, so it
needs >= 64") is arithmetically wrong (64>>4 = 4 < 8).

I now have everything needed. Here is the consolidated, source-verified review.

Review: PR #7157 — feat(sql): add byte caps for CachedWindow and ORDER BY native memory (level 3)

29 raw findings → 26 verified, 3 dropped as false positives. Split: all confirmed code findings are in-diff; the only out-of-diff items are PR metadata/description. The cross-context sweep
of all ~30 unmigrated window-function callsites of getSqlWindowStoreMaxPages/getSqlWindowStorePageSize came back clean (no compile break, no overflow, no silently-dead overrides). No
resource-leak finding survived verification, and the adversarial-performance pass found nothing — the arithmetic is allocation-free and overflow-safe at Long.MAX_VALUE.

Critical

None. No confirmed correctness, safety, concurrency, resource-leak, or UB defect. Results stay correct on every path; the overflow clamps prevent negative/overflowed budgets; per-worker
tree chains are single-thread-owned under PerWorkerLocks then merged behind the dispatchAndAwait barrier.

Moderate

M1 — On the default ORDER BY path, sort.key.max.bytes alone does not bound memory, and the error message misdirects

(merges correctness-0, concurrency-1, resource-7, tests-10 — in-diff)

EncodedSortLightRecordCursor.java:75-77 (the default ORDER BY path, selected at SqlCodeGenerator.java:7449 whenever the sort key is fixed-width-encodable):

keyCap = min(getSqlSortKeyMaxBytes(), LIMIT); // LIMIT = (2^32-2)<<3 ≈ 32 GiB
valueCap = min(getSqlSortLightValueMaxBytes(), LIMIT);
maxEntryMemBytes = min(keyCap + valueCap, LIMIT);

Both caps default to Long.MAX_VALUE, so leaving one unset saturates its operand to LIMIT; the sum is then always ≥ LIMIT and the configured cap is fully masked. A user who sets only
cairo.sql.sort.key.max.bytes to bound ORDER BY gets no effective bound on this path (both keys must be lowered). throwLimitOverflow() (EncodedSortLightRecordCursor.java:229-237) compounds
this by saying "raise X or Y", implying either knob is the binding constraint. This is not a regression (master summed page budgets the same way) and not a correctness bug — but for a PR
whose stated purpose is opt-in memory bounding, the headline knob is ineffective in isolation on the most common path, and the remediation guidance is wrong. The sibling paths behave
correctly: EncodedSortRecordCursor (key cap drives entryMem, value cap a separate RecordChain) and LongTreeChain (two independent region caps) each bind on a single knob. No test exercises
a single-knob cap on the EncodedSortLight path.

Fix: either make each cap bind independently on this path (treat an unset cap as 0-contribution when the other is explicit), or document the combined-budget semantics in server.conf +
reword the error to "both X and Y bound this operator", and add an OrderByMemoryCapTest case that engages the encoded-light path with only the key cap set.

M2 — The new (raise ...) remediation hint on the EncodedSort paths is asserted by no test

(tests-9 — in-diff)

throwLimitOverflow() in both EncodedSortLightRecordCursor.java:229-237 and EncodedSortRecordCursor.java:243-249 appends the new (raise cairo.sql.sort.key.max.bytes[ or ...]) hint, but the
only tests hitting these paths (SecurityTest.java:560,809,1092,1156) assert only the pre-existing substring "memory exceeded in EncodedSort". The new user-facing hint is unverified and a
property rename would silently break it — exactly the failure mode the PR deliberately guards against elsewhere (SampleByFillTest, OrderByMemoryCapTest, CachedWindowMemoryCapTest all pin
their (raise ...) text). Fix: tighten one SecurityTest assertion (or add a dedicated test) to assert the full hint for both encoded paths.

M3 — PR description misstates page-size handling (says "clamp to 1" where the code rejects at startup)

(metadata-17 — PR body)

The body bullet claims all six page-size keys "clamp to a minimum of 1 byte at read time, so a misconfigured *.page.size=0 fails over to a tiny page." In reality, only
cairo.sql.sort.value.page.size clamps (Math.max(1, …)); the other five (sort.key≥64, sort.light.value≥12, window.store≥64, window.rowid≥16, window.tree≥32) are rejected at startup via
validatePageSizeAtLeast (PropServerConfiguration.java:2711) — the server won't boot. The PR's own server.conf text ("rejected at startup if smaller") and testPageSizeBelowBlockSizeRejected
contradict the body. Fix: rewrite the bullet to state heap-backed keys are rejected (naming the key) and only sort.value.page.size clamps; fix the matching test-plan line.

Minor

m1 — RecordArray aux index page can't hold one 8-byte slot at the new store minimum (fresh-28, in-diff). RecordArray.java:59 allocates auxMem with page pageSize >> 4; at the new
MIN_WINDOW_STORE_PAGE_SIZE=64 that's 4 bytes, but beginRecord() writes an 8-byte putLong. With window.store.page.size in [64,127] and a cache cap resolving to ~1 page, the aux side throws
"breached in VirtualMemory" before any record is stored (and the message names the cap key, not the page-size key). The justification comment at PropServerConfiguration.java:153-154
("store >> 4, so it needs >= 64") is arithmetically wrong — 64>>4=4<8. Fix: raise MIN_WINDOW_STORE_PAGE_SIZE to 128, or allocate auxMem with Math.max(pageSize >> 4, 8); correct the
comment.
m2 — Migrated sort.key.max.pages gives materialization buffers a ~16× looser ceiling (perf-4 + resource-8 + crosscontext-23, in-diff). The derived byte cap is sort.key.page.size(128k) ×
maxPages, but the materializer's internal unit is PAGE_SIZE=8192, so per-buffer caps scale by 128k/8k = 16× vs pre-PR for a single column (looser for <16 materialized columns, tighter for

16). Intentional/documented operator-wide semantics and only affects deployments that set the deprecated key; a relaxation that can't break a passing query. Fix: add an explicit migration
note ("raise sort.key.max.bytes if you relied on the old per-buffer ceiling").

m3 — DefaultCairoConfiguration sort caps changed from finite (1024 pages) to uncapped (crosscontext-22 + fresh-24, in-diff). Removes the implicit ~128 MB–16 GB test-time safety ceiling
for every AbstractCairoTest. Production (PropServerConfiguration, already Integer.MAX_VALUE) is unchanged and no current test depends on the old default. Risk is only a future runaway-sort
test silently OOMing instead of surfacing a clean LimitOverflowException. Fix (optional): return 1024L * getSqlSortKeyPageSize() to preserve the finite test default, or note the
intentional removal.
m4 — Parallel ORDER BY ... LIMIT cap is per-worker; aggregate is (workers+1) × cap, undocumented (concurrency-2, in-diff). AsyncTopKAtom.java:106-132 gives the owner chain and every
per-worker chain the full byte budget. Concurrency-safe; just a documentation gap in server.conf.
m5 — warnIfMaxBytesBelowPageSize compares raw page size; consumers floor by ceilPow2(pageSize) (fresh-26, in-diff). For non-power-of-two sort.value.page.size (and sort.key via
RecordTreeChain.derivePageBudget), the advisory can disagree with the real flooring. Advisory-log accuracy only; default page sizes are powers of two. Note the suggested fix's mention of
window.store is a no-op (those are already ceilPow2'd at config time) and must not be applied to sort.light.value (consumed raw).
m6 — New boolean locals miss the is/has prefix (quality-14, in-diff). PropServerConfiguration.java:1790-1792 cacheMaxBytesExplicit/storeMaxPagesExplicit violate the CLAUDE.md naming rule
(the sibling hasAlias in the same PR follows it). Rename to isCacheMaxBytesExplicit/isStoreMaxPagesExplicit.
m7 — Resolved cache-cap error names the canonical key even when only the analytic.* alias was set (quality-15, in-diff). PropServerConfiguration.java:1795 hard-codes
window.store.max.pages though storeMaxPagesExplicit is true for either key. Cosmetic (keys are deprecated-equivalent; raising the named key works).
m8 — Sub-minimum legacy alias page.size is rejected naming the new key, not the alias the user set (quality-16, in-diff). e.g. analytic.tree.page.size=16 is rejected as
cairo.sql.window.tree.page.size. Cosmetic; the value is invalid regardless.
m9 — EncodedSortRecordCursor (full_encoded) value-chain byte cap is untested (tests-11, in-diff). The only sort.value overflow test uses full_recordchain → RecordTreeChain. The byte cap,
page-rounding, message, and config-key hint are shared with the tested path, so the gap is the class wiring only.
m10 — testParallelTopKPerWorkerCapFires can't distinguish per-worker from owner-chain overflow (tests-12, in-diff). Both chains throw the identical message, so the assertion can't prove
the per-worker path fired as the comment claims. Soften the comment or add a per-worker sentinel.
m11 — Negative legacy-cap floor is asserted only at the config layer (tests-13, in-diff). The old end-to-end max.pages=-1 probe was rewritten to a positive value; no query-level test now
drives a negative cap through AbstractRedBlackTree/SortKeyMaterializingRecordCursor. The floor logic is verified correct by inspection; the gap is coverage only.
m12 — Performance label not justified (metadata-18, out-of-diff). feat(sql) memory-guard PR with no benchmark/perf claim; the only benchmark touched is cleanup from the API rename.
Remove the label.
m13 — LIMIT 40000 missing thousands underscore (metadata-19, in-diff). OrderByMemoryCapTest.java:59,93 — use 40_000 to match the rest of the file and CLAUDE.md.
m14 — AMI server.conf omits # Deprecated: notes on analytic.{rowid,tree}.max.pages (metadata-20, in-diff). The main server.conf carries them; the AMI copy doesn't (though it did add them
for the sort keys). analytic.store.max.pages is correctly left undeprecated in both.

Downgraded (false positives)

fresh-25 — divide-by-zero in DefaultCairoConfiguration.getSqlWindowCacheMaxPagesResolved(). Dropped: no subclass anywhere overrides getSqlWindowStorePageSize() to 0; the cited
WindowFunctionTest =0 value flows through PropServerConfiguration, which ceilPow2s and rejects it (MIN=64), never reaching the divisor.
fresh-27 — window.cache.max.bytes doesn't bound the per-partition window buffers. Dropped: factually true but the in-diff server.conf already scopes the key to "the CachedWindow
materialised record store" and explicitly ties per-partition buffers to window.store.max.pages. The finding's own suggested fix is already implemented; no contradiction with stated intent.
crosscontext-21 — "all callsites SAFE". Not a defect — a clean cross-context confirmation (verified: zero references to the 5 removed int getters, mvn -pl core test-compile green,
RankFunctionFactory pageSize*maxPages/2 casts to long first so no overflow, SecurityTest translated 1:1).

Summary

Verdict: approve with changes recommended. This is a clean, well-tested PR with no blocking issues — no critical findings, no surviving resource-leak or correctness defects, and a
verified-safe blast radius across the ~30 unmigrated window-function callsites. The arithmetic (overflow-safe divide-then-multiply, clamps before sums, negative-cap flooring) is sound.

Address before merge: M1 (single-knob sort.key.max.bytes is a no-op on the default ORDER BY path + misleading error message — the most important, since it undercuts the feature's headline
use case) and M3 (PR description contradicts the shipped reject-at-startup behavior). M2 wants a migration callout. M2/M3/M4/m3 are the tradeoffs worth stating plainly in the description:
the cap is per-worker for parallel top-K, migrated max.pages shifts the materialization ceiling, and the test-time default is now uncapped. Everything else is minor (one concrete
edge-config wart in RecordArray/MIN_WINDOW_STORE_PAGE_SIZE, plus naming/test/doc polish).

Counts: 29 draft findings → 26 verified, 3 false positives removed. In-diff/out-of-diff: all confirmed code findings in-diff; 1 finding (label) out-of-diff PR metadata, 1 (description) in
the PR body. The non-trivial-but-zero-code-bug out-of-diff result reflects a genuinely clean caller sweep, not an underrun — the cross-context agent enumerated and cleared every unmigrated
callsite with grep evidence and a successful compile.

mtopolnik · 2026-05-29T16:28:39Z

[PR Coverage check]

😍 pass : 224 / 235 (95.32%)

file detail

	path	covered line	new line	coverage
🔵	io/questdb/griffin/engine/orderby/LimitedSizeSortedLightRecordCursorFactory.java	4	8	50.00%
🔵	io/questdb/std/MemoryPages.java	7	9	77.78%
🔵	io/questdb/griffin/engine/orderby/EncodedSortLightRecordCursor.java	10	12	83.33%
🔵	io/questdb/griffin/engine/orderby/EncodedSortRecordCursor.java	15	17	88.24%
🔵	io/questdb/PropServerConfiguration.java	70	71	98.59%
🔵	io/questdb/cairo/vm/Vm.java	1	1	100.00%
🔵	io/questdb/griffin/engine/orderby/SortedLightRecordCursorFactory.java	4	4	100.00%
🔵	io/questdb/cairo/RecordChain.java	3	3	100.00%
🔵	io/questdb/griffin/engine/AbstractRedBlackTree.java	8	8	100.00%
🔵	io/questdb/cairo/DefaultCairoConfiguration.java	9	9	100.00%
🔵	io/questdb/PropertyKey.java	13	13	100.00%
🔵	io/questdb/griffin/engine/table/AsyncTopKAtom.java	8	8	100.00%
🔵	io/questdb/griffin/engine/orderby/SortKeyMaterializingRecordCursor.java	16	16	100.00%
🔵	io/questdb/cairo/vm/MemoryCARWImpl.java	9	9	100.00%
🔵	io/questdb/cairo/CairoConfigurationWrapper.java	8	8	100.00%
🔵	io/questdb/griffin/engine/orderby/LimitedSizeLongTreeChain.java	9	9	100.00%
🔵	io/questdb/griffin/engine/orderby/LongTreeChain.java	9	9	100.00%
🔵	io/questdb/griffin/engine/orderby/SortKeyMaterializingRecordCursorFactory.java	1	1	100.00%
🔵	io/questdb/cairo/RecordArray.java	4	4	100.00%
🔵	io/questdb/griffin/engine/orderby/SortedRecordCursorFactory.java	4	4	100.00%
🔵	io/questdb/griffin/engine/orderby/RecordTreeChain.java	6	6	100.00%
🔵	io/questdb/griffin/engine/window/CachedWindowRecordCursorFactory.java	6	6	100.00%

…7157)

Resolve the SampleByFillTest conflict by keeping master's byte-cap overflow semantics (#7157) expressed through the QueryAssertion builder rather than the old sqlSortKeyMaxPages mechanism. Master also added new tests that call the legacy assertQueryNoLeakCheck overloads this branch had already removed from AbstractCairoTest. Migrate those call sites to the builder so the merge compiles: the value-heap overflow test in SampleByFillTest and all of the new CachedWindowMemoryCapTest. Each migration preserves master's factory properties (expectedTimestamp, supportsRandomAccess, expectSize). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Resolve conflicts from upstream PRs #7182 (LIMIT -N over a covering index), #7187 (fluent QueryAssertion builder) and #7157 (ORDER BY native-memory byte caps). CoveringIndexRecordCursorFactory: both this branch and #7182 independently reworked the multi-key merge to emit rows in row-id (ts-ascending) order. Take upstream's version as the base -- its min-scan merge plus the new single-key LIMIT -N backward scan -- then re-apply this branch's additive fixes on top: getScanDirection(), duplicate IN-key dedup, and the TablePageFrameCursor promotion (fixes the parallel-keyed-group-by-under-SelectedRecord ClassCastException). getScanDirection() reports OTHER only for multi-key latestBy; single-key latestBy stays FORWARD (it returns one row, trivially ts-ordered) so the designated-timestamp metadata upstream's tests expect is preserved. EncodedSortRecordCursor: keep upstream's extracted throwLimitOverflow(). Tests: take upstream's fluent-builder assertions for shared cases and keep this branch's new regression tests. #7187 removed the legacy positional assertQuery()/assertQueryNoLeakCheck() helpers, so migrate ~35 remaining call sites across 10 test files to the fluent builder. Verified: core test-compile clean; CoveringIndexTest (394), ExplainPlanTest (534), AsOfJoinTest (115), CompiledFilterTest (50), LimitTest (47) and the migrated group-by / distinct / IPv4 / array tests all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

puzpuzpuz self-assigned this May 26, 2026

puzpuzpuz added SQL Issues or changes relating to SQL execution DO NOT MERGE These changes should not be merged to main branch labels May 26, 2026

puzpuzpuz added 5 commits May 26, 2026 13:43

Fix benchmark compilation

782748a

puzpuzpuz added Compatibility Compatibility with third-party tools and services Bug Incorrect or unexpected behavior labels May 26, 2026

puzpuzpuz and others added 6 commits May 26, 2026 15:59

Fix formatting

45b820b

puzpuzpuz removed the DO NOT MERGE These changes should not be merged to main branch label May 26, 2026

Fix SampleByFillTest

324d137

puzpuzpuz added the DO NOT MERGE These changes should not be merged to main branch label May 26, 2026

puzpuzpuz changed the title ~~fix(sql): breaking change 💥 - cap CachedWindow and ORDER BY native memory to prevent OOM~~ feat(sql): add byte caps for CachedWindow and ORDER BY native memory May 26, 2026

puzpuzpuz added Enhancement Enhance existing functionality New feature Feature requests and removed Bug Incorrect or unexpected behavior Compatibility Compatibility with third-party tools and services DO NOT MERGE These changes should not be merged to main branch labels May 26, 2026

puzpuzpuz added 2 commits May 28, 2026 11:16

Merge branch 'master' into puzpuzpuz_proper_window_function_mem_cap

9c064ae

Merge branch 'master' into puzpuzpuz_proper_window_function_mem_cap

fc00fc3

puzpuzpuz and others added 5 commits May 28, 2026 19:44

Merge branch 'master' into puzpuzpuz_proper_window_function_mem_cap

1b3b586

puzpuzpuz changed the title ~~feat(sql): add byte caps for CachedWindow and ORDER BY native memory~~ fix(sql): fix wrong results and memory leaks in posting-index, join and group-by queries May 29, 2026

puzpuzpuz added Performance Performance improvements Core Related to storage, data type, etc. labels May 29, 2026

puzpuzpuz changed the title ~~fix(sql): fix wrong results and memory leaks in posting-index, join and group-by queries~~ feat(sql): add byte caps for CachedWindow and ORDER BY native memory May 29, 2026

puzpuzpuz added 2 commits May 29, 2026 12:26

bluestreak01 approved these changes May 29, 2026

View reviewed changes

Merge branch 'master' into puzpuzpuz_proper_window_function_mem_cap

e910ac8

bluestreak01 merged commit 7ddece9 into master Jun 1, 2026
51 checks passed

bluestreak01 deleted the puzpuzpuz_proper_window_function_mem_cap branch June 1, 2026 10:58

RaphDal pushed a commit that referenced this pull request Jun 1, 2026

feat(sql): add byte caps for CachedWindow and ORDER BY native memory (#…

8845898

…7157)

puzpuzpuz mentioned this pull request Jun 2, 2026

fix(sql): fix wrong results and memory leaks in posting-index, join and group-by queries #7150

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sql): add byte caps for CachedWindow and ORDER BY native memory#7157

feat(sql): add byte caps for CachedWindow and ORDER BY native memory#7157
bluestreak01 merged 24 commits into
masterfrom
puzpuzpuz_proper_window_function_mem_cap

puzpuzpuz commented May 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review skipped

Uh oh!

puzpuzpuz commented May 29, 2026

Uh oh!

mtopolnik commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

puzpuzpuz commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

puzpuzpuz commented May 29, 2026

Uh oh!

mtopolnik commented May 29, 2026

[PR Coverage check]

file detail

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

puzpuzpuz commented May 26, 2026 •

edited

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading