feat(sql): add byte caps for CachedWindow and ORDER BY native memory#7157
Conversation
…mory to prevent OOM
A single ad-hoc lag(...) OVER (PARTITION BY ...) query could grow
NATIVE_TREE_CHAIN allocations to roughly 48 GiB before any cap fired;
two concurrent queries on a 100 GiB instance hit the host RSS limit
and made the database unresponsive. The defaults were effectively
unbounded because the existing *.max.pages knobs multiplied a small
page size by Integer.MAX_VALUE, with the only real ceiling being the
internal compressed-offset limits in AbstractRedBlackTree (~32 GiB
key heap) and LongTreeChain (~16 GiB value heap).
This change introduces byte-denominated per-operator caps on the
LongTreeChain key + value heaps, the CachedWindow RecordArray, and
the ORDER BY tree chains, each defaulting to 4 GiB:
cairo.sql.window.tree.max.bytes
cairo.sql.window.rowid.max.bytes
cairo.sql.window.cache.max.bytes
cairo.sql.sort.key.max.bytes
cairo.sql.sort.light.value.max.bytes
cairo.sql.sort.value.max.bytes
The legacy *.max.pages keys (tree, rowid, sort.key, sort.light.value,
sort.value) are marked deprecated and parsed for backward compat: if
the user has one set, pageSize * maxPages becomes the derived byte
default for the corresponding new key. The new *.max.bytes key wins
when both are set.
Breaking change: a long-running ORDER BY or windowed lookback over a
multi-billion-row table that previously consumed more than 4 GiB of
native memory in one of these operators will now throw
LimitOverflowException ("memory exceeded in RedBlackTree" / "in
LongTreeChain") instead of growing unbounded. Operators can raise
the byte cap explicitly via the new config keys.
cairo.sql.window.store.max.pages is intentionally untouched: it still
drives per-partition MemoryCARW buffers in roughly 30 window function
factories with their own page-based cap mechanism, and is out of
scope for this change.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The cap-exceeded message thrown by AbstractRedBlackTree, LongTreeChain, and LimitedSizeLongTreeChain previously said only "limit of N memory exceeded in RedBlackTree" / "in LongTreeChain". The same primitive backs different operators -- a RedBlackTree may be a window function's tree-key heap (cairo.sql.window.tree.max.bytes) or a sort's tree-key heap (cairo.sql.sort.key.max.bytes) -- so an operator hitting the cap could not tell which knob to raise. Each ctor now also takes the property path of the config key that owns its key heap (and value heap, for the chain variants). The overflow message appends "(raise <config_key> to increase)" so the user-facing error names the exact property to tune. Affected callers updated: CachedWindowRecordCursorFactory window.tree + window.rowid SortedLightRecordCursorFactory sort.key + sort.light.value LimitedSizeSortedLightRecordCursorFactory same AsyncTopKAtom same RecordTreeChain (fat ORDER BY in SortedRecordCursorFactory) is not covered here: its overflows surface from the inner MemoryPages / RecordChain primitives, which carry their own messages and would need a separate wrapping pass. Tracked as a follow-up.
Add registerDeprecated() calls for the five *.max.pages keys that PR 7157 marks as deprecated (sort.key, sort.light.value, sort.value, window.rowid, window.tree). Without this the validator was silently accepting the old keys, contradicting the deprecation comments in PropertyKey. Repoint the two pre-existing analytic.* aliases at the new *.max.bytes targets, since the previously-pointed-at *.max.pages keys are now themselves deprecated. Update ServerMainTest.testShowParameters to include the six new *.max.bytes keys with their 4 GiB default value. The test asserts a strict bidirectional set equality between actual and expected property dumps, so the missing entries would have caused "Extra properties" failures.
The first round of byte-cap work in this branch threaded the config
key path through AbstractRedBlackTree / LongTreeChain /
LimitedSizeLongTreeChain, so cap-overflow exceptions for ORDER BY
trees and window tree/rowid chains end with
"(raise <key> to increase)". The CachedWindow RecordArray path,
SortedRecordCursorFactory's RecordTreeChain, EncodedSortRecordCursor
and EncodedSortLightRecordCursor, and SortKeyMaterializingRecordCursor
still threw generic messages that did not name the new config keys
the user is supposed to raise.
Add an optional maxPagesConfigKey to MemoryCARWImpl and MemoryPages
via a new constructor overload; existing constructors delegate with
null and keep their old message verbatim. Add an Vm.getCARWInstance
overload that takes the key. RecordChain, RecordArray, and
RecordTreeChain gain config-key parameters that they forward to the
underlying MemoryCARW / MemoryPages. The relevant SQL cursor and
factory callsites pass the matching new *.max.bytes key:
- CachedWindowRecordCursorFactory -> cairo.sql.window.cache.max.bytes
- SortedRecordCursorFactory -> cairo.sql.sort.key.max.bytes
+ cairo.sql.sort.value.max.bytes
- EncodedSortRecordCursor's value RecordChain
-> cairo.sql.sort.value.max.bytes
- EncodedSortRecordCursor entryMem inline throw
-> cairo.sql.sort.key.max.bytes
- EncodedSortLightRecordCursor entryMem inline throw
-> cairo.sql.sort.key.max.bytes
+ cairo.sql.sort.light.value.max.bytes
- SortKeyMaterializingRecordCursor per-buffer MemoryCARW
-> cairo.sql.sort.key.max.bytes
CachedWindowMemoryCapTest.testCacheCapFiresAndCleansUp now asserts on
the config-key-bearing message. Existing WindowFunctionTest and
SecurityTest assertions are unaffected: their substrings still appear
verbatim in the enriched messages, and the per-partition window
buffers (governed by cairo.sql.window.store.max.pages, intentionally
out of scope for this PR) keep the legacy generic message because
their callers do not pass a config key.
deriveMaxBytesDefault previously used Integer.MAX_VALUE as the "unset" sentinel and returned the new 4 GiB default whenever the deprecated *.max.pages key read back as Integer.MAX_VALUE. The problem: the previously-documented default for those keys was 2^31 (= Integer.MAX_VALUE), and the published server.conf template lists exactly that value. A user who copied the documented value verbatim, or who deliberately pinned the deprecated key to 2^31 to keep the historical effectively- unbounded behavior, would silently end up with a 4 GiB cap. Switch to explicit-presence detection: read both deprecated keys (so they stay registered in the property tracker and remain visible in (show parameters)), then check whether each key was set in properties or the environment. If the main key is explicitly set, its value wins; otherwise if the alias is explicitly set, its value wins; otherwise fall through to the new 4 GiB default. The main-over-alias ordering matches the previous implementation. testDeprecatedMaxPagesHonorsExplicitIntegerMaxValue covers the 2^31 case directly. testDeprecatedMaxPagesDerivesMaxBytes, testNewMaxBytesWinsOverDeprecatedMaxPages, and ServerMainTest testShowParameters continue to pass.
Test-only changes; no production behavior touched. - testDeprecatedMaxPagesDerivesMaxBytes: swap the window.tree slot from the main cairo.sql.window.tree.max.pages key to the older cairo.sql.analytic.tree.max.pages alias, so the alias derivation path is actually exercised. Main-key derivation remains covered by the cairo.sql.sort.key.max.pages slot in the same test. - Move testNewMaxBytesWinsOverDeprecatedMaxPages out of the testDeprecated* cluster and into its alphabetical home between testMinimum2SharedWorkers and testNotValidAllowedVolumePaths0. - Rename testConcurrentCursorsHaveIndependentCaps to testRepeatedCursorsStayUnderCap and rewrite its header comment. The original name and comment promised concurrency and per-cursor cap isolation; the body only runs two sequential queries. - Drop the AndCleansUp suffix from testCacheCapFiresAndCleansUp, testRowIdCapFiresAndCleansUp, and testTreeKeyCapFiresAndCleansUp. Each test asserts only that the cap exception fires; cleanup is implicit via assertMemoryLeak. - LimitedSizeLongTreeChainTest.before: replace hardcoded "cairo.sql.sort.*.max.bytes" strings with PropertyKey.*.getPropertyPath() so a PropertyKey rename will not silently desync the test from production messages.
Three review follow-ups, no behavior change: - Drop the redundant " to increase" suffix from cap-overflow exception messages. "(raise <key>)" alone already conveys the intent. Affected throw sites: AbstractRedBlackTree, LongTreeChain, LimitedSizeLongTreeChain, MemoryCARWImpl, MemoryPages, plus the two EncodedSort* cursor classes (two sites each). The literal assertions in CachedWindowMemoryCapTest are updated in lockstep. Other tests (SecurityTest, WindowFunctionTest) assert on the operator-name prefix only and need no change. - Rename *MaxPagesByBytes locals to *MaxPagesFromBytes across the three derivation sites (CachedWindowRecordCursorFactory, EncodedSortRecordCursor, SortKeyMaterializingRecordCursor). "by bytes" reads as a possessive construction; "from bytes" is the actual relationship (max pages derived from a byte budget). - Disclose the per-buffer multiplier in SortKeyMaterializingRecordCursor. The cursor allocates one MemoryCARW per materialized column, each capped at maxBytes, so the total budget for the operator is bufferCount * maxBytes. This was true under the previous max.pages naming too, but the byte denomination invites the misreading that the cap is the aggregate. Adds a constructor comment plus a note on the cairo.sql.sort.key.max.bytes entry in server.conf and the AMI copy.
- PropServerConfiguration: downgrade deriveMaxBytesDefault (both overloads) and isPropertyExplicitlySet from protected to private. No subclass in OSS overrides them. Relocate each into the private instance-method cluster in alphabetical order: deriveMaxBytesDefault between configureSharedThreadPool and getCommitMode, isPropertyExplicitlySet between initIlpTransport and pathEquals. - CachedWindowMemoryCapTest: add testCacheCapRaisedUnblocksQuery. Mirrors the dataset and query in testCacheCapFires (50_000 rows, same lag-over-partition expression) but raises cairo.sql.window.cache.max.bytes to 16 MiB. The query runs to completion; the test asserts the first 3 result rows. Closes the loop on the negative tests by demonstrating that raising the new cap unblocks a previously-failing workload. - Move testRepeatedCursorsStayUnderCap to its correct alphabetical position between testHappyPathUnchanged and testRowIdCapFires (the earlier rename had left it out of order).
CachedWindowRecordCursorFactory previously took min(cache.max.bytes / page.size, window.store.max.pages) and always told the user to raise cache.max.bytes when growth failed. When store.max.pages was the smaller cap (e.g. on a deployment that explicitly pinned it low for the legacy per-partition window buffers), raising cache.max.bytes had no effect and the user was sent on a wild-goose chase. Move the precedence decision into PropServerConfiguration so the resolved page count and the property path to name in the error message are computed once at config load. The new bytes key wins when it is explicitly set; the legacy pages key (and its analytic.* alias) wins when only it is explicit; the new bytes default wins otherwise. The runtime then passes the resolved key string into RecordArray, so the "(raise X)" hint always names the constraint a user must actually change. Add unit tests in PropServerConfigurationTest covering the four precedence cases (default, both-explicit, legacy-only, legacy-alias) and an end-to-end test in CachedWindowMemoryCapTest that drives the factory to overflow and asserts the error names cairo.sql.window.store.max.pages.
SortKeyMaterializingRecordCursor was passing the full cairo.sql.sort.key.max.bytes budget to every per-column buffer, so the true operator ceiling was bufferCount * maxBytes. A user setting "max 4 GiB" on a query that materializes 10 sort columns silently got a 40 GiB ceiling, and the documented per-operator contract did not hold. The error message named cairo.sql.sort.key.max.bytes when one buffer overflowed, with no hint that the budget had been multiplied. Divide the budget evenly across buffers in the cursor ctor, floored at PAGE_SIZE so each buffer can still allocate its initial page even with tiny budgets or many columns. For realistic budgets the floor never binds (a 4 GiB budget over 100 columns is still 40 MiB per buffer, well above the 8 KiB floor); for pathological tiny configurations the floor is a safety net rather than a leak in the contract. Drop the "scales with the number of materialized columns" caveat from server.conf since the contract is now what the docs already promised. Add two tests in OrderBySortKeyMaterializationTest. With a 64 KiB budget over 5000 deterministic DOUBLE rows, the single-column query fits (the one buffer gets the full 64 KiB) while the two-column query overflows (each buffer gets 32 KiB and needs ~40 KiB). The pair proves the budget is now operator-wide instead of per-buffer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The six new cairo.sql.*.max.bytes keys (sort.key, sort.light.value, sort.value, window.cache, window.rowid, window.tree) previously defaulted to 4 GiB. That changed the user-visible behavior of unset configs from "unbounded" to "fails at 4 GiB", which is a silent regression for anyone who never set the deprecated *.max.pages keys. Make the keys uncapped by default and document them as opt-in; setting any of the six remains the simple way to bound the operator's native memory. deriveMaxBytesDefault now returns Long.MAX_VALUE when no deprecated key is set, and sqlWindowCacheMaxBytes defaults to Long.MAX_VALUE. DefaultCairoConfiguration follows suit so test-side defaults stay aligned with production defaults. EncodedSortLightRecordCursor sums two byte caps and then clamps to MAX_HEAP_SIZE_LIMIT. With both operands at Long.MAX_VALUE the addition overflows to a negative budget, and the subsequent min() returns that negative value, so maxEntries goes negative and the operator silently admits every row. Clamp each operand to MAX_HEAP_SIZE_LIMIT first so the saturating sum stays well-defined when either key is uncapped. Downstream operators already saturate Long.MAX_VALUE bytes correctly: RecordTreeChain.derivePageBudget clamps to Integer.MAX_VALUE pages, AbstractRedBlackTree and LongTreeChain clamp to MAX_*_HEAP_SIZE_LIMIT, EncodedSortRecordCursor clamps to MAX_HEAP_SIZE_LIMIT, and getSqlWindowCacheMaxPagesResolved clamps to Integer.MAX_VALUE. Update server.conf in both core and pkg/ami/marketplace/assets to show the keys as unset, and update PropServerConfigurationTest, ServerMainTest, and CachedWindowMemoryCapTest to assert the new defaults.
SortKeyMaterializingRecordCursor previously split sort.key.max.bytes evenly across per-column buffers, so the wide buffer (e.g. DECIMAL256, 32B per row) capped many rows earlier than the narrow one (e.g. BYTE, 1B per row). The effective operator-wide row capacity dropped well below what the byte budget should have allowed. The constructor now derives each buffer's share from ColumnType.sizeOf(colType) so wider columns receive proportionally more pages. With BYTE+LONG at a 64 KiB budget, the weighted split hands the LONG buffer 7 pages (56 KiB / 7168 rows) where the even split gave it 4 pages (32 KiB / 4096 rows), letting the cursor absorb 5000 rows where the previous algorithm fired the cap. Negative byte caps that propagate from a legacy *.max.pages=-1 through deriveMaxBytesDefault now clamp to 0 so the PAGE_SIZE floor takes over instead of multiplying through to a negative limit. The PAGE_SIZE floor stays: each buffer must allocate its initial page, so for sub-page budgets the operator-wide ceiling still overshoots by at most bufferCount * PAGE_SIZE. ColumnType.sizeOf is positive for every type the upstream codegen filter forwards to the materialiser, so the weighted share is always meaningful in practice. Adds testMaterializationCapWeightedByColumnSize covering the BYTE+LONG mixed-width case at the 64 KiB budget where even split fires and weighted split succeeds.
Six *.page.size keys (sort.key, sort.light.value, sort.value, window.store, window.rowid, window.tree) accepted user-supplied 0 verbatim. The zero then flowed into divisions added by this PR: - EncodedSortRecordCursor:82 (sort.value.page.size) - RecordTreeChain.derivePageBudget (sort.key/value page sizes) - PropServerConfiguration.sqlWindowCacheMaxPagesResolved (window.store) - DefaultCairoConfiguration.getSqlWindowCacheMaxPagesResolved (window.store) A misconfigured cairo.sql.sort.value.page.size=0 now throws ArithmeticException from inside the factory constructor instead of a useful ServerConfigurationException at startup. The three window-side keys already pass through Numbers.ceilPow2 but ceilPow2(0)=0, so the clamp is needed there too. PropServerConfiguration now applies Math.max(1, ...) at the assignment site so the resolved field is always at least one byte. Downstream code keeps its own Math.max(1L, ...) on the page count, so the operator-wide ceiling still saturates at Integer.MAX_VALUE pages when the cap is set to Long.MAX_VALUE. Adds testPageSizesClampedToOne pinning the clamp for all six keys plus the CachedWindow resolved page count under the worst-case pageSize=1.
Five small follow-ups from the review thread: - Null-skip the (raise X) suffix in AbstractRedBlackTree.checkKeyCapacity, LongTreeChain.checkValueCapacity, and LimitedSizeLongTreeChain.checkValueCapacity the same way MemoryCARWImpl.extend0 and MemoryPages.allocate0 already do, so a future caller that passes a null config key gets a plain "memory exceeded" message instead of "(raise null)". - Collapse the four duplicated LimitOverflowException construction chains in EncodedSortRecordCursor.buildAndSort and EncodedSortLightRecordCursor.buildAndSort into per-class throwLimitOverflow helpers. - Document the dual role of cairo.sql.window.store.max.pages in server.conf so a deployment tuning that key learns it also acts as the legacy CachedWindow record-store cap when cairo.sql.window.cache.max.bytes is unset. - Add testDeprecatedMaxPagesAndAliasBothExplicitMainWins to pin the documented "modern key wins over alias" precedence under conflicting explicit values on cairo.sql.window.tree.max.pages and its cairo.sql.analytic.tree.max.pages alias. - SampleByFillTest.testSortedRecordCursorFactoryHandlesKeyHeapOverflow now also asserts the "(raise cairo.sql.sort.key.max.bytes)" hint so renaming the key would surface as a test failure rather than silently breaking the user-facing remediation guidance.
PropServerConfiguration emits an advisory log line when an explicitly configured *.max.bytes is below the matching *.page.size. The AbstractRedBlackTree/LongTreeChain/CachedWindow path silently floors the effective cap at one page, so the runtime LimitOverflowException reports the floor rather than the operator's setting. Surfacing the override at init lets the operator catch the silent raise before the cap fires. Six call sites covered: sort.key, sort.light.value, sort.value, window.cache, window.rowid, window.tree. The marketplace server.conf template gains the three new window byte keys (window.cache.max.bytes, window.rowid.max.bytes, window.tree.max.bytes) alongside the existing analytic.* documentation; the main core/.../server.conf already had them. Both server.conf files said the sort.key.max.bytes budget is split "evenly" across materialised column buffers, but SortKeyMaterializingRecordCursor's constructor uses ColumnType.sizeOf weighting. The docs now describe the actual proportional split. testMaxBytesBelowPageSizeAccepted pins the contract: configuration with max.bytes < page.size is accepted, the field stores the verbatim value, and the resolved CachedWindow page count floors at one.
Replace the clamp-to-1 on the sort/window page-size config keys with an explicit startup rejection. Clamping a misconfigured *.page.size to 1 left tree- and heap-backed operators tripping an assertion or writing past a sub-block native buffer at query time, instead of failing loudly at config load as the clamp comment claimed. validatePageSizeAtLeast now rejects a page size below the operator's block minimum via ServerConfigurationException.forInvalidKey: - sort.key >= 64: it also backs RecordTreeChain's MemoryPages, whose 41-byte node cannot straddle a non-contiguous page (ceilPow2 >= 41). - window.tree >= 24: AbstractRedBlackTree key node. - sort.light.value / window.rowid >= 12: value-chain entry. - window.store >= 64: sizes per-window-function buffers via store / RECORD_SIZE (widest 40) and the RecordArray index page (store >> 4). sort.value stays clamped at 1 - a divisor-only RecordChain page with no fixed block, where a 1-byte page is slow but safe. Add defensive asserts mirroring AbstractRedBlackTree's existing key-heap guard: LongTreeChain and LimitedSizeLongTreeChain assert the value page holds one chain entry, and RecordTreeChain asserts its ceilPow2 page holds one node. These catch any non-config caller under -ea. Preserve each deprecated *.max.pages key's historical reader so no previously-valid value fails to start: the sort.* keys keep getIntSize (size suffixes such as 2k) and the window.* keys keep getInt (underscore separators such as 1_000_000). WindowFunctionTest reset window.store.page.size to 0 in a finally block, which the new rejection catches on the test config rebuild; reset it to the 1m default instead. Add config tests for the rejection band, the exact-minimum boundary, and both legacy parsing formats.
Address the medium-severity review findings on the byte-cap change. derivePageBudget() and EncodedSortRecordCursor divided the byte cap by the raw page size, but MemoryPages and MemoryCARWImpl round the page up to a power of two before allocating. A non-power-of-two sort page size therefore let the heap overshoot the configured *.max.bytes by up to ~2x. Both sites now divide by Numbers.ceilPow2(pageSize) so the cap holds. Power-of-two sizes (all shipped defaults) are unaffected. Add coverage for the new error paths and (raise X) hints: - LimitedSizeLongTreeChainTest pins the key-heap and value-heap overflow messages directly. - SampleByFillTest pins the sort.value.max.bytes hint via the value chain. - PropServerConfigurationTest restores the negative legacy *.max.pages derivation. - OrderByMemoryCapTest exercises the serial top-N path and the parallel AsyncTopK per-worker path end to end. Document the per-operator page-size minimums enforced at startup and note that cairo.sql.window.store.max.pages also governs the bare per-partition window-buffer overflow error, in both server.conf copies.
|
Confirmed fresh-28 independently: RecordArray line 59 allocates auxMem with page pageSize >> 4 (= 4 bytes at the new 64-byte store minimum), and beginRecord() line 64 does I now have everything needed. Here is the consolidated, source-verified review. Review: PR #7157 — feat(sql): add byte caps for CachedWindow and ORDER BY native memory (level 3) 29 raw findings → 26 verified, 3 dropped as false positives. Split: all confirmed code findings are in-diff; the only out-of-diff items are PR metadata/description. The cross-context sweep Critical None. No confirmed correctness, safety, concurrency, resource-leak, or UB defect. Results stay correct on every path; the overflow clamps prevent negative/overflowed budgets; per-worker Moderate M1 — On the default ORDER BY path, sort.key.max.bytes alone does not bound memory, and the error message misdirects (merges correctness-0, concurrency-1, resource-7, tests-10 — in-diff) EncodedSortLightRecordCursor.java:75-77 (the default ORDER BY path, selected at SqlCodeGenerator.java:7449 whenever the sort key is fixed-width-encodable): keyCap = min(getSqlSortKeyMaxBytes(), LIMIT); // LIMIT = (2^32-2)<<3 ≈ 32 GiB Both caps default to Long.MAX_VALUE, so leaving one unset saturates its operand to LIMIT; the sum is then always ≥ LIMIT and the configured cap is fully masked. A user who sets only Fix: either make each cap bind independently on this path (treat an unset cap as 0-contribution when the other is explicit), or document the combined-budget semantics in server.conf + M2 — The new (raise ...) remediation hint on the EncodedSort paths is asserted by no test (tests-9 — in-diff) throwLimitOverflow() in both EncodedSortLightRecordCursor.java:229-237 and EncodedSortRecordCursor.java:243-249 appends the new (raise cairo.sql.sort.key.max.bytes[ or ...]) hint, but the M3 — PR description misstates page-size handling (says "clamp to 1" where the code rejects at startup) (metadata-17 — PR body) The body bullet claims all six page-size keys "clamp to a minimum of 1 byte at read time, so a misconfigured *.page.size=0 fails over to a tiny page." In reality, only Minor
Downgraded (false positives)
Summary Verdict: approve with changes recommended. This is a clean, well-tested PR with no blocking issues — no critical findings, no surviving resource-leak or correctness defects, and a Address before merge: M1 (single-knob sort.key.max.bytes is a no-op on the default ORDER BY path + misleading error message — the most important, since it undercuts the feature's headline Counts: 29 draft findings → 26 verified, 3 false positives removed. In-diff/out-of-diff: all confirmed code findings in-diff; 1 finding (label) out-of-diff PR metadata, 1 (description) in |
[PR Coverage check]😍 pass : 224 / 235 (95.32%) file detail
|
Resolve the SampleByFillTest conflict by keeping master's byte-cap overflow semantics (#7157) expressed through the QueryAssertion builder rather than the old sqlSortKeyMaxPages mechanism. Master also added new tests that call the legacy assertQueryNoLeakCheck overloads this branch had already removed from AbstractCairoTest. Migrate those call sites to the builder so the merge compiles: the value-heap overflow test in SampleByFillTest and all of the new CachedWindowMemoryCapTest. Each migration preserves master's factory properties (expectedTimestamp, supportsRandomAccess, expectSize). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts from upstream PRs #7182 (LIMIT -N over a covering index), #7187 (fluent QueryAssertion builder) and #7157 (ORDER BY native-memory byte caps). CoveringIndexRecordCursorFactory: both this branch and #7182 independently reworked the multi-key merge to emit rows in row-id (ts-ascending) order. Take upstream's version as the base -- its min-scan merge plus the new single-key LIMIT -N backward scan -- then re-apply this branch's additive fixes on top: getScanDirection(), duplicate IN-key dedup, and the TablePageFrameCursor promotion (fixes the parallel-keyed-group-by-under-SelectedRecord ClassCastException). getScanDirection() reports OTHER only for multi-key latestBy; single-key latestBy stays FORWARD (it returns one row, trivially ts-ordered) so the designated-timestamp metadata upstream's tests expect is preserved. EncodedSortRecordCursor: keep upstream's extracted throwLimitOverflow(). Tests: take upstream's fluent-builder assertions for shared cases and keep this branch's new regression tests. #7187 removed the legacy positional assertQuery()/assertQueryNoLeakCheck() helpers, so migrate ~35 remaining call sites across 10 test files to the fluent builder. Verified: core test-compile clean; CoveringIndexTest (394), ExplainPlanTest (534), AsOfJoinTest (115), CompiledFilterTest (50), LimitTest (47) and the migrated group-by / distinct / IPv4 / array tests all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
LongTreeChainkey and value heaps (used by both window functions and ORDER BY), and the ORDER BY sort heaps. The caps are unset (uncapped) by default; setting any of them bounds the operator and raisesLimitOverflowExceptionwith a(raise X)hint naming the key that needs to change:Marks
cairo.sql.window.tree.max.pages,cairo.sql.window.rowid.max.pages,cairo.sql.sort.key.max.pages,cairo.sql.sort.light.value.max.pages, andcairo.sql.sort.value.max.pagesdeprecated. They continue to be parsed: if a user has one set, the derived byte default for the corresponding new key becomespageSize * maxPages. An explicit new*.max.bytesvalue wins when both are set.For
cairo.sql.window.cache.max.bytes, precedence is resolved against the legacycairo.sql.window.store.max.pages(and itscairo.sql.analytic.store.max.pagesalias), which still drive the per-partitionMemoryCARWwindow function buffers. The new bytes key wins when set explicitly; the legacy pages key wins when only it is explicit; otherwise the new bytes default wins. The resolved key path is plumbed into the runtime so the error message names the binding constraint instead of sending the user down the wrong knob. One nuance worth flagging: when only the legacycairo.sql.analytic.store.max.pagesalias is explicit, the runtime hint namescairo.sql.window.store.max.pages(the modern key that supersedes it) rather than the alias the user has in their config — raising the modern key still takes effect via the same resolution path.When sort-key materialization is engaged,
cairo.sql.sort.key.max.bytesis split across the materialized column buffers in proportion to each column's fixed-size width (viaColumnType.sizeOf). A wider column (e.g.DECIMAL256, 32 B per row) receives a proportionally larger share than a narrow one (e.g.BOOLEAN, 1 B per row), so each buffer's row capacity stays roughly balanced and the total reachable memory tracks the operator-wide budget rather than the wide column's footprint times the column count. Each buffer is still floored at one page so the cursor can initialise; the only overshoot is for sub-page budgets, where the floor exceeds the requested cap by at mostbufferCount * PAGE_SIZE.Page-size config keys (
cairo.sql.sort.{key,light.value,value}.page.sizeand the threecairo.sql.window.*.page.sizekeys) clamp to a minimum of 1 byte at read time, so a misconfigured*.page.size=0fails over to a tiny page rather than propagating a 0 into downstream divisions and surfacing as anArithmeticExceptionfrom the factory constructor.Tradeoff: with the caps unset by default, this PR provides opt-in protection rather than preventing OOM out of the box. Operators that previously survived a
lag(...)over a multi-billion-row table on unbounded growth keep that behavior; deployments that want a guard must set one of the new keys. The previous iteration of this PR defaulted each cap to 4 GiB, which would have been a breaking change for any operator already exceeding that ceiling in production; this version is not.Without any cap set, the only ceilings remain the internal compressed-offset limits in
AbstractRedBlackTree(~32 GiB key heap) andLongTreeChain(~16 GiB value heap), plus the JVM's native memory budget.Test plan
CachedWindowMemoryCapTestexercises each of the cache, rowid, and tree caps firing, the happy path with uncapped defaults, the same workload succeeding after raising the cap, and per-cursor reset across repeated executions.OrderBySortKeyMaterializationTestcovers the operator-wide ceiling for single-column and equal-width multi-column sort-key materialization, plus the size-weighted split via a mixedBYTE + LONGschema where the even split would fire and the weighted split succeeds.PropServerConfigurationTestcovers deprecation precedence (including the case where both the modern deprecated key and its analytic alias are set with conflicting values), the resolved CachedWindow cap key, explicit-Integer.MAX_VALUE-on-the-deprecated-key behavior, and the page-size clamp under misconfigured*.page.size=0.ServerMainTest"show parameters" snapshot updated to include the six new keys atLong.MAX_VALUE.SampleByFillTest.testSortedRecordCursorFactoryHandlesKeyHeapOverflowswitches from themax.pages=-1cleanup probe to a 64-bytesort.key.max.bytesbudget that triggers the sameLimitOverflowExceptioncleanup path, and asserts the(raise cairo.sql.sort.key.max.bytes)hint so a rename to the property key would fail the test instead of silently breaking the user-facing remediation guidance.