Skip to content

feat(sql): add byte caps for CachedWindow and ORDER BY native memory#7157

Merged
bluestreak01 merged 24 commits into
masterfrom
puzpuzpuz_proper_window_function_mem_cap
Jun 1, 2026
Merged

feat(sql): add byte caps for CachedWindow and ORDER BY native memory#7157
bluestreak01 merged 24 commits into
masterfrom
puzpuzpuz_proper_window_function_mem_cap

Conversation

@puzpuzpuz
Copy link
Copy Markdown
Contributor

@puzpuzpuz puzpuzpuz commented May 26, 2026

Summary

  • Adds six byte-denominated per-operator caps for native memory growth in the CachedWindow record store, the LongTreeChain key and value heaps (used by both window functions and ORDER BY), and the ORDER BY sort heaps. The caps are unset (uncapped) by default; setting any of them bounds the operator and raises LimitOverflowException with a (raise X) hint naming the key that needs to change:
cairo.sql.window.cache.max.bytes
cairo.sql.window.rowid.max.bytes
cairo.sql.window.tree.max.bytes
cairo.sql.sort.key.max.bytes
cairo.sql.sort.light.value.max.bytes
cairo.sql.sort.value.max.bytes
  • Marks cairo.sql.window.tree.max.pages, cairo.sql.window.rowid.max.pages, cairo.sql.sort.key.max.pages, cairo.sql.sort.light.value.max.pages, and cairo.sql.sort.value.max.pages deprecated. They continue to be parsed: if a user has one set, the derived byte default for the corresponding new key becomes pageSize * maxPages. An explicit new *.max.bytes value wins when both are set.

  • For cairo.sql.window.cache.max.bytes, precedence is resolved against the legacy cairo.sql.window.store.max.pages (and its cairo.sql.analytic.store.max.pages alias), which still drive the per-partition MemoryCARW window function buffers. The new bytes key wins when set explicitly; the legacy pages key wins when only it is explicit; otherwise the new bytes default wins. The resolved key path is plumbed into the runtime so the error message names the binding constraint instead of sending the user down the wrong knob. One nuance worth flagging: when only the legacy cairo.sql.analytic.store.max.pages alias is explicit, the runtime hint names cairo.sql.window.store.max.pages (the modern key that supersedes it) rather than the alias the user has in their config — raising the modern key still takes effect via the same resolution path.

  • When sort-key materialization is engaged, cairo.sql.sort.key.max.bytes is split across the materialized column buffers in proportion to each column's fixed-size width (via ColumnType.sizeOf). A wider column (e.g. DECIMAL256, 32 B per row) receives a proportionally larger share than a narrow one (e.g. BOOLEAN, 1 B per row), so each buffer's row capacity stays roughly balanced and the total reachable memory tracks the operator-wide budget rather than the wide column's footprint times the column count. Each buffer is still floored at one page so the cursor can initialise; the only overshoot is for sub-page budgets, where the floor exceeds the requested cap by at most bufferCount * PAGE_SIZE.

  • Page-size config keys (cairo.sql.sort.{key,light.value,value}.page.size and the three cairo.sql.window.*.page.size keys) clamp to a minimum of 1 byte at read time, so a misconfigured *.page.size=0 fails over to a tiny page rather than propagating a 0 into downstream divisions and surfacing as an ArithmeticException from the factory constructor.

  • Tradeoff: with the caps unset by default, this PR provides opt-in protection rather than preventing OOM out of the box. Operators that previously survived a lag(...) over a multi-billion-row table on unbounded growth keep that behavior; deployments that want a guard must set one of the new keys. The previous iteration of this PR defaulted each cap to 4 GiB, which would have been a breaking change for any operator already exceeding that ceiling in production; this version is not.

  • Without any cap set, the only ceilings remain the internal compressed-offset limits in AbstractRedBlackTree (~32 GiB key heap) and LongTreeChain (~16 GiB value heap), plus the JVM's native memory budget.

Test plan

  • CachedWindowMemoryCapTest exercises each of the cache, rowid, and tree caps firing, the happy path with uncapped defaults, the same workload succeeding after raising the cap, and per-cursor reset across repeated executions.
  • OrderBySortKeyMaterializationTest covers the operator-wide ceiling for single-column and equal-width multi-column sort-key materialization, plus the size-weighted split via a mixed BYTE + LONG schema where the even split would fire and the weighted split succeeds.
  • PropServerConfigurationTest covers deprecation precedence (including the case where both the modern deprecated key and its analytic alias are set with conflicting values), the resolved CachedWindow cap key, explicit-Integer.MAX_VALUE-on-the-deprecated-key behavior, and the page-size clamp under misconfigured *.page.size=0.
  • ServerMainTest "show parameters" snapshot updated to include the six new keys at Long.MAX_VALUE.
  • SampleByFillTest.testSortedRecordCursorFactoryHandlesKeyHeapOverflow switches from the max.pages=-1 cleanup probe to a 64-byte sort.key.max.bytes budget that triggers the same LimitOverflowException cleanup path, and asserts the (raise cairo.sql.sort.key.max.bytes) hint so a rename to the property key would fail the test instead of silently breaking the user-facing remediation guidance.

…mory to prevent OOM

A single ad-hoc lag(...) OVER (PARTITION BY ...) query could grow
NATIVE_TREE_CHAIN allocations to roughly 48 GiB before any cap fired;
two concurrent queries on a 100 GiB instance hit the host RSS limit
and made the database unresponsive. The defaults were effectively
unbounded because the existing *.max.pages knobs multiplied a small
page size by Integer.MAX_VALUE, with the only real ceiling being the
internal compressed-offset limits in AbstractRedBlackTree (~32 GiB
key heap) and LongTreeChain (~16 GiB value heap).

This change introduces byte-denominated per-operator caps on the
LongTreeChain key + value heaps, the CachedWindow RecordArray, and
the ORDER BY tree chains, each defaulting to 4 GiB:

  cairo.sql.window.tree.max.bytes
  cairo.sql.window.rowid.max.bytes
  cairo.sql.window.cache.max.bytes
  cairo.sql.sort.key.max.bytes
  cairo.sql.sort.light.value.max.bytes
  cairo.sql.sort.value.max.bytes

The legacy *.max.pages keys (tree, rowid, sort.key, sort.light.value,
sort.value) are marked deprecated and parsed for backward compat: if
the user has one set, pageSize * maxPages becomes the derived byte
default for the corresponding new key. The new *.max.bytes key wins
when both are set.

Breaking change: a long-running ORDER BY or windowed lookback over a
multi-billion-row table that previously consumed more than 4 GiB of
native memory in one of these operators will now throw
LimitOverflowException ("memory exceeded in RedBlackTree" / "in
LongTreeChain") instead of growing unbounded. Operators can raise
the byte cap explicitly via the new config keys.

cairo.sql.window.store.max.pages is intentionally untouched: it still
drives per-partition MemoryCARW buffers in roughly 30 window function
factories with their own page-based cap mechanism, and is out of
scope for this change.
@puzpuzpuz puzpuzpuz self-assigned this May 26, 2026
@puzpuzpuz puzpuzpuz added SQL Issues or changes relating to SQL execution DO NOT MERGE These changes should not be merged to main branch labels May 26, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5c3eab13-dabf-4978-a7f4-776d9e9ca121

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch puzpuzpuz_proper_window_function_mem_cap

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

puzpuzpuz added 5 commits May 26, 2026 13:43
The cap-exceeded message thrown by AbstractRedBlackTree, LongTreeChain,
and LimitedSizeLongTreeChain previously said only "limit of N memory
exceeded in RedBlackTree" / "in LongTreeChain". The same primitive
backs different operators -- a RedBlackTree may be a window function's
tree-key heap (cairo.sql.window.tree.max.bytes) or a sort's tree-key
heap (cairo.sql.sort.key.max.bytes) -- so an operator hitting the cap
could not tell which knob to raise.

Each ctor now also takes the property path of the config key that
owns its key heap (and value heap, for the chain variants). The
overflow message appends "(raise <config_key> to increase)" so the
user-facing error names the exact property to tune.

Affected callers updated:
  CachedWindowRecordCursorFactory     window.tree + window.rowid
  SortedLightRecordCursorFactory      sort.key + sort.light.value
  LimitedSizeSortedLightRecordCursorFactory  same
  AsyncTopKAtom                       same

RecordTreeChain (fat ORDER BY in SortedRecordCursorFactory) is not
covered here: its overflows surface from the inner MemoryPages /
RecordChain primitives, which carry their own messages and would
need a separate wrapping pass. Tracked as a follow-up.
Add registerDeprecated() calls for the five *.max.pages keys that PR
7157 marks as deprecated (sort.key, sort.light.value, sort.value,
window.rowid, window.tree). Without this the validator was silently
accepting the old keys, contradicting the deprecation comments in
PropertyKey. Repoint the two pre-existing analytic.* aliases at the
new *.max.bytes targets, since the previously-pointed-at *.max.pages
keys are now themselves deprecated.

Update ServerMainTest.testShowParameters to include the six new
*.max.bytes keys with their 4 GiB default value. The test asserts a
strict bidirectional set equality between actual and expected
property dumps, so the missing entries would have caused
"Extra properties" failures.
The first round of byte-cap work in this branch threaded the config
key path through AbstractRedBlackTree / LongTreeChain /
LimitedSizeLongTreeChain, so cap-overflow exceptions for ORDER BY
trees and window tree/rowid chains end with
"(raise <key> to increase)". The CachedWindow RecordArray path,
SortedRecordCursorFactory's RecordTreeChain, EncodedSortRecordCursor
and EncodedSortLightRecordCursor, and SortKeyMaterializingRecordCursor
still threw generic messages that did not name the new config keys
the user is supposed to raise.

Add an optional maxPagesConfigKey to MemoryCARWImpl and MemoryPages
via a new constructor overload; existing constructors delegate with
null and keep their old message verbatim. Add an Vm.getCARWInstance
overload that takes the key. RecordChain, RecordArray, and
RecordTreeChain gain config-key parameters that they forward to the
underlying MemoryCARW / MemoryPages. The relevant SQL cursor and
factory callsites pass the matching new *.max.bytes key:

  - CachedWindowRecordCursorFactory -> cairo.sql.window.cache.max.bytes
  - SortedRecordCursorFactory       -> cairo.sql.sort.key.max.bytes
                                     + cairo.sql.sort.value.max.bytes
  - EncodedSortRecordCursor's value RecordChain
                                    -> cairo.sql.sort.value.max.bytes
  - EncodedSortRecordCursor entryMem inline throw
                                    -> cairo.sql.sort.key.max.bytes
  - EncodedSortLightRecordCursor entryMem inline throw
                                    -> cairo.sql.sort.key.max.bytes
                                     + cairo.sql.sort.light.value.max.bytes
  - SortKeyMaterializingRecordCursor per-buffer MemoryCARW
                                    -> cairo.sql.sort.key.max.bytes

CachedWindowMemoryCapTest.testCacheCapFiresAndCleansUp now asserts on
the config-key-bearing message. Existing WindowFunctionTest and
SecurityTest assertions are unaffected: their substrings still appear
verbatim in the enriched messages, and the per-partition window
buffers (governed by cairo.sql.window.store.max.pages, intentionally
out of scope for this PR) keep the legacy generic message because
their callers do not pass a config key.
deriveMaxBytesDefault previously used Integer.MAX_VALUE as the
"unset" sentinel and returned the new 4 GiB default whenever the
deprecated *.max.pages key read back as Integer.MAX_VALUE. The
problem: the previously-documented default for those keys was
2^31 (= Integer.MAX_VALUE), and the published server.conf
template lists exactly that value. A user who copied the
documented value verbatim, or who deliberately pinned the
deprecated key to 2^31 to keep the historical effectively-
unbounded behavior, would silently end up with a 4 GiB cap.

Switch to explicit-presence detection: read both deprecated
keys (so they stay registered in the property tracker and
remain visible in (show parameters)), then check whether each
key was set in properties or the environment. If the main key
is explicitly set, its value wins; otherwise if the alias is
explicitly set, its value wins; otherwise fall through to the
new 4 GiB default. The main-over-alias ordering matches the
previous implementation.

testDeprecatedMaxPagesHonorsExplicitIntegerMaxValue covers the
2^31 case directly. testDeprecatedMaxPagesDerivesMaxBytes,
testNewMaxBytesWinsOverDeprecatedMaxPages, and ServerMainTest
testShowParameters continue to pass.
@puzpuzpuz puzpuzpuz added Compatibility Compatibility with third-party tools and services Bug Incorrect or unexpected behavior labels May 26, 2026
puzpuzpuz and others added 6 commits May 26, 2026 15:59
Test-only changes; no production behavior touched.

- testDeprecatedMaxPagesDerivesMaxBytes: swap the window.tree slot
  from the main cairo.sql.window.tree.max.pages key to the older
  cairo.sql.analytic.tree.max.pages alias, so the alias derivation
  path is actually exercised. Main-key derivation remains covered
  by the cairo.sql.sort.key.max.pages slot in the same test.
- Move testNewMaxBytesWinsOverDeprecatedMaxPages out of the
  testDeprecated* cluster and into its alphabetical home between
  testMinimum2SharedWorkers and testNotValidAllowedVolumePaths0.
- Rename testConcurrentCursorsHaveIndependentCaps to
  testRepeatedCursorsStayUnderCap and rewrite its header comment.
  The original name and comment promised concurrency and per-cursor
  cap isolation; the body only runs two sequential queries.
- Drop the AndCleansUp suffix from testCacheCapFiresAndCleansUp,
  testRowIdCapFiresAndCleansUp, and testTreeKeyCapFiresAndCleansUp.
  Each test asserts only that the cap exception fires; cleanup is
  implicit via assertMemoryLeak.
- LimitedSizeLongTreeChainTest.before: replace hardcoded
  "cairo.sql.sort.*.max.bytes" strings with
  PropertyKey.*.getPropertyPath() so a PropertyKey rename will not
  silently desync the test from production messages.
Three review follow-ups, no behavior change:

- Drop the redundant " to increase" suffix from cap-overflow
  exception messages. "(raise <key>)" alone already conveys the
  intent. Affected throw sites: AbstractRedBlackTree, LongTreeChain,
  LimitedSizeLongTreeChain, MemoryCARWImpl, MemoryPages, plus the
  two EncodedSort* cursor classes (two sites each). The literal
  assertions in CachedWindowMemoryCapTest are updated in lockstep.
  Other tests (SecurityTest, WindowFunctionTest) assert on the
  operator-name prefix only and need no change.
- Rename *MaxPagesByBytes locals to *MaxPagesFromBytes across the
  three derivation sites (CachedWindowRecordCursorFactory,
  EncodedSortRecordCursor, SortKeyMaterializingRecordCursor).
  "by bytes" reads as a possessive construction; "from bytes" is
  the actual relationship (max pages derived from a byte budget).
- Disclose the per-buffer multiplier in
  SortKeyMaterializingRecordCursor. The cursor allocates one
  MemoryCARW per materialized column, each capped at maxBytes, so
  the total budget for the operator is bufferCount * maxBytes.
  This was true under the previous max.pages naming too, but the
  byte denomination invites the misreading that the cap is the
  aggregate. Adds a constructor comment plus a note on the
  cairo.sql.sort.key.max.bytes entry in server.conf and the AMI
  copy.
- PropServerConfiguration: downgrade deriveMaxBytesDefault (both
  overloads) and isPropertyExplicitlySet from protected to private.
  No subclass in OSS overrides them. Relocate each into the private
  instance-method cluster in alphabetical order: deriveMaxBytesDefault
  between configureSharedThreadPool and getCommitMode,
  isPropertyExplicitlySet between initIlpTransport and pathEquals.
- CachedWindowMemoryCapTest: add testCacheCapRaisedUnblocksQuery.
  Mirrors the dataset and query in testCacheCapFires (50_000 rows,
  same lag-over-partition expression) but raises
  cairo.sql.window.cache.max.bytes to 16 MiB. The query runs to
  completion; the test asserts the first 3 result rows. Closes the
  loop on the negative tests by demonstrating that raising the new
  cap unblocks a previously-failing workload.
- Move testRepeatedCursorsStayUnderCap to its correct alphabetical
  position between testHappyPathUnchanged and testRowIdCapFires
  (the earlier rename had left it out of order).
CachedWindowRecordCursorFactory previously took min(cache.max.bytes /
page.size, window.store.max.pages) and always told the user to raise
cache.max.bytes when growth failed. When store.max.pages was the
smaller cap (e.g. on a deployment that explicitly pinned it low for
the legacy per-partition window buffers), raising cache.max.bytes had
no effect and the user was sent on a wild-goose chase.

Move the precedence decision into PropServerConfiguration so the
resolved page count and the property path to name in the error message
are computed once at config load. The new bytes key wins when it is
explicitly set; the legacy pages key (and its analytic.* alias) wins
when only it is explicit; the new bytes default wins otherwise. The
runtime then passes the resolved key string into RecordArray, so the
"(raise X)" hint always names the constraint a user must actually
change.

Add unit tests in PropServerConfigurationTest covering the four
precedence cases (default, both-explicit, legacy-only, legacy-alias)
and an end-to-end test in CachedWindowMemoryCapTest that drives the
factory to overflow and asserts the error names
cairo.sql.window.store.max.pages.
SortKeyMaterializingRecordCursor was passing the full
cairo.sql.sort.key.max.bytes budget to every per-column buffer, so the
true operator ceiling was bufferCount * maxBytes. A user setting "max
4 GiB" on a query that materializes 10 sort columns silently got a 40
GiB ceiling, and the documented per-operator contract did not hold.
The error message named cairo.sql.sort.key.max.bytes when one buffer
overflowed, with no hint that the budget had been multiplied.

Divide the budget evenly across buffers in the cursor ctor, floored at
PAGE_SIZE so each buffer can still allocate its initial page even with
tiny budgets or many columns. For realistic budgets the floor never
binds (a 4 GiB budget over 100 columns is still 40 MiB per buffer, well
above the 8 KiB floor); for pathological tiny configurations the floor
is a safety net rather than a leak in the contract.

Drop the "scales with the number of materialized columns" caveat from
server.conf since the contract is now what the docs already promised.

Add two tests in OrderBySortKeyMaterializationTest. With a 64 KiB
budget over 5000 deterministic DOUBLE rows, the single-column query
fits (the one buffer gets the full 64 KiB) while the two-column query
overflows (each buffer gets 32 KiB and needs ~40 KiB). The pair proves
the budget is now operator-wide instead of per-buffer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@puzpuzpuz puzpuzpuz removed the DO NOT MERGE These changes should not be merged to main branch label May 26, 2026
@puzpuzpuz puzpuzpuz added the DO NOT MERGE These changes should not be merged to main branch label May 26, 2026
The six new cairo.sql.*.max.bytes keys (sort.key, sort.light.value,
sort.value, window.cache, window.rowid, window.tree) previously defaulted
to 4 GiB. That changed the user-visible behavior of unset configs from
"unbounded" to "fails at 4 GiB", which is a silent regression for anyone
who never set the deprecated *.max.pages keys. Make the keys uncapped by
default and document them as opt-in; setting any of the six remains the
simple way to bound the operator's native memory.

deriveMaxBytesDefault now returns Long.MAX_VALUE when no deprecated key
is set, and sqlWindowCacheMaxBytes defaults to Long.MAX_VALUE.
DefaultCairoConfiguration follows suit so test-side defaults stay
aligned with production defaults.

EncodedSortLightRecordCursor sums two byte caps and then clamps to
MAX_HEAP_SIZE_LIMIT. With both operands at Long.MAX_VALUE the addition
overflows to a negative budget, and the subsequent min() returns that
negative value, so maxEntries goes negative and the operator silently
admits every row. Clamp each operand to MAX_HEAP_SIZE_LIMIT first so the
saturating sum stays well-defined when either key is uncapped.

Downstream operators already saturate Long.MAX_VALUE bytes correctly:
RecordTreeChain.derivePageBudget clamps to Integer.MAX_VALUE pages,
AbstractRedBlackTree and LongTreeChain clamp to MAX_*_HEAP_SIZE_LIMIT,
EncodedSortRecordCursor clamps to MAX_HEAP_SIZE_LIMIT, and
getSqlWindowCacheMaxPagesResolved clamps to Integer.MAX_VALUE.

Update server.conf in both core and pkg/ami/marketplace/assets to show
the keys as unset, and update PropServerConfigurationTest, ServerMainTest,
and CachedWindowMemoryCapTest to assert the new defaults.
@puzpuzpuz puzpuzpuz changed the title fix(sql): breaking change 💥 - cap CachedWindow and ORDER BY native memory to prevent OOM feat(sql): add byte caps for CachedWindow and ORDER BY native memory May 26, 2026
@puzpuzpuz puzpuzpuz added Enhancement Enhance existing functionality New feature Feature requests and removed Bug Incorrect or unexpected behavior Compatibility Compatibility with third-party tools and services DO NOT MERGE These changes should not be merged to main branch labels May 26, 2026
puzpuzpuz and others added 5 commits May 28, 2026 19:44
SortKeyMaterializingRecordCursor previously split sort.key.max.bytes
evenly across per-column buffers, so the wide buffer (e.g. DECIMAL256,
32B per row) capped many rows earlier than the narrow one (e.g. BYTE,
1B per row). The effective operator-wide row capacity dropped well
below what the byte budget should have allowed.

The constructor now derives each buffer's share from
ColumnType.sizeOf(colType) so wider columns receive proportionally
more pages. With BYTE+LONG at a 64 KiB budget, the weighted split
hands the LONG buffer 7 pages (56 KiB / 7168 rows) where the even
split gave it 4 pages (32 KiB / 4096 rows), letting the cursor absorb
5000 rows where the previous algorithm fired the cap.

Negative byte caps that propagate from a legacy *.max.pages=-1 through
deriveMaxBytesDefault now clamp to 0 so the PAGE_SIZE floor takes
over instead of multiplying through to a negative limit.

The PAGE_SIZE floor stays: each buffer must allocate its initial page,
so for sub-page budgets the operator-wide ceiling still overshoots by
at most bufferCount * PAGE_SIZE. ColumnType.sizeOf is positive for
every type the upstream codegen filter forwards to the materialiser,
so the weighted share is always meaningful in practice.

Adds testMaterializationCapWeightedByColumnSize covering the
BYTE+LONG mixed-width case at the 64 KiB budget where even split
fires and weighted split succeeds.
Six *.page.size keys (sort.key, sort.light.value, sort.value,
window.store, window.rowid, window.tree) accepted user-supplied 0
verbatim. The zero then flowed into divisions added by this PR:

- EncodedSortRecordCursor:82 (sort.value.page.size)
- RecordTreeChain.derivePageBudget (sort.key/value page sizes)
- PropServerConfiguration.sqlWindowCacheMaxPagesResolved (window.store)
- DefaultCairoConfiguration.getSqlWindowCacheMaxPagesResolved (window.store)

A misconfigured cairo.sql.sort.value.page.size=0 now throws
ArithmeticException from inside the factory constructor instead of a
useful ServerConfigurationException at startup. The three window-side
keys already pass through Numbers.ceilPow2 but ceilPow2(0)=0, so the
clamp is needed there too.

PropServerConfiguration now applies Math.max(1, ...) at the assignment
site so the resolved field is always at least one byte. Downstream code
keeps its own Math.max(1L, ...) on the page count, so the operator-wide
ceiling still saturates at Integer.MAX_VALUE pages when the cap is set
to Long.MAX_VALUE.

Adds testPageSizesClampedToOne pinning the clamp for all six keys plus
the CachedWindow resolved page count under the worst-case pageSize=1.
Five small follow-ups from the review thread:

- Null-skip the (raise X) suffix in AbstractRedBlackTree.checkKeyCapacity,
  LongTreeChain.checkValueCapacity, and LimitedSizeLongTreeChain.checkValueCapacity
  the same way MemoryCARWImpl.extend0 and MemoryPages.allocate0 already
  do, so a future caller that passes a null config key gets a plain
  "memory exceeded" message instead of "(raise null)".

- Collapse the four duplicated LimitOverflowException construction
  chains in EncodedSortRecordCursor.buildAndSort and
  EncodedSortLightRecordCursor.buildAndSort into per-class
  throwLimitOverflow helpers.

- Document the dual role of cairo.sql.window.store.max.pages in
  server.conf so a deployment tuning that key learns it also acts as
  the legacy CachedWindow record-store cap when
  cairo.sql.window.cache.max.bytes is unset.

- Add testDeprecatedMaxPagesAndAliasBothExplicitMainWins to pin the
  documented "modern key wins over alias" precedence under conflicting
  explicit values on cairo.sql.window.tree.max.pages and its
  cairo.sql.analytic.tree.max.pages alias.

- SampleByFillTest.testSortedRecordCursorFactoryHandlesKeyHeapOverflow
  now also asserts the "(raise cairo.sql.sort.key.max.bytes)" hint so
  renaming the key would surface as a test failure rather than silently
  breaking the user-facing remediation guidance.
PropServerConfiguration emits an advisory log line when an explicitly
configured *.max.bytes is below the matching *.page.size. The
AbstractRedBlackTree/LongTreeChain/CachedWindow path silently floors
the effective cap at one page, so the runtime LimitOverflowException
reports the floor rather than the operator's setting. Surfacing the
override at init lets the operator catch the silent raise before the
cap fires. Six call sites covered: sort.key, sort.light.value,
sort.value, window.cache, window.rowid, window.tree.

The marketplace server.conf template gains the three new window byte
keys (window.cache.max.bytes, window.rowid.max.bytes,
window.tree.max.bytes) alongside the existing analytic.* documentation;
the main core/.../server.conf already had them.

Both server.conf files said the sort.key.max.bytes budget is split
"evenly" across materialised column buffers, but
SortKeyMaterializingRecordCursor's constructor uses ColumnType.sizeOf
weighting. The docs now describe the actual proportional split.

testMaxBytesBelowPageSizeAccepted pins the contract: configuration with
max.bytes < page.size is accepted, the field stores the verbatim value,
and the resolved CachedWindow page count floors at one.
@puzpuzpuz puzpuzpuz changed the title feat(sql): add byte caps for CachedWindow and ORDER BY native memory fix(sql): fix wrong results and memory leaks in posting-index, join and group-by queries May 29, 2026
@puzpuzpuz puzpuzpuz added Performance Performance improvements Core Related to storage, data type, etc. labels May 29, 2026
@puzpuzpuz puzpuzpuz changed the title fix(sql): fix wrong results and memory leaks in posting-index, join and group-by queries feat(sql): add byte caps for CachedWindow and ORDER BY native memory May 29, 2026
puzpuzpuz added 2 commits May 29, 2026 12:26
Replace the clamp-to-1 on the sort/window page-size config keys with an
explicit startup rejection. Clamping a misconfigured *.page.size to 1
left tree- and heap-backed operators tripping an assertion or writing
past a sub-block native buffer at query time, instead of failing loudly
at config load as the clamp comment claimed.

validatePageSizeAtLeast now rejects a page size below the operator's
block minimum via ServerConfigurationException.forInvalidKey:
- sort.key >= 64: it also backs RecordTreeChain's MemoryPages, whose
  41-byte node cannot straddle a non-contiguous page (ceilPow2 >= 41).
- window.tree >= 24: AbstractRedBlackTree key node.
- sort.light.value / window.rowid >= 12: value-chain entry.
- window.store >= 64: sizes per-window-function buffers via
  store / RECORD_SIZE (widest 40) and the RecordArray index page
  (store >> 4).
sort.value stays clamped at 1 - a divisor-only RecordChain page with no
fixed block, where a 1-byte page is slow but safe.

Add defensive asserts mirroring AbstractRedBlackTree's existing key-heap
guard: LongTreeChain and LimitedSizeLongTreeChain assert the value page
holds one chain entry, and RecordTreeChain asserts its ceilPow2 page
holds one node. These catch any non-config caller under -ea.

Preserve each deprecated *.max.pages key's historical reader so no
previously-valid value fails to start: the sort.* keys keep getIntSize
(size suffixes such as 2k) and the window.* keys keep getInt (underscore
separators such as 1_000_000).

WindowFunctionTest reset window.store.page.size to 0 in a finally block,
which the new rejection catches on the test config rebuild; reset it to
the 1m default instead. Add config tests for the rejection band, the
exact-minimum boundary, and both legacy parsing formats.
Address the medium-severity review findings on the byte-cap change.

derivePageBudget() and EncodedSortRecordCursor divided the byte cap by
the raw page size, but MemoryPages and MemoryCARWImpl round the page up
to a power of two before allocating. A non-power-of-two sort page size
therefore let the heap overshoot the configured *.max.bytes by up to
~2x. Both sites now divide by Numbers.ceilPow2(pageSize) so the cap
holds. Power-of-two sizes (all shipped defaults) are unaffected.

Add coverage for the new error paths and (raise X) hints:
- LimitedSizeLongTreeChainTest pins the key-heap and value-heap
  overflow messages directly.
- SampleByFillTest pins the sort.value.max.bytes hint via the value
  chain.
- PropServerConfigurationTest restores the negative legacy *.max.pages
  derivation.
- OrderByMemoryCapTest exercises the serial top-N path and the parallel
  AsyncTopK per-worker path end to end.

Document the per-operator page-size minimums enforced at startup and
note that cairo.sql.window.store.max.pages also governs the bare
per-partition window-buffer overflow error, in both server.conf copies.
@puzpuzpuz
Copy link
Copy Markdown
Contributor Author

Confirmed fresh-28 independently: RecordArray line 59 allocates auxMem with page pageSize >> 4 (= 4 bytes at the new 64-byte store minimum), and beginRecord() line 64 does
auxMem.putLong(...) (8 bytes) — so with a cache cap resolving to 1 page the aux side throws before storing any record, and the justification comment at line 153-154 ("store >> 4, so it
needs >= 64") is arithmetically wrong (64>>4 = 4 < 8).

I now have everything needed. Here is the consolidated, source-verified review.


Review: PR #7157 — feat(sql): add byte caps for CachedWindow and ORDER BY native memory (level 3)

29 raw findings → 26 verified, 3 dropped as false positives. Split: all confirmed code findings are in-diff; the only out-of-diff items are PR metadata/description. The cross-context sweep
of all ~30 unmigrated window-function callsites of getSqlWindowStoreMaxPages/getSqlWindowStorePageSize came back clean (no compile break, no overflow, no silently-dead overrides). No
resource-leak finding survived verification, and the adversarial-performance pass found nothing — the arithmetic is allocation-free and overflow-safe at Long.MAX_VALUE.

Critical

None. No confirmed correctness, safety, concurrency, resource-leak, or UB defect. Results stay correct on every path; the overflow clamps prevent negative/overflowed budgets; per-worker
tree chains are single-thread-owned under PerWorkerLocks then merged behind the dispatchAndAwait barrier.

Moderate

M1 — On the default ORDER BY path, sort.key.max.bytes alone does not bound memory, and the error message misdirects

(merges correctness-0, concurrency-1, resource-7, tests-10 — in-diff)

EncodedSortLightRecordCursor.java:75-77 (the default ORDER BY path, selected at SqlCodeGenerator.java:7449 whenever the sort key is fixed-width-encodable):

keyCap = min(getSqlSortKeyMaxBytes(), LIMIT); // LIMIT = (2^32-2)<<3 ≈ 32 GiB
valueCap = min(getSqlSortLightValueMaxBytes(), LIMIT);
maxEntryMemBytes = min(keyCap + valueCap, LIMIT);

Both caps default to Long.MAX_VALUE, so leaving one unset saturates its operand to LIMIT; the sum is then always ≥ LIMIT and the configured cap is fully masked. A user who sets only
cairo.sql.sort.key.max.bytes to bound ORDER BY gets no effective bound on this path (both keys must be lowered). throwLimitOverflow() (EncodedSortLightRecordCursor.java:229-237) compounds
this by saying "raise X or Y", implying either knob is the binding constraint. This is not a regression (master summed page budgets the same way) and not a correctness bug — but for a PR
whose stated purpose is opt-in memory bounding, the headline knob is ineffective in isolation on the most common path, and the remediation guidance is wrong. The sibling paths behave
correctly: EncodedSortRecordCursor (key cap drives entryMem, value cap a separate RecordChain) and LongTreeChain (two independent region caps) each bind on a single knob. No test exercises
a single-knob cap on the EncodedSortLight path.

Fix: either make each cap bind independently on this path (treat an unset cap as 0-contribution when the other is explicit), or document the combined-budget semantics in server.conf +
reword the error to "both X and Y bound this operator", and add an OrderByMemoryCapTest case that engages the encoded-light path with only the key cap set.

M2 — The new (raise ...) remediation hint on the EncodedSort paths is asserted by no test

(tests-9 — in-diff)

throwLimitOverflow() in both EncodedSortLightRecordCursor.java:229-237 and EncodedSortRecordCursor.java:243-249 appends the new (raise cairo.sql.sort.key.max.bytes[ or ...]) hint, but the
only tests hitting these paths (SecurityTest.java:560,809,1092,1156) assert only the pre-existing substring "memory exceeded in EncodedSort". The new user-facing hint is unverified and a
property rename would silently break it — exactly the failure mode the PR deliberately guards against elsewhere (SampleByFillTest, OrderByMemoryCapTest, CachedWindowMemoryCapTest all pin
their (raise ...) text). Fix: tighten one SecurityTest assertion (or add a dedicated test) to assert the full hint for both encoded paths.

M3 — PR description misstates page-size handling (says "clamp to 1" where the code rejects at startup)

(metadata-17 — PR body)

The body bullet claims all six page-size keys "clamp to a minimum of 1 byte at read time, so a misconfigured *.page.size=0 fails over to a tiny page." In reality, only
cairo.sql.sort.value.page.size clamps (Math.max(1, …)); the other five (sort.key≥64, sort.light.value≥12, window.store≥64, window.rowid≥16, window.tree≥32) are rejected at startup via
validatePageSizeAtLeast (PropServerConfiguration.java:2711) — the server won't boot. The PR's own server.conf text ("rejected at startup if smaller") and testPageSizeBelowBlockSizeRejected
contradict the body. Fix: rewrite the bullet to state heap-backed keys are rejected (naming the key) and only sort.value.page.size clamps; fix the matching test-plan line.

Minor

  • m1 — RecordArray aux index page can't hold one 8-byte slot at the new store minimum (fresh-28, in-diff). RecordArray.java:59 allocates auxMem with page pageSize >> 4; at the new
    MIN_WINDOW_STORE_PAGE_SIZE=64 that's 4 bytes, but beginRecord() writes an 8-byte putLong. With window.store.page.size in [64,127] and a cache cap resolving to ~1 page, the aux side throws
    "breached in VirtualMemory" before any record is stored (and the message names the cap key, not the page-size key). The justification comment at PropServerConfiguration.java:153-154
    ("store >> 4, so it needs >= 64") is arithmetically wrong — 64>>4=4<8. Fix: raise MIN_WINDOW_STORE_PAGE_SIZE to 128, or allocate auxMem with Math.max(pageSize >> 4, 8); correct the
    comment.
  • m2 — Migrated sort.key.max.pages gives materialization buffers a ~16× looser ceiling (perf-4 + resource-8 + crosscontext-23, in-diff). The derived byte cap is sort.key.page.size(128k) ×
    maxPages, but the materializer's internal unit is PAGE_SIZE=8192, so per-buffer caps scale by 128k/8k = 16× vs pre-PR for a single column (looser for <16 materialized columns, tighter for

16). Intentional/documented operator-wide semantics and only affects deployments that set the deprecated key; a relaxation that can't break a passing query. Fix: add an explicit migration
note ("raise sort.key.max.bytes if you relied on the old per-buffer ceiling").

  • m3 — DefaultCairoConfiguration sort caps changed from finite (1024 pages) to uncapped (crosscontext-22 + fresh-24, in-diff). Removes the implicit ~128 MB–16 GB test-time safety ceiling
    for every AbstractCairoTest. Production (PropServerConfiguration, already Integer.MAX_VALUE) is unchanged and no current test depends on the old default. Risk is only a future runaway-sort
    test silently OOMing instead of surfacing a clean LimitOverflowException. Fix (optional): return 1024L * getSqlSortKeyPageSize() to preserve the finite test default, or note the
    intentional removal.
  • m4 — Parallel ORDER BY ... LIMIT cap is per-worker; aggregate is (workers+1) × cap, undocumented (concurrency-2, in-diff). AsyncTopKAtom.java:106-132 gives the owner chain and every
    per-worker chain the full byte budget. Concurrency-safe; just a documentation gap in server.conf.
  • m5 — warnIfMaxBytesBelowPageSize compares raw page size; consumers floor by ceilPow2(pageSize) (fresh-26, in-diff). For non-power-of-two sort.value.page.size (and sort.key via
    RecordTreeChain.derivePageBudget), the advisory can disagree with the real flooring. Advisory-log accuracy only; default page sizes are powers of two. Note the suggested fix's mention of
    window.store is a no-op (those are already ceilPow2'd at config time) and must not be applied to sort.light.value (consumed raw).
  • m6 — New boolean locals miss the is/has prefix (quality-14, in-diff). PropServerConfiguration.java:1790-1792 cacheMaxBytesExplicit/storeMaxPagesExplicit violate the CLAUDE.md naming rule
    (the sibling hasAlias in the same PR follows it). Rename to isCacheMaxBytesExplicit/isStoreMaxPagesExplicit.
  • m7 — Resolved cache-cap error names the canonical key even when only the analytic.* alias was set (quality-15, in-diff). PropServerConfiguration.java:1795 hard-codes
    window.store.max.pages though storeMaxPagesExplicit is true for either key. Cosmetic (keys are deprecated-equivalent; raising the named key works).
  • m8 — Sub-minimum legacy alias page.size is rejected naming the new key, not the alias the user set (quality-16, in-diff). e.g. analytic.tree.page.size=16 is rejected as
    cairo.sql.window.tree.page.size. Cosmetic; the value is invalid regardless.
  • m9 — EncodedSortRecordCursor (full_encoded) value-chain byte cap is untested (tests-11, in-diff). The only sort.value overflow test uses full_recordchain → RecordTreeChain. The byte cap,
    page-rounding, message, and config-key hint are shared with the tested path, so the gap is the class wiring only.
  • m10 — testParallelTopKPerWorkerCapFires can't distinguish per-worker from owner-chain overflow (tests-12, in-diff). Both chains throw the identical message, so the assertion can't prove
    the per-worker path fired as the comment claims. Soften the comment or add a per-worker sentinel.
  • m11 — Negative legacy-cap floor is asserted only at the config layer (tests-13, in-diff). The old end-to-end max.pages=-1 probe was rewritten to a positive value; no query-level test now
    drives a negative cap through AbstractRedBlackTree/SortKeyMaterializingRecordCursor. The floor logic is verified correct by inspection; the gap is coverage only.
  • m12 — Performance label not justified (metadata-18, out-of-diff). feat(sql) memory-guard PR with no benchmark/perf claim; the only benchmark touched is cleanup from the API rename.
    Remove the label.
  • m13 — LIMIT 40000 missing thousands underscore (metadata-19, in-diff). OrderByMemoryCapTest.java:59,93 — use 40_000 to match the rest of the file and CLAUDE.md.
  • m14 — AMI server.conf omits # Deprecated: notes on analytic.{rowid,tree}.max.pages (metadata-20, in-diff). The main server.conf carries them; the AMI copy doesn't (though it did add them
    for the sort keys). analytic.store.max.pages is correctly left undeprecated in both.

Downgraded (false positives)

  • fresh-25 — divide-by-zero in DefaultCairoConfiguration.getSqlWindowCacheMaxPagesResolved(). Dropped: no subclass anywhere overrides getSqlWindowStorePageSize() to 0; the cited
    WindowFunctionTest =0 value flows through PropServerConfiguration, which ceilPow2s and rejects it (MIN=64), never reaching the divisor.
  • fresh-27 — window.cache.max.bytes doesn't bound the per-partition window buffers. Dropped: factually true but the in-diff server.conf already scopes the key to "the CachedWindow
    materialised record store" and explicitly ties per-partition buffers to window.store.max.pages. The finding's own suggested fix is already implemented; no contradiction with stated intent.
  • crosscontext-21 — "all callsites SAFE". Not a defect — a clean cross-context confirmation (verified: zero references to the 5 removed int getters, mvn -pl core test-compile green,
    RankFunctionFactory pageSize*maxPages/2 casts to long first so no overflow, SecurityTest translated 1:1).

Summary

Verdict: approve with changes recommended. This is a clean, well-tested PR with no blocking issues — no critical findings, no surviving resource-leak or correctness defects, and a
verified-safe blast radius across the ~30 unmigrated window-function callsites. The arithmetic (overflow-safe divide-then-multiply, clamps before sums, negative-cap flooring) is sound.

Address before merge: M1 (single-knob sort.key.max.bytes is a no-op on the default ORDER BY path + misleading error message — the most important, since it undercuts the feature's headline
use case) and M3 (PR description contradicts the shipped reject-at-startup behavior). M2 wants a migration callout. M2/M3/M4/m3 are the tradeoffs worth stating plainly in the description:
the cap is per-worker for parallel top-K, migrated max.pages shifts the materialization ceiling, and the test-time default is now uncapped. Everything else is minor (one concrete
edge-config wart in RecordArray/MIN_WINDOW_STORE_PAGE_SIZE, plus naming/test/doc polish).

Counts: 29 draft findings → 26 verified, 3 false positives removed. In-diff/out-of-diff: all confirmed code findings in-diff; 1 finding (label) out-of-diff PR metadata, 1 (description) in
the PR body. The non-trivial-but-zero-code-bug out-of-diff result reflects a genuinely clean caller sweep, not an underrun — the cross-context agent enumerated and cleared every unmigrated
callsite with grep evidence and a successful compile.

@mtopolnik
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 224 / 235 (95.32%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/engine/orderby/LimitedSizeSortedLightRecordCursorFactory.java 4 8 50.00%
🔵 io/questdb/std/MemoryPages.java 7 9 77.78%
🔵 io/questdb/griffin/engine/orderby/EncodedSortLightRecordCursor.java 10 12 83.33%
🔵 io/questdb/griffin/engine/orderby/EncodedSortRecordCursor.java 15 17 88.24%
🔵 io/questdb/PropServerConfiguration.java 70 71 98.59%
🔵 io/questdb/cairo/vm/Vm.java 1 1 100.00%
🔵 io/questdb/griffin/engine/orderby/SortedLightRecordCursorFactory.java 4 4 100.00%
🔵 io/questdb/cairo/RecordChain.java 3 3 100.00%
🔵 io/questdb/griffin/engine/AbstractRedBlackTree.java 8 8 100.00%
🔵 io/questdb/cairo/DefaultCairoConfiguration.java 9 9 100.00%
🔵 io/questdb/PropertyKey.java 13 13 100.00%
🔵 io/questdb/griffin/engine/table/AsyncTopKAtom.java 8 8 100.00%
🔵 io/questdb/griffin/engine/orderby/SortKeyMaterializingRecordCursor.java 16 16 100.00%
🔵 io/questdb/cairo/vm/MemoryCARWImpl.java 9 9 100.00%
🔵 io/questdb/cairo/CairoConfigurationWrapper.java 8 8 100.00%
🔵 io/questdb/griffin/engine/orderby/LimitedSizeLongTreeChain.java 9 9 100.00%
🔵 io/questdb/griffin/engine/orderby/LongTreeChain.java 9 9 100.00%
🔵 io/questdb/griffin/engine/orderby/SortKeyMaterializingRecordCursorFactory.java 1 1 100.00%
🔵 io/questdb/cairo/RecordArray.java 4 4 100.00%
🔵 io/questdb/griffin/engine/orderby/SortedRecordCursorFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/orderby/RecordTreeChain.java 6 6 100.00%
🔵 io/questdb/griffin/engine/window/CachedWindowRecordCursorFactory.java 6 6 100.00%

@bluestreak01 bluestreak01 merged commit 7ddece9 into master Jun 1, 2026
51 checks passed
@bluestreak01 bluestreak01 deleted the puzpuzpuz_proper_window_function_mem_cap branch June 1, 2026 10:58
mtopolnik added a commit that referenced this pull request Jun 1, 2026
Resolve the SampleByFillTest conflict by keeping master's byte-cap
overflow semantics (#7157) expressed through the QueryAssertion builder
rather than the old sqlSortKeyMaxPages mechanism.

Master also added new tests that call the legacy assertQueryNoLeakCheck
overloads this branch had already removed from AbstractCairoTest. Migrate
those call sites to the builder so the merge compiles: the value-heap
overflow test in SampleByFillTest and all of the new
CachedWindowMemoryCapTest. Each migration preserves master's factory
properties (expectedTimestamp, supportsRandomAccess, expectSize).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
puzpuzpuz added a commit that referenced this pull request Jun 2, 2026
Resolve conflicts from upstream PRs #7182 (LIMIT -N over a covering
index), #7187 (fluent QueryAssertion builder) and #7157 (ORDER BY
native-memory byte caps).

CoveringIndexRecordCursorFactory: both this branch and #7182
independently reworked the multi-key merge to emit rows in row-id
(ts-ascending) order. Take upstream's version as the base -- its
min-scan merge plus the new single-key LIMIT -N backward scan -- then
re-apply this branch's additive fixes on top: getScanDirection(),
duplicate IN-key dedup, and the TablePageFrameCursor promotion (fixes
the parallel-keyed-group-by-under-SelectedRecord ClassCastException).
getScanDirection() reports OTHER only for multi-key latestBy; single-key
latestBy stays FORWARD (it returns one row, trivially ts-ordered) so the
designated-timestamp metadata upstream's tests expect is preserved.

EncodedSortRecordCursor: keep upstream's extracted throwLimitOverflow().

Tests: take upstream's fluent-builder assertions for shared cases and
keep this branch's new regression tests. #7187 removed the legacy
positional assertQuery()/assertQueryNoLeakCheck() helpers, so migrate
~35 remaining call sites across 10 test files to the fluent builder.

Verified: core test-compile clean; CoveringIndexTest (394),
ExplainPlanTest (534), AsOfJoinTest (115), CompiledFilterTest (50),
LimitTest (47) and the migrated group-by / distinct / IPv4 / array tests
all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core Related to storage, data type, etc. Enhancement Enhance existing functionality New feature Feature requests Performance Performance improvements SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants