Skip to content

Add join fan-out prevention test suite#37

Merged
nicosuave merged 5 commits intomainfrom
nicosuave/measures-sql-research
Apr 12, 2026
Merged

Add join fan-out prevention test suite#37
nicosuave merged 5 commits intomainfrom
nicosuave/measures-sql-research

Conversation

@nicosuave
Copy link
Copy Markdown
Member

Summary

  • Adds 20 tests validating that measures evaluate against their defining table and are immune to row duplication caused by joins (paper sections 3.6 and 5.3)
  • Covers every DuckDB aggregate type: additive (SUM/AVG/COUNT), semi-additive (MIN/MAX), statistical (STDDEV/VARIANCE/MEDIAN/KURTOSIS/SKEWNESS/ENTROPY), string (STRING_AGG), collection (LIST), bitwise (BIT_XOR), boolean (BOOL_AND/BOOL_OR), non-decomposable (COUNT DISTINCT/MODE/PRODUCT), and non-additive (ratio)
  • Tests span 1:N, M:N, and LEFT join cardinalities with grouped and filtered query shapes
  • Updates parity matrix to mark fan-out prevention as Covered

20 tests validating that measures evaluate against their defining table
and are immune to row duplication from joins. Covers all aggregate types
(SUM, AVG, COUNT, MIN, MAX, STDDEV, VARIANCE, MEDIAN, STRING_AGG,
COUNT DISTINCT, MODE, PRODUCT, BIT_XOR, KURTOSIS, SKEWNESS, ENTROPY,
LIST, BOOL_AND, BOOL_OR) and ratio measures across 1:N, M:N, and LEFT
join cardinalities with grouped and filtered query shapes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 825b0b42d1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1346 to +1350
# Fan-out wouldn't change these with this data, but the measure must still
# evaluate against the defining table, not the joined rows.

query TT
SEMANTIC SELECT AGGREGATE(all_premium), AGGREGATE(any_trial)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use fan-out-sensitive data for BOOL_AND/BOOL_OR test

This test cannot detect a regression in fan-out prevention because the chosen boolean values produce the same false/true result even if the measure is incorrectly evaluated on duplicated join rows (or rows filtered by the join). As written, it validates function wiring but not the stated fan-out immunity, so a real duplication bug for boolean measures could slip through while this case still passes.

Useful? React with 👍 / 👎.

… name collision

Redesign boolean test data so Dave (no orders) is the only non-premium
and only trial user. If fan-out prevention fails, both results flip.
Rename avg_age to avg_cust_age in fan-out tests to avoid colliding with
the downstream customers_qualified_v measure of the same name.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db5f31dd09

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1086 to +1090
SEMANTIC SELECT AGGREGATE(youngest), AGGREGATE(oldest)
FROM fanout_customers_minmax_v c
JOIN fanout_orders o ON c.cust_id = o.cust_id;
----
25 40
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make MIN/MAX fan-out test observable

This MIN/MAX case cannot detect a fan-out regression because MIN(age) and MAX(age) are identical on both the base rows and the duplicated join rows in this dataset. If measure evaluation accidentally shifts to joined rows, this test would still pass, so it does not validate the claimed fan-out protection for semi-additive aggregates. Use values where the excluded or duplicated rows would change the min/max result.

Useful? React with 👍 / 👎.

Comment on lines +1151 to +1153
SEMANTIC SELECT AGGREGATE(distinct_products)
FROM fanout_orders_cd_v o
JOIN fanout_customers c ON o.cust_id = c.cust_id;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use a fan-out join in COUNT DISTINCT coverage

This COUNT(DISTINCT) test joins orders to customers on cust_id, which is many-to-one from fanout_orders and does not duplicate order rows, so the measure input set is unchanged. As a result, the assertion can pass even when fan-out prevention is broken, because no fan-out is actually introduced for the defining table of the measure.

Useful? React with 👍 / 👎.

Comment on lines +1233 to +1237
query I
SEMANTIC SELECT AGGREGATE(mode_age)
FROM fanout_ages_mode_v a
JOIN fanout_age_orders ao ON a.id = ao.id;
----
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make MODE test fail on joined-row evaluation

The MODE scenario is currently fan-out-insensitive: with this data, both the base table and the fanned-out join still produce mode 30, so a bug that evaluates the measure over joined rows would not be caught. This leaves non-decomposable fan-out behavior unverified despite the test passing.

Useful? React with 👍 / 👎.

MIN/MAX: add row with global-min age that has no join match, so MIN
changes if measure evaluates over joined rows only (15 vs 25).

COUNT DISTINCT: join orders to line_items (1:N from orders side) so
order rows are actually duplicated; previous join was M:1 and never
fanned out the defining table.

MODE: redesign data so base mode=20 but fanned-out mode=10 (id=1 has
3 orders, inflating age=10 to appear 3x vs age=20 appearing 2x).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20b5fe3a04

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1019 to +1022
SEMANTIC SELECT c.name, c.age, COUNT(*) AS order_rows, AGGREGATE(avg_cust_age)
FROM fanout_customers_v c
JOIN fanout_orders o ON c.cust_id = o.cust_id
GROUP BY c.name, c.age;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make grouped AVG fan-out check observable

This grouped case cannot detect a join fan-out regression because each group is a single customer with one constant age value; duplicating that customer's joined rows leaves AGGREGATE(avg_cust_age) unchanged at 30/25/40 either way. As written, the assertion can still pass when grouped measure evaluation accidentally runs over joined rows, so it does not verify fan-out immunity for grouped query shape.

Useful? React with 👍 / 👎.

Comment on lines +1145 to +1148
SEMANTIC SELECT p.product, AGGREGATE(profit_margin), COUNT(*) AS region_count
FROM fanout_products_v p
JOIN fanout_product_regions pr ON p.product = pr.product
GROUP BY p.product;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make grouped profit-margin test sensitive to fan-out

This grouped profit-margin query is also fan-out-insensitive with the current fixture: each product has only one base row, so duplicating Widget via the join scales both numerator and denominator equally and keeps the ratio at 0.6. That means the test can pass even if grouped measure computation incorrectly uses duplicated joined rows, leaving a regression path untested.

Useful? React with 👍 / 👎.

Grouped AVG: replace single-customer-per-group test with tier-based
grouping where gold tier has Alice(30, 3 orders) + Bob(25, 1 order).
Correct AVG=27.5; fan-out would weight Alice 3x giving 28.75.

Grouped ratio: remove per-product grouped profit margin test since
(SUM(rev)-SUM(cost))/SUM(rev) is scale-invariant under uniform row
duplication within a group, making it fundamentally unobservable.
The ungrouped ratio test already validates fan-out prevention.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7eec0f1142

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1183 to +1185
SEMANTIC SELECT AGGREGATE(distinct_products)
FROM fanout_orders_cd_v o
JOIN fanout_line_items li ON o.order_id = li.order_id;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make COUNT DISTINCT fan-out test fail on joined-row evaluation

This case is still fan-out-insensitive because COUNT(DISTINCT product) is unchanged by duplicating existing order rows; the joined rows contain the same three product values as the base table. If measure evaluation regresses to use joined rows, this query still returns 3, so the test would pass and miss the bug. Use fixture data where the join changes which distinct values are visible (for example, a unique product on a non-matching row) so incorrect joined-row evaluation produces a different result.

Useful? React with 👍 / 👎.

Comment on lines +1110 to +1111
# Correct: MIN=15 (id=4, no tasks), MAX=40
# Joined-only (wrong): MIN=25, MAX=40
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make MAX assertion observable in MIN/MAX fan-out test

The setup only makes MIN sensitive to joined-row evaluation; MAX is 40 in both the correct and incorrect (joined-only) path, so a regression affecting MAX specifically would still pass this test. Because this block is intended to validate both MIN and MAX fan-out immunity, the fixture should also force MAX to differ under joined-row execution (for example by putting the global max on an unmatched row).

Useful? React with 👍 / 👎.

MIN/MAX: put both the global min (age=15) and global max (age=50) on
unmatched rows so both values change if measure uses joined rows only
(25/35 vs correct 15/50).

COUNT DISTINCT: add an order with product 'Thingamajig' from cust_id=99
who has no match in the join target. Correct count is 3 distinct
products; joined-only evaluation would give 2.
@nicosuave nicosuave merged commit f620c77 into main Apr 12, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant