Implement symmetric aggregation for multi-model queries#28
Conversation
Fixes fan-out issue when metrics from different join levels are queried together. Pre-aggregates each metric separately to the dimension grain before joining. Fixes 6 bugs in Cube adapter through comprehensive kitchen sink tests.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 41fd438d32
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| col_name = f"{dim_name}__{gran}" if gran else dim_name | ||
| # Use COALESCE to handle NULLs in join condition | ||
| # Actually for FULL OUTER JOIN, we need to compare the actual columns | ||
| # and handle NULLs with IS NOT DISTINCT FROM or COALESCE-based comparison | ||
| join_conditions.append(f"COALESCE({cte_names[0]}.{col_name}, '') = COALESCE({cte_name}.{col_name}, '')") |
There was a problem hiding this comment.
Use NULL-safe equality in preagg joins
The pre-aggregation join condition coalesces every dimension to '' regardless of type. This will either throw (e.g., numeric/date columns can’t coalesce with an empty string in many dialects) or silently coerce values, and it also conflates NULL with empty string for text dimensions, causing incorrect row matching. This breaks multi-model queries whenever a dimension isn’t a string or when NULLs should remain distinct; a NULL-safe equality like IS NOT DISTINCT FROM (or a type-appropriate sentinel) is needed to preserve correctness.
Useful? React with 👍 / 👎.
| # Wrap expression in CASE WHEN for filtering | ||
| # For count measures, return 1 if condition met, else NULL (not 0!) | ||
| # COUNT counts non-NULL values, so we need NULL to exclude non-matching rows | ||
| if measure.agg == "count": | ||
| base_expr = f"CASE WHEN {combined_filter} THEN 1 ELSE NULL END" |
There was a problem hiding this comment.
Preserve COUNT(column) semantics under filters
For filtered COUNT measures that specify a SQL column (e.g., count with sql: user_id to count non‑NULLs), the new CASE expression returns 1 when the filter matches. This changes semantics from “count non‑NULL column values” to “count all rows that match the filter,” even when the column is NULL. This regression appears only for COUNT metrics with a SQL expression and filters, but it will overcount in datasets where the counted column can be NULL; use the column expression inside the CASE (THEN {base_expr}) to retain COUNT(column) behavior.
Useful? React with 👍 / 👎.
- Resolved merge conflicts in generator.py keeping both upstream improvements (visited tracking for recursion, replace_model_placeholder) and local fixes (CASE WHEN for filtered measures, pre-aggregation for symmetric aggregation) - Added support for ratio type metrics in dependency collection - Fixed pre-aggregation to handle no dimensions (use CROSS JOIN) - Fixed metric name collision handling in pre-aggregation output
Summary
Fixes fan-out issue when metrics from different join levels are queried together by pre-aggregating each metric separately to the dimension grain before joining. Includes comprehensive kitchen sink tests that found and fixed 6 bugs in the Cube adapter and SQL generator.
Changes
_generate_with_preaggregation()method handles multi-model metric queries correctly