Skip to content

feat: map field opt-in, ARRAY JOIN SQL fixes, field selection continuity#35

Merged
acmeguy merged 48 commits intomainfrom
feat/map-field-opt-in-and-array-join-fixes
Mar 31, 2026
Merged

feat: map field opt-in, ARRAY JOIN SQL fixes, field selection continuity#35
acmeguy merged 48 commits intomainfrom
feat/map-field-opt-in-and-array-join-fixes

Conversation

@acmeguy
Copy link
Copy Markdown

@acmeguy acmeguy commented Mar 31, 2026

Summary

  • Map-expanded, nested (ARRAY JOIN), and AI-generated fields default to unchecked (opt-in) in change preview
  • ARRAY JOIN cube SQL uses explicit SELECT (no SELECT *) to prevent Array/scalar column ambiguity in ClickHouse subqueries
  • After field exclusion, ARRAY JOIN SQL is pruned to only include surviving columns (performance)
  • Non-ARRAY-JOINed nested groups (e.g. location.*) restored with FILTER_PARAMS lookup-index service
  • Removed paired filtered count measures (count_dimensions_* etc) and pre-aggregation granularity (Cube.js v1.6)
  • Skip LLM toggle, required fields (rewrite rules + filters), AI metrics empty-selection bug fix

Changes

File Change
smartGenerate.js skip_llm param, required_fields in response, ARRAY JOIN SQL pruning after exclusion, summary recount, nested column preservation in selected_columns filter
cubeBuilder.js Explicit SELECT builder, non-AJ nested group support, AJ group column exclusion, paired counts removed, granularity removed
diffModels.js Source tagging (map, nested, ai) on diff field entries
fieldProcessors.js Backtick-quote dotted names in generateSqlExpression
yamlGenerator.js Remove granularity/partition_granularity from pre-agg JS output

Test plan

  • Fresh generation on semantic_events with commerce.products ARRAY JOIN + entry_type = Line Item filter
  • Verify only selected nested fields appear in final model SQL
  • Verify location.* fields appear as FILTER_PARAMS dimensions
  • Verify map fields default unchecked, regular columns default checked
  • Verify "Skip LLM" toggle disables AI enrichment and advisory passes
  • Verify empty AI metrics selection produces model with no AI metrics
  • Run query against generated model — no Array comparison errors

🤖 Generated with Claude Code

acmeguy and others added 30 commits March 31, 2026 08:57
Adds a new POST endpoint that detects nested (GROUPED) column structures
in a ClickHouse table and returns discriminator columns with their
distinct values, enabling the frontend to show filter options in the
Smart Generate dialog.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ion, cleanup

- Use cubejs.options.driverFactory({ securityContext }) instead of cubejs.driverFactory()
- Add SAFE_IDENTIFIER regex validation on schema/table params to prevent SQL injection
- Add driver.release() cleanup in catch block
- Use { code, message } error response shape matching other routes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… cube names

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Restore AS alias clause in legacy ARRAY JOIN path SQL with partition WHERE
- Use ClickHouse-standard doubled single quotes ('') instead of backslash escaping
- Remove redundant template literal wrapping in arrayJoinGroups map
- Add warning when groupColumns is empty but arrayJoinGroups were requested

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Insert LLM polishing step after AI enrichment and before final JS code
generation. The polisher rewrites cube definitions per modeling principles
while preserving original SQL. Polish results are included in all response
payloads (dry-run, no-changes, and apply).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ndpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…selected

Without this, profiling ran against the base table and reported empty columns
for nested array sub-columns. Now the profiler uses LEFT ARRAY JOIN so column
stats reflect the expanded array-joined rows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ClickHouse Nested columns (stored as parallel arrays with dotted names)
require enumerating each sub-column in the ARRAY JOIN clause:
  ARRAY JOIN `parent.child1` AS child1_alias, `parent.child2` AS child2_alias

Previously used `ARRAY JOIN parent` which is invalid for this column type.

Fixes both profiler (for accurate column stats on expanded rows) and
cubeBuilder (for correct cube SQL generation).

Non-array-join profiling path is unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace dots with underscores in the full column name (e.g.
commerce.products.entry_type → commerce_products_entry_type) for both
the ARRAY JOIN alias and the nested WHERE filter clause.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The frontend sends nestedFilters in the profile-table POST body but
the route wasn't extracting or passing them to the profiler function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Profiler: filter ARRAY JOIN to columns where rawType starts with Array(
  Scalar dotted columns (e.g. commerce.details Nullable(String)) excluded
- Profiler + CubeBuilder: use full column name with dots→underscores as alias
  (e.g. commerce.products.entry_type → commerce_products_entry_type)
- CubeBuilder: dimension/measure SQL uses the aliased column name
- WHERE clauses use the aliased names consistently

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Query rewrite rules (e.g. partition scoping by team properties) are now
loaded and translated to raw SQL filters before profiling. This ensures
the profiler respects the same row-level access controls as the Cube.js
query layer.

Applied in both profileTable and smartGenerate routes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…alidate

After the LLM returns polished cubes, generates JS and runs
validateModelSyntax. If validation fails, sends errors back to the LLM
for correction, up to 2 cycles. Also mounts first-principles path
and checks multiple principle file locations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Zod schemas are now built inside an async getSchemas() function that
imports zod dynamically, avoiding the undefined 'z' at module load time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…for Zod 4 compat

zodResponseFormat fails with z.any() as a record value type in Zod 4.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… compat

OpenAI structured output requires every field to have an explicit type.
Replaced z.any().nullable() for rollingWindow, timeShift, refresh_key,
and meta with fully typed schemas.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Preview shows the raw generated model for fast feedback. Polishing
runs only when the user clicks Apply Changes, avoiding timeouts
during preview. Also increased polisher timeout to 180s for large models.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Plan 1: Single-line fix for lcFrom missing arrayJoinClause (4 broken queries)
Plan 2: 6-task plan for principle-compliant cubeBuilder heuristics (titles,
meta, paired counts, format, public:false, drill members, pre-aggregations)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ries

Single-line fix: the lcFrom variable (used by 4 downstream queries for
Map numeric stats, Map string stats, and LC value probe) was missing
the arrayJoinClause. All nested filter WHERE conditions referenced
aliased column names that only exist after ARRAY JOIN.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- titleFromName: snake_case → Title Case on all fields and cubes
- Partition-first dimension ordering
- Complete meta block: grain, grain_description, time_dimension, time_zone, refresh_cadence
- Paired filtered counts for LC dimensions (max 10 values)
- Drill members on primary count measure
- Format inference: currency/percent by column name pattern
- public: false on plumbing fields (GIDs, write_key, etc.)
- Default pre-aggregations: daily + monthly rollups with ClickHouse indexes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…c, drill_members in yamlGenerator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
acmeguy and others added 18 commits March 31, 2026 08:57
6-task plan: fix Hasura timeout, create modelAdvisor with 4 focused
micro-prompts (descriptions, segments, metrics, pre-aggregations),
integrate into pipeline, update frontend, delete old polisher,
full end-to-end testing including Cube.js compiler validation and
Explore page query verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r debug logging

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…enerate pipeline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rays

Cube.js expects pre_aggregations as named keys with indexes as nested
named objects. The yamlGenerator was using JSON.stringify which produced
arrays with 'name' fields — invalid Cube.js syntax.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously buildCubes always emitted the raw base-table cube AND the
array-joined cube. When the user selects an array join with filters,
only the filtered array-joined cube should be produced — one cube,
one file, one intent.

The raw cube is still built internally (for field processing and as a
base for the array join cube's inherited dimensions) but is not emitted.

All heuristics (partition-first, grain/meta, drill members, format
inference, public:false, pre-aggregations) are now applied to the
array-joined cube when it's the sole output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a new model is merged with an existing one, FILTER_PARAMS
expressions from the old model may reference the previous cube name.
This replaces all FILTER_PARAMS.old_cube_name references with the
actual cube name from the current generation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…models

When nestedFilters are active:
1. Force mergeStrategy='replace' — FILTER_PARAMS from old model
   are incompatible with ARRAY JOIN (indexOf on scalar columns)
2. Use the cube name for the file name — ensures file name matches
   cube name for Cube.js resolution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. SQL now uses newlines + indentation for readability in model editor
2. Removed count_distinct_approx from pre-agg filters and advisor schema
   — not supported by ClickHouse driver

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FILTER_PARAMS dimensions use indexOf on array columns which become
scalars after ARRAY JOIN. These dimensions cause runtime ClickHouse
errors and must be excluded from the array-joined cube.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ions

When FILTER_PARAMS dimensions are stripped from the array-joined cube,
paired count measures that reference those dimensions must also be
removed, and drill_members lists must be cleaned.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the user deselects columns in the profile preview, the filter
column (e.g. commerce.products.entry_type) might be removed from
the columns Map. But the WHERE clause still references it. Ensure
filter columns are always in the ARRAY JOIN regardless of selection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… preview

Backend: smartGenerate strips excluded dimensions/measures/segments from
cubes before generating JS. excluded_fields flows through Hasura action
→ RPC handler → CubeJS route.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When user deselects fields in change preview, all references must be
cleaned: drill_members, paired counts, pre-aggregation measures/dimensions,
and derived metrics that reference excluded fields via {name} syntax.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. ARRAY JOIN SQL now uses SELECT *, alias1, alias2... instead of just
   SELECT *. ClickHouse doesn't project ARRAY JOIN aliases into outer
   subquery scope with SELECT * alone.

2. Segments that reference excluded dimensions via {CUBE}.field_name
   are now stripped during field exclusion cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runs cleanup in a loop until stable — each pass may remove fields that
other fields depend on. Checks both {name} and {CUBE}.name reference
patterns. Handles cascading dependencies (metric A references metric B
which references excluded field C).

Also adds debug logging for excluded_fields receipt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Smart Generation improvements:
- Map-expanded fields default to unchecked (opt-in) in change preview
- ARRAY JOIN nested fields default to unchecked (opt-in) in change preview
- AI-generated metrics default to unchecked (opt-in) in change preview
- Count measure and rewrite-rule dimensions always selected
- Source tagging in diffModels (map, nested, ai) for frontend selection logic
- Skip LLM toggle support (skip_llm parameter)
- Required fields (rewrite rules + filter dims) passed to frontend

ARRAY JOIN SQL generation:
- Replace SELECT * with explicit column list to prevent Array/scalar ambiguity
- ARRAY JOIN alias names projected in SELECT for Cube.js subquery visibility
- Non-AJ nested groups (location.*) excluded from SELECT (no corresponding dims)
- After excluded_fields, prune ARRAY JOIN SQL to only surviving columns
- Recompute summary counts after field exclusion

Field continuity fixes:
- Non-AJ nested groups (location.*) pass through processColumns despite no profiling
- FILTER_PARAMS dimensions for non-AJ groups preserved in AJ cube (indexOf still valid)
- AJ group FILTER_PARAMS dimensions correctly excluded (indexOf breaks on scalars)
- Backtick-quote dotted column names in NestedFieldProcessor SQL
- AI metrics empty selection sends empty array (not undefined) to prevent include-all
- SELECT pruning uses exact alias name tracking (not regex heuristic)

Removed:
- Paired filtered count measures (count_dimensions_* etc)
- granularity/partition_granularity from pre-aggregations (Cube.js v1.6 compat)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@acmeguy acmeguy merged commit c9897e9 into main Mar 31, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants