Skip to content

[CLAUDE] [OPUS 4.7] fix(annotate): register typing for FirstValue and RegexpExtract#7577

Merged
georgesittas merged 1 commit intotobymao:mainfrom
RichardHughes-amp:register-firstvalue-regexpextract-typing
Apr 30, 2026
Merged

[CLAUDE] [OPUS 4.7] fix(annotate): register typing for FirstValue and RegexpExtract#7577
georgesittas merged 1 commit intotobymao:mainfrom
RichardHughes-amp:register-firstvalue-regexpextract-typing

Conversation

@RichardHughes-amp
Copy link
Copy Markdown
Contributor

@RichardHughes-amp RichardHughes-amp commented Apr 28, 2026

Register FirstValue in base EXPRESSION_METADATA and RegexpExtract in Hive EXPRESSION_METADATA so the type annotator returns the right type instead of UNKNOWN.

  • FirstValue: propagates type of this arg — correct for every dialect. Presumed harmless in dialects (like SQLite) where FirstValue is not supported.
  • RegexpExtract: VARCHAR in Hive (and descendants: Spark2, Spark, Databricks) and Snowflake. Left out of base since DuckDB handles regexp_extract unlike other implementations.
  • Regression fixture pins DuckDB REGEXP_EXTRACT → UNKNOWN.

@RichardHughes-amp RichardHughes-amp marked this pull request as ready for review April 28, 2026 23:03
Comment thread sqlglot/typing/__init__.py Outdated
Comment thread sqlglot/typing/snowflake.py
@RichardHughes-amp RichardHughes-amp changed the title fix(annotate): register typing for FirstValue and RegexpExtract [CLAUDE] [OPUS 4.7] fix(annotate): register typing for FirstValue and RegexpExtract Apr 29, 2026
FirstValue had no entry in base EXPRESSION_METADATA, so all dialects
except BigQuery returned UNKNOWN — despite its sister LastValue being
registered. Move it next to LastValue in the first-arg propagation block
and drop the now-redundant BigQuery duplicate.

RegexpExtract had no base entry either, so only BigQuery and Snowflake
typed it. Register it as constant-VARCHAR in the Hive typing module
(covers Hive/Spark2/Spark/Databricks through the existing chain). Keep
BigQuery's _annotate_by_args override since BigQuery genuinely overloads
on STRING vs BYTES input. Snowflake's existing entry is preserved.

Scoping the registration to Hive (not base) avoids leaking VARCHAR onto
dialects with different semantics — most notably DuckDB, where
REGEXP_EXTRACT can return a STRUCT when group names are passed.

Adds fixture coverage in annotate_functions.sql:
- cross-dialect FIRST_VALUE on BIGINT and STRING
- spark/databricks REGEXP_EXTRACT on STRING and BINARY input (proves
  the dialect's constant-STRING behavior, distinct from BigQuery's
  input-type overload)
- snowflake REGEXP_SUBSTR on STRING
- duckdb REGEXP_EXTRACT pinned at UNKNOWN to lock in the Hive-only
  scoping
@RichardHughes-amp RichardHughes-amp force-pushed the register-firstvalue-regexpextract-typing branch from 22fd6aa to ec241a4 Compare April 29, 2026 21:05
@RichardHughes-amp
Copy link
Copy Markdown
Contributor Author

RichardHughes-amp commented Apr 29, 2026

I believe that the regexp_extract changes should only affect the Hive->Spark2->Spark stack.

first_value is still affecting every dialect, which I think is fine?

Comment on lines +6032 to +6034
# dialect: duckdb
REGEXP_EXTRACT(tbl.str_col, pattern, 0);
UNKNOWN;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be UNKNOWN, right? The right type is JSON for DuckDB. We should either type it as such or omit this test altogether.

@georgesittas
Copy link
Copy Markdown
Collaborator

I'll get this in and take it to the finish line. Thank you for the contribution!

@georgesittas georgesittas merged commit 8af462a into tobymao:main Apr 30, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants