Skip to content

adapt filter expressions to file schema during parquet scan #16461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 26, 2025

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Jun 19, 2025

The idea here is to move us one step closer to #15780 although there is still work to do (e.g. #15780 (comment)).

My goal is that this immediately unblocks other work (in particular #15057) while being a relatively small change that we can build upon with more optimizations, etc.

@github-actions github-actions bot added the datasource Changes to the datasource crate label Jun 19, 2025
Comment on lines 553 to 556
return exec_err!(
"Non-nullable column '{}' is missing from the physical schema",
column.name()
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to include some sort of file identifier here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed -- that might also be a nice usecase for a builder style struct

let new_predicate = PhysicalExprSchemaRewriter::new(physical_file_schema, logical_file_schema)
  .with_identifier(file_name)
  .convert(predicate)?;

🤔

let stream = opener.open(make_meta(), file.clone()).unwrap().await.unwrap();
let (num_batches, num_rows) = count_batches_and_rows(stream).await;
assert_eq!(num_batches, 1);
assert_eq!(num_rows, 1);
Copy link
Contributor Author

@adriangb adriangb Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion fails on main: all 3 rows are passed because the row filter cannot handle the partition columns. This PR somewhat coincidentally happens to allow the row filter to handle predicates that depend on partition and data columns!

@adriangb adriangb requested a review from alamb June 19, 2025 19:20
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb -- other than the removed test I think this PR makes sense to me and could be merged

I think it would be useful to refactor this code a bit and invest in testing infrastructure as we proceed towards adding nicer features (like unwrapping casts on columns for example)

/// Preference is always given to casting literal values to the data type of the column
/// since casting the column to the literal value's data type can be significantly more expensive.
/// Given two columns the cast is applied arbitrarily to the first column.
pub fn cast_expr_to_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more general than just parquet, so perhaps we could move it into the datafusion-physical-schema crate or perhaps somewhere near the physical planner

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think it would really help (perhaps as a follow on PR) to add some more specific unit tests.

I wonder if an API like the following makes sense:

struct PhysicalExprSchemaRewriter { 
...
}


// rewrite a predicate
let new_predicate = PhysicalExprSchemaRewriter::new(physical_file_schema, logical_file_schema)
  // optionally provide partition values
  .with_partition_columns(partition_fields, partition_values
  .convert(predicate)?;

Then I think writing unit tests would be easy and we could adapt / extend the code over time -- and it would set us up for adapting more sophisticated expressions like field extraction...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, I will work on this next! Agreed on the unit tests 😄


assert!(candidate.is_none());
}

#[test]
fn test_filter_type_coercion() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to keep this test (or something like it) that shows reading with a predicate of a different shcema is correctly coerced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so - I'll have to make it more e2e since it is no longer specific to the row filter. We have some other similar tests, I'll work this into there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 557 to 558
if let Some(column) = expr.as_any().downcast_ref::<expressions::Column>() {
let logical_field = match logical_file_schema.field_with_name(column.name()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could reduce a level of nesting with something like

Suggested change
if let Some(column) = expr.as_any().downcast_ref::<expressions::Column>() {
let logical_field = match logical_file_schema.field_with_name(column.name()) {
let Some(column) = expr.as_any().downcast_ref::<expressions::Column>() else {
return Ok(Transformed::no(expr))
}

}
// If the column is not found in the logical schema, return an error
// This should probably never be hit unless something upstream broke, but nontheless it's better
// for us to return a handleable error than to panic / do something unexpected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 553 to 556
return exec_err!(
"Non-nullable column '{}' is missing from the physical schema",
column.name()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed -- that might also be a nice usecase for a builder style struct

let new_predicate = PhysicalExprSchemaRewriter::new(physical_file_schema, logical_file_schema)
  .with_identifier(file_name)
  .convert(predicate)?;

🤔


// If the logical field and physical field are different, we need to cast
// the column to the logical field's data type.
// We will try later to move the cast to literal values if possible, which is computationally cheaper.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Jun 21, 2025
@adriangb
Copy link
Contributor Author

@alamb I've created the builder, moved the implementation and added some unit tests

@adriangb
Copy link
Contributor Author

My hopes with this work is that we can:

@adriangb
Copy link
Contributor Author

@kosiew I'm curious what you think about this. Would it be possible to implement the nested struct imputation work you're doing with this approach?

@adriangb
Copy link
Contributor Author

Just for fun I opened pydantic#31 to see how hard it would be to incorporate #15780 (comment), not too bad! Most of the code could be shared with the logical layer with some refactoring pubing, etc.

@kosiew
Copy link
Contributor

kosiew commented Jun 23, 2025

@adriangb
Thanks for the ping on this.

Would it be possible to implement the nested struct imputation work you're doing with this approach?

Do you mean reusing the PhysicalExprSchemaRewriter machinery to drive nested‐struct imputation?

Here're some disadvantages of the rewrite‐centric approach versus the more data‐centric adapter approach:

  • Expression-only, not data-only: This never actually transforms the underlying RecordBatch columns—if downstream logic (or the user) inspects a struct column directly, they won’t see the new null fields injected. We’d still need array-level imputation for correctness in the result batches.

  • Limited to predicate contexts: The rewriter hooks into filter and pruning, but our broader schema-evolution needs (e.g. reading all columns, SELECT *, writing out evolved nested structs) fall outside its scope.

  • Duplication risk: We end up reinventing part of the schema-adapter’s compatibility logic (matching fields by name, casting types) inside the rewriter, which can drift from the adapter’s rules over time.

  • Complexity with deep nesting: Recursively handling deeply nested structs inside an expression tree—and ensuring every nested‐field access gets rewritten with the right shape—quickly becomes more intricate than a simple tree visitor.

  • Performance implications: Constantly rewriting and reconstructing expression trees (and then evaluating those casts/lits) might be less efficient than bulk array‐level casts + struct builds, especially on wide tables.

So, could we bolt nested‐struct imputation onto his rewriter? Technically yes, we could extend rewrite_column so that, whenever we see a Column referring to foo.bar.baz that’s missing in the physical schema, you generate a Literal::Null of the full nested type (constructing the proper StructValue). But we’d still need an array-level counterpart to actually materialize those null nested fields in the RecordBatch when we call map_batch.

In practice, the two approaches complement each other:

Use the rewriter to handle predicate and projection expressions (so filters and column references don’t blow up).

Continue to rely on cast_struct_column + NestedStructSchemaAdapter to adapt the actual batch data, filling in null arrays and doing recursive casts.

That way we get the best of both worlds—clean, centralized expression rewriting for pushdown, and robust array-level marshalling for the final tables. 😊

Why the two-pronged approach makes sense

  1. Pushdown vs. Data Adaptation Are Different Concerns

The PhysicalExprSchemaRewriter is perfect for rewriting predicates and projections so they don’t blow up when the on-disk schema diverges.

But once you’ve read that Parquet row group into memory, you still need to reshape the StructArray itself—filling in null arrays for new nested fields, dropping old ones, recursively casting types.

  1. Keeping Pruning Code Lean

Swapping out the old SchemaMapper for the rewriter in your pruning path is a great win: much less boilerplate, better separation of concerns, and no more “fake record batches” just to evaluate a filter.

You can remove all of that pruning-specific adapter code and lean on the rewriter’s tree visitor.

  1. Deep Schema-Evolution Still Lives in the Adapter

Handling a top-level missing column is easy in the rewriter (you just rewrite col("foo") to lit(NULL)), but handling col("a.b.c") where b itself is a new struct, and c is a new field inside it… that’s far more natural in a recursive cast_struct_column that operates on StructArray children.

@kosiew
Copy link
Contributor

kosiew commented Jun 23, 2025

Putting more words to how I understand pushdown and data adaptation:

  1. Pushdown — “Which rows or pages should I read?”
  • Input: your original predicate (e.g. col("foo.b") > 5) and the physical Parquet schema.

  • What the rewriter does:

    • Sees that foo.b doesn’t exist on disk → replaces col("foo.b") > 5 with lit(NULL) > 5.
    • Or if foo.a is stored as Int32 but the table expects Int64, it wraps col("foo.a") in a cast.
  • Result: you get a “safe” predicate that Parquet can evaluate against row‐group statistics or pages without error.

  • Outcome: you prune away unneeded row groups, or skip pages, based on that rewritten expression.

At the end of this step, no data has actually been materialized—you’ve only modified the expression you use to decide what to read.

  1. Data adaptation — “How do I shape the in-memory batch to match the logical schema?”
  • Input: a RecordBatch (or StructArray) that you read directly from Parquet.

    • This batch is laid out exactly as on disk: it only has the columns that existed in that file’s schema, and nested structs only contain the old fields.
  • What the adapter does (map_batch / cast_struct_column):

    • Field matching: for each field in your logical (table) schema, look it up by name in the batch’s arrays.
    • Missing fields → insert a new_null_array(...) of the right datatype and row count.
    • Extra fields (present on disk but dropped in the table) → ignore them.
    • Nested structs → recurse into child struct arrays, doing the same match/fill/ignore/cast logic at each level.
  • Result: a brand-new StructArray (and overall RecordBatch) whose columns exactly line up with your table schema—even for deeply nested new fields.

@adriangb
Copy link
Contributor Author

adriangb commented Jun 23, 2025

Thanks for the thoughtful reply. An important point that I maybe should have mentioned: we're actively working on projection / selection pushdown which would involve pushing down expressions into the scan (#14993), thus we would evaluate the expressions and materialize the record batches in the output.

Basically what I see happening:

  • The scan gets a Vec<Arc<dyn PhysicalExpr>> representing the projection expression for each output column. One question that I don't know the answer to is what level of projection it should get, i.e. if the query is select a + b should it get [a+b] or [a,b] with a ProjectionExec above that does the a+b part. I think the answer is to actually pass down a+b but I'm not sure.
  • We do these rewrites to adapt to the file schema, calling the above referenced hooks in the process.
  • We call projection.iter().map(|p| collect_columns(p)).flatten().sort_by(|c| c.index()).collect_vec() to get a Vec<usize> projection. Might need to do something fancier, I guess whatever ProjectionExec does.
  • We read those columns from the file producing a RecordBatch with all of the projected columns.
  • We evaluate each expression on the RecordBatch (or whatever ProjectionExec does) and that's our actual output.

So some questions for nested structs are: is it going to be a pain to adapt the nested structs via projection expressions? Or is it going to be ~ the same as doing it at the SchemaAdapter / physical column level?

@xudong963 xudong963 self-requested a review June 23, 2025 12:10
@alamb
Copy link
Contributor

alamb commented Jun 23, 2025

But we’d still need an array-level counterpart to actually materialize those null nested fields in the RecordBatch when we call map_batch.

FWIW I think this is one mechanism to turn the expression into an array (you just need to evaluate the expression into a PhysicalExpr):

@alamb
Copy link
Contributor

alamb commented Jun 23, 2025

I would personally recommend proceeding in parallel with the two approaches, ensuring there are good end to end tests (.slt) -- and then if we find that the projection pushdown / rewriting code can subsume the schema adapter code we could make a PR to do that 🤔

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @adriangb

@@ -248,10 +248,25 @@ impl FileOpener for ParquetOpener {
}
}

let predicate = predicate
.map(|p| {
PhysicalExprSchemaRewriter::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

);
}
// If the column is missing from the physical schema fill it in with nulls as `SchemaAdapter` would do.
// TODO: do we need to sync this with what the `SchemaAdapter` actually does?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend ensuring this TODO is covered by a ticket and then adding a reference to the ticket here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own sake: the point here is that SchemaAdapter is a trait so it can do arbitrary things not just replace with nulls (that's what the default implementation does). It's super annoying to have to implement SchemaAdapterFactory and all of the 3 IIRC traits. I'd propose what we do here is just add hooks that follow the TreeNode APIs and make it a builder style struct.


let partition_fields =
vec![Arc::new(Field::new("partition_col", DataType::Utf8, false))];
let partition_values = vec![ScalarValue::Utf8(Some("test_value".to_string()))];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make this less verbose like this if you want

Suggested change
let partition_values = vec![ScalarValue::Utf8(Some("test_value".to_string()))];
let partition_values = vec![ScalarValue::from("test_value")];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I agree that is more concise but in this case since we are testing casting and such I think it's better to be explicit. I can see someone being very confused in the future as to why a test is failing and not realizing that ScalarValue::from("foo") produces Utf8 and not e.g. Utf8View.

}

#[test]
fn test_rewrite_column_with_type_cast() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding a test for a more sophisticated expression (like (a + 5.0::int64) || (c > 0.0::float)) to ensure that the recursive rewrite case is covered

@adriangb
Copy link
Contributor Author

adriangb commented Jun 24, 2025

I opened #16528 to track further ideas / steps. For now I'd like to merge this so that I can iterate on the rest of the work 😄

@adriangb
Copy link
Contributor Author

#16530 will be able to easily be incorporated into this work, completely eliminating what are currently expensive casts

@kosiew
Copy link
Contributor

kosiew commented Jun 24, 2025

Adding notes for future reference:


Summary: Adapting Filter Expressions to File Schema During Parquet Scan


Background & Goal


Key Concepts

1. Expression Rewriting (Pushdown Adaptation)

  • Rewrites filter and projection expressions to align with the file’s physical schema.
  • Examples:
    • If an expression refers to a nested field foo.baz that is missing on disk → rewrite to lit(NULL).
    • If a field has different physical type on disk vs. logical schema → add casts.
  • This rewriting ensures that predicate pushdown logic and filters do not error out when the on-disk schema differs from the logical schema.
  • Expression rewriting happens before reading data and uses the physical schema to safely prune row groups.

2. Data Adaptation (Batch-Level Reshaping)

  • After reading a RecordBatch or arrays from Parquet, reshape them to match the logical table schema.
  • Actions include:
    • Adding null arrays for missing nested fields (nested struct imputation).
    • Dropping columns no longer part of the logical schema.
    • Recursively casting nested struct types to match the logical type.
  • This ensures downstream operators receive data shaped exactly as expected in the query, despite schema evolution.

Main Discussion Points

Topic Details
Proposed Approach Introduce a PhysicalExprSchemaRewriter builder to adapt expressions to file schema during pruning/scanning.
Nested Struct Imputation Expression-only rewrites are limited for nested structs since they do not modify the actual data arrays.
Data vs. Expression Adaptation Expression rewriting is great for pushdown but batch-level adapters are needed for correct, shaped data.
Complementary Approach Use expression rewriting for filters/projections + array-level adapters (e.g. cast_struct_column) to reshape in-memory data.
Projection Pushdown Scenario A scan can receive full projection expressions (e.g. a + b), which get adapted and evaluated on RecordBatch, producing final output.
Risks of Expression-Only Rewrites - No effect on RecordBatch structure.
- Limited scope (only predicates and pruning).
- Risk of code duplication.
- Complex handling for deeply nested types.
- Possibly poorer performance due to repeated expression rewrites.
Potential Benefits of Expression-Based Rewrites Cleaner pruning path, simpler code, no fake batches for evaluation, reusable visitor pattern.

Diagram: Data Adaptation vs. Expression Rewriting Flow

+------------------+                          +------------------+                          +--------------------+
| Query Logical     |                          | File Physical    |                          | In-Memory Batch     |
| Schema           |                          | Schema          |                          | (RecordBatch)       |
| (e.g. table)      |                          | (Parquet file)   |                          |                     |
+------------------+                          +------------------+                          +--------------------+
          |                                           |                                           |
          |                                           |                                           |
          |                    Expression Rewriting  |                                           |
          | <------------------- adapts --------------                                          |
          |                                           |                                           |
          v                                           |                                           |
  Filter / Projection Expressions                      |                                           |
 (col("foo.b") > 5)                                   |                                           |
          |                                           |                                           |
          |                      Reads columns from file projected                         |
          | ---------------------------------------------------►  Parquet File                         |
          |                                           |                                           |
          |                                           |                                           |
          |               Raw RecordBatch w/ file schema  (missing nested fields, etc.)           |
          | <--------------------------------------------------                                 |
          |                                           |                                           |
          v                       Data Adaptation (cast_struct_column, adapters)                  |
  Final RecordBatch shaped to  | logical schema with missing fields filled                       |
  logical schema (including     | (null arrays for nested missing cols)                             |
  nested structs corrected)     |                                           |
          |                                           |                                           |
          v                                           v                                           v
--- Query execution continues using clean, corrected data ----------------------------------------

Takeaways

  • Expression rewriting and data adaptation are complementary:
    • Expression rewriting is for making predicate filtering safe and efficient at the file level.
    • Data adaptation ensures the actual data arrays match the logical schema exactly for query execution.
  • For nested structs, array-level reshaping remains essential to create proper null arrays for missing nested fields.
  • The current proposal introduces a reusable expression rewriting builder simplifying filter pushdown and helping unblock further features.
  • The discussion encourages a two-pronged approach:
    1. Refactor pruning/filter handling to use expression rewriter.
    2. Maintain and improve array-level schema adapters for final batch shaping.
  • Projection pushdown integration involves the scan receiving full projection expressions, rewriting them, reading the required columns, and then evaluating the expressions over the batch.

Future Directions


Summary

The issue centers on improving how DataFusion adapts filter expressions and projections during Parquet scans when the on-disk file schema differs from the logical query schema — especially for nested structs. The key is to separate expression rewriting (for pushdown safety) from actual batch data adaptation (for correctness), using a new builder abstraction for expression rewrites. This approach unblocks further optimizations and schema evolution support while keeping code maintainable and extensible.


@adriangb
Copy link
Contributor Author

@kosiew I'm not sure I agree with the conclusions there. Why can't we use expressions to do the schema adapting during the scan? It's very possible as @alamb pointed out in #16461 (comment) to feed a RecordBatch into a an expression and get back a new array. So unless I'm missing something I don't think these are correct:

Expression rewriting is great for pushdown but batch-level adapters are needed for correct, shaped data.
No effect on RecordBatch structure.
Limited scope (only predicates and pruning).

Possibly poorer performance due to repeated expression rewrites.
There's no more expression rewrites than there are SchemaAdapters created. Those aren't cached either and are created for each file.

I'll put together an example to show how predicate rewrites can be used to reshape data. But also FWIW that's exactly how ProjectionExec works.

@kosiew
Copy link
Contributor

kosiew commented Jun 24, 2025

@adriangb ,
Sorry, it was not my intention to presume the conclusions.

I do look forward to a solution that handles schema adaptation in one pass.

@adriangb
Copy link
Contributor Author

adriangb commented Jun 24, 2025

@kosiew could you take a look at 32725dd? I think this fits in well with the idea of #14993: we are quite literally moving work from ProjectionExec into the scan phase (with the added benefit that we can do smart things with avoiding casts as well as using file-specific optimizations like shredded variant columns).

Complex handling for deeply nested types.

I do think this is a concern, I'm not sure how hard it would be to actually implement, but it's theoretically very possible and I think we should be able to make it easy to implement with some elbow grease / the right helpers and abstractions.

@kosiew
Copy link
Contributor

kosiew commented Jun 25, 2025

hi @adriangb

could you take a look at 32725dd?

👍👍👍
The new tests in
PhysicalExprSchemaRewriter’s suite—including your “test_adapt_batches” example—do demonstrate that we can:

  1. Rewrite a projection (or filter) against a physical RecordBatch,

  2. Evaluate those rewritten PhysicalExprs on the old batch to

  3. Produce a brand-new RecordBatch that (a) injects nulls for missing top-level columns, (b) applies casts, and (c) drops extra columns—all in one go.

So yes, for flat schemas and simple projections/filters, we can entirely sidestep a separate map_batch / cast_struct_column step by:

  • Generating an expression per target column (either col(...) if present or lit(NULL) if absent),
  • Letting the engine “evaluate” those expressions into new arrays, and
  • Bundling them into a fresh RecordBatch.

Where the rewrite-only approach shines

  • Simplicity for top-level columns. We only need col + lit(NULL) + CastExpr.
  • Unified code path: predicates and projections both go through the same rewriter + evaluator.
  • Less bespoke iterator logic: no custom StructArray walks, no recursive field-matching loops.

⚠️ Where the schema-adapter approach still wins

  1. Deeply nested structs

    • There’s no built-in “struct constructor” expression in DataFusion’s evaluator that I know of
      Our rewrite + batch_project hack only handles top-level arrays. We can’t easily say
      “build a StructArray whose fields are (col("a.b"), lit(NULL), cast(col("a.c"), …))” purely in expression form
  2. Performance

    • Expression evaluation involves building ArrayRefs by walking millions of rows through the PhysicalExpr vtable.
    • The adapter’s cast_struct_column does one recursive scan through each StructArray's memory, which is far more cache-friendly for bulk columnar operations.
  3. Full schema fidelity

    • The rewrite test only demonstrates:

      • Drop “extra” columns,
      • Inject null for missing top-level columns,
      • Cast primitive types.
    • It doesn’t cover:

      • Adding a new nested struct (we’d need to build that StructArray via expressions we don’t have),
      • Recursively updating sub-children,
      • Preserving null bit-maps across nested levels.
> Complex handling for deeply nested types.
I do think this is a concern, I'm not sure how hard it would be to actually implement, but it's theoretically very possible

Why it’s possible but a lot of work (and a performance risk)

  1. Engineering effort

    • We’ll be growing the rewriter from a column/cast replacer into a full recursive schema-walker + struct-node constructor.

    • We’ll have to handle every corner case: non-nullable nested fields, mixed present+missing children, ordering of fields, metadata, etc.

  2. Runtime performance

    • Every invocation in the evaluator will loop over rows to build child arrays, then pack them into a StructArray.

    • That’s orders of magnitude slower than a tight cast_struct_column implementation that does one bulk pass through the existing StructArray buffers.

I hope I don't sound like I am dead against the rewrite approach.
It is more like you have shown me a puzzle that I don't know how to solve.

What I would love to hear

Here's a simpler and faster approach .....

@adriangb
Copy link
Contributor Author

Every invocation in the evaluator will loop over rows to build child arrays, then pack them into a StructArray

As far as I know a PhysicalExpr can operate at the array level. For example lit(ScalarValue::Null). into_array(N) will end up calling new_null_array as well after a a couple function call hops:

ScalarValue::Null => new_null_array(&DataType::Null, size),

I think 3-4 function call hops would be an issue if it did happen for every row but it's happening at the array level - it's going to be inconsequential compared to the IO happening, etc.

There’s no built-in “struct constructor” expression in DataFusion

Isn't there https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct? I'm sure we can call that ourselves without SQL:

-- name the first field `field_a`
select struct(a as field_a, b) from t;
+--------------------------------------------------+
| named_struct(Utf8("field_a"),t.a,Utf8("c1"),t.b) |
+--------------------------------------------------+
| {field_a: 1, c1: 2} |
| {field_a: 3, c1: 4} |
+--------------------------------------------------+
```"#,
argument(
name = "expression1, expression_n",
description = "Expression to include in the output struct. Can be a constant, column, or function, any combination of arithmetic or string operators."
)
)]
#[derive(Debug)]
pub struct StructFunc {
signature: Signature,
aliases: Vec<String>,
}

@adriangb
Copy link
Contributor Author

Thank you very much for the feedback @kosiew 🙏🏻! I don't mean to disregard it, you make great points, but I think they are surmountable! Let's move forward with this and keep iterating on the approaches in parallel. If it turns out that this approach won't work for projection evaluation, it's still clearly a win for predicate evaluation.

@kosiew
Copy link
Contributor

kosiew commented Jun 26, 2025

@adriangb

@kosiew any objections to merging this?

Nope.
I am excited to see the solution of the puzzle.

@adriangb
Copy link
Contributor Author

Great I'll address #16461 (comment) and then I think this will be ready to merge!

@github-actions github-actions bot added the core Core DataFusion crate label Jun 26, 2025
@adriangb
Copy link
Contributor Author

I opened #16565 to track handling struct column sub-field adaptation specifically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate datasource Changes to the datasource crate physical-expr Changes to the physical-expr crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants