adapt filter expressions to file schema during parquet scan #16461

adriangb · 2025-06-19T18:26:09Z

The idea here is to move us one step closer to #15780 although there is still work to do (e.g. #15780 (comment)).

My goal is that this immediately unblocks other work (in particular #15057) while being a relatively small change that we can build upon with more optimizations, etc.

adriangb · 2025-06-19T18:27:29Z

datafusion/datasource-parquet/src/opener.rs

+                    return exec_err!(
+                        "Non-nullable column '{}' is missing from the physical schema",
+                        column.name()
+                    );


Might be useful to include some sort of file identifier here?

agreed -- that might also be a nice usecase for a builder style struct

let new_predicate = PhysicalExprSchemaRewriter::new(physical_file_schema, logical_file_schema) .with_identifier(file_name) .convert(predicate)?;

🤔

adriangb · 2025-06-19T19:00:39Z

datafusion/datasource-parquet/src/opener.rs

+        let stream = opener.open(make_meta(), file.clone()).unwrap().await.unwrap();
+        let (num_batches, num_rows) = count_batches_and_rows(stream).await;
+        assert_eq!(num_batches, 1);
+        assert_eq!(num_rows, 1);


This assertion fails on main: all 3 rows are passed because the row filter cannot handle the partition columns. This PR somewhat coincidentally happens to allow the row filter to handle predicates that depend on partition and data columns!

alamb

Thank you @adriangb -- other than the removed test I think this PR makes sense to me and could be merged

I think it would be useful to refactor this code a bit and invest in testing infrastructure as we proceed towards adding nicer features (like unwrapping casts on columns for example)

alamb · 2025-06-20T20:03:27Z

datafusion/datasource-parquet/src/opener.rs

+/// Preference is always given to casting literal values to the data type of the column
+/// since casting the column to the literal value's data type can be significantly more expensive.
+/// Given two columns the cast is applied arbitrarily to the first column.
+pub fn cast_expr_to_schema(


I think this is more general than just parquet, so perhaps we could move it into the datafusion-physical-schema crate or perhaps somewhere near the physical planner

I also think it would really help (perhaps as a follow on PR) to add some more specific unit tests.

I wonder if an API like the following makes sense:

struct PhysicalExprSchemaRewriter { ... } // rewrite a predicate let new_predicate = PhysicalExprSchemaRewriter::new(physical_file_schema, logical_file_schema) // optionally provide partition values .with_partition_columns(partition_fields, partition_values .convert(predicate)?;

Then I think writing unit tests would be easy and we could adapt / extend the code over time -- and it would set us up for adapting more sophisticated expressions like field extraction...

Sounds good to me, I will work on this next! Agreed on the unit tests 😄

alamb · 2025-06-20T21:18:07Z

datafusion/datasource-parquet/src/row_filter.rs


        assert!(candidate.is_none());
    }

-    #[test]
-    fn test_filter_type_coercion() {


Is there any way to keep this test (or something like it) that shows reading with a predicate of a different shcema is correctly coerced?

I think so - I'll have to make it more e2e since it is no longer specific to the row filter. We have some other similar tests, I'll work this into there

alamb · 2025-06-20T21:21:58Z

datafusion/datasource-parquet/src/opener.rs

+        if let Some(column) = expr.as_any().downcast_ref::<expressions::Column>() {
+            let logical_field = match logical_file_schema.field_with_name(column.name()) {


nit: we could reduce a level of nesting with something like

Suggested change

if let Some(column) = expr.as_any().downcast_ref::<expressions::Column>() {

let logical_field = match logical_file_schema.field_with_name(column.name()) {

let Some(column) = expr.as_any().downcast_ref::<expressions::Column>() else {

return Ok(Transformed::no(expr))

}

alamb · 2025-06-20T21:22:20Z

datafusion/datasource-parquet/src/opener.rs

+                    }
+                    // If the column is not found in the logical schema, return an error
+                    // This should probably never be hit unless something upstream broke, but nontheless it's better
+                    // for us to return a handleable error than to panic / do something unexpected.


alamb · 2025-06-20T21:23:30Z

datafusion/datasource-parquet/src/opener.rs

+                    return exec_err!(
+                        "Non-nullable column '{}' is missing from the physical schema",
+                        column.name()
+                    );


agreed -- that might also be a nice usecase for a builder style struct

let new_predicate = PhysicalExprSchemaRewriter::new(physical_file_schema, logical_file_schema) .with_identifier(file_name) .convert(predicate)?;

🤔

alamb · 2025-06-20T21:23:47Z

datafusion/datasource-parquet/src/opener.rs

+
+            // If the logical field and physical field are different, we need to cast
+            // the column to the logical field's data type.
+            // We will try later to move the cast to literal values if possible, which is computationally cheaper.


adriangb · 2025-06-21T16:44:19Z

@alamb I've created the builder, moved the implementation and added some unit tests

adriangb · 2025-06-22T13:24:40Z

My hopes with this work is that we can:

Make it easier to do further optimizations like Evaluate filter pushdown against the physical schema for performance and correctness #15780 (comment)
Replace the existing SchemaAdapter machinery
Add hooks to this builder for with_missing_column_handling(...) to close Add hooks to SchemaAdapter to add custom column generators #15261 and visit_expression(f: Fn(Arc<dyn PhysicalExpr>, &Schema) -> Transformed<Arc<dyn PhysicalExpr>>) to close Per file filter evaluation #15057
Make it easier to work pipe column orderings into pruning predicate creation #15821 in because we scan at the file schema

adriangb · 2025-06-22T13:29:46Z

@kosiew I'm curious what you think about this. Would it be possible to implement the nested struct imputation work you're doing with this approach?

adriangb · 2025-06-22T14:19:42Z

Just for fun I opened pydantic#31 to see how hard it would be to incorporate #15780 (comment), not too bad! Most of the code could be shared with the logical layer with some refactoring pubing, etc.

kosiew · 2025-06-23T05:11:21Z

@adriangb
Thanks for the ping on this.

Would it be possible to implement the nested struct imputation work you're doing with this approach?

Do you mean reusing the PhysicalExprSchemaRewriter machinery to drive nested‐struct imputation?

Here're some disadvantages of the rewrite‐centric approach versus the more data‐centric adapter approach:

Expression-only, not data-only: This never actually transforms the underlying RecordBatch columns—if downstream logic (or the user) inspects a struct column directly, they won’t see the new null fields injected. We’d still need array-level imputation for correctness in the result batches.
Limited to predicate contexts: The rewriter hooks into filter and pruning, but our broader schema-evolution needs (e.g. reading all columns, SELECT *, writing out evolved nested structs) fall outside its scope.
Duplication risk: We end up reinventing part of the schema-adapter’s compatibility logic (matching fields by name, casting types) inside the rewriter, which can drift from the adapter’s rules over time.
Complexity with deep nesting: Recursively handling deeply nested structs inside an expression tree—and ensuring every nested‐field access gets rewritten with the right shape—quickly becomes more intricate than a simple tree visitor.
Performance implications: Constantly rewriting and reconstructing expression trees (and then evaluating those casts/lits) might be less efficient than bulk array‐level casts + struct builds, especially on wide tables.

So, could we bolt nested‐struct imputation onto his rewriter? Technically yes, we could extend rewrite_column so that, whenever we see a Column referring to foo.bar.baz that’s missing in the physical schema, you generate a Literal::Null of the full nested type (constructing the proper StructValue). But we’d still need an array-level counterpart to actually materialize those null nested fields in the RecordBatch when we call map_batch.

In practice, the two approaches complement each other:

Use the rewriter to handle predicate and projection expressions (so filters and column references don’t blow up).

Continue to rely on cast_struct_column + NestedStructSchemaAdapter to adapt the actual batch data, filling in null arrays and doing recursive casts.

That way we get the best of both worlds—clean, centralized expression rewriting for pushdown, and robust array-level marshalling for the final tables. 😊

Why the two-pronged approach makes sense

Pushdown vs. Data Adaptation Are Different Concerns

The PhysicalExprSchemaRewriter is perfect for rewriting predicates and projections so they don’t blow up when the on-disk schema diverges.

But once you’ve read that Parquet row group into memory, you still need to reshape the StructArray itself—filling in null arrays for new nested fields, dropping old ones, recursively casting types.

Keeping Pruning Code Lean

Swapping out the old SchemaMapper for the rewriter in your pruning path is a great win: much less boilerplate, better separation of concerns, and no more “fake record batches” just to evaluate a filter.

You can remove all of that pruning-specific adapter code and lean on the rewriter’s tree visitor.

Deep Schema-Evolution Still Lives in the Adapter

Handling a top-level missing column is easy in the rewriter (you just rewrite col("foo") to lit(NULL)), but handling col("a.b.c") where b itself is a new struct, and c is a new field inside it… that’s far more natural in a recursive cast_struct_column that operates on StructArray children.

kosiew · 2025-06-23T05:22:18Z

Putting more words to how I understand pushdown and data adaptation:

Pushdown — “Which rows or pages should I read?”

Input: your original predicate (e.g. col("foo.b") > 5) and the physical Parquet schema.
What the rewriter does:
- Sees that foo.b doesn’t exist on disk → replaces col("foo.b") > 5 with lit(NULL) > 5.
- Or if foo.a is stored as Int32 but the table expects Int64, it wraps col("foo.a") in a cast.
Result: you get a “safe” predicate that Parquet can evaluate against row‐group statistics or pages without error.
Outcome: you prune away unneeded row groups, or skip pages, based on that rewritten expression.

At the end of this step, no data has actually been materialized—you’ve only modified the expression you use to decide what to read.

Data adaptation — “How do I shape the in-memory batch to match the logical schema?”

Input: a RecordBatch (or StructArray) that you read directly from Parquet.
- This batch is laid out exactly as on disk: it only has the columns that existed in that file’s schema, and nested structs only contain the old fields.
What the adapter does (map_batch / cast_struct_column):
- Field matching: for each field in your logical (table) schema, look it up by name in the batch’s arrays.
- Missing fields → insert a new_null_array(...) of the right datatype and row count.
- Extra fields (present on disk but dropped in the table) → ignore them.
- Nested structs → recurse into child struct arrays, doing the same match/fill/ignore/cast logic at each level.
Result: a brand-new StructArray (and overall RecordBatch) whose columns exactly line up with your table schema—even for deeply nested new fields.

adriangb · 2025-06-23T12:05:26Z

Thanks for the thoughtful reply. An important point that I maybe should have mentioned: we're actively working on projection / selection pushdown which would involve pushing down expressions into the scan (#14993), thus we would evaluate the expressions and materialize the record batches in the output.

Basically what I see happening:

The scan gets a Vec<Arc<dyn PhysicalExpr>> representing the projection expression for each output column. One question that I don't know the answer to is what level of projection it should get, i.e. if the query is select a + b should it get [a+b] or [a,b] with a ProjectionExec above that does the a+b part. I think the answer is to actually pass down a+b but I'm not sure.
We do these rewrites to adapt to the file schema, calling the above referenced hooks in the process.
We call projection.iter().map(|p| collect_columns(p)).flatten().sort_by(|c| c.index()).collect_vec() to get a Vec<usize> projection. Might need to do something fancier, I guess whatever ProjectionExec does.
We read those columns from the file producing a RecordBatch with all of the projected columns.
We evaluate each expression on the RecordBatch (or whatever ProjectionExec does) and that's our actual output.

So some questions for nested structs are: is it going to be a pain to adapt the nested structs via projection expressions? Or is it going to be ~ the same as doing it at the SchemaAdapter / physical column level?

alamb · 2025-06-23T20:36:31Z

But we’d still need an array-level counterpart to actually materialize those null nested fields in the RecordBatch when we call map_batch.

FWIW I think this is one mechanism to turn the expression into an array (you just need to evaluate the expression into a PhysicalExpr):

https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.ColumnarValue.html#method.into_array

alamb · 2025-06-23T20:37:47Z

I would personally recommend proceeding in parallel with the two approaches, ensuring there are good end to end tests (.slt) -- and then if we find that the projection pushdown / rewriting code can subsume the schema adapter code we could make a PR to do that 🤔

alamb

Looks great to me -- thank you @adriangb

alamb · 2025-06-23T20:38:05Z

datafusion/datasource-parquet/src/opener.rs

@@ -248,10 +248,25 @@ impl FileOpener for ParquetOpener {
                }
            }

+            let predicate = predicate
+                .map(|p| {
+                    PhysicalExprSchemaRewriter::new(


alamb · 2025-06-23T20:39:57Z

datafusion/physical-expr/src/schema_rewriter.rs

+                    );
+                    }
+                    // If the column is missing from the physical schema fill it in with nulls as `SchemaAdapter` would do.
+                    // TODO: do we need to sync this with what the `SchemaAdapter` actually does?


I recommend ensuring this TODO is covered by a ticket and then adding a reference to the ticket here

For my own sake: the point here is that SchemaAdapter is a trait so it can do arbitrary things not just replace with nulls (that's what the default implementation does). It's super annoying to have to implement SchemaAdapterFactory and all of the 3 IIRC traits. I'd propose what we do here is just add hooks that follow the TreeNode APIs and make it a builder style struct.

alamb · 2025-06-23T20:41:01Z

datafusion/physical-expr/src/schema_rewriter.rs

+
+        let partition_fields =
+            vec![Arc::new(Field::new("partition_col", DataType::Utf8, false))];
+        let partition_values = vec![ScalarValue::Utf8(Some("test_value".to_string()))];


You can make this less verbose like this if you want

Suggested change

let partition_values = vec![ScalarValue::Utf8(Some("test_value".to_string()))];

let partition_values = vec![ScalarValue::from("test_value")];

Hmm I agree that is more concise but in this case since we are testing casting and such I think it's better to be explicit. I can see someone being very confused in the future as to why a test is failing and not realizing that ScalarValue::from("foo") produces Utf8 and not e.g. Utf8View.

alamb · 2025-06-23T20:43:40Z

datafusion/physical-expr/src/schema_rewriter.rs

+    }
+
+    #[test]
+    fn test_rewrite_column_with_type_cast() -> Result<()> {


I recommend adding a test for a more sophisticated expression (like (a + 5.0::int64) || (c > 0.0::float)) to ensure that the recursive rewrite case is covered

adriangb · 2025-06-24T03:26:40Z

I opened #16528 to track further ideas / steps. For now I'd like to merge this so that I can iterate on the rest of the work 😄

adriangb · 2025-06-24T04:56:14Z

#16530 will be able to easily be incorporated into this work, completely eliminating what are currently expensive casts

kosiew · 2025-06-24T07:25:12Z

Adding notes for future reference:

Summary: Adapting Filter Expressions to File Schema During Parquet Scan

Background & Goal

Apache DataFusion wants to improve how filter expressions (predicates) and projections are adapted to the physical file schema during Parquet scans.
The effort aims to:
- Move closer to handling nested struct schema evolution (Evaluate filter pushdown against the physical schema for performance and correctness #15780).
- Replace older SchemaAdapter machinery with a new builder-based approach.
- Support expression rewrites for projection and selection pushdown (Support Push down expression evaluation in TableProviders #14993, Per file filter evaluation #15057).
- Make it easier to work with schema evolution on nested structs (pipe column orderings into pruning predicate creation #15821).
- Enable simpler hooks for handling missing columns and expression transformations.

Key Concepts

1. Expression Rewriting (Pushdown Adaptation)

Rewrites filter and projection expressions to align with the file’s physical schema.
Examples:
- If an expression refers to a nested field foo.baz that is missing on disk → rewrite to lit(NULL).
- If a field has different physical type on disk vs. logical schema → add casts.
This rewriting ensures that predicate pushdown logic and filters do not error out when the on-disk schema differs from the logical schema.
Expression rewriting happens before reading data and uses the physical schema to safely prune row groups.

2. Data Adaptation (Batch-Level Reshaping)

After reading a RecordBatch or arrays from Parquet, reshape them to match the logical table schema.
Actions include:
- Adding null arrays for missing nested fields (nested struct imputation).
- Dropping columns no longer part of the logical schema.
- Recursively casting nested struct types to match the logical type.
This ensures downstream operators receive data shaped exactly as expected in the query, despite schema evolution.

Main Discussion Points

Topic	Details
Proposed Approach	Introduce a `PhysicalExprSchemaRewriter` builder to adapt expressions to file schema during pruning/scanning.
Nested Struct Imputation	Expression-only rewrites are limited for nested structs since they do not modify the actual data arrays.
Data vs. Expression Adaptation	Expression rewriting is great for pushdown but batch-level adapters are needed for correct, shaped data.
Complementary Approach	Use expression rewriting for filters/projections + array-level adapters (e.g. `cast_struct_column`) to reshape in-memory data.
Projection Pushdown Scenario	A scan can receive full projection expressions (e.g. `a + b`), which get adapted and evaluated on RecordBatch, producing final output.
Risks of Expression-Only Rewrites	- No effect on RecordBatch structure. - Limited scope (only predicates and pruning). - Risk of code duplication. - Complex handling for deeply nested types. - Possibly poorer performance due to repeated expression rewrites.
Potential Benefits of Expression-Based Rewrites	Cleaner pruning path, simpler code, no fake batches for evaluation, reusable visitor pattern.

Diagram: Data Adaptation vs. Expression Rewriting Flow

+------------------+                          +------------------+                          +--------------------+
| Query Logical     |                          | File Physical    |                          | In-Memory Batch     |
| Schema           |                          | Schema          |                          | (RecordBatch)       |
| (e.g. table)      |                          | (Parquet file)   |                          |                     |
+------------------+                          +------------------+                          +--------------------+
          |                                           |                                           |
          |                                           |                                           |
          |                    Expression Rewriting  |                                           |
          | <------------------- adapts --------------                                          |
          |                                           |                                           |
          v                                           |                                           |
  Filter / Projection Expressions                      |                                           |
 (col("foo.b") > 5)                                   |                                           |
          |                                           |                                           |
          |                      Reads columns from file projected                         |
          | ---------------------------------------------------►  Parquet File                         |
          |                                           |                                           |
          |                                           |                                           |
          |               Raw RecordBatch w/ file schema  (missing nested fields, etc.)           |
          | <--------------------------------------------------                                 |
          |                                           |                                           |
          v                       Data Adaptation (cast_struct_column, adapters)                  |
  Final RecordBatch shaped to  | logical schema with missing fields filled                       |
  logical schema (including     | (null arrays for nested missing cols)                             |
  nested structs corrected)     |                                           |
          |                                           |                                           |
          v                                           v                                           v
--- Query execution continues using clean, corrected data ----------------------------------------

Takeaways

Expression rewriting and data adaptation are complementary:
- Expression rewriting is for making predicate filtering safe and efficient at the file level.
- Data adaptation ensures the actual data arrays match the logical schema exactly for query execution.
For nested structs, array-level reshaping remains essential to create proper null arrays for missing nested fields.
The current proposal introduces a reusable expression rewriting builder simplifying filter pushdown and helping unblock further features.
The discussion encourages a two-pronged approach:
1. Refactor pruning/filter handling to use expression rewriter.
2. Maintain and improve array-level schema adapters for final batch shaping.
Projection pushdown integration involves the scan receiving full projection expressions, rewriting them, reading the required columns, and then evaluating the expressions over the batch.

Future Directions

Incorporate projection pushdown (Support Push down expression evaluation in TableProviders #14993) fully with expression rewriting and data adapters.
Explore further optimizations to eliminate expensive casts (Add PhysicalExpr optimizer and cast unwrapping #16530).
Unify logical and physical schema adaptation logic where feasible ([EPIC] expression pushdown and file level expression handling #16528).
Extend hooks in the new builder for handling missing columns and recursive nested struct casts.

Summary

The issue centers on improving how DataFusion adapts filter expressions and projections during Parquet scans when the on-disk file schema differs from the logical query schema — especially for nested structs. The key is to separate expression rewriting (for pushdown safety) from actual batch data adaptation (for correctness), using a new builder abstraction for expression rewrites. This approach unblocks further optimizations and schema evolution support while keeping code maintainable and extensible.

adriangb · 2025-06-24T12:34:48Z

@kosiew I'm not sure I agree with the conclusions there. Why can't we use expressions to do the schema adapting during the scan? It's very possible as @alamb pointed out in #16461 (comment) to feed a RecordBatch into a an expression and get back a new array. So unless I'm missing something I don't think these are correct:

Expression rewriting is great for pushdown but batch-level adapters are needed for correct, shaped data.
No effect on RecordBatch structure.
Limited scope (only predicates and pruning).

Possibly poorer performance due to repeated expression rewrites.
There's no more expression rewrites than there are SchemaAdapters created. Those aren't cached either and are created for each file.

I'll put together an example to show how predicate rewrites can be used to reshape data. But also FWIW that's exactly how ProjectionExec works.

kosiew · 2025-06-24T13:56:43Z

@adriangb ,
Sorry, it was not my intention to presume the conclusions.

I do look forward to a solution that handles schema adaptation in one pass.

adriangb · 2025-06-24T14:18:04Z

@kosiew could you take a look at 32725dd? I think this fits in well with the idea of #14993: we are quite literally moving work from ProjectionExec into the scan phase (with the added benefit that we can do smart things with avoiding casts as well as using file-specific optimizations like shredded variant columns).

Complex handling for deeply nested types.

I do think this is a concern, I'm not sure how hard it would be to actually implement, but it's theoretically very possible and I think we should be able to make it easy to implement with some elbow grease / the right helpers and abstractions.

kosiew · 2025-06-25T05:49:57Z

hi @adriangb

could you take a look at 32725dd?

👍👍👍
The new tests in
PhysicalExprSchemaRewriter’s suite—including your “test_adapt_batches” example—do demonstrate that we can:

Rewrite a projection (or filter) against a physical RecordBatch,
Evaluate those rewritten PhysicalExprs on the old batch to
Produce a brand-new RecordBatch that (a) injects nulls for missing top-level columns, (b) applies casts, and (c) drops extra columns—all in one go.

So yes, for flat schemas and simple projections/filters, we can entirely sidestep a separate map_batch / cast_struct_column step by:

Generating an expression per target column (either col(...) if present or lit(NULL) if absent),
Letting the engine “evaluate” those expressions into new arrays, and
Bundling them into a fresh RecordBatch.

✅ Where the rewrite-only approach shines

Simplicity for top-level columns. We only need col + lit(NULL) + CastExpr.
Unified code path: predicates and projections both go through the same rewriter + evaluator.
Less bespoke iterator logic: no custom StructArray walks, no recursive field-matching loops.

⚠️ Where the schema-adapter approach still wins

Deeply nested structs
- There’s no built-in “struct constructor” expression in DataFusion’s evaluator that I know of
  Our rewrite + batch_project hack only handles top-level arrays. We can’t easily say
  “build a StructArray whose fields are (col("a.b"), lit(NULL), cast(col("a.c"), …))” purely in expression form
Performance
- Expression evaluation involves building ArrayRefs by walking millions of rows through the PhysicalExpr vtable.
- The adapter’s cast_struct_column does one recursive scan through each StructArray's memory, which is far more cache-friendly for bulk columnar operations.
Full schema fidelity
- The rewrite test only demonstrates:
  - Drop “extra” columns,
  - Inject null for missing top-level columns,
  - Cast primitive types.
- It doesn’t cover:
  - Adding a new nested struct (we’d need to build that StructArray via expressions we don’t have),
  - Recursively updating sub-children,
  - Preserving null bit-maps across nested levels.

> Complex handling for deeply nested types.
I do think this is a concern, I'm not sure how hard it would be to actually implement, but it's theoretically very possible

Why it’s possible but a lot of work (and a performance risk)

Engineering effort
- We’ll be growing the rewriter from a column/cast replacer into a full recursive schema-walker + struct-node constructor.
- We’ll have to handle every corner case: non-nullable nested fields, mixed present+missing children, ordering of fields, metadata, etc.
Runtime performance
- Every invocation in the evaluator will loop over rows to build child arrays, then pack them into a StructArray.
- That’s orders of magnitude slower than a tight cast_struct_column implementation that does one bulk pass through the existing StructArray buffers.

I hope I don't sound like I am dead against the rewrite approach.
It is more like you have shown me a puzzle that I don't know how to solve.

What I would love to hear

Here's a simpler and faster approach .....

adriangb · 2025-06-25T12:23:42Z

Every invocation in the evaluator will loop over rows to build child arrays, then pack them into a StructArray

As far as I know a PhysicalExpr can operate at the array level. For example lit(ScalarValue::Null). into_array(N) will end up calling new_null_array as well after a a couple function call hops:

datafusion/datafusion/common/src/scalar/mod.rs

Line 2672 in e3d3302

ScalarValue::Null => new_null_array(&DataType::Null, size),

I think 3-4 function call hops would be an issue if it did happen for every row but it's happening at the array level - it's going to be inconsequential compared to the IO happening, etc.

There’s no built-in “struct constructor” expression in DataFusion

Isn't there https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct? I'm sure we can call that ourselves without SQL:

datafusion/datafusion/functions/src/core/struct.rs

Lines 53 to 71 in e3d3302

    
           -- name the first field `field_a` 
        
           select struct(a as field_a, b) from t; 
        
           +--------------------------------------------------+ 
        
           | named_struct(Utf8("field_a"),t.a,Utf8("c1"),t.b) | 
        
           +--------------------------------------------------+ 
        
           | {field_a: 1, c1: 2}                              | 
        
           | {field_a: 3, c1: 4}                              | 
        
           +--------------------------------------------------+ 
        
           ```"#, 
        
               argument( 
        
                   name = "expression1, expression_n", 
        
                   description = "Expression to include in the output struct. Can be a constant, column, or function, any combination of arithmetic or string operators." 
        
               ) 
        
           )] 
        
           #[derive(Debug)] 
        
           pub struct StructFunc { 
        
               signature: Signature, 
        
               aliases: Vec<String>, 
        
           }

adriangb · 2025-06-25T12:40:57Z

Thank you very much for the feedback @kosiew 🙏🏻! I don't mean to disregard it, you make great points, but I think they are surmountable! Let's move forward with this and keep iterating on the approaches in parallel. If it turns out that this approach won't work for projection evaluation, it's still clearly a win for predicate evaluation.

This reverts commit 4cef38f.

kosiew · 2025-06-26T01:57:19Z

@adriangb

@kosiew any objections to merging this?

Nope.
I am excited to see the solution of the puzzle.

adriangb · 2025-06-26T02:11:55Z

Great I'll address #16461 (comment) and then I think this will be ready to merge!

adriangb · 2025-06-26T13:46:46Z

I opened #16565 to track handling struct column sub-field adaptation specifically

This partially reverts #16461 by keeping backward compatibility with the existing SchemaAdapter APIs.

github-actions bot added the datasource label Jun 19, 2025

adriangb commented Jun 19, 2025

View reviewed changes

adriangb requested a review from alamb June 19, 2025 19:20

alamb reviewed Jun 20, 2025

View reviewed changes

adriangb force-pushed the file-schema-2 branch from dec75bc to fbe74da Compare June 21, 2025 15:41

github-actions bot added the physical-expr label Jun 21, 2025

xudong963 self-requested a review June 23, 2025 12:10

alamb approved these changes Jun 23, 2025

View reviewed changes

adriangb mentioned this pull request Jun 24, 2025

Make filter pushdown rewrites pluggable #16527

Open

adriangb force-pushed the file-schema-2 branch from 9970687 to 0c36952 Compare June 24, 2025 03:03

adriangb mentioned this pull request Jun 24, 2025

[EPIC] expression pushdown and file level expression handling #16528

Open

adriangb added 14 commits June 25, 2025 19:28

cleanup

15b030f

remove unecessary reassign

8f5f150

fmt

b1d3b0c

better comments

a0dcae9

Revert "remove unecessary reassign"

d1e6800

This reverts commit 4cef38f.

handle indexes internally

7f0de73

reafactor

f141fa9

fix

c97fa32

fix

4dcb3d2

add comment

ce47bd8

clippy

dc1005d

add example for RecordBatch adaptation

866c444

fmt

9c71b6d

comments

20aa835

adriangb force-pushed the file-schema-2 branch from b0c3aa2 to 20aa835 Compare June 26, 2025 00:28

add equivalent test

5e09d68

github-actions bot added the core label Jun 26, 2025

fmt

fa1f621

adriangb mentioned this pull request Jun 26, 2025

feature: adapt predicate pushdown for mismatched nested/struct schemas #16565

Open

adriangb merged commit 0e48627 into apache:main Jun 26, 2025
27 checks passed

This was referenced Jul 13, 2025

Release DataFusion 49.0.0 (July 2025) #16235

Closed

Restore custom SchemaAdapter functionality for Parquet #16791

Merged

adriangb added a commit that referenced this pull request Jul 17, 2025

Restore custom SchemaAdapter functionality for Parquet (#16791)

6abc162

This partially reverts #16461 by keeping backward compatibility with the existing SchemaAdapter APIs.

		if let Some(column) = expr.as_any().downcast_ref::<expressions::Column>() {
		let logical_field = match logical_file_schema.field_with_name(column.name()) {

	let partition_values = vec![ScalarValue::Utf8(Some("test_value".to_string()))];
	let partition_values = vec![ScalarValue::from("test_value")];

adapt filter expressions to file schema during parquet scan #16461

adapt filter expressions to file schema during parquet scan #16461

Uh oh!

Conversation

adriangb commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Jun 21, 2025

Uh oh!

adriangb commented Jun 22, 2025

Uh oh!

adriangb commented Jun 22, 2025

Uh oh!

adriangb commented Jun 22, 2025

Uh oh!

kosiew commented Jun 23, 2025

Why the two-pronged approach makes sense

Uh oh!

kosiew commented Jun 23, 2025

Uh oh!

adriangb commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jun 23, 2025

Uh oh!

alamb commented Jun 23, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Jun 24, 2025

Uh oh!

kosiew commented Jun 24, 2025

Summary: Adapting Filter Expressions to File Schema During Parquet Scan

adriangb commented Jun 19, 2025 •

edited

Loading

adriangb Jun 19, 2025 •

edited

Loading

adriangb commented Jun 23, 2025 •

edited

Loading

adriangb commented Jun 24, 2025 •

edited

Loading

adriangb commented Jun 24, 2025 •

edited

Loading