[SPARK-51575][PYTHON] Combine Python Data Source pushdown & plan read workers #50340

wengh · 2025-03-20T20:16:48Z

Follow up of #49961

What changes were proposed in this pull request?

As pointed out by #49961 (comment), at the time of filter pushdown we already have enough information to also plan read partitions. So this PR changes the filter pushdown worker to also get partitions, reducing the number of exchanges between Python and Scala.

Changes:

Extract part of plan_data_source_read.py that is responsible for sending the partitions and the read function to JVM.
Use the extracted logic to also send the partitions and read function when doing filter pushdown in data_source_pushdown_filters.py.
Update the Scala code accordingly.

Why are the changes needed?

To improve Python Data Source performance when filter pushdown configuration is enabled.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests in test_python_datasource.py

Was this patch authored or co-authored using generative AI tooling?

No

wengh · 2025-03-21T18:17:04Z

@cloud-fan @beliefer @allisonwang-db Please take a look at this follow up of Python Data Source filter pushdown PR #49961

combine pushdown & plan read workers

a6d3590

github-actions bot added SQL PYTHON labels Mar 20, 2025

update docstring

60591cb

wengh changed the title ~~[WIP][SPARK-51575][PYTHON] Combine Python Data Source pushdown & plan read workers~~ [SPARK-51575][PYTHON] Combine Python Data Source pushdown & plan read workers Mar 21, 2025

wengh marked this pull request as ready for review March 21, 2025 00:59

wengh mentioned this pull request Mar 21, 2025

[SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources #49961

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51575][PYTHON] Combine Python Data Source pushdown & plan read workers #50340

[SPARK-51575][PYTHON] Combine Python Data Source pushdown & plan read workers #50340

wengh commented Mar 20, 2025 •

edited

Loading

wengh commented Mar 21, 2025

[SPARK-51575][PYTHON] Combine Python Data Source pushdown & plan read workers #50340

Are you sure you want to change the base?

[SPARK-51575][PYTHON] Combine Python Data Source pushdown & plan read workers #50340

Conversation

wengh commented Mar 20, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

wengh commented Mar 21, 2025

wengh commented Mar 20, 2025 •

edited

Loading