Skip to content

Iceberg BatchScan & SparkDistributedDataScan to support limit pushdown #13383

Open
@GPX99

Description

@GPX99

Feature Request / Improvement

Request to add limit pushdown to improve the performance of reading a big table by skipping full batch scan, where the batch scan is implemented here

How is this observed?
When select * from table_name limit 1, the spark will actually scan all the data from the table; the bigger the table, the longer it takes.

For example,

(1) BatchScan glue_catalog.lakehouse_bronze.table_name
Output [51]: [ISTEST#69, LEADUUID#70, UPDATEDAT#71, ...etc]
glue_catalog.lakehouse_bronze.table_name (branch=null) [filters=, groupedBy=] <-- don't have limit pushdown

Hence, the input size is big
Image

Query engine

Spark

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementPR that improves existing functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions