Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid multiple read requests on small parquet files #19127

Merged
merged 2 commits into from Sep 28, 2023

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Sep 22, 2023

Description

Read files below parquet.small-file-threshold size in a single file system request to avoid multiple small read requests

Additional context and related issues

This is similar to hive.orc.tiny-stripe-threshold in ORC reader

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Hudi, Delta, Iceberg
* Reduce read requests for scanning small parquet files. 
  The catalog configuration property `parquet.small-file-threshold` or the corresponding
  catalog session property can be used to change the default size of `3MB` below which files 
  will be read entirely. Setting this configuration to `0B` disables the feature. ({issue}`19127`)

@cla-bot cla-bot bot added the cla-signed label Sep 22, 2023
@github-actions github-actions bot added docs tests:hive hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels Sep 22, 2023
@raunaqmorarka raunaqmorarka force-pushed the pqr-small-read branch 2 times, most recently from 56de0f8 to ac6e6e1 Compare September 23, 2023 01:48
@wendigo
Copy link
Contributor

wendigo commented Sep 23, 2023

Do you have any benchmarks @raunaqmorarka ?

@raunaqmorarka raunaqmorarka force-pushed the pqr-small-read branch 3 times, most recently from 5409456 to 7f7eed7 Compare September 25, 2023 02:18
@raunaqmorarka
Copy link
Member Author

@raunaqmorarka raunaqmorarka force-pushed the pqr-small-read branch 2 times, most recently from 04602b6 to 3b796e8 Compare September 25, 2023 06:28
@raunaqmorarka raunaqmorarka changed the title Avoid multiple reads of small parquet files Avoid multiple read requests on small parquet files Sep 25, 2023
@raunaqmorarka raunaqmorarka force-pushed the pqr-small-read branch 2 times, most recently from 5839229 to 1a4ad55 Compare September 27, 2023 07:08
Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, within the tests of the object storage connectors, we use almost completely for reading Parquet content the memory implementation.
This doesn't reflect however the reality in production.
I'd recommend using the session without small file threshold by default (on the query runner setup) and only when explicitly needed make use of this feature.

Read files below parquet.small-file-threshold size
in a single file system request to avoid multiple small
read requests
@raunaqmorarka
Copy link
Member Author

raunaqmorarka commented Sep 28, 2023

As of now, within the tests of the object storage connectors, we use almost completely for reading Parquet content the memory implementation.
This doesn't reflect however the reality in production.
I'd recommend using the session without small file threshold by default (on the query runner setup) and only when explicitly needed make use of this feature.

The corresponding orc feature hive.orc.tiny-stripe-threshold has a higher threshold of 8MB and we still don't disable it for tests. The same buffering of small file can be found in LinePageSourceFactory as well and that's not even configurable.
We have unit test coverage for TrinoParquetDataSource in TestParquetDataSource and in other places, so it doesn't become unused in tests due to the change here.
I've also added parquet.small-file-threshold=100kB to the default hive product test config to improve coverage of typical case in TestParquet.

@raunaqmorarka raunaqmorarka merged commit 6941148 into trinodb:master Sep 28, 2023
86 checks passed
@raunaqmorarka raunaqmorarka deleted the pqr-small-read branch September 28, 2023 06:53
@github-actions github-actions bot added this to the 428 milestone Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector docs hive Hive connector hudi Hudi connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

None yet

5 participants