-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Control stripe size when writing ORC files #9261
Labels
Projects
Comments
rapids-bot bot
pushed a commit
that referenced
this issue
Oct 8, 2021
…ter (#9310) Closes #9261 Adds the following API to the ORC writer: - Set maximum stripe size, in bytes (minimum of 64KB); - Set maximum stripe size, in rows (minimum of 512) - Set the row index stride (minimum of 512) Notes: If the stripe size is set lower than the row index stride, row index stride is reduced to the stripe size. Row index stride is rounded down to a multiple of 8. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Yunsong Wang (https://github.com/PointKernel) - Karthikeyan (https://github.com/karthikeyann) URL: #9310
This was referenced May 17, 2022
guiyanakuang
pushed a commit
to apache/orc
that referenced
this issue
May 24, 2022
### What changes were proposed in this pull request? add row count limit config "orc.stripe.row.count" to limit row count in one stripe. ### Why are the changes needed? for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split. In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size". But for different kind of table, the row count is difficult to use. for table with much columns( eg. 100 columns), 64MB may contain 5000 rows. for table with less columns(eg. 5 columns), 64MB may contain 100000 rows. for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low. So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe. The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): [rapidsai/cudf#9261](rapidsai/cudf#9261) ### How was this patch tested? testStripeRowCountLimit added. can be test by command below: ``` cd java ./mvnw -Dtest=TestWriterImpl test ```
dongjoon-hyun
pushed a commit
to apache/orc
that referenced
this issue
May 24, 2022
### What changes were proposed in this pull request? add row count limit config "orc.stripe.row.count" to limit row count in one stripe. ### Why are the changes needed? for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split. In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size". But for different kind of table, the row count is difficult to use. for table with much columns( eg. 100 columns), 64MB may contain 5000 rows. for table with less columns(eg. 5 columns), 64MB may contain 100000 rows. for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low. So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe. The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): [rapidsai/cudf#9261](rapidsai/cudf#9261) ### How was this patch tested? testStripeRowCountLimit added. can be test by command below: ``` cd java ./mvnw -Dtest=TestWriterImpl test ``` (cherry picked from commit 7facf81) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
cxzl25
pushed a commit
to cxzl25/orc
that referenced
this issue
Jan 11, 2024
### What changes were proposed in this pull request? add row count limit config "orc.stripe.row.count" to limit row count in one stripe. ### Why are the changes needed? for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split. In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size". But for different kind of table, the row count is difficult to use. for table with much columns( eg. 100 columns), 64MB may contain 5000 rows. for table with less columns(eg. 5 columns), 64MB may contain 100000 rows. for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low. So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe. The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): [rapidsai/cudf#9261](rapidsai/cudf#9261) ### How was this patch tested? testStripeRowCountLimit added. can be test by command below: ``` cd java ./mvnw -Dtest=TestWriterImpl test ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add a parameter to control the stripe size of the output ORC file. The size can potentially be specified in bytes or number of rows.
The text was updated successfully, but these errors were encountered: