Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Control stripe size when writing ORC files #9261

Closed
vuule opened this issue Sep 20, 2021 · 0 comments · Fixed by #9310
Closed

[FEA] Control stripe size when writing ORC files #9261

vuule opened this issue Sep 20, 2021 · 0 comments · Fixed by #9310
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@vuule
Copy link
Contributor

vuule commented Sep 20, 2021

Add a parameter to control the stripe size of the output ORC file. The size can potentially be specified in bytes or number of rows.

@vuule vuule added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Sep 20, 2021
@vuule vuule self-assigned this Sep 20, 2021
@vuule vuule added this to Issue-Needs prioritizing in v21.12 Release via automation Sep 20, 2021
@vuule vuule moved this from Issue-Needs prioritizing to Issue-P1 in v21.12 Release Sep 20, 2021
@rapids-bot rapids-bot bot closed this as completed in #9310 Oct 8, 2021
v21.12 Release automation moved this from Issue-P1 to Done Oct 8, 2021
rapids-bot bot pushed a commit that referenced this issue Oct 8, 2021
…ter (#9310)

Closes #9261

Adds the following API to the ORC writer:
- Set maximum stripe size, in bytes (minimum of 64KB);
- Set maximum stripe size, in rows (minimum of 512)
- Set the row index stride (minimum of 512)

Notes:
If the stripe size is set lower than the row index stride, row index stride is reduced to the stripe size.
Row index stride is rounded down to a multiple of 8.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Yunsong Wang (https://github.com/PointKernel)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #9310
guiyanakuang pushed a commit to apache/orc that referenced this issue May 24, 2022
### What changes were proposed in this pull request?
add row count limit config "orc.stripe.row.count" to limit row count in one stripe.

### Why are the changes needed?
for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split.
In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size".
But for different kind of table, the row count is difficult to use.

for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.
for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.

So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe.
The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): [rapidsai/cudf#9261](rapidsai/cudf#9261)

### How was this patch tested?
testStripeRowCountLimit added.
can be test by command below: 
```
cd java
./mvnw -Dtest=TestWriterImpl test
```
dongjoon-hyun pushed a commit to apache/orc that referenced this issue May 24, 2022
### What changes were proposed in this pull request?
add row count limit config "orc.stripe.row.count" to limit row count in one stripe.

### Why are the changes needed?
for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split.
In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size".
But for different kind of table, the row count is difficult to use.

for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.
for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.

So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe.
The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): [rapidsai/cudf#9261](rapidsai/cudf#9261)

### How was this patch tested?
testStripeRowCountLimit added.
can be test by command below:
```
cd java
./mvnw -Dtest=TestWriterImpl test
```

(cherry picked from commit 7facf81)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
cxzl25 pushed a commit to cxzl25/orc that referenced this issue Jan 11, 2024
### What changes were proposed in this pull request?
add row count limit config "orc.stripe.row.count" to limit row count in one stripe.

### Why are the changes needed?
for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split.
In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size".
But for different kind of table, the row count is difficult to use.

for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.
for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.

So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe.
The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): [rapidsai/cudf#9261](rapidsai/cudf#9261)

### How was this patch tested?
testStripeRowCountLimit added.
can be test by command below: 
```
cd java
./mvnw -Dtest=TestWriterImpl test
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

1 participant