Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parquet dictionary encoding #10635

Merged

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Apr 11, 2022

This PR includes several changes to improve parquet dictionary encoding:

  • API cleanups: get rid of unused arguments
  • Remove min block limit in __launch_bounds__
  • Simplify the grid-stride loop logic by using while
  • All threads calculate start/end indices instead of one doing the calculation and broadcasting the result (no more shared memory or block-wide sync).

Other ideas tested but not eventually included in this PR due to zero or negative performance impact:

  • Tuning hash map occupancy
  • cg::shfl instead of shared memory + sync
  • CG based insert/find
  • Relaxed atomic for num_dict_entries and uniq_data_size
  • cg::reduce instead of cub::BlockReduce

Before:

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                             Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------
ParquetWrite/integral_void_output/29/0/1/1/2/manual_time            734 ms          734 ms            1 bytes_per_second=697.128M/s encoded_file_size=530.706M peak_memory_usage=1.7804G
ParquetWrite/integral_void_output/29/1000/1/1/2/manual_time         303 ms          303 ms            2 bytes_per_second=1.65131G/s encoded_file_size=397.998M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/32/1/2/manual_time           734 ms          734 ms            1 bytes_per_second=697.713M/s encoded_file_size=530.706M peak_memory_usage=1.7804G
ParquetWrite/integral_void_output/29/1000/32/1/2/manual_time       61.9 ms         61.9 ms           11 bytes_per_second=8.07721G/s encoded_file_size=159.574M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/1/0/2/manual_time            690 ms          690 ms            1 bytes_per_second=742.205M/s encoded_file_size=531.066M peak_memory_usage=1.3148G
ParquetWrite/integral_void_output/29/1000/1/0/2/manual_time         282 ms          282 ms            2 bytes_per_second=1.76991G/s encoded_file_size=398.712M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/32/0/2/manual_time           690 ms          690 ms            1 bytes_per_second=742.268M/s encoded_file_size=531.066M peak_memory_usage=1.3148G
ParquetWrite/integral_void_output/29/1000/32/0/2/manual_time       59.5 ms         59.5 ms           12 bytes_per_second=8.40878G/s encoded_file_size=199.926M peak_memory_usage=1.49675G

Now:

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                             Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------
ParquetWrite/integral_void_output/29/0/1/1/2/manual_time            733 ms          733 ms            1 bytes_per_second=698.24M/s encoded_file_size=530.706M peak_memory_usage=1.7804G
ParquetWrite/integral_void_output/29/1000/1/1/2/manual_time         302 ms          302 ms            2 bytes_per_second=1.65496G/s encoded_file_size=397.998M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/32/1/2/manual_time           733 ms          733 ms            1 bytes_per_second=698.701M/s encoded_file_size=530.706M peak_memory_usage=1.7804G
ParquetWrite/integral_void_output/29/1000/32/1/2/manual_time       61.3 ms         61.3 ms           11 bytes_per_second=8.1533G/s encoded_file_size=159.572M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/1/0/2/manual_time            688 ms          688 ms            1 bytes_per_second=743.71M/s encoded_file_size=531.066M peak_memory_usage=1.3148G
ParquetWrite/integral_void_output/29/1000/1/0/2/manual_time         282 ms          282 ms            2 bytes_per_second=1.7712G/s encoded_file_size=398.712M peak_memory_usage=1.49675G
ParquetWrite/integral_void_output/29/0/32/0/2/manual_time           688 ms          688 ms            1 bytes_per_second=743.658M/s encoded_file_size=531.066M peak_memory_usage=1.3148G
ParquetWrite/integral_void_output/29/1000/32/0/2/manual_time       58.9 ms         58.9 ms           12 bytes_per_second=8.49093G/s encoded_file_size=199.926M peak_memory_usage=1.49675G

@PointKernel PointKernel added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue tech debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 11, 2022
@PointKernel PointKernel self-assigned this Apr 11, 2022
@codecov
Copy link

codecov bot commented Apr 11, 2022

Codecov Report

Merging #10635 (48804f4) into branch-22.06 (b50ae82) will increase coverage by 0.20%.
The diff coverage is n/a.

❗ Current head 48804f4 differs from pull request most recent head 7e5d0bb. Consider uploading reports for the commit 7e5d0bb to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10635      +/-   ##
================================================
+ Coverage         86.15%   86.36%   +0.20%     
================================================
  Files               141      140       -1     
  Lines             22510    22304     -206     
================================================
- Hits              19394    19263     -131     
+ Misses             3116     3041      -75     
Impacted Files Coverage Δ
python/cudf/cudf/core/mixins/binops.py 90.00% <0.00%> (-10.00%) ⬇️
python/cudf/cudf/testing/testing.py 81.69% <0.00%> (-2.82%) ⬇️
python/cudf/cudf/core/indexed_frame.py 91.77% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/core/udf/typing.py 96.72% <0.00%> (-0.82%) ⬇️
python/cudf/cudf/testing/_utils.py 93.85% <0.00%> (-0.56%) ⬇️
python/cudf/cudf/core/_base_index.py 85.45% <0.00%> (-0.47%) ⬇️
python/cudf/cudf/core/single_column_frame.py 96.52% <0.00%> (-0.33%) ⬇️
python/cudf/cudf/__init__.py 90.47% <0.00%> (-0.23%) ⬇️
python/cudf/cudf/core/column/decimal.py 91.09% <0.00%> (-0.21%) ⬇️
python/cudf/cudf/io/orc.py 92.77% <0.00%> (-0.09%) ⬇️
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 018924f...7e5d0bb. Read the comment docs.

@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 12, 2022
@PointKernel PointKernel marked this pull request as ready for review April 12, 2022 18:11
@PointKernel PointKernel requested a review from a team as a code owner April 12, 2022 18:11
@PointKernel
Copy link
Member Author

rerun tests

Comment on lines +149 to +165
uniq_elem_size = [&]() -> size_type {
if (not is_unique) { return 0; }
switch (col->physical_type) {
case Type::INT32: return 4;
case Type::INT64: return 8;
case Type::INT96: return 12;
case Type::FLOAT: return 4;
case Type::DOUBLE: return 8;
case Type::BYTE_ARRAY:
if (data_col.type().id() == type_id::STRING) {
// Strings are stored as 4 byte length + string bytes
return 4 + data_col.element<string_view>(val_idx).size_bytes();
}
case Type::FIXED_LEN_BYTE_ARRAY:
if (data_col.type().id() == type_id::DECIMAL128) { return sizeof(__int128_t); }
default: CUDF_UNREACHABLE("Unsupported type for dictionary encoding");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch seems redundant with the type_dispatcher. Couldn't the map_insert_fn be made to return the same information and avoid the extra switch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, it seems like this could be simplified to:

auto const [is_unique, element_size] = is_valid ? type_dispatcher(...) : {0, 0};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are parquet types, not cudf types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the parquet type can be derived the cudf type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better would be to push the is_valid check inside map_insert_fn. Then I'd write this as:

while (val_idx - block_size < end_value_idx) {
   thrust::optional<size_type> unique_element_size = type_dispatcher(...);
   ...
   auto const num_unique = block_reduce(reduce_storage).Sum( unique_element_size.has_value() );
   __syncthreads();
   auto const unique_data_size = block_reduce(reduce_storage).Sum(unique_element_size.value_or(0));
   ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the parquet type can be derived the cudf type?

Not trivially. There can be multiple parquet types associated with a cuDF type and the decision to use which one is determined by a user passed metadata which gets to here.

If the user wants time stamp to be encoded with int96 instead of int64 then that might affect the decision to use dictionary.

@devavret
Copy link
Contributor

Don't benchmark using file output. Use void output. The IO time dominates the kernel running time and hides any actual improvements

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 017d52a into rapidsai:branch-22.06 Apr 20, 2022
@PointKernel PointKernel deleted the optimize-parquet-dict-encoding branch May 26, 2022 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants