Skip to content

Write variable bit-width keys for Parquet dictionary encoded pages#22279

Open
mhaseeb123 wants to merge 27 commits intorapidsai:mainfrom
mhaseeb123:fea/pq-dict-encode-optimize
Open

Write variable bit-width keys for Parquet dictionary encoded pages#22279
mhaseeb123 wants to merge 27 commits intorapidsai:mainfrom
mhaseeb123:fea/pq-dict-encode-optimize

Conversation

@mhaseeb123
Copy link
Copy Markdown
Member

@mhaseeb123 mhaseeb123 commented Apr 23, 2026

Description

Contributes to #13995

This PR enables the Parquet writer to assign lower dictionary indices to elements appearing earlier in each column chunk thereby writing variable (reduced) number of dictionary key bits for earlier pages.

New algorithm:

  1. map_insert_fn now inserts this pair in the static map: {row_idx, frag_idx} instead of {row_idx, row_idx}. frag_idx is just blockIdx.x and indicates which page fragment actually inserted this entry in the static map (CAS race dependent but earlier thread blocks generally win this race)
  2. map_insert_fn also writes the PageFragment::num_dict_vals field for each page fragment to keep a record of number of unique values inserted by it.
  3. Then in collect_map_entries_kernel (where we assign keys to dict values), we optionally assign spatially-local keys to dictionary values inserted by the same page fragment. See algorithm details here
  4. Finally, we launch compute_page_dict_bits_kernel to compute the max bit width required to encode dict keys for each parquet page. See algorithm here

See follow up PR #22323 that deterministically writes the index of the first page fragment that sees each dictionary key

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Repurpose the unused mapped_type slot in populate_chunk_hash_maps_kernel
to carry the fragment index of the block that first inserts each value.
Rewrite collect_map_entries_kernel to bucket dict_ids by that fragment
index so pages that only reference earlier fragments' values see small
max dict_index.

No file-size delta in isolation; prerequisite for the per-page bit-width
change landing next.

Made-with: Cursor
…ai#13995)

Each data page now RLE-encodes dictionary indices using
ceil(log2(page_max_dict_index + 1)) bits instead of the chunk-wide
maximum. Combined with the first-appearance dict_id ordering from the
prior commit, this closes the ~30% file-size gap vs Spark on
moderate-cardinality INT64 and STRING workloads.

Page size estimation continues to use the chunk-wide bits as a
conservative upper bound; the dictionary page itself is unaffected.

Made-with: Cursor
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Apr 23, 2026
@mhaseeb123 mhaseeb123 added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS 3 - Ready for Review Ready for review by team labels Apr 28, 2026
@mhaseeb123 mhaseeb123 changed the title 🚧 Write variable bit-width keys for Parquet dictionary encoded pages Write variable bit-width keys for Parquet dictionary encoded pages Apr 28, 2026
@mhaseeb123 mhaseeb123 marked this pull request as ready for review April 28, 2026 19:24
@mhaseeb123 mhaseeb123 requested review from a team as code owners April 28, 2026 19:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates the Parquet writer’s dictionary encoding to support smaller (variable) RLE bit-widths on earlier data pages by biasing dictionary id assignment toward earlier page fragments, and adds validation/benchmarking around the new behavior.

Changes:

  • Assign dictionary ids using a fragment-locality hint so earlier fragments tend to get smaller ids.
  • Compute and store per-page dict_rle_bits and use it during page encoding.
  • Add/adjust tests and introduce an nvbench benchmark to measure the effect.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cpp/src/io/parquet/chunk_dict.cu Implements fragment-hinted dictionary id assignment and adds per-page dict bit-width computation kernel.
cpp/src/io/parquet/page_enc.cu Switches dict encoding to use per-page dict_rle_bits for data pages.
cpp/src/io/parquet/writer_impl.cu Wires page-boundary finalization to per-page bit-width computation.
cpp/src/io/parquet/parquet_gpu.hpp Extends GPU structs with fragment count and per-page dict_rle_bits.
cpp/src/io/parquet/parquet_gpu.cuh Updates APIs for fragment mutation and adds compute_per_page_dict_bits declaration.
cpp/tests/io/parquet_writer_test.cpp Adds a test asserting variable per-page dictionary bit-width behavior.
cpp/tests/io/parquet_misc_test.cpp Updates dictionary test to validate the max bit-width across pages (not only first page).
cpp/benchmarks/io/parquet/parquet_writer_dict.cpp Adds benchmark to measure encoding perf and resulting per-page dict bit stats.
cpp/benchmarks/CMakeLists.txt Registers the new parquet dictionary writer benchmark.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/src/io/parquet/chunk_dict.cu
Comment on lines +139 to +142
// TODO(mh): Here we insert the fragment index of the CAS winner, which may not be the
// smallest one (relies on monotonic block scheduling). Switch to static_map's
// `insert_or_apply` with `cuco::op::min` for deterministic first-fragment semantics
is_unique = map_insert_ref.insert(slot_type{static_cast<key_type>(val_idx), frag_idx});
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, captain obvious 😄

Comment thread cpp/benchmarks/io/parquet/parquet_writer_dict.cpp
Comment thread cpp/tests/io/parquet_writer_test.cpp Outdated
auto const col = chunk->col_desc;
column_device_view const& data_col = *col->leaf_column;
__shared__ size_type total_num_dict_entries;
__shared__ size_type num_dict_vals;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we store the number of unique values we inserted to the dictionary from this fragment (thread block)

if (total_num_dict_entries > MAX_DICT_SIZE) { break; }
} // for loop
// Flush the number of unique values inserted by this fragment
if (t == 0) { frag->num_dict_vals = num_dict_vals; };
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flush the number of unique values from this fragment to global mem

// TODO(mh): Here we insert the fragment index of the CAS winner, which may not be the
// smallest one (relies on monotonic block scheduling). Switch to static_map's
// `insert_or_apply` with `cuco::op::min` for deterministic first-fragment semantics
is_unique = map_insert_ref.insert(slot_type{static_cast<key_type>(val_idx), frag_idx});
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert the index of the fragment (aka blockIdx.x) that inserted this key in the dictionary for now hoping that lower blockIdx.x is scheduled first and will insert it.

auto const frag = frags[col_idx][block_x];
auto chunk = frag.chunk;
auto col = chunk->col_desc;
auto const col_idx = blockIdx.y;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only cosmetic changes here. Variable names are slightly renamed for consistency.

auto const block_x = blockIdx.x;
auto const frag = frags[col_idx][block_x];
auto chunk = frag.chunk;
auto const col_idx = blockIdx.y;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic changes only

EncColumnChunk* const chunk;
template <typename T>
__device__ void operator()(size_type const s_start_value_idx,
__device__ void operator()(size_type const start_value_idx,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic changes only

* @param pages Pages span
*/
CUDF_KERNEL void __launch_bounds__(DEFAULT_BLOCK_SIZE)
compute_page_dict_bits_kernel(device_span<EncPage> pages)
Copy link
Copy Markdown
Member Author

@mhaseeb123 mhaseeb123 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kernel logic:

  1. Launch one warp per parquet page
  2. Cooperatively find the largest dictionary key in there max(dict_index))
  3. One lane writes the cuda::std::bit_width(max(dict_index)) to page.dict_rle_bits field.

device_span<EncColumnChunk> chunks,
cudf::detail::device_2dspan<PageFragment const> frags)
{
auto& chunk = chunks[blockIdx.x];
Copy link
Copy Markdown
Member Author

@mhaseeb123 mhaseeb123 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this kernel into two parts now.

Part 1: If page fragments per column chunk are < 1024, run two phases:

  • Phase 1: Compute exclusive sum (offset) of number of unique values inserted by each page fragment (in map_insert_fn)
  • Phase 2: For each dictionary slot, assign a monotonically increasing dictionary index starting at the offset (computed in phase 1) of the fragment it was inserted from. After this, keys inserted by each fragment have spatial locality allowing dict bit width compaction if data distribution allows.

Part 2: If page fragments per column chunk are > 1024, run single phase

  • Same code as before (see in else condition) - assign a global atomic counter as dictionary index to each slot.

Comment thread cpp/tests/io/parquet_misc_test.cpp Outdated
Comment on lines +177 to +182
page_dict_bits.reserve(oi.page_locations.size());
std::transform(
oi.page_locations.begin(),
oi.page_locations.end(),
std::back_inserter(page_dict_bits),
[&source](auto const& page_location) { return read_dict_bits(source, page_location); });
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pages no longer have the max bit width. Collect all bit widths and find the max element instead

@mhaseeb123 mhaseeb123 marked this pull request as draft April 28, 2026 23:47
@mhaseeb123 mhaseeb123 marked this pull request as ready for review April 29, 2026 00:04
if (is_valid) {
// Insert the keys using a single thread for best performance for now.
is_unique = map_insert_ref.insert(cuco::pair{val_idx, val_idx});
// TODO(mh): Here we insert the fragment index of the CAS winner, which may not be the
Copy link
Copy Markdown
Member Author

@mhaseeb123 mhaseeb123 Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up PR #22323 makes this deterministic but then we can only compute the number of unique elements inserted by each page fragment using a loop with atomics in collect_map_entries_kernel which can add to the potential cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants