Skip to content

uint64 header support for cuvs_bench#2130

Open
jinsolp wants to merge 8 commits into
rapidsai:mainfrom
jinsolp:cuvs-bench-uint64-header
Open

uint64 header support for cuvs_bench#2130
jinsolp wants to merge 8 commits into
rapidsai:mainfrom
jinsolp:cuvs-bench-uint64-header

Conversation

@jinsolp
Copy link
Copy Markdown
Contributor

@jinsolp jinsolp commented May 26, 2026

Closes #2129

This PR adds a extended uint64 header support for bin data files. It automatically detects based on the filesize, and therefore does not break support for existing files.
This is useful for datasets that exceed uint32 max number of rows (4.2B).

@jinsolp jinsolp self-assigned this May 26, 2026
@jinsolp jinsolp requested a review from a team as a code owner May 26, 2026 20:27
@jinsolp jinsolp added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels May 26, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added support for extended binary file headers (uint64) alongside legacy format (uint32) with automatic detection.
    • Binary format now automatically validates file integrity and provides detailed error messages for corrupted or truncated files.
  • Tests

    • Expanded test coverage for binary header handling, including round-trip validation across both header formats and large-file scenarios.

Walkthrough

This PR introduces shared binary header helpers to enable cuvs-bench to support uint64 headers (16-byte) alongside legacy uint32 headers (8-byte). New functions auto-detect header layout by file-size matching and update vector loading, groundtruth utilities, and dataset conversion scripts to use standardized header I/O instead of manual parsing.

Changes

Binary Header Format Upgrade

Layer / File(s) Summary
Binary header read/write helpers
python/cuvs_bench/cuvs_bench/_bin_format.py
New module defines constants for legacy (8-byte) and extended (16-byte) header sizes. read_bin_header() auto-detects layout by unpacking candidate dimensions and verifying file size matches; write_bin_header() writes the appropriate header based on dimension fit and force_uint64 flag.
Vector loading with auto-detected headers
python/cuvs_bench/cuvs_bench/backends/_utils.py
load_vectors() now computes dtype itemsize, calls read_bin_header() to detect layout and dimensions, seeks past the header, reads payload bytes, and validates against truncation.
Groundtruth memory-mapping and binary writing
python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py
memmap_bin_file() auto-detects header layout when reading, supports optional shape override, and writes headers via write_bin_header(). write_bin() also gains force_uint64 parameter to control extended-header output.
Dataset conversion scripts with standardized headers
python/cuvs_bench/cuvs_bench/get_dataset/fbin_to_f16bin.py, python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py
Both scripts now use read_bin_header() and write_bin_header() instead of manual header I/O. Functions expose force_uint64 parameter to control whether extended headers are written.
Comprehensive test suite for header handling
python/cuvs_bench/cuvs_bench/tests/test_utils.py
Expands coverage to validate header auto-detection, uint64-layout round-trips, overflow behavior, error dispatch for invalid files, and parameterized full and subset reads across both legacy and extended header modes. New TestBinHeaderHelpers class exercises the core helper functions directly.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding uint64 header support for cuvs_bench binary files.
Description check ✅ Passed The description explains the change (uint64 header support), its benefit (handling larger datasets), and mentions automatic detection to maintain backward compatibility.
Linked Issues check ✅ Passed The PR successfully implements uint64 header support for cuvs_bench, directly addressing issue #2129's requirement for billion-scale dataset support.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing uint64 header support with automatic detection and backward compatibility, with comprehensive test coverage.
Docstring Coverage ✅ Passed Docstring coverage is 81.48% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cuvs_bench/cuvs_bench/_bin_format.py`:
- Around line 78-90: The function read_bin_header currently uses itemsize in the
legacy size check without validating it, so callers passing itemsize=0 or
non-positive values can cause false positives; add an upfront validation in
read_bin_header to ensure itemsize is an integer > 0 (and optionally raise
TypeError for non-int types and ValueError for <=0), before computing file_size
== LEGACY_HEADER_BYTES + n_rows_32 * n_cols_32 * itemsize; reference the
parameters itemsize and the function read_bin_header and keep the existing
LEGACY_HEADER_BYTES logic after the validation.
- Around line 136-143: The current header writer truncates non-integral
dimensions by calling int(n_rows)/int(n_cols) which can silently corrupt data;
before packing in the branch that writes "<QQ" or "<II", validate that n_rows
and n_cols are integral (accept Python/NumPy integer types or floats that are
exact integers) and reject fractional values by raising a TypeError/ValueError;
use the existing negative check for <0 but add an explicit integrality check for
n_rows and n_cols (reference symbols: n_rows, n_cols, UINT32_MAX,
EXTENDED_HEADER_BYTES, struct.pack("<QQ"/"<II") ) and only cast to int after
passing the check.

In `@python/cuvs_bench/cuvs_bench/backends/_utils.py`:
- Around line 99-106: Validate that subset_size is an integer-like positive
value before using it: after computing itemsize (np.dtype(dtype).itemsize) and
before clamping n_rows, add a check that if subset_size is not None it must be
an integral type (e.g., isinstance(subset_size, numbers.Integral) and not a
bool) and >= 1, otherwise raise ValueError with a clear message including the
received value; keep the existing clamping of n_rows = min(n_rows, subset_size)
and then call read_bin_header/read/reshape as before so downstream code never
receives a float count.

In `@python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py`:
- Around line 80-95: Validate the incoming shape parameter before building
final_shape: in the block using read_bin_header and later where final_shape is
constructed (variables header_dims, final_shape, shape, and the np.memmap call),
require that shape is either None or an iterable of exactly two entries; for
each provided entry ensure it is either None or an integer >= 1, otherwise raise
a ValueError that includes expected "2 dimensions" and the actual shape value;
additionally, when mode indicates writing, ensure provided shape is consistent
with header_dims (or vice versa) and raise a clear error if they conflict so you
never return a memmap with mismatched dimensions.
- Around line 44-46: The public function memmap_bin_file currently removed the
size_dtype kwarg; restore compatibility by accepting size_dtype as an optional
keyword argument (e.g., def memmap_bin_file(..., *, force_uint64=False,
size_dtype=None)) and map it to force_uint64 when provided (prefer explicit
precedence if both given), emitting a DeprecationWarning that size_dtype is
deprecated and will be removed in a future release; update any internal uses
within memmap_bin_file to use the resolved force_uint64 value and ensure callers
passing size_dtype continue to work for one release cycle.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f762686e-0e10-4f6d-8967-b37391a77504

📥 Commits

Reviewing files that changed from the base of the PR and between 4addf4e and c1df117.

📒 Files selected for processing (6)
  • python/cuvs_bench/cuvs_bench/_bin_format.py
  • python/cuvs_bench/cuvs_bench/backends/_utils.py
  • python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py
  • python/cuvs_bench/cuvs_bench/get_dataset/fbin_to_f16bin.py
  • python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py
  • python/cuvs_bench/cuvs_bench/tests/test_utils.py

Comment thread python/cuvs_bench/cuvs_bench/_bin_format.py
Comment thread python/cuvs_bench/cuvs_bench/_bin_format.py
Comment thread python/cuvs_bench/cuvs_bench/backends/_utils.py
Comment thread python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py
Comment thread python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py
Copy link
Copy Markdown
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this end-to-end for the beyond uint32 rows goal, I think we may need to audit a few non-helper paths too. For example, cpp/bench/ann/src/common/blob.hpp still appears to treat .bin files as two uint32_t header values and uses an 8-byte data offset, so C++ benchmark consumers may not read these extended files correctly yet. Also, python/cuvs_bench/cuvs_bench/generate_groundtruth/main.py still writes ground-truth neighbor IDs as uint32, which may not be enough once base-row IDs can exceed 4B rows. Would it make sense to either update those paths here or call them out as explicit follow-up scope?


if mode[0] == "r":
a = np.memmap(bin_file, mode=mode, dtype=size_dtype, shape=(2,))
n_rows, n_cols, header_bytes = read_bin_header(bin_file, itemsize)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding direct tests for memmap_bin_file with both legacy and extended headers? This is one of the important large-file paths, and this PR changes the read offset detection as well as the write path, so a small round-trip test here would make breakages easy to spot

"""
path = tmp_path / "huge.fbin"
n_rows = UINT32_MAX + 17
n_cols = 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this use a positive column count and create the expected file size with truncate instead of using n_cols = 0? The zero-column case proves the parser can see a large uint64 value, but it doesn’t really exercise the large positive-shape case this feature is meant to support.

@pytest.mark.parametrize(
"ext, dtype, size_dtype",
[
(".fbin", np.float32, np.uint32),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we include the float16 extension in this parametrized coverage too? load_vectors supports .f16bin, np? So it would be nice to cover that path for both header layouts as well.

@jinsolp
Copy link
Copy Markdown
Contributor Author

jinsolp commented Jun 5, 2026

Hi @dantegd I've addressed the test-related feedback.

And thanks for mentioning GT! The default for groundtruth ibin. Now it also writes groundtruth files as u64bin if needed.

I'll open another PR for the cpp side!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

[FEA] uint64 type header support for cuvs bench

2 participants