[REVIEW] New Dataset API Clarifying Ownership by HowardHuang1 · Pull Request #1846 · rapidsai/cuvs

HowardHuang1 · 2026-02-24T23:21:21Z

Overview

Addressing #1574 and #1571.

Replaced strided_dataset with padded_dataset class. Added support all the way up to CAGRA code. Strided_dataset code left in for backwards compatibility but can be deprecated later on.

Proposed class structure:

dataset and dataset_view are now 2 separate parent classes. Strided dataset is separate. strided_dataset, layout_stride, make_strided_dataset are separate.

Parent Classes

Storage is still expressed in terms of RAFT mdspan / mdarray / device_matrix: padded types wrap row-major device views or raft::device_matrix<..., row_major> with a separate logical dim_ from the leading dimension (row pitch in elements).

Ownership

The index and cagra::build / cagra::index do not own raw vector storage; they store std::unique_ptr<dataset_view<...>>, i.e. non-owning view handles, so callers (or the C merged holder) must keep backing memory alive for as long as the index is used.

Backwards Compatibility

The strided_dataset / non_owning_dataset / owning_dataset path and make_strided_dataset / make_aligned_dataset are kept in for backward compatibility.

Device v.s. Host

device_padded_dataset extends dataset
device_padded_dataset_view extends dataset_view

There are no host versions as that is not needed.

ACE v.s. non-ACE paths on Host

ACE path is the only one allowed on host.
It copies datasets that can't entirely fit in CPU memory in chunks onto GPU memory by calling make_padded_dataset. This is 1x memory on CPU and 1x memory on GPU.

Return types:

Used mainly to maintain lifetime of dataset.

merge_result
build_result
cuvs_cagra_c_api_lifetime_holder

unique_ptr<vpq_dataset> vpq_owner
unique_ptr padded_dataset_owner
raft::device_matrix dataset
cagra::index idx
It is a single C++ struct in cagra.cpp that groups the real cagra::index with any extra heap-owned things the C API had to create so the index’s non-owning views stay valid.

cuvs_cagra_c_api_lifetime_holder is a separate heap object from cagra::index. It is heap-allocated in cagra.cpp with new cuvs_cagra_c_api_lifetime_holder<...>. The C API keeps a raw pointer to it in cuvsCagraIndex.cuvs_cagra_c_api_lifetime_holder It is not embedded in the index, which is why the C layer needs that second field to delete the holder on destroy.

Heap-allocated bundle for the C API: owns cagra::index and any co-owned device storage (VPQ, padded dataset copy, merge/de-serialize/extend buffers) when the index is not standalone. cuvsCagraIndex.c_api_lifetime_owner points at this. Used for merge, build, deserialize, from_args, extend.

The holder moves the owning device_padded_dataset (as unique_ptr<dataset<>> in padded_dataset_owner) to the heap, and cuvsCagraIndex.merged_owner points at the holder. Destroying the C index later destroys the holder first, so the dataset outlives the index’s use of the view, or the ordering is set up so the view is not used after free.

In cuvsCagraIndexFromArgs in cagra.cpp (C API) where callers are things like the Python cagra.from_graph (via Cython) and the Java CagraIndexImpl, and any C code that uses that function:

The flow is: caller → cuvsCagraIndexFromArgs → _from_args, which writes into the cuvsCagraIndex struct the user passed

The holder is not returned as a separate C return value. It is allocated on the heap and its address is stored in output_index->merged_owner, and output_index->addr points at the index inside that holder (or at a freestanding index when merged_owner == 0).

So when _from_args returns, the user’s cuvsCagraIndex already holds the pointers that describe where everything lives.

The unique_ptr to the copy of the dataset from make_padded_dataset is not local to _from_args—it is a member of the holder, which is on the heap and stays alive.

Miscellaneous: Extend Serialize Deserialize

Will fill in later

Factories:

make_padded_dataset
make_padded_dataset_view

Old (to be deprecated):

make_strided_dataset

Helpers:

device_row_stride_is_padded (cagra.cpp)
device_strided_matrix_has_cagra_row_pitch (cagra.cpp)
rebind_vpq_index (cagra.cpp)

makes call to update_dataset to rebind vpq index after build()

Places where make_padded_dataset/view are called internally (not by user):

Host non-ACE path

cpp/src/neighbors/cagra_build_inst.cu.in
cagra_from_host_padded in cpp/src/neighbors/iface/iface.hpp
c/src/neighbors/cagra.cpp

ACE internals

ACE is the only path that takes host mdspan as input
But internally, it implements a H2D copy of each partition with make_padded_dataset.
cpp/src/neighbors/detail/cagra/cagra_build.cuh

Attach Dataset

ACE attach_dataset_on_build calls make_padded_dataset on full host dataset to attach dataset to final index.
cpp/src/neighbors/detail/cagra/cagra_build.cuh

Tiered CAGRA

update_cagra_ann_dataset_for_stride
build_upstream_ann

To support Backwards Compatibility:

TLDR for backwards compatibility, we would only need to bring back build() function that accepts non-padded dataset + host inputs and returns index(). Nothing else downstream needs to change.
Old program shape: build(…) → cagra::index → then search / serialize / deserialize / merge / … with that index (and associated methods on index they used).
New program shape: build(…) → build_result → then search / serialize / deserialize / merge / … with that br.idx passed in.
old build() function in the public API had return type index() rather than things like build_result() and ace_build_result(). To maintain backwards compatibility, we would need to maintain the individual overloads supporting the old index return type for just the build() function.
Previously, users were NOT expected to pad the dataset themselves, instead padding was done internally in build(). This means we must mark clearly that the old build() function does padding internally and input can be a non-padded dataset. However, for new build() function input must be a padded_dataset_view and padding is not done internally.
We do not need to maintain a bunch of overload functions belonging to 2 separate pipelines: one old and one new. Should be the same search / serialize / deserialize / merge since the only difference is the new dataset API has search / serialize / deserialize / merge taking in br.idx instead of straight index.
For the internal calling logic, we can do any one of the 3 options below. The downstream functions themselves search / serialize / deserialize / merge should stay the same.

There are different overloads with different third-argument types, so the return type is fixed at compile time:

build(res, params, host_matrix_view) or build(res, params, device_matrix_view) → cagra::index (convenience / legacy-style).
build(res, params, dataset_view) (and the thin device_padded_dataset_view overload for deducing T) → build_result (use .idx, and .vpq if present).
build_ace(res, params, host_matrix_view) → ace_build_result (use .idx and whatever else that struct carries for ACE).
Goal: every build() function needs to support host inputs. These functions don't necessarily need to take in mdspan but they still need to take in the host dataset (and do the same thing with the data that the mdspan version was doing).

For ACE: both build(…, host) (ACE branch) and build_ace(…, host) go through the same detail::build_ace; the former finalizes to index only, the latter returns the full ace_build_result

Bottom line: The backward-compatible surface is declared in cagra.hpp. The restored behavior is implemented in cagra_build_inst.cu.in (and templates in cagra.cuh) by calling the same internal dataset_view build + ACE utilities in cagra_build.cuh, then finalizing into a single index for downstream search / serialize like any other index.

Breaking Changes for Dataset API:

The following functions are removed since index no longer owns the dataset, index only takes views:

update_dataset(host_matrix_view)
update_dataset(resources res, DatasetT&& dataset)
update_dataset(resources res, unique_ptr&& dataset)

All other functions on old public API surface are preserved for backwards compatibility.
Notably, the 8 build() functions that take in device_matrix_view and return indexes are kept in. Their implementations are found in cpp/src/neighbors/cagra_build_inst.cu.in. Because they take in device_matrix_view which is not padded, we call make_padded_dataset/view FOR THE USER in cagra_build_inst.cu.in. This will later be deprecated, as the user is expect to call the make_padded_dataset/view factories themselves to avoid 2x memory spike surprises.

2 cases where index owns dataset [both deprecated paths]:

Both occur on an edge case path when attach_dataset_on_build == true and a successful dense attach:

Non-ACE / typical padded attach: rows live under index_owning_dataset_storage_ (type-erased owning wrapper, commonly device_padded_dataset).
ACE in-memory device_matrix attach: rows live under host_build_ace_device_store_ (optional raw device_matrix).

TODOs:

Bring back Host functions [DONE]
Mark any old functions that are no longer used as [deprecated]
Use templates wherever possible. Shift towards composition rather than inheritance

…A level

copy-pr-bot · 2026-02-24T23:21:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

aamijar · 2026-02-25T01:51:35Z

/ok to test 5447a4c

aamijar · 2026-02-25T02:12:45Z

/ok to test 17ab09d

achirkin · 2026-02-25T06:06:02Z

NB: I updated the label to breaking, since the description implies removal of a publicly visible class strided_dataset

cjnolet · 2026-02-25T22:22:05Z

Does the dataset(_view) type bring anything on top of mdarray/mdspan in that case?

@achirkin The problem w/ using mdspan/mdarray for this is that it's not carrying along the proper information to either the algorithms nor the user (which is why we created this specialized class for this in the first place!).

Two immediate reasons why this API is necessary:

The user should not have to know that they need to pad a dataset in order to use cagra without the additional copy. They should not need to know how any of these algorithms work internally. They should, however, need to know that CAGRA expects a padded dataset, and they should have an API to construct one so that they can own the dataset class and not have cagra creating one under the hood.
APIs, especially the graph-based APIs, should be able to accept as inputs data which has been quantized using a metod like PQ, which carries with additional information. In the case of PQ, the codebooks are needed to compute the distances. This again decouples the quantization from the algorithm (CAGRA-Q does not need to do its own quantization. It should just accept the quantized vectors). We're being asked for the same behavior with Vamana.

This new API solves both of these problems while leaving the control over the memory ownership entirely in the user's hands. We've discussed this for a long time. We've known this is needed for a long time. it's time to prioritize this and get it done. I agree that an anstract class might make more sense, but ultimately we should not be moving any owneship over to the algorithm (the user should maintain ownership over the class and underlying memory the entire time).

…tion between make host/device padded dataset in factory

… of dataset + create build_result struct which returns both index and vpq_dataset to prevent automatic out of scope destruction of dataset for vpq case

…rt for cases where we DO need to own the dataset (in order to keep view alive for index). All cases where we build() from dataset already on device --> we don't need to own. Merge + All cases when data is on host --> we DO need to own the device copy we create. This includes within ACE build and C API build from host and from_args with host dataset

HowardHuang1 · 2026-03-04T23:24:04Z

The doc that outlines some of the API design choices can be found in slack. Let me know if there are any parts of the design that can be altered to better suit our users' needs.

The following files are test case files I've added and can be ignored for now. They will be removed before the final merge with upstream repo:

cagra_build_view_only.cu
cagra_padded_dataset.cu
cagra_vpq_build_result.cu
dataset_compression.cu
dataset_types.cu

…ndex_from_padded

cjnolet · 2026-05-13T01:43:24Z

I would propose we separate the implementation from the prototype of the methods.

…

On Tue, May 12, 2026 at 9:38 PM Howard Huang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In cpp/include/cuvs/neighbors/cagra.hpp <#1846 (comment)>: > void update_dataset(raft::resources const& res, - raft::device_matrix_view<const T, int64_t, raft::layout_stride> dataset) + cuvs::neighbors::any_owning_dataset<dataset_index_type>&& dataset) The old update_dataset had these 2 owning signatures that both replace the old dataset with a new dataset: - update_dataset(resources res, DatasetT&& dataset) - update_dataset(resources res, unique_ptr&& dataset) I agree since index() now only takes views and user is responsible for ownership of the dataset, we should no longer support owning datasets in update_dataset(). But don't we need this for backwards compatibility at least (it will be marked as deprecated)? Or can we just remove support for the 2 owning dataset instances of update_dataset() outright? — Reply to this email directly, view it on GitHub <#1846 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJPKYEWAMJMDHHV37DT6HT42PGX5AVCNFSM6AAAAACV6TA3QKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DENZXGUZDMMZWGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

HowardHuang1 · 2026-05-13T04:30:08Z

/ok to test 88da190

HowardHuang1 · 2026-05-13T20:18:55Z

/ok to test d1c1dd4

…et as opposed to a strided dataset. Recovering strided dataset can cause serialized logical dim and in-memory dim used by index to disagree which leads to bad recall

HowardHuang1 · 2026-05-14T02:07:05Z

/ok to test 8c836ec

…eserialization so deserialization fails. Also fixed doxygen

HowardHuang1 · 2026-05-14T06:30:39Z

/ok to test 11b0c61

…e same. Previously any_owning_dataset_to_index_view() was missing vpq codebook type branches f32 and f16 for some index logical element types.

HowardHuang1 · 2026-05-14T17:09:56Z

/ok to test 1de47f9

tarang-jain · 2026-05-14T21:49:33Z

I am adding some comments to propose how to express VQ / PQ codebooks.
We want to separate out the codebooks from the vpq_dataset struct. Currently the PQ quantizer is creating an empty quantized dataset everywhere, which is not needed because it just needs the codebooks. The same holds for vamana as well. Therefore I am proposing containers only for the codebooks:
(a) pq_codebooks_owning
(b) vpq_codebooks_owning
(c) pq_codebooks_view
(d) vpq_codebooks_view

Technically, the PQ containers above would just hold a single array (or view), but it still needs a container to express helpers such as pq_dim() and pq_bits().

Now for the vpq_dataset, I think we can use one of the regular owning dataset containers that you already have, separately from the codebooks. In other words, we can decouple the quantized dataset from the codebooks and current callers of vpq_dataset can own two objects: one "dataset" object to hold the quantized data and one codebooks object from the four I have mentioned above.

…all make_vpq_dataset() factory instead of relying on build() to create vpq_dataset for them. Remove vpq_dataset ownership storage from build_result and merge_result

HowardHuang1 · 2026-05-14T23:17:35Z

/ok to test 99ab789

…se path for build_from_host_matrix() call with host + attach_dataset_on_build + successful attach and have index own dataset for now for this edge case path only

…rn build_result

lowener

Add an example in the headers of how to use make_vpq_dataset along with CAGRA, there are currently none.

…taset_view

…aset() factory

…e edge case attach_dataset_on_build path for ACE analogous to the device build path we deprecated to remove build_result earlier. For that one path only, index owns dataset.

…nly build()

HowardHuang1 added 4 commits February 23, 2026 09:49

get build working

d78f459

add dataset compression test and basic constructor types test

4febf8b

add padded_dataset class along with test cases

b403473

add support for new padded_dataset classes all the way up to the CAGR…

8d6833a

…A level

HowardHuang1 requested review from a team as code owners February 24, 2026 23:21

github-project-automation Bot added this to Unstructured Data Processing Feb 24, 2026

aamijar added non-breaking Introduces a non-breaking change feature request New feature or request labels Feb 25, 2026

aamijar assigned HowardHuang1 Feb 25, 2026

aamijar moved this to In Progress in Unstructured Data Processing Feb 25, 2026

Merge branch 'main' into HH-Dataset-API

5447a4c

fix style

17ab09d

achirkin added breaking Introduces a breaking change and removed non-breaking Introduces a non-breaking change labels Feb 25, 2026

achirkin reviewed Feb 25, 2026

View reviewed changes

Comment thread cpp/include/cuvs/neighbors/cagra.hpp

aamijar reviewed Feb 25, 2026

View reviewed changes

Comment thread cpp/CMakeLists.txt Outdated

seunghwak mentioned this pull request Feb 27, 2026

[WIP] Clarify dataset ownership and allocation semantics #1738

Closed

build() now only takes views and not unique ptrs + get rid of distinc…

fb556c9

…tion between make host/device padded dataset in factory

HowardHuang1 requested a review from a team as a code owner February 28, 2026 03:34

HowardHuang1 added 4 commits March 2, 2026 15:27

clean up old overloads of build & index functions that take ownership…

37d28dc

… of dataset + create build_result struct which returns both index and vpq_dataset to prevent automatic out of scope destruction of dataset for vpq case

fix failing mg tests that do build -> serialize -> deserialize -> search

26b46a2

Merge remote-tracking branch 'upstream' into HH-Dataset-API

a38fb18

HowardHuang1 added 2 commits May 12, 2026 17:41

Remove detail namespace around finalize_index_from_ace and finalize_i…

cebcfd4

…ndex_from_padded

move implementation of dispatch in index out into dispatch.hpp file

88da190

tarang-jain mentioned this pull request May 13, 2026

[FEA] View Type PQ Preprocessor #1764

Open

HowardHuang1 added 3 commits May 13, 2026 13:13

Merge remote-tracking branch 'upstream' into HH-Dataset-API

79d0fd8

remove indirect_dataset

f9adbc5

Merge remote-tracking branch 'upstream' into HH-Dataset-API

d1c1dd4

HowardHuang1 added 2 commits May 13, 2026 19:01

Merge remote-tracking branch 'upstream' into HH-Dataset-API

5223862

fix failing CI by having deserialize_strided() recover a padded datas…

8c836ec

…et as opposed to a strided dataset. Recovering strided dataset can cause serialized logical dim and in-memory dim used by index to disagree which leads to bad recall

index was missing vpq_16_owning check when rebinding dataset during d…

11b0c61

…eserialization so deserialization fails. Also fixed doxygen

index logical element type and vpq codebook type do not need to be th…

1de47f9

…e same. Previously any_owning_dataset_to_index_view() was missing vpq codebook type branches f32 and f16 for some index logical element types.

HowardHuang1 added 2 commits May 14, 2026 16:13

Merge remote-tracking branch 'upstream' into HH-Dataset-API

c6b4901

pull out nested vpq_dataset creation from build(). Users should now c…

99ab789

…all make_vpq_dataset() factory instead of relying on build() to create vpq_dataset for them. Remove vpq_dataset ownership storage from build_result and merge_result

HowardHuang1 added 2 commits May 14, 2026 18:12

remove deferred_host_dataset from build_result. Deprecate old edge ca…

3ea5f56

…se path for build_from_host_matrix() call with host + attach_dataset_on_build + successful attach and have index own dataset for now for this edge case path only

remove build_result completely and remove build() overloads that retu…

ea4c1d6

…rn build_result

lowener reviewed May 15, 2026

View reviewed changes

Comment thread cpp/src/neighbors/detail/cagra/cagra_build.cuh

Comment thread cpp/src/neighbors/detail/cagra/cagra_build.cuh Outdated

Comment thread cpp/src/preprocessing/quantize/pq.cu Outdated

HowardHuang1 added 7 commits May 15, 2026 09:29

make_vpq_dataset now takes any_dataset_view instead of just padded_da…

68c3f6b

…taset_view

remove unused files and includes. Added code example for make_vpq_dat…

f7f607d

…aset() factory

Merge branch 'main' into HH-Dataset-API

34d2dc4

remove ace_build_result. build_ace() now returns index only. Deprecat…

3298366

…e edge case attach_dataset_on_build path for ACE analogous to the device build path we deprecated to remove build_result earlier. For that one path only, index owns dataset.

Merge remote-tracking branch 'upstream' into HH-Dataset-API

0af9527

remove all instances of build_ace() on public API surface. Now it's o…

6459568

…nly build()

unify two index dataset storage variables into one

e3cf3d3

Conversation

HowardHuang1 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Proposed class structure:

Parent Classes

Ownership

Backwards Compatibility

Device v.s. Host

ACE v.s. non-ACE paths on Host

Return types:

Miscellaneous: Extend Serialize Deserialize

Factories:

Helpers:

Places where make_padded_dataset/view are called internally (not by user):

To support Backwards Compatibility:

Breaking Changes for Dataset API:

TODOs:

Uh oh!

copy-pr-bot Bot commented Feb 24, 2026

Uh oh!

aamijar commented Feb 25, 2026

Uh oh!

aamijar commented Feb 25, 2026

Uh oh!

achirkin commented Feb 25, 2026

Uh oh!

Uh oh!

cjnolet commented Feb 25, 2026

Uh oh!

Uh oh!

HowardHuang1 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjnolet commented May 13, 2026 via email

Uh oh!

HowardHuang1 commented May 13, 2026

Uh oh!

HowardHuang1 commented May 13, 2026

Uh oh!

HowardHuang1 commented May 14, 2026

Uh oh!

HowardHuang1 commented May 14, 2026

Uh oh!

HowardHuang1 commented May 14, 2026

Uh oh!

tarang-jain commented May 14, 2026

Uh oh!

HowardHuang1 commented May 14, 2026

Uh oh!

lowener left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

HowardHuang1 commented Feb 24, 2026 •

edited

Loading

HowardHuang1 commented Mar 4, 2026 •

edited

Loading