Skip to content

Add Multi-GPU neighbors#465

Merged
Intron7 merged 15 commits into
mainfrom
neighbors-multi-GPU
Sep 26, 2025
Merged

Add Multi-GPU neighbors#465
Intron7 merged 15 commits into
mainfrom
neighbors-multi-GPU

Conversation

@Intron7
Copy link
Copy Markdown
Member

@Intron7 Intron7 commented Sep 22, 2025

This will add Multi-GPU neighbors

@Intron7
Copy link
Copy Markdown
Member Author

Intron7 commented Sep 22, 2025

ToDo:

  • Add Test for BBKNN
  • Fix Docstring neighbors
  • Fix Docstring neighbors
  • Check metrics and algorithm keywords
  • Check pqbit for ivfpq

Copy link
Copy Markdown

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! It looks though like indices are built and dropped immediately after search. Might be what is necessary for the use case, but it would probably be neat to have these wrappers split into a build and a search function to allow index preservation over multiple search rounds.

"intermediate_graph_degree", None
)
nn_descent_params = nn_descent.IndexParams(
graph_degree=k, intermediate_graph_degree=intermediate_graph_degree
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graph_degree does not necessarily have to be equal to k. For instance UMAP uses 64 as a default.

)
ivf_pq_params = None
else:
raise ValueError(f"Invalid algorithm: {algo}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All neighbors can also be launched in brute_force mode by setting ivf_pq_params and nn_descent_params to None.

Comment on lines +71 to +81
neighbors = cp.zeros([X.shape[0], k], dtype=np.int64)
distances = cp.zeros([X.shape[0], k], dtype=np.float32)

neighbors, distances = all_neighbors.build(
dataset=X,
k=k,
params=build_params,
indices=neighbors,
distances=distances,
resources=res,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are pre-allocating the neighbors and distances arrays and passing them to the function while also getting them as the return value. This will probably work, but isn't how things are intended to work. There is no need to pre-allocate and pass them as arguments. The function will allocate and return the result on its own. Passing them as argument allows one to re-use the same buffer for multiple operations.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the distances need to be given as an array to be returned, right?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, distances is actually optional and not returned if not provided.

distances=distances,
resources=res,
)
neighbors = neighbors.astype(np.int32)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it cause issues on larger datasets with more samples than int32 can handle?

Copy link
Copy Markdown
Member Author

@Intron7 Intron7 Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is only the index it should work. We have to create a sparse martrix from this and thats also limited to int32

Comment on lines +83 to +84
if metric == "euclidean":
distances = cp.sqrt(distances)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def _all_neighbors_knn(
X: cp.ndarray,
Y: cp.ndarray,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All neighbors is a pairwise operation. Is the Y argument for consistency with other functions?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I also use them for bbknn and there you build a small index and search the whole dataset

algo=algo,
overlap_factor=overlap_factor,
n_clusters=n_clusters,
metric="sqeuclidean",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the metric function argument is not used here. Also nn_descent_params and ivf_pq_params should be configured with the same metric.

else:
metric_to_use = metric
build_params = cagra.IndexParams(
metric=metric_to_use, graph_degree=k, build_algo="nn_descent"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graph_degree is not necessarily k and build_algo can take other options. But, maybe good defaults?

Comment on lines +16 to +17
def _all_neighbors_knn(
X: cp.ndarray,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All neighbors requires the dataset to be on host for multi-GPU run. Would be important to have checks here to ensure that the user is made aware of this.

@agemagician
Copy link
Copy Markdown

Oh, this is awesome .
When we will have this PR merged ?

@Intron7 Intron7 merged commit ebe1e6a into main Sep 26, 2025
13 of 16 checks passed
@Intron7 Intron7 deleted the neighbors-multi-GPU branch September 26, 2025 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants