Add Multi-GPU neighbors#465
Conversation
|
ToDo:
|
viclafargue
left a comment
There was a problem hiding this comment.
Great work! It looks though like indices are built and dropped immediately after search. Might be what is necessary for the use case, but it would probably be neat to have these wrappers split into a build and a search function to allow index preservation over multiple search rounds.
| "intermediate_graph_degree", None | ||
| ) | ||
| nn_descent_params = nn_descent.IndexParams( | ||
| graph_degree=k, intermediate_graph_degree=intermediate_graph_degree |
There was a problem hiding this comment.
graph_degree does not necessarily have to be equal to k. For instance UMAP uses 64 as a default.
| ) | ||
| ivf_pq_params = None | ||
| else: | ||
| raise ValueError(f"Invalid algorithm: {algo}") |
There was a problem hiding this comment.
All neighbors can also be launched in brute_force mode by setting ivf_pq_params and nn_descent_params to None.
| neighbors = cp.zeros([X.shape[0], k], dtype=np.int64) | ||
| distances = cp.zeros([X.shape[0], k], dtype=np.float32) | ||
|
|
||
| neighbors, distances = all_neighbors.build( | ||
| dataset=X, | ||
| k=k, | ||
| params=build_params, | ||
| indices=neighbors, | ||
| distances=distances, | ||
| resources=res, | ||
| ) |
There was a problem hiding this comment.
It looks like you are pre-allocating the neighbors and distances arrays and passing them to the function while also getting them as the return value. This will probably work, but isn't how things are intended to work. There is no need to pre-allocate and pass them as arguments. The function will allocate and return the result on its own. Passing them as argument allows one to re-use the same buffer for multiple operations.
There was a problem hiding this comment.
the distances need to be given as an array to be returned, right?
There was a problem hiding this comment.
You're right, distances is actually optional and not returned if not provided.
| distances=distances, | ||
| resources=res, | ||
| ) | ||
| neighbors = neighbors.astype(np.int32) |
There was a problem hiding this comment.
Won't it cause issues on larger datasets with more samples than int32 can handle?
There was a problem hiding this comment.
since this is only the index it should work. We have to create a sparse martrix from this and thats also limited to int32
| if metric == "euclidean": | ||
| distances = cp.sqrt(distances) |
There was a problem hiding this comment.
Hasn't the sqrt operation been applied already?https://github.com/rapidsai/cuvs/blob/8fc4a2ad134f2d2d7763b026d0965b328ecc8db8/python/cuvs/cuvs/distance/distance.pyx#L29-L31
|
|
||
| def _all_neighbors_knn( | ||
| X: cp.ndarray, | ||
| Y: cp.ndarray, |
There was a problem hiding this comment.
All neighbors is a pairwise operation. Is the Y argument for consistency with other functions?
There was a problem hiding this comment.
Yes. I also use them for bbknn and there you build a small index and search the whole dataset
| algo=algo, | ||
| overlap_factor=overlap_factor, | ||
| n_clusters=n_clusters, | ||
| metric="sqeuclidean", |
There was a problem hiding this comment.
Looks like the metric function argument is not used here. Also nn_descent_params and ivf_pq_params should be configured with the same metric.
| else: | ||
| metric_to_use = metric | ||
| build_params = cagra.IndexParams( | ||
| metric=metric_to_use, graph_degree=k, build_algo="nn_descent" |
There was a problem hiding this comment.
graph_degree is not necessarily k and build_algo can take other options. But, maybe good defaults?
| def _all_neighbors_knn( | ||
| X: cp.ndarray, |
There was a problem hiding this comment.
All neighbors requires the dataset to be on host for multi-GPU run. Would be important to have checks here to ensure that the user is made aware of this.
|
Oh, this is awesome . |
This will add Multi-GPU neighbors