Dispatch to use fp32 distance computation in NN Descent depending on data dimensions by jinsolp · Pull Request #1415 · rapidsai/cuvs

jinsolp · 2025-10-08T17:59:46Z

Closes #1370
Closes #195

This PR adds an option to use fp32 distance computation.

(Outdated) From heuristics, chose dim=16 as the threshold for dispatching to a fp32 distance kernel.
We do manual computation, but since we only target small dimensions, fp32 dispatching ends up being slightly faster end to end with much better recall for small dimensions.

All number below are run on L40 machine and AMD EPYC CPU with 128 cores. Perf and recall is averaged over 5 runs and all time is in seconds. Baseline knn graph is computed using sklearn.neighbors.NearestNeighbors brute for method.

Max iters=20

For larger dimensions there is an inherent issue with the NN Descent algorithm itself that makes the recall low. This can be improved slightly with more iterations.
Also notice that the e2e time taken is similar or slightly less for using fp32.

Max iters=100

Notice how the blue part, the recall doesn't get better compared to the table above even with more iterations (i.e. why we need the fp32 appraoch for this part)

Perf impact on different architectures

H100

L40

jinsolp · 2025-11-18T23:41:41Z

Maybe we might need to have fp16 vs fp32 as an option (and default to fp16) instead of depending on data dimensions. WIP investigating

jinsolp · 2025-11-19T19:41:10Z

cpp/src/neighbors/detail/nn_descent.cuh

 }

+template <typename Index_t, typename Data_t, typename DistEpilogue_t>
+__device__ __forceinline__ void calculate_metric(float* s_distances,


This is taken out as a separate function from the original kernel because it's commonly used for fp32 and fp16

jinsolp · 2025-11-19T19:41:47Z

cpp/src/neighbors/detail/nn_descent.cuh

+      // this is much faster than a warp-collaborative multiplication because MAX_NUM_BI_SAMPLES is
+      // fixed and small (64)
+      for (int i = threadIdx.x; i < MAX_NUM_BI_SAMPLES * SKEWED_MAX_NUM_BI_SAMPLES;
+           i += blockDim.x) {
+        int tmp_row = i / SKEWED_MAX_NUM_BI_SAMPLES;
+        int tmp_col = i % SKEWED_MAX_NUM_BI_SAMPLES;
+        if (tmp_row < list_new_size && tmp_col < list_new_size) {
+          float acc = 0.0f;
+          for (int d = 0; d < num_load_elems; d++) {
+            acc += s_nv[tmp_row][d] * s_nv[tmp_col][d];
+          }
+          s_distances[i] += acc;
+        }
+      }


This matmul part is different from the fp16 kernel

divyegala

I would like to see a unit test, if possible, where you can demonstrably prove distance computation improvements in fp32 over fp16.

jinsolp · 2025-11-21T00:52:40Z

Thanks for the feedback @divyegala, changing this to target 26.02 for now because Corey suggested we do further investigation. Force-pushing after rebasing

jinsolp · 2025-12-03T00:55:40Z

cpp/src/neighbors/detail/nn_descent.cuh

  static_assert(NUM_SAMPLES <= 32);

+  using input_t = typename std::remove_const<Data_t>::type;
+  if (std::is_same_v<input_t, float> && build_config.dataset_dim <= 16) {


dispatch logic based on dimensions

jinsolp · 2025-12-03T02:31:20Z

Blue is better (e2e time or kernel time)

FP32 kernel is usually faster for small dimensions
FP32 e2e time is sometimes slightly longer for small dimensions despite shorter kernel time. Believe this small gap comes from more data transfers (4bytes vs 2bytes per data element)

L40 GPU

H100 GPU

jinsolp · 2025-12-03T18:11:35Z

@divyegala based on the latest benchmarks (and confirming from HDBSCAN that we don't have quality issues from using fp16 for larger dimensions), the mechanism is back to dispatching based on the number of dimensions
(chose 16 as a threshold).

I would like to see a unit test, if possible, where you can demonstrably prove distance computation improvements in fp32 over fp16.

Not sure how to add this as a test. Any suggestions?

divyegala · 2025-12-03T18:23:55Z

Not sure how to add this as a test. Any suggestions?

@jinsolp is it possible to generate a dataset that shows degraded recall for fp16 and good recall for fp32? The way I envision the assertion is ASSERT(fp16_recall < <upperbound_value> && fp32_recall >= <lowerbound_value>).

jinsolp · 2025-12-03T18:30:16Z

We can (e.g. the blobs data with dim=2 which shows 0.4 recall for fp16 vs 0.99 recall for fp32)
, but if we run through the current code, it will always take the fp32 path.

Should I hardwire the results of fp16 and add that as a vector to the test?

divyegala · 2025-12-03T18:32:39Z

@jinsolp we should allow the user an override in-case the heuristic fails. Let users choose fp32 explicitly if they want, and exporting that option will also let you test the separate paths.

jinsolp · 2025-12-03T18:34:50Z

Oh okay, so let the user choose an option, and if none is given fall back to the heuristic dim=16 threshold?

divyegala · 2025-12-03T18:37:11Z

The other way around. Use the heuristic eagerly, but let user override.

jinsolp · 2025-12-03T21:18:27Z

@divyegala we now have a dist_comp_dtype option defaulting to auto. can override by using fp32 or fp16.
Added a test as well.

divyegala · 2025-12-10T19:10:05Z

c/src/neighbors/nn_descent.cpp

Please default initialize the index creation with this new parameter.

jinsolp · 2025-12-10T21:12:44Z

cpp/include/cuvs/neighbors/nn_descent.hpp

  size_t max_iterations            = 20;
  float termination_threshold      = 0.0001;
  bool return_distances            = true;
+  DIST_COMP_DTYPE dist_comp_dtype  = DIST_COMP_DTYPE::AUTO;


@divyegala It already defaults to AUTO here, which would be taken in nn_descent.cpp

That file is the source of C API, and this default is for the C++ API.

I believe all the defaults of cpp are passed when we call cuvsNNDescentIndexParamsCreate

cuvs/c/src/neighbors/nn_descent.cpp

Lines 167 to 182 in 1959a0d

extern "C" cuvsError_t cuvsNNDescentIndexParamsCreate(cuvsNNDescentIndexParams_t* params)

{

return cuvs::core::translate_exceptions([=] {

// get defaults from cpp parameters struct

cuvs::neighbors::nn_descent::index_params cpp_params;

*params = new cuvsNNDescentIndexParams{

.metric = static_cast<cuvsDistanceType>((int)cpp_params.metric),

.metric_arg = cpp_params.metric_arg,

.graph_degree = cpp_params.graph_degree,

.intermediate_graph_degree = cpp_params.intermediate_graph_degree,

.max_iterations = cpp_params.max_iterations,

.termination_threshold = cpp_params.termination_threshold,

.return_distances = cpp_params.return_distances};

});

}

C doesn't support default struct init and so nothing is initialized in the c struct:

cuvs/c/include/cuvs/neighbors/nn_descent.h

Lines 38 to 46 in 1959a0d

struct cuvsNNDescentIndexParams {

cuvsDistanceType metric;

float metric_arg;

size_t graph_degree;

size_t intermediate_graph_degree;

size_t max_iterations;

float termination_threshold;

bool return_distances;

};

I haven't kept up with the changes to the C API, my apologies. We used to explicitly default initialize in the create function previously.

jinsolp · 2025-12-12T22:49:30Z

/merge

jinsolp self-assigned this Oct 8, 2025

jinsolp requested a review from a team as a code owner October 8, 2025 17:59

jinsolp added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Oct 8, 2025

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Oct 8, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Oct 8, 2025

jinsolp changed the base branch from main to release/25.12 November 17, 2025 17:04

jinsolp requested a review from a team as a code owner November 19, 2025 02:42

jinsolp changed the title ~~Dispatch to use fp32 distance computation in NN Descent depending on data dimensions~~ Add fp32 distance computation option in NN Descent Nov 19, 2025

jinsolp mentioned this pull request Nov 19, 2025

Place data on CPU memory for nn_descent option in HDBSCAN rapidsai/cuml#7506

Closed

jinsolp commented Nov 19, 2025

View reviewed changes

jinsolp requested a review from divyegala November 20, 2025 01:07

divyegala reviewed Nov 20, 2025

View reviewed changes

jinsolp changed the base branch from release/25.12 to main November 21, 2025 00:52

jinsolp requested a review from a team as a code owner November 21, 2025 00:52

fp32 dis computation

5e8600d

jinsolp force-pushed the fix-nnd-recall-fp32 branch from 72bab3e to 5e8600d Compare November 21, 2025 02:23

jinsolp added 5 commits December 1, 2025 17:20

Merge branch 'main' into fix-nnd-recall-fp32

74ef509

Merge branch 'main' into fix-nnd-recall-fp32

cb5994f

rm print

3e41b53

revert fp32 dist option exposure

cf9153a

rm newline

06880ff

jinsolp commented Dec 3, 2025

View reviewed changes

Merge branch 'main' into fix-nnd-recall-fp32

d23e5ea

jinsolp changed the title ~~Add fp32 distance computation option in NN Descent~~ Dispatch to use fp32 distance computation in NN Descent depending on data dimensions Dec 3, 2025

jinsolp added 2 commits December 3, 2025 21:16

add dist_comp_dtype option

0948bd5

Merge branch 'main' into fix-nnd-recall-fp32

bc6b4a5

jinsolp added 2 commits December 3, 2025 13:47

Merge branch 'main' into fix-nnd-recall-fp32

6ca087c

Merge branch 'main' into fix-nnd-recall-fp32

c2e6f3e

divyegala requested changes Dec 10, 2025

View reviewed changes

c/src/neighbors/nn_descent.cpp

Copy link

Member

divyegala Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please default initialize the index creation with this new parameter.

jinsolp commented Dec 10, 2025

View reviewed changes

divyegala approved these changes Dec 10, 2025

View reviewed changes

jinsolp added 2 commits December 10, 2025 14:27

Merge branch 'main' into fix-nnd-recall-fp32

5bb77ee

Merge branch 'main' into fix-nnd-recall-fp32

5d40d35

rapids-bot bot merged commit a2ddea4 into rapidsai:main Dec 12, 2025
262 of 268 checks passed

github-project-automation bot moved this from Todo to Done in Vector Search, ML, & Data Mining Release Board Dec 12, 2025

jinsolp deleted the fix-nnd-recall-fp32 branch December 12, 2025 22:49

	extern "C" cuvsError_t cuvsNNDescentIndexParamsCreate(cuvsNNDescentIndexParams_t* params)
	{
	return cuvs::core::translate_exceptions([=] {
	// get defaults from cpp parameters struct
	cuvs::neighbors::nn_descent::index_params cpp_params;

	*params = new cuvsNNDescentIndexParams{
	.metric = static_cast<cuvsDistanceType>((int)cpp_params.metric),
	.metric_arg = cpp_params.metric_arg,
	.graph_degree = cpp_params.graph_degree,
	.intermediate_graph_degree = cpp_params.intermediate_graph_degree,
	.max_iterations = cpp_params.max_iterations,
	.termination_threshold = cpp_params.termination_threshold,
	.return_distances = cpp_params.return_distances};
	});
	}

	struct cuvsNNDescentIndexParams {
	cuvsDistanceType metric;
	float metric_arg;
	size_t graph_degree;
	size_t intermediate_graph_degree;
	size_t max_iterations;
	float termination_threshold;
	bool return_distances;
	};

Conversation

jinsolp commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Max iters=20

Max iters=100

Perf impact on different architectures

H100

L40

Uh oh!

jinsolp commented Nov 18, 2025

Uh oh!

jinsolp Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

jinsolp Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

divyegala left a comment

Choose a reason for hiding this comment

Uh oh!

jinsolp commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jinsolp Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

jinsolp commented Dec 3, 2025

L40 GPU

H100 GPU

Uh oh!

jinsolp commented Dec 3, 2025

Uh oh!

divyegala commented Dec 3, 2025

Uh oh!

jinsolp commented Dec 3, 2025

Uh oh!

divyegala commented Dec 3, 2025

Uh oh!

jinsolp commented Dec 3, 2025

Uh oh!

divyegala commented Dec 3, 2025

Uh oh!

jinsolp commented Dec 3, 2025

Uh oh!

divyegala Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

jinsolp Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyegala Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

jinsolp Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

divyegala Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

jinsolp commented Dec 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jinsolp commented Oct 8, 2025 •

edited

Loading

jinsolp commented Nov 21, 2025 •

edited

Loading

jinsolp Dec 10, 2025 •

edited

Loading