Skip to content

Dispatch to use fp32 distance computation in NN Descent depending on data dimensions#1415

Merged
rapids-bot[bot] merged 13 commits intorapidsai:mainfrom
jinsolp:fix-nnd-recall-fp32
Dec 12, 2025
Merged

Dispatch to use fp32 distance computation in NN Descent depending on data dimensions#1415
rapids-bot[bot] merged 13 commits intorapidsai:mainfrom
jinsolp:fix-nnd-recall-fp32

Conversation

@jinsolp
Copy link
Contributor

@jinsolp jinsolp commented Oct 8, 2025

Closes #1370
Closes #195

This PR adds an option to use fp32 distance computation.

(Outdated) From heuristics, chose dim=16 as the threshold for dispatching to a fp32 distance kernel.
We do manual computation, but since we only target small dimensions, fp32 dispatching ends up being slightly faster end to end with much better recall for small dimensions.

All number below are run on L40 machine and AMD EPYC CPU with 128 cores. Perf and recall is averaged over 5 runs and all time is in seconds. Baseline knn graph is computed using sklearn.neighbors.NearestNeighbors brute for method.

Max iters=20

Screenshot 2025-10-08 at 10 56 17 AM

For larger dimensions there is an inherent issue with the NN Descent algorithm itself that makes the recall low. This can be improved slightly with more iterations.
Also notice that the e2e time taken is similar or slightly less for using fp32.

Max iters=100

Screenshot 2025-10-08 at 10 58 26 AM

Notice how the blue part, the recall doesn't get better compared to the table above even with more iterations (i.e. why we need the fp32 appraoch for this part)

Perf impact on different architectures

H100

Screenshot 2025-11-20 at 10 15 25 AM

L40

Screenshot 2025-11-20 at 10 15 50 AM

@jinsolp jinsolp self-assigned this Oct 8, 2025
@jinsolp jinsolp requested a review from a team as a code owner October 8, 2025 17:59
@jinsolp jinsolp added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Oct 8, 2025
@jinsolp jinsolp changed the base branch from main to release/25.12 November 17, 2025 17:04
@jinsolp
Copy link
Contributor Author

jinsolp commented Nov 18, 2025

Maybe we might need to have fp16 vs fp32 as an option (and default to fp16) instead of depending on data dimensions. WIP investigating

@jinsolp jinsolp requested a review from a team as a code owner November 19, 2025 02:42
@jinsolp jinsolp changed the title Dispatch to use fp32 distance computation in NN Descent depending on data dimensions Add fp32 distance computation option in NN Descent Nov 19, 2025
}

template <typename Index_t, typename Data_t, typename DistEpilogue_t>
__device__ __forceinline__ void calculate_metric(float* s_distances,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taken out as a separate function from the original kernel because it's commonly used for fp32 and fp16

Comment on lines +659 to +672
// this is much faster than a warp-collaborative multiplication because MAX_NUM_BI_SAMPLES is
// fixed and small (64)
for (int i = threadIdx.x; i < MAX_NUM_BI_SAMPLES * SKEWED_MAX_NUM_BI_SAMPLES;
i += blockDim.x) {
int tmp_row = i / SKEWED_MAX_NUM_BI_SAMPLES;
int tmp_col = i % SKEWED_MAX_NUM_BI_SAMPLES;
if (tmp_row < list_new_size && tmp_col < list_new_size) {
float acc = 0.0f;
for (int d = 0; d < num_load_elems; d++) {
acc += s_nv[tmp_row][d] * s_nv[tmp_col][d];
}
s_distances[i] += acc;
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matmul part is different from the fp16 kernel

@jinsolp jinsolp requested a review from divyegala November 20, 2025 01:07
Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see a unit test, if possible, where you can demonstrably prove distance computation improvements in fp32 over fp16.

@jinsolp
Copy link
Contributor Author

jinsolp commented Nov 21, 2025

Thanks for the feedback @divyegala, changing this to target 26.02 for now because Corey suggested we do further investigation. Force-pushing after rebasing

@jinsolp jinsolp changed the base branch from release/25.12 to main November 21, 2025 00:52
@jinsolp jinsolp requested a review from a team as a code owner November 21, 2025 00:52
@jinsolp jinsolp force-pushed the fix-nnd-recall-fp32 branch from 72bab3e to 5e8600d Compare November 21, 2025 02:23
static_assert(NUM_SAMPLES <= 32);

using input_t = typename std::remove_const<Data_t>::type;
if (std::is_same_v<input_t, float> && build_config.dataset_dim <= 16) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatch logic based on dimensions

@jinsolp jinsolp changed the title Add fp32 distance computation option in NN Descent Dispatch to use fp32 distance computation in NN Descent depending on data dimensions Dec 3, 2025
@jinsolp
Copy link
Contributor Author

jinsolp commented Dec 3, 2025

Blue is better (e2e time or kernel time)

  • FP32 kernel is usually faster for small dimensions
  • FP32 e2e time is sometimes slightly longer for small dimensions despite shorter kernel time. Believe this small gap comes from more data transfers (4bytes vs 2bytes per data element)

L40 GPU

Screenshot 2025-12-02 at 6 28 48 PM Screenshot 2025-12-02 at 6 28 58 PM

H100 GPU

Screenshot 2025-12-02 at 6 30 47 PM Screenshot 2025-12-02 at 6 30 59 PM

@jinsolp
Copy link
Contributor Author

jinsolp commented Dec 3, 2025

@divyegala based on the latest benchmarks (and confirming from HDBSCAN that we don't have quality issues from using fp16 for larger dimensions), the mechanism is back to dispatching based on the number of dimensions
(chose 16 as a threshold).

I would like to see a unit test, if possible, where you can demonstrably prove distance computation improvements in fp32 over fp16.

Not sure how to add this as a test. Any suggestions?

@divyegala
Copy link
Member

Not sure how to add this as a test. Any suggestions?

@jinsolp is it possible to generate a dataset that shows degraded recall for fp16 and good recall for fp32? The way I envision the assertion is ASSERT(fp16_recall < <upperbound_value> && fp32_recall >= <lowerbound_value>).

@jinsolp
Copy link
Contributor Author

jinsolp commented Dec 3, 2025

We can (e.g. the blobs data with dim=2 which shows 0.4 recall for fp16 vs 0.99 recall for fp32)
, but if we run through the current code, it will always take the fp32 path.

Should I hardwire the results of fp16 and add that as a vector to the test?

@divyegala
Copy link
Member

@jinsolp we should allow the user an override in-case the heuristic fails. Let users choose fp32 explicitly if they want, and exporting that option will also let you test the separate paths.

@jinsolp
Copy link
Contributor Author

jinsolp commented Dec 3, 2025

Oh okay, so let the user choose an option, and if none is given fall back to the heuristic dim=16 threshold?

@divyegala
Copy link
Member

The other way around. Use the heuristic eagerly, but let user override.

@jinsolp
Copy link
Contributor Author

jinsolp commented Dec 3, 2025

@divyegala we now have a dist_comp_dtype option defaulting to auto. can override by using fp32 or fp16.
Added a test as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please default initialize the index creation with this new parameter.

size_t max_iterations = 20;
float termination_threshold = 0.0001;
bool return_distances = true;
DIST_COMP_DTYPE dist_comp_dtype = DIST_COMP_DTYPE::AUTO;
Copy link
Contributor Author

@jinsolp jinsolp Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divyegala It already defaults to AUTO here, which would be taken in nn_descent.cpp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That file is the source of C API, and this default is for the C++ API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all the defaults of cpp are passed when we call cuvsNNDescentIndexParamsCreate

extern "C" cuvsError_t cuvsNNDescentIndexParamsCreate(cuvsNNDescentIndexParams_t* params)
{
return cuvs::core::translate_exceptions([=] {
// get defaults from cpp parameters struct
cuvs::neighbors::nn_descent::index_params cpp_params;
*params = new cuvsNNDescentIndexParams{
.metric = static_cast<cuvsDistanceType>((int)cpp_params.metric),
.metric_arg = cpp_params.metric_arg,
.graph_degree = cpp_params.graph_degree,
.intermediate_graph_degree = cpp_params.intermediate_graph_degree,
.max_iterations = cpp_params.max_iterations,
.termination_threshold = cpp_params.termination_threshold,
.return_distances = cpp_params.return_distances};
});
}

C doesn't support default struct init and so nothing is initialized in the c struct:

struct cuvsNNDescentIndexParams {
cuvsDistanceType metric;
float metric_arg;
size_t graph_degree;
size_t intermediate_graph_degree;
size_t max_iterations;
float termination_threshold;
bool return_distances;
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't kept up with the changes to the C API, my apologies. We used to explicitly default initialize in the create function previously.

@jinsolp
Copy link
Contributor Author

jinsolp commented Dec 12, 2025

/merge

@rapids-bot rapids-bot bot merged commit a2ddea4 into rapidsai:main Dec 12, 2025
262 of 268 checks passed
@jinsolp jinsolp deleted the fix-nnd-recall-fp32 branch December 12, 2025 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

[BUG] cuVS Nearest Neighbors recall lower than expected for some datasets [FEA] Calculating distances with FP32 in NN Descent

2 participants