Dispatch to use fp32 distance computation in NN Descent depending on data dimensions#1415
Conversation
|
Maybe we might need to have fp16 vs fp32 as an option (and default to fp16) instead of depending on data dimensions. WIP investigating |
| } | ||
|
|
||
| template <typename Index_t, typename Data_t, typename DistEpilogue_t> | ||
| __device__ __forceinline__ void calculate_metric(float* s_distances, |
There was a problem hiding this comment.
This is taken out as a separate function from the original kernel because it's commonly used for fp32 and fp16
| // this is much faster than a warp-collaborative multiplication because MAX_NUM_BI_SAMPLES is | ||
| // fixed and small (64) | ||
| for (int i = threadIdx.x; i < MAX_NUM_BI_SAMPLES * SKEWED_MAX_NUM_BI_SAMPLES; | ||
| i += blockDim.x) { | ||
| int tmp_row = i / SKEWED_MAX_NUM_BI_SAMPLES; | ||
| int tmp_col = i % SKEWED_MAX_NUM_BI_SAMPLES; | ||
| if (tmp_row < list_new_size && tmp_col < list_new_size) { | ||
| float acc = 0.0f; | ||
| for (int d = 0; d < num_load_elems; d++) { | ||
| acc += s_nv[tmp_row][d] * s_nv[tmp_col][d]; | ||
| } | ||
| s_distances[i] += acc; | ||
| } | ||
| } |
There was a problem hiding this comment.
This matmul part is different from the fp16 kernel
divyegala
left a comment
There was a problem hiding this comment.
I would like to see a unit test, if possible, where you can demonstrably prove distance computation improvements in fp32 over fp16.
|
Thanks for the feedback @divyegala, changing this to target 26.02 for now because Corey suggested we do further investigation. Force-pushing after rebasing |
72bab3e to
5e8600d
Compare
| static_assert(NUM_SAMPLES <= 32); | ||
|
|
||
| using input_t = typename std::remove_const<Data_t>::type; | ||
| if (std::is_same_v<input_t, float> && build_config.dataset_dim <= 16) { |
There was a problem hiding this comment.
dispatch logic based on dimensions
|
@divyegala based on the latest benchmarks (and confirming from HDBSCAN that we don't have quality issues from using fp16 for larger dimensions), the mechanism is back to dispatching based on the number of dimensions
Not sure how to add this as a test. Any suggestions? |
@jinsolp is it possible to generate a dataset that shows degraded recall for fp16 and good recall for fp32? The way I envision the assertion is |
|
We can (e.g. the blobs data with dim=2 which shows 0.4 recall for fp16 vs 0.99 recall for fp32) Should I hardwire the results of fp16 and add that as a vector to the test? |
|
@jinsolp we should allow the user an override in-case the heuristic fails. Let users choose fp32 explicitly if they want, and exporting that option will also let you test the separate paths. |
|
Oh okay, so let the user choose an option, and if none is given fall back to the heuristic dim=16 threshold? |
|
The other way around. Use the heuristic eagerly, but let user override. |
|
@divyegala we now have a |
There was a problem hiding this comment.
Please default initialize the index creation with this new parameter.
| size_t max_iterations = 20; | ||
| float termination_threshold = 0.0001; | ||
| bool return_distances = true; | ||
| DIST_COMP_DTYPE dist_comp_dtype = DIST_COMP_DTYPE::AUTO; |
There was a problem hiding this comment.
@divyegala It already defaults to AUTO here, which would be taken in nn_descent.cpp
There was a problem hiding this comment.
That file is the source of C API, and this default is for the C++ API.
There was a problem hiding this comment.
I believe all the defaults of cpp are passed when we call cuvsNNDescentIndexParamsCreate
cuvs/c/src/neighbors/nn_descent.cpp
Lines 167 to 182 in 1959a0d
C doesn't support default struct init and so nothing is initialized in the c struct:
cuvs/c/include/cuvs/neighbors/nn_descent.h
Lines 38 to 46 in 1959a0d
There was a problem hiding this comment.
I haven't kept up with the changes to the C API, my apologies. We used to explicitly default initialize in the create function previously.
|
/merge |




Closes #1370
Closes #195
This PR adds an option to use fp32 distance computation.
(Outdated) From heuristics, chose dim=16 as the threshold for dispatching to a fp32 distance kernel.We do manual computation, but since we only target small dimensions, fp32 dispatching ends up being slightly faster end to end with much better recall for small dimensions.
All number below are run on L40 machine and AMD EPYC CPU with 128 cores. Perf and recall is averaged over 5 runs and all time is in seconds. Baseline knn graph is computed using
sklearn.neighbors.NearestNeighborsbrute for method.Max iters=20
For larger dimensions there is an inherent issue with the NN Descent algorithm itself that makes the recall low. This can be improved slightly with more iterations.
Also notice that the e2e time taken is similar or slightly less for using fp32.
Max iters=100
Notice how the blue part, the recall doesn't get better compared to the table above even with more iterations (i.e. why we need the fp32 appraoch for this part)
Perf impact on different architectures
H100
L40