Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] (?) raft::neighbors::ivf_pq::search<float, int64_t> returns good neighbors indices, but zero distances #1480

Open
alexanderguzhva opened this issue May 1, 2023 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@alexanderguzhva
Copy link
Contributor

Describe the bug
raft::neighbors::ivf_pq::search<float, int64_t> returns good neighbors indices, but zero distances.

Steps/Code to reproduce bug

  • tried for both small and large number of topk
  • the effect seems to be consistent
  • I've confirmed that actual zero values are written to output distances (by commenting out postprocess_distances() call and seeing that no values in the output distances memory array got changes, as expected)

The code looks like this

  using data_t = float;
  using idx_t = int64_t;

  data_t* data_search_queries = query.template data_ptr<data_t>();

  at::Tensor indices_ivfpq =
      at::empty({n_rows, topk}, c10::TensorOptions().dtype(kInt64)).cuda();
  idx_t* data_indices_ivfpq = indices_ivfpq.template data_ptr<idx_t>();

   at::Tensor distances_ivfpq =
       at::empty({n_rows, topk}, c10::TensorOptions().dtype(kFloat)).cuda();
  data_t* data_distances_ivfpq = distances_ivfpq.template data_ptr<data_t>();

  cudaDeviceSynchronize();

  auto query_view = raft::make_device_matrix_view<const data_t, idx_t>(
      data_search_queries, n_rows, dim);
  auto inds_view = raft::make_device_matrix_view<idx_t, idx_t>(
      data_indices_ivfpq, n_rows, topk);
  auto dists_view = raft::make_device_matrix_view<data_t, idx_t>(
      data_distances_ivfpq, n_rows, topk);

  raft::neighbors::ivf_pq::search_params search_params;
  search_params.n_probes = nprobe;

  raft::neighbors::ivf_pq::search<data_t, idx_t>(
      handle,
      search_params,
      raft_index.value(),
      query_view,
      inds_view,
      dists_view);
 handle.sync_stream();

All other settings seem to be default ones

Environment details (please complete the following information):
upstream hash is "a98295b516ef58bc855177077860bab2a2a76d77" (Apr 12 ?)

Additional context
Maybe, I'm missing something to the degree of being blind.

@alexanderguzhva alexanderguzhva added the bug Something isn't working label May 1, 2023
@alexanderguzhva
Copy link
Contributor Author

The following issue can be mitigated by patching ivf_pq_search.cuh in the following way (for <float, int64_t>):

/*
  // Select topk vectors for each query
  rmm::device_uvector<ScoreT> topk_dists(n_queries * topK, stream, mr);
  matrix::detail::select_k<ScoreT, uint32_t>(distances_buf.data(),
                                             neighbors_ptr,
                                             n_queries,
                                             topk_len,
                                             topK,
                                             topk_dists.data(),
                                             neighbors_uint32,
                                             true,
                                             stream,
                                             mr);
*/

  matrix::detail::select_k<ScoreT, uint32_t>(distances_buf.data(),
                                             neighbors_ptr,
                                             n_queries,
                                             topk_len,
                                             topK,
                                             (ScoreT*)distances,
                                             neighbors_uint32,
                                             true,
                                             stream,
                                             mr);

  // Postprocessing
//  postprocess_distances(
//    distances, topk_dists.data(), index.metric(), n_queries, topK, scaling_factor, stream);

@alexanderguzhva
Copy link
Contributor Author

Further update for ivf_pq_search.cuh.
The following change fixes the problem for <float, int64_t> case, which is a kinda weird, because the value of scaling_factor is 1:

    case distance::DistanceType::L2Expanded: {
      linalg::unaryOp(out,
                      in,
                      len,
//                      raft::compose_op(raft::mul_const_op<float>{scaling_factor * scaling_factor},
//                                       raft::cast_op<float>{}),
                      raft::cast_op<float>{},
                      stream);

@cjnolet
Copy link
Member

cjnolet commented May 4, 2023

@alexanderguzhva just to clarify- did your last change fixed the entire issue or does it still require the change in your prior comment?

@tfeher (and @achirkin once you are back in the office), any ideas here? It almost seems like either something is going on in the mul_const_op or the scaling factor is getting into a bad numerical state somehow?

@alexanderguzhva
Copy link
Contributor Author

The last change fixed the issue completely.

@tfeher tfeher assigned tfeher and achirkin and unassigned tfeher May 9, 2023
@tfeher
Copy link
Contributor

tfeher commented May 9, 2023

Thanks @alexanderguzhva for the description and for providing a fix! @achirkin will investigate the issue further.

@achirkin
Copy link
Contributor

achirkin commented May 10, 2023

Hi @alexanderguzhva , I've not yet been able to reproduce the issue in our tests on our synthetic data.
Would it be possible to show a full reproducer? If not, could you please give the values of the constants and parameters (especially n_rows, nprobe, dim, topk) and an approximate range of the inputs/distribution?
Also, do I correctly assume you're constructing the raft handle using the default constructor? No custom memory allocator, etc?

@alexanderguzhva
Copy link
Contributor Author

alexanderguzhva commented May 10, 2023

@achirkin
https://gist.github.com/alexanderguzhva/cb2b9a08ec312e585b5ba11e3691ce36

I'll try to grab two PTX-es as well...

@achirkin
Copy link
Contributor

Thanks for the full snippet! This is getting funny: I've just compiled it with the latest raft and the output looked fine (non-zero distances). I'll check with other gpus and with the raft commit by the hash you shared and come back later.

@alexanderguzhva
Copy link
Contributor Author

I'm using A100 with CUDA 11.4.2
I can also try to replicate the issue to taking out postprocess_distances() call into a standalone executable, if this helps

@alexanderguzhva
Copy link
Contributor Author

fyi, CUDA 11.8 seems to solve this problem

@achirkin
Copy link
Contributor

achirkin commented May 11, 2023

Thanks for the update. I tried A100 and RTX3090, on CUDA 11.4.2 (gcc 10.3) and CUDA 11.8 (gcc 11.3) this morning, still not seeing any zero distances.
What's the host compiler and build flags are in your failing case? If you're using conda, could you try to test it again in a fresh environment?

@alexanderguzhva
Copy link
Contributor Author

well, clang is used for nvcc

Here is the gist with the PTX for patched version of Raft for CUDA 11.4.2, CC=80: https://gist.github.com/alexanderguzhva/f3d86dc42d7a11ff85293c8804e304d4
Here is the gist with the PTX for baseline version of Raft for CUDA 11.4.2, CC=80:
https://gist.github.com/alexanderguzhva/1f7dc7352e71824fd4a3b668ec5c9750
Basically, I've made two huuuuge dumps, used a diff tool and found code blocks that differ in a meaningful way.
I did not study the disassembly of the host code.

no conda is used, pure C++ / CUDA only :)

@achirkin
Copy link
Contributor

well, clang is used for nvcc

Oh, could you please then show the exact command you use to compile the file (or the corresponding line in compile_commands.json if you're using cmake) also with clang version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

4 participants