Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support reconstruction of original vectors from IVF-Flat / IVF-PQ Index #1205

Open
cjnolet opened this issue Jan 30, 2023 · 3 comments
Assignees
Labels
FAISS feature request New feature or request

Comments

@cjnolet
Copy link
Member

cjnolet commented Jan 30, 2023

FAISS currently supports two functions (on CPU) for reconstructing original vectors from the IVF-Flat/IVF-PQ indexes: reconstruct_n and reconstruct_batch. FAISS' existing IVF-Flat GPU implementation does support reconstruct and reconstruct_n (a linear set of indices from a specific start to specific end index) but reconstruct_batch would be more efficient so that we can reconstruct a specific set of vectors given their indexes, which don't even need to be in order. It's been requested that we support this in our GPU implementations.

The reconstruct API is also very useful when a GPU coarse quantizer (of any index type) is combined w/ a CPU IVF index for very large indexes which may not fit into GPU memory.

@cjnolet cjnolet added feature request New feature or request FAISS labels Jan 30, 2023
@cjnolet
Copy link
Member Author

cjnolet commented Jan 31, 2023

Referencing the corresponding method in FAISS' CPU IVF index: https://github.com/facebookresearch/faiss/blob/main/faiss/IndexIVF.cpp#L918

@achirkin
Copy link
Contributor

achirkin commented Mar 20, 2023

Update on the reconstruction.

At this moment, we have two draft implementations of reconstruction. My IVF-PQ implementation takes the (cluster_id, [in_cluster_indices]) instead of the user-defined indices, and the IVF-Flat implementation seems to be sub-optimal in that it does not support non-contiguous user-defined indices and iterates through the whole index multiple times.

The ANN indices are one-directional by design. Thus, supporting the search in the opposite direction (find vectors by indices) requires either an exhaustive search through the whole DB for every query, or the introduction of a helper struct for a faster search (an attempt to make such is in ivf-flat PR). I suggest we do the latter, but decouple the helper struct from the model. Something like the following:

template <typename ListT>
struct index_mapping_t {
  device_vector<ListT::index_type> user_indices;
  device_vector<std::Tuple<uint32_t, uint32_t>> clusters_and_offsets;
  template <typename = ivf::enable_if_valid_list<ListT>>
  index_mapping_t(std::vector<std::shared_ptr<ListT>> lists) {
    ...
    cub::DeviceRadixSort::SortPairs(user_indices, clusters_and_offsets);
    ...
  }
};
template <typename ListT>
auto get_lists_ands_offsets(const index_mapping_t<ListT>& mapping, device_vector_view<ListT::index_type> queries) ->
std::vector<std::Tuple<uint32_t, device_vector<uint32_t>>> {
   // find the cluster & in-cluster-offsets, and group them by cluster ids
}

Then, we could calculate index_mapping_t in a reasonable time and then repeatedly convert the user-space queries into the (cluster, [indices]) format and perform the reconstruct functions on that.

@achirkin
Copy link
Contributor

In addition to that, we could make a separate version of ivf_*::search that would return (cluster, index) pairs instead of the user-space indices. Then, the user could use the results of the search directly to manipulate the lists (reconstruct or remove records).

rapids-bot bot pushed a commit that referenced this issue Apr 17, 2023
Add public functions for reading and writing into individual ivf-pq lists (clusters), in the input space (reconstructed data) and in flat PQ codes.

Partially solves (IVF-PQ) #1205

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Tamas Bela Feher (https://github.com/tfeher)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #1298
ahendriksen pushed a commit to ahendriksen/raft that referenced this issue Apr 18, 2023
Add public functions for reading and writing into individual ivf-pq lists (clusters), in the input space (reconstructed data) and in flat PQ codes.

Partially solves (IVF-PQ) rapidsai#1205

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Tamas Bela Feher (https://github.com/tfeher)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1298
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FAISS feature request New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

3 participants