-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to index (hnsw) multiple vector points per document #15854
Comments
Excited to see this making its way to Vespa! (First author of the ColBERT paper.) Happy to help with ideas/etc. about doing this. FYI our implementation is here: https://github.com/stanford-futuredata/ColBERT/tree/v0.2 |
Thanks for reaching out @okhat and excellent work on the paper and also making the source code for training available. I've used the main branch for now and I'm able to train and reproduce your re-ranking metrics using cosine similarity. We reduced dimensionality to 32 to save space and memory bandwidth during evaluation (Vespa is currently cpu only). Well done, it's not every paper and released code where it's easy to reproduce the reported results! Metrics on top1000.dev file (as provided by MS Marco):
Given the rather bad recall@1K in the original top1000 the MRR level is very impressive We are able to represent your MaxSim operator using Vespa tensor functions (*1) and usable as a re-ranker. Adding support for indexing multiple document term representations for the same passage/document will make it easier to also use colbert for retrieval. Though I think it can also be distilled into a single dense tensor representation with ok recall@k. We use the following document schema for MS Marco Passage ranking experiments (so far). The end work will be published as a sample application and probably an entry to the MS Marco Passage ranking leaderboard where we will use multiple features in the re-ranking stage.
Ranking profile:
Where query(qt) is the run time tensor with query term to embedding representation defined as
The query tensor is computed by the exported bert query model (onnx format for serving). From a latency perspective the encoder is the overall bottleneck as the rest can be scaled easily while the single none-batched pass through Bert cannot so it might be that we try to use a smaller bert version (e.g bert-medium) and with quantisation. It's a very nice use case for us as it allows us to demonstrate our mixed tensor support with advanced tensor functions and representing bert models in Vespa. (*1) https://docs.vespa.ai/documentation/reference/ranking-expressions.html#tensor-functions |
Very interesting---thanks @jobergum! I see that you're using d=32. Another way to go about this is to use a larger dimension (say 128) but quantize the representations. This may perform better in terms of accuracy. It's also interesting that you think (on CPU) the main bottleneck would be the query encoder. I think it could be pretty fast (tens of milliseconds), even with a modest few-core CPU. There are some ONNX results in milliseconds here from this article. For bsize=1 and maxlen=32, they're doing inference in 59.84 ms on a commodity CPU. |
@okhat Yes, agree on dimensionality, but currently our tensor values are only float or double, we plan to add more tensor types with lower resolution (e.g int8). Generally it seems like number of parameters/dimensions is more important than weight resolution. So 32 is a reasonable tradeoff given our current tensor type limitations.
Yes, we will fully explore this (We also did similar accuracy versus speed for the DPR model https://blog.vespa.ai/from-research-to-production-scaling-a-state-of-the-art-machine-learning-system/) Below 100ms overall with 1 node including ranking and query encoding is our goal (with 1 node of modern hw) and that is fully reachable without much accuracy loss. I'll update this ticket when we have more on this, but it's surely a very exciting model you have proposed and fits perfectly with Vespa tensor functions and storage. |
With a bert-medium-uncased model (https://huggingface.co/google/bert_uncased_L-8_H-512_A-8), with batch size 300 and dim 32 and trained for 120K iterations the accuracy is pretty good (0.354 MRR@10 on MS Marco Dev). Using just WAND(1000) BM25 as retriever over text and re-ranking top 1K using ColBERT: MS Marco Passage Ranking Dev:
TREC DL 2019 Eval (43 queries) with deep graded judgements
|
That's really awesome! The results with medium BERT and a small dimension are better than I expected. Also the latency is pretty great. Is this on CPU? I'm wondering how things would look with end-to-end retrieval, especially with the deeper relevance judgments! |
Yes, this runs on a single node in our private Vespa cloud with 2 x Xeon Gold 2.60GHz and 512 GB memory. Pretty standard stuff. We'll explore more on the performance, number of threads per search and see if we can optimize it further. We can also cut down the latency of the query encoder by using a quantized version (int8) which brings down the cost of encoding the query, but with an accuracy loss (from 0.35ish to 0.34 ish). My conclusion is that it's perfectly practical to serve this model in production using Vespa and with a fraction of the cost (and latency) of a full interaction based model with almost the same accuracy. |
Unrelated to this ticket directly but an update on Colbert realization on Vespa. #16236 adds an optimization to the tensor expression which implements the Colbert MaxSim operator and we now reach end to end 2.2K QPS with low latency on a single node. End to end latency includes
Results:
This is also using 1 thread per search only so for use cases which are not that throughput sensitive one might reduce latency further at the cost of throughput. We also did a submission to MS Marco Passage ranking with it So thanks again @okhat for sharing the code to reproduce the training routine and the excellent paper. We will write more about this on our blog real soon. |
Thank you @jobergum and thanks to the team! This is actually much faster than I thought. I'm wondering how fast you guys could make the end-to-end retrieval version of ColBERT too if you gave that one a shot! ;-) Especially with iterated (gathering negatives on the training set based on the trained ColBERT model), this should have substantially higher accuracy (~37.5% MRR@10 on dev). |
So by iterated you mean creating a new triplet set over the train queries using the positives and a sample of the top negatives from the initial ColBERT model trained on the original bm25 triplet file? We will publish the weights of this model on Huggingface models and also the pre-computed passsage tensors so people can reproduce. We will leave it to those interested to improve it and iterate over it, after all we are not scientists, we just want to demonstrate how one can efficiently represent the excellent ColBERT model and have close to state of the art ranking with great serving performance on cpu architectures. In this context we are rather happy with a MRR score of 0.35ish on dev and eval as it's 100% better than plain BM25, similar on TREC DL 2019. Allowing representing ColBERT on Vespa has IMHO many benefits and especially cpu friendliness, and also Vespa allows the the final inference to be a combination of multiple ranking features. Later this year we will also add support for other tensor data types so we can reduce the memory footprint. |
Discussed this with @jobergum. We concluded this is less important. Setting to Milestone later. |
In the context of the ColBERT multi representation model this is less important as there are alternatives using a single dense representation for the retriever and use ColBERT as the re-ranking step. Now with paged on-disk tensors with binarization and unpacking the memory and storage footprint allows storing 100M passages with fixed length 256 tokens on 16-64-600 nodes. |
@geirst I think this has been completed now. |
Yes, this is part of Vespa 8.144.19. See blog post for details: https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/ |
With the growing interest in ANN with Vespa, there has also been increased demand for the ability to index multiple vectors per document. If any of the vectors assigned to a document is in the top-k closest neighbors, it should be retrieved and exposed to the first-phase ranking function. This is similar to the behavior with position type fields (array) and geo-search.
Current support
Future support:
The ranking features, like
closeness
etc., return the value of the closest vector, which made the document retrieved and exposed to the first-phase ranking function where users can do more advanced computations over all vectors if they want to.Query syntax is unchanged, but we might consider allowing passing N tensors with the query. This would then implement batch queries, e.g., where there are multiple query intents or token-based query embeddings like with ColBERT. Ideally, these can then share the filter execution so that the filters are only performed once, and the resulting filter is shared for the batch of NN queries.
This would simplify expressing the end-to-end retrieval using
Colbert also handle longer documents, limited by the input length of transformer architectures.
The text was updated successfully, but these errors were encountered: