-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback on ColBERT #32
Comments
Hi Jo, thanks for the feedback! I think that with d=32 and float16, MS MARCO passages consumes around 35GiBs, so one could use a much smaller machine than m5a.12xlarge. To your main points:
Please check the current code on master (now v0.2). I think it already does so.
Good advice! Will do this soon and let you know. I hope it's relatively straightforward. Let me know if there are other things I can do that are helpful. I'm about to release a quantization flag for indexing in ColBERT that represents each vector with just 32 bytes and yet gets >37% MRR@10 on MS MARCO passages (dev). Conversely, a lot of our users want to use all the Vespa optimizations for our late interaction (MaxSim) operator but don't want to miss out on some features in our repo (e.g., end-to-end retrieval). Is there any way we can make Vespa and ColBERT interoperate more directly, so people don't have to choose one or the other? |
Thank you for the quick feedback Omar, appreciate it!
Yes, you are right and it makes it even more attractive compared to GPU serving. bfloat16/int8 is coming soon to Vespa as well
Then we are down to 18GiB for the passage tasks and it also makes the document ranking task more practical as well.
Yes, end to end retrieval using only ColBERT depends on Vespa allowing indexing multiple tensors per document in the HNSW graph, we don't want to duplicate the passage text across up to 80 sub-documents. Especially for document ranking, but also for passage where we could also store token_ids of the passage for another re-ranking stage using full cross-attention but on e.g top 10 hits from ColBERT MaxSim. Vespa allows efficient candidate retrieval using sparse (e.g HDCT or docT5query using wand) or dense via ann (hnsw) or a hybrid of the above in the same query. I think personally that ColBERT shines as a re-ranker as compared with a cross-attention model but we do see the need for allowing indexing multiple vectors for the same document so I think we will get there.
I see, I'll check it out. I used an older version of this repo when training the linked snapshot weights and a small wrapper for the query forward pass. We used your repo to produce the passage representation and we plan on releasing the pre-produced term vectors on a datahub soon
I have to think about this. I think the first important part is that the ColBERT model allows exporting to ONNX and that Vespa can index multiple vectors per document in the HNSW graph. The vectorization offline is best done outside of Vespa (as batch size > 1) which makes GPU attractive. |
Awesome---thanks Jo! |
Hey @jobergum ! I thought you may be interested to know about our new quantization branch. By default, it represents each vector in just 32 bytes. I generally get very similar results with this to using the full 128-dim embedding. |
@okhat thanks for the update. That is awesome! So we can use tensor<int8>(t{}, x[32]). That will also enable further run time optimizations of the max sim operator in Vespa. The current version uses bfloat16 but it just saves memory, the evaluation is still on float but having int8 could enable faster max sim evaluation. Will merge https://github.com/vespa-engine/sample-apps/blob/passage-ranking/msmarco-ranking/passage-ranking.md to master this week, just wrapping up the performance summary. |
Hello, this is not an issue but feedback. As you know we at the vespa.ai team have been working on the ColBERT model as it is cost effective on CPU and with almost the same accuracy as full cross-attention models and by using 32 dims per contextualized token representation the memory footprint is not a huge concern as you get a lot of memory and v-cpu for free if you can avoid having a GPU, example pricing from AWS EC2 (on-demand/Linux):
m5a.12xlarge 48 v-cpu, 192GB => $2.064 per hour
p3.2xlarge 8 v-cpu, 1 Nvidia Tesla V100 GPUs (16GB), 61GB RAM => $3.06 per hour
Vespa currently support float32, but will soon add bfloat16 for our tensor representation which will reduce memory footprint by 50% from 32 bits per value to 16. (Some data our work on ColBERT documented in vespa-engine/vespa#15854)
Now to the feedback on the modelling:
Thank you.
The text was updated successfully, but these errors were encountered: