-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FAISS indexing parameters #44
Comments
Thank you for the kind words! Partitions is very important: it means the number of centroids used by FAISS for indexing and search. Higher means slower FAISS indexing but faster retrieval. You can make the number 2x or 4x smaller and it would still be fine. The sample dictates how much of the data is used for FAISS indexing, so here it's 30%. If you drop this parameter completely, the default will internally be 5%. More is better, but 5--30% is enough. |
so as far as I understand |
Yes, all documents will be indexed! Irrespective of what you choose for sample, all embeddings are going to be stored. Sampling just dictates the not-so-critical aspect of how to create internal representations without too much cost. |
Another semi-related indexing parameter question: What does the Also was the parameter of the value used in the original work 180? I am planning to use ColBERT for indexing other datasets apart from MSMarco-passage so I am not sure whether in fact I would have to retrain everything from scratch. After trying to index a collection using the checkpoint from the original work transformers gave me this warning, which I guess should alarm me:
If I understand correctly, trying to index using a checkpoint created with a different Thank you again for your help! :) |
By default it's FirstP. You'll have to split the documents up if you want to implement MaxP on top of this. You don't have to retrain. Just split up the passages to 100--150 words (with Python whitespace split) and select an appropriate --doc_maxlen in the range 180--256. It should work fine. |
hello,
if it's right then, we only use 30% for decreasing the cost?
can't we just feed all files to the function? thank you +) and I also want to know about below things..
|
Dear authors,
thank you for your nice work and providing the code repository.
I would like to use your model to index a collection using faiss. However I see a few parameters (in the faiss indexing example command) that I do not understand.
One is
sample
and the otherpartitions
. My guess is that the second one splits the generated index file to different partitions, hence not so important (correct me if Im wrong), but what aboutsample
?Also, is the
--root /root/to/experiments/
expected to be the colbert code directory (this repo)?Thank you
The text was updated successfully, but these errors were encountered: