Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding XTR from Rethinking the Role of Token Retrieval in Multi-Vector Retrieval #30

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

arthur-75
Copy link
Contributor

No description provided.

Raphael Sourty and others added 3 commits April 18, 2024 00:41
I have coded XTR from google, "Rethinking the Role of Token Retrieval in Multi-Vector Retrieval", we still need to optimize the code and to add Missing similarity imputation, please let me know if u have any question.
@arthur-75 arthur-75 changed the title Patch 2 Adding XTR from Rethinking the Role of Token Retrieval in Multi-Vector Retrieval May 4, 2024
@raphaelsty
Copy link
Owner

Thank you @arthur-75 for this MR, the best I think would be to add an index directory with a file annoy.py in this directory.

The class would be Annoy() with the parameters dedicated to create the vector database: https://github.com/spotify/annoy

The Annoy index would have a add() method which take as input the documents_embeddings parameter, in order to upload the documents_embeddings.

Then it would have a __call__ method which take as input queries_embeddings: dict[str, torch.tensor], k: int = 100, batch_size: int = 32 and then retrieve the top_k documents_embeddings given the set of queries_embeddings in batch.

Once we have the index method, we can create an XTR object which will take as input an index object such as Annoy, key, on, model.

The XTR object will have an add method, which will simply call the add method of XTR.

The XTR object should inherit from ColBERT retriever.

The __call__ method of XTR will query the index and then post-process the embeddings similarities in order to compute the XTR score.

Also you should properly set up ruff in order to format your code, this is really useful 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants