Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ColBERT on Wikipedia corpus #54

Closed
shashankg7 opened this issue Jun 30, 2021 · 1 comment
Closed

ColBERT on Wikipedia corpus #54

shashankg7 opened this issue Jun 30, 2021 · 1 comment

Comments

@shashankg7
Copy link

Hi,

Thanks for releasing this library.

I am planning to use ColBERT for a ranking task on Wikipedia corpus (as part of FAIR Ranking track: https://fair-trec.github.io/). Briefly, given a keyword consisting of terms related to Wiki articles, the task is to generate a rank list of Wiki docs. I have a couple of questions about using the model on the task:

  1. Wikipedia docs are typically very long-docs. To use them in the model, if I truncate the doc (say take top-500 words), will it affect the perf of the model?

  2. I want to use the query & document embeddings from ColBERT as feats. in another model. Is there a way to get the query and doc. embedding after training?

Thanks.

@okhat
Copy link
Collaborator

okhat commented Jul 4, 2021

Hi Shashank! Sorry for the late response.

I strongly recommend using a passage-level Wikipedia corpus. It's common in the Open-QA literature (e.g., our ColBERT-QA paper) to divide Wikipedia into 100-word or (say) 200-token passages, keeping the title of the page at the start of each passage.

For the second one, encoding the corpus (or the queries) with colbert.index can give you files with all the embeddings. Or you can use the ModelInference class from colbert/modeling/inference.py, and in particular queryFromText and docFromText. See existing uses in the code for how to do this; it's pretty simple!

Let me know if you face any issues!

@okhat okhat closed this as completed Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants