ColBERT on Wikipedia corpus #54

shashankg7 · 2021-06-30T06:22:42Z

Hi,

Thanks for releasing this library.

I am planning to use ColBERT for a ranking task on Wikipedia corpus (as part of FAIR Ranking track: https://fair-trec.github.io/). Briefly, given a keyword consisting of terms related to Wiki articles, the task is to generate a rank list of Wiki docs. I have a couple of questions about using the model on the task:

Wikipedia docs are typically very long-docs. To use them in the model, if I truncate the doc (say take top-500 words), will it affect the perf of the model?
I want to use the query & document embeddings from ColBERT as feats. in another model. Is there a way to get the query and doc. embedding after training?

Thanks.

okhat · 2021-07-04T22:12:15Z

Hi Shashank! Sorry for the late response.

I strongly recommend using a passage-level Wikipedia corpus. It's common in the Open-QA literature (e.g., our ColBERT-QA paper) to divide Wikipedia into 100-word or (say) 200-token passages, keeping the title of the page at the start of each passage.

For the second one, encoding the corpus (or the queries) with colbert.index can give you files with all the embeddings. Or you can use the ModelInference class from colbert/modeling/inference.py, and in particular queryFromText and docFromText. See existing uses in the code for how to do this; it's pretty simple!

Let me know if you face any issues!

okhat closed this as completed Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ColBERT on Wikipedia corpus #54

ColBERT on Wikipedia corpus #54

shashankg7 commented Jun 30, 2021

okhat commented Jul 4, 2021

ColBERT on Wikipedia corpus #54

ColBERT on Wikipedia corpus #54

Comments

shashankg7 commented Jun 30, 2021

okhat commented Jul 4, 2021