# PyTerrier ColBERT Demo Notebook - Vaswani

This notebook demonstrates use of [PyTerrier plugin for ColBERT](https://github.com/terrierteam/pyterrier_colbert) for dense passage retrieval. 

[ColBERT](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds. ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings. Then at search time, it embeds every query into another matrix of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. 


[ColBERT](https://arxiv.org/abs/2004.12832) is built on top of [BERT](https://arxiv.org/abs/1810.04805). ColBERT surpasses the quality of single-vector representation models, while scaling efficiently to large corpora. 

The corpus used in this demo is the [Vaswani NPL corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a corpus of 11,429 scientific abstract, with corresponding queries and relevance assessments.

## Installation 

We need to install [PyTerrier](https://github.com/terrier-org/pyterrier).

In [1]:
!pip install python-terrier

Collecting python-terrier
[?25l  Downloading https://files.pythonhosted.org/packages/10/0e/1756a1892b8b2aa0152ac532c7f85de802bda25772108ab8196259ea9d4f/python-terrier-0.5.0.tar.gz (74kB)
[K     |████▍                           | 10kB 20.3MB/s eta 0:00:01[K     |████████▉                       | 20kB 26.7MB/s eta 0:00:01[K     |█████████████▎                  | 30kB 23.3MB/s eta 0:00:01[K     |█████████████████▊              | 40kB 17.7MB/s eta 0:00:01[K     |██████████████████████▏         | 51kB 15.7MB/s eta 0:00:01[K     |██████████████████████████▌     | 61kB 17.6MB/s eta 0:00:01[K     |███████████████████████████████ | 71kB 14.2MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 5.6MB/s 
Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Collecting pytrec_eval>=0.5
  Downloading https://files.pythonhosted.org/packages/2e/03/e6e84df6a7c1265579ab26bbe30ff7f8c

This installs the [PyTerrier plugin for ColBERT](https://github.com/terrierteam/pyterrier_colbert). It supplies an indexer and a retrieval transformer.

In [2]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

Collecting git+https://github.com/terrierteam/pyterrier_colbert.git
  Cloning https://github.com/terrierteam/pyterrier_colbert.git to /tmp/pip-req-build-v69a3sw_
  Running command git clone -q https://github.com/terrierteam/pyterrier_colbert.git /tmp/pip-req-build-v69a3sw_
Collecting ColBERT@ git+https://github.com/cmacdonald/ColBERT.git@v0.2#egg=ColBERT
  Cloning https://github.com/cmacdonald/ColBERT.git (to revision v0.2) to /tmp/pip-install-gvrh73e2/ColBERT
  Running command git clone -q https://github.com/cmacdonald/ColBERT.git /tmp/pip-install-gvrh73e2/ColBERT
  Running command git checkout -b v0.2 --track origin/v0.2
  Switched to a new branch 'v0.2'
  Branch 'v0.2' set up to track remote branch 'v0.2' from 'origin'.
Collecting transformers==3.0.2
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 13.9MB/s 
[?25h

This installs [FAISS](https://github.com/facebookresearch/faiss), a library for efficient similarity search and clustering of dense vectors.

In [3]:
import sys

COLAB='google.colab' in sys.modules

try:
  import faiss
  faiss.get_num_gpus()
except:
  if COLAB:
    print('Installing faiss-gpu from pip ')
    !pip install faiss-gpu==1.6.3
  else:
    print('Installing faiss-gpu via Conda')
    !conda install -c pytorch faiss-gpu

import faiss
assert faiss.get_num_gpus() > 0

Installing faiss-gpu from pip 
Collecting faiss-gpu==1.6.3
[?25l  Downloading https://files.pythonhosted.org/packages/1c/43/f33a7d59e7b367f34e1ab61db70a5a75194c848ea2df940c49f7aedc95d9/faiss_gpu-1.6.3-cp37-cp37m-manylinux2010_x86_64.whl (35.5MB)
[K     |████████████████████████████████| 35.5MB 143kB/s 
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.6.3


# Setup

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org) IR platform.

In [4]:
import pyterrier as pt
pt.init()

  from pandas import Panel


terrier-assemblies 5.4  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.5  jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.5.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)


This is the ColBERT checkpoint generated by Craig Macdonald. and used in our TREC 2020 Participation. It will be downloaded first time it is used. Downloading time varies.

In [5]:
checkpoint="http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"

# Indexing

This indexes the [Vaswani dataset](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/). Indexing takes about 3 minutes using a Colab GPU.

In [6]:
!rm -rf /content/colbertindex

import pyterrier_colbert.indexing

indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint, "/content", "colbertindex", chunksize=3)
indexer.index(pt.get_dataset("irds:vaswani").get_corpus_iter())

[INFO] [starting] building docstore
[INFO] If you have a local copy of http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/23e5607081191b153738e81fbd834680
[INFO] [starting] http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz
docs_iter: 0it [00:01, ?it/s]
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 0.0%| 0.00/2.13M [00:00<?, ?B/s][A

[A[INFO] [finished] http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: [00:00] [2.13MB] [14.7MB/s]
docs_iter: 0it [00:01, ?it/s]
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: [00:00] [2.13MB] [13.1MB/s][A
docs_iter: 11429it [00:01, 7974.55it/s]
[INFO] [finished] docs_iter: [00:01] [11429it] [7969.76it/s]
[INFO] [finished] building docstore [1.44s]


HBox(children=(FloatProgress(value=0.0, description='vaswani documents', max=11429.0, style=ProgressStyle(desc…

[Mar 26, 13:28:07] [0] 		 #> Local args.bsize = 128
[Mar 26, 13:28:07] [0] 		 #> args.index_root = /content
[Mar 26, 13:28:07] [0] 		 #> self.possible_subset_sizes = [69905]


[INFO] Lock 140466134063632 acquired on /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517.lock
[INFO] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpvqzv286k


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




[INFO] storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json in cache at /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
[INFO] creating metadata file for /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
[INFO] Lock 140466134063632 released on /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517.lock
[INFO] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
[INFO] Model config BertConfig {
 

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




[INFO] storing https://cdn.huggingface.co/bert-base-uncased-pytorch_model.bin in cache at /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
[INFO] creating metadata file for /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
[INFO] Lock 140466134064080 released on /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157.lock
[INFO] loading weights file https://cdn.huggingface.co/bert-base-uncased-pytorch_model.bin from cache at /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157




- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[Mar 26, 13:28:29] #> Loading model checkpoint.
[Mar 26, 13:28:29] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip


Downloading: "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip" to /root/.cache/torch/hub/checkpoints/colbert.dnn.zip


HBox(children=(FloatProgress(value=0.0, max=1187618437.0), HTML(value='')))






[Mar 26, 13:29:07] #> checkpoint['epoch'] = 0
[Mar 26, 13:29:07] #> checkpoint['batch'] = 44500




[INFO] Lock 140466131921552 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
[INFO] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpc0nnn0ma


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




[INFO] storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt in cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
[INFO] creating metadata file for /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
[INFO] Lock 140466131921552 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
[INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084




[Mar 26, 13:29:09] #> Note: Output directory /content already exists




[Mar 26, 13:29:09] #> Creating directory /content/colbertindex 



docFromText on 11429 documents


[INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


tokens doc 0: 180
total tokens 2057220
lenD = 11429 
11429/content/colbertindex/0.pt

[Mar 26, 13:30:11] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 10.9k (overall),  10.9k (this encoding),  1423.2M (this saving)
[Mar 26, 13:30:11] [0] 		 [NOTE] Done with local share.
[Mar 26, 13:30:12] [0] 		 #> Joining saver thread.
[Mar 26, 13:30:12] [0] 		 #> Saved batch #0 to /content/colbertindex/0.pt 		 Saving Throughput = 2.5M passages per minute.

#> num_embeddings = 581496
[Mar 26, 13:30:12] #> Starting..
[Mar 26, 13:30:12] #> Processing slice #1 of 1 (range 0..1).
[Mar 26, 13:30:12] #> Will write to /content/colbertindex/ivfpq.100.faiss.
[Mar 26, 13:30:12] #> Loading /content/colbertindex/0.sample ...
#> Sample has shape (29074, 128)
[Mar 26, 13:30:12] Preparing resources for 1 GPUs.
[Mar 26, 13:30:12] #> Training with the vectors...
[Mar 26, 13:30:12] #> Training now (using 1 GPUs)...
36.77404522895813
26.64493465423584
0.0008385181427001953
[Mar 26, 13:31:15] Done

The indexing procedure generates the document embeddings index and a [FAISS](https://github.com/facebookresearch/faiss) index, together with some additional files.

In [7]:
!ls -ltrh /content/colbertindex

total 163M
-rw-r--r-- 1 root root 142M Mar 26 13:30 0.pt
-rw-r--r-- 1 root root 7.1M Mar 26 13:30 0.sample
-rw-r--r-- 1 root root  35K Mar 26 13:30 doclens.0.json
-rw-r--r-- 1 root root  36K Mar 26 13:30 docnos.pkl.gz
-rw-r--r-- 1 root root  14M Mar 26 13:31 ivfpq.100.faiss


# Retrieval

Now that indexing has completed, we can load in the index and the checkpoint model (which we will need for encoding queries). Index loading can take some times, as the [FAISS](https://github.com/facebookresearch/faiss) index needs to be loaded in main memory, as well as the document embeddings index.

Since we indexed a collection from scratch and the data structures are already loaded in main memory, we reuse the data structures for retrieval.

In the case the indexing was done offline, the following ColBERT factory can be used:

```python
pyterrier_colbert_factory = pyterrier_colbert.ranking.ColBERTFactory(checkpoint, "/content", "colbertindex")
```


In [8]:
pyterrier_colbert_factory = indexer.ranking_factory()

colbert_e2e = pyterrier_colbert_factory.end_to_end()

[INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


[Mar 26, 13:31:18] #> Loading the FAISS index from /content/colbertindex/ivfpq.100.faiss ..
[Mar 26, 13:31:18] #> Building the emb2pid mapping..
[Mar 26, 13:31:18] len(self.emb2pid) = 581496


[INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


Loading reranking index, memtype=mem


HBox(children=(FloatProgress(value=0.0, description='Loading index shards to memory', max=1.0, style=ProgressS…




Here we can ask [PyTerrier](https://github.com/terrier-org/pyterrier) to search the [ColBERT](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) index for `'chemical reactions'`, returning the top 10 scored documents.

In [9]:
(colbert_e2e % 10).search("chemical reactions")

[INFO] NumExpr defaulting to 2 threads.


Unnamed: 0,qid,query,docid,query_toks,query_embs,docno,score,rank
1956,1,chemical reactions,4911,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",4912,19.822016,0
2524,1,chemical reactions,7048,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",7049,19.053322,1
2387,1,chemical reactions,6479,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",6480,18.036819,2
555,1,chemical reactions,9373,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",9374,17.140938,3
2342,1,chemical reactions,6278,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",6279,16.79257,4
1134,1,chemical reactions,2420,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",2421,16.428358,5
1803,1,chemical reactions,4292,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",4293,16.194141,6
1189,1,chemical reactions,10702,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",10703,16.15019,7
2085,1,chemical reactions,5303,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",5304,16.008406,8
1473,1,chemical reactions,3100,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0680), tensor(-0.0085), tensor(0.11...",3101,15.884938,9


# Run an experiment

Lets prepare an experiment. Firstly, lets create in a BM25 baseline transformer.

In [10]:
dataset = pt.get_dataset("vaswani")

bm25 = pt.BatchRetrieve(dataset.get_index(), wmodel="BM25")

Downloading vaswani index to /root/.pyterrier/corpora/vaswani/index


HBox(children=(FloatProgress(value=0.0, description='data.direct.bf', max=334927.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='data.document.fsarrayfile', max=194293.0, style=ProgressS…




HBox(children=(FloatProgress(value=0.0, description='data.inverted.bf', max=308600.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomapfile', max=667016.0, style=ProgressSty…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomaphash', max=777.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomapid', max=31024.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='data.meta.idx', max=91432.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='data.meta.zdata', max=171754.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='data.properties', max=882.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='data.meta-0.fsomapfile', max=742885.0, style=ProgressStyl…




Finally, lets evaluate our performance. We also load in an BM25 index for the same corpus for comparison reasons. We limit our experiments to just 50 queries.

In [11]:
pt.Experiment(
    [bm25, colbert_e2e],
    dataset.get_topics().head(50),
    dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "mrt"],
    names = ["BM25", "ColBERT"]
)

Downloading vaswani topics to /root/.pyterrier/corpora/vaswani/query-text.trec


HBox(children=(FloatProgress(value=0.0, description='query-text.trec', max=3121.0, style=ProgressStyle(descrip…


Downloading vaswani qrels to /root/.pyterrier/corpora/vaswani/qrels


HBox(children=(FloatProgress(value=0.0, description='qrels', max=6793.0, style=ProgressStyle(description_width…




Unnamed: 0,name,map,recip_rank,mrt
0,BM25,0.338941,0.808238,44.980815
1,ColBERT,0.333021,0.762643,780.633069
