# PyTerrier ColBERT Demo Notebook - Vaswani

This notebook demonstrates use of [PyTerrier plugin for ColBERT](https://github.com/terrierteam/pyterrier_colbert) for dense passage retrieval. 

[ColBERT](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds. ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings. Then at search time, it embeds every query into another matrix of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. 


[ColBERT](https://arxiv.org/abs/2004.12832) is built on top of [BERT](https://arxiv.org/abs/1810.04805). ColBERT surpasses the quality of single-vector representation models, while scaling efficiently to large corpora. 

The corpus used in this demo is the [Vaswani NPL corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a corpus of 11,429 scientific abstract, with corresponding queries and relevance assessments.

## Installation 

We need to install [PyTerrier](https://github.com/terrier-org/pyterrier).

In [1]:
!pip install python-terrier

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-terrier
  Downloading python-terrier-0.9.1.tar.gz (102 kB)
[K     |████████████████████████████████| 102 kB 5.1 MB/s 
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Collecting pyjnius>=1.4.2
  Downloading pyjnius-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 27.9 MB/s 
[?25hCollecting matchpy
  Downloading matchpy-0.5.5-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 6.5 MB/s 
[?25hCollecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Collecting deprecated
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting chest
  Downloading chest-0.2.3.tar.gz (9.6 kB)
Collecting nptyping==1.4.4
  Downloading nptyping-1.4.4-py3-none-any.whl (31 kB)
Collecting ir_datasets>=0.3.2
  Downloading ir_datasets-0.5.4-py3-none-any.whl (311 kB)
[K     |█

This installs the [PyTerrier plugin for ColBERT](https://github.com/terrierteam/pyterrier_colbert). It supplies an indexer and a retrieval transformer.

In [2]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/terrierteam/pyterrier_colbert.git
  Cloning https://github.com/terrierteam/pyterrier_colbert.git to /tmp/pip-req-build-8z4e5o9t
  Running command git clone -q https://github.com/terrierteam/pyterrier_colbert.git /tmp/pip-req-build-8z4e5o9t
Collecting ColBERT@ git+https://github.com/cmacdonald/ColBERT.git@v0.2#egg=ColBERT
  Cloning https://github.com/cmacdonald/ColBERT.git (to revision v0.2) to /tmp/pip-install-mple5szk/colbert_3bfad04156c94600bff61ff69c4ae4b6
  Running command git clone -q https://github.com/cmacdonald/ColBERT.git /tmp/pip-install-mple5szk/colbert_3bfad04156c94600bff61ff69c4ae4b6
  Running command git checkout -b v0.2 --track origin/v0.2
  Switched to a new branch 'v0.2'
  Branch 'v0.2' set up to track remote branch 'v0.2' from 'origin'.
Collecting transformers<5,>=3.0.2
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     

This installs [FAISS](https://github.com/facebookresearch/faiss), a library for efficient similarity search and clustering of dense vectors.

In [3]:
import sys

COLAB='google.colab' in sys.modules

try:
  import faiss
  faiss.get_num_gpus()
except:
  if COLAB:
    print('Installing faiss-gpu from pip ')
    !pip install faiss-gpu==1.6.3
  else:
    print('Installing faiss-gpu via Conda')
    !conda install -c pytorch faiss-gpu

import faiss
assert faiss.get_num_gpus() > 0

Installing faiss-gpu from pip 
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-gpu==1.6.3
  Downloading faiss_gpu-1.6.3-cp38-cp38-manylinux2010_x86_64.whl (35.5 MB)
[K     |████████████████████████████████| 35.5 MB 1.3 MB/s 
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.6.3


# Setup

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org) IR platform.

In [4]:
import pyterrier as pt
pt.init()

terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



This is the ColBERT checkpoint generated by Craig Macdonald. and used in our TREC 2020 Participation. It will be downloaded first time it is used. Downloading time varies.

In [5]:
checkpoint="http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"

# Indexing

This indexes the [Vaswani dataset](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/). Indexing takes about 3 minutes using a Colab GPU.

In [6]:
!rm -rf /content/colbertindex

import pyterrier_colbert.indexing

indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint, "/content", "colbertindex", chunksize=3)
indexer.index(pt.get_dataset("irds:vaswani").get_corpus_iter())

vaswani documents:   0%|          | 0/11429 [00:00<?, ?it/s]

[Dec 10, 16:28:31] [0] 		 #> Local args.bsize = 128
[Dec 10, 16:28:31] [0] 		 #> args.index_root = /content
[Dec 10, 16:28:31] [0] 		 #> self.possible_subset_sizes = [69905]


Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAI

[Dec 10, 16:29:00] #> Loading model checkpoint.
[Dec 10, 16:29:00] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip


Downloading: "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip" to /root/.cache/torch/hub/checkpoints/colbert.dnn.zip


  0%|          | 0.00/1.11G [00:00<?, ?B/s]



[Dec 10, 16:30:54] #> checkpoint['epoch'] = 0
[Dec 10, 16:30:54] #> checkpoint['batch'] = 44500




Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]



[Dec 10, 16:30:56] #> Note: Output directory /content already exists




[Dec 10, 16:30:56] #> Creating directory /content/colbertindex 




[INFO] [starting] http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz

http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 0.0%| 0.00/2.13M [00:00<?, ?B/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 2.3%| 49.2k/2.13M [00:00<00:08, 244kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 4.6%| 98.3k/2.13M [00:00<00:06, 322kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 7.7%| 164k/2.13M [00:00<00:04, 400kB/s] [A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 15.8%| 336k/2.13M [00:00<00:02, 653kB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 31.6%| 672k/2.13M [00:00<00:01, 1.08MB/s][A
http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: 64.0%| 1.36M/2.13M [00:00<00:00, 1.88MB/s][A
[A[INFO] [finished] http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz: [00:00] [2.13MB] [2.88MB/s]

http://ir.dcs.gla.ac.uk/re

[Dec 10, 16:32:50] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 6.0k (overall),  6.1k (this encoding),  15218.0M (this saving)
[Dec 10, 16:32:51] [0] 		 [NOTE] Done with local share.
[Dec 10, 16:32:51] [0] 		 #> Joining saver thread.
[Dec 10, 16:32:51] [0] 		 #> Saved batch #0 to /content/colbertindex/0.pt 		 Saving Throughput = 2.3M passages per minute.

#> num_embeddings = 581496
[Dec 10, 16:32:51] #> Starting..
[Dec 10, 16:32:51] #> Processing slice #1 of 1 (range 0..1).
[Dec 10, 16:32:51] #> Will write to /content/colbertindex/ivfpq.100.faiss.
[Dec 10, 16:32:51] #> Loading /content/colbertindex/0.sample ...
#> Sample has shape (29074, 128)
[Dec 10, 16:32:51] Preparing resources for 1 GPUs.
[Dec 10, 16:32:51] #> Training with the vectors...
[Dec 10, 16:32:51] #> Training now (using 1 GPUs)...
108.56665992736816
27.515798807144165
0.0008912086486816406
[Dec 10, 16:35:07] Done training!

[Dec 10, 16:35:07] #> Indexing the vectors...
[Dec 10, 16:35:07] #> Loadi

The indexing procedure generates the document embeddings index and a [FAISS](https://github.com/facebookresearch/faiss) index, together with some additional files.

In [7]:
!ls -ltrh /content/colbertindex

total 168M
-rw-r--r-- 1 root root 142M Dec 10 16:32 0.pt
-rw-r--r-- 1 root root 4.5M Dec 10 16:32 0.tokenids
-rw-r--r-- 1 root root 7.1M Dec 10 16:32 0.sample
-rw-r--r-- 1 root root  35K Dec 10 16:32 doclens.0.json
-rw-r--r-- 1 root root  24K Dec 10 16:32 docnos.pkl.gz
-rw-r--r-- 1 root root  14M Dec 10 16:35 ivfpq.100.faiss


# Retrieval

Now that indexing has completed, we can load in the index and the checkpoint model (which we will need for encoding queries). Index loading can take some times, as the [FAISS](https://github.com/facebookresearch/faiss) index needs to be loaded in main memory, as well as the document embeddings index.

Since we indexed a collection from scratch and the data structures are already loaded in main memory, we reuse the data structures for retrieval.

In the case the indexing was done offline, the following ColBERT factory can be used:

```python
pyterrier_colbert_factory = pyterrier_colbert.ranking.ColBERTFactory(checkpoint, "/content", "colbertindex")
```


In [8]:
pyterrier_colbert_factory = indexer.ranking_factory()

colbert_e2e = pyterrier_colbert_factory.end_to_end()

[Dec 10, 16:37:38] #> Loading the FAISS index from /content/colbertindex/ivfpq.100.faiss ..
[Dec 10, 16:37:38] #> Building the emb2pid mapping..
[Dec 10, 16:37:38] len(self.emb2pid) = 581496
Loading reranking index, memtype=mem


Loading index shards to memory:   0%|          | 0/1 [00:00<?, ?shard/s]

Here we can ask [PyTerrier](https://github.com/terrier-org/pyterrier) to search the [ColBERT](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) index for `'chemical reactions'`, returning the top 10 scored documents.

In [9]:
(colbert_e2e % 10).search("chemical reactions")

Unnamed: 0,qid,query,docid,query_toks,query_embs,score,docno,rank
1861,1,chemical reactions,4911,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0681), tensor(-0.0084), tensor(0.11...",19.824471,4912,0
2388,1,chemical reactions,7048,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0681), tensor(-0.0084), tensor(0.11...",19.053652,7049,1
2262,1,chemical reactions,6479,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0681), tensor(-0.0084), tensor(0.11...",18.034084,6480,2
532,1,chemical reactions,9373,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0681), tensor(-0.0084), tensor(0.11...",17.139666,9374,3
2216,1,chemical reactions,6278,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0681), tensor(-0.0084), tensor(0.11...",16.793507,6279,4
1103,1,chemical reactions,2420,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0681), tensor(-0.0084), tensor(0.11...",16.426735,2421,5
1708,1,chemical reactions,4292,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0681), tensor(-0.0084), tensor(0.11...",16.193594,4293,6
1147,1,chemical reactions,10702,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0681), tensor(-0.0084), tensor(0.11...",16.152369,10703,7
1981,1,chemical reactions,5303,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0681), tensor(-0.0084), tensor(0.11...",16.00952,5304,8
1419,1,chemical reactions,3100,"[tensor(101), tensor(1), tensor(5072), tensor(...","[[tensor(0.0681), tensor(-0.0084), tensor(0.11...",15.882051,3101,9


# Run an experiment

Lets prepare an experiment. Firstly, lets create in a BM25 baseline transformer.

In [None]:
dataset = pt.get_dataset("vaswani")

bm25 = pt.BatchRetrieve(dataset.get_index(), wmodel="BM25")

Downloading vaswani index to /root/.pyterrier/corpora/vaswani/index


HBox(children=(FloatProgress(value=0.0, description='data.direct.bf', max=334927.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='data.document.fsarrayfile', max=194293.0, style=ProgressS…




HBox(children=(FloatProgress(value=0.0, description='data.inverted.bf', max=308600.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomapfile', max=667016.0, style=ProgressSty…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomaphash', max=777.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomapid', max=31024.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='data.meta.idx', max=91432.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='data.meta.zdata', max=171754.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='data.properties', max=882.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='data.meta-0.fsomapfile', max=742885.0, style=ProgressStyl…




Finally, lets evaluate our performance. We also load in an BM25 index for the same corpus for comparison reasons. We limit our experiments to just 50 queries.

In [None]:
pt.Experiment(
    [bm25, colbert_e2e],
    dataset.get_topics().head(50),
    dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "mrt"],
    names = ["BM25", "ColBERT"]
)

Downloading vaswani topics to /root/.pyterrier/corpora/vaswani/query-text.trec


HBox(children=(FloatProgress(value=0.0, description='query-text.trec', max=3121.0, style=ProgressStyle(descrip…


Downloading vaswani qrels to /root/.pyterrier/corpora/vaswani/qrels


HBox(children=(FloatProgress(value=0.0, description='qrels', max=6793.0, style=ProgressStyle(description_width…




Unnamed: 0,name,map,recip_rank,mrt
0,BM25,0.338941,0.808238,44.980815
1,ColBERT,0.333021,0.762643,780.633069
