# PyTerrier ECIR 2021 Tutorial Notebook - Part 4.3 - ColBERT

This notebook provides experiences to attendees for building transformer pipelines in [PyTerrier](https://github.com/terrier-org/pyterrier). 

This notebook aims to demonstrate use of the [ColBERT dense retrieval](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) for end-to-end indexing and retrieval in PyTerrier, as provided by the [*Pyterrier ColBERT*](https://github.com/terrierteam/pyterrier_colbert) plugin.

In this notebook, you will experience indexing and retrieval using pyterrier_colbert.

NB: ColBERT is memory hungry. For this reason, we are not able to demonstate ColBERT on corpora larger than Vaswani (11k abstracts) within the tight constraints of a Google Colab environment.

# Setup

In the following, we will set up the libraries required to execute the notebook.

## Python packages installation

The following packages are installed to avoid warnings/errors during [PyTerrier](https://github.com/terrier-org/pyterrier) installation. Note that the current release of [*Pyterrier ColBERT*](https://github.com/terrierteam/pyterrier_colbert) works only with the following Python packages:

* `transfomers`, version 3.0.2
* `faiss-gpu`, version 1.6.3

> You can safely ignore the message about runtime restart.

In [1]:
!apt install --upgrade libomp-dev

!pip install --upgrade transformers==3.0.2
!pip install --upgrade faiss-gpu==1.6.3

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libomp5
Suggested packages:
  libomp-doc
The following NEW packages will be installed:
  libomp-dev libomp5
0 upgraded, 2 newly installed, 0 to remove and 30 not upgraded.
Need to get 239 kB of archives.
After this operation, 804 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp5 amd64 5.0.1-1 [234 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp-dev amd64 5.0.1-1 [5,088 B]
Fetched 239 kB in 1s (415 kB/s)
Selecting previously unselected package libomp5:amd64.
(Reading database ... 160980 files and directories currently installed.)
Preparing to unpack .../libomp5_5.0.1-1_amd64.deb ...
Unpacking libomp5:amd64 (5.0.1-1) ...
Selecting previously unselected package libomp-dev.
Preparing to unpack .../libomp-dev_5.0.1-1_amd64.deb ...
Unpacking libomp-dev (5.0.1-

## Pyterrier installation

The following cell installs the latest release of the [PyTerrier](https://github.com/terrier-org/pyterrier) package.

In [2]:
!pip install python-terrier

Collecting python-terrier
[?25l  Downloading https://files.pythonhosted.org/packages/10/0e/1756a1892b8b2aa0152ac532c7f85de802bda25772108ab8196259ea9d4f/python-terrier-0.5.0.tar.gz (74kB)
[K     |████▍                           | 10kB 23.7MB/s eta 0:00:01[K     |████████▉                       | 20kB 30.9MB/s eta 0:00:01[K     |█████████████▎                  | 30kB 27.6MB/s eta 0:00:01[K     |█████████████████▊              | 40kB 23.6MB/s eta 0:00:01[K     |██████████████████████▏         | 51kB 25.4MB/s eta 0:00:01[K     |██████████████████████████▌     | 61kB 17.3MB/s eta 0:00:01[K     |███████████████████████████████ | 71kB 18.1MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 7.9MB/s 
Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Collecting pytrec_eval>=0.5
  Downloading https://files.pythonhosted.org/packages/2e/03/e6e84df6a7c1265579ab26bbe30ff7f8c

## Pyterrier plugins installation

We install the official version of the [*Pyterrier ColBERT*](https://github.com/terrierteam/pyterrier_colbert) plugin. You can safely ignore the package versioning errors.

In [3]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

Collecting git+https://github.com/terrierteam/pyterrier_colbert.git
  Cloning https://github.com/terrierteam/pyterrier_colbert.git to /tmp/pip-req-build-yqvdmli2
  Running command git clone -q https://github.com/terrierteam/pyterrier_colbert.git /tmp/pip-req-build-yqvdmli2
Collecting ColBERT@ git+https://github.com/cmacdonald/ColBERT.git@v0.2#egg=ColBERT
  Cloning https://github.com/cmacdonald/ColBERT.git (to revision v0.2) to /tmp/pip-install-psp3to_p/ColBERT
  Running command git clone -q https://github.com/cmacdonald/ColBERT.git /tmp/pip-install-psp3to_p/ColBERT
  Running command git checkout -b v0.2 --track origin/v0.2
  Switched to a new branch 'v0.2'
  Branch 'v0.2' set up to track remote branch 'v0.2' from 'origin'.
Collecting ujson
[?25l  Downloading https://files.pythonhosted.org/packages/17/4e/50e8e4cf5f00b537095711c2c86ac4d7191aed2b4fffd5a19f06898f6929/ujson-4.0.2-cp37-cp37m-manylinux1_x86_64.whl (179kB)
[K     |████████████████████████████████| 184kB 11.6MB/s 
[?25hColle

## Trained model download

This downloads the [ColBERT](https://github.com/stanford-futuredata/ColBERT) model checkpoint. This will be downloaded the first time it is used - download takes less than 1 minute in our experience.

In [4]:
checkpoint="http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/colbert_model_checkpoint.zip"

# Preliminary steps

## [PyTerrier](https://github.com/terrier-org/pyterrier) initialization

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org) IR platform.

In [5]:
import pyterrier as pt

if not pt.started():
    pt.init()

terrier-assemblies 5.4  jar-with-dependencies not found, downloading to /root/.pyterrier...


  from pandas import Panel


Done
terrier-python-helper 0.0.5  jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.5.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)


## [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) Dataset download

The following cell downloads the [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) dataset that we will use in the reamining of the tutorial.

 We limit queries to just 50 topics to avoid RAM issues with ColBERT on Colab. ColBERT is **very** memory-hungry.

In [6]:
dataset = pt.get_dataset("vaswani")
topics = dataset.get_topics().head(50)
qrels = dataset.get_qrels()

index = dataset.get_index()

Downloading vaswani topics to /root/.pyterrier/corpora/vaswani/query-text.trec


HBox(children=(FloatProgress(value=0.0, description='query-text.trec', max=3121.0, style=ProgressStyle(descrip…


Downloading vaswani qrels to /root/.pyterrier/corpora/vaswani/qrels


HBox(children=(FloatProgress(value=0.0, description='qrels', max=6793.0, style=ProgressStyle(description_width…


Downloading vaswani index to /root/.pyterrier/corpora/vaswani/index


HBox(children=(FloatProgress(value=0.0, description='data.direct.bf', max=334927.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='data.document.fsarrayfile', max=194293.0, style=ProgressS…




HBox(children=(FloatProgress(value=0.0, description='data.inverted.bf', max=308600.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomapfile', max=667016.0, style=ProgressSty…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomaphash', max=777.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomapid', max=31024.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='data.meta.idx', max=91432.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='data.meta.zdata', max=171754.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='data.properties', max=882.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='data.meta-0.fsomapfile', max=742885.0, style=ProgressStyl…




# [ColBERT](https://github.com/stanford-futuredata/ColBERT) indexing

We are going to index the [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) collection with [ColBERT](https://github.com/stanford-futuredata/ColBERT).

The construction of this index takes some time. The following code:
* downloads some additional BERT models;
* processes the whole collection to compute the document embeddings, e.g, at most 180 embeddings per document;
* performs the *training* of the IVFPQ [FAISS](https://github.com/facebookresearch/faiss) index supporting approximate nearest neightbour search.

For 11,429 documents, the code computes 581,496 document embeddings, ~50.8 embeddings per document, in approximatively **5 minutes**.

In [7]:
!rm -rf /content/colbert_index

import pyterrier_colbert.indexing

colbert_indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint=checkpoint, 
                                                            index_root="/content",
                                                            index_name="colbert_index",
                                                            chunksize=3)
colbert_indexer.index(dataset.get_corpus_iter())

Downloading vaswani corpus to /root/.pyterrier/corpora/vaswani/corpus


HBox(children=(FloatProgress(value=0.0, description='doc-text.trec', max=1035244.0, style=ProgressStyle(descri…


[Mar 26, 13:40:06] [0] 		 #> Local args.bsize = 128
[Mar 26, 13:40:06] [0] 		 #> args.index_root = /content
[Mar 26, 13:40:06] [0] 		 #> self.possible_subset_sizes = [69905]


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAI

[Mar 26, 13:40:28] #> Loading model checkpoint.
[Mar 26, 13:40:28] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/colbert_model_checkpoint.zip


Downloading: "http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/colbert_model_checkpoint.zip" to /root/.cache/torch/hub/checkpoints/colbert_model_checkpoint.zip


HBox(children=(FloatProgress(value=0.0, max=1187618437.0), HTML(value='')))






[Mar 26, 13:41:16] #> checkpoint['epoch'] = 0
[Mar 26, 13:41:16] #> checkpoint['batch'] = 44500




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




[Mar 26, 13:41:16] #> Note: Output directory /content already exists




[Mar 26, 13:41:16] #> Creating directory /content/colbert_index 


docFromText on 11429 documents
tokens doc 0: 180
total tokens 2057220
lenD = 11429 
11429/content/colbert_index/0.pt

[Mar 26, 13:43:48] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 4.5k (overall),  4.6k (this encoding),  848.9M (this saving)
[Mar 26, 13:43:48] [0] 		 [NOTE] Done with local share.
[Mar 26, 13:43:48] [0] 		 #> Joining saver thread.
[Mar 26, 13:43:48] [0] 		 #> Saved batch #0 to /content/colbert_index/0.pt 		 Saving Throughput = 1.8M passages per minute.

#> num_embeddings = 581496
[Mar 26, 13:43:49] #> Starting..
[Mar 26, 13:43:49] #> Processing slice #1 of 1 (range 0..1).
[Mar 26, 13:43:49] #> Will write to /content/colbert_index/ivfpq.100.faiss.
[Mar 26, 13:43:49] #> Loading /content/colbert_index/0.sample ...
#> Sample has shape (29074, 128)
[Mar 26, 13:43:49] Preparing resources for 1 GPUs.
[Mar 26, 1

Lets give a look at the files created.

In [8]:
!ls -lh /content/colbert_index/

total 163M
-rw-r--r-- 1 root root 142M Mar 26 13:43 0.pt
-rw-r--r-- 1 root root 7.1M Mar 26 13:43 0.sample
-rw-r--r-- 1 root root  35K Mar 26 13:43 doclens.0.json
-rw-r--r-- 1 root root  36K Mar 26 13:43 docnos.pkl.gz
-rw-r--r-- 1 root root  14M Mar 26 13:46 ivfpq.100.faiss


So we have a few files that have been generated. Firstly, note that ColBERT indexes into chunks - Vaswani is small enough to only need a single chunk, so we have only `0.pt` and no `1.pt` etc :
 - $x$ `.pt` - the document embeddings for each chunk
 - $x$ `.sample` - a sample of the document embeddings in that chunk - used for training FAISS, not needed at retrieval time
 - `doclens.` $x$ `.json` - the number of document embeddings per document.
 - `ivfpq.` $y$ `.faiss` - the FAISS index for all document embeddings
 - `docnos.pkl.gz` - the docno document metadata, used by PyTerrier_ColBERT to return docnos.
 

# Retrieval experiments

Now that indexing/downloading has completed, we can load in the index and the learned model (which we will need for encoding queries). Index loading can take some time, as the [FAISS](https://github.com/facebookresearch/faiss) index needs to be loaded in main memory, as well as the document embeddings index.

Lets prepare an experiment. Firstly, lets create in a BM25 baseline transformer, and the [ColBERT](https://github.com/stanford-futuredata/ColBERT) retrieve transformer.

In [9]:
bm25_retriever = pt.BatchRetrieve(index, wmodel="BM25")

colbert_retriever = colbert_indexer.ranking_factory().end_to_end()

[Mar 26, 13:46:20] #> Loading the FAISS index from /content/colbert_index/ivfpq.100.faiss ..
[Mar 26, 13:46:20] #> Building the emb2pid mapping..
[Mar 26, 13:46:20] len(self.emb2pid) = 581496
Loading reranking index, memtype=mem


HBox(children=(FloatProgress(value=0.0, description='Loading index shards to memory', max=1.0, style=ProgressS…




Now we are ready to run the experiments. We are going to retrieve the top 10 ranked documents for the official topics, and compute several effectiveness metrics. 

In [10]:
pt.Experiment(
    [bm25_retriever % 10, colbert_retriever % 10], 
    topics,
    qrels,
    eval_metrics=["map", "recip_rank", "P_10", "ndcg_cut_10", "mrt"],
    names=['BM25', 'ColBERT'],
)

Unnamed: 0,name,map,recip_rank,P_10,ndcg_cut_10,mrt
0,BM25,0.200356,0.806667,0.392,0.505645,44.90698
1,ColBERT,0.191802,0.757857,0.41,0.495308,837.903343


So for this small dataset, ColBERT achieves a MAP is similar to BM25, a marginally higher P@10, but a lower MRR.

#  That's all folks

Once you have finished all of the Part 4 notebooks, please dont forget to complete our exit quiz: https://forms.office.com/r/2WbpLiQmWV