# PyTerrier ECIR 2021 Tutorial Notebook - Part 4.2 - ColBERT

This notebook provides experiences to attendees for building transformer pipelines in [PyTerrier](https://github.com/terrier-org/pyterrier). 

This notebook aims to demonstrate use of the [ColBERT dense retrieval](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) for end-to-end indexing and retrieval in PyTerrier, as provided by the [*Pyterrier ColBERT*](https://github.com/terrierteam/pyterrier_colbert) plugin.

In this notebook, you will experience indexing and retrieval using pyterrier_colbert.

NB: ColBERT is memory hungry. For this reason, we are not able to demonstate ColBERT on corpora larger than Vaswani (11k abstracts) within the tight constraints of a Google Colab environment.

# Setup

In the following, we will set up the libraries required to execute the notebook.

## Python packages installation

The following packages are installed to avoid warnings/errors during [PyTerrier](https://github.com/terrier-org/pyterrier) installation. Note that the current release of [*Pyterrier ColBERT*](https://github.com/terrierteam/pyterrier_colbert) works only with the following Python packages:

* `transfomers`, version 3.0.2
* `faiss-gpu`, version 1.6.3

> You can safely ignore the message about runtime restart.

In [None]:
!apt install --upgrade libomp-dev

!pip install --upgrade transformers==3.0.2
!pip install --upgrade faiss-gpu==1.6.3

## Pyterrier installation

The following cell installs the latest release of the [PyTerrier](https://github.com/terrier-org/pyterrier) package.

In [None]:
!pip install python-terrier

## Pyterrier plugins installation

We install the official version of the [*Pyterrier ColBERT*](https://github.com/terrierteam/pyterrier_colbert) plugin. You can safely ignore the package versioning errors.

In [None]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

## Trained model download

This downloads the [ColBERT](https://github.com/stanford-futuredata/ColBERT) model checkpoint. Download time can vary, but was less than one minute in our experience.

In [None]:
import os
if not os.path.exists("colbert_model_checkpoint.zip"):
    !wget http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/colbert_model_checkpoint.zip
    !unzip -j colbert_model_checkpoint.zip -d colbert_model_checkpoint

# Preliminary steps

## [PyTerrier](https://github.com/terrier-org/pyterrier) initialization

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org) IR platform.

In [None]:
import pyterrier as pt

if not pt.started():
    pt.init(tqdm='notebook')

## [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) Dataset download

The following cell downloads the [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) dataset that we will use in the reamining of the tutorial.

 We limit queries to just 50 topics to avoid RAM issues with ColBERT on Colab. ColBERT is **very** memory-hungry.

In [None]:
dataset = pt.get_dataset("vaswani")
topics = dataset.get_topics().head(50)
qrels = dataset.get_qrels()

index = dataset.get_index()

# [ColBERT](https://github.com/stanford-futuredata/ColBERT) indexing

We are going to index the [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) collection with [ColBERT](https://github.com/stanford-futuredata/ColBERT).

The construction of this index takes some time. The following code:
* downloads some additional BERT models;
* processes the whole collection to compute the document embeddings, e.g, at most 180 embeddings per document;
* performs the *training* of the IVFPQ [FAISS](https://github.com/facebookresearch/faiss) index supporting approximate nearest neightbour search.

For 11,429 documents, the code computes 581,496 document embeddings, ~50.8 embeddings per document, in approximatively 5 minutes.

In [None]:
!rm -rf /content/colbert_index

import pyterrier_colbert.indexing

colbert_indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint="/content/colbert_model_checkpoint/colbert.dnn", 
                                                            index_root="/content",
                                                            index_name="colbert_index",
                                                            chunksize=3)
colbert_indexer.index(dataset.get_corpus_iter())

Lets give a look at the files created.

In [None]:
!ls -lh /content/colbert_index/

So we have a few files that have been generated. Firstly, note that ColBERT indexes into chunks - Vaswani is small enough to only need a single chunk, so we have only `0.pt` and no `1.pt` etc :
 - $x$ `.pt` - the document embeddings for each chunk
 - $x$ `.sample` - a sample of the document embeddings in that chunk - used for training FAISS, not needed at retrieval time
 - `doclens.` $x$ `.json` - the number of document embeddings per document.
 - `ivfpq.` $y$ `.faiss` - the FAISS index for all document embeddings
 - `docnos.pkl.gz` - the docno document metadata, used by PyTerrier_ColBERT to return docnos.
 

# Retrieval experiments

Now that indexing/downloading has completed, we can load in the index and the learned model (which we will need for encoding queries). Index loading can take some time, as the [FAISS](https://github.com/facebookresearch/faiss) index needs to be loaded in main memory, as well as the document embeddings index.

Lets prepare an experiment. Firstly, lets create in a BM25 baseline transformer, and the [ColBERT](https://github.com/stanford-futuredata/ColBERT) retrieve transformer.

In [None]:
bm25_retriever = pt.BatchRetrieve(index, wmodel="BM25")

colbert_retriever = colbert_indexer.ranking_factory().end_to_end()

Now we are ready to run the experiments. We are going to retrieve the top 10 ranked documents for the official topics, and compute several effectiveness metrics. 

In [None]:
pt.Experiment(
    [bm25_retriever % 10, colbert_retriever % 10], 
    topics,
    qrels,
    eval_metrics=["map", "recip_rank", "P_10", "ndcg_cut_10", "mrt"],
    names=['BM25', 'ColBERT'],
)

#  That's all folks