# PyTerrier Search Solutions 2022 Tutorial Notebook - ColBERT

This notebook provides experiences to attendees for building transformer pipelines in [PyTerrier](https://github.com/terrier-org/pyterrier). 

This notebook aims to demonstrate use of the [ColBERT dense retrieval](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) for end-to-end indexing and retrieval in PyTerrier, as provided by the [*Pyterrier ColBERT*](https://github.com/terrierteam/pyterrier_colbert) plugin.

In this notebook, you will experience indexing and retrieval using `pyterrier_colbert`.

NB: ColBERT is memory hungry. For this reason, we are not able to demonstate ColBERT on corpora larger than Vaswani (11k abstracts) within the tight constraints of a Google Colab environment.

# Setup

In the following, we will set up the libraries required to execute the notebook.

## Python packages installation

The following packages are installed to avoid warnings/errors during [PyTerrier](https://github.com/terrier-org/pyterrier) installation. Note that the current release of [*Pyterrier ColBERT*](https://github.com/terrierteam/pyterrier_colbert) works only with the following Python packages:

* `transfomers`, version 3.0.2
* `faiss-gpu`, version 1.6.3

> You can safely ignore the message about runtime restart.

In [None]:
!apt install --upgrade --quiet libomp-dev

%pip install --upgrade --quiet transformers
%pip install --upgrade --quiet faiss-gpu==1.7.2

## Pyterrier installation

The following cell installs the latest release of the [PyTerrier](https://github.com/terrier-org/pyterrier) package.

In [None]:
%pip install --quiet python-terrier

## Pyterrier plugins installation

We install the official version of the [*Pyterrier ColBERT*](https://github.com/terrierteam/pyterrier_colbert) plugin. You can safely ignore the package versioning errors.

In [None]:
%pip install --upgrade --quiet git+https://github.com/terrierteam/pyterrier_colbert.git

# Preliminary steps

In [None]:
import pyterrier as pt


## [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) dataset download

The following cell downloads the [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) dataset that we will use in the remaining of the tutorial.

 We limit queries to just 50 topics to avoid RAM issues with ColBERT on Colab. ColBERT is **very** memory-hungry.

In [None]:
dataset = pt.get_dataset("vaswani")
topics = dataset.get_topics().head(50)
qrels = dataset.get_qrels()

# [ColBERT](https://github.com/stanford-futuredata/ColBERT) indexing

We are going to index the [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) collection with [ColBERT](https://github.com/stanford-futuredata/ColBERT).

The construction of this index takes some time. The following code:
* downloads some additional BERT models;
* processes the whole collection to compute the document embeddings, e.g, at most 180 embeddings per document;
* performs the *training* of the IVFPQ [FAISS](https://github.com/facebookresearch/faiss) index supporting approximate nearest neightbour search.

For 11,429 documents, the code computes 581,496 document embeddings, ~50.8 embeddings per document, in approximatively **5 minutes**.

The following code can be use for indexing:

```python
!rm -rf /content/colbert_index

import pyterrier_colbert.indexing

colbert_indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint=checkpoint, 
                                                            index_root="/content",
                                                            index_name="colbert_index",
                                                            chunksize=3)
colbert_indexer.index(dataset.get_corpus_iter())
```

# Retrieval experiments

Now we can load in the index and the learned model (which we will need for encoding queries). Index loading can take some time, as the [FAISS](https://github.com/facebookresearch/faiss) index needs to be loaded in main memory, as well as the document embeddings index.

Lets prepare an experiment. Firstly, lets create in a BM25 baseline transformer, and the [ColBERT](https://github.com/stanford-futuredata/ColBERT) retrieve transformer.

In [None]:
from pyterrier_colbert.ranking import ColBERTFactory

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_stemmed', wmodel='BM25')

factory = ColBERTFactory.from_dataset('vaswani', 'colbert_uog44k')
colbert_e2e = factory.end_to_end()

Lets give a look at the files downloaded. Firstly, note that ColBERT indexes into chunks - Vaswani is small enough to only need a single chunk, so we have only `0.pt` and no `1.pt` etc :
 - $x$ `.pt` - the document embeddings for each chunk
 - $x$ `.sample` - a sample of the document embeddings in that chunk - used for training FAISS, not needed at retrieval time
 - `doclens.` $x$ `.json` - the number of document embeddings per document.
 - `ivfpq.` $y$ `.faiss` - the FAISS index for all document embeddings
 - `docnos.pkl.gz` - the docno document metadata, used by PyTerrier_ColBERT to return docnos.
 


Now we are ready to run the experiments. We are going to retrieve the top 10 ranked documents for the official topics, and compute several effectiveness metrics. 

In [None]:
pt.Experiment(
    [bm25_terrier_stemmed % 10, colbert_e2e % 10], 
    topics,
    qrels,
    eval_metrics=["map", "recip_rank", "P_10", "ndcg_cut_10", "mrt"],
    names=['BM25', 'ColBERT'],
)

So for this small dataset, ColBERT achieves a MAP is similar to BM25, a marginally higher P@10, but a lower MRR.

# Visualising ColBERT

Let's dig a little deeper into which documents ColBERT retrieves and what it pays attention to in those documents.

In [None]:
query = 'what is the origin of covid 19'
document = 'Origin of the COVID-19 virus has been intensely debated in the scientific community since the first infected cases were detected in December 2019. The disease has caused a global pandemic, leading to deaths of thousands of people across the world and thus finding origin of this novel coronavirus is'

figure = factory.explain_text(query, document)

This interaction figure shows how a query and a document interact. In particular:

* the top sub-plot shows the contribution each query wordpiece to the document's score.
* In the heatmap, darker colours indicate higher similarity between the query emebdding and the document embedding.
* ColBERT uses max_sim operator - for each query embedding, only the most similar document embedding contributes to the final score of the document. For each query embedding, we put an "X" mark in the row of document embedding that is the source of that maximum similarity for that query embedding.
* [MASK] tokens are extra tokens added to the query by ColBERT. We can observe which document embeddings these match with.