# PyTerrier ColBERT Demo Notebook - Vaswani

This notebook demonstrates use of [PyTerrier plugin for ColBERT](https://github.com/terrierteam/pyterrier_colbert) for dense passage retrieval. 

[ColBERT](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds. ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings. Then at search time, it embeds every query into another matrix of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. 


[ColBERT](https://arxiv.org/abs/2004.12832) is built on top of [BERT](https://arxiv.org/abs/1810.04805). ColBERT surpasses the quality of single-vector representation models, while scaling efficiently to large corpora. 

The corpus used in this demo is the [Vaswani NPL corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a corpus of 11,429 scientific abstract, with corresponding queries and relevance assessments.

## Installation 

We need to install [PyTerrier](https://github.com/terrier-org/pyterrier).

In [3]:
!pip install python-terrier



This installs the [PyTerrier plugin for ColBERT](https://github.com/terrierteam/pyterrier_colbert). It supplies an indexer and a retrieval transformer.

In [4]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

Collecting git+https://github.com/terrierteam/pyterrier_colbert.git
  Cloning https://github.com/terrierteam/pyterrier_colbert.git to /tmp/pip-req-build-leu2i9co
  Running command git clone -q https://github.com/terrierteam/pyterrier_colbert.git /tmp/pip-req-build-leu2i9co
Building wheels for collected packages: pyterrier-colbert
  Building wheel for pyterrier-colbert (setup.py) ... [?25l[?25hdone
  Created wheel for pyterrier-colbert: filename=pyterrier_colbert-0.0.1-cp37-none-any.whl size=11942 sha256=b37992b59cc28b51b83bab8ff2e848132e5eb1c4bab4230acc3adeadc72846c0
  Stored in directory: /tmp/pip-ephem-wheel-cache-0itqb27i/wheels/7d/23/87/59bcb24958d35319315084fe0b193e9b0c15a1d384199dbaf3
Successfully built pyterrier-colbert
Installing collected packages: pyterrier-colbert
  Found existing installation: pyterrier-colbert 0.0.1
    Uninstalling pyterrier-colbert-0.0.1:
      Successfully uninstalled pyterrier-colbert-0.0.1
Successfully installed pyterrier-colbert-0.0.1


This installs [FAISS](https://github.com/facebookresearch/faiss), a library for efficient similarity search and clustering of dense vectors.

In [5]:
import sys

COLAB='google.colab' in sys.modules

try:
  import faiss
  faiss.get_num_gpus()
except:
  if COLAB:
    print('Installing faiss-gpu from pip ')
    !pip install faiss-gpu==1.6.3
  else:
    print('Installing faiss-gpu via Conda')
    !conda install -c pytorch faiss-gpu

import faiss
assert faiss.get_num_gpus() > 0

# Setup

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org) IR platform.

In [6]:
import pyterrier as pt
pt.init()

  from pandas import Panel


PyTerrier 0.4.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)


This downloads the model checkpoint generated by Craig Macdonald. Download time can vary, on average it requires 11-12 minutes.

In [7]:
import os
if not os.path.exists("colbert.dnn.zip"):
  !wget "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"
  !unzip colbert.dnn.zip

# Indexing

This indexes the [Vaswani dataset](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/). Indexing takes about 3 minutes using a Colab GPU.

In [None]:
!rm -rf /content/colbertindex

import pyterrier_colbert.indexing

indexer = pyterrier_colbert.indexing.ColBERTIndexer("colbert.dnn", "/content", "colbertindex", chunksize=3)
indexer.index(pt.get_dataset("irds:vaswani").get_corpus_iter())

[Mar 15, 15:50:54] [0] 		 #> Local args.bsize = 128
[Mar 15, 15:50:54] [0] 		 #> args.index_root = /content
[Mar 15, 15:50:54] [0] 		 #> self.possible_subset_sizes = [69905]


Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAI

[Mar 15, 15:51:20] #> Loading model checkpoint.
[Mar 15, 15:51:20] #> Loading checkpoint colbert.dnn
[Mar 15, 15:51:56] #> checkpoint['epoch'] = 0
[Mar 15, 15:51:56] #> checkpoint['batch'] = 44500




[Mar 15, 15:51:57] #> Note: Output directory /content already exists




[Mar 15, 15:51:57] #> Creating directory /content/colbertindex 




vaswani documents: 100%|██████████| 11429/11429 [00:00<00:00, 155318.92it/s]

docFromText on 11429 documents





tokens doc 0: 180
total tokens 2057220
lenD = 11429 
11429/content/colbertindex/0.pt

[Mar 15, 15:54:17] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 4.9k (overall),  4.9k (this encoding),  1042.1M (this saving)
[Mar 15, 15:54:17] [0] 		 [NOTE] Done with local share.
[Mar 15, 15:54:17] [0] 		 #> Joining saver thread.
[Mar 15, 15:54:18] [0] 		 #> Saved batch #0 to /content/colbertindex/0.pt 		 Saving Throughput = 2.2M passages per minute.

#> num_embeddings = 581496
[Mar 15, 15:54:18] #> Starting..
[Mar 15, 15:54:18] #> Processing slice #1 of 1 (range 0..1).
[Mar 15, 15:54:18] #> Will write to /content/colbertindex/ivfpq.100.faiss.
[Mar 15, 15:54:18] #> Loading /content/colbertindex/0.sample ...
#> Sample has shape (29074, 128)
[Mar 15, 15:54:18] Preparing resources for 1 GPUs.
[Mar 15, 15:54:18] #> Training with the vectors...
[Mar 15, 15:54:18] #> Training now (using 1 GPUs)...
2.825673818588257


The indexing procedure generates the document embeddings index and a [FAISS](https://github.com/facebookresearch/faiss) index, together with some additional files.

In [None]:
!ls -ltrh /content/colbertindex

# Retrieval

Now that indexing has completed, we can load in the index and the checkpoint model (which we will need for encoding queries). Index loading can take some times, as the [FAISS](https://github.com/facebookresearch/faiss) index needs to be loaded in main memory, as well as the document embeddings index.

Since we indexed a collection from scratch and the data structures are already loaded in main memory, we reuse the data structures for retrieval.

In the case the indexing was done offline, the following ColBERT factory can be used:

```python
pyterrier_colbert_factory = pyterrier_colbert.ranking.ColBERTFactory("colbert.dnn", "/content", "colbertindex")
```


In [None]:
pyterrier_colbert_factory = indexer.ranking_factory()

colbert_e2e = pyterrier_colbert_factory.end_to_end()

Here we can ask [PyTerrier](https://github.com/terrier-org/pyterrier) to search the [ColBERT](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) index for `'chemical reactions'`, returning the top 10 scored documents.

In [None]:
(colbert_e2e % 10).search("chemical reactions")

# Run an experiment

Lets prepare an experiment. Firstly, lets create in a BM25 baseline transformer.

In [None]:
dataset = pt.get_dataset("vaswani")

bm25 = pt.BatchRetrieve(dataset.get_index(), wmodel="BM25")

Finally, lets evaluate our performance. We also load in an BM25 index for the same corpus for comparison reasons. We limit our experiments to just 50 queries.

In [32]:
pt.Experiment(
    [bm25, colbert_e2e],
    dataset.get_topics().head(50),
    dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "mrt"],
    names = ["BM25", "ColBERT"]
)

lookups: 100%|██████████| 1000/1000 [00:00<00:00, 27377.07d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 28739.92d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 28735.39d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 27756.99d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 27629.37d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 29116.60d/s]
lookups: 100%|██████████| 64/64 [00:00<00:00, 17404.88d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 29091.76d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 22719.69d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 19248.76d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 21108.83d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 24005.17d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 28143.08d/s]
lookups: 100%|██████████| 796/796 [00:00<00:00, 29006.91d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 29644.66d/s]
lookups: 100%|██████████| 1000/1000 [00:00<00:00, 29071.79d/s

Unnamed: 0,name,map,recip_rank,mrt
0,BM25,0.338941,0.808238,24.625619
1,ColBERT,0.332899,0.762643,839.961611
