# CIKM 2021 ColBERT Papers

This notebook demonstrates the use of techniques proposed in our CIKM 2021 papers:

 - [Macdonald21a]: On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval. Craig Macdonald and Nicola Tonellotto. In Proceedings of CIKM 2021. https://arxiv.org/abs/2108.11480 
 - [Tonellotto21]: Query Embedding Pruning for Dense Retrieval Nicola Tonellotto and Craig Macdonald. In Proceedings of CIKM 2021. https://arxiv.org/abs/2108.10341

## Installation

Install pyt_colbert installs PyTerrier too. You also need to have [FAISS installed](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md).

In [1]:
!/opt/conda/envs/colbert_cikm2021/bin/pip install --force-reinstall --no-deps git+https://github.com/terrierteam/pyterrier_colbert.git@cikm2021

Collecting git+https://github.com/terrierteam/pyterrier_colbert.git@cikm2021
  Cloning https://github.com/terrierteam/pyterrier_colbert.git (to revision cikm2021) to /tmp/pip-req-build-lufimuww
  Running command git clone -q https://github.com/terrierteam/pyterrier_colbert.git /tmp/pip-req-build-lufimuww
  Running command git checkout -b cikm2021 --track origin/cikm2021
  Switched to a new branch 'cikm2021'
  Branch 'cikm2021' set up to track remote branch 'cikm2021' from 'origin'.
Building wheels for collected packages: pyterrier-colbert
  Building wheel for pyterrier-colbert (setup.py) ... [?25ldone
[?25h  Created wheel for pyterrier-colbert: filename=pyterrier_colbert-0.0.1-py3-none-any.whl size=21094 sha256=e0d16832b0c075205ff59cb32875f1cede099f0e79ba1ad0ff095d3789c4c958
  Stored in directory: /tmp/pip-ephem-wheel-cache-83tdy0gu/wheels/c3/e5/5b/24fce6cf44d216004312001eeeb43826e86b074f7c5222ad90
Successfully built pyterrier-colbert
Installing collected packages: pyterrier-colbert


In [2]:
import pyterrier as pt
pt.init()

PyTerrier 0.7.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)


No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


## Setup

We have an existing index for the MSMARCO v1 Passage corpus, previously indexed using pyt_colbert (this adds the tokenids file, which is needed).

In [3]:
from pyterrier_colbert.ranking import ColBERTFactory

factory = ColBERTFactory(
    "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip",
    "/nfs/indices/colbert_passage/","index_name3"
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAI

[Sep 30, 09:04:55] #> Loading model checkpoint.
[Sep 30, 09:04:55] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip




[Sep 30, 09:05:04] #> checkpoint['epoch'] = 0
[Sep 30, 09:05:04] #> checkpoint['batch'] = 44500


## Baseline

This is the default ColBERT dense retrieval setting - a set ANN retrieval from the FAISS index, followed an exact scoring using the large ColBERT index.

In [4]:
e2e = factory.end_to_end()

[Sep 30, 09:05:05] #> Loading the FAISS index from /nfs/indices/colbert_passage/index_name3/ivfpq.faiss ..
[Sep 30, 09:05:33] #> Building the emb2pid mapping..
[Sep 30, 09:06:04] len(self.emb2pid) = 687989391
Loading reranking index, memtype=mem


Loading index shards to memory: 100%|██████████| 24/24 [03:26<00:00,  8.59s/shard]


## CIKM pipelines

In [5]:
import pyterrier_colbert.pruning

#CIKM 2021 Approximate Scoring paper: only retrieve 200 candidates for exact re-ranking
ann_pipe = (factory.ann_retrieve_score() % 200) >> factory.index_scorer(query_encoded=True)

#CIKM 2021 query embeddings paper: only keep the 9 tokens with highest ICF
qep_pipe5 = (factory.query_encoder() 
            >> pyterrier_colbert.pruning.query_embedding_pruning(factory, 5) 
            >> factory.set_retrieve(query_encoded=True)
            >> factory.index_scorer(query_encoded=False)
)
qep_pipe9 = (factory.query_encoder() 
            >> pyterrier_colbert.pruning.query_embedding_pruning(factory, 9) 
            >> factory.set_retrieve(query_encoded=True)
            >> factory.index_scorer(query_encoded=False)
)

# a QEP baseline that suppresses [Q], [CLS]] and [MASK] tokens in the query
nocls_nomask_noQ = (factory.query_encoder() 
            >> pyterrier_colbert.pruning.query_embedding_pruning_special(Q=True, CLS=True, MASK=True)
            >> factory.set_retrieve(query_encoded=True)
            >> factory.index_scorer(query_encoded=False)
)

[Sep 30, 09:10:01] #> Building the emb2tid mapping..
687989391
Computing collection frequencies
Done
Loading doclens


## Experiment on TREC 2019

In [6]:
from pyterrier.measures import *
pt.Experiment(
    [
        e2e,
        ann_pipe,
        nocls_nomask_noQ,
        qep_pipe5,
        qep_pipe9
    ],
    *pt.get_dataset("msmarco_passage").get_topicsqrels("test-2019"),
    eval_metrics=[RR(rel=2)@100, nDCG@10, nDCG@100, AP(rel=2)@100, NumRet, "mrt", "num_q"],
    names=["ColBERT E2E", "Approx", "NoMASK NoCLS NoQ", "QEP 5 embs", "QEP 9 embs"],
)

Unnamed: 0,name,RR(rel=2)@100,nDCG@10,nDCG@100,AP(rel=2)@100,NumRet,num_q,mrt
0,ColBERT E2E,0.852713,0.693407,0.602398,0.386779,309698.0,43.0,671.341082
1,Approx,0.870155,0.684195,0.534308,0.349277,8600.0,43.0,180.961294
2,NoMASK NoCLS NoQ,0.853488,0.693194,0.602279,0.38551,187493.0,43.0,414.449386
3,QEP 5 embs,0.853488,0.695987,0.606343,0.389748,140232.0,43.0,342.59989
4,QEP 9 embs,0.853488,0.693194,0.602088,0.385421,209672.0,43.0,432.3805


Observations:
 - All four approaches result in good effectiveness (e.g. MRR, nDCG@10) while reducing then number of retrieved documents
 - In particular, Approx only retrieved 2% of the documents that E2E does, while enhancing MRR, and very small reduction in nDCG@10 (0.69 -> 0.68).
 - By applying QEP to reduce the 32 query embeddings to just 5 results in no real difference in MRR, NDCG@10, NDCG@10 and even MAP, while reducing by 50% the number of retrieved documents.
 

## Summary

Both papers propose methods to adapt ColBERT's dense retrieval pipeline to be more efficient without markedly reducing effectiveness. Further results and significance tests are provided in the respective papers.