# Setup

In the following, we will set up the libraries required to execute the notebook.

## Python packages installation

The following packages are installed to avoid warnings/errors during [PyTerrier](https://github.com/terrier-org/pyterrier) installation. Note that the current release of the [PyTerrier](https://github.com/terrier-org/pyterrier) [ANCE](https://github.com/microsoft/ANCE) plugin works only with the following Python packages:

* `transfomers`, version 3.0.2
* `faiss-gpu`, version 1.6.3

> You can safely ignore the message about runtime restart.

In [1]:
!apt install --upgrade libomp-dev

!pip install --upgrade transformers==3.0.2
!pip install --upgrade faiss-gpu==1.6.3

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libomp5
Suggested packages:
  libomp-doc
The following NEW packages will be installed:
  libomp-dev libomp5
0 upgraded, 2 newly installed, 0 to remove and 30 not upgraded.
Need to get 239 kB of archives.
After this operation, 804 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp5 amd64 5.0.1-1 [234 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp-dev amd64 5.0.1-1 [5,088 B]
Fetched 239 kB in 0s (2,585 kB/s)
Selecting previously unselected package libomp5:amd64.
(Reading database ... 160980 files and directories currently installed.)
Preparing to unpack .../libomp5_5.0.1-1_amd64.deb ...
Unpacking libomp5:amd64 (5.0.1-1) ...
Selecting previously unselected package libomp-dev.
Preparing to unpack .../libomp-dev_5.0.1-1_amd64.deb ...
Unpacking libomp-dev (5.0.

## Pyterrier installation

The following cell installs the latest release of the [PyTerrier](https://github.com/terrier-org/pyterrier) package.

In [2]:
!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

Collecting python-terrier
  Cloning https://github.com/terrier-org/pyterrier.git to /tmp/pip-install-df2vjnkv/python-terrier
  Running command git clone -q https://github.com/terrier-org/pyterrier.git /tmp/pip-install-df2vjnkv/python-terrier
Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Collecting pytrec_eval>=0.5
  Downloading https://files.pythonhosted.org/packages/2e/03/e6e84df6a7c1265579ab26bbe30ff7f8c22745aa77e0799bba471c0a3a19/pytrec_eval-0.5.tar.gz
Collecting tqdm>=4.57.0
[?25l  Downloading https://files.pythonhosted.org/packages/f8/3e/2730d0effc282960dbff3cf91599ad0d8f3faedc8e75720fdf224b31ab24/tqdm-4.59.0-py2.py3-none-any.whl (74kB)
[K     |████████████████████████████████| 81kB 6.2MB/s 
[?25hCollecting pyjnius~=1.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/ea/b1/e33db12a20efe28b20fbcf4efc9b95a934954587cd7aa5998987a22e8885/pyjnius-1.3.0-cp37-cp37m-many

## Pyterrier plugins installation

We install the official version of the [PyTerrier](https://github.com/terrier-org/pyterrier) [ColBERT](https://github.com/stanford-futuredata/ColBERT) plugin. You can safely ignore the package versioning errors.

In [3]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

Collecting git+https://github.com/terrierteam/pyterrier_colbert.git
  Cloning https://github.com/terrierteam/pyterrier_colbert.git to /tmp/pip-req-build-s3lufjq7
  Running command git clone -q https://github.com/terrierteam/pyterrier_colbert.git /tmp/pip-req-build-s3lufjq7
Collecting ColBERT@ git+https://github.com/cmacdonald/ColBERT.git@v0.2#egg=ColBERT
  Cloning https://github.com/cmacdonald/ColBERT.git (to revision v0.2) to /tmp/pip-install-hddebk7z/ColBERT
  Running command git clone -q https://github.com/cmacdonald/ColBERT.git /tmp/pip-install-hddebk7z/ColBERT
  Running command git checkout -b v0.2 --track origin/v0.2
  Switched to a new branch 'v0.2'
  Branch 'v0.2' set up to track remote branch 'v0.2' from 'origin'.
Collecting ujson
[?25l  Downloading https://files.pythonhosted.org/packages/17/4e/50e8e4cf5f00b537095711c2c86ac4d7191aed2b4fffd5a19f06898f6929/ujson-4.0.2-cp37-cp37m-manylinux1_x86_64.whl (179kB)
[K     |████████████████████████████████| 184kB 9.7MB/s 
[?25hCollec

## Trained model download

This downloads the [ColBERT](https://github.com/stanford-futuredata/ColBERT) model checkpoint. Download time can vary.

In [4]:
import os
if not os.path.exists("colbert_model_checkpoint.zip"):
    !wget http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/colbert_model_checkpoint.zip
    !unzip -j colbert_model_checkpoint.zip -d colbert_model_checkpoint

--2021-03-23 13:31:08--  http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/colbert_model_checkpoint.zip
Resolving www.dcs.gla.ac.uk (www.dcs.gla.ac.uk)... 130.209.240.1
Connecting to www.dcs.gla.ac.uk (www.dcs.gla.ac.uk)|130.209.240.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1187618437 (1.1G) [application/zip]
Saving to: ‘colbert_model_checkpoint.zip’


2021-03-23 13:31:33 (45.8 MB/s) - ‘colbert_model_checkpoint.zip’ saved [1187618437/1187618437]

Archive:  colbert_model_checkpoint.zip
  inflating: colbert_model_checkpoint/colbert.dnn  


# Preliminary steps

## [PyTerrier](https://github.com/terrier-org/pyterrier) initialization

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org) IR platform.

In [5]:
import pyterrier as pt

if not pt.started():
    pt.init(tqdm='notebook')

terrier-assemblies 5.4  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.5  jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.4.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)


## [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) Dataset download

The following cell downloads the [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) dataset that we will use in the reamining of the tutorial.

 We limit queries to just 50 topics to avoid RAM issues with ColBERT on Colab. ColBERT is **very** memory-hungry.

In [6]:
dataset = pt.get_dataset("vaswani")
topics = dataset.get_topics().head(50)
qrels = dataset.get_qrels()

index = dataset.get_index()

Downloading vaswani topics to /root/.pyterrier/corpora/vaswani/query-text.trec


query-text.trec:   0%|          | 0.00/3.05k [00:00<?, ?iB/s]

Downloading vaswani qrels to /root/.pyterrier/corpora/vaswani/qrels


qrels:   0%|          | 0.00/6.63k [00:00<?, ?iB/s]

Downloading vaswani index to /root/.pyterrier/corpora/vaswani/index


data.direct.bf:   0%|          | 0.00/327k [00:00<?, ?iB/s]

data.document.fsarrayfile:   0%|          | 0.00/190k [00:00<?, ?iB/s]

data.inverted.bf:   0%|          | 0.00/301k [00:00<?, ?iB/s]

data.lexicon.fsomapfile:   0%|          | 0.00/651k [00:00<?, ?iB/s]

data.lexicon.fsomaphash:   0%|          | 0.00/777 [00:00<?, ?iB/s]

data.lexicon.fsomapid:   0%|          | 0.00/30.3k [00:00<?, ?iB/s]

data.meta.idx:   0%|          | 0.00/89.3k [00:00<?, ?iB/s]

data.meta.zdata:   0%|          | 0.00/168k [00:00<?, ?iB/s]

data.properties:   0%|          | 0.00/882 [00:00<?, ?iB/s]

data.meta-0.fsomapfile:   0%|          | 0.00/725k [00:00<?, ?iB/s]

# [ColBERT](https://github.com/stanford-futuredata/ColBERT) Indexing

We are going to index the [Vaswani](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) collection with [ColBERT](https://github.com/stanford-futuredata/ColBERT).

The construction of this index takes some time. The following code:
* downloads some additional BERT models;
* processes the whole collection to compute the document embeddings, e.g, at most 180 embeddings per document;
* performs the *training* of the IVFPQ [FAISS](https://github.com/facebookresearch/faiss) index supporting approximate nearest neightbour search.

For 11,429 documents, the code computes 581,496 document embeddings, ~50.8 embeddings per document, in approximatively 5 minutes.

In [7]:
!rm -rf /content/colbert_index

import pyterrier_colbert.indexing

colbert_indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint="/content/colbert_model_checkpoint/colbert.dnn", 
                                                            index_root="/content",
                                                            index_name="colbert_index",
                                                            chunksize=3)
colbert_indexer.index(dataset.get_corpus_iter())

Downloading vaswani corpus to /root/.pyterrier/corpora/vaswani/corpus


doc-text.trec:   0%|          | 0.00/0.99M [00:00<?, ?iB/s]

[Mar 23, 13:32:02] [0] 		 #> Local args.bsize = 128
[Mar 23, 13:32:02] [0] 		 #> args.index_root = /content
[Mar 23, 13:32:02] [0] 		 #> self.possible_subset_sizes = [69905]


Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAI

[Mar 23, 13:32:29] #> Loading model checkpoint.
[Mar 23, 13:32:29] #> Loading checkpoint /content/colbert_model_checkpoint/colbert.dnn
[Mar 23, 13:32:30] #> checkpoint['epoch'] = 0
[Mar 23, 13:32:30] #> checkpoint['batch'] = 44500




Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]



[Mar 23, 13:32:31] #> Note: Output directory /content already exists




[Mar 23, 13:32:31] #> Creating directory /content/colbert_index 


docFromText on 11429 documents
tokens doc 0: 180
total tokens 2057220
lenD = 11429 
11429/content/colbert_index/0.pt

[Mar 23, 13:36:40] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 2.8k (overall),  2.8k (this encoding),  933.5M (this saving)
[Mar 23, 13:36:40] [0] 		 [NOTE] Done with local share.
[Mar 23, 13:36:40] [0] 		 #> Joining saver thread.
[Mar 23, 13:36:41] [0] 		 #> Saved batch #0 to /content/colbert_index/0.pt 		 Saving Throughput = 1.6M passages per minute.

#> num_embeddings = 581496
[Mar 23, 13:36:41] #> Starting..
[Mar 23, 13:36:41] #> Processing slice #1 of 1 (range 0..1).
[Mar 23, 13:36:41] #> Will write to /content/colbert_index/ivfpq.100.faiss.
[Mar 23, 13:36:41] #> Loading /content/colbert_index/0.sample ...
#> Sample has shape (29074, 128)
[Mar 23, 13:36:41] Preparing resources for 1 GPUs.
[Mar 23, 13

# Retrieval experiments

Now that indexing/downloading has completed, we can load in the index and the learned model (which we will need for encoding queries). Index loading can take some time, as the [FAISS](https://github.com/facebookresearch/faiss) index needs to be loaded in main memory, as well as the document embeddings index.

Lets prepare an experiment. Firstly, lets create in a BM25 baseline transformer, and the [ColBERT](https://github.com/stanford-futuredata/ColBERT) retrieve transformer.

In [8]:
bm25_retriever = pt.BatchRetrieve(index, wmodel="BM25")

colbert_retriever = colbert_indexer.ranking_factory().end_to_end()

[Mar 23, 13:38:02] #> Loading the FAISS index from /content/colbert_index/ivfpq.100.faiss ..
[Mar 23, 13:38:02] #> Building the emb2pid mapping..
[Mar 23, 13:38:02] len(self.emb2pid) = 581496
Loading reranking index, memtype=mem


Loading index shards to memory:   0%|          | 0/1 [00:00<?, ?shard/s]

Now we are ready to run the experiments. We are going to retrieve the top 10 ranked documents for the official topics, and compute several effectiveness metrics. 

In [9]:
pt.Experiment(
    [bm25_retriever % 10, colbert_retriever % 10], 
    topics,
    qrels,
    eval_metrics=["map", "recip_rank", "P_10", "ndcg_cut_10", "mrt"],
    names=['BM25', 'ColBERT'],
)

Unnamed: 0,name,map,recip_rank,P_5,P_10,P_15,P_20,P_30,P_100,P_200,P_500,P_1000,ndcg_cut_5,ndcg_cut_10,ndcg_cut_15,ndcg_cut_20,ndcg_cut_30,ndcg_cut_100,ndcg_cut_200,ndcg_cut_500,ndcg_cut_1000,mrt
0,BM25,0.200356,0.806667,0.512,0.392,0.261333,0.196,0.130667,0.0392,0.0196,0.00784,0.00392,0.571956,0.505645,0.433819,0.396179,0.361263,0.339566,0.339566,0.339566,0.339566,61.353643
1,ColBERT,0.192729,0.757857,0.488,0.412,0.274667,0.206,0.137333,0.0412,0.0206,0.00824,0.00412,0.532899,0.496438,0.423486,0.385484,0.34967,0.327262,0.327262,0.327262,0.327262,1185.926183
