# Setup

In the following, we will set up the libraries required to execute the notebook.

## Python packages installation

The following packages are installed to avoid warnings/errors during [PyTerrier](https://github.com/terrier-org/pyterrier) installation. Note that the current release of the [PyTerrier](https://github.com/terrier-org/pyterrier) [ANCE](https://github.com/microsoft/ANCE) plugin works only with the following Python packages:

* `transfomers`, version 3.0.2
* `faiss-gpu`, version 1.6.3

> You can safely ignore the message about runtime restart.

In [1]:
!apt install --upgrade libomp-dev

!pip install --upgrade transformers==3.0.2
!pip install --upgrade faiss-gpu==1.6.3

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libomp5
Suggested packages:
  libomp-doc
The following NEW packages will be installed:
  libomp-dev libomp5
0 upgraded, 2 newly installed, 0 to remove and 30 not upgraded.
Need to get 239 kB of archives.
After this operation, 804 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp5 amd64 5.0.1-1 [234 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp-dev amd64 5.0.1-1 [5,088 B]
Fetched 239 kB in 1s (232 kB/s)
Selecting previously unselected package libomp5:amd64.
(Reading database ... 160980 files and directories currently installed.)
Preparing to unpack .../libomp5_5.0.1-1_amd64.deb ...
Unpacking libomp5:amd64 (5.0.1-1) ...
Selecting previously unselected package libomp-dev.
Preparing to unpack .../libomp-dev_5.0.1-1_amd64.deb ...
Unpacking libomp-dev (5.0.1-

## Pyterrier installation

The following cell installs the latest release of the [PyTerrier](https://github.com/terrier-org/pyterrier) package.

In [2]:
!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

Collecting python-terrier
  Cloning https://github.com/terrier-org/pyterrier.git to /tmp/pip-install-mdw0479v/python-terrier
  Running command git clone -q https://github.com/terrier-org/pyterrier.git /tmp/pip-install-mdw0479v/python-terrier
Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Collecting pytrec_eval>=0.5
  Downloading https://files.pythonhosted.org/packages/2e/03/e6e84df6a7c1265579ab26bbe30ff7f8c22745aa77e0799bba471c0a3a19/pytrec_eval-0.5.tar.gz
Collecting tqdm>=4.57.0
[?25l  Downloading https://files.pythonhosted.org/packages/f8/3e/2730d0effc282960dbff3cf91599ad0d8f3faedc8e75720fdf224b31ab24/tqdm-4.59.0-py2.py3-none-any.whl (74kB)
[K     |████████████████████████████████| 81kB 5.5MB/s 
[?25hCollecting pyjnius~=1.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/ea/b1/e33db12a20efe28b20fbcf4efc9b95a934954587cd7aa5998987a22e8885/pyjnius-1.3.0-cp37-cp37m-many

## Pyterrier plugins installation

We install the official version of the [PyTerrier](https://github.com/terrier-org/pyterrier) [ANCE](https://github.com/microsoft/ANCE) plugin. You can safely ignore the package versioning errors.

In [3]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_ance.git

Collecting git+https://github.com/terrierteam/pyterrier_ance.git
  Cloning https://github.com/terrierteam/pyterrier_ance.git to /tmp/pip-req-build-7_8t5zpb
  Running command git clone -q https://github.com/terrierteam/pyterrier_ance.git /tmp/pip-req-build-7_8t5zpb
Collecting ANCE@ git+https://github.com/cmacdonald/ANCE.git
  Cloning https://github.com/cmacdonald/ANCE.git to /tmp/pip-install-v_takh_s/ANCE
  Running command git clone -q https://github.com/cmacdonald/ANCE.git /tmp/pip-install-v_takh_s/ANCE
Collecting transformers==2.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/50/10/aeefced99c8a59d828a92cc11d213e2743212d3641c87c82d61b035a7d5c/transformers-2.3.0-py3-none-any.whl (447kB)
[K     |████████████████████████████████| 450kB 9.4MB/s 
Collecting boto3
[?25l  Downloading https://files.pythonhosted.org/packages/dd/14/6615878e5e9fcb116570ec438a241716614bfc83112430827ad26b1f5db0/boto3-1.17.34-py2.py3-none-any.whl (131kB)
[K     |████████████████████████████████| 1

## Trained model download

This downloads the [ANCE](https://github.com/microsoft/ANCE) model checkpoint. Download time can vary.

In [4]:
import os
if not os.path.exists("ance_model_checkpoint.zip"):
    !wget http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/ance_model_checkpoint.zip
    !unzip -j ance_model_checkpoint.zip -d ance_model_checkpoint

--2021-03-23 12:12:02--  http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/ance_model_checkpoint.zip
Resolving www.dcs.gla.ac.uk (www.dcs.gla.ac.uk)... 130.209.240.1
Connecting to www.dcs.gla.ac.uk (www.dcs.gla.ac.uk)|130.209.240.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1277112820 (1.2G) [application/zip]
Saving to: ‘ance_model_checkpoint.zip’


2021-03-23 12:13:14 (17.1 MB/s) - ‘ance_model_checkpoint.zip’ saved [1277112820/1277112820]

Archive:  ance_model_checkpoint.zip
  inflating: ance_model_checkpoint/config.json  
  inflating: ance_model_checkpoint/desktop.ini  
  inflating: ance_model_checkpoint/merges.txt  
  inflating: ance_model_checkpoint/optimizer.pt  
  inflating: ance_model_checkpoint/pytorch_model.bin  
  inflating: ance_model_checkpoint/scheduler.pt  
  inflating: ance_model_checkpoint/special_tokens_map.json  
  inflating: ance_model_checkpoint/tokenizer_config.json  
  inflating: ance_model_checkpoint/training_args.bin  
  inflating:

# Preliminary steps

## [PyTerrier](https://github.com/terrier-org/pyterrier) initialization

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org) IR platform.

In [5]:
import pyterrier as pt

if not pt.started():
    pt.init(tqdm='notebook')

terrier-assemblies 5.4  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.5  jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.4.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)


## [TREC-COVID19](https://ir.nist.gov/covidSubmit/) Dataset download

The following cell downloads the [TREC-COVID19](https://ir.nist.gov/covidSubmit/) dataset that we will use in the reamining of the tutorial.

In [6]:
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
topics = dataset.get_topics(variant='title')
qrels = dataset.get_qrels()

[INFO] If you have a local copy of https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/0307a37b6b9f1a5f233340a769d538ea
[INFO] [starting] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml: [00:00] [18.7kB] [12.6MB/s]
[INFO] If you have a local copy of https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/8138424a59daea0aba751c8a891e5f54
[INFO] [starting] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt: [00:00] [1.14MB] [1.19MB/s]


## [Terrier](http://terrier.org) Inverted Index Download

We are going to download a pre-built [Terrier](http://terrier.org) inverted index for the [TREC-COVID19](https://ir.nist.gov/covidSubmit/) collection.
Download time can vary.

The construction of the inverted index will take few minutes, and the code to use is the following:

```python
import os

cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')
pt_index_path = './terrier_cord19'

if not os.path.exists(pt_index_path + "/data.properties"):
    # create the index, using the IterDictIndexer indexer 
    indexer = pt.index.IterDictIndexer(pt_index_path)

    # we give the dataset get_corpus_iter() directly to the indexer
    # while specifying the fields to index and the metadata to record
    index_ref = indexer.index(cord19.get_corpus_iter(), 
                              fields=('abstract',), 
                              meta=('docno',))

else:
    # if you already have the index, use it.
    index_ref = pt.IndexRef.of(pt_index_path + "/data.properties")

index = pt.IndexFactory.of(index_ref)
```

In [7]:
import os

if not os.path.exists("terrier_index.zip"):
  !wget http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/terrier_index.zip
  !unzip -j terrier_index.zip -d terrier_index

index_ref = pt.IndexRef.of("./terrier_index/data.properties")
index = pt.IndexFactory.of(index_ref)

--2021-03-23 12:13:33--  http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/terrier_index.zip
Resolving www.dcs.gla.ac.uk (www.dcs.gla.ac.uk)... 130.209.240.1
Connecting to www.dcs.gla.ac.uk (www.dcs.gla.ac.uk)|130.209.240.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42017186 (40M) [application/zip]
Saving to: ‘terrier_index.zip’


2021-03-23 12:13:36 (13.2 MB/s) - ‘terrier_index.zip’ saved [42017186/42017186]

Archive:  terrier_index.zip
  inflating: terrier_index/data.lexicon.fsomapfile  
  inflating: terrier_index/data.properties  
  inflating: terrier_index/data.lexicon.fsomaphash  
  inflating: terrier_index/data.lexicon.fsomapid  
  inflating: terrier_index/data.meta.idx  
  inflating: terrier_index/data.direct.bf  
  inflating: terrier_index/data.inverted.bf  
  inflating: terrier_index/data.meta.zdata  
  inflating: terrier_index/data.document.fsarrayfile  


## [ANCE](https://github.com/microsoft/ANCE) Dense Index Download

We are going to download a pre-built [ANCE](https://github.com/microsoft/ANCE) [FAISS](https://github.com/facebookresearch/faiss) index for the [TREC-COVID19](https://ir.nist.gov/covidSubmit/) collection. Download time can vary.

The construction of this index takes some time, and, just in case, the code to use is the following:

```python
!rm -rf /content/anceindex

import pyterrier_ance

ance_indexer = pyterrier_ance.ANCEIndexer(
    checkpoint_path="./ance_model_checkpoint",
    index_path="/content/anceindex",
    num_docs=192509,
    text_attr='abstract' # COVID
)

ance_indexer.index(dataset.get_corpus_iter())
```

In [8]:
import os

if not os.path.exists("ance_index.zip"):
  !wget http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/ance_index.zip
  !unzip -j ance_index.zip -d ance_index

--2021-03-23 12:13:38--  http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/ance_index.zip
Resolving www.dcs.gla.ac.uk (www.dcs.gla.ac.uk)... 130.209.240.1
Connecting to www.dcs.gla.ac.uk (www.dcs.gla.ac.uk)|130.209.240.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 389973243 (372M) [application/zip]
Saving to: ‘ance_index.zip’


2021-03-23 12:13:57 (19.4 MB/s) - ‘ance_index.zip’ saved [389973243/389973243]

Archive:  ance_index.zip
  inflating: ance_index/shards.pkl   
  inflating: ance_index/._shards.pkl  
  inflating: ance_index/0.faiss      
  inflating: ance_index/._0.faiss    
  inflating: ance_index/0.docids.pkl  
  inflating: ance_index/._0.docids.pkl  


# Retrieval experiments

Now that indexing/downloading has completed, we can load in the index and the learned model (which we will need for encoding queries). Index loading can take some time, as the [FAISS](https://github.com/facebookresearch/faiss) index needs to be loaded in main memory, as well as the document embeddings index.

Lets prepare an experiment. Firstly, lets create in a BM25 baseline transformer, and the [ANCE](https://github.com/microsoft/ANCE) retrieve transformer.

In [9]:
bm25_retriever = pt.BatchRetrieve(index, wmodel="BM25")

import pyterrier_ance

ance_retriever = pyterrier_ance.ANCERetrieval(checkpoint_path="/content/ance_model_checkpoint",
                                              index_path="/content/ance_index")

[INFO] Loading faiss with AVX2 support.
[INFO] Loading faiss.
[INFO] PyTorch version 1.8.0+cu101 available.
[INFO] TensorFlow version 2.4.1 available.
[INFO] loading configuration file /content/ance_model_checkpoint/config.json
[INFO] Model config {
  "_num_labels": 2,
  "architectures": [
    "RobertaDot_NLL_LN"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bad_words_ids": null,
  "bos_token_id": 0,
  "decoder_start_token_id": null,
  "do_sample": false,
  "early_stopping": false,
  "eos_token_id": 2,
  "eos_token_ids": 0,
  "finetuning_task": "MSMarco",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "is_encoder_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-05,
  "length_penalty": 1.0,
  "max_length": 20,
  "max_position_embeddings": 514,
  "min_length": 0,
  "model

Loading model
Using mean: False


[INFO] loading weights file /content/ance_model_checkpoint/pytorch_model.bin


Loading shard metadata


[INFO] Inference parameters <pyterrier_ance. object at 0x7f760c140890>


Loading shards:   0%|          | 0/1 [00:00<?, ?shard/s]

Now we are ready to run the experiments. We are going to retrieve the top 10 ranked documents for the official topics, and compute several effectiveness metrics. 

In [10]:
pt.Experiment(
    [bm25_retriever % 10, ance_retriever % 10], 
    topics,
    qrels,
    eval_metrics=["map", "recip_rank", "P_10", "ndcg_cut_10", "mrt"],
    names=['BM25', 'ANCE'],
)

[INFO] NumExpr defaulting to 2 threads.
[INFO] ***** Running ANN Embedding Inference *****
[INFO]   Batch size = 128
Inferencing: 0it [00:00, ?it/s]

***** inference of 50 queries *****
Not running in distributed mode


Inferencing: 1it [00:00,  2.31it/s]

***** faiss search for 50 queries on 1 shards *****



[INFO] merging embeddings


  0%|          | 0/1 [00:00<?, ?shard/s]

Unnamed: 0,name,map,recip_rank,P_5,P_10,P_15,P_20,P_30,P_100,P_200,P_500,P_1000,ndcg_cut_5,ndcg_cut_10,ndcg_cut_15,ndcg_cut_20,ndcg_cut_30,ndcg_cut_100,ndcg_cut_200,ndcg_cut_500,ndcg_cut_1000,mrt
0,BM25,0.013127,0.808246,0.7,0.65,0.433333,0.325,0.216667,0.065,0.0325,0.013,0.0065,0.611724,0.582446,0.451496,0.375891,0.288856,0.127812,0.080397,0.053294,0.049722,55.097956
1,ANCE,0.006939,0.614,0.464,0.404,0.269333,0.202,0.134667,0.0404,0.0202,0.00808,0.00404,0.428917,0.392817,0.304502,0.253512,0.194812,0.085517,0.052678,0.033406,0.030891,22.818893
