# PyTerrier_ANCE Demo Notebook - Vaswani

This notebook demonstrates use of PyTerrier_ANCE for dense passage retrieval. The corpus used is the [Vaswani NPL corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a corpus of 11,000 scientific abstract, with corresponding queries and relevance assessments.

## Installation 

We need to install PyTerrier

In [1]:
!pip install python-terrier



ANCE requires FAISS - this is the setup for FAISS on Colab. YMMV outside of Colab.

In [2]:
!apt install libomp-dev
!pip install faiss

Reading package lists... Done
Building dependency tree       
Reading state information... Done
libomp-dev is already the newest version (5.0.1-1).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.


This installs the [PyTerrier plugin for ANCE](https://github.com/terrierteam/pyterrier_ance). It supplies an indexer and a retrieval transformer. This also installs ANCE.

In [3]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_ance.git

Collecting git+https://github.com/terrierteam/pyterrier_ance.git
  Cloning https://github.com/terrierteam/pyterrier_ance.git to /tmp/pip-req-build-734amq6r
  Running command git clone -q https://github.com/terrierteam/pyterrier_ance.git /tmp/pip-req-build-734amq6r
Building wheels for collected packages: pyterrier-ance
  Building wheel for pyterrier-ance (setup.py) ... [?25l[?25hdone
  Created wheel for pyterrier-ance: filename=pyterrier_ance-0.0.1-cp37-none-any.whl size=4539 sha256=79461f7ed719fc04a2d9f3f57abd3593b2e24f1df1c13ca75ba0df6fcdac9308
  Stored in directory: /tmp/pip-ephem-wheel-cache-0w9tgies/wheels/26/dd/43/8ce2f9be56bd68ec751d264b83d1df7e9944b675efaf81aa2b
Successfully built pyterrier-ance
Installing collected packages: pyterrier-ance
  Found existing installation: pyterrier-ance 0.0.1
    Uninstalling pyterrier-ance-0.0.1:
      Successfully uninstalled pyterrier-ance-0.0.1
Successfully installed pyterrier-ance-0.0.1


# Setup

Lets get PyTerrier started

In [4]:
import pyterrier as pt
pt.init(tqdm='notebook')

  from pandas import Panel


PyTerrier 0.4.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)


We're using the Vaswani dataset - lets collect the topics & qrels.

In [5]:
dataset = pt.get_dataset("irds:vaswani")

This downloads the model checkpoint listed on the [ANCE github repository](https://github.com/microsoft/ANCE/#results).

In [6]:
import os
if not os.path.exists("Passage_ANCE_FirstP_Checkpoint.zip"):
  !wget https://webdatamltrainingdiag842.blob.core.windows.net/semistructstore/OpenSource/Passage_ANCE_FirstP_Checkpoint.zip
  !unzip Passage_ANCE_FirstP_Checkpoint.zip

## Indexing

This indexes the Vaswani corpus. Indexing takes about 2 minutes using a Colab GPU.

In [8]:
!rm -rf /content/anceindex

import pyterrier_ance
indexer = pyterrier_ance.ANCEIndexer("/content/Passage ANCE(FirstP) Checkpoint", "/content/anceindex", num_docs=12000)
indexer.index(dataset.get_corpus_iter())

Using mean: False


HBox(children=(FloatProgress(value=0.0, description='Indexing', max=12000.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='vaswani documents', max=11429.0, style=ProgressStyle(desc…

Segment 0


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Inferencing', max=1.0, style=ProgressSt…

Not running in distributed mode





'/content/anceindex'

In [9]:
del(indexer)

# Retrieval

Now that indexing has completed, we can load in the index and the checkpoint model (which we will need for encoding queries).


In [19]:
anceretr = pyterrier_ance.ANCERetrieval("/content/Passage ANCE(FirstP) Checkpoint", "/content/anceindex")

Loading model
Using mean: False
Loading shard metadata


HBox(children=(FloatProgress(value=0.0, description='Loading shards', max=1.0, style=ProgressStyle(description…




Here we can ask PyTerrier to search the ANCE index for `'chemical reactions'`.

In [20]:
(anceretr %10).search("chemical reactions")

***** inference of 1 queries *****


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Inferencing', max=1.0, style=ProgressSt…

Not running in distributed mode

***** faiss search for 1 queries on 1 shards *****


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Unnamed: 0,qid,docid,docno,score,rank
0,1,7048,7049,709.171814,0
1,1,3451,3452,708.950317,1
2,1,1605,1606,708.893311,2
3,1,9373,9374,708.687378,3
4,1,5507,5508,708.424561,4
5,1,10059,10060,708.145752,5
6,1,7921,7922,708.093506,6
7,1,10540,10541,708.003906,7
8,1,8157,8158,707.991089,8
9,1,6285,6286,707.990051,9


# Running an Experiment

Lets prepare an experiment. Firstly, lets load in a BM25 baseline.

In [16]:
bm25 = pt.BatchRetrieve(pt.get_dataset("vaswani").get_index(), wmodel="BM25")

Finally, lets evaluate our performance. We also load in an BM25 index for the same corpus for comparison reasons.

In [17]:
pt.Experiment(
    [bm25,retr], 
    dataset.get_topics(), 
    dataset.get_qrels(), 
    eval_metrics=["map", "recip_rank", "mrt"]
    )

***** inference of 93 queries *****


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Inferencing', max=1.0, style=ProgressSt…

Not running in distributed mode

***** faiss search for 93 queries on 1 shards *****


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Unnamed: 0,name,map,recip_rank,mrt
0,BR(BM25),0.296519,0.725665,24.777853
1,ANCE,0.150573,0.668049,14.022788


So on this collection, ANCE isnt as effective under MAP or MRR, but it does have a lower mean response time.