# PyTerrier Search Solutions 2022 Tutorial Notebook - ANCE

This notebook provides experiences to attendees in dense indexing and retrieval in [PyTerrier](https://github.com/terrier-org/pyterrier). All experiments are conducted using the [CORD19 corpus](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7251955/) and the [TREC Covid test collection](https://ir.nist.gov/covidSubmit/).

This notebook aims to demonstrate use of the [ANCE dense retrieval](https://github.com/microsoft/ANCE) for end-to-end indexing and retrieval in PyTerrier, as provided by the [*PyTerrier ANCE*](https://github.com/terrierteam/pyterrier_ance) plugin.

In this notebook, you will experience:
 - indexing and retrieval using [*PyTerrier ANCE*](https://github.com/terrierteam/pyterrier_ance).
 - doing a failure analysis comparing ANCE with BM25 results.



# Setup

In the following, we will set up the libraries required to execute the notebook.

## Python packages installation

The following packages are installed to avoid warnings/errors during [PyTerrier](https://github.com/terrier-org/pyterrier) installation. Note that the current release of the [PyTerrier](https://github.com/terrier-org/pyterrier) [ANCE](https://github.com/microsoft/ANCE) plugin works only with the following Python packages:

* `faiss-cpu` or `faiss-gpu`, version 1.6.3

> You can safely ignore the message about runtime restart.

In [None]:
!apt install --quiet --upgrade libomp-dev
%pip install --quiet --upgrade faiss-gpu==1.7.2

## Pyterrier installation

The following cell installs the latest release of the [PyTerrier](https://github.com/terrier-org/pyterrier) package.

In [None]:
%pip install --quiet python-terrier

## Pyterrier plugins installation

We install the official version of the [*PyTerrier ANCE*](https://github.com/terrierteam/pyterrier_ance) plugin. You can safely ignore the package versioning errors.

In [None]:
%pip install --quiet git+https://github.com/terrierteam/pyterrier_ance.git

# Preliminary steps

In [None]:
import pyterrier as pt

## [TREC-COVID19](https://ir.nist.gov/covidSubmit/) dataset download

The following cell downloads the [TREC-COVID19](https://ir.nist.gov/covidSubmit/) dataset that we will use in the remainder of this notebook.

In [None]:
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
topics = dataset.get_topics(variant='title')
qrels = dataset.get_qrels()

## [Terrier](http://terrier.org) inverted index download

To save a few minutes, we use a pre-built [Terrier](http://terrier.org) inverted index for the [TREC-COVID19](https://ir.nist.gov/covidSubmit/) collection.
Download time took a few seconds for us.

The construction of the inverted index will take few minutes, and the code to use is the following:

```python
import os

cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')
pt_index_path = './terrier_cord19'

if not os.path.exists(pt_index_path + "/data.properties"):
    # create the index, using the IterDictIndexer indexer 
    indexer = pt.index.IterDictIndexer(pt_index_path)

    # we give the dataset get_corpus_iter() directly to the indexer
    # while specifying the fields to index and the metadata to record
    index_ref = indexer.index(cord19.get_corpus_iter(), 
                              fields=('abstract',), 
                              meta=('docno',))

else:
    # if you already have the index, use it.
    index_ref = pt.IndexRef.of(pt_index_path + "/data.properties")

index = pt.IndexFactory.of(index_ref)
```

In [None]:
index = pt.datasets.get_dataset('trec-covid').get_index('terrier_stemmed')

## [ANCE](https://github.com/microsoft/ANCE) dense index download

We are going to download a pre-built [ANCE](https://github.com/microsoft/ANCE) [FAISS](https://github.com/facebookresearch/faiss) index for the [TREC-COVID19](https://ir.nist.gov/covidSubmit/) collection. Download time took <1 minute for us.

The construction of this index takes some time. If you wanted to create the index yourself, the code to use is as follows:

```python
!rm -rf /content/anceindex

import pyterrier_ance

ance_indexer = pt.text.sliding() >> pyterrier_ance.ANCEIndexer(
    checkpoint_path="./ance_model_checkpoint",
    index_path="/content/anceindex",
    num_docs=192509,
    text_attr='abstract' # COVID
)

ance_indexer.index(dataset.get_corpus_iter())
```

In [None]:
ance_index = pt.datasets.get_dataset('trec-covid').get_index('ance_msmarco_psg')

So we have a few files that have been generated. In particular, an ANCE index is broken into shards - CORD19 is small enough to only need a single chunk, so we have only `0.*` and no `1.*` etc.

ANCE index contains 3 files:
 - `shards.pkl` - how many documents in each shard;
 - $x$ `.docids.pkl` - docid to docno mapping for documents in each shard;

 - $x$ `.faiss` - the FAISS index for the documents in each shard.


# Retrieval experiments

Now that indexing/downloading has completed, we can load in the index and the learned model (which we will need for encoding queries). Index loading can take some time, as the [FAISS](https://github.com/facebookresearch/faiss) index needs to be loaded in main memory, as well as the docno lookup file.

Lets prepare an experiment. Firstly, lets create in a BM25 baseline transformer, and the [ANCE](https://github.com/microsoft/ANCE) retrieve transformer. Since most documents exceed the maximum length supported by ANCE, a sliding window of 150 tokens was used (stride 75, prepending title) to construct passages. As such, passage scores need to be aggregated, e.g., using `pt.text.max_passage()`.

In [None]:
bm25_retriever = pt.BatchRetrieve(index, wmodel="BM25")

from pyterrier_ance import ANCERetrieval

ance_retriever = ANCERetrieval.from_dataset('trec-covid', 'ance_msmarco_psg') >> pt.text.max_passage()

Now we are ready to run the experiments. We are going to retrieve the top 10 ranked documents for the official topics, and compute several effectiveness metrics. 

In [None]:
pt.Experiment(
    [bm25_retriever % 10, ance_retriever % 10], 
    topics,
    qrels,
    eval_metrics=["map", "recip_rank", "P_10", "ndcg_cut_10", "mrt"],
    names=['BM25', 'ANCE'],
)

The underperforming results computed our ANCE retriever are due to the lack of fine-tuning of the underlying BERT-based model with COVID19 and medical-related documents, as done, for example, in [Sledge](https://arxiv.org/pdf/2005.02365.pdf).

## ANCE failure analysis

Lets give a look at the results for BM25 and ANCE for query with qid "`1`". To achieve this, lets make a function that displays the results, including the DOI URL.

In [None]:
def show_res_with_text_labels(system, qid):
    def make_doi_url(df):
      df["doi"] = df["doi"].apply(lambda doi: "https://doi.org/" + doi)
      return df
    pipe = (system % 10) >> pt.text.get_text(dataset, ["title", "doi"]) >> pt.apply.generic(make_doi_url)
    res = pipe.transform(topics[topics.qid == qid])
    res = res.merge(qrels, how='left')
    def make_clickable(val):
        # target _blank to open new window
        return '<a target="_blank" href="{}">{}</a>'.format(val, val)
    res = res.sort_values("rank", ascending=True)
    res.style.format({'doi': make_clickable})
    return res
  
import pandas as pd
pd.set_option("max_colwidth", 102)

In [None]:
show_res_with_text_labels(bm25_retriever, "1")

Judging by the titles of the retrieved documents, BM25 did quite well (NB: For documents with DOIs, you can look up the actual document).

Overall, BM25 would have got P@10 = 0.8 for this query. 

Lets look at ANCE now.

In [None]:
show_res_with_text_labels(ance_retriever, "1")

Here, we see that ANCE has got ranked three documents that had some relevance to the query (titled: "The first two cases of 2019‐nCoV in Italy: Where they come from"), however, only the third one (the one with a DOI) was actually judged relevant. Several other of the retrieved documents appear to be non-relevant, e.g. "General Anesthesia Recommendations for Electroconvulsive Therapy". 

## Practice task

Try a few other queries in `show_res_with_text_labels()`. Qids are numbered from "1" to "50". 

In [None]:
# Write your code here