# PyTerrier ECIR 2021 Tutorial Notebook - Part 3.2 - ColBERT

This is one of a series of Colab notebooks created for This notebook provides experiences to attendees for creating indexing pipelines in PyTerrier. All experiments are conducted using the CORD19 corpus and the TREC Covid test collection.

This notebook aims to demonstrate use of the [ColBERT](https://github.com/stanford-futuredata/ColBERT/tree/v0.2) as a re-ranker using [PyTerrier ColBERT](https://github.com/terrierteam/pyterrier_colbert) plugin. Note that due to differing requirements on Huggingface transformer versions, this is a separate notebook from Part 3.1 (OpenNIR and monoT5).

In this notebook, you will experience:

 - re-ranking documents using CoBERT
 - analysis of the how ColBERT scores a document using the max-sim operator.


## Setup

In the following, we will set up the libraries required to execute the notebook.

### Pyterrier installation

The following cell installs the latest release of the [PyTerrier](https://github.com/terrier-org/pyterrier) package.

In [None]:
!pip install --upgrade python-terrier

### Pyterrier plugins installation 

We install the official version of the [ColBERT](https://github.com/terrierteam/pyterrier_colbert) plugin. You can safely ignore the package versioning errors.

In [None]:
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert

## Preliminary steps

### [PyTerrier](https://github.com/terrier-org/pyterrier) initialization 

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org/) IR platform. We also import the [OpenNIR](https://opennir.net/) pyterrier bindings.

In [None]:
import pyterrier as pt
if not pt.started():
    pt.init()

### [TREC-COVID19](https://ir.nist.gov/covidSubmit/) Dataset download

The following cell downloads the [TREC-COVID19](https://ir.nist.gov/covidSubmit/) dataset that we will use in the remainder of this notebook.

In [None]:
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
topics = dataset.get_topics(variant='description')
qrels = dataset.get_qrels()

### Terrier inverted index download

To save a few minutes, we use a pre-built Terrier inverted index for the TREC-COVID19 collection. Download time took a few seconds for us.

The construction of the inverted index will take few minutes, and the code to use is the following:

```python
import os

cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')
pt_index_path = './terrier_cord19'

if not os.path.exists(pt_index_path + "/data.properties"):
    # create the index, using the IterDictIndexer indexer 
    indexer = pt.index.IterDictIndexer(pt_index_path)

    # we give the dataset get_corpus_iter() directly to the indexer
    # while specifying the fields to index and the metadata to record
    index_ref = indexer.index(cord19.get_corpus_iter(), 
                              fields=('abstract',), 
                              meta=('docno',))

else:
    # if you already have the index, use it.
    index_ref = pt.IndexRef.of(pt_index_path + "/data.properties")

index = pt.IndexFactory.of(index_ref)
```

In [None]:
import os

if not os.path.exists("terrier_index.zip"):
  !wget http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/terrier_index.zip
  !unzip -j terrier_index.zip -d terrier_index

index_ref = pt.IndexRef.of("./terrier_index/data.properties")
index = pt.IndexFactory.of(index_ref)

## ColBERT



In [None]:
import pyterrier_colbert.ranking
colbert_factory = pyterrier_colbert.ranking.ColBERTFactory(
    "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip", None, None)
colbert = colbert_factory.text_scorer(doc_attr='abstract')

Let's look at how well it works at ranking! Here we create a ColBERT ranking pipeline,
which re-ranks 100 document identified by BM25.

In [None]:
br = pt.BatchRetrieve(index) % 100
pipeline = br >> pt.text.get_text(dataset, 'abstract') >> colbert
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> ColBERT'],
    eval_metrics=["map", "ndcg", 'ndcg_cut.10', 'P.10', 'mrt']
)

## Visualizing ColBERT

That's not bad! Let's dig a little deeper into which documents it retrieved and what it was paying attention to in those documents.

In [None]:
res = pipeline(topics.iloc[:1])
res.merge(dataset.get_qrels(), how='left').head()

The top-ranked document for query 0 is non-relevant. Let's see what it's paying attention to in this document using ColBERT's `explain_text`:

In [None]:
text = dataset.irds_ref().docs_store().get('4dtk1kyh').abstract[:300] + '...' # truncate text
colbert_factory.explain_text('what is the origin of covid 19', text)

We see very strong matches at the start of the document, which contributed directly to the ranking scores (indicated by x's).

#  That's all folks
If you arent coming back for Part 4 of the tutorial, please dont forget to complete our exit quiz: https://forms.office.com/r/2WbpLiQmWV