# PyTerrier Tutorial Notebook - Part 1

This is one of a series of Colab notebooks created for the Tutorial entitled '**IR From Bag-of-words to BERT and Beyond through Practical Experiments**'. It demonstrates the use of [PyTerrier](https://github.com/terrier-org/pyterrier) on the [CORD19 test collection](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

In particular, this notebook has the following learning outcomes:
  - PyTerrier installation & configuration
  - indexing a collection
  - accessing an index
  - using the `BatchRetrieve` transformer for searching an index
  - conducting an `Experiment` 

Pre-requisites:
 - We assume that you are confident in programming Python, including [lambda functions](https://www.w3schools.com/python/python_lambda.asp).
 - We will **only be supporting notebooks on the Google Colab platform**.
  > *Explanation*: You can try this notebook on Linux/Mac/Windows, but we have verified these notebooks on Google Colab for the purposes of this tutorial.

Related Reading:
 - [Pandas documentation](https://pandas.pydata.org/docs/)
 - [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/)


PyTerrier is a Python framework, but uses the underlying [Terrier information retrieval toolkit](http://terrier.org) for many indexing and retrieval operations. While PyTerrier was new in 2020, Terrier is written in Java and has a long history dating back to 2001. PyTerrier makes it easy to perform IR experiments in Python, but using the mature Terrier platform for the expensive indexing and retrieval operations. 

In the following, we introduce everything you need to know about PyTerrier, and also provide appropriate links to relevant parts of the [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/).


### Installation & Configuration

PyTerrier is installed as follows. This might take a few minutes, so you can read on.

In [None]:
%pip install -q python-terrier

import pyterrier as pt

### Documents, Indexing and Indexes

Much of PyTerrier's view of the world is wrapped up in Pandas dataframes. Let's consider some textual documents in a dataframe.


In [None]:
# we need to import pandas. We commonly rename it to pd, to make commands shorter
import pandas as pd

# lets not truncate output too much
pd.set_option('display.max_colwidth', 150)

docs_df = pd.DataFrame([
        ["d1", "this is the first document of many documents"],
        ["d2", "this is another document"],
        ["d3", "the topic of this document is unknown"]
    ], columns=["docno", "text"])

docs_df

Before any search engine can estimate which documents are most likely to be relevant for a given query, it must index the documents. 

In the following cell, we index the dataframe's documents. The index, with all its data structures, is written into a directory called `index_3docs`. 

In [None]:
indexer = pt.DFIndexer("./index_3docs", overwrite=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index_ref.toString()

An `IndexRef`
 is essentially a string saying where an index is stored. Indeed, we can look in the `index_3docs` directory and see that it has created various small files: 

In [None]:
!ls -lh index_3docs/

With an `IndexRef`, we can load it to an actual index. The method `pt.IndexFactory.of()` is the relevant factory. 

In [None]:
index = pt.IndexFactory.of(index_ref)

#lets see what type index is.
type(index)

Ok, so this object refers to Terrier's [`Index`](http://terrier.org/docs/current/javadoc/org/terrier/structures/Index.html) type. Check the linked Javadoc – you will see that this Java object has methods such as:
 - `getCollectionStatistics()`
 - `getInvertedIndex()`
 - `getLexicon()`

Let's see what is returned by the `CollectionStatistics()` method:

In [None]:
print(index.getCollectionStatistics().toString())

Ok, that seems fair – we have 3 documents. But why only 4 terms? 
Let's check the [`Lexicon`](http://terrier.org/docs/current/javadoc/org/terrier/structures/Lexicon.html), which is our vocabulary. Fortunately, the `Lexicon` can be iterated easily from Python:

In [None]:
for kv in index.getLexicon():
  print("%s (%s) -> %s (%s)" % (kv.getKey(), type(kv.getKey()), kv.getValue().toString(), type(kv.getValue()) ) )

Here, iterating over the `Lexicon` returns a pair of `String ` term and a [`LexiconEntry`](http://terrier.org/docs/current/javadoc/org/terrier/structures/LexiconEntry.html) object – which itself is an [`EntryStatistics`](http://terrier.org/docs/current/javadoc/org/terrier/structures/EntryStatistics.html) – and contains information including the statistics of that term.


So what did we find? Here are some observations:
 - we only have 4 unique terms, as stopwords were removed;
 - we have one term for `"document"`, even though `"documents"` occurred in document "`d1`". 
 
Both these observations make sense, as indeed Terrier removes standard stopwords and applies Porter's stemmer by default.

Further:
 - `Nt` is the number of unique documents that each term occurs in – this is useful for calculating IDF.
 - `TF` is the total number of occurrences – some weighting models use this instead of Nt.
 - The numbers in the `@{}` are a pointer – they tell Terrier where the postings are for that term in the inverted index data structure.

Finally, we can also use the square bracket notation to lookup terms in Terrier's lexicon:


In [None]:
index.getLexicon()["document"].toString()

Let's now think about the inverted index. Remember that the inverted index tells us in which *documents* each term occurs in. The `LexiconEntry` is the pointer that tell us where to find the postings for that term in the inverted index.

In [None]:
pointer = index.getLexicon()["document"]
for posting in index.getInvertedIndex().getPostings(pointer):
    print(posting.toString() + " doclen=%d" % posting.getDocumentLength())

Ok, so we can see that `"document"` occurs once in each of the three documents. 

NB: Terrier counts documents as integers from 0 (called *docids*). It records the mapping back to *docnos* (the string form, i.e. "`d1`", "`d2`") in a separate data structure called the *metaindex*.

### Searching an Index

Our way into search in PyTerrier is called `BatchRetrieve`. BatchRetrieve is configured by specifying an index and a weighting model (`Tf` in our example). We then search for a single-word query, `"document"`.

In [None]:
br = pt.BatchRetrieve(index, wmodel="Tf")
br.search("document")

So the `search()` method returns a dataframe with columns:
 - `qid`: this is by default "1", since it's our first and only query
 - `docid`: Terrier' internal integer for each document
 - `docno`: the external (string) unique identifier for each document
 - `score`: since we use the `Tf` weighting model, this score corresponds the total frequency of the query (terms) in each document
 - `rank`: A handy attribute showing the descending order by score
 - `query`: the input query

As expected, the `Tf` weighting model used here only counts the frequencies of the query terms in each document, i.e.:
$$
score(d,q) = \sum_{t \in q} tf_{t,d}
$$

Hence, it's clear that document `d1` should be the highest scored document with two occurrences (c.f. `'document'` and `'documents'`).  

We can also pass a dataframe of one or more queries to the `transform()` method (rather than the `search()` method) of a transformer, with queries numbered "q1", "q2" etc.. 

In [None]:
import pandas as pd
queries = pd.DataFrame([["q1", "document"], ["q2", "first document"]], columns=["qid", "query"])
br.transform(queries)

In fact, we are usually calling `transform()`, so it's the default method – i.e. 
`br.transform(queries)` can be more succinctly written as `br(queries)`.

In [None]:
br(queries)

### CORD19

OK, having 3 documents is quite trivial, so let's move upto a slightly larger corpus of documents. We'll be using the CORD19 datasets for the remainder of this tutorial. PyTerrier has a handy `get_dataset()` API, which allows us to download the corpus and index it.

In [None]:
import os

cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')
pt_index_path = './terrier_cord19'

if not os.path.exists(pt_index_path + "/data.properties"):
  # create the index, using the IterDictIndexer indexer 
  indexer = pt.index.IterDictIndexer(pt_index_path, meta_reverse=[])

  # we give the dataset get_corpus_iter() directly to the indexer
  # while specifying the fields to index and the metadata to record
  index_ref = indexer.index(cord19.get_corpus_iter(), 
                            fields=('abstract',), 
                            meta=('docno',))

else:
  # if you already have the index, use it.
  index_ref = pt_index_path

index = pt.IndexFactory.of(index_ref)


#### Task 1
- What are the statistics of our index?

In [None]:
#YOUR SOLUTION

In [None]:
#@title View Solution
print(index.getCollectionStatistics().toString())

Next, CORD19 also has a corresponding set of queries and relevance assessments (aka qrels), thus forming a *test collection*, 

We can easily access the topics and qrels from the dataset. Indeed these are expressed as dataframes as well (we use Pandas's [`head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method to show only the first 5 topics):

In [None]:
cord19.get_topics(variant='title').head(5)

In [None]:
cord19.get_qrels().head(5)

### Weighting Models

So far, we have been using the simple "`Tf`" as our ranking function for document retrieval in BatchRetrieve. However, we can use other models such as `"TF_IDF"` by simply changing the `wmodel="Tf"` keyword argument in the constructor of `BatchRetrieve`.


In [None]:
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
tfidf.search("chemical reactions")

You will note that, as expected, the scores of documents ranked by `TF_IDF` are no longer integers. You can see the exact formula used by Terrier from [the Github repo](https://github.com/terrier-org/terrier-core/blob/5.x/modules/core/src/main/java/org/terrier/matching/models/TF_IDF.java#L79).

Terrier supports many weighting models – the documentation contains [a list of supported models](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html) - some of which we will discover later in the tutorial.


### What is Success?

So far, we have been creating search engine models, but we haven't decided if any of them are actually any good. Let's investigate if we are getting a correct ("relevant") document at the first rank.

In [None]:
qrels = cord19.get_qrels()
def get_res_with_labels(ranker, df):
  # get the results for the query or queries
  results = ranker( df )
  # left outer join with the qrels
  with_labels = results.merge(qrels, on=["qid", "docno"], how="left").fillna(0)
  return with_labels

# lets get the Tf results for the first query
get_res_with_labels(tfidf, cord19.get_topics(variant='title').head(1))

In [None]:
from pyterrier.measures import *

pt.Experiment(
    [tfidf],
    cord19.get_topics(variant='title'),
    cord19.get_qrels(),
    eval_metrics=["map", "ndcg", nDCG@10])
# we can trec_eval (string) measure names, e.g. "map", 
# or the irmeasures syntax, e.g nDCG@10

# Prebuilt Index from Terrier Data Repository

In the rest of this tutorial, we'll be using a provided index from the [Terrier Data Repository](http://data.terrier.org/). You can view all of the index variants available for the TREC Covid test collection at: http://data.terrier.org/trec-covid.dataset.html. We will use the standard 'terrier_stemmed' index variant. 

You will see progress bars while the files are downloading. 

In [None]:
bm25 = pt.BatchRetrieve.from_dataset(
    'trec-covid', 
    'terrier_stemmed', 
    wmodel='BM25')

Lets see how BM25 compares to TF_IDF - how many queries have improved (M)AP?  

In [None]:
#YOUR SOLUTION HERE


In [None]:
#@title See Solution
pt.Experiment( 
    [tfidf, bm25],
    cord19.get_topics(variant='title'),
    cord19.get_qrels(),
    baseline=0,
    eval_metrics=["map", "ndcg"])


You should see that 40 queries exhibited improvements in AP when using BM25, while 10 are degraded; This is a significant increase ($p < 0.05$). 

## That's all folks

The following parts of the PyTerrier documentation may be useful references for this notebook:
 * [PyTerrier datasets](https://pyterrier.readthedocs.io/en/latest/datasets.html)
 * [Using Terrier for retrieval](https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html)
 * [Transformers in PyTerrier](https://pyterrier.readthedocs.io/en/latest/transformer.html)
 * [Transformer Operators](https://pyterrier.readthedocs.io/en/latest/operators.html)

