# NIR 2022 - Lab 2: Introduction to PyTerrier

## 0: Libraries for Indexing and Ranking

There exist several software libraries that can help you develop a search engine.
They usually provide highly optimized indexes, scoring algorithms (e.g. BM25) and other features such as text pre-processing.

Popular libraries include:
- [INDRI](https://www.lemurproject.org/indri/)
- [Elasticsearch](https://github.com/elastic/elasticsearch)
- [Anserini](https://github.com/castorini/anserini)
- [Terrier](https://github.com/terrier-org/terrier-core)
- [Whoosh](https://pypi.org/project/Whoosh/)

As Python is becoming the standard programming language in Data Science and Deep Learning, Python interfaces have been added to several libraries:
- [Elasticsearch-py](https://github.com/elastic/elasticsearch-py) for Elasticsearch
- [Pyserini](https://github.com/castorini/pyserini/) for Anserini
- [Py-Terrier](https://github.com/terrier-org/pyterrier) for Terrier
- [BEIR](https://github.com/beir-cellar/beir.git) for BEIR

The role of these Python interfaces is to understand your Python code and to map it onto the underlying service.
For example, PyTerrier is a Python layer above Terrier:
![PyTerrier](figures/PyTerrier.png "PyTerrier")

## 1: PyTerrier

In our first labs, we will be using PyTerrier: a new Python Framework that makes it easy to perform information retrieval experiments in Python while using the Java-based Terrier platform for the expensive indexing and retrieval operations.

In particular, the material of the lab is heavily based on the [ECIR 2021 tutorial](https://github.com/terrier-org/ecir2021tutorial) with PyTerrier and [OpenNIR](https://github.com/Georgetown-IR-Lab/OpenNIR) search toolkits.

Another useful resource is the [PyTerrier documentation](https://pyterrier.readthedocs.io/_/downloads/en/latest/pdf/).

NB. You can choose any library you prefer to develop your final project. 
Our labs aim to provide guidance into applying the theoretical content of the lectures in practice.
As such, we only use one library and a small dataset during the lab sessions.

<div class="alert alert-warning">
PyTerrier does not yet easily install on Windows.

If you do not have a Linux or macOS device, one of the easiest options is to use <a href="https://colab.research.google.com/">Google Colab</a>.

Get in touch with the TA if you need help setting up Google Colab!

</div>

## 2: Pre-requisites

PyTerrier requires:
- Python 3.6 or newer
    + You can check your Python version by running `python --version` or `python3 --version` on the terminal
    + You can download a newer version from [here](https://www.python.org/downloads/)
    + Once you have a valid Python version installed, make sure your Jupyter notebook is using it (`Kernel -> Change Kernel`)
- Java 11 or newer
    + You can check your Java version by running `java --version` on the terminal
    + You can download and install Java 11 from [JDK 11](https://www.oracle.com/java/technologies/javase-jdk11-downloads.html) or [OpenJDK 11](http://jdk.java.net/archive/). Several tutorials exist to help you in this task, such as [this one for Linux](https://computingforgeeks.com/how-to-install-java-11-on-ubuntu-debian-linux/) and [this one for macOS](https://mkyong.com/java/how-to-install-java-on-mac-osx/)

## 3: Installation

PyTerrier can be easily installed from the terminal using Pip:
```bash
pip install python-terrier
```

In [None]:
# !pip install python-terrier

## 4: Configuration

To use PyTerrier, we need to both import it and initialize it.
The initialization with the `init()` method makes PyTerrier download Terrier's JAR file as well as start the Java virtual machine.
To avoid `init()` being called more than once, we can check if it's being initialized through the `started()` method.

In [None]:
import pyterrier as pt
if not pt.started():
    pt.init()

## 5: Data

In our labs, we will be using a subset of the small version of [WikIR](https://www.aclweb.org/anthology/2020.lrec-1.237.pdf) dataset for English.

The data is located inside the `data/` folder, and consists of:
- `lab_docs.csv`: CSV file of document number and document text
- `lab_topics.csv`: CSV file of query id and query text
- `lab_qrels.csv`: CSV file of annotations with schema `qid, docno, label, iteration`

In [None]:
import pandas as pd

In [None]:
docs_df = pd.read_csv('data/lab_docs.csv', dtype=str)
print(docs_df.shape)
docs_df.head()

In [None]:
topics_df = pd.read_csv('data/lab_topics.csv', dtype=str)
print(topics_df.shape)
topics_df.head()

In [None]:
qrels_df = pd.read_csv('data/lab_qrels.csv', dtype=str)
print(qrels_df.shape)
qrels_df.head()

## 6: Indexing and Indexes

To perform the task of retrieving relevant documents for a given query, a search engine needs to know which documents are available and index them to efficiently retrieve them.

In PyTerrier, we can create an index from a Pandas DataFrame with the `DFIndexer` method.
The index, with all its data structures, is written into a directory called `indexes/default`.

In [None]:
indexer = pt.DFIndexer("./indexes/default", overwrite=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index_ref.toString()

The returned `IndexRef` is basically a string saying where an index is stored.
A PyTerrier index contains several files:

In [None]:
!ls -lh indexes/default

These files represent several data structures:
- Lexicon: Records the list of all unique terms and their statistics
- Document index: Records the statistics of all documents (e.g. document length)
- Inverted index: Records the mapping between terms and documents
- Meta index: Records document metadata (e.g. document number, URL, raw text, etc passed through `indexer.index()`)
- Direct index: Records terms for each document

Once we have an `IndexRef`, we can load it to an actual index:

In [None]:
index = pt.IndexFactory.of(index_ref)

# lets see what type index is
type(index)

Ok, so this object refers to Terrier's [`Index`](http://terrier.org/docs/current/javadoc/org/terrier/structures/Index.html) type. 

Looking at the linked Javadoc, we can see that this Java object has methods such as:
 - `getCollectionStatistics()`
 - `getInvertedIndex()`
 - `getLexicon()`

Let's see what is returned by the `CollectionStatistics()` method:

In [None]:
print(index.getCollectionStatistics().toString())

In [None]:
index.getEnd()
index.getStart()

In [None]:
index.getDocumentIndex().getDocumentLength(1)


In [None]:
index.getMetaIndex().getAllItems(0)

### Lexicon

What is our vocabulary of terms?

This is the [Lexicon](http://terrier.org/docs/current/javadoc/org/terrier/structures/Lexicon.html), which can be iterated easily from Python:

In [None]:
ix_range = range(1900, 1905)
for ix, kv in enumerate(index.getLexicon()):
    if ix in ix_range:
        print(f"{kv.getKey()} -> {kv.getValue().toString()}")
    elif ix > ix_range[-1]:
        break

Here, iterating over the Lexicon returns a pair of String term and a [LexiconEntry](http://terrier.org/docs/current/javadoc/org/terrier/structures/LexiconEntry.html) object – which itself is an [EntryStatistics](http://terrier.org/docs/current/javadoc/org/terrier/structures/EntryStatistics.html) – and contains information including the statistics of that term:
- `Nt` is the is the number of unique documents that each term occurs in (this is useful for calculating IDF)
- `TF` is the total number of occurrences – some weighting models use this instead of Nt
- The numbers in the `@{}` are pointers for Terrier to find that term in the inverted index

### Inverted Index

The inverted index tells us in which _documents_ each term occurs.

The LexiconEntry is also the pointer to find the postings (i.e. occurrences) for that term in the inverted index.

In [None]:
pointer = index.getLexicon()["agenda"]
for posting in index.getInvertedIndex().getPostings(pointer):
    print(f"{posting.toString()} doclen={posting.getDocumentLength()}")

Ok, so we can see that `"agenda"` occurs once in documents with ids 520, 709 and 1052.

Note that these are internal document ids of Terrier.
We can know which documents (i.e. the string "docno" in the corpus DataFrame) from the metaindex:

In [None]:
meta = index.getMetaIndex()
pointer = index.getLexicon()["agenda"]
for posting in index.getInvertedIndex().getPostings(pointer):
    docno = meta.getItem("docno", posting.getId())
    print(f"{posting.toString()} doclen={posting.getDocumentLength()} docno={docno}")

### Text Pre-processing

Looking at the terms in the Lexicon, do you think the index applied any text pre-processing?

What happens if we lookup a very frequent term?

In [None]:
index.getLexicon()["agenda"].toString()

Indeed, Terrier removes standard stopwords and applies Porter's stemmer by default.

### Index Variants

We can modify the pre-processing transformations applied by Terrier when creating an index by changing its `term pipelines` property.

In [None]:
# No pre-processing
indexer = pt.DFIndexer("./indexes/none", overwrite=True)
indexer.setProperty("termpipelines", "")
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

index.getLexicon()["the"].toString()

In [None]:
# Stopwords removal
indexer = pt.DFIndexer("./indexes/stopwords", overwrite=True)
indexer.setProperty("termpipelines", "Stopwords")
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

index.getLexicon()["agenda"].toString()

See the [org.terrier.terms](http://terrier.org/docs/current/javadoc/org/terrier/terms/package-summary.html) package for a list of the available term pipeline objects provided by Terrier.

Similarly, tokenization is controlled by the _“tokeniser”_ property. For example:
```python
indexer.setProperty("tokeniser", "UTFTokeniser")
```

[EnglishTokeniser](http://terrier.org/docs/current/javadoc/org/terrier/indexing/tokenisation/EnglishTokeniser.html) is the default tokeniser. Other tokenisers are listed in [org.terrier.indexing.tokenisation](http://terrier.org/docs/current/javadoc/org/terrier/indexing/tokenisation/package-summary.html) package.

Finally, we can also use the `blocks=True` argument for the index to store position information of every term in each document:

In [None]:
indexer = pt.DFIndexer("./indexes/default", overwrite=True, blocks=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

### Loading an Index

Creating an index can take significant time for large document collections.
We can load an index that we previously computed by specifying its path to `"data/properties"`.

In [None]:
index_ref = pt.IndexRef.of("./indexes/default/data.properties")
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

## 7: Searching an Index

Now that we have an index, let's perform retrieval on it!

In PyTerrier, search is done through the `BatchRetrieve()` method.
BatchRetrieve takes two main arguments:
- an index
- a weighting model

For instance, we can search for the word `"wall"` with our index and a term frequency (`Tf`) model by:

In [None]:
tf = pt.BatchRetrieve(index, wmodel="Tf")
tf.search("wall")  # NB. This can also be a multi-word expression (e.g. "white wall")

The `search()` method returns a DataFrame with columns:
 - `qid`: This is equal to "1" here since we only have a single query
 - `docid`: This is Terrier's internal integer for each document
 - `docno`: This is the external (string) unique identifier for each document
 - `rank`: This shows the descending order by score of retrieved documents
 - `score`: Since we use the `Tf` weighting model, this score corresponds the total frequency of the query (terms) in each document
 - `query`: The input query

We can also pass a DataFrame of one or more queries to the `transform()` method (rather than the `search()` method) with queries numbered "q1", "q2", etc.

In [None]:
queries = pd.DataFrame([["q1", "dragon"], ["q2", "wall"]], columns=["qid", "query"])
tf.transform(queries)

Moreover, since `transform()` is the default method of a BatchRetrieve object `br`, we can directly write `br(queries)`:

In [None]:
tf(queries)

Finally, while we have used the simple `"Tf"` ranking function in the example above, Terrier supports many other models that can be used by simply changing the `wmodel="Tf"` argument of `BatchRetrieve` (e.g. `wmodel="BM25"` for BM25 scoring).
A list of supported models is available in the [documentation](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html).

We can also tune internal Terrier configurations through the `properties` and `controls` arguments.
For example, we can tune [BM25](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/BM25.html)'s $b$, $k_1$ and $k_3$ parameters (c.f. Equation 4 [here](http://ir.dcs.gla.ac.uk/smooth/he-ecir05.pdf)) as follows:

In [None]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")  # default parameters
bm25v2 = pt.BatchRetrieve(index, wmodel="BM25", controls={"c": 0.1, "bm25.k_1": 2.0, "bm25.k_3": 10})
bm25v3 = pt.BatchRetrieve(index, wmodel="BM25", controls={"c": 8, "bm25.k_1": 1.4, "bm25.k_3": 10})

Here, the $b$ parameters is set via the generic `"c"` control parameter.

## 8: Measuring Retrieval Performance

Ranking metrics allow us to decide which search engine models are better than others for our application.

While we will look into evaluation metrics in a future lab, we can use PyTerrier's `Experiment` abstraction to evaluate multiple (BatchRetrieve) systems on queries "Q" and labels "RA":
```python
pt.Experiment([br1, br2], Q, RA, eval_metrics=["map", "ndcg"])
```

For instance, we can evaluate the MAP and NDCG metrics of the models we defined so far on the first three topics of our collection as follows:

In [None]:
qrels_df = qrels_df.astype({'label': 'int32'})
pt.Experiment(
    retr_systems=[tf, bm25, bm25v2, bm25v3],
    names=['TF', 'BM25', 'BM25 (0.1, 2.0, 10)', 'BM25 (8, 1.4, 10)'],
    topics=topics_df[:3],
    qrels=qrels_df,
    eval_metrics=["map", "ndcg"])