# Introduction to PyTerrier

_IN4325: Information retrieval lecture, TU Delft_

**Part 2: Indexing & retrieval**

In this notebook we'll learn how to

- create a simple searchable index of a document corpus in PyTerrier and
- retrieve documents based on a query from that index (_ad-hoc retrieval_).


In [1]:
pip install python-terrier==0.10.0

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pyterrier as pt

if not pt.started():
    pt.init(tqdm="notebook")

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


## The data

For our simple example, we'll use a collection of works by William Shakespeare as our document corpus. They are available, collated in a single text file, [here](https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt). We can download this file directly:


In [3]:
from urllib import request

with request.urlopen(
    "https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt"
) as u:
    shakespeare_complete = u.read().decode("utf-8")

Before we can index the works, we need to parse and split them somehow. Let's take a look at the first chunk of the file:


In [4]:
print(shakespeare_complete[:15000])

This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS.  Project Gutenberg
often releases Etexts that are NOT placed in the Public Domain!!

Shakespeare

*This Etext has certain copyright implications you should read!*

<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS
PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE
WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE
DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>

*Project Gutenberg is proud to cooperate with The World Library*
in the presentation of The Complete Works of William Shakespeare
for your reading for educatio

You can see that each work starts with

```
<YEAR>

<TITLE>

by William Shakespeare
```

and ends with `THE END`.

We can use use a regular expression to extract each individual work (including year and title). We'll package the whole thing as a generator that yields a dictionary for each document, which contains

- a document ID,
- the year,
- the title,
- the document text.

The unique document ID may be useful to identify documents in the index later. Note that PyTerrier also assigns internal unique IDs itself.


In [9]:
import re


def shakespeare_generator():
    for i, item in enumerate(
        re.compile(
            r"((\d{4})\s*?([A-Z ]+)\s*?by William Shakespeare.*?THE END)",
            re.DOTALL,
        ).finditer(shakespeare_complete)
    ):
        yield {
            "docno": f"D{i}",
            "year": item.group(2),
            "title": item.group(3),
            "text": item.group(1),
        }

Let's give it a spin and print the first document:


In [10]:
from pprint import pprint

for x in shakespeare_generator():
    pprint(x)
    break

{'docno': 'D0',
 'text': '1609\n'
         '\n'
         'THE SONNETS\n'
         '\n'
         'by William Shakespeare\n'
         '\n'
         '\n'
         '\n'
         '                     1\n'
         '  From fairest creatures we desire increase,\n'
         "  That thereby beauty's rose might never die,\n"
         '  But as the riper should by time decease,\n'
         '  His tender heir might bear his memory:\n'
         '  But thou contracted to thine own bright eyes,\n'
         "  Feed'st thy light's flame with self-substantial fuel,\n"
         '  Making a famine where abundance lies,\n'
         '  Thy self thy foe, to thy sweet self too cruel:\n'
         "  Thou that art now the world's fresh ornament,\n"
         '  And only herald to the gaudy spring,\n'
         '  Within thine own bud buriest thy content,\n'
         "  And tender churl mak'st waste in niggarding:\n"
         '    Pity the world, or else this glutton be,\n'
         "    To eat the world's due, b

## Indexing

We can use this generator to index our collection. `pyterrier.IterDictIndexer` will consume our iterator and build the index. We just need to tell it a path for our index (`shakespeare_index`) and the metadata we want to store (along with the corresponding maximum length).

Note that we also pass the arguments `stemmer="porter"` and `stopwords="terrier"`; this is optional, as PyTerrier applies Porter stemming and stopword removal by default, but these arguments can be used to customize that behaviour.


In [11]:
from pathlib import Path

indexer = pt.IterDictIndexer(
    str(Path("shakespeare_index").absolute()),
    meta={
        "docno": 4,
        "year": 4,
        "title": 32,
        "text": 131072,
    },
    stemmer="porter",
    stopwords="terrier",
)

Now we can index our collection. By default, only the field `text` will be indexed. Since the text contains both year and title in our case, we'll keep the default. To change this behavior, you can set, for example, `fields=("text", "some_other_field")` if you want `some_other_field` to be searchable as well.

This method returns a _reference_ to our newly created index:


In [12]:
index_ref = indexer.index(shakespeare_generator())

There are many different indexers available. For a complete list, click [here](https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html#indexer-classes).

## Retrieval

In order to search in our index, we use `pyterrier.BatchRetrieve`. [Terrier supports lots of weighting models](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html), and we can specify one using the `wmodel` parameter. For now, we'll use simple TF-IDF.

By setting the `metadata` argument, we can tell PyTerrier to retrieve any metadata that we added earlier (such as the titles) along with the document IDs.


In [13]:
tf_idf = pt.BatchRetrieve(
    index_ref, wmodel="TF_IDF", num_results=10, metadata=["docno", "title"]
)

This model can be used directly to search. The result is a `pandas.DataFrame`:


In [14]:
tf_idf.search("tragedy")

Unnamed: 0,qid,docid,docno,title,rank,score,query
0,1,28,D28,THE TRAGEDY OF TITUS ANDRONICUS,0,1.600345,tragedy
1,1,11,D11,THE SECOND PART OF KING HENRY T,1,1.317369,tragedy
2,1,17,D17,THE TRAGEDY OF MACBETH,2,1.100422,tragedy
3,1,15,D15,THE TRAGEDY OF JULIUS CAESAR,3,1.061067,tragedy
4,1,10,D10,THE FIRST PART OF HENRY THE SIX,4,0.996237,tragedy
5,1,24,D24,THE TRAGEDY OF ROMEO AND JULIET,5,0.959452,tragedy
6,1,2,D2,THE TRAGEDY OF ANTONY AND CLEOP,6,0.951813,tragedy
7,1,12,D12,THE THIRD PART OF KING HENRY TH,7,0.945801,tragedy
8,1,5,D5,THE TRAGEDY OF CORIOLANUS,8,0.934525,tragedy
9,1,16,D16,THE TRAGEDY OF KING LEAR,9,0.931893,tragedy


As the name suggests, you can also retrieve documents for a batch of queries, but this needs to be done using a `pandas.DataFrame`:


In [15]:
import pandas as pd

tf_idf(
    pd.DataFrame(
        [
            ["Q1", "a public place"],
            ["Q2", "king henry"],
        ],
        columns=["qid", "query"],
    )
)

Unnamed: 0,qid,docid,docno,title,rank,score,query
0,Q1,5,D5,THE TRAGEDY OF CORIOLANUS,0,2.522378,a public place
1,Q1,0,D0,THE SONNETS,1,2.416732,a public place
2,Q1,15,D15,THE TRAGEDY OF JULIUS CAESAR,2,2.394695,a public place
3,Q1,18,D18,MEASURE FOR MEASURE,3,2.32852,a public place
4,Q1,2,D2,THE TRAGEDY OF ANTONY AND CLEOP,4,2.314671,a public place
5,Q1,10,D10,THE FIRST PART OF HENRY THE SIX,5,2.289707,a public place
6,Q1,27,D27,THE LIFE OF TIMON OF ATHENS,6,2.267233,a public place
7,Q1,6,D6,CYMBELINE,7,2.244214,a public place
8,Q1,24,D24,THE TRAGEDY OF ROMEO AND JULIET,8,2.22056,a public place
9,Q1,3,D3,AS YOU LIKE IT,9,2.169172,a public place


## Loading an index

Once you have created your index on disk, you can always load it rather than re-indexing the collection every time. Let's delete our index reference and access the index directly from disk:


In [16]:
del index_ref

pt.BatchRetrieve(
    str(Path("shakespeare_index").absolute()),
    wmodel="TF_IDF",
    num_results=10,
    metadata=["docno", "title"],
).search("tragedy")

Unnamed: 0,qid,docid,docno,title,rank,score,query
0,1,28,D28,THE TRAGEDY OF TITUS ANDRONICUS,0,1.600345,tragedy
1,1,11,D11,THE SECOND PART OF KING HENRY T,1,1.317369,tragedy
2,1,17,D17,THE TRAGEDY OF MACBETH,2,1.100422,tragedy
3,1,15,D15,THE TRAGEDY OF JULIUS CAESAR,3,1.061067,tragedy
4,1,10,D10,THE FIRST PART OF HENRY THE SIX,4,0.996237,tragedy
5,1,24,D24,THE TRAGEDY OF ROMEO AND JULIET,5,0.959452,tragedy
6,1,2,D2,THE TRAGEDY OF ANTONY AND CLEOP,6,0.951813,tragedy
7,1,12,D12,THE THIRD PART OF KING HENRY TH,7,0.945801,tragedy
8,1,5,D5,THE TRAGEDY OF CORIOLANUS,8,0.934525,tragedy
9,1,16,D16,THE TRAGEDY OF KING LEAR,9,0.931893,tragedy


Note that, any time you're sharing one index among multiple models, the best practice is to load it into memory once rather than using references:


In [17]:
index = pt.IndexFactory.of(str(Path("shakespeare_index").absolute()))
tf_idf = pt.BatchRetrieve(index, wmodel="TF_IDF")
dirichlet_lm = pt.BatchRetrieve(index, wmodel="DirichletLM")

## Memory indexes

The first index we created is saved to and loaded from the disk. Another alternative that can be useful for small corpora is a _memory index_. These are kept entirely in the main memory and are therefore faster.

We can create a memory index by specifying `type=pyterrier.index.IndexingType.MEMORY`. Note that the index path must still be valid, even though it will be ignored. Hence, we can simply pass the current working directory:


In [18]:
memory_index = pt.index.IterDictIndexer(
    str(Path.cwd()),  # this will be ignored
    meta={
        "docno": 4,
        "year": 4,
        "title": 32,
        "text": 131072,
    },
    type=pt.index.IndexingType.MEMORY,
).index(shakespeare_generator())

Now we can use the index just as before:


In [19]:
pt.BatchRetrieve(memory_index, wmodel="TF_IDF").search("tragedy")



Unnamed: 0,qid,docid,docno,rank,score,query
0,1,28,D28,0,1.600345,tragedy
1,1,11,D11,1,1.317369,tragedy
2,1,17,D17,2,1.100422,tragedy
3,1,15,D15,3,1.061067,tragedy
4,1,10,D10,4,0.996237,tragedy
5,1,24,D24,5,0.959452,tragedy
6,1,2,D2,6,0.951813,tragedy
7,1,12,D12,7,0.945801,tragedy
8,1,5,D5,8,0.934525,tragedy
9,1,16,D16,9,0.931893,tragedy


## Further reading

Check out the [indexing guide](https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html) in the official documentation.
