# PyTerrier Indexing Demo

This notebook takes you through indexing using [PyTerrier](https://github.com/terrier-org/pyterrier).

## Prerequisites

You will need PyTerrier installed. PyTerrier also needs Java to be installed, and will find most installations.

In [1]:
%pip install -q python-terrier

Note: you may need to restart the kernel to use updated packages.


## Initialisation

PyTerrier needs Java 11 installed. If it cannot find your Java installation, you can set the `JAVA_HOME` environment variable.

(Since version 0.11, `pt.init()` is no longer required, but many of the options are available under `pt.java.` and `pt.terrier` packages.)

In [2]:
import pyterrier as pt

## TREC Indexing

Here, we are going to make use of Pyterrier's dataset API. We will use the [vaswani_npl corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a very small information retrieval test collection. 

In [3]:
dataset = pt.get_dataset("vaswani")

print("Files in vaswani corpus: %s " % dataset.get_corpus())

Files in vaswani corpus: ['/Users/craigm/.pyterrier/corpora/vaswani/corpus/doc-text.trec'] 


In [4]:
index_path = "./index"

Create `pt.TRECCollectionIndexer` object:
 - `index_path` argument specifies where to store the index
 - `blocks` argument specifies whether term positions should be recorded in the index or not. These are used for phrasal (`""`) queries or applying term proximity ranking models. 

In [5]:
!rm -rf ./index
indexer = pt.TRECCollectionIndexer(index_path, blocks=True)

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/craigm/.m2/repository/org/terrier/terrier-assemblies/5.9/terrier-assemblies-5.9-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/craigm/opt/anaconda3/lib/python3.9/site-packages/pyserini/resources/jars/anserini-0.22.0-fatjar.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
Java started (triggered by TerrierIndexer.__init__) and loaded: pyterrier.java, pyterrier.terrier.java [version=5.9 (build: craigm 2024-05-02 17:40), helper_version=0.0.8 prf_version=-SNAPSHOT], pyterrier.anserini.java [version=0.22.0 (from pyserini package)]


Index the files by calling the index method on the TRECCollectionIndexer object

In [6]:
indexref = indexer.index(dataset.get_corpus())

# indexer method takes either a string or a list of strings with the files names
# indexer.index(["/vaswani_corpus/doc-text.trec",])
# indexer.index("/vaswani_corpus/doc-text.trec")


Lets see what we got from the indexer.

IndexRef is a python object representing a Terrier [IndexRef](http://terrier.org/docs/current/javadoc/org/terrier/querying/IndexRef.html) object. You can think of this like a pointer, or a URI. In this case, it points to the location of the main index file.

In [7]:
indexref.toString()

'./index/data.properties'

We can use that to get more information about the index. For instance, to see the statistics of the index, lets use `index.getCollectionStatistics().toString()`. You can see that we have indexed 11429 documents, containing a total of 7756 unique words.

In [8]:
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().toString())

Number of documents: 11429
Number of terms: 7756
Number of postings: 224573
Number of fields: 0
Number of tokens: 271581
Field names: []
Positions:   true



To index TXT, PDF, Microsoft Word, etc files use pt.FilesIndexer instead of pt.TRECCollectionIndexer

## Indexing a Pandas dataframe

Sometimes we have the documents that we want to index in memory. Terrier makes it easy to index standard Python data structures, including [Pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).


In [9]:
import pandas as pd
!rm -rf ./pd_index
pd_indexer = pt.IterDictIndexer("./pd_index")

# optionally change how indexing occur, for instance, recording positions
# pd_indexer = pt.IterDictIndexer("./pd_index", blocks=True)

In [10]:
df = pd.DataFrame({ 
'docno':
['1', '2', '3'],
'url': 
['url1', 'url2', 'url3'],
'text': 
['He ran out of money, so he had to stop playing',
'The waves were crashing on the shore; it was a',
'The body may perhaps compensates for the loss']
})
df

Unnamed: 0,docno,url,text
0,1,url1,"He ran out of money, so he had to stop playing"
1,2,url2,The waves were crashing on the shore; it was a
2,3,url3,The body may perhaps compensates for the loss


Then there are a number of options to index the dataframe:    
The first argument should always a pandas.Series object of Strings, which specifies the body of each document.    
Any arguments after that are for specifying metadata.


In [11]:
# no metadata
# pd_indexer.index(df["text"])

# Add metadata fields as Pandas.Series objects, with the name of the Series object becoming the name of the meta field.
indexref2 = pd_indexer.index(df.to_dict(orient='records'))
# pd_indexer.index(df["text"], df["docno"], df["url"])

## Indexing a iterable, generator, etc.

You may not want to load all documents into memory, particularly for large collections. Terrier can index iterable objects (e.g., generators) that yield `dict` objects.

To do this, we also use `pt.IterDictIndexer()`. By default, `text` will be indexed and `docno` will be stored in the meta index. These can be configured with the `fields` and `meta` parameters, respectively.

Indexing this corpus of 400k passages takes around 30 seconds.

In [12]:
# As an example, we will stream the ANTIQUE collection.
# It is formatted as "[docno] \t [text] \n"
import urllib
import io
def antique_doc_iter():
    stream = urllib.request.urlopen('https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt')
    stream = io.TextIOWrapper(stream)
    for i, line in enumerate(stream):
        if i % 100000 == 0:
            print(f'processing document {i}')
        docno, text = line.rstrip().split('\t')
        yield {'docno': docno, 'text': text}

!rm -rf ./iter_index
iter_indexer = pt.IterDictIndexer("./iter_index")

doc_iter = antique_doc_iter()
indexref3 = iter_indexer.index(doc_iter)

# Additional fields can be added in the dict. You can configure which fields are
# indexed and which are used as metadata with the fields and meta parameters.
# yield {'docno': docno, 'title': title, 'text': text, 'url': url}
# iter_indexer.index(doc_iter, fields=['text', 'title'], meta=['docno', 'url'])

processing document 0
processing document 100000
processing document 200000
processing document 300000
processing document 400000
16:06:10.750 [ForkJoinPool-2-worker-1] WARN org.terrier.structures.indexing.Indexer - Indexed 2224 empty documents


## Retrieval

Lets see how we can use one of these for retrieval. Retrieval takes place using the `Retriever` object, by invoking `transform()` method for one or more queries. For a quick test, you can give just pass your query to `transform()`. 

BatchRetrieve will return the results as a Pandas dataframe.


In [13]:
pt.terrier.Retriever(indexref).search("mathematical")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,303,304,0,3.566201,mathematical
1,1,2444,2445,1,3.566201,mathematical
2,1,3534,3535,2,3.566201,mathematical
3,1,5040,5041,3,3.566201,mathematical
4,1,1169,1170,4,3.564534,mathematical
...,...,...,...,...,...,...
147,1,7283,7284,147,2.834784,mathematical
148,1,6714,6715,148,2.811375,mathematical
149,1,4746,4747,149,2.790373,mathematical
150,1,8622,8623,150,2.759409,mathematical


However, most IR experiments, will use a set of queries. You can pass such a set using a data frame for input.

In [14]:
import pandas as pd
topics = pd.DataFrame([["2", "mathematical"]],columns=['qid','query'])
pt.terrier.Retriever(indexref).transform(topics)

Unnamed: 0,qid,docid,docno,rank,score,query
0,2,303,304,0,3.566201,mathematical
1,2,2444,2445,1,3.566201,mathematical
2,2,3534,3535,2,3.566201,mathematical
3,2,5040,5041,3,3.566201,mathematical
4,2,1169,1170,4,3.564534,mathematical
...,...,...,...,...,...,...
147,2,7283,7284,147,2.834784,mathematical
148,2,6714,6715,148,2.811375,mathematical
149,2,4746,4747,149,2.790373,mathematical
150,2,8622,8623,150,2.759409,mathematical


Thats the end of the indexing tutorial - you can continue with other example tutorials. A good next place is retrieval_and_evaluation.ipynb as well as experiment.ipynb.