# PyTerrier Indexing Demo

This notebook takes you through indexing using [PyTerrier](https://github.com/terrier-org/pyterrier).

## Prerequisites

You will need PyTerrier installed. PyTerrier also needs Java to be installed, and will find most installations.

In [1]:
!pip install python-terrier
#!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

Collecting python-terrier
  Cloning https://github.com/terrier-org/pyterrier.git to /tmp/pip-install-1xbb7l2t/python-terrier
  Running command git clone -q https://github.com/terrier-org/pyterrier.git /tmp/pip-install-1xbb7l2t/python-terrier
Collecting pyjnius~=1.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/d8/50/098cb5fb76fb7c7d99d403226a2a63dcbfb5c129b71b7d0f5200b05de1f0/pyjnius-1.3.0-cp36-cp36m-manylinux2010_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 2.8MB/s 
Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Collecting pytrec_eval
  Downloading https://files.pythonhosted.org/packages/36/0a/5809ba805e62c98f81e19d6007132712945c78e7612c11f61bac76a25ba3/pytrec_eval-0.4.tar.gz
Collecting matchpy
[?25l  Downloading https://files.pythonhosted.org/packages/47/95/d265b944ce391bb2fa9982d7506bbb197bb55c5088ea74448a5ffcaeefab/matchpy-0.5.1-py3-none-any

## Init 

You must run `pt.init()` before other pyterrier functions and classes

Optional Arguments:    
 - `version` - terrier IR version e.g. "5.2"    
 - `mem` - megabytes allocated to java e.g. "4096"      
 - `packages` - external java packages for Terrier to load e.g. ["org.terrier:terrier.prf"]
 - `logging` - logging level for Terrier. Defaults to "WARN", use "INFO" or "DEBUG" for more output.

NB: PyTerrier needs Java 11 installed. If it cannot find your Java installation, you can set the `JAVA_HOME` environment variable.

In [2]:
import pyterrier as pt
if not pt.started():
  pt.init()

terrier-assemblies 5.2  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.2  jar not found, downloading to /root/.pyterrier...
Done


## TREC Indexing

Here, we are going to make use of Pyterrier's dataset API. We will use the [vaswani_npl corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a very small information retrieval test collection. 

In [5]:
dataset = pt.datasets.get_dataset("vaswani")

print("Files in vaswani corpus: %s " % dataset.get_corpus())

Files in vaswani corpus: ['/root/.pyterrier/corpora/vaswani/corpus/doc-text.trec'] 


In [0]:
index_path = "./index"

Create `pt.TRECCollectionIndexer` object    
index_path argument specifies where to store the index

In [8]:
!rm -rf ./index
indexer = pt.TRECCollectionIndexer(index_path, blocks=True)

IndexingType.CLASSIC
IndexingType.CLASSIC


Index the files by calling the index method on the TRECCollectionIndexer object

In [0]:
indexref = indexer.index(dataset.get_corpus())

# indexer method takes either a string or a list of strings with the files names
# indexer.index(["/vaswani_corpus/doc-text.trec",])
# indexer.index("/vaswani_corpus/doc-text.trec")


Lets see what we got from the indexer.

IndexRef is a python object representing a Terrier [IndexRef](http://terrier.org/docs/current/javadoc/org/terrier/querying/IndexRef.html) object. You can think of this like a pointer, or a URI. In this case, it points to the location of the main index file.

In [12]:
indexref.toString()

'./index/data.properties'

We can use that to get more information about the index. For instance, to see the statistics of the index, lets use `index.getCollectionStatistics().toString()`. You can see that we have indexed 11429 documents, containing a total of 7756 unique words.

In [13]:
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().toString())

Number of documents: 11429
Number of terms: 7756
Number of fields: 0
Field names: []
Number of tokens: 271581



To index TXT, PDF, Microsoft Word, etc files use pt.FilesIndexer instead of pt.TRECCollectionIndexer

## Indexing a Pandas dataframe

Sometimes we have the documents that we want to index in memory. Terrier makes it easy to index standard Python data structures, particularly [Pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

To do thise, we can use a `pt.DFIndexer()` object

In [16]:
import pandas as pd
!rm -rf ./pd_index
pd_indexer = pt.DFIndexer("./pd_index")

# optionally modify properties
# index_properies = {"block.indexing":"true", "invertedfile.lexiconscanner":"pointers"}
# indexer.setProperties(**index_properies)

IndexingType.CLASSIC


In [0]:
df = pd.DataFrame({ 
'docno':
['1', '2', '3'],
'url': 
['url1', 'url2', 'url3'],
'text': 
['He ran out of money, so he had to stop playing',
'The waves were crashing on the shore; it was a',
'The body may perhaps compensates for the loss']
})

In [18]:
df

Unnamed: 0,docno,url,text
0,1,url1,"He ran out of money, so he had to stop playing"
1,2,url2,The waves were crashing on the shore; it was a
2,3,url3,The body may perhaps compensates for the loss


Then there are a number of options to index the dataframe:    
The first argument should always a pandas.Series object of Strings, which specifies the body of each document.    
Any arguments after that are for specifying metadata.


In [0]:
# no metadata
# pd_indexer.index(df["text"])

# Add metadata fields as Pandas.Series objects, with the name of the Series object becoming the name of the meta field.
indexref2 = pd_indexer.index(df["text"], df["docno"])
# pd_indexer.index(df["text"], df["docno"], df["url"])

# Add metadata fields as lists to a keyword arguement
# pd_indexer.index(df["text"], docno=["1","2","3"], url=["url1", "url2", "url3"])

# Add the metadata fields with a dictionary
# meta_fields={"docno":["1","2","3"],"url":["url1", "url2", "url3"]}
# pd_indexer.index(df["text"], **meta_fields)

# Add the entire dataframe as metadata
# pd_indexer.index(df["text"], df)

## Indexing a iterable, generator, etc.

You may not want to load all documents into memory, particularly for large collections. Terrier can index iterable objects (e.g., generators) that yield `dict` objects.

To do thise, we can use a `pt.IterDictIndexer()` object. By default, `text` will be indexed and `docno` will be stored in the meta index. These can be configured with the `fields` and `meta` parameters, respectively.

In [3]:
# As an example, we will stream the ANTIQUE collection.
# It is formatted as "[docno] \t [text] \n"
import urllib
import io
def antique_doc_iter():
    stream = urllib.request.urlopen('https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt')
    stream = io.TextIOWrapper(stream)
    for i, line in enumerate(stream):
        if i % 100000 == 0:
            print(f'processing document {i}')
        docno, text = line.rstrip().split('\t')
        yield {'docno': docno, 'text': text}

!rm -rf ./iter_index
iter_indexer = pt.IterDictIndexer("./iter_index")

doc_iter = antique_doc_iter()
indexref3 = iter_indexer.index(doc_iter)

# Additional fields can be added in the dict. You can configure which fields are
# indexed and which are used as metadata with the fields and meta parameters.
# yield {'docno': docno, 'title': title, 'text': text, 'url': url}
# iter_indexer.index(doc_iter, fields=['text', 'title'], meta=['docno', 'url'])

processing document 0
processing document 100000
processing document 200000
processing document 300000
processing document 400000


## Retrieval

Lets see how we can use one of these for retrieval. Retrieval takes place using the `BatchRetrieve` object, by invoking `transform()` method for one or more queries. For a quick test, you can give just pass your query to `transform()`. 

BatchRetrieve will return the results as a Pandas dataframe.


In [21]:
pt.BatchRetrieve(indexref).search("mathematical")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,5040,5041,0,3.566201,mathematical
1,1,303,304,1,3.566201,mathematical
2,1,3534,3535,2,3.566201,mathematical
3,1,2444,2445,3,3.566201,mathematical
4,1,5011,5012,4,3.564534,mathematical
...,...,...,...,...,...,...
147,1,7283,7284,147,2.834784,mathematical
148,1,6714,6715,148,2.811375,mathematical
149,1,4746,4747,149,2.790373,mathematical
150,1,8622,8623,150,2.759409,mathematical


However, most IR experiments, will use a set of queries. You can pass such a set using a data frame for input.

In [20]:
import pandas as pd
topics = pd.DataFrame([["2", "mathematical"]],columns=['qid','query'])
pt.BatchRetrieve(indexref).transform(topics)

Unnamed: 0,qid,docid,docno,rank,score,query
0,2,5040,5041,0,3.566201,mathematical
1,2,303,304,1,3.566201,mathematical
2,2,3534,3535,2,3.566201,mathematical
3,2,2444,2445,3,3.566201,mathematical
4,2,5011,5012,4,3.564534,mathematical
...,...,...,...,...,...,...
147,2,7283,7284,147,2.834784,mathematical
148,2,6714,6715,148,2.811375,mathematical
149,2,4746,4747,149,2.790373,mathematical
150,2,8622,8623,150,2.759409,mathematical


Thats the end of the indexing tutorial - you can continue with other example tutorials.