# PyTerrier Index Analysis examples

This notebook takes you through how to access an index directly in [Pyterrier](https://github.com/terrier-org/pyterrier).

## Prerequisites

You will need Pyterrier installed. Pyterrier also needs Java to be installed, and will find most installations.

In [1]:
!pip install python-terrier
# !pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

Collecting python-terrier
  Cloning https://github.com/terrier-org/pyterrier.git to /tmp/pip-install-90ambft7/python-terrier
  Running command git clone -q https://github.com/terrier-org/pyterrier.git /tmp/pip-install-90ambft7/python-terrier
Collecting pyjnius~=1.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/d8/50/098cb5fb76fb7c7d99d403226a2a63dcbfb5c129b71b7d0f5200b05de1f0/pyjnius-1.3.0-cp36-cp36m-manylinux2010_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 2.8MB/s 
Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Collecting pytrec_eval
  Downloading https://files.pythonhosted.org/packages/36/0a/5809ba805e62c98f81e19d6007132712945c78e7612c11f61bac76a25ba3/pytrec_eval-0.4.tar.gz
Collecting matchpy
[?25l  Downloading https://files.pythonhosted.org/packages/47/95/d265b944ce391bb2fa9982d7506bbb197bb55c5088ea74448a5ffcaeefab/matchpy-0.5.1-py3-none-any

## Init 

You must run `pt.init()` before other pyterrier functions and classes

Optional Arguments:    
 - `version` - terrier IR version e.g. "5.2"    
 - `mem` - megabytes allocated to java e.g. "4096"      
 - `packages` - external java packages for Terrier to load e.g. ["org.terrier:terrier.prf"]
 - `logging` - logging level for Terrier. Defaults to "WARN", use "INFO" or "DEBUG" for more output.

NB: Pyterrier needs Java 11 installed. If it cannot find your Java installation, you can set the `JAVA_HOME` environment variable.

In [2]:
import pyterrier as pt
if not pt.started():
  pt.init()

terrier-assemblies 5.2  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.2  jar not found, downloading to /root/.pyterrier...
Done


## Loading an Index

Here, we are going to make use of Pyterrier's dataset API. We will use the [vaswani_npl corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a very small information retrieval test collection. 

In [3]:
dataset = pt.datasets.get_dataset("vaswani")

indexref = dataset.get_index()

Downloading vaswani index to /root/.pyterrier/corpora/vaswani/index


Lets have a look at the statistics of this index.

In [4]:
index = pt.IndexFactory.of(indexref)

print(index.getCollectionStatistics().toString())

Number of documents: 11429
Number of terms: 7756
Number of fields: 0
Field names: []
Number of tokens: 271581



## Using a Terrier index in your own code

### How many documents does term X occur in?

As our index is stemmed, we used the stemmed form of the word 'chemical' which is 'chemic'

In [9]:
index.getLexicon()["chemic"].getDocumentFrequency()

20

### What is the un-smoothed probability of term Y occurring in the collection?

Here, we again use the [Lexicon](http://terrier.org/docs/current/javadoc/org/terrier/structures/Lexicon.html) of the underlying Terrier index. We check that the term occurs in the lexicon (to prevent a KeyError). The Lexicon returns a [LexiconEntry](http://terrier.org/docs/current/javadoc/org/terrier/structures/LexiconEntry.html), which allows us access to the number of occurrences of the term in the index.

Finally, we use the [CollectionStatistics](http://terrier.org/docs/current/javadoc/org/terrier/structures/CollectionStatistics.html) object to determine the total number of occurrences of all terms in the index.

In [10]:
index.getLexicon()["chemic"].getFrequency() / index.getCollectionStatistics().getNumberOfTokens() if "chemic" in index.getLexicon() else 0

7.732499696223226e-05

### What terms occur in the 11th document?


In [15]:
di = index.getDirectIndex()
doi = index.getDocumentIndex()
lex = index.getLexicon()
docid = 10 #docids are 0-based
#NB: postings will be null if the document is empty
for posting in  di.getPostings(doi.getDocumentEntry(docid)):
  termid = posting.getId()
  lee = lex.getLexiconEntry(termid)
  print("%s with frequency %d" % (lee.getKey(),posting.getFrequency()))

circuit with frequency 3
transistor with frequency 1
us with frequency 1
obtain with frequency 1
switch with frequency 2
design with frequency 1
affect with frequency 1
plot with frequency 1
junction with frequency 1
characterist with frequency 1
paramet with frequency 1
relat with frequency 1
theoret with frequency 1
load with frequency 1
bistabl with frequency 1
curv with frequency 1
mai with frequency 1
diagram with frequency 1
line with frequency 1
static with frequency 1


### What documents does term "Z" occur in?

In [18]:
meta = index.getMetaIndex()
inv = index.getInvertedIndex()

le = lex.getLexiconEntry( "chemic" )
# the lexicon entry is also our pointer to access the inverted index posting list
for posting in inv.getPostings( le ): 
	docno = meta.getItem("docno", posting.getId())
	print("%s with frequency %d " % (docno, posting.getFrequency()))

1056 with frequency 1 
1140 with frequency 1 
2050 with frequency 1 
2417 with frequency 1 
2520 with frequency 1 
2558 with frequency 1 
3320 with frequency 1 
4054 with frequency 1 
4687 with frequency 1 
4886 with frequency 1 
4912 with frequency 1 
6129 with frequency 1 
6279 with frequency 2 
7049 with frequency 1 
8416 with frequency 1 
8766 with frequency 1 
9374 with frequency 1 
10139 with frequency 1 
10445 with frequency 1 
10703 with frequency 1 


Our index does not have position information, but *if it did*, the above loop would look like:

```python
for posting in inv.getPostings( le ): 
  docno = meta.getItem("docno", posting.getId())
  # unlike in Java, we dont need to cast posting to be a BlockPosting
  positions = postings.getPositions()
  print("%s with frequency %d and positions %s" % (docno, posting.getFrequency(), str(positions))
```

### What are the PL2 weighting model scores of documents that "Y" occurs in?

Use of a WeightingModel class needs some setup, namely the [EntryStatistics](http://terrier.org/docs/current/javadoc/org/terrier/structures/EntryStatistics.html) of the term (obtained from the Lexicon, in the form of the LexiconEntry), as well as the CollectionStatistics (obtained from the index).

In [20]:
inv = index.getInvertedIndex()
meta = index.getMetaIndex()
lex = index.getLexicon()
le = lex.getLexiconEntry( "chemic" )
wmodel = pt.autoclass("org.terrier.matching.models.PL2")()
wmodel.setCollectionStatistics(index.getCollectionStatistics())
wmodel.setEntryStatistics(le);
wmodel.setKeyFrequency(1)
wmodel.prepare()
for posting in inv.getPostings(le):
  docno = meta.getItem("docno", posting.getId())
  score = wmodel.score(posting)
  print("%s with score %0.4f"  % (docno, score))


1056 with score 6.3584
1140 with score 5.3378
2050 with score 4.5494
2417 with score 4.5494
2520 with score 5.1136
2558 with score 5.1136
3320 with score 1.5902
4054 with score 2.1297
4687 with score 5.0092
4886 with score 6.1814
4912 with score 4.2399
6129 with score 3.0708
6279 with score 5.6394
7049 with score 4.3891
8416 with score 1.9834
8766 with score 5.3378
9374 with score 4.4678
10139 with score 5.2230
10445 with score 3.6754
10703 with score 6.9992
