# PyTerrier: Datasets, Indexes, Experiments

This jupyter notebook serves as playground for getting to know the dataset (classes and contents), 
how indexes work, and how to run and evaluate experiments


### Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

In [1]:
# If using Google Colab:
#!pip3 install tira ir-datasets python-terrier

In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt

import pandas as pd
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
#pd.set_option("display.max_rows", None)
pd.set_option("display.precision", 4)

In [3]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [4]:
#TODO: what is this exactly (in comarison to pyterrier datasets)
#      when to use this, when pyterrier?
from tira.third_party_integrations import ir_datasets

### Helper functions

Here I define some helper functions that will be used later to analyse the dataset class.

In [417]:
import json

# Helper function for printing dicts
def pp(obj):
    print(json.dumps(obj, indent=2, ensure_ascii=False))

# For printing attributes of object
# exclude attributes that contain substrings specified in 'exclude'
# filter attributes that contain substrings specified in 'include'
def get_attrs(obj, exclude=None, include=None):
    exclude = [] if exclude is None else exclude
    if type(exclude) == str:
        exclude = [exclude]
    exclude += ["__"] # always exclude attrs like __str__ etc.

    if type(include) == str:
        include = [include]

    attrs = dir(obj)
    attrs = [a for a in attrs if 
                (all(e not in a for e in exclude)) and
                (any(i in a.lower() for i in include) if include else True)]
    return attrs

## Dataset  -  Corpus and Queries

The dataset we use in the shared task contains a corpus of scientific papers (title + abstracts) from the fields of IR and NLP (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). -> Dataset is the union of the IR Anthology and the ACL Anthology

It also contains queries on the corpus and the corresponding relevant documents.

The dataset is a pyterrier dataset (genauer: ```pyterrier.datasets.IRDSDataset```) [source code of pyterrier.datasets](https://pyterrier.readthedocs.io/en/latest/_modules/pyterrier/datasets.html#Dataset.get_topics)


The dataset has some methods that we might want to use:
- ```get_corpus_iter()```: The document corpus as an iterator (```get_corpus()``` is not available with this dataset)
- ```get_topics()```: The queries in a pandas DataFrame
- ```get_qrels()```: The relevant documents for the queries

In [215]:
# get the dataset
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')

### the document corpus

In [216]:
# Iterator for corpus
corpus_iter = pt_dataset.get_corpus_iter(verbose=False) # verbose=True -> tqdm progress bar
# corpus iter is a GeneratorLen object, returns dicts
# has 2 attributes: length (number of documents in corpus) and gen (generator fn)

In [217]:
print("number of documents in corpus:", corpus_iter.length)

try:
    doc = next(corpus_iter.gen)  # doc is a dict with keys "docno" and "text"
except StopIteration:
    print("FEHLER: Keine Elemente im Iterator!")

print()
print("First element in corpus_iter:")
print(f"Document ID: ->{doc['docno']}<-")  # unique id string
print(f"Document Text: ->{doc['text']}<-") # text consisting of the title and (if available) the abstract of a paper
# NOTE: I use the arrows '->','<-' to better identify empty strings / newlines / blanks in prints
# NOTE: The generator returns a random document -> need to set seed in production?

number of documents in corpus: 126958

First element in corpus_iter:
Document ID: ->O02-2002<-
Document Text: ->A Study on Word Similarity using Context Vector Models


 There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarit ies. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilisti

In [218]:
doc_json = {"id": doc["docno"], "abstract":""}
for i, line in enumerate(doc["text"].split("\n")):
    if i == 0 and line:
        doc_json["title"] = line
        continue
    if not line: # empty line
        continue
    else: # Non empty lines belong to the abstract
        # In case the abstract contains newlines -> concatenate lines TODO: are there newlines some of the abstracts?
        doc_json["abstract"] += line #or line.strip() TODO: -> could be bad if one line ends on a blank and the next line begins with a word?
pp(doc_json)

{
  "id": "O02-2002",
  "abstract": " There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarit ies. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment of each word , and all the context

Now I will (try to) iterate over the dataset and apply for corpus

In [219]:
def split_doctext(doctext):
    splitted = doctext.split("\n")
    title = splitted.pop(0) # first line is always title
    abstract = "".join(splitted) # following non-empty lines are abstract
    return title, abstract

In [260]:
corpus_iter = pt_dataset.get_corpus_iter(verbose=False) # verbose=True -> tqdm progress bar

docnos = set()
no_abstracts = {}
# iterate over corpus
for i, d in enumerate(corpus_iter):
    docnos.add(d["docno"])
    title, abstract = split_doctext(d["text"])
    if not title:
        raise Exception(f"Document does not have a title!!! docno: {d['docno']}")
    if not abstract:
        no_abstracts[d["docno"]] = title

print(f"Number of documents without an abstract: {len(no_abstracts)}/{corpus_iter.length}")

Number of documents without an abstract: 39701/126958


### the queries
Pyterrier Dataset has a function ```get_topics()``` which returns a dataframe of the topics aka queries.

In [515]:
queries = pt_dataset.get_topics()
print("columns:", list(queries.columns))
print("number of topics/queries:", len(queries))
print()

# get_topics() can be called passing the column-name/key-name you want to have:
queries = pt_dataset.get_topics(variant="title")
print(queries.head(10).to_string())
queries = pt_dataset.get_topics(variant="description")
print(queries.head(10).to_string())
print()

# Example of first topic
topic = pt_dataset.get_topics().iloc[0]
print("Example of the first topic/query:")
print(f" query id   : {topic['qid']}")
print(f" title      : {topic['title']}") # topic name (topic["text"] is the same as topic["title"])
print(f" description: {topic['narrative'][:90]}...") # the description of the query
print(f" query      : {topic['description']}") # is the user query


There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.
columns: ['qid', 'text', 'title', 'query', 'description', 'narrative']
number of topics/queries: 68

  qid                                                query
0   1             retrieval system improving effectiveness
1   2             machine learning language identification
2   3                        social media detect self harm
3   4                        stemming for arabic languages
4   5                       audio based animal recognition
5   6                comparison different retrieval models
6   7                                   cache architecture
7   8                             document scoping formula
8   9                            pseudo relevance feedback
9  10  how to represent natural conversations in word nets
  qid                                                                  

: 

### The groundtruths

In [339]:
# Groundtruths - relevant documents for queries
qrels = pt_dataset.get_qrels()
qrels["label"] = qrels["label"].astype(int) # label in the dataframe is string ('0' or '1')

print("length:", len(qrels))
print("columns:", list(qrels.columns))
print(f"on average {round(len(qrels) / len(qrels['qid'].unique()))} labels per query id")
print()


print("Label Statistics: (per qid)")
results_df = qrels.groupby("qid").agg(
    n_labels=("label", "size"),
    n_relev=("label", "sum"),
    relev_ratio=("label", lambda x: x.sum() / x.size if x.size > 0 else 0)
).reset_index()
#print(results_df)
print(results_df[["n_labels", "n_relev", "relev_ratio"]].agg( ["min", "max", "mean", "std"]))
print()

print("Document Statistics: (per docno)")
results_df = qrels.groupby("docno").agg(
    n_qids=("qid", "size"),  # number of queries a document occurs in labels
    n_relev=("label", "sum") # number of times it was relevant
).reset_index()
#print(results_df)
print(results_df[["n_qids", "n_relev"]].agg( ["min", "max", "mean", "std"]))
print("number of documents in qrels with only label 0:", (results_df["n_relev"] == 0).sum())

length: 2623
columns: ['qid', 'docno', 'label', 'iteration']
on average 39 labels per query id

Label Statistics: (per qid)
       n_labels    n_relev  relev_ratio
min   13.000000   0.000000     0.000000
max   50.000000  43.000000     1.000000
mean  38.573529  19.911765     0.509098
std    7.016646  11.137854     0.258363

Document Statistics: (per docno)
        n_qids   n_relev
min   1.000000  0.000000
max   6.000000  3.000000
mean  1.137961  0.587419
std   0.435077  0.570786
number of documents in qrels with only label 0: 1037


Was bedeutet die spalte iteration? alle werte sind 0

**Label statistics**
- 1 or more labels have no relevant document.
- 1 or more labels have only relevant documents.

-> how to interpret those? 

filter out queries without relevant documents? but there are only 68 queries

<!--Aus den Statistiken geht hervor, dass es sowohl (1 oder mehr) queries gibt, zu dem kein einziges dokument relevant ist, und auch (1 oder mehr) queries gibt wo jedes document relevant ist.

Frage ist, sollte man diese aussortieren? bei nur 68 queries werden es dann noch weniger, ist das sinnvoll? -->

**Document statistics**
- every document occurs in the qrels
- there are documents that are not relevant for any query
- half of the 

In [337]:
topicsqrels = pt_dataset.get_topicsqrels() # returns tuple: (DataFrame, DataFrame)
# first Dataframe is the topics (just like get_topics())
# second Dataframe is the qrels (just like get_qrels())

df1, df2 = topicsqrels

print("First Dataset:")
print(" columns:", list(df1.columns))
print(" length: ", len(df1))

print("Second Dataset:")
print(" columns:", list(df2.columns))
print(" length: ", len(df2))

There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.
First Dataset:
 columns: ['qid', 'text', 'title', 'query', 'description', 'narrative']
 length:  68
Second Dataset:
 columns: ['qid', 'docno', 'label', 'iteration']
 length:  2623


### Comparison to ```ir_datasets```


There is another variant to get the dataset: from tira.ir_datasets

In [1]:
dataset = ir_datasets.load("ir-lab-sose-2024/ir-acl-anthology-20240504-training")
#print("IR Lab dataset attrs:\n", [i for i in dir(dataset) if "__" not in i])

#### the document corpus

In [None]:
# Document Corpus
docs_store = dataset.docs_store()

# WHAT IS THIS???
#lookup = docs.lookup("docid")
#lookup_iter = docs.lookup_iter("DOCID")

#print([i for i in dir(lookup_iter) if "__" not in i])

In [280]:
#print(dataset.docs_count()) # really slow?
#print(dataset.docs_cls()) #  <class 'ir_datasets.formats.base.GenericDoc'> 

docs_iter = dataset.docs_iter()  # this is almost the same as get_corpus iter of the pt_dataset
for doc in docs_iter: 
    print("id", doc.doc_id)
    print("text:", doc.text)
    break

# document corpus
docs_store = dataset.docs_store() # of type <class 'tira.ir_datasets_util.DictDocsstore'>

id O02-2002
text: A Study on Word Similarity using Context Vector Models


 There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarit ies. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environm

In [296]:
print("Attributes of docs_store:", [i for i in dir(docs_store) if "__" not in i])
# -> ['docs', 'get', 'get_many_iter']

# DOCS
# docs are the same docs like in pt_dataset as dict
docs = docs_store.docs # dict keys are doc_ids, values are GenercDoc objects
#for i, (k,v) in enumerate(docs.items()):
#    print("doc_id of first document:", k) 
#    print("document obj:", v)
#    break

# GET()
document = docs_store.get("O02-2002") # class GenericDoc (like above)
print("doc_id:", document.doc_id)
print("text:", document.text)

# GET_MANY_ITER()
many_ids = ["L02-1310", "R13-1042", "W05-0819"]
docs_iter = docs_store.get_many_iter(many_ids)
print("----------------------------------------")
for doc_id, doc_text in docs_iter:
    print(doc_id)
    print(doc_text)
    print("--------------------")


Attributes of docs_store: ['docs', 'get', 'get_many_iter']
doc_id: O02-2002
text: A Study on Word Similarity using Context Vector Models


 There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarit ies. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrenc

In [299]:
# Is there the same amount of missing abstracts??

docs_iter = dataset.docs_iter()

no_abstracts = {}
# almost same code like for pt_dataset corpus
for i, d in enumerate(docs_iter):
    docnos.add(d.doc_id)
    title, abstract = split_doctext(d.text)
    if not title:
        raise Exception(f"Document does not have a title!!! docno: {d.doc_id}")
    if not abstract:
        no_abstracts[d.doc_id] = title

print(f"Number of documents without an abstract: {len(no_abstracts)}/{len(corpus_iter)}")

# Answer: yes, the documents seem to be exaclty the same.

Number of documents without an abstract: 39701/126958


#### queries

In [321]:
#print(dataset.queries_cls()) # <class 'tira.ir_datasets_util.TirexQuery'>
# Attributes of TirexQuery: ['count', 'default_text', 'description', 'index', 'narrative', 'query', 'query_id', 'text', 'title']
queries_iter = dataset.queries_iter()

for i, q in enumerate(queries_iter): # q is TirexQuery
    for attr in ['description', 'narrative', 'query', 'query_id', 'text', 'title']:
        print(f"{attr}: {getattr(q, attr)}, type: {type(getattr(q, attr)).__name__}")
    break
# THis is exactly the same as above:
# - q.description is the user query
# - q.text, q.title, q.query are exactly the same -> the topic
# - q.narrative is the description/explanation of the query

# Extra methods:
# - default_text()   returns the title (aka q.text)
# - count( int )     I dont know what that does
# - index( arg )     I dont know what that does



description: What papers focus on improving the effectiveness of a retrieval system?, type: str
narrative: Relevant papers include research on what makes a retrieval system effective and what improves the effectiveness of a retrieval system. Papers that focus on improving something else or improving the effectiveness of a system that is not a retrieval system are not relevant., type: str
query: retrieval system improving effectiveness, type: str
query_id: 1, type: str
text: retrieval system improving effectiveness, type: str
title: retrieval system improving effectiveness, type: str


#### The Groundtruths (qrels - Relevant Documents)

In [327]:
#og_qrels = dataset.original_qrels # = None
#qrels = dataset.qrels   # Dataset object like dataset.docs, dataset.queries
#qrels_cls = dataset.qrels_cls() # <class 'ir_datasets.formats.trec.TrecQrel'>
#qrels_defs = dataset.qrels_defs() # empty dict/set

qrels_iter = dataset.qrels_iter()
for item in qrels_iter: # item of type TrecQrel with attributes [query_id, doc_id, relevance, iteration]
    print(item)
    break


TrecQrel(query_id='1', doc_id='2005.ipm_journal-ir0volumeA41A1.7', relevance=1, iteration='0')


### Corpus als JSON Speichern

In [None]:
tira_dataset = ir_datasets.load("ir-lab-sose-2024/ir-acl-anthology-20240504-training")
docs_store = tira_dataset.docs_store()

corpus = docs_store.docs
with open("dataset_corpus.json", "w") as f:
    json.dump(corpus, f, indent=2, ensure_ascii=False)

## Index and Retrieval

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

### The Index

The reference for the index: [Class Index](http://terrier.org/docs/current/javadoc/org/terrier/structures/Index.html)

A good tutorial/introduction to pyterrier index: [ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial/blob/main/notebooks/notebook1.ipynb)

In [343]:
# A (pre-built) PyTerrier index loaded from TIRA
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)
print(type(index))

<class 'jnius.reflect.org.terrier.structures.Index'>


In [418]:
# just to have the list visually, (getStructure ... doesnt work)
for attr_name in get_attrs(index, include="get"):
    print(f"* {attr_name}:")
    #print(f"{getattr(index, attr_name)()}")  # attrs are all getters
print(index.getClass()) # org.terrier.structures.IndexOnDisk

* getClass:
* getCollectionStatistics:
* getDirectIndex:
* getDocumentIndex:
* getEnd:
* getIndexRef:
* getInvertedIndex:
* getLexicon:
* getMetaIndex:
* getStart:
Class: org.terrier.structures.IndexOnDisk


#### Collection Statistics

In [427]:

collection_statistics = index.getCollectionStatistics()
#print(get_attrs(collection_statistics))
#print(collection_statistics.toString())

print("Collection Statistics:")
print(" number of documents:", collection_statistics.getNumberOfDocuments())
print(" average document length:", collection_statistics.getAverageDocumentLength())

print(" number of tokens:", collection_statistics.getNumberOfTokens())
print(" number of unique terms:", collection_statistics.getNumberOfUniqueTerms())

# There is only one field "text" -> same numbers as above with documents and tokens.
print(" number of fields:", collection_statistics.getNumberOfDocuments())
#print(" average field lengths:", collection_statistics.getAverageFieldLengths())
print(" field names:", collection_statistics.getFieldNames())
#print(" field tokens:", collection_statistics.getFieldTokens())

# What is this??? 
print(" number of pointers:", collection_statistics.getNumberOfPointers())
print(" number of postings:", collection_statistics.getNumberOfPostings())
print(" has positions:", collection_statistics.hasPositions())


Collection Statistics:
 number of documents: 126958
 average document length: 64.02700893208778
 number of tokens: 8128741
 number of unique terms: 97223
 number of fields: 126958
 field names: ['text']
 number of pointers: 5471851
 number of postings: 5471851
 has positions: False


#### Direct Index

In [None]:
#direct_index = index.getDirectIndex() # Before I dont understand postings this is not useful/interesting
# direct_index.getPostings(  pointer  )  # -> needs pointer

In [396]:
document_index = index.getDocumentIndex()
#print(get_attrs(document_index))
#['getDocumentEntry', 'getDocumentLength', 'getNumberOfDocuments']

number_of_docs = document_index.getNumberOfDocuments()
print("n documents:", number_of_docs)
doc_entry = document_index.getDocumentEntry(0)
print("doc entry 0:", doc_entry)
doc_len = document_index.getDocumentLength(0)
print("doc len:", doc_len)

#print(get_attrs(doc_entry))
#['getDocumentLength', 'getFileNumber', 'getNumberOfEntries']
print("doc length:", doc_entry.getDocumentLength())
print("file number:", doc_entry.getFileNumber()) 
print("n entries:", doc_entry.getNumberOfEntries())

# I dont really understand how to work with this


n documents: 126958
doc entry 0: <org.terrier.structures.DocumentIndexEntry at 0x7fff49af04f0 jclass=org/terrier/structures/DocumentIndexEntry jself=<LocalRef obj=0x5555583b64e8 at 0x7fff4cc343b0>>
doc len: 111
doc length: 111
file number: 0
n entries: 66


In [401]:

index_ref = index.getIndexRef()
print(type(index_ref))
print(index_ref)

print(get_attrs(index_ref))

# I dont understand this either

<class 'jnius.reflect.org.terrier.querying.IndexRef'>
<org.terrier.querying.IndexRef at 0x7fffd388f740 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x5555583b6410 at 0x7fff4cc35fb0>>
['_class', 'clone', 'equals', 'finalize', 'getClass', 'hashCode', 'location', 'notify', 'notifyAll', 'of', 'registerNatives', 'serialVersionUID', 'size', 'toString', 'wait']


#### Lexicon

A lexicon (a.k.a. dictionary, vocabulary) typically represents the list of terms in the index, 
together with their statistics (EntryStatistics) and the pointer (Pointer) 
to the offset of that term's postings in the PostingIndex returned by Index.getInvertedIndex(). 
The EntryStatistics and Pointer are combined in a single LexiconEntry object. 

In [433]:
lexicon = index.getLexicon()

print("number of entries in lexicon:", lexicon.numberOfEntries())

# Iterate over lexicon
for i, kv in enumerate(index.getLexicon()):
    if i < 9000: continue # first entries are just numbers
    if i > 9010: break # print 10 entries
    print(f"{kv.getKey()} ({type(kv.getKey()).__name__}) -> {kv.getValue()} ({type(kv.getValue()).__name__})")

# Lexicon entries have statistics (Nt -> number of documents with term, TF total ocurrences, @{} are pointers)

# lookup terms with bracket notation
entry = lexicon["angular"].toString()
print(entry)

# lookup terms with getLexiconEntry
entry = lexicon.getLexiconEntry("tomato")
print(entry)

number of entries in lexicon: 97223
angora (str) -> term72311 Nt=1 TF=1 maxTF=2147483647 @{0 635219 0} TFf=1 (org.terrier.structures.LexiconEntry)
angorn (str) -> term37097 Nt=1 TF=1 maxTF=2147483647 @{0 635223 4} TFf=1 (org.terrier.structures.LexiconEntry)
angri (str) -> term13025 Nt=26 TF=33 maxTF=2147483647 @{0 635227 4} TFf=33 (org.terrier.structures.LexiconEntry)
angrier (str) -> term13916 Nt=1 TF=1 maxTF=2147483647 @{0 635304 0} TFf=1 (org.terrier.structures.LexiconEntry)
angstrom (str) -> term13140 Nt=1 TF=1 maxTF=2147483647 @{0 635307 4} TFf=1 (org.terrier.structures.LexiconEntry)
angu (str) -> term67919 Nt=1 TF=1 maxTF=2147483647 @{0 635311 0} TFf=1 (org.terrier.structures.LexiconEntry)
anguag (str) -> term526 Nt=140 TF=147 maxTF=2147483647 @{0 635315 2} TFf=147 (org.terrier.structures.LexiconEntry)
anguagc (str) -> term67574 Nt=1 TF=1 maxTF=2147483647 @{0 635640 2} TFf=1 (org.terrier.structures.LexiconEntry)
anguao (str) -> term62024 Nt=1 TF=1 maxTF=2147483647 @{0 635644 4} T

In [451]:
inverted_index = index.getInvertedIndex()
pointer = lexicon["tomato"]

for posting in inverted_index.getPostings(pointer):
    print(f"{posting.toString()} doclen={posting.getDocumentLength()} " \
          f"->  ID: {posting.getId()} frequency: {posting.getFrequency()})")

(21341,2,F[2]) doclen=231 ->  ID: 21341 frequency: 2)
(26729,1,F[1]) doclen=63 ->  ID: 26729 frequency: 1)
(35901,1,F[1]) doclen=144 ->  ID: 35901 frequency: 1)
(36834,1,F[1]) doclen=78 ->  ID: 36834 frequency: 1)
(39721,1,F[1]) doclen=129 ->  ID: 39721 frequency: 1)
(52494,1,F[1]) doclen=72 ->  ID: 52494 frequency: 1)
(53515,1,F[1]) doclen=167 ->  ID: 53515 frequency: 1)
(55481,2,F[2]) doclen=133 ->  ID: 55481 frequency: 2)
(88962,1,F[1]) doclen=70 ->  ID: 88962 frequency: 1)
(95102,1,F[1]) doclen=7 ->  ID: 95102 frequency: 1)


In [484]:
# Docnos for the index doc-ids available with MetaIndex
meta_index = index.getMetaIndex()
#metadata_keys = meta_index.getKeys()
#print(metadata_keys)

# LexiconEntry is the pointer for where to find postings for that term in the inverted index
entry = lexicon["tomato"]

postings = inverted_index.getPostings(entry)
docnos = []
for posting in postings:
    doc_id = posting.getId()
    docnos.append(meta_index.getItem("docno", doc_id))   #getItem
print(docnos)
print("-----")

# NOTE: postings object changes after iterating over it!! -> need to assign it again
postings = inverted_index.getPostings(entry)
doc_ids = [posting.getId() for posting in postings]
docnos = meta_index.getItems("docno", doc_ids)  # getItems - possible is also getItems(["docno", "other", ...], doc_ids)
print(docnos)


['E14-1071', '2020.acl-tutorials.6', 'W17-7515', 'D18-1474', 'D19-1596', 'D14-1059', '2021.eacl-main.229', '2021.emnlp-main.734', '2020.wsdm_conference-2020.112', '2020.cikm_conference-2020.181']
-----
['E14-1071', '2020.acl-tutorials.6', 'W17-7515', 'D18-1474', 'D19-1596', 'D14-1059', '2021.eacl-main.229', '2021.emnlp-main.734', '2020.wsdm_conference-2020.112', '2020.cikm_conference-2020.181']


#### Searching an Index

In [468]:
# Singe-word search:
br = pt.BatchRetrieve(index, wmodel="Tf")
br.search("tomato")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,21341,E14-1071,0,2.0,tomato
1,1,55481,2021.emnlp-main.734,1,2.0,tomato
2,1,26729,2020.acl-tutorials.6,2,1.0,tomato
3,1,35901,W17-7515,3,1.0,tomato
4,1,36834,D18-1474,4,1.0,tomato
5,1,39721,D19-1596,5,1.0,tomato
6,1,52494,D14-1059,6,1.0,tomato
7,1,53515,2021.eacl-main.229,7,1.0,tomato
8,1,88962,2020.wsdm_conference-2020.112,8,1.0,tomato
9,1,95102,2020.cikm_conference-2020.181,9,1.0,tomato


In [483]:
# search with dataframe with transform()

test_queries = pd.DataFrame([["q1", "tomato"], ["q2", "angular"], ["q3", "sheep"]],
                            columns=["qid", "query"])

#br.transform(queries)
search_result = br(test_queries) # short form of br.transform(queries)
print(search_result)


   qid   docid                                      docno  rank  score  \
0   q1   21341                                   E14-1071     0    2.0   
1   q1   55481                        2021.emnlp-main.734     1    2.0   
2   q1   26729                       2020.acl-tutorials.6     2    1.0   
3   q1   35901                                   W17-7515     3    1.0   
4   q1   36834                                   D18-1474     4    1.0   
5   q1   39721                                   D19-1596     5    1.0   
6   q1   52494                                   D14-1059     6    1.0   
7   q1   53515                         2021.eacl-main.229     7    1.0   
8   q1   88962              2020.wsdm_conference-2020.112     8    1.0   
9   q1   95102              2020.cikm_conference-2020.181     9    1.0   
10  q2   67458                           2021.paclic-1.47     0    2.0   
11  q2   96877              2013.cikm_conference-2013.170     1    2.0   
12  q2    5040                        

### Retrieval - Searching the Index

Way to search in PyTerrier: ```BatchRetrieve```
-> configured by specifing an _index_ and a _weighting model_

#### Weighting Models

Weighting Models define the ranking function for document retrieval in BatchRetrieve.

List of [Supported Weighting Models](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html)

In [486]:
# configure the BatchRetrieve
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")

# Do the retrieval
run = bm25(pt_dataset.get_topics("text"))

# Print the result
run.head(10)

# Save the run into runfile for later evaluation
persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".


#### Evaluating the Retrieval

In [488]:
qrels = pt_dataset.get_qrels()
def get_res_with_labels(ranker, df):
    # get results for the query (or queries)
    results = ranker(df)
    with_labels = results.merge(qrels, on=["qid", "docno"], how="left").fillna(0)
    return with_labels

# bm25 results for the first query (for all queries -> without head(1))
get_res_with_labels(bm25, pt_dataset.get_topics(variant="title").head(1))

Unnamed: 0,qid,docid,docno,rank,score,query,label,iteration
0,1,94858,2004.cikm_conference-2004.47,0,15.6818,retrieval system improving effectiveness,1.0,0
1,1,125137,1989.ipm_journal-ir0volumeA25A4.2,1,15.0474,retrieval system improving effectiveness,1.0,0
2,1,125817,2005.ipm_journal-ir0volumeA41A5.11,2,14.1442,retrieval system improving effectiveness,1.0,0
3,1,5868,W05-0704,3,14.0257,retrieval system improving effectiveness,0.0,0
4,1,84876,2016.ntcir_conference-2016.90,4,13.9480,retrieval system improving effectiveness,1.0,0
...,...,...,...,...,...,...,...,...
995,1,74055,2004.ntcir_workshop-2004.7,995,9.0151,retrieval system improving effectiveness,0.0,0
996,1,86097,2014.clef_conference-2014w.26,996,9.0149,retrieval system improving effectiveness,0.0,0
997,1,81966,2007.sigirconf_conference-2007.76,997,9.0147,retrieval system improving effectiveness,0.0,0
998,1,75540,2009.clef_workshop-2009w.152,998,9.0124,retrieval system improving effectiveness,0.0,0


In [490]:
pt.Experiment(
    [tfidf],
    pt_dataset.get_topics(variant="title"),
    pt_dataset.get_qrels(),
    eval_metrics=["map", "ndcg"]
)

Unnamed: 0,name,map,ndcg
0,BR(TF_IDF),0.2642,0.5541


### Transformer Pipelines
These are not transformer models, in pyterrier transformer means SOMETHING ELSE.
And those can be connected to form a pipeline

## Experiments and Baselines

Baselines executed in TIRA

In [492]:
bm25_baseline = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 (tira-ir-starter-pyterrier)', pt_dataset)
sparse_cross_encoder = tira.pt.from_submission('ir-benchmarks/fschlatt/sparse-cross-encoder-4-512', pt_dataset)
rank_zephyr = tira.pt.from_submission('workshop-on-open-web-search/fschlatt/rank-zephyr', pt_dataset)

Download: 126kiB [00:00, 1.10MiB/s]


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/ir-acl-anthology-20240504-training/fschlatt


Download: 683kiB [00:00, 4.28MiB/s]


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/workshop-on-open-web-search/ir-acl-anthology-20240504-training/fschlatt


### Experiments

Conducting Experiments with PyTerrier
-> run transformer-pipeline* over set of queries and evaluate outcome using standart IR evaluation metrics, based on known relevant documants (qrels)

evaluation metrics calculated by pytrec_eval library

*(pyterrier transformers not the transformer models)


Reference for [PyTerrier Experiments](https://pyterrier.readthedocs.io/en/latest/experiments.html)

In [5]:
# the retireval systems specified in experiment (here [bm25, bm25_baseline, ... ]) can be either
# systems themselves or results dataframe (like below when read from file)

bm25 = pt.io.read_results('../runs/run.txt') # Read the run written above in Retrival Pipeline section
print(bm25)

#pt.Experiment(
#    [bm25, bm25_baseline, sparse_cross_encoder, rank_zephyr],
#    pt_dataset.get_topics(),
#    pt_dataset.get_qrels(),
#    ["ndcg_cut.10", "recip_rank", "recall_100"], # eval metrics
#    names=["BM25 (Own)", "BM 25 (Baseline)", "Sparse Cross Encoder", "RankZephyr"]
#)

      qid                               docno  rank    score           name
0       1        2004.cikm_conference-2004.47     1  15.6818  bm25-baseline
1       1   1989.ipm_journal-ir0volumeA25A4.2     2  15.0474  bm25-baseline
2       1  2005.ipm_journal-ir0volumeA41A5.11     3  14.1442  bm25-baseline
3       1                            W05-0704     4  14.0257  bm25-baseline
4       1       2016.ntcir_conference-2016.90     5  13.9480  bm25-baseline
...    ..                                 ...   ...      ...            ...
66278  68                            W18-6474   996   8.7468  bm25-baseline
66279  68        2007.cikm_conference-2007.37   997   8.7466  bm25-baseline
66280  68     1998.sigirconf_conference-98.61   998   8.7460  bm25-baseline
66281  68   2015.ipm_journal-ir0volumeA51A2.1   999   8.7444  bm25-baseline
66282  68       2015.ictir_conference-2015.24  1000   8.7436  bm25-baseline

[66283 rows x 5 columns]
