# Dataset Analysis

This jupyter notebook serves as playground for getting to know the dataset.

The dataset contains a corpus of scientific papers (title + abstracts) from the fields of IR and NLP (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). -> Dataset is the union of the IR Anthology and the ACL Anthology

It also contains queries on the corpus and the corresponding relevant documents.

### Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

In [None]:
# If using Google Colab:
!pip3 install tira ir-datasets python-terrier

In [1]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt

### Helper functions

Here I define some helper functions that will be used later to analyse the dataset.

In [2]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [38]:
import json

# Helper function for printing dicts
def pp(obj):
    print(json.dumps(obj, indent=2, ensure_ascii=False))

## Dataset 

the dataset is a pyterrier dataset (genauer: ```pyterrier.datasets.IRDSDataset```) [source code of pyterrier.datasets](https://pyterrier.readthedocs.io/en/latest/_modules/pyterrier/datasets.html#Dataset.get_topics)


The dataset has some methods that we might want to use:
- ```get_corpus_iter()```: The document corpus as an iterator (```get_corpus()``` is not available with this dataset)
- ```get_topics()```: The queries in a pandas DataFrame
- ```get_qrels()```: The relevant documents for the queries

In [85]:
# get the dataset
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')

### The document corpus

In [86]:
# Iterator for corpus
corpus_iter = pt_dataset.get_corpus_iter(verbose=False) # verbose=True -> tqdm progress bar
# corpus iter is a GeneratorLen object, returns dicts
# has 2 attributes: length (number of documents in corpus) and gen (generator fn)

In [97]:
print("number of documents in corpus:", corpus_iter.length)

try:
    doc = next(corpus_iter.gen)  # doc is a dict with keys "docno" and "text"
except StopIteration:
    print("FEHLER: Keine Elemente im Iterator!")

print()
print("First element in corpus_iter:")
print(f"Document ID: ->{doc['docno']}<-")  # unique id string
print(f"Document Text: ->{doc['text']}<-") # text consisting of the title and (if available) the abstract of a paper
# NOTE: I use the arrows '->','<-' to better identify empty strings / newlines / blanks in prints
# NOTE: The generator returns a random document -> need to set seed in production?

number of documents in corpus: 126958

First element in corpus_iter:
Document ID: ->R13-1044<-
Document Text: ->Recognizing semantic relations within {P}olish noun phrase: A rule-based approach


 The paper 1 presents a rule-based approach to semantic relation recognition within the Polish noun phrase. A set of semantic relations, including some thematic relations, has been determined for the need of experiments. The method consists in two steps: first the system recognizes word pairs and triples, and then it classifies the relations. Evaluation was performed on random samples from two balanced Polish corpora.<-


In [98]:
doc_json = {"id": doc["docno"], "abstract":""}
for i, line in enumerate(doc["text"].split("\n")):
    if i == 0 and line:
        doc_json["title"] = line
    if not line: # empty line
        continue
    else: # Non empty lines belong to the abstract
        # In case the abstract contains newlines -> concatenate lines TODO: are there newlines some of the abstracts?
        doc_json["abstract"] += line #or line.strip() TODO: -> could be bad if one line ends on a blank and the next line begins with a word 
pp(doc_json)

{
  "id": "R13-1044",
  "abstract": "Recognizing semantic relations within {P}olish noun phrase: A rule-based approach The paper 1 presents a rule-based approach to semantic relation recognition within the Polish noun phrase. A set of semantic relations, including some thematic relations, has been determined for the need of experiments. The method consists in two steps: first the system recognizes word pairs and triples, and then it classifies the relations. Evaluation was performed on random samples from two balanced Polish corpora.",
  "title": "Recognizing semantic relations within {P}olish noun phrase: A rule-based approach"
}


Now I will (try to) iterate over the dataset, and check if the cell above works for all elements in the dataset.

In [99]:
def split_doctext(doctext):
    splitted = doctext.split("\n")
    title = splitted.pop(0) # first line is always title
    abstract = "".join(splitted) # following non-empty lines are abstract
    return title, abstract

In [100]:
corpus_iter = pt_dataset.get_corpus_iter(verbose=False) # verbose=True -> tqdm progress bar

no_abstracts = {}
# iterate over corpus
for i, d in enumerate(corpus_iter):
    title, abstract = split_doctext(d["text"])
    if not title:
        raise Exception(f"Document does not have a title!!! docno: {d['docno']}")
    if not abstract:
        no_abstracts[d["docno"]] = title

print(f"Number of documents without an abstract: {len(no_abstracts)}/{corpus_iter.length}")

Number of documents without an abstract: 39701/126958


#### Queries
Pyterrier Dataset has a function ```get_topics()``` which returns a dataframe of the topics aka queries.

In [137]:
queries = pt_dataset.get_topics()
print("number of topics/queries:", len(queries))
print(queries.columns)

topic = queries.iloc[0]
# confusing names
query = topic['description']  # is actually the user query
description = topic['narrative']  #is actually the description (though i dont know what narrative means) 
print("Example of one topic/query:")
print(f" query id   : {topic['qid']}")
print(f" title      : {topic['title']}")
print(f" description: {description[:90]}...")
print(f" query      : {query}")


There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.
number of topics/queries: 68
Index(['qid', 'text', 'title', 'query', 'description', 'narrative'], dtype='object')
Example of one topic/query:
 query id   : 1
 title      : retrieval system improving effectiveness
 description: Relevant papers include research on what makes a retrieval system effective and what impro...
 query      : What papers focus on improving the effectiveness of a retrieval system?


In [197]:
# Groundtruths - relevant documents for queries
qrels = pt_dataset.get_qrels()
qrels["label"] = qrels["label"].astype(int) # label is string

print("length:", len(qrels))
print("columns:", list(qrels.columns))
print(f"on average {round(len(qrels) / len(qrels['qid'].unique()))} labels per query id")
print()


print("Label Statistics: (per qid)")
results_df = qrels.groupby("qid").agg(
    n_labels=("label", "size"),
    n_relevant=("label", "sum"),
    relev_ratio=("label", lambda x: x.sum() / x.size if x.size > 0 else 0)
).reset_index()
#print(results_df)
print(results_df[["n_labels", "n_relevant", "relev_ratio"]].agg(
            ["min", "max", "mean", "std"]))
print()

print("Document Statistics: (per docno)")
results_df = qrels.groupby("docno").agg(
    n_qids=("qid", "size"),  # number of queries a document occurs in labels
    n_rel=("label", "sum")
).reset_index()
#print(results_df)
print(results_df[["n_qids", "n_rel"]].agg(
            ["min", "max", "mean", "std"]))


length: 2623
columns: ['qid', 'docno', 'label', 'iteration']
on average 39 labels per query id

Label Statistics: (per qid)


KeyError: "None of [Index(['number_of_labels', 'number_of_relevant', 'relevance_rate'], dtype='object')] are in the [columns]"

Aus den Statistiken geht hervor, dass es sowohl (1 oder mehr) queries gibt, zu dem kein einziges dokument relevant ist, und auch wo jedes document relevant ist.

Frage ist, sollte man das aussortieren? bei nur 68 queries werden es dann noch weniger!

Was bedeutet die spalte iteration? alle werte sind 0

In [176]:
# QUERIES

# pyterrier dataset (class IRDSDataset) builds queries dataframe with self.irds_ref()



In [114]:

queries = pt_dataset.get_topics()
print("number of queries:", len(queries))
print(queries.columns)
print(queries.index)

for i, row in queries.iterrows():
    if i > 10: break
    print("---------------------------------")
    topic = row['text']
    query = row['description']
    description = row['narrative']
    print(f"query ID: {row['qid']}")
    print(f"topic: {topic}")
    print(f"query: {query}")
    print(f"description: {description}")
    break


#print(queries["test"].unique())


There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.
number of queries: 68
Index(['qid', 'text', 'title', 'query', 'description', 'narrative'], dtype='object')
RangeIndex(start=0, stop=68, step=1)
---------------------------------
query ID: 1
topic: retrieval system improving effectiveness
query: What papers focus on improving the effectiveness of a retrieval system?
description: Relevant papers include research on what makes a retrieval system effective and what improves the effectiveness of a retrieval system. Papers that focus on improving something else or improving the effectiveness of a system that is not a retrieval system are not relevant.


In [23]:
# Q RELS

qrels = pt_dataset.get_qrels()

for row in qrels.iterrows():
    print(row)
    break


(0, qid                                          1
docno        2005.ipm_journal-ir0volumeA41A1.7
label                                        1
iteration                                    0
Name: 0, dtype: object)


### Step 3: Define the Retrieval Pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [None]:

# A (pre-built) PyTerrier index loaded from TIRA
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

In [5]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

### Step 4: Create the Run


In [9]:
print('First, we have a short look at the first three topics:')

pt_dataset.get_topics('text').head(3)

First, we have a short look at the first three topics:


Unnamed: 0,qid,query
0,1,retrieval system improving effectiveness
1,2,machine learning language identification
2,3,social media detect self harm


In [11]:
print('Now we do the retrieval...')
run = bm25(pt_dataset.get_topics('text'))

print('Done. Here are the first 10 entries of the run')
run.head(10)

Now we do the retrieval...
Done. Here are the first 10 entries of the run


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,94858,2004.cikm_conference-2004.47,0,15.681777,retrieval system improving effectiveness
1,1,125137,1989.ipm_journal-ir0volumeA25A4.2,1,15.04738,retrieval system improving effectiveness
2,1,125817,2005.ipm_journal-ir0volumeA41A5.11,2,14.144223,retrieval system improving effectiveness
3,1,5868,W05-0704,3,14.025748,retrieval system improving effectiveness
4,1,84876,2016.ntcir_conference-2016.90,4,13.947994,retrieval system improving effectiveness
5,1,82472,1998.sigirconf_conference-98.15,5,13.901647,retrieval system improving effectiveness
6,1,94415,2008.cikm_conference-2008.183,6,13.808208,retrieval system improving effectiveness
7,1,17496,O01-2005,7,13.749449,retrieval system improving effectiveness
8,1,82490,1998.sigirconf_conference-98.33,8,13.735541,retrieval system improving effectiveness
9,1,124801,2006.ipm_journal-ir0volumeA42A3.2,9,13.569263,retrieval system improving effectiveness


### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [20]:
persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".
