# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [1]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier

# install spacy
%pip install -U pip setuptools wheel
%pip install -U spacy
!python -m spacy download en_core_web_sm

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m60.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
import pandas as pd

In [3]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load the Dataset and the Index

The type of the index object that we load is `<class 'jnius.reflect.org.terrier.structures.Index'>`, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python.

In [4]:
# The dataset: the union of the IR Anthology and the ACL Anthology
# This line creates an IRDSDataset object and registers it under the name provided as an argument.
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
document_entity_recognition = tira.pt.transform_documents('ir-lab-sose-2024/ir-nfmj/entity-recognition', pt_dataset)

In [12]:
corpus_with_entities = []

for i in pt_dataset.get_corpus_iter():
    corpus_with_entities += [{'docno': i['docno']}]

corpus_with_entities = document_entity_recognition(pd.DataFrame(corpus_with_entities))

ir-lab-sose-2024/ir-acl-anthology-20240504-training documents: 100%|██████████| 126958/126958 [00:03<00:00, 36806.83it/s]


In [13]:
def retain_only_entity_type_text(positive_entity_types):
    ret = []
    for _, i in corpus_with_entities.iterrows():
        text = ""
        for j in i["entities"]:
            if j["label"] in positive_entity_types:
                text += " " + j["text"]
        if len(text) > 0:
            ret += [{"docno": i["docno"], "text": text}]     
    return ret        

-----
alle Entitäten

In [14]:
indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_full_text = pt.IndexFactory.of(indexer.index(pt_dataset.get_corpus_iter()))
bm25 = pt.BatchRetrieve(index_full_text, wmodel="BM25")

ir-lab-sose-2024/ir-acl-anthology-20240504-training documents:  70%|███████   | 88932/126958 [00:14<00:05, 6996.92it/s]



ir-lab-sose-2024/ir-acl-anthology-20240504-training documents: 100%|██████████| 126958/126958 [00:18<00:00, 6846.64it/s] 


16:20:19.454 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 3 empty documents


----
nur ORG


In [15]:
indexer = pt.IterDictIndexer("/tmp/index-ORG", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_org_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['ORG']))))
bm25_org = pt.BatchRetrieve(index_org_text, wmodel="BM25")

16:20:38.883 [ForkJoinPool-2-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 194 empty documents


----
nur CARDINAL

In [16]:
indexer = pt.IterDictIndexer("/tmp/index-CARDINAL", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_card_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['CARDINAL']))))
bm25_card = pt.BatchRetrieve(index_card_text, wmodel="BM25")

16:21:06.824 [ForkJoinPool-3-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 21698 empty documents


----
nur PERSON

In [17]:
indexer = pt.IterDictIndexer("/tmp/index-PERSON", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_pers_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['PERSON']))))
bm25_pers = pt.BatchRetrieve(index_pers_text, wmodel="BM25")

16:21:26.664 [ForkJoinPool-4-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 284 empty documents


----
nur LANGUAGE

In [18]:
indexer = pt.IterDictIndexer("/tmp/index-LANGUAGE", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_lang_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['LANGUAGE']))))
bm25_lang = pt.BatchRetrieve(index_lang_text, wmodel="BM25")

16:22:00.904 [ForkJoinPool-5-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 24 empty documents


----
nur WORK_OF_ART

In [19]:
indexer = pt.IterDictIndexer("/tmp/index-WORK_OF_ART", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_woa_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['WORK_OF_ART']))))
bm25_woa = pt.BatchRetrieve(index_woa_text, wmodel="BM25")

16:22:22.345 [ForkJoinPool-6-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 26 empty documents


----
nur NORP

In [20]:
indexer = pt.IterDictIndexer("/tmp/index-NORP", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_norp_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['NORP']))))
bm25_norp = pt.BatchRetrieve(index_norp_text, wmodel="BM25")

16:22:52.059 [ForkJoinPool-7-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 116 empty documents


----
nur GPE

In [21]:
indexer = pt.IterDictIndexer("/tmp/index-GPE", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_gpe_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['GPE']))))
bm25_gpe = pt.BatchRetrieve(index_gpe_text, wmodel="BM25")

16:23:08.873 [ForkJoinPool-8-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 371 empty documents


----
nur PERCENT

In [22]:
indexer = pt.IterDictIndexer("/tmp/index-PERCENT", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_perc_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['PERCENT']))))
bm25_perc = pt.BatchRetrieve(index_perc_text, wmodel="BM25")

16:23:21.133 [ForkJoinPool-9-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 7 empty documents


----
nur ORDINAL

In [23]:
indexer = pt.IterDictIndexer("/tmp/index-ORDINAL", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_ord_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['ORDINAL']))))
bm25_ord = pt.BatchRetrieve(index_ord_text, wmodel="BM25")

16:23:34.301 [ForkJoinPool-10-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents


----
nur DATE

In [24]:
indexer = pt.IterDictIndexer("/tmp/index-DATE", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_date_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['DATE']))))
bm25_date = pt.BatchRetrieve(index_date_text, wmodel="BM25")

16:24:10.248 [ForkJoinPool-11-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 41 empty documents


----
nur QUANTITY

In [25]:
indexer = pt.IterDictIndexer("/tmp/index-QUANTITY", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_quant_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['QUANTITY']))))
bm25_quant = pt.BatchRetrieve(index_quant_text, wmodel="BM25")

16:24:26.386 [ForkJoinPool-12-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents


----
nur PRODUCT

In [26]:
indexer = pt.IterDictIndexer("/tmp/index-PRODUCT", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_prod_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['PRODUCT']))))
bm25_prod = pt.BatchRetrieve(index_prod_text, wmodel="BM25")

16:24:38.812 [ForkJoinPool-13-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 168 empty documents


----
nur FAC

In [27]:
indexer = pt.IterDictIndexer("/tmp/index-FAC", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_fac_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['FAC']))))
bm25_fac = pt.BatchRetrieve(index_fac_text, wmodel="BM25")

16:24:50.512 [ForkJoinPool-14-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 16 empty documents


----
nur LOC

In [28]:
indexer = pt.IterDictIndexer("/tmp/index-LOC", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_loc_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['LOC']))))
bm25_loc = pt.BatchRetrieve(index_loc_text, wmodel="BM25")

16:25:02.223 [ForkJoinPool-15-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 11 empty documents


----
nur MONEY

In [29]:
indexer = pt.IterDictIndexer("/tmp/index-MONEY", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_money_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['MONEY']))))
bm25_money = pt.BatchRetrieve(index_money_text, wmodel="BM25")

16:25:14.597 [ForkJoinPool-16-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 22 empty documents


----
nur EVENT

In [30]:
indexer = pt.IterDictIndexer("/tmp/index-EVENT", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_event_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['EVENT']))))
bm25_event = pt.BatchRetrieve(index_event_text, wmodel="BM25")

16:25:26.643 [ForkJoinPool-17-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 3 empty documents


----
nur TIME

In [31]:
indexer = pt.IterDictIndexer("/tmp/index-TIME", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_time_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['TIME']))))
bm25_time = pt.BatchRetrieve(index_time_text, wmodel="BM25")

16:25:43.336 [ForkJoinPool-18-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 6 empty documents


----
nur LAW

In [32]:
indexer = pt.IterDictIndexer("/tmp/index-LAW", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_law_text = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['LAW']))))
bm25_law = pt.BatchRetrieve(index_law_text, wmodel="BM25")

16:25:55.348 [ForkJoinPool-19-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 16 empty documents


----
Ergbenisse

In [48]:
pt.Experiment([bm25, bm25_org, bm25_card, bm25_pers, bm25_lang, bm25_woa, bm25_norp, bm25_gpe, bm25_perc, bm25_date, bm25_quant, bm25_prod, bm25_fac, bm25_loc, bm25_money, bm25_event, bm25_time, bm25_law], pt_dataset.get_topics("query"), pt_dataset.get_qrels(), eval_metrics=['ndcg_cut_10', 'P_10'])

Unnamed: 0,name,ndcg_cut_10,P_10
0,BR(BM25),0.374041,0.332353
1,BR(BM25),0.063756,0.044118
2,BR(BM25),0.0,0.0
3,BR(BM25),0.014584,0.011765
4,BR(BM25),0.0,0.0
5,BR(BM25),0.015978,0.011765
6,BR(BM25),0.004855,0.002941
7,BR(BM25),0.0,0.0
8,BR(BM25),0.0,0.0
9,BR(BM25),0.0,0.0


-----
Resultat:
-----
Effektiv: ORG, PERSON, WORK_OF_ART, NORP, PRODUCT, FAC, LOC, MONEY, EVENT, LAW


In [49]:
indexer = pt.IterDictIndexer("/tmp/index-MULTIPLE_TYPES", overwrite=True, stemmer='PorterStemmer', meta={'docno': 75, 'text': 4096})

index_with_entities = pt.IndexFactory.of(indexer.index(retain_only_entity_type_text(set(['ORG', 'PERSON', 'WORK_OF_ART', 'NORP', 'PRODUCT', 'FAC', 'LOC', 'MONEY', 'EVENT', 'LAW']))))
bm25_ents = pt.BatchRetrieve(index_with_entities, wmodel="BM25")

16:55:10.988 [ForkJoinPool-20-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 202 empty documents


In [50]:
pt.Experiment([bm25, bm25_ents], pt_dataset.get_topics("query"), pt_dataset.get_qrels(), eval_metrics=['ndcg_cut_10', 'P_10'])

Unnamed: 0,name,ndcg_cut_10,P_10
0,BR(BM25),0.374041,0.332353
1,BR(BM25),0.085677,0.060294


-----
Try Improvements
-----

In [56]:
pipe_1 = (bm25*0.99) + (bm25_ents*0.01)
pipe_2 = (bm25*0.98) + (bm25_ents*0.02)
pipe_3 = (bm25*0.97) + (bm25_ents*0.03)

In [57]:
pt.Experiment([bm25, pipe_1, pipe_2, pipe_3], pt_dataset.get_topics("query"), pt_dataset.get_qrels(), eval_metrics=['ndcg_cut_10', 'P_10'])

Unnamed: 0,name,ndcg_cut_10,P_10
0,BR(BM25),0.374041,0.332353
1,"Sum(ScalarProd(BR(BM25), 0.99), ScalarProd(BR(...",0.37553,0.335294
2,"Sum(ScalarProd(BR(BM25), 0.98), ScalarProd(BR(...",0.369844,0.330882
3,"Sum(ScalarProd(BR(BM25), 0.97), ScalarProd(BR(...",0.36204,0.326471


----
NOT MINE
---

### Step 4: Create the Run


In [15]:
print('First, we have a short look at the first three topics:')

pt_dataset.get_topics('text').head(3)

First, we have a short look at the first three topics:


Unnamed: 0,qid,query
0,1,retrieval system improving effectiveness
1,2,machine learning language identification
2,3,social media detect self harm


In [20]:
print('Now we do the retrieval...')
run = bm25(pt_dataset.get_topics('text'))

print('Done. Here are the first 10 entries of the run')
run.head(10)

Now we do the retrieval...
Done. Here are the first 10 entries of the run


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,94858,2004.cikm_conference-2004.47,0,15.681777,retrieval system improving effectiveness
1,1,125137,1989.ipm_journal-ir0volumeA25A4.2,1,15.04738,retrieval system improving effectiveness
2,1,125817,2005.ipm_journal-ir0volumeA41A5.11,2,14.144223,retrieval system improving effectiveness
3,1,5868,W05-0704,3,14.025748,retrieval system improving effectiveness
4,1,84876,2016.ntcir_conference-2016.90,4,13.947994,retrieval system improving effectiveness
5,1,82472,1998.sigirconf_conference-98.15,5,13.901647,retrieval system improving effectiveness
6,1,94415,2008.cikm_conference-2008.183,6,13.808208,retrieval system improving effectiveness
7,1,17496,O01-2005,7,13.749449,retrieval system improving effectiveness
8,1,82490,1998.sigirconf_conference-98.33,8,13.735541,retrieval system improving effectiveness
9,1,124801,2006.ipm_journal-ir0volumeA42A3.2,9,13.569263,retrieval system improving effectiveness


### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [21]:
persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".
