# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [1]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier transformers torch nltk

Collecting tira
  Using cached tira-0.0.134-py3-none-any.whl.metadata (4.6 kB)
Collecting ir-datasets
  Using cached ir_datasets-0.5.8-py3-none-any.whl.metadata (12 kB)
Collecting python-terrier
  Using cached python_terrier-0.10.1-py3-none-any.whl
Collecting transformers
  Using cached transformers-4.42.3-py3-none-any.whl.metadata (43 kB)
Collecting torch
  Using cached torch-2.3.1-cp311-none-macosx_11_0_arm64.whl.metadata (26 kB)
Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting requests==2.*,>=2.26 (from tira)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting docker==7.*,>=7.1.0 (from tira)
  Using cached docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting numpy==1.* (from tira)
  Downloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.8/114.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas (from tira)
 

In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
from transformers import BertTokenizer, BertForTokenClassification, pipeline
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/martinschlenk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

PyTerrier 0.10.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load the Dataset and the Index

The type of the index object that we load is `<class 'jnius.reflect.org.terrier.structures.Index'>`, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python.

In [4]:
# The dataset: the union of the IR Anthology and the ACL Anthology
# This line creates an IRDSDataset object and registers it under the name provided as an argument.
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')

# A (pre-built) PyTerrier index loaded from TIRA
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

### Step 3: Define the Retrieval Pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [5]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

### Step 4: Create the Run


In [6]:
print('First, we have a short look at the first three topics:')

topics = pt_dataset.get_topics('text')
print(topics.head(3))

First, we have a short look at the first three topics:
  qid                                     query
0   1  retrieval system improving effectiveness
1   2  machine learning language identification
2   3             social media detect self harm


In [7]:
#Query Segmentation
tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = BertForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

# def advanced_segment_query(query):
#     ner_results = nlp(query)
#     segments = [result['word'] for result in ner_results if result['entity'] in ['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']]
#     if not segments:
#         segments = word_tokenize(query)
#     return " ".join(segments)

domain_specific_terms = [
    "natural language processing", "NLP", "information retrieval", "IR",
    "machine learning", "deep learning", "neural network", "text mining",
    "language model", "BERT", "transformer", "word embeddings", "semantic search",
    "question answering", "text classification", "entity recognition",
    "tokenization", "part-of-speech tagging", "POS tagging", "named entity recognition", "NER",
    "sentiment analysis", "topic modeling", "latent Dirichlet allocation", "LDA",
    "vector space model", "TF-IDF", "BM25", "relevance feedback",
    "information retrieval evaluation", "precision", "recall", "F1 score",
    "mean average precision", "MAP", "normalized discounted cumulative gain", "nDCG",
    "word2vec", "GloVe", "fastText", "attention mechanism",
    "sequence-to-sequence", "seq2seq", "encoder-decoder", "automatic summarization",
    "machine translation", "language generation", "dialogue systems", "chatbots",
    "cross-lingual information retrieval", "multilingual models", "transfer learning",
    "fine-tuning", "pre-trained models", "zero-shot learning",
    "few-shot learning", "domain adaptation", "semi-supervised learning",
    "unsupervised learning", "self-supervised learning", "contrastive learning",
    "contextual embeddings", "contextualized word representations",
    "transformer-based models", "convolutional neural networks", "CNNs",
    "recurrent neural networks", "RNNs", "long short-term memory", "LSTM",
    "gated recurrent units", "GRU", "sequence labeling", "dependency parsing",
    "constituency parsing", "syntactic parsing", "semantic parsing",
    "coreference resolution", "relation extraction", "information extraction",
    "knowledge graphs", "ontologies", "semantic role labeling", "SRL",
    "document retrieval", "passage retrieval", "question answering systems",
    "retrieval-augmented generation", "RAG", "open-domain QA", "closed-domain QA",
    "query expansion", "query reformulation", "interactive information retrieval",
    "user modeling", "personalized search", "context-aware retrieval",
    "query understanding", "query intent", "search engine optimization", "SEO",
    "click-through rate", "CTR", "session-based search", "search result diversification",
    "exploratory search", "faceted search", "enterprise search",
    "legal information retrieval", "medical information retrieval",
    "scientific information retrieval", "scholarly search", "academic search",
    "digital libraries", "citation analysis", "bibliometrics", "altmetrics",
    "author disambiguation", "document clustering", "document classification",
    "information visualization", "search interfaces", "human-computer interaction",
    "HCI", "recommendation systems", "collaborative filtering", "content-based filtering",
    "hybrid recommendation", "ranking algorithms", "learning to rank", "LTR",
    "pairwise ranking", "listwise ranking", "pointwise ranking", "click models",
    "user feedback", "implicit feedback", "explicit feedback", "active learning",
    "crowdsourcing", "data annotation", "evaluation metrics", "benchmark datasets"
]
def advanced_segment_query(query):
    ner_results = nlp(query)
    segments = set(result['word'] for result in ner_results if result['entity'] in ['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'])
    for term in domain_specific_terms:
        if term in query:
            segments.add(term)
    if not segments:
        segments = word_tokenize(query)
    return " ".join(segments)

print('Segmenting the queries...')
segmented_topics = topics.copy()
segmented_topics['query'] = segmented_topics['query'].apply(advanced_segment_query)
print(segmented_topics.head(3))

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Segmenting the queries...
  qid                                     query
0   1  retrieval system improving effectiveness
1   2                          machine learning
2   3             social media detect self harm


In [8]:
print('Now we do the retrieval...')
run = bm25(segmented_topics)

print('Done. Here are the first 10 entries of the run')
run.head(10)

Now we do the retrieval...
Done. Here are the first 10 entries of the run


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,94858,2004.cikm_conference-2004.47,0,15.681777,retrieval system improving effectiveness
1,1,125137,1989.ipm_journal-ir0volumeA25A4.2,1,15.04738,retrieval system improving effectiveness
2,1,125817,2005.ipm_journal-ir0volumeA41A5.11,2,14.144223,retrieval system improving effectiveness
3,1,5868,W05-0704,3,14.025748,retrieval system improving effectiveness
4,1,84876,2016.ntcir_conference-2016.90,4,13.947994,retrieval system improving effectiveness
5,1,82472,1998.sigirconf_conference-98.15,5,13.901647,retrieval system improving effectiveness
6,1,94415,2008.cikm_conference-2008.183,6,13.808208,retrieval system improving effectiveness
7,1,17496,O01-2005,7,13.749449,retrieval system improving effectiveness
8,1,82490,1998.sigirconf_conference-98.33,8,13.735541,retrieval system improving effectiveness
9,1,124801,2006.ipm_journal-ir0volumeA42A3.2,9,13.569263,retrieval system improving effectiveness


### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [9]:
persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".
