# IR Lab SoSe 2024: Combined Retrieval System

This jupyter notebook serves as an improved retrieval system combining components from both provided notebooks.
We will use a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This notebook serves as a retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 0: Install Required Packages and Setup Logging

Execute this cell if you're using Google Colab or if you haven't installed these packages yet.

In [12]:
!pip install tira ir-datasets python-terrier transformers torch nltk

import logging
logging.basicConfig(level=logging.INFO)
logging.info("Logging initialized.")



INFO:root:Logging initialized.


### Step 1: Import Libraries

In [13]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
import pandas as pd
import os
from transformers import BertTokenizer, BertForTokenClassification, pipeline
import nltk
from nltk.tokenize import word_tokenize

# Download NLTK data
nltk.download('punkt')

# Initialize PyTerrier and TIRA client
ensure_pyterrier_is_loaded()
tira = Client()

logging.info("Libraries imported successfully.")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/martinschlenk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
INFO:root:Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
INFO:root:No settings given in /Users/martinschlenk/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /Users/martinschlenk/.tira/.tira-settings.json. I will use defaults.
INFO:root:Libraries imported successfully.


### Step 2: Load the Dataset and the Index

In [14]:
try:
    # The dataset: the union of the IR Anthology and the ACL Anthology
    pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
    logging.info("Dataset loaded successfully.")

    # A (pre-built) PyTerrier index loaded from TIRA
    index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)
    logging.info("Index loaded successfully.")
except Exception as e:
    logging.error(f"An error occurred while loading the dataset or index: {str(e)}")
    raise

INFO:root:Dataset loaded successfully.
INFO:root:Index loaded successfully.


### Step 3: Define the Retrieval Pipeline

In [15]:
# Base retrieval model with BM25
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

# Query expansion with Bo1
# fb_docs: number of feedback documents, fb_terms: number of expansion terms
bo1_expansion = pt.rewrite.Bo1QueryExpansion(index, fb_docs=10, fb_terms=20)
bm25_bo1 = bm25 >> bo1_expansion >> bm25

# Additional reranking models
tf_idf = pt.BatchRetrieve(index, wmodel="TF_IDF")
dirichletLM = pt.BatchRetrieve(index, wmodel="DirichletLM")

# Combined retrieval pipeline
# We're giving more weight to TF-IDF and DirichletLM models
combined_pipeline = bm25_bo1 + 2 * tf_idf + 2 * dirichletLM

logging.info("Retrieval pipeline defined successfully.")

INFO:root:Retrieval pipeline defined successfully.


### Step 4: Create the Run

In [16]:
print('First, we have a short look at the first three topics:')
topics = pt_dataset.get_topics('text')
print(topics.head(3))

# Query Segmentation
try:
    print('\nInitializing BERT model for Named Entity Recognition...')
    tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
    model = BertForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
    nlp = pipeline("ner", model=model, tokenizer=tokenizer)
    logging.info('BERT model loaded successfully')
except Exception as e:
    logging.error(f"Error loading BERT model: {str(e)}")
    raise

# Domain-specific terms for query segmentation
domain_specific_terms = [
    "natural language processing", "NLP", "information retrieval", "IR",
    "machine learning", "deep learning", "neural network", "text mining",
    "language model", "BERT", "transformer", "word embeddings", "semantic search",
    "question answering", "text classification", "entity recognition",
    "tokenization", "part-of-speech tagging", "POS tagging", "named entity recognition", "NER",
    "sentiment analysis", "topic modeling", "latent Dirichlet allocation", "LDA",
    "vector space model", "TF-IDF", "BM25", "relevance feedback",
    "information retrieval evaluation", "precision", "recall", "F1 score",
    "mean average precision", "MAP", "normalized discounted cumulative gain", "nDCG",
    "word2vec", "GloVe", "fastText", "attention mechanism",
    "sequence-to-sequence", "seq2seq", "encoder-decoder", "automatic summarization",
    "machine translation", "language generation", "dialogue systems", "chatbots",
    "cross-lingual information retrieval", "multilingual models", "transfer learning",
    "fine-tuning", "pre-trained models", "zero-shot learning",
    "few-shot learning", "domain adaptation", "semi-supervised learning",
    "unsupervised learning", "self-supervised learning", "contrastive learning",
    "contextual embeddings", "contextualized word representations",
    "transformer-based models", "convolutional neural networks", "CNNs",
    "recurrent neural networks", "RNNs", "long short-term memory", "LSTM",
    "gated recurrent units", "GRU", "sequence labeling", "dependency parsing",
    "constituency parsing", "syntactic parsing", "semantic parsing",
    "coreference resolution", "relation extraction", "information extraction",
    "knowledge graphs", "ontologies", "semantic role labeling", "SRL",
    "document retrieval", "passage retrieval", "question answering systems",
    "retrieval-augmented generation", "RAG", "open-domain QA", "closed-domain QA",
    "query expansion", "query reformulation", "interactive information retrieval",
    "user modeling", "personalized search", "context-aware retrieval",
    "query understanding", "query intent", "search engine optimization", "SEO",
    "click-through rate", "CTR", "session-based search", "search result diversification",
    "exploratory search", "faceted search", "enterprise search",
    "legal information retrieval", "medical information retrieval",
    "scientific information retrieval", "scholarly search", "academic search",
    "digital libraries", "citation analysis", "bibliometrics", "altmetrics",
    "author disambiguation", "document clustering", "document classification",
    "information visualization", "search interfaces", "human-computer interaction",
    "HCI", "recommendation systems", "collaborative filtering", "content-based filtering",
    "hybrid recommendation", "ranking algorithms", "learning to rank", "LTR",
    "pairwise ranking", "listwise ranking", "pointwise ranking", "click models",
    "user feedback", "implicit feedback", "explicit feedback", "active learning",
    "crowdsourcing", "data annotation", "evaluation metrics", "benchmark datasets"
]

def advanced_segment_query(query):
    ner_results = nlp(query)
    segments = set(result['word'] for result in ner_results if result['entity'] in ['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'])
    for term in domain_specific_terms:
        if term.lower() in query.lower():  # Case-insensitive matching
            segments.add(term)
    if not segments:
        segments = word_tokenize(query)
    return " ".join(segments)

print('\n Segmenting the queries...')
segmented_topics = topics.copy()
segmented_topics['query'] = segmented_topics['query'].apply(advanced_segment_query)
print(segmented_topics.head(3))

print('\n Now we do the retrieval...')
run = combined_pipeline.transform(segmented_topics)

print('\n Done. Here are the first 10 entries of the run')
print(run.head(10))

# Definiere mögliche Ausgabeverzeichnisse
output_dirs = [
    os.environ.get('outputDir', '/output'),  # TIRA-spezifisches Verzeichnis
    '../runs',  # Lokales Verzeichnis außerhalb der Sandbox
    '.'  # Aktuelles Verzeichnis als Fallback
]

# Versuche, in jedes Verzeichnis zu schreiben, bis es klappt
for output_dir in output_dirs:
    try:
        if output_dir != '/output':  # Wir wissen, dass /output schreibgeschützt ist
            os.makedirs(output_dir, exist_ok=True)
        run_file_path = os.path.join(output_dir, 'run.txt')
        run.to_csv(run_file_path, sep='\t', index=False, header=False)
        logging.info(f"Results saved to {run_file_path}")
        break  # Beende die Schleife, wenn das Schreiben erfolgreich war
    except OSError as e:
        logging.warning(f"Could not save to {output_dir}: {str(e)}")
else:
    logging.error("Failed to save results to any output directory")
    raise RuntimeError("No writable output directory found")

# Persistiere und normalisiere den Run, falls möglich
try:
    persist_and_normalize_run(run, system_name='combined-bm25-bo1-tfidf-dirichlet', default_output='../runs')
except Exception as e:
    logging.warning(f"Could not persist and normalize run: {str(e)}")

First, we have a short look at the first three topics:
  qid                                     query
0   1  retrieval system improving effectiveness
1   2  machine learning language identification
2   3             social media detect self harm

Initializing BERT model for Named Entity Recognition...


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
INFO:root:BERT model loaded successfully



 Segmenting the queries...
  qid                                     query
0   1  retrieval system improving effectiveness
1   2                          machine learning
2   3             social media detect self harm

 Now we do the retrieval...





 Done. Here are the first 10 entries of the run
  qid     docid                                       docno      score  \
0   1   94858.0                2004.cikm_conference-2004.47  41.726132   
1   1   94415.0               2008.cikm_conference-2008.183  37.286952   
2   1  124801.0           2006.ipm_journal-ir0volumeA42A3.2  36.632425   
3   1   17496.0                                    O01-2005  36.026305   
4   1   82472.0             1998.sigirconf_conference-98.15  35.786096   
5   1   82490.0             1998.sigirconf_conference-98.33  35.791673   
6   1   74513.0                 2001.clef_workshop-2001w.24  34.150032   
7   1  125137.0           1989.ipm_journal-ir0volumeA25A4.2  34.147797   
8   1  125817.0          2005.ipm_journal-ir0volumeA41A5.11  35.869904   
9   1  114223.0  2014.wwwjournals_journal-ir0volumeA17A4.15  33.782619   

                                    query_0  \
0  retrieval system improving effectiveness   
1  retrieval system improving effectivenes

INFO:root:Results saved to ../runs/run.txt


The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".


### Step 5: Persist the run file for subsequent evaluations

In [17]:
try:
    os.makedirs('../runs', exist_ok=True)
    persist_and_normalize_run(run, system_name='combined-bm25-bo1-tfidf-dirichlet', default_output='../runs')
    output_dir = os.environ.get('outputDir', '/output')
 
except Exception as e:
    logging.error(f"An error occurred while saving the run file: {str(e)}")
    raise

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".


ERROR:root:An error occurred while saving the run file: Cannot save file into a non-existent directory: '/output'


Done. run file is stored under "../runs/run.txt".


OSError: Cannot save file into a non-existent directory: '/output'