# IR Lab SoSe 2024: Vergleich von BM25 und BM25+TF-IDF Retrieval Systemen

Dieses Jupyter Notebook implementiert und vergleicht zwei Retrieval-Systeme: eines nur mit BM25 und eines mit BM25 und TF-IDF.
Wir verwenden ein Korpus wissenschaftlicher Arbeiten (Titel + Abstracts) aus den Bereichen Information Retrieval und Natural Language Processing (die [IR Anthology](https://ir.webis.de/anthology/) und die [ACL Anthology](https://aclanthology.org/)).

### Schritt 1: Bibliotheken importieren

In [1]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
from pyterrier.pipelines import *
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
ensure_pyterrier_is_loaded()
tira = Client()

PyTerrier 0.10.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Schritt 2: Dataset und Index laden

In [3]:
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

### Schritt 3: Retrieval Pipelines definieren

In [4]:
# BM25 Pipeline
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

# BM25 + TF-IDF Pipeline
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
combined_pipeline = bm25 >> tfidf

### Schritt 4: Runs erstellen und vergleichen

In [5]:
topics = pt_dataset.get_topics('text')
print('Beispiel-Topics:')
print(topics.head(3))

print('\nFühre Retrieval für beide Pipelines durch...')
run_bm25 = bm25(topics)
run_combined = combined_pipeline(topics)

print('\nVergleiche die Top-5 Ergebnisse für die erste Query:')
first_query = topics.iloc[0]['qid']

results_bm25 = run_bm25[run_bm25['qid'] == first_query].head(5)
results_combined = run_combined[run_combined['qid'] == first_query].head(5)

print('\nBM25 Top-5:')
print(results_bm25[['docno', 'score']])

print('\nBM25 + TF-IDF Top-5:')
print(results_combined[['docno', 'score']])

# Berechne Überschneidung der Top-10 Ergebnisse
top_10_bm25 = set(run_bm25[run_bm25['qid'] == first_query]['docno'].head(10))
top_10_combined = set(run_combined[run_combined['qid'] == first_query]['docno'].head(10))
overlap = len(top_10_bm25.intersection(top_10_combined))

print(f'\nÜberschneidung der Top-10 Ergebnisse: {overlap} von 10')

Beispiel-Topics:
  qid                                     query
0   1  retrieval system improving effectiveness
1   2  machine learning language identification
2   3             social media detect self harm

Führe Retrieval für beide Pipelines durch...

Vergleiche die Top-5 Ergebnisse für die erste Query:

BM25 Top-5:
                                docno      score
0        2004.cikm_conference-2004.47  15.681777
1   1989.ipm_journal-ir0volumeA25A4.2  15.047380
2  2005.ipm_journal-ir0volumeA41A5.11  14.144223
3                            W05-0704  14.025748
4       2016.ntcir_conference-2016.90  13.947994

BM25 + TF-IDF Top-5:
                                docno      score
0        2004.cikm_conference-2004.47  10.432440
1   1989.ipm_journal-ir0volumeA25A4.2   9.971822
2  2005.ipm_journal-ir0volumeA41A5.11   9.255656
3     1998.sigirconf_conference-98.33   9.133236
4       2008.cikm_conference-2008.183   9.126181

Überschneidung der Top-10 Ergebnisse: 9 von 10


### Schritt 5: Run-Dateien persistieren

In [6]:
persist_and_normalize_run(run_bm25, system_name='bm25-baseline', default_output='../runs')
persist_and_normalize_run(run_combined, system_name='bm25-tfidf-combined', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".
The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".
