# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [1]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.

!apt install --upgrade libomp-dev
!pip install --upgrade transformers==3.0.2
!pip install --upgrade faiss-gpu==1.6.3

!pip3 install tira ir-datasets python-terrier

!pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git

checkpoint="http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/colbert_model_checkpoint.zip"


The operation couldn’t be completed. Unable to locate a Java Runtime that supports apt.
Please visit http://www.java.com for information on installing Java.

Collecting transformers==3.0.2
  Using cached transformers-3.0.2-py3-none-any.whl.metadata (44 kB)
Collecting tokenizers==0.8.1.rc1 (from transformers==3.0.2)
  Using cached tokenizers-0.8.1rc1.tar.gz (97 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting sentencepiece!=0.1.92 (from transformers==3.0.2)
  Using cached sentencepiece-0.2.0-cp39-cp39-macosx_10_9_x86_64.whl.metadata (7.7 kB)
Collecting sacremoses (from transformers==3.0.2)
  Using cached sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Using cached transformers-3.0.2-py3-none-any.whl (769 kB)
Using cached sentencepiece-0.2.0-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB)
Using cached sacremoses-0.1.1-py3-none-any.whl (897 kB)
Building wheels f

In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

PyTerrier 0.10.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load the Dataset and the Index

The type of the index object that we load is `<class 'jnius.reflect.org.terrier.structures.Index'>`, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python.

In [13]:
import pyterrier as pt
import pyterrier_colbert.indexing
from pyterrier_colbert.indexing import ColBERTIndexer
from transformers import BertModel
import torch

# Initialize PyTerrier
if not pt.started():
    pt.init()

# Set device to CPU if CUDA is not available
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Download the model from Hugging Face
from transformers import AutoModel, AutoTokenizer

checkpoint = "bert-base-uncased"

# Ensure the model is downloaded
AutoModel.from_pretrained(checkpoint)
AutoTokenizer.from_pretrained(checkpoint)

# Initialize the indexer with the correct device
colbert_indexer = ColBERTIndexer(checkpoint=checkpoint, 
                                 index_root="/content",
                                 index_name="colbert_index",
                                 chunksize=3,
                                 gpu=DEVICE.type == 'cuda')

# Define your dataset
data = {
    'docno': ['1', '2', '3'],
    'text': ['This is the first document.', 'This is the second document.', 'And this is the third one.']
}
import pandas as pd
documents = pd.DataFrame(data)

# Convert the DataFrame to an iterator of dictionaries, which ColBERT can use
def get_corpus_iter(documents):
    for row in documents.itertuples(index=False):
        yield {'docno': row.docno, 'text': row.text}

# Index the dataset
colbert_indexer.index(get_corpus_iter(documents))

[Jul 01, 16:00:46] [0] 		 #> Local args.bsize = 128
[Jul 01, 16:00:46] [0] 		 #> args.index_root = /content
[Jul 01, 16:00:46] [0] 		 #> self.possible_subset_sizes = [69905]


Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[Jul 01, 16:00:46] #> Loading model checkpoint.
[Jul 01, 16:00:46] #> Loading checkpoint bert-base-uncased


FileNotFoundError: [Errno 2] No such file or directory: 'bert-base-uncased'

### Step 3: Define the Retrieval Pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [None]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

### Step 4: Create the Run


In [None]:
print('First, we have a short look at the first three topics:')

pt_dataset.get_topics('text').head(3)

First, we have a short look at the first three topics:


Unnamed: 0,qid,query
0,1,retrieval system improving effectiveness
1,2,machine learning language identification
2,3,social media detect self harm


In [None]:
print('Now we do the retrieval...')
run = bm25(pt_dataset.get_topics('text'))

print('Done. Here are the first 10 entries of the run')
run.head(10)

Now we do the retrieval...
Done. Here are the first 10 entries of the run


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,94858,2004.cikm_conference-2004.47,0,15.681777,retrieval system improving effectiveness
1,1,125137,1989.ipm_journal-ir0volumeA25A4.2,1,15.04738,retrieval system improving effectiveness
2,1,125817,2005.ipm_journal-ir0volumeA41A5.11,2,14.144223,retrieval system improving effectiveness
3,1,5868,W05-0704,3,14.025748,retrieval system improving effectiveness
4,1,84876,2016.ntcir_conference-2016.90,4,13.947994,retrieval system improving effectiveness
5,1,82472,1998.sigirconf_conference-98.15,5,13.901647,retrieval system improving effectiveness
6,1,94415,2008.cikm_conference-2008.183,6,13.808208,retrieval system improving effectiveness
7,1,17496,O01-2005,7,13.749449,retrieval system improving effectiveness
8,1,82490,1998.sigirconf_conference-98.33,8,13.735541,retrieval system improving effectiveness
9,1,124801,2006.ipm_journal-ir0volumeA42A3.2,9,13.569263,retrieval system improving effectiveness


### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [None]:
persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".


# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [None]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier

[0m

In [None]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt

In [None]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load the Dataset and the Index

The type of the index object that we load is `<class 'jnius.reflect.org.terrier.structures.Index'>`, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python.

In [None]:
# The dataset: the union of the IR Anthology and the ACL Anthology
# This line creates an IRDSDataset object and registers it under the name provided as an argument.
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')

# A (pre-built) PyTerrier index loaded from TIRA
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

### Step 3: Define the Retrieval Pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [None]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
bo1_expansion = bm25 >> pt.rewrite.Bo1QueryExpansion(index, fb_docs=10, fb_terms=20)
bo1_expansion(pt_dataset.get_topics('text'))
bm25_bo1 = bo1_expansion >> bm25

### Step 4: Create the Run


In [None]:
print('First, we have a short look at the first three topics:')

pt_dataset.get_topics('text').head(3)

First, we have a short look at the first three topics:


Unnamed: 0,qid,query
0,1,retrieval system improving effectiveness
1,2,machine learning language identification
2,3,social media detect self harm


In [None]:
print('Now we do the retrieval...')
run = bm25_bo1(pt_dataset.get_topics('text'))

print('Done. Here are the first 10 entries of the run')
run.head(10)

Now we do the retrieval...
Done. Here are the first 10 entries of the run


Unnamed: 0,qid,docid,docno,rank,score,query_0,query
0,1,94858,2004.cikm_conference-2004.47,0,16.524789,retrieval system improving effectiveness,applypipeline:off retriev^1.221462101 system^1...
1,1,94415,2008.cikm_conference-2008.183,1,15.495605,retrieval system improving effectiveness,applypipeline:off retriev^1.221462101 system^1...
2,1,124801,2006.ipm_journal-ir0volumeA42A3.2,2,15.230592,retrieval system improving effectiveness,applypipeline:off retriev^1.221462101 system^1...
3,1,17496,O01-2005,3,15.161319,retrieval system improving effectiveness,applypipeline:off retriev^1.221462101 system^1...
4,1,82472,1998.sigirconf_conference-98.15,4,14.723383,retrieval system improving effectiveness,applypipeline:off retriev^1.221462101 system^1...
5,1,82490,1998.sigirconf_conference-98.33,5,14.557489,retrieval system improving effectiveness,applypipeline:off retriev^1.221462101 system^1...
6,1,74513,2001.clef_workshop-2001w.24,6,14.479557,retrieval system improving effectiveness,applypipeline:off retriev^1.221462101 system^1...
7,1,125137,1989.ipm_journal-ir0volumeA25A4.2,7,14.204153,retrieval system improving effectiveness,applypipeline:off retriev^1.221462101 system^1...
8,1,125817,2005.ipm_journal-ir0volumeA41A5.11,8,13.992787,retrieval system improving effectiveness,applypipeline:off retriev^1.221462101 system^1...
9,1,114223,2014.wwwjournals_journal-ir0volumeA17A4.15,9,13.931153,retrieval system improving effectiveness,applypipeline:off retriev^1.221462101 system^1...


In [None]:
run = bm25(pt_dataset.get_topics('text'))
run.head(10)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,94858,2004.cikm_conference-2004.47,0,15.681777,retrieval system improving effectiveness
1,1,125137,1989.ipm_journal-ir0volumeA25A4.2,1,15.04738,retrieval system improving effectiveness
2,1,125817,2005.ipm_journal-ir0volumeA41A5.11,2,14.144223,retrieval system improving effectiveness
3,1,5868,W05-0704,3,14.025748,retrieval system improving effectiveness
4,1,84876,2016.ntcir_conference-2016.90,4,13.947994,retrieval system improving effectiveness
5,1,82472,1998.sigirconf_conference-98.15,5,13.901647,retrieval system improving effectiveness
6,1,94415,2008.cikm_conference-2008.183,6,13.808208,retrieval system improving effectiveness
7,1,17496,O01-2005,7,13.749449,retrieval system improving effectiveness
8,1,82490,1998.sigirconf_conference-98.33,8,13.735541,retrieval system improving effectiveness
9,1,124801,2006.ipm_journal-ir0volumeA42A3.2,9,13.569263,retrieval system improving effectiveness


### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [None]:
persist_and_normalize_run(run, system_name='bm25-bo1', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".
