# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [3]:
import os
import math

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install python-terrier tira==0.0.88 ir_datasets
else:
    print('We are in the TIRA sandbox.')

96.48s - pydevd: Sending message related to process being replaced timed-out after 5 seconds




In [4]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# this loads and starts pyterrier so that it also works in the TIRA
ensure_pyterrier_is_loaded()

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt
from pyterrier.measures import *

if not pt.started():
    pt.init(boot_packages=['mam10eks:custom-terrier-token-processing:0.0.1', 'com.github.terrierteam:terrier-prf:-SNAPSHOT'])
    from jnius import autoclass


Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load the data

In [5]:
data = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')

Load ir_dataset "ir-lab-jena-leipzig-wise-2023/validation-20231104-training" from tira.


In [6]:
def linear_weight_function(value, max_length):
    """Linear weight function, which applies linearly decreasing weights."""
    return (-value + max_length) / max_length


def inverse_linear_weight_function(value, max_length):
    """Inverse linear weight function, which applies linearly increasing weights."""
    return value / max_length


def centered_parabola_weight_function(value, max_length):
    """Centered parabola weight function, which applies decreasing and increasing weights."""
    return (((value - (max_length / 2)) ** 2 + 1)) / ((-max_length / 2) ** 2 + 1)


def inverse_centered_parabola_weight_function(value, max_length):
    """Inverse centered parabola function, which applies decreasing and increasing weights."""
    return -centered_parabola_weight_function(value, max_length) + 1


def log_weight_function(value, max_length):
    """Logarithmic weight function, which applies decreasing weights."""
    dividend = math.log2((value + 0.1) / (value + 0.1 + max_length - 0.9))
    divisor = math.log2(0.1 / (0.1 + max_length - 0.9))
    return dividend / divisor


def inverse_log_weight_function(value, max_length):
    """Inverse logarithmic weight functin, which applies increasing weights."""
    dividend = math.log2(
        (-value + max_length + 0.1) / (-value + max_length + 0.1 + max_length - 0.9)
    )
    divisor = math.log2(0.1 / (0.1 + max_length - 0.9))
    return dividend / divisor


def english_tokenizer(string):
    """Tokenizes the input string according to the english terrier tokenizer and generates a list of tokens."""
    english_tokeniser = pt.TerrierTokeniser.english
    english_tokeniser = pt.TerrierTokeniser._to_obj(english_tokeniser)
    english_tokeniser = pt.TerrierTokeniser._to_class(english_tokeniser)

    tokeniser = "org.terrier.indexing.tokenisation." + english_tokeniser
    tokenobj = pt.autoclass(tokeniser)()
    _query_fn = tokenobj.getTokens
    return _query_fn(string)


def apply_query_term_weighing(query, weight_function):
    query_parts = english_tokenizer(query)
    query_length = len(query_parts)
    weights = [weight_function(x, query_length) for x in range(query_length)]

    return " ".join(
        [f"{query_part}^{weight}" for query_part, weight in zip(query_parts, weights)]
    )

In [7]:
solutions = []
for solution in [
    ("linear", linear_weight_function),
    ("inverse_linear", inverse_linear_weight_function),
    ("quadratic", centered_parabola_weight_function),
    ("inverse_quadratic", inverse_centered_parabola_weight_function),
    ("logarithmic", log_weight_function),
    ("inverse_logarithmic", inverse_log_weight_function),
]:
    topics = data.get_topics("title")
    for entry in topics.iterrows():
        query = entry[1]["query"]
        query = apply_query_term_weighing(query, solution[1])
        entry[1]["query"] = query
    solutions.append((solution[0], topics))

No settings given in /root/.tira/.tira-settings.json. I will use defaults.
No settings given in /root/.tira/.tira-settings.json. I will use defaults.


### Step 3: Build the Index

In [8]:
print('Build index:')
# Both the indexer and batch retrieve use terriers default porter stemmer and a default stopword list (englisch)
iter_indexer = pt.IterDictIndexer("/tmp/index", overwrite = True, blocks = True,meta = {'docno':100, 'text': 20480}, stemmer = 'PorterStemmer')
!rm -Rf /tmp/index
index_ref = iter_indexer.index(data.get_corpus_iter())

print('Done. Index is created')

Build index:


207.52s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


No settings given in /root/.tira/.tira-settings.json. I will use defaults.
No settings given in /root/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   0%|          | 0/61307 [00:00<?, ?it/s]

No settings given in /root/.tira/.tira-settings.json. I will use defaults.
No settings given in /root/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [02:14<00:00, 456.40it/s]


Done. Index is created


### Step 4: Create the Retrieval Pipeline

In [9]:
index = pt.IndexFactory.of(index_ref)

bm25 = pt.BatchRetrieve(index, wmodel="BM25", verbose=True)

#### Step 4.1: Add Query Expansion

In [10]:
#Pipeline
pipe = bm25

### Step 5: Create the Run and Persist the Run

In [11]:
print('Create run')

pipes = [pipe(solution[1]) for solution in solutions]

# Add default pipe
pipes.insert(0,pipe(data.get_topics("title")))

print('Done, run was created')

Create run


BR(BM25): 100%|██████████| 882/882 [14:01<00:00,  1.05q/s]
BR(BM25): 100%|██████████| 882/882 [13:02<00:00,  1.13q/s]
BR(BM25): 100%|██████████| 882/882 [13:17<00:00,  1.11q/s]
BR(BM25): 100%|██████████| 882/882 [14:32<00:00,  1.01q/s]
BR(BM25): 100%|██████████| 882/882 [13:54<00:00,  1.06q/s]
BR(BM25): 100%|██████████| 882/882 [15:50<00:00,  1.08s/q]
BR(BM25): 100%|██████████| 882/882 [17:15<00:00,  1.17s/q]


Done, run was created


### Step 6: Run Experiments

In [21]:
# Doesn't work in TIRA, only for local testing
names = ["BM25"]
names.extend([solution[0] for solution in solutions])
pt.Experiment(
   pipes,
   data.get_topics()[:50],
   data.get_qrels(),
   eval_metrics=[nDCG@5],
   names=names,
   baseline=0
)

There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.


Unnamed: 0,name,nDCG@5,nDCG@5 +,nDCG@5 -,nDCG@5 p-value
0,BM25,0.133832,,,
1,linear,0.121555,2.0,5.0,0.181516
2,inverse_linear,0.049922,4.0,13.0,0.011909
3,quadratic,0.119084,1.0,5.0,0.093741
4,inverse_quadratic,0.048588,4.0,13.0,0.010434
5,logarithmic,0.119049,4.0,7.0,0.215296
6,inverse_logarithmic,0.143178,4.0,1.0,0.467332


In [23]:
persist_and_normalize_run(pipes[0], 'bm25-query-term-weighing')

I use the environment variable "TIRA_OUTPUT_DIR" to determine where I should store the run file using "." as default.
Done. run file is stored under "./run.txt".
