# Tutorial on Tuning Hyperparameters of Retrieval Systems

This tutorial exemplifies hyperparameter tuning with PyTerrier and TrecTools.
To make things a bit more explicit, we exhaustively evaluate a small grid of possible parameters for BM25.
After you understand the concepts of this tutorial, please consider to switch to a dedicated API for tuning hyperparameters, e.g., [the official PyTerrier one](https://pyterrier.readthedocs.io/en/latest/tuning.html).

**Attention:** This tutorial comes in two parts, where part 1 executes all configurations and part 2 does the actual search. Please skim only over part 1 (and do not execute it) if you do this tutorial for the first time and come back later if needed, as we prepared the results of part 2 via a download so that you directly can skip to part 2.

## Preparation: Install dependencies

In [1]:
# This is only needed in Google Colab, in the dev container, everything should be installed already
!pip3 install tira trectools python-terrier

## Our Scenario

We want to tune the hyperparameters of BM25 on the training and validation data of the [IR Lab of the winter semester Jena/Leipzig](https://www.tira.io/task-overview/ir-lab-jena-leipzig-wise-2023).

First, we import all used dependencies:



In [2]:
from tira.third_party_integrations import ir_datasets, ensure_pyterrier_is_loaded, persist_and_normalize_run
import pyterrier as pt

ensure_pyterrier_is_loaded()

training_dataset = 'ir-lab-jena-leipzig-wise-2023/training-20231104-training'
validation_dataset = 'ir-lab-jena-leipzig-wise-2023/validation-20231104-training'

# Part 1: Run all Configurations of the Grid Search

Running all configurations below takes roughly three hours (that is one advantage of dedicated APIs, they often offer parallelization). Therefore, please only skim over this first part if you do the tutorial for the first time, you can download the outputs of this grid search at the beginning of part 2 so that you directly can skip to part 2.

Next, we implement two methods: (1) for indexing documents, and (2) for grid search method to exhaustively run a small grid of possible parameters for BM25. We will store all runs in a directory `grid-search/training` (for the training runs) respectively `grid-search/validation` (for the validation runs).

In [3]:
def create_index(documents):
    indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, meta={'docno': 100, 'text': 20480})
    index_ref = indexer.index(({'docno': i.doc_id, 'text': i.text} for i in documents))
    return pt.IndexFactory.of(index_ref)

In [4]:
def run_bm25_grid_search_run(index, output_dir, queries):
    """
        defaults: http://terrier.org/docs/current/javadoc/org/terrier/matching/models/BM25.html
        k_1 = 1.2d, k_3 = 8d, b = 0.75d
        We do not tune parameter k_3, as this parameter only impacts queries with reduntant terms.
    """
    for b in [0.7, 0.75, 0.8]:
        for k_1 in [1.1, 1.2, 1.3]:
            system = f'bm25-b={b}-k_1={k_1}'
            configuration = {"bm25.b" : b, "bm25.k_1": k_1}
            run_output_dir = output_dir + '/' + system
            !rm -Rf {run_output_dir}
            !mkdir -p {run_output_dir}
            print(f'Run {system}')
            BM25 = pt.BatchRetrieve(index, wmodel="BM25", controls=configuration, verbose=True)
            run = BM25(queries)
            persist_and_normalize_run(run, system, run_output_dir)

## Run All Configurations on the Training Data

First, we load the training dataset and index the documents, then we run our `run_bm25_grid_search_run`.

In [5]:
dataset = ir_datasets.load(training_dataset)
queries = pt.io.read_topics(ir_datasets.topics_file(training_dataset), format='trecxml')

queries.head(3)

Load ir_dataset "ir-lab-jena-leipzig-wise-2023/training-20231104-training" from tira.
No settings given in /root/.tira/.tira-settings.json. I will use defaults.


Unnamed: 0,qid,query
0,q06223196,car shelter
1,q062228,airport
2,q062287,antivirus comparison


In [6]:
index = create_index(dataset.docs_iter())

No settings given in /root/.tira/.tira-settings.json. I will use defaults.


In [7]:
run_bm25_grid_search_run(index, 'grid-search/training', queries)

Run bm25-b=0.7-k_1=1.1


BR(BM25): 100%|██████████| 672/672 [07:49<00:00,  1.43q/s]


Done. run file is stored under "grid-search/training/bm25-b=0.7-k_1=1.1/run.txt".
Run bm25-b=0.7-k_1=1.2


BR(BM25): 100%|██████████| 672/672 [08:39<00:00,  1.29q/s]


Done. run file is stored under "grid-search/training/bm25-b=0.7-k_1=1.2/run.txt".
Run bm25-b=0.7-k_1=1.3


BR(BM25): 100%|██████████| 672/672 [09:02<00:00,  1.24q/s]


Done. run file is stored under "grid-search/training/bm25-b=0.7-k_1=1.3/run.txt".
Run bm25-b=0.75-k_1=1.1


BR(BM25): 100%|██████████| 672/672 [09:07<00:00,  1.23q/s]


Done. run file is stored under "grid-search/training/bm25-b=0.75-k_1=1.1/run.txt".
Run bm25-b=0.75-k_1=1.2


BR(BM25): 100%|██████████| 672/672 [10:00<00:00,  1.12q/s]


Done. run file is stored under "grid-search/training/bm25-b=0.75-k_1=1.2/run.txt".
Run bm25-b=0.75-k_1=1.3


BR(BM25): 100%|██████████| 672/672 [09:13<00:00,  1.21q/s]


Done. run file is stored under "grid-search/training/bm25-b=0.75-k_1=1.3/run.txt".
Run bm25-b=0.8-k_1=1.1


BR(BM25): 100%|██████████| 672/672 [08:55<00:00,  1.26q/s]


Done. run file is stored under "grid-search/training/bm25-b=0.8-k_1=1.1/run.txt".
Run bm25-b=0.8-k_1=1.2


BR(BM25): 100%|██████████| 672/672 [08:34<00:00,  1.31q/s]


Done. run file is stored under "grid-search/training/bm25-b=0.8-k_1=1.2/run.txt".
Run bm25-b=0.8-k_1=1.3


BR(BM25): 100%|██████████| 672/672 [07:48<00:00,  1.44q/s]


Done. run file is stored under "grid-search/training/bm25-b=0.8-k_1=1.3/run.txt".


## Run All Configurations on the Validation Data

Second, we load the validation dataset and index the documents, then we run our `run_bm25_grid_search_run`.

In [5]:
dataset = ir_datasets.load(validation_dataset)
queries = pt.io.read_topics(ir_datasets.topics_file(validation_dataset), format='trecxml')

queries.head(3)

Load ir_dataset "ir-lab-jena-leipzig-wise-2023/validation-20231104-training" from tira.
No settings given in /root/.tira/.tira-settings.json. I will use defaults.


Unnamed: 0,qid,query
0,q072224,purchase money
1,q072226,purchase used car
2,q072232,buy gold silver


In [6]:
index = create_index(dataset.docs_iter())

No settings given in /root/.tira/.tira-settings.json. I will use defaults.


In [7]:
run_bm25_grid_search_run(index, 'grid-search/validation', queries)

Run bm25-b=0.7-k_1=1.1


BR(BM25): 100%|██████████| 882/882 [12:22<00:00,  1.19q/s]


Done. run file is stored under "grid-search/validation/bm25-b=0.7-k_1=1.1/run.txt".
Run bm25-b=0.7-k_1=1.2


BR(BM25): 100%|██████████| 882/882 [12:25<00:00,  1.18q/s]


Done. run file is stored under "grid-search/validation/bm25-b=0.7-k_1=1.2/run.txt".
Run bm25-b=0.7-k_1=1.3


BR(BM25): 100%|██████████| 882/882 [10:54<00:00,  1.35q/s]


Done. run file is stored under "grid-search/validation/bm25-b=0.7-k_1=1.3/run.txt".
Run bm25-b=0.75-k_1=1.1


BR(BM25): 100%|██████████| 882/882 [10:58<00:00,  1.34q/s]


Done. run file is stored under "grid-search/validation/bm25-b=0.75-k_1=1.1/run.txt".
Run bm25-b=0.75-k_1=1.2


BR(BM25): 100%|██████████| 882/882 [10:56<00:00,  1.34q/s]


Done. run file is stored under "grid-search/validation/bm25-b=0.75-k_1=1.2/run.txt".
Run bm25-b=0.75-k_1=1.3


BR(BM25): 100%|██████████| 882/882 [10:55<00:00,  1.34q/s]


Done. run file is stored under "grid-search/validation/bm25-b=0.75-k_1=1.3/run.txt".
Run bm25-b=0.8-k_1=1.1


BR(BM25): 100%|██████████| 882/882 [10:39<00:00,  1.38q/s]


Done. run file is stored under "grid-search/validation/bm25-b=0.8-k_1=1.1/run.txt".
Run bm25-b=0.8-k_1=1.2


BR(BM25): 100%|██████████| 882/882 [11:00<00:00,  1.33q/s]


Done. run file is stored under "grid-search/validation/bm25-b=0.8-k_1=1.2/run.txt".
Run bm25-b=0.8-k_1=1.3


BR(BM25): 100%|██████████| 882/882 [10:45<00:00,  1.37q/s]


Done. run file is stored under "grid-search/validation/bm25-b=0.8-k_1=1.3/run.txt".


# Part 2: Evaluate all Configurations of the Grid Search

First, we import the dependencies and load the training and validation qrels.

In [2]:
from trectools import TrecRun, TrecQrel, TrecEval
from tira.rest_api_client import Client
from glob import glob
import pandas as pd
tira = Client()

def load_qrels(dataset):
    return TrecQrel(tira.download_dataset('ir-lab-jena-leipzig-wise-2023', dataset, truth_dataset=True) + '/qrels.txt')

training_qrels = load_qrels('training-20231104-training')
validation_qrels = load_qrels('validation-20231104-training')

No settings given in /root/.tira/.tira-settings.json. I will use defaults.


We download the rusn of the grid search and evaluate them.

In [6]:
!wget https://files.webis.de/teaching/ir-wise-23/ir-lab-sose-grid-search.zip
!unzip ir-lab-sose-grid-search.zip

--2023-11-05 20:56:57--  https://files.webis.de/teaching/ir-wise-23/ir-lab-sose-grid-search.zip
Resolving files.webis.de (files.webis.de)... 141.54.132.200
Connecting to files.webis.de (files.webis.de)|141.54.132.200|:443... connected.


HTTP request sent, awaiting response... 200 OK
Length: 208266950 (199M) [application/zip]
Saving to: ‘ir-lab-sose-grid-search.zip’


2023-11-05 20:57:06 (23.3 MB/s) - ‘ir-lab-sose-grid-search.zip’ saved [208266950/208266950]

Archive:  ir-lab-sose-grid-search.zip
   creating: grid-search/
   creating: grid-search/training/
   creating: grid-search/training/bm25-b=0.7-k_1=1.1/
  inflating: grid-search/training/bm25-b=0.7-k_1=1.1/run.txt  
   creating: grid-search/training/bm25-b=0.7-k_1=1.2/
  inflating: grid-search/training/bm25-b=0.7-k_1=1.2/run.txt  
   creating: grid-search/training/bm25-b=0.7-k_1=1.3/
  inflating: grid-search/training/bm25-b=0.7-k_1=1.3/run.txt  
   creating: grid-search/training/bm25-b=0.75-k_1=1.1/
  inflating: grid-search/training/bm25-b=0.75-k_1=1.1/run.txt  
   creating: grid-search/training/bm25-b=0.75-k_1=1.2/
  inflating: grid-search/training/bm25-b=0.75-k_1=1.2/run.txt  
   creating: grid-search/training/bm25-b=0.75-k_1=1.3/
  inflating: grid-search/traini

In [7]:
def evaluate_run(run_dir, qrels):
    run = TrecRun(run_dir + '/run.txt')
    trec_eval = TrecEval(run, qrels)

    return {
        'run': run.get_runid(),
        'nDCG@10': trec_eval.get_ndcg(depth=10),
        'nDCG@10 (unjudgedRemoved)': trec_eval.get_ndcg(depth=10, removeUnjudged=True),
        'MAP': trec_eval.get_map(depth=10),
        'MRR': trec_eval.get_reciprocal_rank()
    }

In [14]:
df = []
for r in glob('grid-search/training/bm25*'):
    df += [evaluate_run(r, training_qrels)]
df = pd.DataFrame(df)
df.sort_values('nDCG@10', ascending=False)

Unnamed: 0,run,nDCG@10,nDCG@10 (unjudgedRemoved),MAP,MRR
7,bm25-b=0.8-k_1=1.3,0.180344,0.53824,0.121363,0.264709
5,bm25-b=0.8-k_1=1.2,0.180077,0.538306,0.121138,0.264024
3,bm25-b=0.8-k_1=1.1,0.179424,0.537341,0.120681,0.263756
2,bm25-b=0.75-k_1=1.3,0.177931,0.536984,0.119467,0.264127
0,bm25-b=0.75-k_1=1.2,0.17738,0.536716,0.118656,0.262814
6,bm25-b=0.75-k_1=1.1,0.176958,0.536394,0.118267,0.262656
8,bm25-b=0.7-k_1=1.3,0.176549,0.537078,0.118874,0.26242
4,bm25-b=0.7-k_1=1.1,0.176017,0.536765,0.118172,0.261779
1,bm25-b=0.7-k_1=1.2,0.175711,0.536624,0.117826,0.261377


In [12]:
df = []
for r in glob('grid-search/validation/bm25*'):
    df += [evaluate_run(r, validation_qrels)]
df = pd.DataFrame(df)
df.sort_values('nDCG@10', ascending=False)

Unnamed: 0,run,nDCG@10,nDCG@10 (unjudgedRemoved),MAP,MRR
5,bm25-b=0.8-k_1=1.2,0.181957,0.523761,0.125262,0.260425
3,bm25-b=0.8-k_1=1.1,0.181566,0.523328,0.12505,0.2606
6,bm25-b=0.75-k_1=1.1,0.181476,0.522154,0.125384,0.264391
7,bm25-b=0.8-k_1=1.3,0.1813,0.524912,0.125203,0.261041
0,bm25-b=0.75-k_1=1.2,0.181206,0.522771,0.12521,0.264076
2,bm25-b=0.75-k_1=1.3,0.180932,0.522815,0.125112,0.263925
4,bm25-b=0.7-k_1=1.1,0.180461,0.519431,0.123792,0.262504
1,bm25-b=0.7-k_1=1.2,0.180256,0.52023,0.123787,0.262342
8,bm25-b=0.7-k_1=1.3,0.179685,0.520998,0.123525,0.262351


# Summary

We conducted an exhaustive grid search on the b and k1 parameters of BM25.

To summarize everything, please answer the following three questions:


### Question 1:

What are the advantages of splitting  into a training and validation set?


### Question 2:

Are there scenarious where you would join the training and validation data?