# Sample Eval Pilot Study

This is a pilot study to evaluate if infNDCG in the original implementation (can be downloaded from: https://trec.nist.gov/data/medical/12/sample_eval.pl) should be included in our experiments or not.
From reading the original paper, I think infNDCG would make not too much sense as it addresses a different problem: "support fair comparison of retrieval results using relatively small amounts of judging effort" [[1](https://dl.acm.org/doi/pdf/10.1145/2600428.2609524)]. I.e., it is a technique to construct a judgment-set that uses details on the document sampling that it applied (i.e., it partitions documents into strata during construction of the judgment set from which documents can be judged with different probabilities) to calculate the inferred NDCG. However, in our situation, the judgment pool was constructed with pooling. Hence, the requirements for infNDCG (i.e., the strata) are not applicable).


This script contains an implementation of infNDCG. I copied this script to `sample_eval_depth_10.pl` where I adopted the maxResultSize to 10 so that we calculate infNDCG@10.

In [5]:
with open('../../ipynb/beir-evaluation-data/incomplete-beir-trec-covid.txt', 'r') as input_file, open('incomplete-beir-trec-covid-for-sample-eval.txt', 'w') as output_file:
    for l in input_file:
        l = l.split()
        # We only have one strata
        l = l[0] + ' ' + l[1] + ' ' + l[2] + ' 1 ' + l[3]
        output_file.write(l + '\n')
    

In [6]:
!head -3 ../../ipynb/beir-evaluation-data/incomplete-beir-trec-covid.txt

1 0 005b2j4b 2
1 0 00fmeepz 1
1 0 g7dhmyyo 2


In [7]:
!head -3 incomplete-beir-trec-covid-for-sample-eval.txt

1 0 005b2j4b 1 2
1 0 00fmeepz 1 1
1 0 g7dhmyyo 1 2


In [9]:
!./sample_eval_depth_10.pl incomplete-beir-trec-covid-for-sample-eval.txt ../../ipynb/beir-evaluation-data/runs/ance-09-01-2023-run.txt |grep infNDCG

# This yields an infNDCG of 0.6435 which is even below assuming all unjudged to be non-relevant, which would get 0.652

infNDCG		all		0.6435


In [10]:
!./sample_eval_depth_10.pl incomplete-beir-trec-covid-for-sample-eval.txt ../../ipynb/beir-evaluation-data/runs/tas-b-09-01-2023-run.txt | grep infNDCG

# This yields an infNDCG of 0.4746 which is even below assuming all unjudged to be non-relevant, which would get 0.481

infNDCG		all		0.4746


In [11]:
!./sample_eval_depth_10.pl incomplete-beir-trec-covid-for-sample-eval.txt ../../ipynb/beir-evaluation-data/runs/colbert-ranking-26-12-2022-run.txt | grep infNDCG

# This yields an infNDCG of 0.6703 which is even below assuming all unjudged to be non-relevant, which would get 0.680

infNDCG		all		0.6703


In all three cases, the infNDCG even underestimates the nDCG when assuming all unjudged documents are non-relevant, confirming my initial line of thought that infNDCG is better suited for the construction of the judgment set, but can conceptually not applied to our post-hoc evaluation on pooled relevance judgments. 