# Statistical Significance

This notebook is an example of how two `trec_eval` runs can be used to measure if the results they produced are statistically significant.
It assumes that you have two files containing the results of the two runs:

* `es_result.qrels`

* `jina_result.qrels`



The example to test the statistical significance is adapted from https://github.com/cvangysel/pytrec_eval/blob/master/examples/statistical_significance.py

It uses `pytrec_eval`, a Python interface for trec_eval.

## Install `pytrec_eval`

In [None]:
!pip install pytrec_eval

## Do the necessary imports

In [None]:
import pytrec_eval
import scipy.stats

## Parse the ratings & runs, calculate the t-score & p-value

In [None]:
#open file with ratings
with open('../data/ratings.qrels', 'r') as f_run:
    qrel = pytrec_eval.parse_qrel(f_run)
#open file with first run - Elasticsearch BM25 run
with open('../data/es_result', 'r') as f_run:
    first_run = pytrec_eval.parse_run(f_run)
#open file with first run - vector search run
with open('../data/jina_result', 'r') as f_run:
    second_run = pytrec_eval.parse_run(f_run)

#define evaluator to look at NDCG only and evaluate the results of the two runs
evaluator = pytrec_eval.RelevanceEvaluator(qrel, {"ndcg_cut"})

first_results = evaluator.evaluate(first_run)
second_results = evaluator.evaluate(second_run)

#retrieve query ids, retrieve scores for the two runs and pass them to scipy for p-value computation
query_ids = list(set(first_results.keys()) & set(second_results.keys()))

first_scores = [first_results[query_id]['ndcg_cut_10'] for query_id in query_ids]
second_scores = [second_results[query_id]['ndcg_cut_10'] for query_id in query_ids]

print(scipy.stats.ttest_rel(first_scores, second_scores))