# Statistical Significance

This notebook is an example of how two `trec_eval` runs can be used to measure if the results they produced are statistically significant.
It assumes that you have two files containing the results of the two runs:

* `default_result`

* `boost_embimg_result`



The example to test the statistical significance is adapted from https://github.com/cvangysel/pytrec_eval/blob/master/examples/statistical_significance.py

It uses `pytrec_eval`, a Python interface for trec_eval.

## Install `pytrec_eval`

In [6]:
!pip install pytrec_eval



## Do the necessary imports

In [7]:
import pytrec_eval
import scipy.stats

## Parse the ratings & runs, calculate the t-score & p-value

In [8]:
#open file with ratings
with open('../data/ratings.qrels', 'r') as f_run:
    qrel = pytrec_eval.parse_qrel(f_run)
#open file with default run - Elasticsearch BM25 run
with open('../data/default_result', 'r') as f_run:
    first_run = pytrec_eval.parse_run(f_run)
#open file with second run - boost by image vector run
with open('../data/boost_embimg_result', 'r') as f_run:
    second_run = pytrec_eval.parse_run(f_run)
#open file with third run - match by image vector run
with open('../data/match_embimg_result', 'r') as f_run:
    thrid_run = pytrec_eval.parse_run(f_run)
#open file with fourth run - boost by text vector run
with open('../data/boost_embtxt_result', 'r') as f_run:
    fourth_run = pytrec_eval.parse_run(f_run)
#open file with fifth run - match by text vector run
with open('../data/match_embtxt_result', 'r') as f_run:
    fifth_run = pytrec_eval.parse_run(f_run)

#define evaluator to look at NDCG only and evaluate the results of the two runs
evaluator = pytrec_eval.RelevanceEvaluator(qrel, {"ndcg_cut"})

first_results = evaluator.evaluate(first_run)
second_results = evaluator.evaluate(second_run)
third_results = evaluator.evaluate(thrid_run)
fourth_results = evaluator.evaluate(fourth_run)
fifth_results = evaluator.evaluate(fifth_run)

#retrieve query ids, retrieve scores for the two runs and pass them to scipy for p-value computation
query_ids = list(set(first_results.keys()) & set(second_results.keys()) & set(third_results.keys()) & set(fourth_results.keys()) & set(fifth_results.keys()))

first_scores = [first_results[query_id]['ndcg_cut_10'] for query_id in query_ids]
second_scores = [second_results[query_id]['ndcg_cut_10'] for query_id in query_ids]
third_scores = [third_results[query_id]['ndcg_cut_10'] for query_id in query_ids]
fourth_scores = [fourth_results[query_id]['ndcg_cut_10'] for query_id in query_ids]
fifth_scores = [fifth_results[query_id]['ndcg_cut_10'] for query_id in query_ids]

print(scipy.stats.ttest_rel(first_scores, second_scores))

Ttest_relResult(statistic=-1.2321974256501635, pvalue=0.22003306982392185)
