Skip to content

rtmdrr/replicability-analysis-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 

Repository files navigation

replicability-analysis-NLP

This code implements the methods described in (Dror et al., 2017):

"Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets." Rotem Dror, Gili Baumer, Marina Bogomolov and Roi Reichart. Accepted to the Transactions of the Association for Computational Linguistics (TACL).

The implemented methods help researchers and engineers draw statistically sound conclusions about the difference in performance between two algorithms, based on multiple comparisons between these algorithms. In each of these multiple comparisons the algorithms are applied to a dataset and the researcher writes down the performance of each algorithm on the dataset (e.g. an accuracy measure) as well as the p-value generated by a statistical significance test that is held in order to estimate the robustness of the difference between the algorithms (e.g. t-test or a bootstrap test). The methods implemented here are the Bonferroni and the Fisher tests for counting the number of datasets for which one algorithm is significantly better than the other, and the Holm procedure for identifying these datasets.

Please see the paper as to why retrieving the datasets for which one algorithm performs better than the other with a p-value below a desired threshold does not provide a statistically sound method for solving the above problem.

Getting Started

P-values

Our code requires a list of p-values from the comparisons of both algorithms on multiple datasets - a single p-value for each dataset. If you debate how to choose your significance test please consult our paper which discusses this issue for four representative NLP applications.

Running the tests

The Input:

  1. Write down a comma separated list of the p-values.
  2. Write down the desired significance level (alpha). The Algorithm will output 2 estimators:
  • B for Bonferroni if the datasets are dependent.
  • F for Fisher if the datasets are independent.

The Algorithm will output:

  1. An estimation (K estimator) of the number of datasets with a significant effect according to Bonferroni method.
  2. An estimation (K estimator) of the number of datasets with a significant effect according to Fisher method.
  3. The indices of the datasets recognized by the Holm procedure (Rejection list).
  • Notice that the number of datasets recognized by the Holm procedure should be exactly K-Bonferroni, K-Fisher can be equal\smaller\larger than this number (see paper for more details).

Example

Enter p-values :
0.168,0.297,0.357,0.019,0.218,0.001

{'dataset1': 0.168, 'dataset2': 0.297, 'dataset3': 0.357, 'dataset4': 0.019, 'dataset5': 0.218, 'dataset6': 0.001}

Enter significance level: 
0.05

The Bonferroni-k estimator for the number of datasets with effect is:  1

 The Fisher-k estimator for the number of datasets with effect is:  2

 The rejections list according to the Holm procedure is: 
dataset6

Citation

If you make use of this software for research purposes, we'll appreciate citing the following:

@Article{Q17-1033,
  author = 	"Dror, Rotem
		and Baumer, Gili
		and Bogomolov, Marina
		and Reichart, Roi",
  title = 	"Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets",
  journal = 	"Transactions of the Association for Computational Linguistics",
  year = 	"2017",
  volume = 	"5",
  pages = 	"471--486",
  url = 	"http://aclweb.org/anthology/Q17-1033"
}

Release History

  • 0.1.0 The first proper release.
  • 0.2.0 Output both k estimators.

Contact Information

This file and the code was written by Rotem Dror. The methods are described in the above paper (Dror et al., 2017). For questions please write to: rtmdrr@seas.upenn.edu

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages