Script to perform statistical significance test between ASR (Automatic Speech Recognition) transcription hypotheses. This can be used to evaluate whether differences in WER (word error rate) are actually significant or not (on the same test set).
You will need to use the comands sclite
and sc_stats
from the NIST Scoring Toolkit available here:
http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sctk.htm
RUN.sh contains an example script to (1) generate an SGML (XML-like) file for transcript hypotheses you want to compare, and (2) to compare the hypotheses using a statistical significance test of your choice. The repo contains:
- ref.trn : A reference (ground truth) transcript in the format of .
- hyp.A.trn and hyp.B.trn : Two transcript hypthoses each generated by different ASR setups.
Run the following command for hypothesis A and B.
sclite -F -i wsj -r ref.trn -h hyp.A.trn -o sgml
sclite -F -i wsj -r ref.trn -h hyp.B.trn -o sgml
cat hyp.A.trn.sgml hyp.B.trn.sgml | sc_stats -p -t mapsswe -v -u -n result.A-B.mapsswe
result.A-B.mapsswe.stats.unified will be generated (output below) stating that p < 0.001 between the hypotheses.
|------------------------------------------------------------------------------|
| Test || | hyp.A.trn | hyp.B.trn || Test |
| Abbrev. || | | || Abbrev. |
|----------++------------+-------------+------------------------++-------------|
| MP || hyp.A.trn | | hyp.B.trn 0.007 ** || MP |
|----------++------------+-------------+------------------------++-------------|
| MP || hyp.B.trn | | || MP |
|------------------------------------------------------------------------------|
Instead of the mapsswe
(Matched Pairs Sentence-Segment Word Error) option, you can use mcn
(McNemar), sign
, wilc
(Wilcoxon Signed Rank ), anovar
(Analysis of Variance), and std4
(standard four - mcn, mapsswe, wilc, and sign).