Skip to content

qcri/e-wer

Repository files navigation

Word Error Rate Estimation Without ASR Output: e-WER2

This is the second version of e-WER (e-WER2).

New Features!

  • An end-to-end multistream architecture to predictthe WER per sentence using language-independent phonotactic features.
  • Our novel system is able to learn acoustic-lexical embeddings
  • We estimate the error rate directly without having access to the ASR results nor the ASR system – no-box WER estimation
System Pearson RSME e-WER (ref WER=28.5)
e-WER Glass Box 0.82 0.17 27.3%
e-WER Black Box 0.68 0.19 35.8%
e-WER2 Glass Box 0.74 0.19. 27.9%
e-WER2 Black Box 0.66 0.21 30.9%
e-WER No Box 0.56 0.24 30.9%

Model definition

An end-to-end multistream based regression model to predict the WER per sentence.

We combine four streams: lexical, phonotactic, acoustics and numerical features into a single end-to-end network to estimate word error rate directly. We jointly train the multistream network to obtain a joint feature space in which another fully connected layer to estimate the WER directly.

Results

Test set cumulative WER over all sentences X-axis is duration in hours and Y-axis is WER in %.

Citation

More details about this work can be found in INTERSPEECH 2020 and ACL 2018 papers:

@InProceedings{,
    author={Ali, Ahmed and Renals, Steve},
      title={Word Error Rate Estimation Without ASR Output: e-WER2},
      booktitle={INTERSPEECH},
      year={2020}, 

 @InProceedings{,
    author={Ali, Ahmed and Renals, Steve},
      title={Word Error Rate Estimation for Speech Recognition: e-WER},
      booktitle={ACL},
      year={2018}, 

About

Word Error Rate Estimation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published