<a href="https://colab.research.google.com/github/SlangLabs/asr-wer-bench/blob/main/asr_wer_bench.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ASR WER Benchmark

Infrastructure to measure Word Error Rate for an offline ASR engine on a given audio data set.

## Setup

In [1]:
!python --version

Python 3.6.9


In [2]:
# For wav2vec
%load_ext cython

In [3]:
!apt-get -y install tree build-essential cmake sox libsox-dev \
    libfftw3-dev libatlas-base-dev liblzma-dev libbz2-dev libzstd-dev \
    apt-utils gcc libpq-dev libopenblas-dev \
    libsndfile-dev libsndfile1-dev libgflags-dev libgoogle-glog-dev \
    &> /dev/null

### SCTK

Install `sclite`.

In [4]:
!git clone https://github.com/usnistgov/SCTK.git

Cloning into 'SCTK'...
remote: Enumerating objects: 5115, done.[K
remote: Total 5115 (delta 0), reused 0 (delta 0), pack-reused 5115[K
Receiving objects: 100% (5115/5115), 7.26 MiB | 19.26 MiB/s, done.
Resolving deltas: 100% (3658/3658), done.


In [5]:
!ls -l SCTK

total 52
-rw-r--r--  1 root root 16498 Nov 12 08:39 CHANGELOG
-rw-r--r--  1 root root   788 Nov 12 08:39 DISCLAIMER
drwxr-xr-x  4 root root  4096 Nov 12 08:39 doc
-rw-r--r--  1 root root  2273 Nov 12 08:39 LICENSE.md
-rw-r--r--  1 root root  1673 Nov 12 08:39 makefile
-rw-r--r--  1 root root  6440 Nov 12 08:39 README.md
drwxr-xr-x 26 root root  4096 Nov 12 08:39 src
-rw-r--r--  1 root root  1484 Nov 12 08:39 TODO


In [6]:
!cd SCTK && make config &> /dev/null

In [7]:
!cd SCTK && make all &> /dev/null

In [8]:
!cd SCTK && make check &> /dev/null

In [9]:
!cd SCTK  && make install &> /dev/null

In [10]:
!ls -la SCTK/bin/sclite

-rwxr-xr-x 1 root root 344296 Nov 12 08:41 SCTK/bin/sclite


### KenLM Language Model

Needed for wav2vec

#### Build From GitHub

In [11]:
!git clone https://github.com/kpu/kenlm.git

Cloning into 'kenlm'...
remote: Enumerating objects: 90, done.[K
remote: Counting objects: 100% (90/90), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 13672 (delta 41), reused 54 (delta 21), pack-reused 13582[K
Receiving objects: 100% (13672/13672), 5.53 MiB | 18.17 MiB/s, done.
Resolving deltas: 100% (7847/7847), done.


In [12]:
!cd kenlm && mkdir -p build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_POSITION_INDEPENDENT_CODE=ON && make -j16 &> /dev/null

-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find Eigen3 (missing: Eigen3_DIR)
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Boost version: 

#### Build from https://kheafield.com/code/kenlm/

In [13]:
#!mkdir -p kenlm/build
#!cd kenlm && curl -LO https://kheafield.com/code/kenlm.tar.gz
#!cd kenlm/build && cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_POSITION_INDEPENDENT_CODE=ON && make -j2

#### Set KENLM root

In [14]:
import os
os.environ['KENLM_ROOT_DIR'] = os.path.abspath('./kenlm')

In [15]:
!echo $KENLM_ROOT_DIR

/content/kenlm


#### Build LM

Instructions to build LM: https://kheafield.com/code/kenlm/structures/

In [16]:
!mkdir -p ./models/kenlm/en-US/

In [17]:
!cd ./models/kenlm/en-US/ && curl -LO http://www.openslr.org/resources/11/4-gram.arpa.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1292M  100 1292M    0     0  26.6M      0  0:00:48  0:00:48 --:--:-- 26.6M


In [18]:
!cd ./models/kenlm/en-US/ && gunzip 4-gram.arpa.gz

In [19]:
!ls -l ./models/kenlm/en-US/

total 4292612
-rw-r--r-- 1 root root 4395628122 Nov 12 08:43 4-gram.arpa


In [20]:
!cd ./models/kenlm/en-US/ && $KENLM_ROOT_DIR/build/bin/build_binary trie 4-gram.arpa 4-gram.bin

Reading 4-gram.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 1219518464 bytes == 0x559445e02000 @  0x7fe4b09eb1e7 0x55944334399e 0x559443338720 0x55944331594b 0x559443318514 0x55944330cf55 0x7fe4af22fbf7 0x55944330d6aa
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


In [21]:
!ls -l ./models/kenlm/en-US/

total 5812376
-rw-r--r-- 1 root root 4395628122 Nov 12 08:43 4-gram.arpa
-rw-r--r-- 1 root root 1556230938 Nov 12 08:48 4-gram.bin


### ASR WER Bench

Clone repo and set env

In [22]:
!git clone https://github.com/SlangLabs/asr-wer-bench.git

Cloning into 'asr-wer-bench'...
remote: Enumerating objects: 33, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 33 (delta 4), reused 29 (delta 4), pack-reused 0[K
Unpacking objects: 100% (33/33), done.


In [23]:
# Run for CPU
!pip install -r asr-wer-bench/requirements.txt



In [24]:
# Run for GPU
#!pip install -r asr-wer-bench/requirements-gpu.txt

In [25]:
!ls -l asr-wer-bench/data/en-US/audio

total 272
-rw-r--r-- 1 root root    23 Nov 12 08:48 2830-3980-0043.txt
-rw-r--r-- 1 root root 63244 Nov 12 08:48 2830-3980-0043.wav
-rw-r--r-- 1 root root    31 Nov 12 08:48 4507-16021-0012.txt
-rw-r--r-- 1 root root 87564 Nov 12 08:48 4507-16021-0012.wav
-rw-r--r-- 1 root root    32 Nov 12 08:48 8455-210777-0068.txt
-rw-r--r-- 1 root root 82924 Nov 12 08:48 8455-210777-0068.wav
-rw-r--r-- 1 root root   340 Nov 12 08:48 Attribution.txt
-rw-r--r-- 1 root root 18652 Nov 12 08:48 License.txt


## DeepSpeech

### Download Models

In [26]:
!mkdir -p ./models/deepspeech/en-US/
!cd ./models/deepspeech/en-US/ && curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.1/deepspeech-0.8.1-models.pbmm
!cd ./models/deepspeech/en-US/ && curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.1/deepspeech-0.8.1-models.scorer

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   652  100   652    0     0   4024      0 --:--:-- --:--:-- --:--:--  4024
100  180M  100  180M    0     0  55.9M      0  0:00:03  0:00:03 --:--:-- 59.6M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   654  100   654    0     0   4087      0 --:--:-- --:--:-- --:--:--  4087
100  909M  100  909M    0     0  72.2M      0  0:00:12  0:00:12 --:--:-- 70.4M


In [27]:
!ls -l ./models/deepspeech/en-US/

total 1115516
-rw-r--r-- 1 root root 188915984 Nov 12 08:48 deepspeech-0.8.1-models.pbmm
-rw-r--r-- 1 root root 953363776 Nov 12 08:48 deepspeech-0.8.1-models.scorer


In [28]:
# Verify DeepSpeech

!deepspeech \
  --model models/deepspeech/en-US/deepspeech-0.8.1-models.pbmm \
  --scorer models/deepspeech/en-US/deepspeech-0.8.1-models.scorer \
  --audio asr-wer-bench/data/en-US/audio/2830-3980-0043.wav

Loading model from file models/deepspeech/en-US/deepspeech-0.8.1-models.pbmm
TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.1-0-gfa883eb
2020-11-12 08:48:24.472793: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.0592s.
Loading scorer from files models/deepspeech/en-US/deepspeech-0.8.1-models.scorer
Loaded scorer in 0.000458s.
Running inference.
experience proves this
Inference took 1.999s for 1.975s audio file.


In [29]:
# Expected transcript
!cat asr-wer-bench/data/en-US/audio/2830-3980-0043.txt

experience proves this


### Run Test Bench

In [30]:
!PYTHONPATH=asr-wer-bench python asr-wer-bench/werbench/asr/engine.py \
  --engine deepspeech \
  --model-path-prefix ./models/deepspeech/en-US/deepspeech-0.8.1-models \
  --input-dir ./asr-wer-bench/data/en-US/audio \
  --output-path-prefix ./deepspeech-out

TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.1-0-gfa883eb
2020-11-12 08:48:27.985978: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA


Compare using `sclite`:

In [31]:
!./SCTK/bin/sclite -r deepspeech-out.ref trn -h deepspeech-out.hyp trn -i rm

sclite: 2.10 TK Version 1.3
Begin alignment of Ref File: 'deepspeech-out.ref' and Hyp File: 'deepspeech-out.hyp'
    Alignment# 1 for speaker 4507          
    Alignment# 1 for speaker 8455          
    Alignment# 1 for speaker 2830          




                     SYSTEM SUMMARY PERCENTAGES by SPEAKER                      

       ,----------------------------------------------------------------.
       |                       deepspeech-out.hyp                       |
       |----------------------------------------------------------------|
       | SPKR   | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err |
       |--------+-------------+-----------------------------------------|
       | 4507   |    1      7 | 85.7   14.3    0.0    0.0   14.3  100.0 |
       |--------+-------------+-----------------------------------------|
       | 8455   |    1      6 |100.0    0.0    0.0    0.0    0.0    0.0 |
       |--------+-------------+-----------------------------------------|


## Facebook wav2vec 2.0

When it comes to engineering, wav2vec is not as mature as Mozilla DeepSpeech:

- FairSeq [PyPi package is 1+ yr stale](https://github.com/pytorch/fairseq/issues/2737).
- There is not even requirements.txt in the source code.
- Wav2Vec has no simple API for batch and streaming ASR for one wav file.
- Pieces from kenlm, wav2letter, wav2vec has to brought together (vs. here is package, and here is model, and here is a straight forward API to use the two together on a wav file).

Beware that it is research quality software.

In [32]:
!pip install soundfile torchaudio sentencepiece omegaconf hydra-core &> /dev/null

### Install wav2letter

In [33]:
!git clone -b v0.2 https://github.com/facebookresearch/wav2letter.git

Cloning into 'wav2letter'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects: 100% (41/41), done.[K
remote: Compressing objects: 100% (31/31), done.[K
remote: Total 6215 (delta 11), reused 28 (delta 9), pack-reused 6174[K
Receiving objects: 100% (6215/6215), 5.98 MiB | 22.04 MiB/s, done.
Resolving deltas: 100% (3990/3990), done.


In [34]:
!cd wav2letter/bindings/python && pip install -e .

Obtaining file:///content/wav2letter/bindings/python
Installing collected packages: wav2letter
  Found existing installation: wav2letter 0.0.2
    Can't uninstall 'wav2letter'. No files were found to uninstall.
  Running setup.py develop for wav2letter
Successfully installed wav2letter


### Download Model

Pretrained models: https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md

In [35]:
!mkdir -p ./models/wav2vec20/en-US/

!cd ./models/wav2vec20/en-US/ && curl -LO https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_960h_pl.pt
!cd ./models/wav2vec20/en-US/ && curl -LO https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt

!cd ./models/wav2vec20/en-US/ && curl -LO https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt

!cd ./models/wav2vec20/en-US/ && curl -LO https://dl.fbaipublicfiles.com/fairseq/wav2vec/librispeech_lexicon.lst

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3610M  100 3610M    0     0  23.6M      0  0:02:32  0:02:32 --:--:-- 23.6M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  900M  100  900M    0     0  23.0M      0  0:00:39  0:00:39 --:--:-- 23.7M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   207  100   207    0     0    400      0 --:--:-- --:--:-- --:--:--   400
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5093k  100 5093k    0     0  4491k      0  0:00:01  0:00:01 --:--:-- 4487k


In [36]:
!ls -l ./models/wav2vec20/en-US/

total 4624128
-rw-r--r-- 1 root root        207 Nov 12 08:54 dict.ltr.txt
-rw-r--r-- 1 root root    5215947 Nov 12 08:54 librispeech_lexicon.lst
-rw-r--r-- 1 root root  944029831 Nov 12 08:54 wav2vec_small_960h.pt
-rw-r--r-- 1 root root 3785840758 Nov 12 08:53 wav2vec_vox_960h_pl.pt


### Install fairseq

In [37]:
!git clone https://github.com/pytorch/fairseq.git

Cloning into 'fairseq'...
remote: Enumerating objects: 43, done.[K
remote: Counting objects: 100% (43/43), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 19672 (delta 15), reused 17 (delta 8), pack-reused 19629[K
Receiving objects: 100% (19672/19672), 8.88 MiB | 23.86 MiB/s, done.
Resolving deltas: 100% (14679/14679), done.


In [38]:
!pip install --editable ./fairseq/ &> /dev/null

In [39]:
# On MacOS
#!CFLAGS="-stdlib=libc++" pip install --editable ./fairseq/ &> /dev/null

In [40]:
!pip3 list | grep fairseq

fairseq                       1.0.0a0+e607911 /content/fairseq                   


In [41]:
!ls -l 

total 36
drwxr-xr-x  5 root root 4096 Nov 12 08:48 asr-wer-bench
-rw-r--r--  1 root root  140 Nov 12 08:48 deepspeech-out.hyp
-rw-r--r--  1 root root  140 Nov 12 08:48 deepspeech-out.ref
drwxr-xr-x 12 root root 4096 Nov 12 08:54 fairseq
drwxr-xr-x  9 root root 4096 Nov 12 08:41 kenlm
drwxr-xr-x  5 root root 4096 Nov 12 08:51 models
drwxr-xr-x  1 root root 4096 Nov  6 17:30 sample_data
drwxr-xr-x  6 root root 4096 Nov 12 08:39 SCTK
drwxr-xr-x 12 root root 4096 Nov 12 08:48 wav2letter


### Prepare data/manifest for wav2vec

In [42]:
!mkdir -p ./wav2vec-manifest
!cp ./models/wav2vec20/en-US/dict.ltr.txt ./wav2vec-manifest/

In [43]:
import soundfile as sf

In [44]:
!!PYTHONPATH="fairseq" python fairseq/examples/wav2vec/wav2vec_manifest.py ./asr-wer-bench/data/en-US/audio --dest ./wav2vec-manifest --ext wav --valid-percent 0

[]

In [45]:
!ls -l ./wav2vec-manifest/

total 12
-rw-r--r-- 1 root root 207 Nov 12 08:55 dict.ltr.txt
-rw-r--r-- 1 root root 118 Nov 12 08:55 train.tsv
-rw-r--r-- 1 root root  40 Nov 12 08:55 valid.tsv


In [46]:
# Doing no training here, just want to check that the fairseq/wav2vec setup works
!cp ./wav2vec-manifest/train.tsv ./wav2vec-manifest/test.tsv
!cp ./wav2vec-manifest/dict.ltr.txt ./wav2vec-manifest/test.ltr

### Verify wav2vec

In [47]:
# Verify wav2vec

!PYTHONPATH="fairseq" python fairseq/examples/speech_recognition/infer.py \
./wav2vec-manifest \
--path ./models/wav2vec20/en-US/wav2vec_small_960h.pt \
--results-path ./models/wav2vec20/en-US/result-wav2vec \
--lexicon ./models/wav2vec20/en-US/librispeech_lexicon.lst \
--w2l-decoder kenlm --lm-model ./models/kenlm/en-US/4-gram.bin \
--task audio_pretraining \
--nbest 1 --gen-subset test \
--lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000 \
--post-process letter --cpu --num-workers 1 --batch-size 8 --beam 1024

INFO:__main__:Namespace(all_gather_list_size=16384, autoregressive=False, batch_size=8, batch_size_valid=8, beam=1024, beam_size_token=100, beam_threshold=25.0, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=True, criterion='ctc', curriculum=0, data='./wav2vec-manifest', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, dump_emissions=None, dump_features=None, empty_cache_freq=0, enable_padding=False, eos=2, eval_wer=False, eval_wer_post_process='letter', eval_wer_tokenizer=None, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=N

### Run Test Bench

---
&copy; 2020 Slang Labs Private Limited. All rights reserved.