Skip to content
Branch: master
Find file History
okhonko and facebook-github-bot Add ctc loss to ASR task (#1233)
Summary:
Adds CTC loss and corresponding transformer ctc based models.

Tested with
`CUDA_VISIBLE_DEVICES=0 python train.py $DATA_PATH --save-dir $SAVE_DIR --max-epoch 30 --task speech_recognition --arch vggtransformer_enc_1 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0  --max-tokens 10000 --log-format json --log-interval 1 --criterion ctc_loss --user-dir examples/speech_recognition/ --validate-interval=10`
Pull Request resolved: #1233

Reviewed By: jcai1

Differential Revision: D17856824

Pulled By: okhonko

fbshipit-source-id: f3eac64d3fdd0c37cf8c539dd360cfb610d8a6ef
Latest commit c4893ca Oct 10, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
criterions Add ctc loss to ASR task (#1233) Oct 10, 2019
data Add ctc loss to ASR task (#1233) Oct 10, 2019
datasets Asr initial push (#810) Aug 8, 2019
models Add ctc loss to ASR task (#1233) Oct 10, 2019
tasks wav2letter integration Oct 10, 2019
utils Add ctc loss to ASR task (#1233) Oct 10, 2019
README.md wav2letter integration Oct 10, 2019
__init__.py Asr initial push (#810) Aug 8, 2019
infer.py wav2letter integration Oct 10, 2019
w2l_decoder.py wav2letter integration Oct 10, 2019

README.md

Speech Recognition

examples/speech_recognition is implementing ASR task in Fairseq, along with needed features, datasets, models and loss functions to train and infer model described in Transformers with convolutional context for ASR (Abdelrahman Mohamed et al., 2019).

Additional dependencies

On top of main fairseq dependencies there are couple more additional requirements.

  1. Please follow the instructions to install torchaudio. This is required to compute audio fbank features.
  2. Sclite is used to measure WER. Sclite can be downloaded and installed from source from sctk package here. Training and inference doesn't require Sclite dependency.
  3. sentencepiece is required in order to create dataset with word-piece targets.

Preparing librispeech data

./examples/speech_recognition/datasets/prepare-librispeech.sh $DIR_TO_SAVE_RAW_DATA $DIR_FOR_PREPROCESSED_DATA

Training librispeech data

python train.py $DIR_FOR_PREPROCESSED_DATA --save-dir $MODEL_PATH --max-epoch 80 --task speech_recognition --arch vggtransformer_2 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0  --max-tokens 5000 --log-format json --log-interval 1 --criterion cross_entropy_acc --user-dir examples/speech_recognition/

Inference for librispeech

$SET can be test_clean or test_other Any checkpoint in $MODEL_PATH can be selected. In this example we are working with checkpoint_last.pt

python examples/speech_recognition/infer.py $DIR_FOR_PREPROCESSED_DATA --task speech_recognition --max-tokens 25000 --nbest 1 --path $MODEL_PATH/checkpoint_last.pt --beam 20 --results-path $RES_DIR --batch-size 40 --gen-subset $SET --user-dir examples/speech_recognition/

Inference for librispeech

sclite -r ${RES_DIR}/ref.word-checkpoint_last.pt-${SET}.txt -h ${RES_DIR}/hypo.word-checkpoint_last.pt-${SET}.txt -i rm -o all stdout > $RES_REPORT

Sum/Avg row from first table of the report has WER

Using wav2letter components

wav2letter now has integration with fairseq. Currently this includes:

  • AutoSegmentationCriterion (ASG)
  • wav2letter-style Conv/GLU model
  • wav2letter's beam search decoder

To use these, follow the instructions at the bottom of this page to install python bindings. Please note that python bindings are for a subset of wav2letter and don't require its full dependencies (notably, flashlight and ArrayFire are not required).

To quickly summarize the instructions: first, install CUDA. Then follow these steps:

# additional prerequisites - use equivalents for your distro
sudo apt-get install build-essential cmake libatlas-base-dev libfftw3-dev liblzma-dev libbz2-dev libzstd-dev
# install KenLM from source
git clone https://github.com/kpu/kenlm.git
cd kenlm
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_POSITION_INDEPENDENT_CODE=ON
make -j16
cd ..
export KENLM_ROOT_DIR=$(pwd)
cd ..
# install wav2letter python bindings
git clone https://github.com/facebookresearch/wav2letter.git
cd wav2letter/bindings/python
# make sure your python environment is active at this point
pip install torch packaging
pip install -e .
# try some examples to verify installation succeeded
python ./examples/criterion_example.py
python ./examples/decoder_example.py ../../src/decoder/test
python ./examples/feature_example.py ../../src/feature/test/data

Training librispeech data (wav2letter style, Conv/GLU + ASG loss)

Training command:

python train.py $DIR_FOR_PREPROCESSED_DATA --save-dir $MODEL_PATH --max-epoch 100 --task speech_recognition --arch w2l_conv_glu_enc --batch-size 4 --optimizer sgd --lr 0.3,0.8 --momentum 0.8 --clip-norm 0.2 --max-tokens 50000 --log-format json --log-interval 100 --num-workers 0 --sentence-avg --criterion asg_loss --asg-transitions-init 5 --max-replabel 2 --linseg-updates 8789 --user-dir examples/speech_recognition

Note that ASG loss currently doesn't do well with word-pieces. You should prepare a dataset with character targets by setting nbpe=31 in prepare-librispeech.sh.

Inference for librispeech (wav2letter decoder, n-gram LM)

Inference command:

python examples/speech_recognition/infer.py $DIR_FOR_PREPROCESSED_DATA --task speech_recognition --seed 1 --nbest 1 --path $MODEL_PATH/checkpoint_last.pt --gen-subset $SET --results-path $RES_DIR --w2l-decoder kenlm --kenlm-model $KENLM_MODEL_PATH --lexicon $LEXICON_PATH --beam 200 --beam-threshold 15 --lm-weight 1.5 --word-score 1.5 --sil-weight -0.3 --criterion asg_loss --max-replabel 2 --user-dir examples/speech_recognition

$KENLM_MODEL_PATH should be a standard n-gram language model file. $LEXICON_PATH should be a wav2letter-style lexicon (list of known words and their spellings). For ASG inference, a lexicon line should look like this (note the repetition labels):

doorbell  D O 1 R B E L 1 ▁

For CTC inference with word-pieces, repetition labels are not used and the lexicon should have most common spellings for each word (one can use sentencepiece's NBestEncodeAsPieces for this):

doorbell  ▁DOOR BE LL
doorbell  ▁DOOR B E LL
doorbell  ▁DO OR BE LL
doorbell  ▁DOOR B EL L
doorbell  ▁DOOR BE L L
doorbell  ▁DO OR B E LL
doorbell  ▁DOOR B E L L
doorbell  ▁DO OR B EL L
doorbell  ▁DO O R BE LL
doorbell  ▁DO OR BE L L

Lowercase vs. uppercase matters: the word should match the case of the n-gram language model (i.e. $KENLM_MODEL_PATH), while the spelling should match the case of the token dictionary (i.e. $DIR_FOR_PREPROCESSED_DATA/dict.txt).

Inference for librispeech (wav2letter decoder, viterbi only)

Inference command:

python examples/speech_recognition/infer.py $DIR_FOR_PREPROCESSED_DATA --task speech_recognition --seed 1 --nbest 1 --path $MODEL_PATH/checkpoint_last.pt --gen-subset $SET --results-path $RES_DIR --w2l-decoder viterbi --criterion asg_loss --max-replabel 2 --user-dir examples/speech_recognition
You can’t perform that action at this time.