Skip to content

Commit

Permalink
add all files
Browse files Browse the repository at this point in the history
  • Loading branch information
Chao Weng committed Dec 25, 2020
1 parent 44fe0bc commit 23c4cdd
Show file tree
Hide file tree
Showing 42 changed files with 7,997 additions and 3 deletions.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2020 Tencent AI Lab

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
141 changes: 139 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,139 @@
# pika
a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi
--------------------------------------------------------------------------------

# PIKA: a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi #
PIKA is a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi. The first release focuses on end-to-end speech recognition. We use [Pytorch](https://pytorch.org) as deep learning engine, [Kaldi](https://github.com/kaldi-asr/kaldi) for data formatting and feature extraction.

## Key Features ##

- On-the-fly data augmentation and feature extraction loader

- TDNN Transformer encoder and convolution and transformer based decoder model structure

- RNNT training and batch decoding

- RNNT decoding with external Ngram FSTs (on-the-fly rescoring, aka, shallow fusion)

- RNNT Minimum Bayes Risk (MBR) training

- LAS forward and backward rescorer for RNNT

- Efficient BMUF (Block model update filtering) based distributed training

## Installation and Dependencies ##

In general, we recommend [Anaconda](https://www.anaconda.com/) since it comes with most dependencies. Other major dependencies include,

### Pytorch ###

Please go to <https://pytorch.org/> for pytorch installation, codes and scripts should be able to run against pytorch 0.4.0 and above. But we recommend 1.0.0 above for compatibility with RNNT loss module (see below)

### Pykaldi and Kaldi ###

We use Kaldi (<https://github.com/kaldi-asr/kaldi)>) and PyKaldi (a python wrapper for Kaldi) for data processing, feature extraction and FST manipulations. Please go to Pykaldi website <https://github.com/pykaldi/pykaldi> for installation and make sure to build Pykaldi with ninja for efficiency. After following the installation process of pykaldi, you should have both Kaldi and Pykaldi dependencies ready.

### CUDA-Warp RNN-Transducer ###

For RNNT loss module, we adopt the pytorch binding at <https://github.com/1ytic/warp-rnnt>

### Others ###

Check requirements.txt for other dependencies.

## Get Started ##

To get started, check all the training and decoding scripts located in egs directory.

### I. Data preparation and RNNT training ###

egs/train_transducer_bmuf_otfaug.sh contains data preparation and RNNT training. One need to prepare training data and specify the training data directory,

```bash
#training data dir must contain wav.scp and label.txt files
#wav.scp: standard kaldi wav.scp file, see https://kaldi-asr.org/doc/data_prep.html
#label.txt: label text file, the format is, uttid sequence-of-integer, where integer
# is one-based indexing mapped label, note that zero is reserved for blank,
# ,eg., utt_id_1 3 5 7 10 23
train_data_dir=
```

### II. Continue with MBR training ###

With RNNT trained model, one can continued MBR training with egs/train_transducer_mbr_bmuf_otfaug.sh (assuming using the same training data, therefore data preparation is omitted). Make sure to specify the initial model,

```bash
--verbose \
--optim sgd \
--init_model $exp_dir/init.model \
--rnnt_scale 1.0 \
--sm_scale 0.8 \
```

### III. Training LAS forward and backward rescorer ###

One can train a forward and backward LAS rescorer for your RNN-T model using egs/train_las_rescorer_bmuf_otfaug.sh. The LAS rescorer will share the encoder part with RNNT model, and has extra two-layer LSTM as additional encoder, make sure to specify the encoder sharing as,

```bash
--num_batches_per_epoch 526264 \
--shared_encoder_model $exp_dir/final.model \
--num_epochs 5 \
```

We support bi-directional LAS rescoring, i.e., forward and backward rescoring. Backward (right-to-left) rescoring is achieved by reversing sequential labels when conducting LAS model training. One can easily perform a backward LAS rescorer training by specifying,
```bash
--reverse_labels

```

### IV. Decoding ###

egs/eval_transducer.sh is the main evluation script, which contains the decoding pipeline. Forward and backward LAS rescoring can be enabled by specifying these two models,

```bash
##########configs#############
#rnn transducer model
rnnt_model=
#forward and backward las rescorer model
lasrescorer_fw=
lasrescorer_bw=
```

## Caveats ##

All the training and decoding hyper-parameters are adopted based on large-scale (e.g., 60khrs) training and internal evaluation data. One might need to re-tune hyper-parameters to acheive optimal performances. Also the WER (CER) scoring script is based on a Mandarin task, we recommend those who work on different languages rewrite scoring scripts.

## References ##


[1] [Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition](https://www.isca-speech.org/archive/Interspeech_2018/abstracts/1030.html), Chao Weng, Jia Cui, Guangsen Wang, Jun Wang, Chengzhu Yu, Dan Su, Dong Yu, InterSpeech 2018

[2] [Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition](https://www.isca-speech.org/archive/Interspeech_2020/abstracts/1221.html), Chao Weng, Chengzhu Yu, Jia Cui, Chunlei Zhang, Dong Yu, InterSpeech 2020


## Citations ##

```
@inproceedings{Weng2020,
author={Chao Weng and Chengzhu Yu and Jia Cui and Chunlei Zhang and Dong Yu},
title={{Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition}},
year=2020,
booktitle={Proc. Interspeech 2020},
pages={966--970},
doi={10.21437/Interspeech.2020-1221},
url={http://dx.doi.org/10.21437/Interspeech.2020-1221}
}
@inproceedings{Weng2018,
author={Chao Weng and Jia Cui and Guangsen Wang and Jun Wang and Chengzhu Yu and Dan Su and Dong Yu},
title={Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={761--765},
doi={10.21437/Interspeech.2018-1030},
url={http://dx.doi.org/10.21437/Interspeech.2018-1030}
}
```

## Disclaimer ##

This is not an officially supported Tencent product

0 comments on commit 23c4cdd

Please sign in to comment.