Skip to content

Commit

Permalink
[WIP] Pruned-transducer-stateless5-for-WenetSpeech (offline and strea…
Browse files Browse the repository at this point in the history
…ming) (facebookresearch#447)

* pruned-rnnt5-for-wenetspeech

* style check

* style check

* add streaming conformer

* add streaming decode

* changes codes for fast_beam_search and export cpu jit

* add modified-beam-search for streaming decoding

* add modified-beam-search for streaming decoding

* change for streaming_beam_search.py

* add README.md and RESULTS.md

* change for style_check.yml

* do some changes

* do some changes for export.py

* add some decode commands for usage

* add streaming results on README.md
  • Loading branch information
luomingshuang committed Jul 28, 2022
1 parent 385645d commit f26b62a
Show file tree
Hide file tree
Showing 22 changed files with 5,311 additions and 9 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/style_check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-18.04, macos-10.15]
os: [ubuntu-18.04, macos-latest]
python-version: [3.7, 3.9]
fail-fast: false

Expand Down
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -250,17 +250,25 @@ We provide a Colab notebook to run a pre-trained Pruned Transducer Stateless mod

### WenetSpeech

We provide one model for this recipe: [Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss][WenetSpeech_pruned_transducer_stateless2].
We provide some models for this recipe: [Pruned stateless RNN-T_2: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss][WenetSpeech_pruned_transducer_stateless2] and [Pruned stateless RNN-T_5: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss][WenetSpeech_pruned_transducer_stateless5].

#### Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss (trained with L subset)
#### Pruned stateless RNN-T_2: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss (trained with L subset, offline ASR)

| | Dev | Test-Net | Test-Meeting |
|----------------------|-------|----------|--------------|
| greedy search | 7.80 | 8.75 | 13.49 |
| fast beam search | 7.94 | 8.74 | 13.80 |
| modified beam search | 7.76 | 8.71 | 13.41 |

We provide a Colab notebook to run a pre-trained Pruned Transducer Stateless model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1EV4e1CHa1GZgEF-bZgizqI9RyFFehIiN?usp=sharing)
#### Pruned stateless RNN-T_5: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss (trained with L subset)
**Streaming**:
| | Dev | Test-Net | Test-Meeting |
|----------------------|-------|----------|--------------|
| greedy_search | 8.78 | 10.12 | 16.16 |
| modified_beam_search | 8.53| 9.95 | 15.81 |
| fast_beam_search| 9.01 | 10.47 | 16.28 |

We provide a Colab notebook to run a pre-trained Pruned Transducer Stateless2 model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1EV4e1CHa1GZgEF-bZgizqI9RyFFehIiN?usp=sharing)

### Alimeeting

Expand Down Expand Up @@ -333,6 +341,7 @@ Please see: [![Open In Colab](https://colab.research.google.com/assets/colab-bad
[GigaSpeech_pruned_transducer_stateless2]: egs/gigaspeech/ASR/pruned_transducer_stateless2
[Aidatatang_200zh_pruned_transducer_stateless2]: egs/aidatatang_200zh/ASR/pruned_transducer_stateless2
[WenetSpeech_pruned_transducer_stateless2]: egs/wenetspeech/ASR/pruned_transducer_stateless2
[WenetSpeech_pruned_transducer_stateless5]: egs/wenetspeech/ASR/pruned_transducer_stateless5
[Alimeeting_pruned_transducer_stateless2]: egs/alimeeting/ASR/pruned_transducer_stateless2
[Aishell4_pruned_transducer_stateless5]: egs/aishell4/ASR/pruned_transducer_stateless5
[TAL_CSASR_pruned_transducer_stateless5]: egs/tal_csasr/ASR/pruned_transducer_stateless5
Expand Down
1 change: 1 addition & 0 deletions egs/wenetspeech/ASR/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ The following table lists the differences among them.
| | Encoder | Decoder | Comment |
|---------------------------------------|---------------------|--------------------|-----------------------------|
| `pruned_transducer_stateless2` | Conformer(modified) | Embedding + Conv1d | Using k2 pruned RNN-T loss | |
| `pruned_transducer_stateless5` | Conformer(modified) | Embedding + Conv1d | Using k2 pruned RNN-T loss | |

The decoder in `transducer_stateless` is modified from the paper
[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
Expand Down
78 changes: 75 additions & 3 deletions egs/wenetspeech/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,84 @@
## Results

### WenetSpeech char-based training results (offline and streaming) (Pruned Transducer 5)

#### 2022-07-22

Using the codes from this PR https://github.com/k2-fsa/icefall/pull/447.

When training with the L subset, the CERs are

**Offline**:
|decoding-method| epoch | avg | use-averaged-model | DEV | TEST-NET | TEST-MEETING|
|-- | -- | -- | -- | -- | -- | --|
|greedy_search | 4 | 1 | True | 8.22 | 9.03 | 14.54|
|modified_beam_search | 4 | 1 | True | **8.17** | **9.04** | **14.44**|
|fast_beam_search | 4 | 1 | True | 8.29 | 9.00 | 14.93|

The offline training command for reproducing is given below:
```
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
./pruned_transducer_stateless5/train.py \
--lang-dir data/lang_char \
--exp-dir pruned_transducer_stateless5/exp_L_offline \
--world-size 8 \
--num-epochs 15 \
--start-epoch 2 \
--max-duration 120 \
--valid-interval 3000 \
--model-warm-step 3000 \
--save-every-n 8000 \
--average-period 1000 \
--training-subset L
```

The tensorboard training log can be found at https://tensorboard.dev/experiment/SvnN2jfyTB2Hjqu22Z7ZoQ/#scalars .


A pre-trained offline model and decoding logs can be found at <https://huggingface.co/luomingshuang/icefall_asr_wenetspeech_pruned_transducer_stateless5_offline>

**Streaming**:
|decoding-method| epoch | avg | use-averaged-model | DEV | TEST-NET | TEST-MEETING|
|--|--|--|--|--|--|--|
| greedy_search | 7| 1| True | 8.78 | 10.12 | 16.16 |
| modified_beam_search | 7| 1| True| **8.53**| **9.95** | **15.81** |
| fast_beam_search | 7 | 1| True | 9.01 | 10.47 | 16.28 |

The streaming training command for reproducing is given below:
```
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
./pruned_transducer_stateless5/train.py \
--lang-dir data/lang_char \
--exp-dir pruned_transducer_stateless5/exp_L_streaming \
--world-size 8 \
--num-epochs 15 \
--start-epoch 1 \
--max-duration 140 \
--valid-interval 3000 \
--model-warm-step 3000 \
--save-every-n 8000 \
--average-period 1000 \
--training-subset L \
--dynamic-chunk-training True \
--causal-convolution True \
--short-chunk-size 25 \
--num-left-chunks 4
```

The tensorboard training log can be found at https://tensorboard.dev/experiment/E2NXPVflSOKWepzJ1a1uDQ/#scalars .


A pre-trained offline model and decoding logs can be found at <https://huggingface.co/luomingshuang/icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming>

### WenetSpeech char-based training results (Pruned Transducer 2)

#### 2022-05-19

Using the codes from this PR https://github.com/k2-fsa/icefall/pull/349.

When training with the L subset, the WERs are
When training with the L subset, the CERs are

| | dev | test-net | test-meeting | comment |
|------------------------------------|-------|----------|--------------|------------------------------------------|
Expand Down Expand Up @@ -72,7 +144,7 @@ avg=2
--max-states 8
```

When training with the M subset, the WERs are
When training with the M subset, the CERs are

| | dev | test-net | test-meeting | comment |
|------------------------------------|--------|-----------|---------------|-------------------------------------------|
Expand All @@ -81,7 +153,7 @@ When training with the M subset, the WERs are
| fast beam search (set as default) | 10.18 | 11.10 | 19.32 | --epoch 29, --avg 11, --max-duration 1500 |


When training with the S subset, the WERs are
When training with the S subset, the CERs are

| | dev | test-net | test-meeting | comment |
|------------------------------------|--------|-----------|---------------|-------------------------------------------|
Expand Down
2 changes: 0 additions & 2 deletions egs/wenetspeech/ASR/pruned_transducer_stateless2/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -348,7 +348,6 @@ def get_params() -> AttributeDict:
epochs.
- log_interval: Print training loss if batch_idx % log_interval` is 0
- reset_interval: Reset statistics if batch_idx % reset_interval is 0
- valid_interval: Run validation if batch_idx % valid_interval is 0
- feature_dim: The model input dim. It has to match the one used
in computing features.
- subsampling_factor: The subsampling factor for the model.
Expand Down Expand Up @@ -376,7 +375,6 @@ def get_params() -> AttributeDict:
"decoder_dim": 512,
# parameters for joiner
"joiner_dim": 512,
# parameters for Noam
"env_info": get_env_info(),
}
)
Expand Down
Empty file.
Loading

0 comments on commit f26b62a

Please sign in to comment.