<a href="https://colab.research.google.com/github/vectominist/MiniASR/blob/main/example/example_librispeech_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MiniASR Tutorial: LibriSpeech Training**
This is a tutorial for training an end-to-end automatic speech recognition model with the toolkit [MiniASR](https://github.com/vectominist/MiniASR).  
You can run this notebook on [Google Colab](colab.research.google.com/), but to train an ASR model completely requires a Pro account since it needs several hours to converge.

## **Download Code & Install Dependencies**
Ref: [MiniASR](https://github.com/vectominist/MiniASR)

In [None]:
! git clone https://github.com/vectominist/MiniASR.git
% cd MiniASR

Cloning into 'MiniASR'...
remote: Enumerating objects: 161, done.[K
remote: Counting objects: 100% (161/161), done.[K
remote: Compressing objects: 100% (94/94), done.[K
remote: Total 161 (delta 71), reused 140 (delta 54), pack-reused 0[K
Receiving objects: 100% (161/161), 134.92 KiB | 8.99 MiB/s, done.
Resolving deltas: 100% (71/71), done.
/content/MiniASR


In [None]:
! pip3 install -e ./

Obtaining file:///content/MiniASR
Collecting sentencepiece>=0.1.96
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 8.0 MB/s 
[?25hCollecting pytorch-lightning>=1.3.8
  Downloading pytorch_lightning-1.4.9-py3-none-any.whl (925 kB)
[K     |████████████████████████████████| 925 kB 42.2 MB/s 
Collecting numba==0.48
  Downloading numba-0.48.0-1-cp37-cp37m-manylinux2014_x86_64.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 56.4 MB/s 
[?25hCollecting edit_distance
  Downloading edit_distance-1.0.4-py3-none-any.whl (11 kB)
Collecting torchaudio>=0.7.0
  Downloading torchaudio-0.9.1-cp37-cp37m-manylinux1_x86_64.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 50.7 MB/s 
Collecting llvmlite<0.32.0,>=0.31.0dev0
  Downloading llvmlite-0.31.0-cp37-cp37m-manylinux1_x86_64.whl (20.2 MB)
[K     |████████████████████████████████| 20.2 MB 1.4 MB/s 
Collecting fsspec[h

## **Download Data**
- training set: [Libri-light](https://github.com/facebookresearch/libri-light) fine-tuning set (10 hours, 0.6G)
- development set: [LibriSpeech](https://www.openslr.org/12) `dev-clean` set
- testing set: [LibriSpeech](https://www.openslr.org/12) `test-clean` set

In [None]:
! mkdir -p data
% cd data
! wget https://dl.fbaipublicfiles.com/librilight/data/librispeech_finetuning.tgz
! tar zxf librispeech_finetuning.tgz
! rm librispeech_finetuning.tgz

/content/MiniASR/data
--2021-10-01 08:36:17--  https://dl.fbaipublicfiles.com/librilight/data/librispeech_finetuning.tgz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 597601132 (570M) [application/gzip]
Saving to: ‘librispeech_finetuning.tgz’


2021-10-01 08:36:35 (32.7 MB/s) - ‘librispeech_finetuning.tgz’ saved [597601132/597601132]



In [None]:
! wget https://www.openslr.org/resources/12/dev-clean.tar.gz
! wget https://www.openslr.org/resources/12/test-clean.tar.gz
! tar zxf dev-clean.tar.gz
! tar zxf test-clean.tar.gz
! rm dev-clean.tar.gz
! rm test-clean.tar.gz
% cd ..

--2021-10-01 08:36:41--  https://www.openslr.org/resources/12/dev-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 337926286 (322M) [application/x-gzip]
Saving to: ‘dev-clean.tar.gz’


2021-10-01 08:37:00 (17.3 MB/s) - ‘dev-clean.tar.gz’ saved [337926286/337926286]

--2021-10-01 08:37:00--  https://www.openslr.org/resources/12/test-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 346663984 (331M) [application/x-gzip]
Saving to: ‘test-clean.tar.gz’


2021-10-01 08:37:20 (17.3 MB/s) - ‘test-clean.tar.gz’ saved [346663984/346663984]

/content/MiniASR


## **Preprocess Data**
Find all data in the corpus and extract vocabularies. We use characters as text tokens since the dataset is small.

In [None]:
# Train set
! miniasr-preprocess \
        -c LibriSpeech \
        -p data/librispeech_finetuning \
        -s 1h \
        -o data/libri_train_1h \
        --gen-vocab \
        --char-vocab-size 40

! miniasr-preprocess \
        -c LibriSpeech \
        -p data/librispeech_finetuning \
        -s 9h \
        -o data/libri_train_9h

# Development set
! miniasr-preprocess \
        -c LibriSpeech \
        -p data/LibriSpeech \
        -s dev-clean \
        -o data/libri_dev

# Test set
! miniasr-preprocess \
        -c LibriSpeech \
        -p data/LibriSpeech \
        -s test-clean \
        -o data/libri_test

10-01 08:37 run_preprocess.py.main(72) Preprocessing LibriSpeech corpus.
10-01 08:37 run_preprocess.py.main(73) Subsets = ['1h']
10-01 08:37 run_preprocess.py.main(76) Results will be saved to data/libri_train_1h
10-01 08:37 run_preprocess.py.main(79) Reading data from data/librispeech_finetuning
10-01 08:37 run_preprocess.py.main(85) Found 286 audio files.
10-01 08:37 run_preprocess.py.main(89) Saving unsorted data dict to data/libri_train_1h/data_dict.json
10-01 08:37 run_preprocess.py.main(94) Sorting data by audio file length.
10-01 08:37 run_preprocess.py.main(103) Saving sorted data list to data/libri_train_1h/data_list_sorted.json
10-01 08:37 run_preprocess.py.main(109) Generating LM file.
10-01 08:37 run_preprocess.py.main(117) Generating vocabularies.
10-01 08:37 run_preprocess.py.main(120) Generating characters.
Found 28 chars.
Selected 28 vocabularies.
Saving char vocabularies to data/libri_train_1h/vocab_char.txt
10-01 08:37 run_preprocess.py.main(72) Preprocessing LibriSpe

## **Training**
- Modify `MiniASR/egs/librispeech/config/ctc_train_example.yaml` for changing training hyper-parameters.
- The results will be saved to `MiniASR/model/ctc_libri-10h_char`.

In [None]:
! mkdir -p model

In [None]:
! minasr-asr --config egs/librispeech/config/ctc_train_example.yaml

# Resume training with this command:
# ! minasr-asr --ckpt model/ctc_libri-10h_char/epoch=4-step=429.ckpt

10-01 12:12 run_asr.py.main(99) Training mode.
10-01 12:12 dataloader.py.create_dataloader(86) Creating text tokenizer of character level.
10-01 12:12 dataloader.py.create_dataloader(91) Generating datasets and dataloaders. (mode = train)
10-01 12:12 dataset.py.__init__(28) Loading data from ['data/libri_train_1h/data_list_sorted.json', 'data/libri_train_9h/data_list_sorted.json']
100% 2763/2763 [00:00<00:00, 12494.93it/s]
10-01 12:12 dataset.py.__init__(49) 2763 audio files found (mode = train)
10-01 12:12 dataset.py.__init__(28) Loading data from ['data/libri_dev/data_list_sorted.json']
100% 2703/2703 [00:00<00:00, 21381.83it/s]
10-01 12:12 dataset.py.__init__(49) 2703 audio files found (mode = dev)
  cpuset_checked))
10-01 12:12 asr_trainer.py.create_asr_trainer(24) Creating ASR model (type = ctc_asr).
Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

 

## **Testing**
- Specify your checkpoint with `--ckpt`.

In [None]:
! minasr-asr \
    --config egs/librispeech/config/ctc_test_example.yaml \
    --test \
    --override "args.data.dev_paths=['data/libri_test/data_list_sorted.json']" \
    --ckpt model/ctc_libri-10h_char/epoch=44-step=3869.ckpt

10-01 13:28 basic_setups.py.override(89) Override: args.data.dev_paths = ['data/libri_test/data_list_sorted.json']
10-01 13:28 run_asr.py.main(105) Testing mode.
10-01 13:28 dataloader.py.create_dataloader(91) Generating datasets and dataloaders. (mode = dev)
10-01 13:28 dataset.py.__init__(28) Loading data from ['data/libri_test/data_list_sorted.json']
100% 2620/2620 [00:00<00:00, 21892.29it/s]
10-01 13:28 dataset.py.__init__(49) 2620 audio files found (mode = dev)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing: 100% 164/164 [00:39<00:00,  8.80it/s]

Character errors
| #Snt     #Tok     | Sub    Del    Ins    Err    SErr   |
| 2620     281530   | 16.7   21.1   1.5    39.2   100.0  |

Word errors
| #Snt     #Tok     | Sub    Del    Ins    Err    SErr   |
| 2620     52576    | 74.7   7.2    4.4    86.4   100.0  |

RTF:     0.0013
Latency: 9.3666 [ms/sentence]

----------------

## **Inference**

In [None]:
from miniasr.utils import load_from_checkpoint, sequence_distance
from miniasr.data.audio import load_waveform

model, args, tokenizer = load_from_checkpoint(
    'model/ctc_libri-10h_char/epoch=44-step=3869.ckpt', 'cuda')
waves = [load_waveform('data/LibriSpeech/dev-clean/6345/93302/6345-93302-0025.flac').to('cuda')]
hyps = model.recognize(waves)

In [None]:
print(hyps[0])
ref = 'ARE YOU REALLY GOING TO THROW ME OVER FOR A THING LIKE THIS'
res_cer = sequence_distance(ref, hyps[0], mode='char')
res_wer = sequence_distance(ref, hyps[0], mode='word')
print('CER = {:.2f}%'.format(100. * res_cer['distance'] / res_cer['length']))
print('WER = {:.2f}%'.format(100. * res_wer['distance'] / res_wer['length']))

Y WILY O T OE ME R THING MY FES
CER = 59.32%
WER = 84.62%
