Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
129 commits
Select commit Hold shift + click to select a range
b6592ca
example pipeline, initial commit.
vincentqb May 12, 2020
3e2f24a
removing notebook conversion artifacts.
vincentqb May 12, 2020
ff90ee9
remove extra comments. lint.
vincentqb May 12, 2020
30efb4a
addressing some feedback.
vincentqb May 13, 2020
a55b7cd
main function.
vincentqb May 13, 2020
b955815
defining args in function.
vincentqb May 13, 2020
7565d61
refactor.
vincentqb May 13, 2020
8fef171
lint.
vincentqb May 13, 2020
b27ec81
checkpoint.
vincentqb May 14, 2020
4c5e4de
clean version to start with.
vincentqb May 19, 2020
f246747
adding more parameters.
vincentqb May 20, 2020
26b327a
lint.
vincentqb May 20, 2020
8a768d2
cleaning full version.
vincentqb May 20, 2020
e0b1359
check for not None.
vincentqb May 20, 2020
be34e16
cleaning.
vincentqb May 21, 2020
bd7c9c9
back -l 160
vincentqb May 21, 2020
91528c9
black.
vincentqb May 21, 2020
3f28b75
fix runtime error.
vincentqb May 21, 2020
38d0cae
removing some print statements.
vincentqb May 21, 2020
797f0f9
add help to command line. add progress bar option.
vincentqb May 21, 2020
79b5daf
grouping librispeech-specific transform in subclass.
vincentqb May 21, 2020
401df9f
typo.
vincentqb May 21, 2020
78fd8f7
fix concatenation.
vincentqb May 21, 2020
516bdc8
typo.
vincentqb May 21, 2020
0b65e43
black. tqdm.
vincentqb May 21, 2020
c335fe4
missing transpose.
vincentqb May 22, 2020
bbc2a03
renaming variables.
vincentqb May 22, 2020
9ed1cce
sum cer and wer
vincentqb May 28, 2020
552a1a9
clip norm.
vincentqb May 28, 2020
c8cc7d7
second signal handler removed.
vincentqb May 28, 2020
cdb8f8e
cosmetic.
vincentqb May 28, 2020
2cd49b1
default to no checkpoint.
vincentqb May 28, 2020
db943fc
remove non_blocking.
vincentqb May 28, 2020
8ecdef1
adadelta works better than sgd.
vincentqb May 28, 2020
6b6cccb
anomaly detection.
vincentqb May 28, 2020
0e250b3
moving dataset to separate file.
vincentqb Jun 5, 2020
f791ec5
lint.
vincentqb Jun 5, 2020
2fb5097
move to separate module: languagemodel, decoder, metric.
vincentqb Jun 5, 2020
9ca6f1d
flush=True.
vincentqb Jun 5, 2020
f91f77f
renaming decoder.
vincentqb Jun 5, 2020
5b3ef99
CTC Decoders.
vincentqb Jun 5, 2020
620e65d
flush=True.
vincentqb Jun 5, 2020
8887c86
pass length for viterbi decoder.
vincentqb Jun 5, 2020
c53301d
progress bar. relative path.
vincentqb Jun 5, 2020
a0c144e
generalize transition matrix to n-gram. progress bar.
vincentqb Jun 8, 2020
50fc186
choice of decoder.
vincentqb Jun 8, 2020
5e6a44a
collate func.
vincentqb Jun 10, 2020
4c6d87b
remove signal handling.
vincentqb Jun 10, 2020
bbede94
adding distributed.
vincentqb Jun 10, 2020
6a0f12f
lint.
vincentqb Jun 11, 2020
afc9d32
normalize w/r to length of dataset, and w/r to total number characters.
vincentqb Jun 11, 2020
9dc45ca
relative cer/wer.
vincentqb Jun 12, 2020
0bfb559
clip grad parameter. momentum back but not yet used.
vincentqb Jun 24, 2020
28c905a
Switch to SGD.
vincentqb Jun 24, 2020
f99eef9
choice of optimizer.
vincentqb Jun 25, 2020
9431d55
scheduler.
vincentqb Jun 26, 2020
91e71c1
move to utils file.
vincentqb Jun 29, 2020
9472c22
metric log, and utils file.
vincentqb Jun 29, 2020
5d77b88
rename metric_logger.
vincentqb Jun 30, 2020
7529009
stderr and stdout. simpler metric logger.
vincentqb Jun 30, 2020
25cb8f3
replace by logging.
vincentqb Jun 30, 2020
660082c
adding time measurement in metric logger.
vincentqb Jul 1, 2020
dd03e37
fix duplicate name. remove tqdm. keep track of epoch instead and iter…
vincentqb Jul 1, 2020
358236a
rename main file. and add readme.
vincentqb Jul 1, 2020
490c222
refactor distributed.
vincentqb Jul 1, 2020
17a5999
swap example and output in readme.
vincentqb Jul 1, 2020
a188200
remove time from logger.
vincentqb Jul 2, 2020
bd5d4d9
check non-empty tensor input.
vincentqb Jul 2, 2020
d1183dc
typo in variable name and log update.
vincentqb Jul 2, 2020
26de948
typo.
vincentqb Jul 7, 2020
7d40304
compute cer/wer in training too.
vincentqb Jul 7, 2020
214ed96
typo.
vincentqb Jul 13, 2020
26fc391
add back slurm signal capture to resubmit job.
vincentqb Jul 13, 2020
8b3e156
update levinstein distance.
vincentqb Jul 14, 2020
16765be
adding tests for levenstein distance.
vincentqb Jul 14, 2020
61b61d8
record error rate during iteration.
vincentqb Jul 14, 2020
243f9c2
metric logger using setitem.
vincentqb Jul 15, 2020
4e34958
moving signal break to end of loop and return loss so far.
vincentqb Jul 15, 2020
84a15a3
typo.
vincentqb Jul 15, 2020
efb74f1
add citation.
vincentqb Jul 17, 2020
dbded0d
change default to best run.
vincentqb Jul 23, 2020
fb8324d
adding other experiment with decoders.
vincentqb Jul 23, 2020
5063d68
remove other decoders than greedy.
vincentqb Jul 23, 2020
61b7afc
Revert "remove other decoders than greedy."
vincentqb Jul 23, 2020
bc95fb5
changing name of folfder.
vincentqb Jul 23, 2020
d8ee1e9
remove other decoders, and unused dataset class.
vincentqb Jul 23, 2020
0503f65
rename functions to align with other pipeline.
vincentqb Jul 23, 2020
cef6c50
pick which parts to train with.
vincentqb Jul 24, 2020
0a90df5
adding specaugment to validation. note that caching prevents randomiz…
vincentqb Jul 24, 2020
1563288
updating readme.
vincentqb Jul 27, 2020
463a25c
typo in metric logging.
vincentqb Jul 27, 2020
18a18e6
Revert "typo in metric logging."
vincentqb Jul 27, 2020
c4545d2
Revert "Revert "typo in metric logging.""
vincentqb Jul 27, 2020
8e2d1f7
update metric logger.
vincentqb Jul 27, 2020
523e0e1
simplify metric logger implementation.
vincentqb Jul 27, 2020
7efc028
use json dumps instead.
vincentqb Jul 27, 2020
7780b26
group metric together.
vincentqb Jul 28, 2020
0006d89
move function.
vincentqb Jul 28, 2020
68d0ac1
lint.
vincentqb Jul 28, 2020
b087ff5
quick summary of files in folder.
vincentqb Jul 28, 2020
f5bcead
pass clip_grad explictly.
vincentqb Jul 28, 2020
91a06a6
typo in default dataset name.
vincentqb Jul 29, 2020
2167f27
option to disable logger.
vincentqb Jul 29, 2020
10ef47c
ergonomics for distributed.
vincentqb Jul 29, 2020
6e6b2ea
reminder about signal handler.
vincentqb Jul 29, 2020
8d6e27d
minor refactor of main in main.
vincentqb Jul 29, 2020
6f5f7cd
replace by not_main_rank.
vincentqb Jul 29, 2020
d7ebdb3
raising error if parameter not supported.
vincentqb Jul 29, 2020
b67ba51
move model before invoking DDP.
vincentqb Jul 29, 2020
ecd8d73
changing log level. using python 2 style string for logging.
vincentqb Jul 29, 2020
af2eb0c
dynamic augmentations.
vincentqb Jul 31, 2020
25524eb
update metric log.
vincentqb Jul 31, 2020
0143803
save learning rate even if function not available.
vincentqb Aug 1, 2020
406d2a3
add type option to model.
vincentqb Aug 4, 2020
3716e9d
add adamw.
vincentqb Aug 4, 2020
f30f713
reduce lr on validation step or training step.
vincentqb Aug 4, 2020
061dd40
specify hop-length and win-length.
vincentqb Aug 4, 2020
8d49a70
normalize option.
vincentqb Aug 4, 2020
4ea2596
rename parameter.
vincentqb Aug 6, 2020
340df0a
add dropout and tweak to number of channels.
vincentqb Aug 6, 2020
f5b1b1b
copy model in pipeline folder for experimentation.
vincentqb Aug 6, 2020
e5f733d
fix scheduler stepping.
vincentqb Aug 6, 2020
fe75249
fix input_type and num_features.
vincentqb Aug 7, 2020
4d2119a
waveform mode changes shape more.
vincentqb Aug 7, 2020
e63a616
adding best character error rate with current implementation of model…
vincentqb Aug 19, 2020
4a9381f
comment update.
vincentqb Aug 19, 2020
4795a72
remove signal. remove custom wav2letter model.
vincentqb Aug 19, 2020
cc4db15
remove comment.
vincentqb Aug 19, 2020
a2b6ad2
simpler import with pandas.
vincentqb Aug 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions examples/pipeline_wav2letter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
This is an example pipeline for speech recognition using a greedy or Viterbi CTC decoder, along with the Wav2Letter model trained on LibriSpeech, see [Wav2Letter: an End-to-End ConvNet-based Speech Recognition System](https://arxiv.org/pdf/1609.03193.pdf). Wav2Letter and LibriSpeech are available in torchaudio.

### Usage

More information about each command line parameters is available with the `--help` option. An example can be invoked as follows.
```
python main.py \
--reduce-lr-valid \
--dataset-train train-clean-100 train-clean-360 train-other-500 \
--dataset-valid dev-clean \
--batch-size 128 \
--learning-rate .6 \
--momentum .8 \
--weight-decay .00001 \
--clip-grad 0. \
--gamma .99 \
--hop-length 160 \
--n-hidden-channels 2000 \
--win-length 400 \
--n-bins 13 \
--normalize \
--optimizer adadelta \
--scheduler reduceonplateau \
--epochs 30
```
With these default parameters, we get a character error rate of 13.8% on dev-clean after 30 epochs.

### Output

The information reported at each iteration and epoch (e.g. loss, character error rate, word error rate) is printed to standard output in the form of one json per line, e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write log to a separate file alongside with saved model, otherwise users have to redirect all the time, which is not very convenient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i must admit i do like the standard output a lot -- but i can see users preferring writing to a file, so i'll add the option to choose :)

```python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops :) thx for pointing this out

{"name": "train", "epoch": 0, "cer over target length": 1.0, "cumulative cer": 23317.0, "total chars": 23317.0, "cer": 0.0, "cumulative cer over target length": 0.0, "wer over target length": 1.0, "cumulative wer": 4446.0, "total words": 4446.0, "wer": 0.0, "cumulative wer over target length": 0.0, "lr": 0.6, "batch size": 128, "n_channel": 13, "n_time": 2453, "dataset length": 128.0, "iteration": 1.0, "loss": 8.712121963500977, "cumulative loss": 8.712121963500977, "average loss": 8.712121963500977, "iteration time": 41.46276903152466, "epoch time": 41.46276903152466}
{"name": "train", "epoch": 0, "cer over target length": 1.0, "cumulative cer": 46005.0, "total chars": 46005.0, "cer": 0.0, "cumulative cer over target length": 0.0, "wer over target length": 1.0, "cumulative wer": 8762.0, "total words": 8762.0, "wer": 0.0, "cumulative wer over target length": 0.0, "lr": 0.6, "batch size": 128, "n_channel": 13, "n_time": 1703, "dataset length": 256.0, "iteration": 2.0, "loss": 8.918599128723145, "cumulative loss": 17.63072109222412, "average loss": 8.81536054611206, "iteration time": 1.2905676364898682, "epoch time": 42.753336668014526}
{"name": "train", "epoch": 0, "cer over target length": 1.0, "cumulative cer": 70030.0, "total chars": 70030.0, "cer": 0.0, "cumulative cer over target length": 0.0, "wer over target length": 1.0, "cumulative wer": 13348.0, "total words": 13348.0, "wer": 0.0, "cumulative wer over target length": 0.0, "lr": 0.6, "batch size": 128, "n_channel": 13, "n_time": 1713, "dataset length": 384.0, "iteration": 3.0, "loss": 8.550191879272461, "cumulative loss": 26.180912971496582, "average loss": 8.726970990498861, "iteration time": 1.2109291553497314, "epoch time": 43.96426582336426}
```
One way to import the output in python with pandas is by saving the standard output to a file, and then using `pandas.read_json(filename, lines=True)`.

## Structure of pipeline

* `main.py` -- the entry point
* `ctc_decoders.py` -- the greedy CTC decoder
* `datasets.py` -- the function to split and process librispeech, a collate factory function
* `languagemodels.py` -- a class to encode and decode strings
* `metrics.py` -- the levenshtein edit distance
* `utils.py` -- functions to log metrics, save checkpoint, and count parameters
15 changes: 15 additions & 0 deletions examples/pipeline_wav2letter/ctc_decoders.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from torch import topk


class GreedyDecoder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could generalize this file to be called "decoders.py" and also fold in things such as compute_error_rates

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is stateless. Can it be a function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be functional corresponding to a transform, but really it's a step towards our beamsearch work

def __call__(self, outputs):
"""Greedy Decoder. Returns highest probability of class labels for each timestep

Args:
outputs (torch.Tensor): shape (input length, batch size, number of classes (including blank))

Returns:
torch.Tensor: class labels per time step.
"""
_, indices = topk(outputs, k=1, dim=-1)
return indices[..., 0]
113 changes: 113 additions & 0 deletions examples/pipeline_wav2letter/datasets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
import torch
from torchaudio.datasets import LIBRISPEECH


class MapMemoryCache(torch.utils.data.Dataset):
"""
Wrap a dataset so that, whenever a new item is returned, it is saved to memory.
"""

def __init__(self, dataset):
self.dataset = dataset
self._cache = [None] * len(dataset)

def __getitem__(self, n):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified.

if self._cache[n] is None:
    self._cache[n] = self.dataset[n]
return self._cache[n]

if self._cache[n] is not None:
return self._cache[n]

item = self.dataset[n]
self._cache[n] = item

return item

def __len__(self):
return len(self.dataset)


class Processed(torch.utils.data.Dataset):
def __init__(self, dataset, transforms, encode):
self.dataset = dataset
self.transforms = transforms
self.encode = encode

def __getitem__(self, key):
item = self.dataset[key]
return self.process_datapoint(item)

def __len__(self):
return len(self.dataset)

def process_datapoint(self, item):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation is not generic and requires specific item type, and since it uses index slicing it is very difficult to understand what it does. Please add a docstring.

transformed = item[0]
target = item[2].lower()

transformed = self.transforms(transformed)
transformed = transformed[0, ...].transpose(0, -1)

target = self.encode(target)
target = torch.tensor(target, dtype=torch.long, device=transformed.device)

return transformed, target


def split_process_librispeech(
datasets, transforms, language_model, root, folder_in_archive,
):
def create(tags, cache=True):

if isinstance(tags, str):
tags = [tags]
if isinstance(transforms, list):
transform_list = transforms
else:
transform_list = [transforms]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an example code and all the helper functions are for making the example code main code simpler, so making helper functions more specific helps better with maintainability. Instead of allowing multiple types, it's simpler to allow only one type and do the equivalent type conversion in client code.


data = torch.utils.data.ConcatDataset(
[
Processed(
LIBRISPEECH(
root, tag, folder_in_archive=folder_in_archive, download=False,
),
transform,
language_model.encode,
)
for tag, transform in zip(tags, transform_list)
]
)

data = MapMemoryCache(data)
return data

# For performance, we cache all datasets
return tuple(create(dataset) for dataset in datasets)


def collate_factory(model_length_function, transforms=None):

if transforms is None:
transforms = torch.nn.Sequential()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of the declarative "nn.Sequential" approach and would write a custom function whose pointer I'd pass around, but I can see it being nice to aggregate transforms based on a sequence of decisions.


def collate_fn(batch):

tensors = [transforms(b[0]) for b in batch if b]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very difficult to understand what are being transformed, here.

  1. for b in batch if b

Why is there a case that one item in a batch (denoted as b) can be invalid sample?

  1. what does b[0] represent?

Copy link
Contributor Author

@vincentqb vincentqb Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. if b is no longer needed, removed :)
  2. b[0] is the waveform from the processed data point tuple. added a comment


tensors_lengths = torch.tensor(
[model_length_function(t) for t in tensors],
dtype=torch.long,
device=tensors[0].device,
)

tensors = torch.nn.utils.rnn.pad_sequence(tensors, batch_first=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A wrapped / generalized version of this could form a useful torchaudio function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pad_sequence requires transposes as it is, since in torchaudio it is the last dimension that we want to pad. I re-implemented pad_sequence for this use case.

tensors = tensors.transpose(1, -1)

targets = [b[1] for b in batch if b]
target_lengths = torch.tensor(
[target.shape[0] for target in targets],
dtype=torch.long,
device=tensors.device,
)
targets = torch.nn.utils.rnn.pad_sequence(targets, batch_first=True)

return tensors, targets, tensors_lengths, target_lengths

return collate_fn
38 changes: 38 additions & 0 deletions examples/pipeline_wav2letter/languagemodels.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import collections
import itertools


class LanguageModel:
def __init__(self, labels, char_blank, char_space):

self.char_space = char_space
self.char_blank = char_blank

labels = [l for l in labels]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot be labels = list(labels)? What is the expected type of the input labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Yes, it's just a string.

self.length = len(labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having __len__ function and length property to return the same value is confusing. Should prefer __len__.

enumerated = list(enumerate(labels))
flipped = [(sub[1], sub[0]) for sub in enumerated]

d1 = collections.OrderedDict(enumerated)
d2 = collections.OrderedDict(flipped)
self.mapping = {**d1, **d2}

def encode(self, iterable):
if isinstance(iterable, list):
return [self.encode(i) for i in iterable]
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I pass an iterable that yields lists? What's the basecase type here? Maybe that's an easier case to branch on. Also as a very minor nit, I actually like using returns to avoid "else". So you could write

if isinstance(iterable, list):
    return [self.encode(i) for i in iterable]
return [self.mapping[i] + self.mapping[self.char_blank] for i in iterable]

return [self.mapping[i] + self.mapping[self.char_blank] for i in iterable]

def decode(self, tensor):
if len(tensor) > 0 and isinstance(tensor[0], list):
return [self.decode(t) for t in tensor]
else:
# not idempotent, since clean string
x = (self.mapping[i] for i in tensor)
x = "".join(i for i, _ in itertools.groupby(x))
x = x.replace(self.char_blank, "")
# x = x.strip()
return x

def __len__(self):
return self.length
Loading