-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Added a librispeech data generator. #419
Conversation
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed, please reply here (e.g.
|
|
CLAs look good, thanks! |
|
@vince62s -- I work with @wingsbr. I was collaborating with Archy de Berker via gitter. Looks like a co-worker of his (Majid) of ended up posting another LibriSpeech data generator (which I haven't looked at yet). To your question about MFCC, I thought for the 'problem' it would be best to leave pre-processing/spectral featurization/MFCC up to user, to allow for experimentation. We'll make sure to link up with Majid to avoid duplication if/when we code up the modality bottom to handle these transformations. |
|
Thanks for your feedback. As it is now (and it was similar for WSJ) I have the impression it is trying to take frames directly as inputs, which is really disturbing ... |
|
Agreed. It is an important TODO. I haven't seen good results published from working from raw waveform. However, there are many papers with good results starting from spectral features. But even here there is a lot of room for experimentation (e.g., mel-scale, #of bins, window-size, etc.). I think going to MFCC is over-engineering the features, and it would be better to let the NN to derive its own features from the spectrum. In any case, different researchers, and different domains may warrant different approaches. Hence, I would argue against pushing the encoding into the TFRecords. |
|
@mschonwe Sorry, I did not see you pull request before sending mine. I closed mine. Regarding MFCC, I suggest that to put the pre-processing in problem.Problem class and as a dataset is generated, it saved in the correct format. Something similar to input space id. |
|
@mjlaali No worries - let's chat on Gitter to coordinate incorporating the signal processing. |
zh794390558
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not register this Problem rather than put it into _SUPPORTED_PROBLEM_GENERATORS?
|
@zh794390558 Good idea. I expanded the librispeech generator to include a problem and modality and registered those rather than using _SUPPORTED_PROBLEM_GENERATORS. |
| "http://www.openslr.org/resources/12/dev-other.tar.gz", | ||
| "dev-other" | ||
| ], | ||
| ]''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why comment above code ?
| class LibrispeechTextEncoder(text_encoder.TextEncoder): | ||
|
|
||
| def encode(self, s): | ||
| return [ord[c] for c in s] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should inlcued self._num_reserved_ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ord[c] is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not encode like self._num_reserved_ids + i ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch, I'll fix that syntax. Regarding, num_reserved_ids, I based it on the timit generator:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/audio.py#L150
which doesn't offset for num_reserved_ids, but you're right that it makes sense, so I will do so here.
| def example_reading_spec(self): | ||
| data_fields = { | ||
| "inputs": tf.VarLenFeature(tf.int64), | ||
| #"audio/channel_count": tf.FixedLenFeature([], tf.int64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be reserved!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I don't understand. What are you suggesting?
|
|
||
|
|
||
| def generator(self, data_dir, tmp_dir, training, eos_list=None, start_from=0, how_many=0): | ||
| eos_list = [1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I had meant to fix that but forgot. Doing so now.
| def hparams(self, defaults, unused_model_hparams): | ||
| p = defaults | ||
| p.stop_at_eos = int(False) | ||
| p.input_modality = { "inputs": ("audio:librispeech_modality", None) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registry.Modalities.AUDIO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that result in the base Audio modality being used, and bypass the custom signal processing added to bottom()?
| return problem.SpaceID.EN_CHR | ||
|
|
||
| @property | ||
| def num_shards(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transformer_base_single_gpu, does this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the question, but I used transformer_base_single_gpu as the basis for the hparams because that was what was referenced in all of the examples:
https://github.com/tensorflow/tensor2tensor/blob/master/docs/new_problem.md
https://github.com/tensorflow/tensor2tensor/blob/master/docs/walkthrough.md
https://github.com/tensorflow/tensor2tensor/blob/master/README.md
lukaszkaiser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks guys! Let's get it in and see how it trains :).
|
@wingsbr which paper do you based on for your experiment? Maybe we have some common . |
|
@zh794390558 most of our work has been on Listen, Attend and Spell with various enhancements. We were about to start working on implementing arxiv.org/pdf/1610.03022v1.pdf which, in part, uses convolutions rather than the pblstm of LAS. We wanted to give the 'all you need is attention' transformer model a try, and utilize the framework extensions that t2t offers. So far we haven't gotten the transformer model to do much more than learn the LM from the labels. If the transformer model isn't viable for this task, perhaps we can collaborate implementing a t2t Problem for convolutional+rnn model (like 1610.03022v1). |
|
@mschonwe Sorry for later response. I work on Mandarin, also use the Listen, Attend and Spell model, now |
Added a generator for the librispeech datasets and included it in the supported generators in t2t-datagen.
Like the audio / TIMIT generator, this is dependent upon sox for WAV file generation.