Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Conversation

@wingsbr
Copy link
Contributor

@wingsbr wingsbr commented Nov 14, 2017

Added a generator for the librispeech datasets and included it in the supported generators in t2t-datagen.

Like the audio / TIMIT generator, this is dependent upon sox for WAV file generation.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If your company signed a CLA, they designated a Point of Contact who decides which employees are authorized to participate. You may need to contact the Point of Contact for your company and ask to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot.
  • In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again.

@googlebot
Copy link

CLAs look good, thanks!

@mschonwe
Copy link
Contributor

@vince62s -- I work with @wingsbr. I was collaborating with Archy de Berker via gitter. Looks like a co-worker of his (Majid) of ended up posting another LibriSpeech data generator (which I haven't looked at yet).

To your question about MFCC, I thought for the 'problem' it would be best to leave pre-processing/spectral featurization/MFCC up to user, to allow for experimentation. We'll make sure to link up with Majid to avoid duplication if/when we code up the modality bottom to handle these transformations.

@vince62s
Copy link
Contributor

Thanks for your feedback. As it is now (and it was similar for WSJ) I have the impression it is trying to take frames directly as inputs, which is really disturbing ...

@mschonwe
Copy link
Contributor

Agreed. It is an important TODO.

I haven't seen good results published from working from raw waveform. However, there are many papers with good results starting from spectral features. But even here there is a lot of room for experimentation (e.g., mel-scale, #of bins, window-size, etc.).

I think going to MFCC is over-engineering the features, and it would be better to let the NN to derive its own features from the spectrum. In any case, different researchers, and different domains may warrant different approaches. Hence, I would argue against pushing the encoding into the TFRecords.

@mjlaali
Copy link

mjlaali commented Nov 15, 2017

@mschonwe Sorry, I did not see you pull request before sending mine. I closed mine.

Regarding MFCC, I suggest that to put the pre-processing in problem.Problem class and as a dataset is generated, it saved in the correct format. Something similar to input space id.

@mschonwe
Copy link
Contributor

@mjlaali No worries - let's chat on Gitter to coordinate incorporating the signal processing.
Thankfully the ops required are now part of TF1.4 (https://www.tensorflow.org/api_docs/python/tf/contrib/signal).

Copy link

@zh794390558 zh794390558 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not register this Problem rather than put it into _SUPPORTED_PROBLEM_GENERATORS?

@wingsbr
Copy link
Contributor Author

wingsbr commented Nov 20, 2017

@zh794390558 Good idea. I expanded the librispeech generator to include a problem and modality and registered those rather than using _SUPPORTED_PROBLEM_GENERATORS.

"http://www.openslr.org/resources/12/dev-other.tar.gz",
"dev-other"
],
]'''

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why comment above code ?

class LibrispeechTextEncoder(text_encoder.TextEncoder):

def encode(self, s):
return [ord[c] for c in s]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should inlcued self._num_reserved_ids

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ord[c] is wrong.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not encode like self._num_reserved_ids + i ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, I'll fix that syntax. Regarding, num_reserved_ids, I based it on the timit generator:

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/audio.py#L150

which doesn't offset for num_reserved_ids, but you're right that it makes sense, so I will do so here.

def example_reading_spec(self):
data_fields = {
"inputs": tf.VarLenFeature(tf.int64),
#"audio/channel_count": tf.FixedLenFeature([], tf.int64),
Copy link

@zh794390558 zh794390558 Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be reserved!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I don't understand. What are you suggesting?



def generator(self, data_dir, tmp_dir, training, eos_list=None, start_from=0, how_many=0):
eos_list = [1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I had meant to fix that but forgot. Doing so now.

def hparams(self, defaults, unused_model_hparams):
p = defaults
p.stop_at_eos = int(False)
p.input_modality = { "inputs": ("audio:librispeech_modality", None) }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registry.Modalities.AUDIO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that result in the base Audio modality being used, and bypass the custom signal processing added to bottom()?

return problem.SpaceID.EN_CHR

@property
def num_shards(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformer_base_single_gpu, does this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the question, but I used transformer_base_single_gpu as the basis for the hparams because that was what was referenced in all of the examples:

https://github.com/tensorflow/tensor2tensor/blob/master/docs/new_problem.md
https://github.com/tensorflow/tensor2tensor/blob/master/docs/walkthrough.md
https://github.com/tensorflow/tensor2tensor/blob/master/README.md

Copy link
Contributor

@lukaszkaiser lukaszkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks guys! Let's get it in and see how it trains :).

@lukaszkaiser lukaszkaiser merged commit 92983ea into tensorflow:master Nov 23, 2017
@zh794390558
Copy link

@wingsbr which paper do you based on for your experiment? Maybe we have some common .

@mschonwe
Copy link
Contributor

@zh794390558 most of our work has been on Listen, Attend and Spell with various enhancements. We were about to start working on implementing arxiv.org/pdf/1610.03022v1.pdf which, in part, uses convolutions rather than the pblstm of LAS.

We wanted to give the 'all you need is attention' transformer model a try, and utilize the framework extensions that t2t offers. So far we haven't gotten the transformer model to do much more than learn the LM from the labels.

If the transformer model isn't viable for this task, perhaps we can collaborate implementing a t2t Problem for convolutional+rnn model (like 1610.03022v1).

@zh794390558
Copy link

zh794390558 commented Dec 10, 2017

@mschonwe Sorry for later response. I work on Mandarin, also use the Listen, Attend and Spell model, now
want to try some tricks in T2T which successfully used in NMT, including 'Attention is all you need'. I think 'convolutional+rnn model' maybe a good attempt.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants