Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Community] Add fairseq FastSpeech2 #15773

Open
wants to merge 63 commits into
base: main
Choose a base branch
from

Conversation

jaketae
Copy link
Contributor

@jaketae jaketae commented Feb 22, 2022

What does this PR do?

This PR adds FastSpeech2, a transformer-based text-to-speech model. Fixes #15166.

Motivation

While HF transformers has great support for ASR and other audio processing tasks, there is not much support for text-to-speech. There are many transformer-based TTS models that would be great to have in the library. FastSpeech2 is perhaps the most well-known and popular of transformer-based TTS models. With TTS support, we could potentially have really cool demos that go from ASR -> NLU -> NLG -> TTS to build full-fledged voice assistants!

Who can review?

Anton, Patrick (will tag when PR is ready)

@HuggingFaceDocBuilder
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@jaketae
Copy link
Contributor Author

jaketae commented Feb 24, 2022

Here is a (tried-to-keep-short-but-long) update on this PR, summarizing some of the things discussed with @anton-l for public visibility.

  1. Checkpoint loading: The model weight loads properly and produces roughly the same output as the fairseq model. On preliminary tests, I've found that torch.allclose passes with atol=1e-6. On this branch, the model can be loaded via
    from transformers import AutoModel
    model = AutoModel.from_pretrained("jaketae/fastspeech2-lj")
  2. Grapheme-to-phoneme: Fairseq uses g2p_en. I've added this as a dependency via requires_backend in the tokenizer. One thing I'm not too sure about is whether to enforce this from the get-go at __init__, or only require it in the phonemize function.
  3. Tokenizer: The tokenizer can be loaded as shown. For some reason, AutoTokenizer doesn't work at the moment, but this can likely be fixed by reviewing import statements. One consideration is that the fairseq code seems to add an EOS token at the end of every sequence, whereas HF's tokenizers do not by default (correct me if I'm wrong). In my code, I manually add the EOS token at the end for consistency.
    from transformers.models.fastspeech2 import FastSpeech2Tokenizer
    tokenizer = FastSpeech2Tokenizer.from_pretrained("jaketae/fastspeech2-lj")
  4. Vocoder: (Note: This is not directly related to the source code in this PR.) Using Griffin-Lim for vocoding yields very poor audio quality and is likely a bad idea for demos. An alternative is to use neural vocoders like HiFi-GAN. I've added HiFi-GAN on the model hub, which can be imported via vocoder = AutoModel.from_pretrained("jaketae/hifigan-lj-v1", trust_remote_code=True). At the moment, for some reason the model weights do not load properly, debugging required.

I have some questions for discussion.

  1. Model output: What should the output of the model be? At the moment, it returns a tuple of mel-spectrograms, mel-spectrogram after postnet, output lengths (number of spectrogram frames), duration predictor output, predicted pitch, and predicted energy. I assume we need to subclass BaseModelOutput for something like TTSModelOutput, and also optionally return loss when duration, pitch, and energy ground-truth labels are supplied in forward.
  2. Mean and standard deviation: Fairseq's FastSpeech2 model technically does not predict a mel-spectrogram, but a normalized mean spectrogram according to mu and sigma values extracted from the training set. In fairseq's TTS pipeline, FS2 output is denormalized by some mu and sigma before being vocoded. I've added the mu and sigma values as register_buffer to the model, and created set_mean and set_std functions so that users can set these values according to their own training set. By default, mean and variance are set to 0 and 1 so that no rescaling is performed. Is this good design?
  3. TTS pipeline: At the moment, there is only FastSpeech2Model, which outputs mel-spectrograms as indicated above. While people might want to use acoustic feature generators for mel-spectrogram synthesis, more often they will want to obtain raw audio in waveform. As mentioned, this requires a vocoder, and we need to discuss ways to integrate this. Some API ideas I have include:
    • Introducing a FastSpeech2ForAudioSynthesis class with an integrated vocoder
    • Adding a pipeline that supports full-fledged TTS via something like pipeline("text-to-speech", model="fastspeech2", vocoder="hifigan"). But this is again contingent on how we want to integrate the vocoder.

To-do's on my end (will be updated as I find more).

  • Writing tests
  • Documentation
  • Cleaning up imports
  • Debugging why AutoTokenizer fails
  • Debugging why HiFi-GAN weights do not load

cc @anton-l @patrickvonplaten

Sorry, something went wrong.

)

if args.mean:
self.register_buffer("mean", torch.zeros(self.out_dim))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mean and standard deviation are 0 and 1 by default so that even if the user forgets to use set_mean and set_std, the output is not affected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

Copy link
Contributor

@patrickvonplaten patrickvonplaten Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be cleaner to let mean be of type Optional[float] and set it to 0.0 by default (same with variance)? The if config.mean() is a bit weird to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the flag name to config.use_mean. I also added an informative warning in case the user sets use_mean to True, pointing them to the set_mean() function (likewise for variance).

Do you think it makes more sense for the user to specify the mean and variance explicitly in the config object? If so, one problem is that the mean and variance are 80-d vectors, so I'm afraid that it would pollute config.json too much. I might have misunderstood your intent though, definitely open to suggestions as I also think there might be a cleaner way to achieve this!

Copy link
Member

@anton-l anton-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jaketae, great work on the port, looks very clean so far!

For the next steps:

  1. Upload the conversion script to models/fastspeech2/ for future checkpoint ports :)
  2. Try converting other checkpoints, e.g. one of each model size, one for each dataset (ljspeech, commonvoice en) and write integration tests for them, like you did for the first test. With this we'll make sure that there won't be any missing modules or config options for different model variants down the road, when we start mass-converting the leftover checkpoints.
  3. Remove unsupported task-specific tests from FastSpeech2ModelTester and FastSpeech2ModelTest, e.g. create_and_check_for_question_answering. Then see if FastSpeech2ModelTest() is all green.


def forward(self, x):
# Input: B x T x C; Output: B x T
x = self.conv1(x.transpose(1, 2)).transpose(1, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After verifying that everything works, feel free to make the modules more verbose, e.g. by replacing x->hidden_states, ln->layernorm, etc 🙂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to close this comment for now as it has not been addressed.

Copy link
Contributor Author

@jaketae jaketae Apr 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the hasty resolve. I fixed this and other related variable naming issues the latest commit; will keep this thread open for a reviewer to confirm!

)

if args.mean:
self.register_buffer("mean", torch.zeros(self.out_dim))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

@anton-l
Copy link
Member

anton-l commented Apr 25, 2022

@patrickvonplaten after a brief discussion with @jaketae, I think we should keep the spectrogram2waveform generators as separate models on the Hub for several reasons:

  • the models for waveform generation are more or less interchangeable between different models (maybe with some finetuning) and are trained separately from the TTS models
  • since the generators can be quite diverse, it would be cumbersome to support training for them, e.g. we probably don't have GANs or diffusion models in transformers for that very reason
  • if I understand correctly, the generators usually don't support batched inference on inputs of varying lengths, so they're more suitable for pipelines rather than an integrated module

Having said that, I suggest that we finish this PR as a first step without adding any waveform generators. After that we can work on a TTS pipeline that runs inference on FastSpeech2 and decodes the output using either a Griffin-Lim synthesizer or any of the GAN models from the Hub (e.g. HiFiGAN) that we'll port.

@arampacha
Copy link
Contributor

Hi, may I ask a couple of questions:

  1. Is there some estimated time of merging for this PR?
  2. Are there any experiments/examples of training using this implementation Fastspeech2?
    And somewhat more general question:
  3. Is there work in progress related to text-to-speech pipeline?

I'm working training/fine-tuning fastspeech2 models and would like to deploy using HF inference API.
Would be happy to help out with testing and contribute to pipeline and/or porting HiFiGAN or provide any assistance I can to speed things up with TTS

@huggingface huggingface deleted a comment from github-actions bot Jun 8, 2022
@patrickvonplaten
Copy link
Contributor

@anton-l, think it's not high priority but once you have a free day let's maybe add the model :-)

@jaketae
Copy link
Contributor Author

jaketae commented Jun 9, 2022

Wish I could have iterated faster to finish it earlier. Let me know if you need anything from me @anton-l @patrickvonplaten!

@patrickvonplaten
Copy link
Contributor

Wish I could have iterated faster to finish it earlier. Let me know if you need anything from me @anton-l @patrickvonplaten!

@jaketae if you want, would you like to finish this PR maybe ? E.g. fix the failing tests and add all the necessary code (including the vocoders) to the modeling file?

@huggingface huggingface deleted a comment from github-actions bot Jul 5, 2022
@patrickvonplaten
Copy link
Contributor

@anton-l I still think we should go ahead with this PR - it's been stale for a while now. Wdyt?

@patrickvonplaten patrickvonplaten requested a review from anton-l July 7, 2022 09:58
@anton-l
Copy link
Member

anton-l commented Jul 7, 2022

@patrickvonplaten definitely! I'm trying to find some time alongside other tasks to work on FS2 😅

@huggingface huggingface deleted a comment from github-actions bot Aug 1, 2022
@huggingface huggingface deleted a comment from github-actions bot Aug 25, 2022
@jaketae
Copy link
Contributor Author

jaketae commented Sep 16, 2022

Hey @anton-l @patrickvonplaten, hope you're doing well! I was wondering where FS2 fits within the transformers road map, and whether it would make sense for me to update this branch and wrap up the work. I feel responsible for leaving this stale, and I still think it would be cool to have this feature pushed out. Thank you!

@patrickvonplaten
Copy link
Contributor

Hey @jaketae,

We definitely still want to add the model - would you like to go ahead and update the branch?

@anton-l anton-l added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Oct 21, 2022
@huggingface huggingface deleted a comment from github-actions bot Oct 21, 2022
@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Oct 24, 2022

I don't think anybody will have time to work on it soon. If anybody from the community wants to give it a shot, please feel free to do so!

@patrickvonplaten patrickvonplaten added Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! Good Difficult Issue labels Oct 24, 2022
@patrickvonplaten patrickvonplaten changed the title Add fairseq FastSpeech2 [Community] Add fairseq FastSpeech2 Oct 24, 2022
@JuheonChu
Copy link
Contributor

I don't think anybody will have time to work on it soon. If anybody from the community wants to give it a shot, please feel free to do so!

Can I try working on this?

@connor-henderson
Copy link
Contributor

@JuheonChu I'm also interested in working on this, any interest in collaborating? Planning to get up to speed then open a fresh pr.

@hollance
Copy link
Contributor

hollance commented May 1, 2023

Hi @JuheonChu @connor-henderson, internally at HF we've been discussing what to do with this model. This PR has a lot of good work done already and FastSpeech 2 is a popular model, so it makes a lot of sense to add this to Transformers. If you want to help with this, that's great. 😃

There are a few different implementations of FastSpeech 2 out there and we'd like to add the best one to Transformers. As far as we can tell, there is a Conformer-based FastSpeech 2 model in ESPNet that outperforms regular FastSpeech 2, so that might be an interesting candidate. You can find the Conformer FastSpeech 2 checkpoints here.

In addition, PortaSpeech is very similar to FastSpeech 2 but it replaces the post-net and possibly a few other parts of the model. So that might be an interesting variation of FS2 to add as well.

We're still trying to figure out which of these models we'd prefer to have, but since you've shown an interest in working on this, I wanted to point out the direction we were thinking of taking these models internally. And I'd like to hear your opinions on this as well. 😄

P.S. Make sure to check out the implementation of SpeechT5. It has many features in common with FastSpeech2 (such as the vocoder and the generation loop) and so we'd want to implement FS2 mostly in the same manner.

@connor-henderson connor-henderson mentioned this pull request May 17, 2023
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good Difficult Issue Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add FastSpeech2
10 participants