[Community] Add fairseq FastSpeech2 #15773

jaketae · 2022-02-22T20:37:18Z

What does this PR do?

This PR adds FastSpeech2, a transformer-based text-to-speech model. Fixes #15166.

Motivation

While HF transformers has great support for ASR and other audio processing tasks, there is not much support for text-to-speech. There are many transformer-based TTS models that would be great to have in the library. FastSpeech2 is perhaps the most well-known and popular of transformer-based TTS models. With TTS support, we could potentially have really cool demos that go from ASR -> NLU -> NLG -> TTS to build full-fledged voice assistants!

Who can review?

Anton, Patrick (will tag when PR is ready)

HuggingFaceDocBuilder · 2022-02-22T20:37:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

jaketae · 2022-02-24T17:09:07Z

Here is a (tried-to-keep-short-but-long) update on this PR, summarizing some of the things discussed with @anton-l for public visibility.

Checkpoint loading: The model weight loads properly and produces roughly the same output as the fairseq model. On preliminary tests, I've found that torch.allclose passes with atol=1e-6. On this branch, the model can be loaded via
```
from transformers import AutoModel
model = AutoModel.from_pretrained("jaketae/fastspeech2-lj")
```
Grapheme-to-phoneme: Fairseq uses g2p_en. I've added this as a dependency via requires_backend in the tokenizer. One thing I'm not too sure about is whether to enforce this from the get-go at __init__, or only require it in the phonemize function.
Tokenizer: The tokenizer can be loaded as shown. For some reason, AutoTokenizer doesn't work at the moment, but this can likely be fixed by reviewing import statements. One consideration is that the fairseq code seems to add an EOS token at the end of every sequence, whereas HF's tokenizers do not by default (correct me if I'm wrong). In my code, I manually add the EOS token at the end for consistency.
```
from transformers.models.fastspeech2 import FastSpeech2Tokenizer
tokenizer = FastSpeech2Tokenizer.from_pretrained("jaketae/fastspeech2-lj")
```
Vocoder: (Note: This is not directly related to the source code in this PR.) Using Griffin-Lim for vocoding yields very poor audio quality and is likely a bad idea for demos. An alternative is to use neural vocoders like HiFi-GAN. I've added HiFi-GAN on the model hub, which can be imported via vocoder = AutoModel.from_pretrained("jaketae/hifigan-lj-v1", trust_remote_code=True). At the moment, for some reason the model weights do not load properly, debugging required.

I have some questions for discussion.

Model output: What should the output of the model be? At the moment, it returns a tuple of mel-spectrograms, mel-spectrogram after postnet, output lengths (number of spectrogram frames), duration predictor output, predicted pitch, and predicted energy. I assume we need to subclass BaseModelOutput for something like TTSModelOutput, and also optionally return loss when duration, pitch, and energy ground-truth labels are supplied in forward.
Mean and standard deviation: Fairseq's FastSpeech2 model technically does not predict a mel-spectrogram, but a normalized mean spectrogram according to mu and sigma values extracted from the training set. In fairseq's TTS pipeline, FS2 output is denormalized by some mu and sigma before being vocoded. I've added the mu and sigma values as register_buffer to the model, and created set_mean and set_std functions so that users can set these values according to their own training set. By default, mean and variance are set to 0 and 1 so that no rescaling is performed. Is this good design?
TTS pipeline: At the moment, there is only FastSpeech2Model, which outputs mel-spectrograms as indicated above. While people might want to use acoustic feature generators for mel-spectrogram synthesis, more often they will want to obtain raw audio in waveform. As mentioned, this requires a vocoder, and we need to discuss ways to integrate this. Some API ideas I have include:
- Introducing a FastSpeech2ForAudioSynthesis class with an integrated vocoder
- Adding a pipeline that supports full-fledged TTS via something like pipeline("text-to-speech", model="fastspeech2", vocoder="hifigan"). But this is again contingent on how we want to integrate the vocoder.

To-do's on my end (will be updated as I find more).

cc @anton-l @patrickvonplaten

src/transformers/models/fastspeech2/tokenization_fastspeech2.py

jaketae · 2022-02-24T17:26:51Z

src/transformers/models/fastspeech2/modeling_fastspeech2.py

+            )
+
+        if args.mean:
+            self.register_buffer("mean", torch.zeros(self.out_dim))


Mean and standard deviation are 0 and 1 by default so that even if the user forgets to use set_mean and set_std, the output is not affected.

Great catch!

Wouldn't it be cleaner to let mean be of type Optional[float] and set it to 0.0 by default (same with variance)? The if config.mean() is a bit weird to me

I changed the flag name to config.use_mean. I also added an informative warning in case the user sets use_mean to True, pointing them to the set_mean() function (likewise for variance).

Do you think it makes more sense for the user to specify the mean and variance explicitly in the config object? If so, one problem is that the mean and variance are 80-d vectors, so I'm afraid that it would pollute config.json too much. I might have misunderstood your intent though, definitely open to suggestions as I also think there might be a cleaner way to achieve this!

anton-l

Hey @jaketae, great work on the port, looks very clean so far!

For the next steps:

Upload the conversion script to models/fastspeech2/ for future checkpoint ports :)
Try converting other checkpoints, e.g. one of each model size, one for each dataset (ljspeech, commonvoice en) and write integration tests for them, like you did for the first test. With this we'll make sure that there won't be any missing modules or config options for different model variants down the road, when we start mass-converting the leftover checkpoints.
Remove unsupported task-specific tests from FastSpeech2ModelTester and FastSpeech2ModelTest, e.g. create_and_check_for_question_answering. Then see if FastSpeech2ModelTest() is all green.

src/transformers/models/fastspeech2/tokenization_fastspeech2.py

tests/test_modeling_fastspeech2.py

anton-l · 2022-03-09T19:32:15Z

src/transformers/models/fastspeech2/modeling_fastspeech2.py

+
+    def forward(self, x):
+        # Input: B x T x C; Output: B x T
+        x = self.conv1(x.transpose(1, 2)).transpose(1, 2)


After verifying that everything works, feel free to make the modules more verbose, e.g. by replacing x->hidden_states, ln->layernorm, etc 🙂

No reason to close this comment for now as it has not been addressed.

Apologies for the hasty resolve. I fixed this and other related variable naming issues the latest commit; will keep this thread open for a reviewer to confirm!

anton-l · 2022-03-09T19:32:59Z

src/transformers/models/fastspeech2/modeling_fastspeech2.py

+            )
+
+        if args.mean:
+            self.register_buffer("mean", torch.zeros(self.out_dim))


Great catch!

src/transformers/models/fastspeech2/modeling_fastspeech2.py

anton-l · 2022-04-25T11:51:29Z

@patrickvonplaten after a brief discussion with @jaketae, I think we should keep the spectrogram2waveform generators as separate models on the Hub for several reasons:

the models for waveform generation are more or less interchangeable between different models (maybe with some finetuning) and are trained separately from the TTS models
since the generators can be quite diverse, it would be cumbersome to support training for them, e.g. we probably don't have GANs or diffusion models in transformers for that very reason
if I understand correctly, the generators usually don't support batched inference on inputs of varying lengths, so they're more suitable for pipelines rather than an integrated module

Having said that, I suggest that we finish this PR as a first step without adding any waveform generators. After that we can work on a TTS pipeline that runs inference on FastSpeech2 and decodes the output using either a Griffin-Lim synthesizer or any of the GAN models from the Hub (e.g. HiFiGAN) that we'll port.

arampacha · 2022-05-14T10:43:57Z

Hi, may I ask a couple of questions:

Is there some estimated time of merging for this PR?
Are there any experiments/examples of training using this implementation Fastspeech2?
And somewhat more general question:
Is there work in progress related to text-to-speech pipeline?

I'm working training/fine-tuning fastspeech2 models and would like to deploy using HF inference API.
Would be happy to help out with testing and contribute to pipeline and/or porting HiFiGAN or provide any assistance I can to speed things up with TTS

patrickvonplaten · 2022-06-08T16:33:48Z

@anton-l, think it's not high priority but once you have a free day let's maybe add the model :-)

jaketae · 2022-06-09T04:27:25Z

Wish I could have iterated faster to finish it earlier. Let me know if you need anything from me @anton-l @patrickvonplaten!

patrickvonplaten · 2022-06-10T16:11:53Z

Wish I could have iterated faster to finish it earlier. Let me know if you need anything from me @anton-l @patrickvonplaten!

@jaketae if you want, would you like to finish this PR maybe ? E.g. fix the failing tests and add all the necessary code (including the vocoders) to the modeling file?

patrickvonplaten · 2022-07-07T09:58:26Z

@anton-l I still think we should go ahead with this PR - it's been stale for a while now. Wdyt?

anton-l · 2022-07-07T10:06:05Z

@patrickvonplaten definitely! I'm trying to find some time alongside other tasks to work on FS2 😅

jaketae · 2022-09-16T00:28:50Z

Hey @anton-l @patrickvonplaten, hope you're doing well! I was wondering where FS2 fits within the transformers road map, and whether it would make sense for me to update this branch and wrap up the work. I feel responsible for leaving this stale, and I still think it would be cool to have this feature pushed out. Thank you!

patrickvonplaten · 2022-09-27T08:38:51Z

Hey @jaketae,

We definitely still want to add the model - would you like to go ahead and update the branch?

patrickvonplaten · 2022-10-24T20:32:05Z

I don't think anybody will have time to work on it soon. If anybody from the community wants to give it a shot, please feel free to do so!

JuheonChu · 2023-03-23T03:25:50Z

I don't think anybody will have time to work on it soon. If anybody from the community wants to give it a shot, please feel free to do so!

Can I try working on this?

connor-henderson · 2023-04-28T20:05:13Z

@JuheonChu I'm also interested in working on this, any interest in collaborating? Planning to get up to speed then open a fresh pr.

hollance · 2023-05-01T10:25:07Z

Hi @JuheonChu @connor-henderson, internally at HF we've been discussing what to do with this model. This PR has a lot of good work done already and FastSpeech 2 is a popular model, so it makes a lot of sense to add this to Transformers. If you want to help with this, that's great. 😃

There are a few different implementations of FastSpeech 2 out there and we'd like to add the best one to Transformers. As far as we can tell, there is a Conformer-based FastSpeech 2 model in ESPNet that outperforms regular FastSpeech 2, so that might be an interesting candidate. You can find the Conformer FastSpeech 2 checkpoints here.

In addition, PortaSpeech is very similar to FastSpeech 2 but it replaces the post-net and possibly a few other parts of the model. So that might be an interesting variation of FS2 to add as well.

We're still trying to figure out which of these models we'd prefer to have, but since you've shown an interest in working on this, I wanted to point out the direction we were thinking of taking these models internally. And I'd like to hear your opinions on this as well. 😄

P.S. Make sure to check out the implementation of SpeechT5. It has many features in common with FastSpeech2 (such as the vocoder and the generation loop) and so we'd want to implement FS2 mostly in the same manner.

Jake Tae added 3 commits February 17, 2022 19:37

wip: add fairseq fastspeech2

9fbde4a

fix: use bart attn, complete ckpt conversion

fc76dd8

style: run linter

f1ddaa9

Jake Tae added 11 commits February 22, 2022 15:42

chore: rm non existent classes

95846a4

chore: rm unused tf script, label cp origin

03a4713

chore: cleanup dead code

4c84ca4

feature: add mean std for mel normalization

e9aad5c

wip: write tokenizer code

e2c7705

chore: rm fast tokenizer

b94029d

chore: rm fast tokenizer

0ca31f3

feat: add g2p_en backend, complete tokenizer

e28db3c

fix: add padding token

0f25b7b

style: run linter

380f4b2

fix: stash g2p attr, add eos, rm unused import

8abd16b

jaketae commented Feb 24, 2022

View reviewed changes

src/transformers/models/fastspeech2/tokenization_fastspeech2.py Show resolved Hide resolved

jaketae commented Feb 24, 2022

View reviewed changes

src/transformers/models/fastspeech2/tokenization_fastspeech2.py Show resolved Hide resolved

jaketae commented Feb 24, 2022

View reviewed changes

wip: rm unused classes in test

1d21cd7

anton-l reviewed Mar 9, 2022

View reviewed changes

Jake Tae added 9 commits March 12, 2022 20:17

feature: use custom modeling output cls

0d2d793

fix: account for multi spk (common voice)

875a526

feature: add conversion script

bf8556b

style: run linter

6b5fcb6

chore: fix wav2vec2 reference to fastspeech2

c140388

fix: add fastspeech2tokenizer to auto

8276d2a

chore: conform to convert func signature

e6d0276

fix: robustify mean std configuration code

98b01e5

Merge branch 'master' into add-fairseq-fs2

2771ce7

Jake Tae added 5 commits April 16, 2022 02:16

docs: specify input tensor shapes

9197915

test: update test to reflect modeling code change

b0fafdd

feature: add hifi gan

86a2d52

fix: rm hifigan config, hard code params

e57c150

fix: force apply remove_weight_norm, transpose for dim match

5737e70

huggingface deleted a comment from github-actions bot Jun 8, 2022

huggingface deleted a comment from github-actions bot Jul 5, 2022

patrickvonplaten requested a review from anton-l July 7, 2022 09:58

huggingface deleted a comment from github-actions bot Aug 1, 2022

huggingface deleted a comment from github-actions bot Aug 25, 2022

anton-l added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Oct 21, 2022

huggingface deleted a comment from github-actions bot Oct 21, 2022

patrickvonplaten added Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! Good Difficult Issue labels Oct 24, 2022

patrickvonplaten changed the title ~~Add fairseq FastSpeech2~~ [Community] Add fairseq FastSpeech2 Oct 24, 2022

hollance mentioned this pull request Apr 7, 2023

Support text-to-speech in pipeline function and in Optimum #22487

Closed

connor-henderson mentioned this pull request May 17, 2023

Add FastSpeech2Conformer #23439

Merged

11 tasks

[Community] Add fairseq FastSpeech2 #15773

Are you sure you want to change the base?

[Community] Add fairseq FastSpeech2 #15773

Uh oh!

Conversation

jaketae commented Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Who can review?

Uh oh!

HuggingFaceDocBuilder commented Feb 22, 2022

Uh oh!

jaketae commented Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaketae Feb 24, 2022

Choose a reason for hiding this comment

Uh oh!

anton-l Mar 9, 2022

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Apr 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaketae Apr 15, 2022

Choose a reason for hiding this comment

Uh oh!

anton-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

anton-l Mar 9, 2022

Choose a reason for hiding this comment

Uh oh!

sgugger Apr 11, 2022

Choose a reason for hiding this comment

Uh oh!

jaketae Apr 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anton-l Mar 9, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anton-l commented Apr 25, 2022

Uh oh!

arampacha commented May 14, 2022

Uh oh!

patrickvonplaten commented Jun 8, 2022

Uh oh!

jaketae commented Jun 9, 2022

Uh oh!

patrickvonplaten commented Jun 10, 2022

Uh oh!

patrickvonplaten commented Jul 7, 2022

Uh oh!

anton-l commented Jul 7, 2022

Uh oh!

jaketae commented Sep 16, 2022

Uh oh!

patrickvonplaten commented Sep 27, 2022

Uh oh!

patrickvonplaten commented Oct 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuheonChu commented Mar 23, 2023

Uh oh!

connor-henderson commented Apr 28, 2023

Uh oh!

hollance commented May 1, 2023

Uh oh!

Uh oh!

jaketae commented Feb 22, 2022 •

edited

Loading

jaketae commented Feb 24, 2022 •

edited

Loading

patrickvonplaten Apr 14, 2022 •

edited

Loading

jaketae Apr 16, 2022 •

edited

Loading

patrickvonplaten commented Oct 24, 2022 •

edited

Loading