-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Community] Add fairseq FastSpeech2 #15773
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Here is a (tried-to-keep-short-but-long) update on this PR, summarizing some of the things discussed with @anton-l for public visibility.
I have some questions for discussion.
To-do's on my end (will be updated as I find more).
|
) | ||
|
||
if args.mean: | ||
self.register_buffer("mean", torch.zeros(self.out_dim)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mean and standard deviation are 0 and 1 by default so that even if the user forgets to use set_mean
and set_std
, the output is not affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be cleaner to let mean
be of type Optional[float]
and set it to 0.0
by default (same with variance)? The if config.mean()
is a bit weird to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the flag name to config.use_mean
. I also added an informative warning in case the user sets use_mean
to True
, pointing them to the set_mean()
function (likewise for variance).
Do you think it makes more sense for the user to specify the mean and variance explicitly in the config object? If so, one problem is that the mean and variance are 80-d vectors, so I'm afraid that it would pollute config.json
too much. I might have misunderstood your intent though, definitely open to suggestions as I also think there might be a cleaner way to achieve this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jaketae, great work on the port, looks very clean so far!
For the next steps:
- Upload the conversion script to
models/fastspeech2/
for future checkpoint ports :) - Try converting other checkpoints, e.g. one of each model size, one for each dataset (ljspeech, commonvoice en) and write integration tests for them, like you did for the first test. With this we'll make sure that there won't be any missing modules or config options for different model variants down the road, when we start mass-converting the leftover checkpoints.
- Remove unsupported task-specific tests from
FastSpeech2ModelTester
andFastSpeech2ModelTest
, e.g.create_and_check_for_question_answering
. Then see ifFastSpeech2ModelTest()
is all green.
|
||
def forward(self, x): | ||
# Input: B x T x C; Output: B x T | ||
x = self.conv1(x.transpose(1, 2)).transpose(1, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After verifying that everything works, feel free to make the modules more verbose, e.g. by replacing x
->hidden_states
, ln
->layernorm
, etc 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason to close this comment for now as it has not been addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the hasty resolve. I fixed this and other related variable naming issues the latest commit; will keep this thread open for a reviewer to confirm!
) | ||
|
||
if args.mean: | ||
self.register_buffer("mean", torch.zeros(self.out_dim)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
@patrickvonplaten after a brief discussion with @jaketae, I think we should keep the spectrogram2waveform generators as separate models on the Hub for several reasons:
Having said that, I suggest that we finish this PR as a first step without adding any waveform generators. After that we can work on a TTS |
Hi, may I ask a couple of questions:
I'm working training/fine-tuning fastspeech2 models and would like to deploy using HF inference API. |
@anton-l, think it's not high priority but once you have a free day let's maybe add the model :-) |
Wish I could have iterated faster to finish it earlier. Let me know if you need anything from me @anton-l @patrickvonplaten! |
@jaketae if you want, would you like to finish this PR maybe ? E.g. fix the failing tests and add all the necessary code (including the vocoders) to the modeling file? |
@anton-l I still think we should go ahead with this PR - it's been stale for a while now. Wdyt? |
@patrickvonplaten definitely! I'm trying to find some time alongside other tasks to work on FS2 😅 |
Hey @anton-l @patrickvonplaten, hope you're doing well! I was wondering where FS2 fits within the transformers road map, and whether it would make sense for me to update this branch and wrap up the work. I feel responsible for leaving this stale, and I still think it would be cool to have this feature pushed out. Thank you! |
Hey @jaketae, We definitely still want to add the model - would you like to go ahead and update the branch? |
I don't think anybody will have time to work on it soon. If anybody from the community wants to give it a shot, please feel free to do so! |
Can I try working on this? |
@JuheonChu I'm also interested in working on this, any interest in collaborating? Planning to get up to speed then open a fresh pr. |
Hi @JuheonChu @connor-henderson, internally at HF we've been discussing what to do with this model. This PR has a lot of good work done already and FastSpeech 2 is a popular model, so it makes a lot of sense to add this to Transformers. If you want to help with this, that's great. 😃 There are a few different implementations of FastSpeech 2 out there and we'd like to add the best one to Transformers. As far as we can tell, there is a Conformer-based FastSpeech 2 model in ESPNet that outperforms regular FastSpeech 2, so that might be an interesting candidate. You can find the Conformer FastSpeech 2 checkpoints here. In addition, PortaSpeech is very similar to FastSpeech 2 but it replaces the post-net and possibly a few other parts of the model. So that might be an interesting variation of FS2 to add as well. We're still trying to figure out which of these models we'd prefer to have, but since you've shown an interest in working on this, I wanted to point out the direction we were thinking of taking these models internally. And I'd like to hear your opinions on this as well. 😄 P.S. Make sure to check out the implementation of SpeechT5. It has many features in common with FastSpeech2 (such as the vocoder and the generation loop) and so we'd want to implement FS2 mostly in the same manner. |
What does this PR do?
This PR adds FastSpeech2, a transformer-based text-to-speech model. Fixes #15166.
Motivation
While HF transformers has great support for ASR and other audio processing tasks, there is not much support for text-to-speech. There are many transformer-based TTS models that would be great to have in the library. FastSpeech2 is perhaps the most well-known and popular of transformer-based TTS models. With TTS support, we could potentially have really cool demos that go from ASR -> NLU -> NLG -> TTS to build full-fledged voice assistants!
Who can review?
Anton, Patrick (will tag when PR is ready)