Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to disentangle style and speaker information? #74

Open
mozykhau opened this issue Jan 30, 2023 · 8 comments
Open

How to disentangle style and speaker information? #74

mozykhau opened this issue Jan 30, 2023 · 8 comments
Labels
discussion New research topic

Comments

@mozykhau
Copy link

I would like to transfer speech style of one speaker and apply it to the another speaker, while preserving identity of the speaker.

Do you have any advice how to use it for emotional cross-speaker style transfer? I thought about adding additional discriminator to classify speaker id but how to define domains in such case?

Thanks

@yl4579
Copy link
Owner

yl4579 commented Jan 31, 2023

You can define the domains in terms of emotions instead of speakers. This way you can preserve the speakers but only convert emotions.

@yl4579 yl4579 added the discussion New research topic label Jan 31, 2023
@mozykhau
Copy link
Author

Thanks, Approach of defining domains as emotions instead of speakers worked but sometimes it messed up speaker identity for specific emotional domains.
Found an interesting research for EVC based on StarGANv2-vc by Sony Research India: https://arxiv.org/pdf/2302.10536.pdf. They added second encoder and classifier for speaker domain for better disentanglement.

@chiaki-luo
Copy link

Maybe that's because a same people speaks too many same emotional sentences?

@CONGLUONG12
Copy link

@yl4579 Hi,thanks for this project.
I want to know if this domain is the emotional category of one speaker or many speakers?

You can define the domains in terms of emotions instead of speakers. This way you can preserve the speakers but only convert emotions.

@yl4579
Copy link
Owner

yl4579 commented Apr 9, 2023

@CONGLUONG12 It should be of multiple speakers. You can refer to https://arxiv.org/pdf/2302.10536.pdf for more details. This is a good example of how to modify StarGANv2-VC for emotion conversion.

@CONGLUONG12
Copy link

@yl4579 Thank you very much.
In your demo, you chose a speaker with a specific emotion. With this emotion, if you choose another speaker (call speaker A) included in the training set, you will have sound with this emotion and timbre of A?

@yl4579
Copy link
Owner

yl4579 commented Apr 16, 2023

@CONGLUONG12 Probably yes, if speaker A has samples in the training set with similar emotions, otherwise it might not work.

@gnekt
Copy link

gnekt commented Sep 5, 2023

Hey There!
I made something similar for my MSc Degree in AI starting from the great implementation of @yl4579
Take a look there for some hint Here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion New research topic
Projects
None yet
Development

No branches or pull requests

5 participants