Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-lingual training #257

Open
nvadigauvce opened this issue Jun 23, 2024 · 27 comments
Open

Multi-lingual training #257

nvadigauvce opened this issue Jun 23, 2024 · 27 comments

Comments

@nvadigauvce
Copy link

Thanks for wonderful work which gives good expressive TTS for English speakers. I was planning for Indian Multi-lingual TTS. For this purpose, I have few questions.

  1. Do we need to change only data and PL-BERT model or any other changes required ?
  2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ?
  3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
  4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?
@SandyPanda-MLDL
Copy link

SandyPanda-MLDL commented Jun 23, 2024

You have to train the PL-bert model with the specific dataset of that particular language you want. A text dataset of size more than 30MB is also sufficient enough, though you can use larger dataset. Then use that trained PL-bert model in StyleTTS2. As you want to work with multilingual data, then of course you need to use specific phonemizer and tokenizer that supports that specific language. And you have to train StyleTTS2 (training stage1 and stage2) with the specific language dataset (train.txt, validate.txt and odd.txt).

@nvadigauvce
Copy link
Author

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.

How about other three questions?
2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ?
3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

@traderpedroso
Copy link

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.

How about other three questions? 2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ? 3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ? 4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

According to the documentation in the readme, it states that the ASR model performs well in other languages. I tested it and indeed it works fine. However, when I trained my ASR model, StyleTTS improved dramatically. After this, I decided to train all models with my own data and achieved results exactly in terms of quality that the model delivers in English.

@nvadigauvce
Copy link
Author

@traderpedroso Thanks for reply.

  1. Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?

  2. Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?

@traderpedroso
Copy link

traderpedroso commented Jun 26, 2024

@traderpedroso Thanks for reply.

  1. Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?
  2. Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?

I used the PL-BERT recommended in the multilingual repository https://huggingface.co/papercup-ai/multilingual-pl-bert and it worked perfectly for ASR. I tested it with fine-tuning and also tried training from scratch; both approaches gave me the same result. Clearly, the ASR that I trained from scratch was for a single language.

From my experience training StyleTTS 2, it's only worthwhile because the inference is very fast and consumes little VRAM, but the training cost makes it somewhat unfeasible. Besides, you can only train the second stage with a single GPU. Clearly, I didn't train the model from scratch, which would be even more expensive, but I can guarantee that the quality is sensational. Another advantage of StyleTTS 2 is that it doesn’t hallucinate; the generated audios are extremely reliable, especially for real-time streaming applications that don’t need monitoring. However, in terms of cost vs. benefit, I personally prefer Tortoise for the final outcome.

@nvadigauvce
Copy link
Author

@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.

  1. My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?
  2. For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
  3. Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?

@traderpedroso
Copy link

@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.

  1. My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?
  2. For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
  3. Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?

Ensure that the speaker IDs are numbers. I personally used large numbers for the IDs, such as 3000, 3001, etc. You need to fine-tune the multilingual-pl-bert with your language if it is not listed. You do not need to add a language ID. Keep the data as in the example in the Data folder.

I added data in the same language I trained within the Data/OOD_texts.txt, but honestly, I believe it has no relevance because in the first 20 epochs I trained with the original Data/OOD_texts.txt, and the model was already generating quality audios.

In the inference, you need to put a dropdown list to select the language for your G2P, in this case, phonemizer, or use a library that detects the language and switches the lag in the phonemizer, for example, en-us, it, fr, etc.

@nvadigauvce
Copy link
Author

@traderpedroso thanks for answering all my questions in detailed manner. I will try to build multi-lingual TTS model and will report if it is successful.

@mc-marcocheng
Copy link

@traderpedroso How many hours of audio data did you use for training?

@traderpedroso
Copy link

traderpedroso commented Jul 2, 2024

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

@mc-marcocheng
Copy link

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

@traderpedroso
Copy link

traderpedroso commented Jul 8, 2024

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

@nvadigauvce
Copy link
Author

@traderpedroso Thanks for your insights.

  1. I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?
  2. I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

@tanishbajaj101
Copy link

@traderpedroso Thanks for your insights.

  1. I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?
  2. I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

hey! were you able to build the PL-BERT model for hindi? i seem to be in the same situation as you.

@traderpedroso
Copy link

@traderpedroso Thanks for your insights.

  1. I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?
  2. I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

You need to add silence padding to your audio before training. I added 500ms to the beginning and end of the audio file. Then, during inference, I implemented a workaround with.

def trim_audio(audio_np_array, sample_rate=24000, trim_ms=350):
    trim_samples = int(trim_ms * sample_rate / 1000)
    if len(audio_np_array) > 2 * trim_samples:
        trimmed_audio_np = audio_np_array[trim_samples:-trim_samples]
    else:
        trimmed_audio_np = audio_np_array
    return trimmed_audio_np

def tts(input: str, voice="Bia", output_sample_rate=24000, alpha=0.7, beta=0.7, diffusion_steps=5, embedding_scale=2, output_wav_file=None):
    text = normalizer(input)
    if text.strip() == "":
        raise ValueError("insert some text")
    if len(text) > 50000:
        raise ValueError("max 50.000 tokens")
    
    texts = split_sentence(text)
    audios = []
    for t in texts:
        audio = styletts2importable.inference(
            t,
            voices[voice],
            alpha=alpha,
            beta=beta,
            diffusion_steps=diffusion_steps,
            embedding_scale=embedding_scale,
        )
        trimmed_audio = trim_audio(audio)
        audios.append(trimmed_audio)
    output_audio = np.concatenate(audios)
    if output_wav_file:
        scipy.io.wavfile.write(output_wav_file, rate=output_sample_rate, data=output_audio)
    return output_sample_rate, output_audio

@nvadigauvce
Copy link
Author

@traderpedroso Thanks for detailed answer and code, this is very helpful.

@nvadigauvce
Copy link
Author

@tanishbajaj101 I have trained Hindi StyleTTS2 model, with existing English BERT model and it seems to be working fine without any issue. So I have not yet explored Hindi PL-BERT model.

@traderpedroso
Copy link

@traderpedroso Thanks for detailed answer and code, this is very helpful.

I'm building a dataset creator for WebUI using models to recognize speakers, segment audio, and detect silence for cuts and padding. Using Whisper alone for cutting audio isn't ideal, it's hard to get good quality cuts. Doing it manually is a lot of work! I found some models on Hugging Face that might help, so I'm hoping to develop something that makes fine-tuning easier for everyone. If I get something working well, I'll share it here. Thanks, and see you later!

@xujzouyyz
Copy link

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.
How about other three questions? 2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ? 3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ? 4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

According to the documentation in the readme, it states that the ASR model performs well in other languages. I tested it and indeed it works fine. However, when I trained my ASR model, StyleTTS improved dramatically. After this, I decided to train all models with my own data and achieved results exactly in terms of quality that the model delivers in English.

If I want to train a new language model. There are a few steps I need to follow:

  1. Train a ASR model for the new language and use it in TTS training;
  2. Train a PL-Bert model for the new language and use it in TTS training;
  3. prepare audio-text-phoneme data;
  4. Train TTS model.
    Could you please tell me whether there are any other steps I need to do?

@martinambrus
Copy link

martinambrus commented Aug 20, 2024

@traderpedroso thank you very much for all your detailed work, shortcuts and workarounds. I'm trying to train a good English model from scratch but since I'm using around 9000 WAV files with lengths from 1s to 20s, it's actually quite costly (although even at 20 epochs of stage 2, I'm already getting some good results).

If you'd like to put your ideas into a single place, I'd be more than willing to compile a Jupyter notebook that would incorporate your findings to help others train their models.

I'm going to try your shortcut approach on training Slovak (my language) model next month and will report how that went.

EDIT: did you try training a model for different language? I'm finding that the training code is not prepared for multilingual PL-BERT (which is what also some other people discovered) and I'm having trouble adjusting the code to cope with it. I commented more details in this PL-BERT multilingual repo fork.

Thanks again!

@traderpedroso
Copy link

@traderpedroso thank you very much for all your detailed work, shortcuts and workarounds. I'm trying to train a good English model from scratch but since I'm using around 9000 WAV files with lengths from 1s to 20s, it's actually quite costly (although even at 20 epochs of stage 2, I'm already getting some good results).

If you'd like to put your ideas into a single place, I'd be more than willing to compile a Jupyter notebook that would incorporate your findings to help others train their models.

I'm going to try your shortcut approach on training Slovak (my language) model next month and will report how that went.

EDIT: did you try training a model for different language? I'm finding that the training code is not prepared for multilingual PL-BERT (which is what also some other people discovered) and I'm having trouble adjusting the code to cope with it. I commented more details in this PL-BERT multilingual repo fork.

Thanks again!

Yes, I trained with Brazilian Portuguese and Italian, and I had a lot of success in the results using PL-BERT. You don't need to change anything in the code, just replace everything in the /Utils/PLBERT/ folder with the multilingual version.

Another thing I noticed is that it's better to have a few audios with perfect cuts than hundreds of audios with cuts that generate noise. So, padding at the beginning and end of the audio is extremely necessary. I was a bit short on time these days, but next week I'll make a Gradio available to apply the cuts and make it easier. As I mentioned, training each time with audios of the same size generates better results. The English model in fine-tuning with only 5 epochs with 15 minutes of audio already starting SLM active had undeniable results, always using the rule of audios of the same size, the minimum of 4 seconds.

@martinambrus

This comment was marked as resolved.

@mantrakp04
Copy link

mantrakp04 commented Aug 23, 2024

does anyone have an easy to follow jupyter notebook or webui?

@martinambrus
Copy link

does anyone have an easy to follow jupyter notebook or webui?

Not for multilingual but for single language I have created 2 notebooks here: #144

The training notebook is easily adaptable to multilingual by simply exchanging the PL-BERT subfolder in the Utils folder by the multilingual one, or one that you trained yourself. For example, I used https://huggingface.co/gerulata/slovakbert for Slovak language.

@mantrakp04
Copy link

mantrakp04 commented Aug 24, 2024

For a single-language, multi-speaker dataset of approximately 1k hours, primarily consisting of Hindi audiobook recordings, would you recommend training a model from scratch or fine-tuning?

@martinambrus
Copy link

@traderpedroso would you be able to write down a couple of points on how to train my own ASR? I tried to clone your https://github.com/traderpedroso/AuxiliaryASR but in the example Jupyter notebook, there is some metadata.txt file that I don't have, so I couldn't progress - and since I'm still fairly new to all this, I'll be very grateful for any pointers here... I already successfully implemented a Slovak PL-BERT and this is the last step for me to perfect my training :)

@traderpedroso
Copy link

traderpedroso commented Aug 28, 2024

@traderpedroso would you be able to write down a couple of points on how to train my own ASR? I tried to clone your https://github.com/traderpedroso/AuxiliaryASR but in the example Jupyter notebook, there is some metadata.txt file that I don't have, so I couldn't progress - and since I'm still fairly new to all this, I'll be very grateful for any pointers here... I already successfully implemented a Slovak PL-BERT and this is the last step for me to perfect my training :)

https://github.com/yl4579/AuxiliaryASR

I suggest you use the official one, I made some workarounds to make it work with phonemizer in Brazilian Portuguese, you can usually create your train list and validation list already converted to phonemes as I was using for testing custom phonemes, I ended up modifying a lot of mine and I believe it won't be useful in your case, just to be clear, AuxiliaryASR training will improve pronunciation in the language you train it on, and it's not necessary for English.

However, if you want to use mine, simply modify meldataset.py where you have global_phonemizer = phonemizer.backend.EspeakBackend(language='pt-br', preserve_punctuation=True, with_stress=True). to global_phonemizer = phonemizer.backend.EspeakBackend(language='your language iso', preserve_punctuation=True, with_stress=True), remembering that your dataset cannot contain phonemes, but rather in this format LJSpeech-1.1/wavs/LJ048-0203.wav|The three officers confirm that their primary concern was crowd and traffic control,|0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants