<a href="https://colab.research.google.com/github/rmcpantoja/AHK-scripts-for-accessibility/blob/main/notebooks/piper_multilingual_training_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color="ffc800"> **[Piper](https://github.com/rhasspy/piper) training notebook.**
## ![Piper logo](https://contribute.rhasspy.org/img/logo.png)

---

- Notebook made by [rmcpantoja](http://github.com/rmcpantoja)
- Collaborator: [Xx_Nessu_xX](http://github.com/Xx_Nessu_xX)

---

# Notes:

- <font color="orange">**Things in orange mean that they are important.**

# <font color="ffc800">🔧 ***First steps.*** 🔧

In [None]:
#@markdown ## <font color="ffc800"> **Google Colab Anti-Disconnect.** 🔌
#@markdown ---
#@markdown #### Avoid automatic disconnection. Still, it will disconnect after <font color="orange">**6 to 12 hours**</font>.

import IPython
js_code = '''
function ClickConnect(){
console.log("Working");
document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(ClickConnect,60000)
'''
display(IPython.display.Javascript(js_code))

In [None]:
#@markdown ## <font color="ffc800"> **Check GPU type.** 👁️
#@markdown ---
#@markdown #### A higher capable GPU can lead to faster training speeds. By default, you will have a <font color="orange">**Tesla T4**</font>.
!nvidia-smi

In [None]:
#@markdown # <font color="ffc800"> **Mount Google Drive.** 📂
#@markdown ---
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
#@markdown # <font color="ffc800"> **Install software.** 📦
#@markdown ---
#@markdown ####In this cell the synthesizer and its necessary dependencies to execute the training will be installed. (this may take a while)

#@markdown #### <font color="orange">**Do you want to use the patch?**
#@markdown The patch provides the ability to export audio files to the output folder and save a single model while training.
usepatch = False #@param {type:"boolean"}
#@markdown ---
# clone:
!git clone -q https://github.com/rmcpantoja/piper
%cd /content/piper/src/python
!wget -q "https://raw.githubusercontent.com/coqui-ai/TTS/dev/TTS/bin/resample.py"
#!pip install -q -r requirements.txt
!pip install -q cython>=0.29.0 piper-phonemize==1.1.0 librosa>=0.9.2 numpy==1.24 onnxruntime>=1.11.0 pytorch-lightning==1.7.7 torch==1.13.0+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
!pip install -q torchtext==0.14.0 torchvision==0.14.0
# fixing recent compativility isswes:
!pip install -q torchaudio==0.13.0 torchmetrics==0.11.4
!bash build_monotonic_align.sh
# download patches:
if usepatch:
    print("\033[93mDownloading patch...")
    !gdown -q "1EWEb7amo1rgFGpBFfRD4BKX3pkjVK1I-" -O "/content/piper/src/python/patch.zip"
    !unzip -o -q "patch.zip"
    print("\033[93mPatch downloaded!")

# <font color="ffc800"> 🤖 ***Training.*** 🤖

In [None]:
#@markdown # <font color="ffc800"> **1. Extract dataset.** 📥
#@markdown ---
#@markdown ####Important: the audios must be in <font color="orange">**wav format, (16000 or 22050hz, 16-bits, mono), and, for convenience, numbered. Example:**

#@markdown * <font color="orange">**1.wav**</font>
#@markdown * <font color="orange">**2.wav**</font>
#@markdown * <font color="orange">**3.wav**</font>
#@markdown * <font color="orange">**.....**</font>

#@markdown ---

%cd /content
!mkdir /content/dataset
%cd /content/dataset
!mkdir /content/dataset/wavs
#@markdown ### Audio dataset path to unzip:
zip_path = "/content/drive/MyDrive/wavs.zip" #@param {type:"string"}
!unzip -q -j "{zip_path}" -d /content/dataset/wavs
#@markdown ---

In [None]:
#@markdown # <font color="ffc800"> **2. Upload the transcript file.** 📝
#@markdown ---
#@markdown ####<font color="orange">**Important: the transcription means writing what the character says in each of the audios, and it must have the following structure:**

#@markdown ##### <font color="orange">For a single-speaker dataset:
#@markdown * wavs/1.wav|This is what my character says in audio 1.
#@markdown * wavs/2.wav|This, the text that the character says in audio 2.
#@markdown * ...

#@markdown ##### <font color="orange">For a multi-speaker dataset:

#@markdown * wavs/speaker1audio1.wav|speaker1|This is what the first speaker says.
#@markdown * wavs/speaker1audio2.wav|speaker1|This is another audio of the first speaker.
#@markdown * wavs/speaker2audio1.wav|speaker2|This is what the second speaker says in the first audio.
#@markdown * wavs/speaker2audio2.wav|speaker2|This is another audio of the second speaker.
#@markdown * ...

#@markdown And so on. In addition, the transcript must be in a <font color="orange">**.csv or .txt format. (UTF-8 without BOM)**

#@markdown ---
%cd /content/dataset
from google.colab import files
!rm /content/dataset/metadata.csv
listfn, length = files.upload().popitem()
if listfn != "metadata.csv":
  !mv "$listfn" metadata.csv
%cd ..

In [None]:
#@markdown # <font color="ffc800"> **3. Preprocess dataset.** 🔄
#@markdown ---
import os
#@markdown ### First of all, select the language of your dataset.
language = "English (U.S.)" #@param ["ألعَرَبِي", "Català", "čeština", "Dansk", "Deutsch", "Ελληνικά", "English (British)", "English (U.S.)", "Español (Castellano)", "Español (Latinoamericano)", "Suomi", "Français", "Magyar", "Icelandic", "Italiano", "ქართული", "қазақша", "Lëtzebuergesch", "नेपाली", "Nederlands", "Norsk", "Polski", "Português (Brasil)", "Português (Portugal)", "Română", "Русский", "Српски", "Svenska", "Kiswahili", "Türkçe", "украї́нська", "Tiếng Việt", "简体中文"]
#@markdown ---
# language definition:
languages = {
    "ألعَرَبِي": "ar",
    "Català": "ca",
    "čeština": "cs",
    "Dansk": "da",
    "Deutsch": "de",
    "Ελληνικά": "el",
    "English (British)": "en",
    "English (U.S.)": "en-us",
    "Español (Castellano)": "es",
    "Español (Latinoamericano)": "es-419",
    "Suomi.": "fi",
    "Français": "fr",
    "Magyar": "hu",
    "Icelandic": "is",
    "Italiano": "it",
    "ქართული": "ka",
    "қазақша": "kk",
    "Lëtzebuergesch": "lb",
    "नेपाली": "ne",
    "Nederlands": "nl",
    "Norsk": "nb",
    "Polski": "pl",
    "Português (Brasil)": "pt-br",
    "Português (Portugal)": "pt-pt",
    "Română": "ro",
    "Русский": "ru",
    "Српски": "sr",
    "Svenska": "sv",
    "Kiswahili": "sw",
    "Türkçe": "tr",
    "украї́нська": "uk",
    "Tiếng Việt": "vi",
    "简体中文": "zh"
}

def _get_language(code):
    return languages[code]

final_language = _get_language(language)
#@markdown ### Choose a name for your model:
model_name = "Test" #@param {type:"string"}
#@markdown ---
# output:
#@markdown ### Choose the working folder: (recommended to save to Drive)

#@markdown The working folder will be used in preprocessing, but also in training the model.
output_path = "/content/drive/MyDrive/colab/piper" #@param {type:"string"}
output_dir = output_path+"/"+model_name
if not os.path.exists(output_dir):
  os.makedirs(output_dir)
#@markdown ---
#@markdown ### Choose dataset format:
dataset_format = "ljspeech" #@param ["ljspeech", "mycroft"]
#@markdown ---
#@markdown ### Is this a single speaker dataset? Otherwise, uncheck:
single_speaker = True #@param {type:"boolean"}
if single_speaker:
  force_sp = " --single-speaker"
else:
  force_sp = ""
#@markdown ---
#@markdown ### Select the sample rate of the dataset:
sample_rate = "22050" #@param ["16000", "22050"]
#@markdown ---
!mkdir /content/audio_cache
%cd /content/piper/src/python
#@markdown ### Do you want to train using this sample rate, but your audios don't have it?
#@markdown The resampler helps you do it quickly!
resample = False #@param {type:"boolean"}
if resample:
  !python resample.py --input_dir "/content/dataset/wavs" --output_dir "/content/dataset/wavs_resampled" --output_sr {sample_rate} --file_ext "wav"
  !mv /content/dataset/wavs_resampled/* /content/dataset/wavs
#@markdown ---

!python -m piper_train.preprocess \
  --language {final_language} \
  --input-dir /content/dataset \
  --cache-dir "/content/audio_cache" \
  --output-dir "{output_dir}" \
  --dataset-name "{model_name}" \
  --dataset-format {dataset_format} \
  --sample-rate {sample_rate} \
  {force_sp}

In [None]:
#@markdown # <font color="ffc800"> **4. Settings.** 🧰
#@markdown ---
import json
import ipywidgets as widgets
from IPython.display import display
from google.colab import output
import os
#@markdown ### <font color="orange">**Select the action to train this dataset: (READ CAREFULLY)**

#@markdown * The option to <font color="orange">continue a training</font> is self-explanatory. If you've previously trained a model with free colab, your time is up and you're considering training it some more, this is ideal for you. You just have to set the same settings that you set when you first trained this model.
#@markdown * The option to <font color="orange">convert a single-speaker model to a multi-speaker model</font> is self-explanatory, and for this it is important that you have processed a dataset that contains text and audio from all possible speakers that you want to train in your model.
#@markdown * The <font color="orange">finetune</font> option is used to train a dataset using a pretrained model, that is, train on that data. This option is ideal if you want to train a very small dataset (more than five minutes recommended).
#@markdown * The <font color="orange">train from scratch</font> option builds features such as dictionary and speech form from scratch, and this may take longer to converge. For this, hours of audio (8 at least) are recommended, which have a large collection of phonemes.

action = "finetune" #@param ["Continue training", "convert single-speaker to multi-speaker model", "finetune", "train from scratch"]
#@markdown ---
if action == "Continue training":
    if os.path.exists(f"{output_dir}/lightning_logs/version_0/checkpoints/last.ckpt"):
        ft_command = f'--resume_from_checkpoint "{output_dir}/lightning_logs/version_0/checkpoints/last.ckpt" '
        print(f"\033[93mContinuing {model_name}'s training at: {output_dir}/lightning_logs/version_0/checkpoints/last.ckpt")
    else:
        raise Exception("Training cannot be continued as there is no checkpoint to continue at.")
elif action == "finetune":
    if os.path.exists(f"{output_dir}/lightning_logs/version_0/checkpoints/last.ckpt"):
        raise Exception("Oh no! You have already trained this model before, you cannot choose this option since your progress will be lost, and then your previous time will not count. Please select the option to continue a training.")
    else:
        ft_command = '--resume_from_checkpoint "/content/pretrained.ckpt" '
elif action == "convert single-speaker to multi-speaker model":
    if not single_speaker:
        ft_command = '--resume_from_single_speaker_checkpoint "/content/pretrained.ckpt" '
    else:
        raise Exception("This dataset is not a multi-speaker dataset!")
else:
    ft_command = ""
if action== "convert single-speaker to multi-speaker model" or action == "finetune":
    try:
        with open('/content/piper/notebooks/pretrained_models.json') as f:
            pretrained_models = json.load(f)
        if final_language in pretrained_models:
            models = pretrained_models[final_language]
            model_options = [(model_name, model_name) for model_name, model_url in models.items()]
            model_dropdown = widgets.Dropdown(description = "Choose pretrained model", options=model_options)
            download_button = widgets.Button(description="Download")
            def download_model(btn):
                model_name = model_dropdown.value
                model_url = pretrained_models[final_language][model_name]
                print("\033[93mDownloading pretrained model...")
                if model_url.startswith("1"):
                    !gdown -q "{model_url}" -O "/content/pretrained.ckpt"
                elif model_url.startswith("https://drive.google.com/file/d/"):
                    !gdown -q "{model_url}" -O "/content/pretrained.ckpt" --fuzzy
                else:
                    !wget -q "{model_url}" -O "/content/pretrained.ckpt"
                model_dropdown.close()
                download_button.close()
                output.clear()
                if os.path.exists("/content/pretrained.ckpt"):
                    print("\033[93mModel downloaded!")
                else:
                    raise Exception("Couldn't download the pretrained model!")
            download_button.on_click(download_model)
            display(model_dropdown, download_button)
        else:
            raise Exception(f"There are no pretrained models available for the language {final_language}")
    except FileNotFoundError:
        raise Exception("The pretrained_models.json file was not found.")
else:
    print("\033[93mWarning: this model will be trained from scratch. You need at least 8 hours of data for everything to work decent. Good luck!")
#@markdown ### Choose batch size based on this dataset:
batch_size = 12 #@param {type:"integer"}
#@markdown ---

#@markdown ### Choose the quality for this model:

#@markdown * x-low - 16Khz audio, 5-7M params
#@markdown * medium - 22.05Khz audio, 15-20 params
#@markdown * high - 22.05Khz audio, 28-32M params
quality = "medium" #@param ["high", "x-low", "medium"]
#@markdown ---
#@markdown ### For how many epochs to save training checkpoints?
#@markdown The larger your dataset, you should set this saving interval to a smaller value, as epochs can progress longer time.
checkpoint_epochs = 5 #@param {type:"integer"}
#@markdown ---
#@markdown ### Step interval to generate model samples:
log_every_n_steps = 1000 #@param {type:"integer"}
#@markdown ---
#@markdown ### Training epochs:
max_epochs = 10000 #@param {type:"integer"}
#@markdown ---

In [None]:
#@markdown # <font color="pink"> **5. Run the TensorBoard extension.** 📈

#@markdown The TensorBoard is used to visualize the results of the model while it's being trained such as audio and losses.

%load_ext tensorboard
%tensorboard --logdir {output_dir}

In [None]:
#@markdown # <font color="ffc800"> **6. Train.** 🏋️‍♂️
#@markdown ---
#@markdown ### Run this cell to train your final model! If possible, some audio samples will be saved during training in the output folder, unless you disable validation.

#@markdown ---
#@markdown ### <font color="orange">**Disable validation?**
#@markdown By disable this checkbox, this will allow to train the full dataset, without using any audio files or examples as a validation set. So, it will not be able to generate audios on the tensorboard while it's training. It is recommended to disable validation on extremely small datasets.
validation = True #@param {type:"boolean"}
if validation:
    validation_split = 0.01
    num_test_examples = 2
else:
    validation_split = 0
    num_test_examples = 0
get_ipython().system(f'''
python -m piper_train \
--dataset-dir "{output_dir}" \
--accelerator 'gpu' \
--devices 1 \
--batch-size {batch_size} \
--validation-split {validation_split} \
--num-test-examples {num_test_examples} \
--quality {quality} \
--checkpoint-epochs {checkpoint_epochs} \
--log_every_n_steps {log_every_n_steps} \
--max_epochs {max_epochs} \
{ft_command}\
--precision 32
''')

#  <font color="orange">**Have you finished training and want to test the model?**

* If you want to run this model in any software that Piper integrates or the same Piper app, export your model using the [model exporter notebook](https://colab.research.google.com/github/rmcpantoja/piper/blob/master/notebooks/piper_model_exporter.ipynb)!
* Wait! I want to test this right now before exporting it to the supported format for Piper. Test your generated last.ckpt with [this notebook](https://colab.research.google.com/github/rmcpantoja/piper/blob/master/notebooks/piper_inference_(ckpt).ipynb)!