We refer to the repository of https://github.com/Plachtaa/VALL-E-X to open an unofficial reproduction of VALLEX.
Pretrained model can be found at Google driven.
# first modify the model_home in following scrip to the location of downloaded/pretrained models.
# second diy the prompt_txt, prompt_audio, target_txt with a corresponding language id
bash examples\\vallex\\scripts\\inference.sh
Vallex is trained on the dataset containing discrete speech tokens and text tokens.
-
Prepare a "info.tsv" file as following (file \t duration), containing speech path and duration of each speech.
SPEECH_PATH1 DURATION1 SPEECH_PATH2 DURATION2 SPEECH_PATH3 DURATION3 ......
-
Extract Codec according to the "info.tsv"
bash examples/vallex/data_pretreatment/extract_codec.sh
We can obtain 8 "codec[i].tsv" files, 0~7 (i)-th layer of codecs are separately saved into "codec[i].tsv"
304 123 453 255 256 345 124 666 543 ... 654 662 543 463 674 537 273 473 973 ... 355 345 766 255 234 768 275 785 102 ... ......
-
Prepare the text ("trans.tsv") file with each line corresponding to the speech
Text for SPEECH1 Text for SPEECH2 Text for SPEECH3 ......
Next, we need convert the text into tokens via tools like BPE/G2P/..., and it's saved as "st.tsv"
1521 467 885 2367 242 ... 2362 3261 356 167 1246 2364 ... 1246 123 432 134 53 13 ... ......
-
Convert data (codec[i].tsv and st.tsv) into binary file for fast reading
# We use the fairseq tool to achieve this convertion process python /home/wangtianrui/codes/fairseq/fairseq_cli/preprocess.py \ --only-source \ --trainpref /home/wangtianrui/develop_dataset/st.tsv \ --destdir /home/wangtianrui/develop_dataset/data_bin \ --thresholdsrc 0 \ --srcdict /home/wangtianrui/develop_dataset/dict.st.txt \ --workers `cat /proc/cpuinfo| grep "processor"| wc -l` for ((i=0;i<=7;i++)) do echo $i outname=train.at${i}.zh python /home/wangtianrui/codes/fairseq/fairseq_cli/preprocess.py \ --only-source \ --trainpref codec${i}.tsv \ --destdir $outdir \ --thresholdsrc 0 \ --srcdict /home/wangtianrui/develop_dataset/dict.at.txt \ --workers `cat /proc/cpuinfo| grep "processor"| wc -l` done
where dict.at.txt and dict.st.txt are simple idx-to-idx rows of speech discrete tokens and text tokens, as shown in examples/vallex/data_pretreatment
In this way, we can train the vallex with the dataset_config.train_data_path set as the home_path of binary files. We also release a tiny dataset for reference at Google driven.
After pretreated dataset, modify the "train_data_path" in following script, you can start for your training or finetuning.
bash examples\\vallex\\scripts\\vallex.sh