Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional detail on using preprocess.py with gentle phoneme data #96

Closed
nmstoker opened this issue Jun 27, 2018 · 9 comments
Closed

Additional detail on using preprocess.py with gentle phoneme data #96

nmstoker opened this issue Jun 27, 2018 · 9 comments
Labels

Comments

@nmstoker
Copy link

Hello @r9y9

Would you mind giving a few more details on what is needed to use preprocess.py as mentioned in the last step of the section on using custom data?

https://github.com/r9y9/deepvoice3_pytorch#1-2-preprocessing-custom-english-datasets-with-long-silence-based-on-vctk_preprocess

Initially I managed to train using custom data without using gentle, and the results were recognisably like my training data (recordings of my own voice) but I am hoping it will improve quality if I use gentle with the training data. I have managed to process the data with gentle_web_align.py, but I am unsure what parameters to use for preprocess.py now and also what format I need to put the files into.

Is there some similar format to the alignment.json file that should be created? And how would I incorporate the .lab files I got from gentle_web_align.py?

Sorry - I expect these may be obvious to you, but I've been trying to figure it out from looking over the code, but to no avail! 😞

image

The project is really impressive - thank you for sharing your work!

Neil (@nmstoker)

@nmstoker
Copy link
Author

I've been able to figure it out, by more careful reading of the code and just trying things (repeatedly 🙂 )

If it helps others, what I did was:

  • Ran gentle_web_align.py

  • The output from gentle_web_align.py is like this:

    datasets
    └── neil2
        ├── 1.wav
        ├── 1.lab
        ├── 1.txt
        ├── 2.wav
        ├── 2.txt
        ├── 2.lab
        ├── 3.wav
        ├── 3.lab
        ├── 3.txt
        └── ...
  • Renamed files to have speaker prefix "p225_" (chosen to match 1st one in vctk script)
for f in *.txt; do mv "$f" "p225_$f"; done
for f in *.lab; do mv "$f" "p225_$f"; done
for f in *.wav; do mv "$f" "p225_$f"; done
  • Re-arranged into this folder format
    datasets
    └── neil2
        ├── wav48
        │   ├── p225_1.wav
        │   ├── p225_2.wav
        │   ├── p225_3.wav
        │   └── ...
        │
        ├── lab
        │   ├── p225_1.lab
        │   ├── p225_2.lab
        │   ├── p225_3.lab
        │   └── ...
        │    
        └── txt
            ├── p225_1.txt
            ├── p225_2.txt
            ├── p225_3.txt
            └── ...
  • Added a speaker-info.txt file (based on the one for vctk dataset but just the first line for speaker 225)
    ID  AGE  GENDER  ACCENTS  REGION  
    225 42   M       English  Southern  England
  • Put speaker-info.txt in the top folder for my specific dataset (ie "neil2" here)
    datasets
    └── neil2
        ├── speaker-info.txt
        ├── wav48
        │    ├── p225_1.wav
        │    └── ...
        ├── ...
        └── ...
  • Adapted the script for vctk processing that's in my site-packages folder: nnmnkwii/datasets/vctk.py
    This probably isn't the smartest / right way to do it, but it was late! It was changed such that:
  1. it only looked for one individual speaker ("225")
  2. small adjustments so that various assertions were still okay (ie assert len(available_speakers) == 108
    --> assert len(available_speakers) == 1 )
  3. fixed it so it didn't clip the end of transcriptions (may well be something I'd configured wrongly myself) [line 214]
if not is_wav:
  #files = list(map(lambda s: open(s, "rb").read().decode("utf-8")[:-1], files))
  files = list(map(lambda s: open(s, "rb").read().decode("utf-8"), files))
  • Run the adapted script through preprocess.py as follows:
    python preprocess.py vctk "./datasets/neil2" "./datasets/processed_neil2" --preset=presets/deepvoice3_ljspeech.json
  • Outputs this folder format:
    datasets
    └── processed_neil2
        ├── train.txt
        ├── vctk-mel-00001.npy
        ├── vctk-mel-00002.npy
        ├── vctk-mel-00003.npy
        ├── vctk-spec-00001.npy
        ├── vctk-spec-00002.npy
        ├── vctk-spec-00003.npy
        └── ...
  • Because vctk had worked as if it was a multi-speaker setup, it had appended |0 to all the lines in the train.txt file, so I cut that off every line

  • Then finally was able to train using a checkpoint I had initially trained with ljspeech data

python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./datasets/processed_neil2 --restore-parts=checkpoints_intial_ljspeech/checkpoint_step000430000.pth

And, after waiting overnight it worked!! 🎉

I've got a small issue with the synthesised audio, but I'll open a separate issue for that (if I don't figure out the cause)

Two questions @r9y9

  1. If I wrote this up for the README would you be interesting in a PR?
  2. For the vctk hack (ie in nnmnkwii) is there a smarter way to handle it?

@r9y9
Copy link
Owner

r9y9 commented Jun 30, 2018

You did a great job, but I think the better way to handle your own dataset is to follow https://github.com/r9y9/deepvoice3_pytorch#1-1-building-custom-dataset-using-json_meta. I just tried tto build my own toy dataset and I understand that could be explained more carefully (took me more than ten minutes to figure out what exactly alignment.json should be. Doc improvements are of-course welcome!), but it's not that hard. I'll leave steps to create alignment.json for example:

> ls -l                                                                                                                                                                                         total 936
-rw-rw-r-- 1 ryuichi ryuichi    152  6月 30 22:32 LJ001-0001.txt
-rw-r--r-- 1 ryuichi ryuichi 425830  6月 30 22:29 LJ001-0001.wav
-rw-rw-r-- 1 ryuichi ryuichi     31  6月 30 22:32 LJ001-0002.txt
-rw-r--r-- 1 ryuichi ryuichi  83814  6月 30 22:29 LJ001-0002.wav
-rw-rw-r-- 1 ryuichi ryuichi    156  6月 30 22:32 LJ001-0003.txt
-rw-r--r-- 1 ryuichi ryuichi 426342  6月 30 22:29 LJ001-0003.wav
-rw-rw-r-- 1 ryuichi ryuichi    417  6月 30 22:45 alignment.json

> echo "{" > alignment.json; for f in $(find $PWD -type f -name "*.wav"); do { g=${f/.wav/.txt}; echo \"${f}\": \"${g}\", >> alignment.json } done; echo "}" >> alignment.json
# remove last comma manually
> emacs -nw alignment.json 
# check that we have correct json format
> cat alignment.json | jq .
{
  "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0002.wav": "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0002.txt",
  "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0001.wav": "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0001.txt",
  "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0003.wav": "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0003.txt"
}

After you have alignment.json, do preprocessisng as ususal:

python preprocess.py json_meta "./foobar/alignment.json" output_dir

Steps you described after Renamed files to have speaker prefix "p225_" (chosen to match 1st one in vctk script) are not actually necessary.

@r9y9
Copy link
Owner

r9y9 commented Jun 30, 2018

If I wrote this up for the README would you be interesting in a PR?

Yes!

For the vctk hack (ie in nnmnkwii) is there a smarter way to handle it?

Yes. See my previous comment.

if not is_wav:
  #files = list(map(lambda s: open(s, "rb").read().decode("utf-8")[:-1], files))
  files = list(map(lambda s: open(s, "rb").read().decode("utf-8"), files))

If this is a real issue (sorry I forgot why I put --1 here), I'm happy if you would create a PR.

@nmstoker
Copy link
Author

nmstoker commented Jul 1, 2018

Thanks a lot for the update.

I could be wrong, but I think your steps don't take account of the gentle processing - I'd already managed to get the regular training working as per section 1.1, but I was trying to use the .lab files that gentle created (as I thought that better knowledge of the positions of the words within the .wav files might help with training) and therefore was trying to do the steps suggested by 1.2.

I tried to follow section 1.2 but was struggling due to not knowing the structure to aim for with the files/folders. My steps above did seem to manage to incorporate the gentle .lab files etc, but if there's a smarter way to handle that part (ie what is implied in 1.2), I'd be keen to dicsuss it.

Did you find using gentle helpful? It's a bit subjective, but I think my results were marginally improved. I'm now focusing on weeding out some bad quality data that I found in my dataset and then will do some more runs.

@r9y9
Copy link
Owner

r9y9 commented Jul 2, 2018

While I didn't prepare label files, python preprocess.py json_meta "./foobar/alignment.json" output_dir should take into account label files if exist.

lab_path = wav_path.replace("wav48/", "lab/").replace(".wav", ".lab")
if not exists(lab_path):
lab_path = os.path.splitext(wav_path)[0]+'.lab'
# Trim silence from hts labels if available
if exists(lab_path):
labels = hts.load(lab_path)
b = int(start_at(labels) * 1e-7 * sr)
e = int(end_at(labels) * 1e-7 * sr)
wav = wav[b:e]
wav, _ = librosa.effects.trim(wav, top_db=25)
else:
if hparams.process_only_htk_aligned:
return None
wav, _ = librosa.effects.trim(wav, top_db=15)
# End added from the multispeaker version

@engiecat
Copy link
Contributor

engiecat commented Jul 10, 2018

Hi.
as i had initially proposed the gentle-based preprocessing, as well as custom dataset, i am really sorry for vague documentation. (english is not my first language :/)

For json format, in principle the format should be like
{
"path-to-wav-file":"transcript of wav file",
(continued)
}
in accordance to the format of carpedm20/multi-speaker-tacotron-tensorflow
It differs with the explanation given above.

(Other formats where list of alignment candidates is placed instead of transcript text is supported (because the carpedm20's automatic alignment creates such format), but the format mentioned above works well)

For effectiveness of Gentle, you can refer to #78, where i directly compared the performance. In summary, merlin based vctk_preprocess didnt work at all in 'dirty' dataset, while gentle did work and did improve the tts performance. This works well when dealing with dataset which includes inconsistent silences(e.g. breathing) during speech.

I totally agree that doc needs some improvement. Any suggestions will be welcome, and i will try to improve it!! :)
(at least, i should rewrite the json based custom metadata part and explain the alignment json format explicitly)

+) I just discovered that gentle alignment step can be covered just with alignment.json files (as it provides path to wav and the wav's transcript) Would it be better if gentle alignment step supports json input, instead of txt and wav file patterns?

@nmstoker
Copy link
Author

Hi @engiecat, it's great that you suggested gentle and it was included 😀
Regarding your suggestion of the gentle alignment step being covered with alignment.json files, this sounds like it would slightly simplify the overall pipeline of steps needed and therefore would indeed be helpful! (although the extra steps aren't that hard)

@engiecat
Copy link
Contributor

@nmstoker
okay! thanks for the input.

@stale
Copy link

stale bot commented May 30, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label May 30, 2019
@stale stale bot closed this as completed Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants