Additional detail on using preprocess.py with gentle phoneme data #96

nmstoker · 2018-06-27T23:24:18Z

Would you mind giving a few more details on what is needed to use preprocess.py as mentioned in the last step of the section on using custom data?

https://github.com/r9y9/deepvoice3_pytorch#1-2-preprocessing-custom-english-datasets-with-long-silence-based-on-vctk_preprocess

Initially I managed to train using custom data without using gentle, and the results were recognisably like my training data (recordings of my own voice) but I am hoping it will improve quality if I use gentle with the training data. I have managed to process the data with gentle_web_align.py, but I am unsure what parameters to use for preprocess.py now and also what format I need to put the files into.

Is there some similar format to the alignment.json file that should be created? And how would I incorporate the .lab files I got from gentle_web_align.py?

Sorry - I expect these may be obvious to you, but I've been trying to figure it out from looking over the code, but to no avail! 😞

The project is really impressive - thank you for sharing your work!

Neil (@nmstoker)

nmstoker · 2018-06-30T11:06:02Z

I've been able to figure it out, by more careful reading of the code and just trying things (repeatedly 🙂 )

If it helps others, what I did was:

Ran gentle_web_align.py
The output from gentle_web_align.py is like this:

    datasets
    └── neil2
        ├── 1.wav
        ├── 1.lab
        ├── 1.txt
        ├── 2.wav
        ├── 2.txt
        ├── 2.lab
        ├── 3.wav
        ├── 3.lab
        ├── 3.txt
        └── ...

Renamed files to have speaker prefix "p225_" (chosen to match 1st one in vctk script)

for f in *.txt; do mv "$f" "p225_$f"; done
for f in *.lab; do mv "$f" "p225_$f"; done
for f in *.wav; do mv "$f" "p225_$f"; done

Re-arranged into this folder format

    datasets
    └── neil2
        ├── wav48
        │   ├── p225_1.wav
        │   ├── p225_2.wav
        │   ├── p225_3.wav
        │   └── ...
        │
        ├── lab
        │   ├── p225_1.lab
        │   ├── p225_2.lab
        │   ├── p225_3.lab
        │   └── ...
        │    
        └── txt
            ├── p225_1.txt
            ├── p225_2.txt
            ├── p225_3.txt
            └── ...

Added a speaker-info.txt file (based on the one for vctk dataset but just the first line for speaker 225)

    ID  AGE  GENDER  ACCENTS  REGION  
    225 42   M       English  Southern  England

Put speaker-info.txt in the top folder for my specific dataset (ie "neil2" here)

    datasets
    └── neil2
        ├── speaker-info.txt
        ├── wav48
        │    ├── p225_1.wav
        │    └── ...
        ├── ...
        └── ...

Adapted the script for vctk processing that's in my site-packages folder: nnmnkwii/datasets/vctk.py
This probably isn't the smartest / right way to do it, but it was late! It was changed such that:

it only looked for one individual speaker ("225")
small adjustments so that various assertions were still okay (ie assert len(available_speakers) == 108
--> assert len(available_speakers) == 1 )
fixed it so it didn't clip the end of transcriptions (may well be something I'd configured wrongly myself) [line 214]

if not is_wav:
  #files = list(map(lambda s: open(s, "rb").read().decode("utf-8")[:-1], files))
  files = list(map(lambda s: open(s, "rb").read().decode("utf-8"), files))

Run the adapted script through preprocess.py as follows:

    python preprocess.py vctk "./datasets/neil2" "./datasets/processed_neil2" --preset=presets/deepvoice3_ljspeech.json

Outputs this folder format:

    datasets
    └── processed_neil2
        ├── train.txt
        ├── vctk-mel-00001.npy
        ├── vctk-mel-00002.npy
        ├── vctk-mel-00003.npy
        ├── vctk-spec-00001.npy
        ├── vctk-spec-00002.npy
        ├── vctk-spec-00003.npy
        └── ...

Because vctk had worked as if it was a multi-speaker setup, it had appended |0 to all the lines in the train.txt file, so I cut that off every line
Then finally was able to train using a checkpoint I had initially trained with ljspeech data

python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./datasets/processed_neil2 --restore-parts=checkpoints_intial_ljspeech/checkpoint_step000430000.pth

And, after waiting overnight it worked!! 🎉

I've got a small issue with the synthesised audio, but I'll open a separate issue for that (if I don't figure out the cause)

Two questions @r9y9

If I wrote this up for the README would you be interesting in a PR?
For the vctk hack (ie in nnmnkwii) is there a smarter way to handle it?

r9y9 · 2018-06-30T14:01:06Z

You did a great job, but I think the better way to handle your own dataset is to follow https://github.com/r9y9/deepvoice3_pytorch#1-1-building-custom-dataset-using-json_meta. I just tried tto build my own toy dataset and I understand that could be explained more carefully (took me more than ten minutes to figure out what exactly alignment.json should be. Doc improvements are of-course welcome!), but it's not that hard. I'll leave steps to create alignment.json for example:

> ls -l                                                                                                                                                                                         total 936
-rw-rw-r-- 1 ryuichi ryuichi    152  6月 30 22:32 LJ001-0001.txt
-rw-r--r-- 1 ryuichi ryuichi 425830  6月 30 22:29 LJ001-0001.wav
-rw-rw-r-- 1 ryuichi ryuichi     31  6月 30 22:32 LJ001-0002.txt
-rw-r--r-- 1 ryuichi ryuichi  83814  6月 30 22:29 LJ001-0002.wav
-rw-rw-r-- 1 ryuichi ryuichi    156  6月 30 22:32 LJ001-0003.txt
-rw-r--r-- 1 ryuichi ryuichi 426342  6月 30 22:29 LJ001-0003.wav
-rw-rw-r-- 1 ryuichi ryuichi    417  6月 30 22:45 alignment.json

> echo "{" > alignment.json; for f in $(find $PWD -type f -name "*.wav"); do { g=${f/.wav/.txt}; echo \"${f}\": \"${g}\", >> alignment.json } done; echo "}" >> alignment.json
# remove last comma manually
> emacs -nw alignment.json 
# check that we have correct json format
> cat alignment.json | jq .
{
  "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0002.wav": "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0002.txt",
  "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0001.wav": "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0001.txt",
  "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0003.wav": "/home/ryuichi/Dropbox/sp/deepvoice3_pytorch/foobar/LJ001-0003.txt"
}

After you have alignment.json, do preprocessisng as ususal:

python preprocess.py json_meta "./foobar/alignment.json" output_dir

Steps you described after Renamed files to have speaker prefix "p225_" (chosen to match 1st one in vctk script) are not actually necessary.

r9y9 · 2018-06-30T14:04:34Z

If I wrote this up for the README would you be interesting in a PR?

Yes!

For the vctk hack (ie in nnmnkwii) is there a smarter way to handle it?

Yes. See my previous comment.

if not is_wav:
  #files = list(map(lambda s: open(s, "rb").read().decode("utf-8")[:-1], files))
  files = list(map(lambda s: open(s, "rb").read().decode("utf-8"), files))

If this is a real issue (sorry I forgot why I put --1 here), I'm happy if you would create a PR.

nmstoker · 2018-07-01T20:21:54Z

Thanks a lot for the update.

I could be wrong, but I think your steps don't take account of the gentle processing - I'd already managed to get the regular training working as per section 1.1, but I was trying to use the .lab files that gentle created (as I thought that better knowledge of the positions of the words within the .wav files might help with training) and therefore was trying to do the steps suggested by 1.2.

I tried to follow section 1.2 but was struggling due to not knowing the structure to aim for with the files/folders. My steps above did seem to manage to incorporate the gentle .lab files etc, but if there's a smarter way to handle that part (ie what is implied in 1.2), I'd be keen to dicsuss it.

Did you find using gentle helpful? It's a bit subjective, but I think my results were marginally improved. I'm now focusing on weeding out some bad quality data that I found in my dataset and then will do some more runs.

r9y9 · 2018-07-02T16:32:27Z

While I didn't prepare label files, python preprocess.py json_meta "./foobar/alignment.json" output_dir should take into account label files if exist.

deepvoice3_pytorch/json_meta.py

Lines 222 to 237 in 271863f

    
           lab_path = wav_path.replace("wav48/", "lab/").replace(".wav", ".lab") 
        
           if not exists(lab_path): 
        
               lab_path = os.path.splitext(wav_path)[0]+'.lab' 
        
           # Trim silence from hts labels if available 
        
           if exists(lab_path): 
        
               labels = hts.load(lab_path) 
        
               b = int(start_at(labels) * 1e-7 * sr) 
        
               e = int(end_at(labels) * 1e-7 * sr) 
        
               wav = wav[b:e] 
        
               wav, _ = librosa.effects.trim(wav, top_db=25) 
        
           else: 
        
               if hparams.process_only_htk_aligned: 
        
                   return None 
        
               wav, _ = librosa.effects.trim(wav, top_db=15) 
        
           # End added from the multispeaker version

engiecat · 2018-07-10T09:14:16Z

Hi.
as i had initially proposed the gentle-based preprocessing, as well as custom dataset, i am really sorry for vague documentation. (english is not my first language :/)

For json format, in principle the format should be like
{
"path-to-wav-file":"transcript of wav file",
(continued)
}
in accordance to the format of carpedm20/multi-speaker-tacotron-tensorflow
It differs with the explanation given above.

(Other formats where list of alignment candidates is placed instead of transcript text is supported (because the carpedm20's automatic alignment creates such format), but the format mentioned above works well)

For effectiveness of Gentle, you can refer to #78, where i directly compared the performance. In summary, merlin based vctk_preprocess didnt work at all in 'dirty' dataset, while gentle did work and did improve the tts performance. This works well when dealing with dataset which includes inconsistent silences(e.g. breathing) during speech.

I totally agree that doc needs some improvement. Any suggestions will be welcome, and i will try to improve it!! :)
(at least, i should rewrite the json based custom metadata part and explain the alignment json format explicitly)

+) I just discovered that gentle alignment step can be covered just with alignment.json files (as it provides path to wav and the wav's transcript) Would it be better if gentle alignment step supports json input, instead of txt and wav file patterns?

nmstoker · 2018-07-10T09:29:51Z

Hi @engiecat, it's great that you suggested gentle and it was included 😀
Regarding your suggestion of the gentle alignment step being covered with alignment.json files, this sounds like it would slightly simplify the overall pipeline of steps needed and therefore would indeed be helpful! (although the extra steps aren't that hard)

engiecat · 2018-07-10T09:33:44Z

@nmstoker
okay! thanks for the input.

stale · 2019-05-30T01:34:37Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label May 30, 2019

stale bot closed this as completed Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional detail on using preprocess.py with gentle phoneme data #96

Additional detail on using preprocess.py with gentle phoneme data #96

nmstoker commented Jun 27, 2018

nmstoker commented Jun 30, 2018

r9y9 commented Jun 30, 2018

r9y9 commented Jun 30, 2018

nmstoker commented Jul 1, 2018

r9y9 commented Jul 2, 2018

engiecat commented Jul 10, 2018 •

edited

nmstoker commented Jul 10, 2018

engiecat commented Jul 10, 2018

stale bot commented May 30, 2019

Additional detail on using preprocess.py with gentle phoneme data #96

Additional detail on using preprocess.py with gentle phoneme data #96

Comments

nmstoker commented Jun 27, 2018

nmstoker commented Jun 30, 2018

r9y9 commented Jun 30, 2018

r9y9 commented Jun 30, 2018

nmstoker commented Jul 1, 2018

r9y9 commented Jul 2, 2018

engiecat commented Jul 10, 2018 • edited

nmstoker commented Jul 10, 2018

engiecat commented Jul 10, 2018

stale bot commented May 30, 2019

engiecat commented Jul 10, 2018 •

edited