Feb 10th Meeting #2

realzza · 2023-02-11T16:07:56Z

Check: Does VAD change speech data in data prep (P1)

No. The VAD step computes the VAD information only, and store it in the dumpdir, in the file vad.scp. The VAD step is used to mark to non-speech segments, and then exclude those segment information from training. However, it is true that these missing blanks could affect our reconstruction loss. But it can improve the quality of synthesized audios. It is a tradeoff we need to be aware.
Keep VITS with xvector and VAD training
~~Used trained decoder (with aligned sample rate) to re-decode, see if speaker information is perceptible (p2) #3~~
- No, the decoded wav sample rate is still 22050. Trying the following steps.
  - check the training process
  - check tts_inference.py file on sample rate usage.
- Inference jobs are not eligible to submit since Feb 13th. Couldn't decode to see if meet correct requirement.
- Applied retrained model. Speaker information is integrated! /ocean/projects/cis210027p/zzhou5/espnet/egs2/librispeech_100/tts_vits/exp/16k_xvector/tts_beta_lib100_vits_tts_all16k_char_xvector/decode_with_trained_16k_vocoder
If 3 does not work, consult Jiatong (p2)
Run inference w/o trained vocoder
Integrate VITS model in cyclic systems (p3)

The text was updated successfully, but these errors were encountered:

merge master to md_pr

Sm s2st

realzza added the todo label Feb 11, 2023

realzza added this to the VITS-tts milestone Feb 11, 2023

realzza self-assigned this Feb 11, 2023

realzza pushed a commit that referenced this issue Oct 23, 2023

Merge pull request #2 from brianyan918/master

d3f9f10

merge master to md_pr

realzza pushed a commit that referenced this issue Nov 1, 2023

Merge pull request #2 from realzza/sm_s2st

fbaf306

Sm s2st

Provide feedback