Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect training data #15

Closed
KlausBuchegger opened this issue May 9, 2018 · 24 comments
Closed

Incorrect training data #15

KlausBuchegger opened this issue May 9, 2018 · 24 comments

Comments

@KlausBuchegger
Copy link

Hi,
so I trained Kaldi using your (old) s5 script, and as a sanity check I tried to decode the training data.
When I compared the text file from the training data to my results, I noticed that there seem to be quite a number of errors in the texts.
I checked the audio and xml files and saw that the sentences were wrong.

I added a screenshot of a partial vimdiff of the text and my results.

kaldi

Appears to be a mixup, as those sentences do exist, but in other audio files

@bmilde
Copy link
Contributor

bmilde commented May 9, 2018

Hi,
thanks for opening the issue. I'm was aware of this and I'm currently investigating this.

By running
https://github.com/tudarmstadt-lt/kaldi-tuda-de/blob/master/s5_r2/local/run_cleanup_segmentation.sh
from the new set of scripts in s5_r2 a cleaned training directory is generated. I've compared it to the train set and this seem to be the broken ids:

https://github.com/tudarmstadt-lt/kaldi-tuda-de/blob/master/s5_r2/local/cleanup/problematic_wavs.txt

The cleanup script also decodes the training set (with a biased language model built from the reference), so that's similar to what you did. The ids from s5_r2 are similar to s5, there is just a suffix added for the different microphones (a,b,c) since s5 didn't use all the available data.

Can you send me your diff and/or would you be able to fix the corpus XML's directly? Otherwise since this doesn't seem to be a lot of the total data (1.5%), I'd simply remove the broken data from the next release (v3) of the corpus.

@svenha
Copy link

svenha commented May 28, 2018

If fixing the XML files needs some helping hands and makes sense, let me know.

@bmilde
Copy link
Contributor

bmilde commented May 28, 2018 via email

@bmilde
Copy link
Contributor

bmilde commented May 30, 2018

Here is our planned v3 package where hopefully most of the bad utterances are moved into a separate folder: http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/german-speechdata-package-v3.tar.gz

@KlausBuchegger can you upload your diff and decoded output?

@svenha
Copy link

svenha commented May 30, 2018

Two questions about the file problematic_wavs.txt.

  • It lists 730 files, but the directory train_removed/ contains 363 recordings (5 files per recording). Why?
  • What is is the role of the transcript in problematic_wavs.txt? Is it the incorrect transcript from the current .xml file?

I think the uploads from @KlausBuchegger might be helpful ...

@fbenites
Copy link

fbenites commented Jul 2, 2018

Hello
I am trying to train deepspeech and I got two days ago a feeling it might lies on the data, that the model does not converge well. I checked also the the data and have a rather preliminary impression.
As I see some transcriptions got delete but some got mixed, so the text exists but is (was) assigned to other transcription (or of course the same text was spoken multiple times):
train/2014-03-17-13-03-33_Kinect-Beam.wav should have the same text as train/2014-05-08-11-48-47_Kinect-Beam.wav

What speaks for mixed (wrong) assignment is that sometimes I have the impression the text was just shifted:

train/2014-03-17-13-03-49_Kinect-Beam.wav should have the same text as train/2014-03-17-13-03-33_Kinect-Beam.wav
and
train/2014-03-17-13-04-15_Kinect-Beam.wav and train/2014-03-17-13-04-43_Kinect-Beam.wav

The dates are pretty close (they are in my training file neighbor lines).
Should we use some sort of work coordination sorting it out?

@svenha
Copy link

svenha commented Jul 2, 2018

@fbenites Yes, it would be good to distribute the correction work. Benjamin (bmilde) offered to produce a list of problematic transcripts by running the recognizer on the train set. When this is done, let us partition the manual check work in two parts.

@bmilde
Copy link
Contributor

bmilde commented Jul 3, 2018

@fbenites @svenha thanks for offering to help. I'm running the decode on the train set right now and should have the results soon.

@svenha As for the number of problematic utterances in the proposed v3 tar: the numbers are different since I excluded all wav files of all microphones, whenever at least one decode of one microphone fails. There are multiple microphone recordings of the same utterance. Better safe than sorry. Still, only about 1.5% of all utterances are problematic. But the problematic files tend to be in the same recording session(s).

@fbenites Low RNN-CTC performance will probably remain, even if we fix all the problematic files. There is probably not enough data for end-to-end RNN training (40h x number of microphones, but that is more like doing augmentation learning on 40h of data). What kind of WERs are you seeing with DeepSpeech at the moment? Are you training with or without a phonetic dictionary? Utterance variance is unfortunately also fairly low, there are only about 3000 distinct sentences in train/dev/test combined, so I suggest using a phoneme dictionary if possible. I also suggesting adding German speech data from SWC (https://nats.gitlab.io/swc/), worked very well in the our r2 scripts in this repository (18.39% WER dev / 19.60% WER test now).

@fbenites
Copy link

fbenites commented Jul 3, 2018

@bmilde Thanks, I will have a look at the wikipedia.
I am not certain, I removed some problematic files I am using also: github.com/ynop/audiomate to process which cover some files already, but I added the other 700 to blacklist.
I will have some results tomorrow. WER is also complicated see https://github.com/ynop/deepspeech-german sometimes the text is just missing some chars or spaces, which drops the WER a lot. I had some useful results only using voxforge contradicting the results with voxforge and tuda. I will check the phoneme dictionary in deepspeech, thx.

@bmilde
Copy link
Contributor

bmilde commented Jul 6, 2018

@fbenites @svenha

I uploaded the decode of the tuda train set to: http://speech.tools/decode_tuda_train.tar.gz

This file might be interesting:

exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/wer_details/per_utt

or alternatively you can also diff e.g.:

diff exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/penalty_0.5/10.txt exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/test_filt.txt

Though this doesn't look so pretty in standard diff, so some graphical diff-like tool like in the screenshot from Klaus is probably a better idea.

@svenha
Copy link

svenha commented Jul 6, 2018

Thanks @bmilde . I investigated the file per_utt with a little script that sums the isd part of the "cisd" fields. I count all of them as errors of equal weight. Then I used an error threshold t. For t=4, 2490 affected files; for t=5, 1788 files, for t=6,1511 files. If I ignore the microphone suffix (_a etc.), there remain 616 files to be checked for t=6. As we have only 2 annotators, I would suggest to use t=6. (We can repeat the decoding of the train set with the improved corpus and choose a second annotation round.) If you agree, I can produce the file list, sort it, and cut it in the middle. I would take the first half, @fbenites the second part?

@bmilde : How should we contribute the changes? We could edit the .xml files, collect all changed .xml files and send them to you. But I am open to other approaches, like a normal pull request (if the repo is not too large for this).

One final point: in per_utt, there are file names like 02dae8284f104451a8de85538da6fdec_20140317140355_a . Is it safe to assume that it corresponds 1:1 to an xml file derived from the date/time part, here 2014-03-17-14-03-55.xml ?

@bmilde
Copy link
Contributor

bmilde commented Jul 6, 2018

Many thanks @svenha !

Yes, it is safe to assume that 02dae8284f104451a8de85538da6fdec_20140317140355_a belongs to 2014-03-17-14-03-55.xml

02dae8284f104451a8de85538da6fdec is the speaker hash, the last letter indicates the microphone, between that the IDs contain the timestamp without the dashes. Note that it's also safe to assume that _a, _b, _c etc all belong to the same utterance and that they should contain the same transcription. If all of them decode to something else, it's safe to assume an incorrect transcription.

Maybe it's also a good idea to make the threshold t depended on the length of the transcription, there are some very short utterances, too. But I can also always rerun the decoding for you after getting corrected xml files.

Since the corpus is not hosted on Github, it's probably easier if you send me the corrected xml files directly. But you can also send me a pull request, containing only the corrected xml files, placed somewhere in a subfolder of the local directory. I can then also write a script that checks the corpus files and patches them if needed, so that it is not necessary to redownload the whole tar.gz file.

One final point: Note that the xml files have sentence IDs.

E.g. sentence_id 59 in:

<?xml version="1.0" encoding="utf-8"?><recording><speaker_id>02dae828-4f10-4451-a8de-85538da6fdec</speaker_id><rate>16000</rate><angle>0</angle><gender>male</gender><ageclass>21-30</ageclass><sentence_id>59</sentence_id><sentence>Rom wurde damit zur ‚De-Facto-Vormacht‘ im ö
stlichen Mittelmeerraum.</sentence><cleaned_sentence>Rom wurde damit zur De Facto Vormacht im östlichen Mittelmeerraum</cleaned_sentence><corpus>WIKI</corpus><muttersprachler>Ja</muttersprachler><bundesland>Hessen</bundesland><sourceurls><url>https://de.wikipedia.org/wiki/R|
ömisches_Reich</url></sourceurls></recording>

A text file with all of the sentence IDs and transcriptions is in the root of the corpus archive. Since there are multiple recordings per sentence ID, it is very unlikely that the correct sentence is not included. Maybe we can also just try to find the closest match automatically and check that its correct manually?

@svenha
Copy link

svenha commented Jul 9, 2018

I switched from an absolute threshold to a relative threshold as suggested by @bmilde : number_of_errors / number_of_words >= 0.2
I include an xml file only if all of its recordings (i.e. microphones a, b, c, and d) fufill this criterion.
This gave me 744 xml file names that I attach in two parts of 372 file names each below. I will check the first part (files1) now and send corrected xml files. (We will see how theses changes must be propagated to the SentencesAndIds files.) If the sentence-id was moved by 1 or similar (as noted by others), I will correct the sentence-id and delete the elements sentence and cleaned_sentence.

tuda-20perc-err.files1.txt
tuda-20perc-err.files2.txt

@svenha
Copy link

svenha commented Jul 17, 2018

tuda-20perc-err.files1.txt is finished; my manual correction speed was around 2 minutes per recording. I will send the files to @bmilde. Any volunteers for tuda-20perc-err.files2.txt? (If not, I might have time in August.)

@svenha
Copy link

svenha commented Aug 8, 2018

Just to avoid duplicate work, I would let you know that I am working on files2.

@svenha
Copy link

svenha commented Oct 5, 2018

The second half (i.e. files2) were finished some weeks ago. Is there a rough estimate of the release date of tuda v3?

@wolfgang-s

This comment has been minimized.

@fbenites
Copy link

fbenites commented Nov 9, 2018

Hi,

Sorry for the radio silence, I was caught up in other projects.
Where are the cleaned data? I would like to check them further. at http://speech.tools/ I get a 403.

Thanks again!

@akoehn
Copy link
Collaborator

akoehn commented Nov 11, 2018

@silenterus: It would be nice to keep the discussion on-topic. This bug is about mix-ups and errors in the data files. Feel free to open a new bug about training deep speech with our data.

@fbenites: speech.tools is hosted by @bmilde afaik, maybe he can fix that.

@silenterus
Copy link

Sry you are absolutly right.
I will put the results on my git.
Keep up the good work

@svenha
Copy link

svenha commented Dec 11, 2020

Are there any plans to integrate all the corrections from 2018 or later into a new corpus version?

@Alienmaster
Copy link
Member

I created a repository with the whole ~20GB Dataset here: https://github.com/Alienmaster/TudaDataset
For the revision v4 i removed the incorrect data mentioned here and added the corrections made by @svenha .
Currently i train on this dataset without any errors.
Feel free to download and test the new revision.

@svenha
Copy link

svenha commented Feb 18, 2022

@Alienmaster Thanks for picking this up and the clever issue template in the new TudaDataset repo.

If you have any new evaluation results, please let us know :-)

@bmilde
Copy link
Contributor

bmilde commented Mar 26, 2022

Closing this, release 4 of the tuda dataset contains the fixes: http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/german-speechdata-package-v4.tar.gz

Our CV7 branch uses this already to train the models, together with the newest Common voice data. See https://github.com/uhh-lt/kaldi-tuda-de/tree/CV7

@bmilde bmilde closed this as completed Mar 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants