Incorrect training data #15

KlausBuchegger · 2018-05-09T08:37:07Z

Hi,
so I trained Kaldi using your (old) s5 script, and as a sanity check I tried to decode the training data.
When I compared the text file from the training data to my results, I noticed that there seem to be quite a number of errors in the texts.
I checked the audio and xml files and saw that the sentences were wrong.

I added a screenshot of a partial vimdiff of the text and my results.

Appears to be a mixup, as those sentences do exist, but in other audio files

bmilde · 2018-05-09T11:17:42Z

Hi,
thanks for opening the issue. I'm was aware of this and I'm currently investigating this.

By running
https://github.com/tudarmstadt-lt/kaldi-tuda-de/blob/master/s5_r2/local/run_cleanup_segmentation.sh
from the new set of scripts in s5_r2 a cleaned training directory is generated. I've compared it to the train set and this seem to be the broken ids:

https://github.com/tudarmstadt-lt/kaldi-tuda-de/blob/master/s5_r2/local/cleanup/problematic_wavs.txt

The cleanup script also decodes the training set (with a biased language model built from the reference), so that's similar to what you did. The ids from s5_r2 are similar to s5, there is just a suffix added for the different microphones (a,b,c) since s5 didn't use all the available data.

Can you send me your diff and/or would you be able to fix the corpus XML's directly? Otherwise since this doesn't seem to be a lot of the total data (1.5%), I'd simply remove the broken data from the next release (v3) of the corpus.

svenha · 2018-05-28T14:17:22Z

If fixing the XML files needs some helping hands and makes sense, let me know.

bmilde · 2018-05-28T16:47:46Z

Hi svenha! That would be greatly appreciated! There is already a list of files with wrong transcriptions in the cleanup folder in the repository. That should be a good starting point! I'm currently preparing v3 of the corpus, where I sorted these files out into a separate folder. I can send you a link tomorrow. svenha <notifications@github.com> schrieb am Mo., 28. Mai 2018, 16:17:

…

If fixing the XML files needs some helping hands and makes sense, let me know. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJRgetAq1txc8OE-8BcLKaUKG3Ve-HFxks5t3AbzgaJpZM4T37R6> .

bmilde · 2018-05-30T14:40:27Z

Here is our planned v3 package where hopefully most of the bad utterances are moved into a separate folder: http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/german-speechdata-package-v3.tar.gz

@KlausBuchegger can you upload your diff and decoded output?

svenha · 2018-05-30T18:13:18Z

Two questions about the file problematic_wavs.txt.

It lists 730 files, but the directory train_removed/ contains 363 recordings (5 files per recording). Why?
What is is the role of the transcript in problematic_wavs.txt? Is it the incorrect transcript from the current .xml file?

I think the uploads from @KlausBuchegger might be helpful ...

fbenites · 2018-07-02T21:14:38Z

Hello
I am trying to train deepspeech and I got two days ago a feeling it might lies on the data, that the model does not converge well. I checked also the the data and have a rather preliminary impression.
As I see some transcriptions got delete but some got mixed, so the text exists but is (was) assigned to other transcription (or of course the same text was spoken multiple times):
train/2014-03-17-13-03-33_Kinect-Beam.wav should have the same text as train/2014-05-08-11-48-47_Kinect-Beam.wav

What speaks for mixed (wrong) assignment is that sometimes I have the impression the text was just shifted:

train/2014-03-17-13-03-49_Kinect-Beam.wav should have the same text as train/2014-03-17-13-03-33_Kinect-Beam.wav
and
train/2014-03-17-13-04-15_Kinect-Beam.wav and train/2014-03-17-13-04-43_Kinect-Beam.wav

The dates are pretty close (they are in my training file neighbor lines).
Should we use some sort of work coordination sorting it out?

svenha · 2018-07-02T22:11:11Z

@fbenites Yes, it would be good to distribute the correction work. Benjamin (bmilde) offered to produce a list of problematic transcripts by running the recognizer on the train set. When this is done, let us partition the manual check work in two parts.

bmilde · 2018-07-03T13:13:29Z

@fbenites @svenha thanks for offering to help. I'm running the decode on the train set right now and should have the results soon.

@svenha As for the number of problematic utterances in the proposed v3 tar: the numbers are different since I excluded all wav files of all microphones, whenever at least one decode of one microphone fails. There are multiple microphone recordings of the same utterance. Better safe than sorry. Still, only about 1.5% of all utterances are problematic. But the problematic files tend to be in the same recording session(s).

@fbenites Low RNN-CTC performance will probably remain, even if we fix all the problematic files. There is probably not enough data for end-to-end RNN training (40h x number of microphones, but that is more like doing augmentation learning on 40h of data). What kind of WERs are you seeing with DeepSpeech at the moment? Are you training with or without a phonetic dictionary? Utterance variance is unfortunately also fairly low, there are only about 3000 distinct sentences in train/dev/test combined, so I suggest using a phoneme dictionary if possible. I also suggesting adding German speech data from SWC (https://nats.gitlab.io/swc/), worked very well in the our r2 scripts in this repository (18.39% WER dev / 19.60% WER test now).

fbenites · 2018-07-03T16:05:43Z

@bmilde Thanks, I will have a look at the wikipedia.
I am not certain, I removed some problematic files I am using also: github.com/ynop/audiomate to process which cover some files already, but I added the other 700 to blacklist.
I will have some results tomorrow. WER is also complicated see https://github.com/ynop/deepspeech-german sometimes the text is just missing some chars or spaces, which drops the WER a lot. I had some useful results only using voxforge contradicting the results with voxforge and tuda. I will check the phoneme dictionary in deepspeech, thx.

bmilde · 2018-07-06T13:35:03Z

@fbenites @svenha

I uploaded the decode of the tuda train set to: http://speech.tools/decode_tuda_train.tar.gz

This file might be interesting:

exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/wer_details/per_utt

or alternatively you can also diff e.g.:

diff exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/penalty_0.5/10.txt exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/test_filt.txt

Though this doesn't look so pretty in standard diff, so some graphical diff-like tool like in the screenshot from Klaus is probably a better idea.

svenha · 2018-07-06T16:44:45Z

Thanks @bmilde . I investigated the file per_utt with a little script that sums the isd part of the "cisd" fields. I count all of them as errors of equal weight. Then I used an error threshold t. For t=4, 2490 affected files; for t=5, 1788 files, for t=6,1511 files. If I ignore the microphone suffix (_a etc.), there remain 616 files to be checked for t=6. As we have only 2 annotators, I would suggest to use t=6. (We can repeat the decoding of the train set with the improved corpus and choose a second annotation round.) If you agree, I can produce the file list, sort it, and cut it in the middle. I would take the first half, @fbenites the second part?

@bmilde : How should we contribute the changes? We could edit the .xml files, collect all changed .xml files and send them to you. But I am open to other approaches, like a normal pull request (if the repo is not too large for this).

One final point: in per_utt, there are file names like 02dae8284f104451a8de85538da6fdec_20140317140355_a . Is it safe to assume that it corresponds 1:1 to an xml file derived from the date/time part, here 2014-03-17-14-03-55.xml ?

bmilde · 2018-07-06T19:00:29Z

Many thanks @svenha !

Yes, it is safe to assume that 02dae8284f104451a8de85538da6fdec_20140317140355_a belongs to 2014-03-17-14-03-55.xml

02dae8284f104451a8de85538da6fdec is the speaker hash, the last letter indicates the microphone, between that the IDs contain the timestamp without the dashes. Note that it's also safe to assume that _a, _b, _c etc all belong to the same utterance and that they should contain the same transcription. If all of them decode to something else, it's safe to assume an incorrect transcription.

Maybe it's also a good idea to make the threshold t depended on the length of the transcription, there are some very short utterances, too. But I can also always rerun the decoding for you after getting corrected xml files.

Since the corpus is not hosted on Github, it's probably easier if you send me the corrected xml files directly. But you can also send me a pull request, containing only the corrected xml files, placed somewhere in a subfolder of the local directory. I can then also write a script that checks the corpus files and patches them if needed, so that it is not necessary to redownload the whole tar.gz file.

One final point: Note that the xml files have sentence IDs.

E.g. sentence_id 59 in:

<?xml version="1.0" encoding="utf-8"?><recording><speaker_id>02dae828-4f10-4451-a8de-85538da6fdec</speaker_id><rate>16000</rate><angle>0</angle><gender>male</gender><ageclass>21-30</ageclass><sentence_id>59</sentence_id><sentence>Rom wurde damit zur ‚De-Facto-Vormacht‘ im ö
stlichen Mittelmeerraum.</sentence><cleaned_sentence>Rom wurde damit zur De Facto Vormacht im östlichen Mittelmeerraum</cleaned_sentence><corpus>WIKI</corpus><muttersprachler>Ja</muttersprachler><bundesland>Hessen</bundesland><sourceurls><url>https://de.wikipedia.org/wiki/R|
ömisches_Reich</url></sourceurls></recording>

A text file with all of the sentence IDs and transcriptions is in the root of the corpus archive. Since there are multiple recordings per sentence ID, it is very unlikely that the correct sentence is not included. Maybe we can also just try to find the closest match automatically and check that its correct manually?

svenha · 2018-07-09T11:49:13Z

I switched from an absolute threshold to a relative threshold as suggested by @bmilde : number_of_errors / number_of_words >= 0.2
I include an xml file only if all of its recordings (i.e. microphones a, b, c, and d) fufill this criterion.
This gave me 744 xml file names that I attach in two parts of 372 file names each below. I will check the first part (files1) now and send corrected xml files. (We will see how theses changes must be propagated to the SentencesAndIds files.) If the sentence-id was moved by 1 or similar (as noted by others), I will correct the sentence-id and delete the elements sentence and cleaned_sentence.

tuda-20perc-err.files1.txt
tuda-20perc-err.files2.txt

svenha · 2018-07-17T09:39:04Z

tuda-20perc-err.files1.txt is finished; my manual correction speed was around 2 minutes per recording. I will send the files to @bmilde. Any volunteers for tuda-20perc-err.files2.txt? (If not, I might have time in August.)

svenha · 2018-08-08T07:49:16Z

Just to avoid duplicate work, I would let you know that I am working on files2.

svenha · 2018-10-05T05:41:07Z

The second half (i.e. files2) were finished some weeks ago. Is there a rough estimate of the release date of tuda v3?

fbenites · 2018-11-09T13:19:11Z

Hi,

Sorry for the radio silence, I was caught up in other projects.
Where are the cleaned data? I would like to check them further. at http://speech.tools/ I get a 403.

Thanks again!

akoehn · 2018-11-11T09:57:20Z

@silenterus: It would be nice to keep the discussion on-topic. This bug is about mix-ups and errors in the data files. Feel free to open a new bug about training deep speech with our data.

@fbenites: speech.tools is hosted by @bmilde afaik, maybe he can fix that.

silenterus · 2018-11-11T10:06:04Z

Sry you are absolutly right.
I will put the results on my git.
Keep up the good work

svenha · 2020-12-11T09:20:26Z

Are there any plans to integrate all the corrections from 2018 or later into a new corpus version?

Alienmaster · 2022-02-18T13:51:07Z

I created a repository with the whole ~20GB Dataset here: https://github.com/Alienmaster/TudaDataset
For the revision v4 i removed the incorrect data mentioned here and added the corrections made by @svenha .
Currently i train on this dataset without any errors.
Feel free to download and test the new revision.

svenha · 2022-02-18T14:09:25Z

@Alienmaster Thanks for picking this up and the clever issue template in the new TudaDataset repo.

If you have any new evaluation results, please let us know :-)

bmilde · 2022-03-26T15:47:02Z

Closing this, release 4 of the tuda dataset contains the fixes: http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/german-speechdata-package-v4.tar.gz

Our CV7 branch uses this already to train the models, together with the newest Common voice data. See https://github.com/uhh-lt/kaldi-tuda-de/tree/CV7

fbenites mentioned this issue Jul 3, 2018

Tuda Corpus broken ynop/audiomate#57

Open

This comment has been minimized.

Sign in to view

bmilde closed this as completed Mar 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect training data #15

Incorrect training data #15

KlausBuchegger commented May 9, 2018

bmilde commented May 9, 2018

svenha commented May 28, 2018

bmilde commented May 28, 2018 via email

bmilde commented May 30, 2018

svenha commented May 30, 2018 •

edited

fbenites commented Jul 2, 2018

svenha commented Jul 2, 2018

bmilde commented Jul 3, 2018

fbenites commented Jul 3, 2018

bmilde commented Jul 6, 2018

svenha commented Jul 6, 2018 •

edited

bmilde commented Jul 6, 2018

svenha commented Jul 9, 2018 •

edited

svenha commented Jul 17, 2018 •

edited

svenha commented Aug 8, 2018

svenha commented Oct 5, 2018

This comment has been minimized.

fbenites commented Nov 9, 2018

akoehn commented Nov 11, 2018

silenterus commented Nov 11, 2018

svenha commented Dec 11, 2020 •

edited

Alienmaster commented Feb 18, 2022

svenha commented Feb 18, 2022 •

edited

bmilde commented Mar 26, 2022

Incorrect training data #15

Incorrect training data #15

Comments

KlausBuchegger commented May 9, 2018

bmilde commented May 9, 2018

svenha commented May 28, 2018

bmilde commented May 28, 2018 via email

bmilde commented May 30, 2018

svenha commented May 30, 2018 • edited

fbenites commented Jul 2, 2018

svenha commented Jul 2, 2018

bmilde commented Jul 3, 2018

fbenites commented Jul 3, 2018

bmilde commented Jul 6, 2018

svenha commented Jul 6, 2018 • edited

bmilde commented Jul 6, 2018

svenha commented Jul 9, 2018 • edited

svenha commented Jul 17, 2018 • edited

svenha commented Aug 8, 2018

svenha commented Oct 5, 2018

This comment has been minimized.

fbenites commented Nov 9, 2018

akoehn commented Nov 11, 2018

silenterus commented Nov 11, 2018

svenha commented Dec 11, 2020 • edited

Alienmaster commented Feb 18, 2022

svenha commented Feb 18, 2022 • edited

bmilde commented Mar 26, 2022

svenha commented May 30, 2018 •

edited

svenha commented Jul 6, 2018 •

edited

svenha commented Jul 9, 2018 •

edited

svenha commented Jul 17, 2018 •

edited

svenha commented Dec 11, 2020 •

edited

svenha commented Feb 18, 2022 •

edited