Label Noise, Speech Noise, Cut Speech in the dataset #1

shangeth · 2022-11-24T08:15:46Z

I was doing Dataset Cartography analysis on the training dataset for the e2e SLU model based on a whisper encoder. This analysis splits the dataset into 3 parts: easy, hard, and ambiguous samples.

After the split, I tried to analyze the hard samples to understand why these samples are harder for the model to learn. When listening to these audio samples, I found a few samples were mislabelled, a few had no speech only noise and a few sample speeches were cut in between. This analysis was only done on the train set, this has to be done on test set too.

shangeth · 2022-11-24T09:34:35Z

This analysis was done on the test set too and found similar errors in the test set.

There are audio files like the below plot

shangeth · 2022-11-28T05:52:29Z

@janaab11

shangeth · 2022-11-28T09:28:52Z

Also adding data-maps analysis for each speaker, to find which speaker has more errors in the audio.

We can observe, speaker 5, and 1 has many hard-samples which may be incorrect and 4,7 have a few.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label Noise, Speech Noise, Cut Speech in the dataset #1

Label Noise, Speech Noise, Cut Speech in the dataset #1

shangeth commented Nov 24, 2022

shangeth commented Nov 24, 2022

shangeth commented Nov 28, 2022

shangeth commented Nov 28, 2022

Label Noise, Speech Noise, Cut Speech in the dataset #1

Label Noise, Speech Noise, Cut Speech in the dataset #1

Comments

shangeth commented Nov 24, 2022

shangeth commented Nov 24, 2022

shangeth commented Nov 28, 2022

shangeth commented Nov 28, 2022