Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label Noise, Speech Noise, Cut Speech in the dataset #1

Open
shangeth opened this issue Nov 24, 2022 · 3 comments
Open

Label Noise, Speech Noise, Cut Speech in the dataset #1

shangeth opened this issue Nov 24, 2022 · 3 comments

Comments

@shangeth
Copy link
Contributor

I was doing Dataset Cartography analysis on the training dataset for the e2e SLU model based on a whisper encoder. This analysis splits the dataset into 3 parts: easy, hard, and ambiguous samples.
datamaps

After the split, I tried to analyze the hard samples to understand why these samples are harder for the model to learn. When listening to these audio samples, I found a few samples were mislabelled, a few had no speech only noise and a few sample speeches were cut in between. This analysis was only done on the train set, this has to be done on test set too.

@shangeth
Copy link
Contributor Author

This analysis was done on the test set too and found similar errors in the test set.
datamaps_test

There are audio files like the below plot
Screenshot 2022-11-24 at 3 04 08 PM

@shangeth
Copy link
Contributor Author

@janaab11

@shangeth
Copy link
Contributor Author

Also adding data-maps analysis for each speaker, to find which speaker has more errors in the audio.
datamaps-speaker

We can observe, speaker 5, and 1 has many hard-samples which may be incorrect and 4,7 have a few.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant