Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Better document when and how vak.split functions are used #599

Open
2 tasks
NickleDave opened this issue Dec 5, 2022 · 0 comments
Open
2 tasks

DOC: Better document when and how vak.split functions are used #599

NickleDave opened this issue Dec 5, 2022 · 0 comments
Assignees
Labels
DOC: documentation documentation

Comments

@NickleDave
Copy link
Collaborator

Currently we don't explain clearly in the docs when and how the vak.split functions are used.
I was reminded of this when @fMizki asked in the vocalpy forum here:
https://forum.vocalpy.org/t/what-happens-when-train-dur-total-duration-of-files-prepared-for-training/50/4

I recently noticed that I might have had misusage of some parameters, train_dur, val_dur and test_dur. I was setting much shorter values than total duration of prepared data to be referred as training set. As a result, the ‘split’ column in a csv file which waa an output of vak prep was filled with ‘None’, though training would be completed.
So, questions I have are,
1.What happens when the sum of parameter values, ‘train_dur’ val_dur and test_dur is shorter than total duration of files prepared for training? Are randomly selected small amount of files used for each usage?
2.What happens when the sum of those three values exceeded total duration of files prepared for training?

We should

  • at a bare minimum say something in the tutorial about how datasets are created by vak.split
  • ideally point to another page that says more, not sure where this should live now; if we were sklearn it would be under a model_selection module.

My reply, we should recycle some of this language:

1.What happens when the sum of parameter values, ‘train_dur’ val_dur and test_dur is shorter than total duration of files prepared for training? Are randomly selected small amount of files used for each usage?

Yes, internally when you call vak prep config.toml, it calls the function vak.split.dataframe, that randomly selects files for each split from the total set of files in the dataset. (It’s randomly selecting rows from the pandas.DataFrame that represents the dataset, in case you’re wondering about the name)

If you do not specify any of the options {“train_dur”, “val_dur”, “test_dur”} when you run vak prep with a config that has a [TRAIN] section, then it will put all of the data in data_dir in the train split – maybe this is what you were expecting?
Similarly when you run vak prep for a [PREDICT] config, it just puts all the data in a predict split.

We probably need to document this better somewhere. Please feel free to raise an issue on the vak GitHub repository suggesting that we do so.

2.What happens when the sum of those three values exceeded total duration of files prepared for training?

You should get an error telling you that the sum of those three values exceeded total duration of files.

And @fMizki should get credit for contributing to the docs, thank you for the reminder we need to make this clear

@NickleDave NickleDave self-assigned this Dec 5, 2022
@NickleDave NickleDave added the DOC: documentation documentation label Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DOC: documentation documentation
Projects
None yet
Development

No branches or pull requests

1 participant