DOC: Better document when and how `vak.split` functions are used #599

NickleDave · 2022-12-05T02:21:28Z

Currently we don't explain clearly in the docs when and how the vak.split functions are used.
I was reminded of this when @fMizki asked in the vocalpy forum here:
https://forum.vocalpy.org/t/what-happens-when-train-dur-total-duration-of-files-prepared-for-training/50/4

I recently noticed that I might have had misusage of some parameters, train_dur, val_dur and test_dur. I was setting much shorter values than total duration of prepared data to be referred as training set. As a result, the ‘split’ column in a csv file which waa an output of vak prep was filled with ‘None’, though training would be completed.
So, questions I have are,
1.What happens when the sum of parameter values, ‘train_dur’ val_dur and test_dur is shorter than total duration of files prepared for training? Are randomly selected small amount of files used for each usage?
2.What happens when the sum of those three values exceeded total duration of files prepared for training?

We should

at a bare minimum say something in the tutorial about how datasets are created by vak.split
ideally point to another page that says more, not sure where this should live now; if we were sklearn it would be under a model_selection module.

My reply, we should recycle some of this language:

1.What happens when the sum of parameter values, ‘train_dur’ val_dur and test_dur is shorter than total duration of files prepared for training? Are randomly selected small amount of files used for each usage?

Yes, internally when you call vak prep config.toml, it calls the function vak.split.dataframe, that randomly selects files for each split from the total set of files in the dataset. (It’s randomly selecting rows from the pandas.DataFrame that represents the dataset, in case you’re wondering about the name)

If you do not specify any of the options {“train_dur”, “val_dur”, “test_dur”} when you run vak prep with a config that has a [TRAIN] section, then it will put all of the data in data_dir in the train split – maybe this is what you were expecting?
Similarly when you run vak prep for a [PREDICT] config, it just puts all the data in a predict split.

We probably need to document this better somewhere. Please feel free to raise an issue on the vak GitHub repository suggesting that we do so.
2.What happens when the sum of those three values exceeded total duration of files prepared for training?
You should get an error telling you that the sum of those three values exceeded total duration of files.

And @fMizki should get credit for contributing to the docs, thank you for the reminder we need to make this clear

The text was updated successfully, but these errors were encountered:

NickleDave self-assigned this Dec 5, 2022

NickleDave added the DOC: documentation documentation label Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Better document when and how `vak.split` functions are used #599

DOC: Better document when and how `vak.split` functions are used #599

NickleDave commented Dec 5, 2022

DOC: Better document when and how vak.split functions are used #599

DOC: Better document when and how vak.split functions are used #599

Comments

NickleDave commented Dec 5, 2022

DOC: Better document when and how `vak.split` functions are used #599

DOC: Better document when and how `vak.split` functions are used #599