-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Have prep generate learncurve splits ahead of time #651
Comments
This is what's described in #556 so this should close that. I think for now at least we need to keep the WindowDataset abstraction (so we should close #555 also) -- the families of models we are dealing with expect to live in a world of windows, and having this abstraction is better than e.g. literally preparing every window ahead of time, even though I kind of hate how it conflates the source data and a transformation applied to it |
* Remove `spect_output_dir ` option since we now will always generate spectrogram files in the specified dataset directory. - Remove `spect_output_dir` parameter from `vak.core.prep` - Remove `spect_output_dir` attribute from PrepConfig - Remove 'spect_output_dir' option from vak/config/valid.toml - Remove use of `spect_output_dir` parameter in cli.prep - Remove 'spect_output_dir' option from configs in tests/data_for_tests/configs/ - Remove use of `spect_output_dir` from tests/test_core/test_prep.py * Remove logging from cli/prep.py so we can configure logger inside core.prep to save log file in dataset directory * Change imports in vak/core/__init__.py so we don't shadow module names * Add vak/datasets/metadata.py with Metadata class that we will use when loading datasets in core/train, core/predict, etc. - Import metadata module in datasets/__init__.py * TST: Add tests/test_datasets/test_metadata.py * CLN/ENH: Make core/prep a sub-package, rewrite - Refactor core.prep into sub-package - Rewrite core.prep to prepare dataset as a directory - Import core module in vak/__init__.py so we can get `vak.core.prep.prep.prep` without extra imports - Rename `vak_df` -> `dataset_df` in `core.prep` - Add module `prep/prep_helper` and move 2 functons from io.dataframe \ into it: `add_split_col` and `validate_and_get_timebin_dur` - Import prep_helper in prep and use prep_helper.add_split_col there - Use datasets.Metadata class in vak.core.prep.prep - remove constant METADATA_JSON_FILENAME from prep/__init__ since it became a Metadata class variable - in core.prep use vak.timenow.get_timenow_as_str, add helper functions to get dataset_csv_path - Import prep and prep_helper modules in core/prep/__init__.py * TST/CLN: Make test_core/test_prep a sub-package, add/rewrite tests - Move test_prep into its own sub-package - Rewrite tests/test_core/test_prep.py to test we make dataset dir correctly - Move unit test for `add_split_col` out of io.dataframe - Add test_prep/test_prep_helper.py with unit test for `add_split_col` * TST/CLN: Rename fixture `specific_dataframe` -> `specific_dataset_df` * Rearrange code blocks in core/prep/prep.py * Fix where we import function from to get timebin dur in WindowDataset * Make WindowDataset attribute `duration` a property * Remove whitespace in WindowDataset docstring * Rewrite core/train.py to use dataset_path + Metadata * Fix reference to attr.asdict in datasets/metadata.py * Rewrite core/eval.py the same way as core/train.py * Rewrite core/predict.py the same way as core/train.py * Fix how prep_helper.move_files_into_split_subdirs handles annotation files * Require crowsetta >=5.0.1 to get bugfix for generic-seq format * Normalize birdsong-recognition-dataset annotation format in core/prep.py * Don't copy annotation files into dir if they're already there in prep/prep_helper.py * Add dataset_csv_path argument to core/train.py, defaults to None * WIP: Rewriting core/learncurve to use dataset_path that's a directory * Add src/vak/core/prep/learncurve.py with `make_learncurve_splits_from_dataset_df` function * WIP: Rewrite learncurve to use splits generated by `prep.learncurve.make_learncurve_splits_from_dataset_df` * WIP: Rewrite prep to generate learncurve splits * Remove wrong return value from type hint for prep.learncurve.make_learncurve_splits_from_dataset_df * WIP: Add tests/test_core/test_prep/test_learncurve.py with unit test for prep.learncurve.make_learncurve_splits_from_dataset_df * Fix SPECT_LIST_NPZ glob in tests/fixtures/spect.py * Remove breakpoint left in src/vak/core/train.py * Rewrite prep.learncurve.make_learncurve_splits_from_dataset_df to save split metadata in a csv * Fix unit test in test_prep/test_learncurve.py after rewriting make_learncurve_splits_from_dataset_df * Fix how core/learncurve/learncurve uses splits csv generated by prep * Remove previous_run_path attribute from LearncurveConfig, no longer needed * Remove previous_run_path option from config/valid.toml, no longer needed * Fix prep.learncurve.make_learncurve_splits_from_dataset_df to need less parameters * Rewrite core.prep to also prepare splits for learncurve * Remove parameters that were removed from learncurve docstring * Remove args that were removed from core/learncurve from call to it in cli/learncurve * Add missing imports in src/vak/core/prep/prep.py * Fix cli.prep to call core.prep with extra (needed) args when purpose == 'learncurve' * Remove attributes train_set_durs and num_replicates from LearnCurveConfig * Add attributes train_set_durs and num_replicates to PrepConfig * Move options train_set_durs and num_replicates from [LEARNCURVE] to [PREP] in valid.toml * Fix cli.prep to pass window size argument from dataloader config into core.prep * Move train_set_durs and num_replicates options to prep section in two learncurve configs * Fix how we get values for train_set_durs and num_replicates from config in cli/prep.py * Remove train_set_durs + num_replicates from LEARNCURVE list in REQUIRED_OPTIONS constant * Make train_set_durs and num_replicates attributes of PrepConfig optional * Remove unused imports in src/vak/core/eval.py * Only save labelmap.json in core/prep if purpose is not 'predict' * Make labelset be an actual set in src/vak/core/prep/learncurve.py * Pass dataset_path to eval in core/learncurve, not results_path_this_replicate * Fix how we validate dataset_csv_path in core/train * Remove src/vak/core/learncurve/splits.py, no longer used * Remove tests/test_core/test_learncurve/test_splits.py, no longer used * Rename vak_df -> dataset_df in src/vak/cli/prep.py * Fix fixture in tests/test_datasets/test_window_dataset/conftest.py * Fix import in tests/test_datasets/test_metadata.py * Fix fixture in tests/fixtures/csv.py * Fix mocking of core.eval.eval in tests/test_cli/test_eval.py * Fix unit tests in tests/test_cli/test_learncurve.py * Remove args / a unit test in tests/test_core/test_learncurve/test_learncurve.py * Fix how we mock core.predict.predict in tests/test_cli/test_predict.py * Fix how we mock vak.core.train.train in tests/test_cli/test_train.py * Fix how we mock vak.core.prep.prep, remove unneeded asserts in tests/test_cli/test_prep.py * Fix how we get timebin_dur for post_tfm_kwargs in eval -- use dataset_csv_path not dataset_path * Fix how we test core.eval.eval raises expected errors in tests/test_core/test_eval.py * Add missing pathlib import, change Path -> pathlib.Path in core/predict.py * Fix how we load dataset_df in src/vak/core/predict.py * Fix how we load dataset_df in an assert helper in tests/test_core/test_predict.py * Fix how we test core.predict.predict raises expected errors in tests/test_core/test_predict.py * Fix how we test core.train.train raises expected errors in tests/test_core/test_train.py * Fix how we test learncurve raises expected errors in test_learncurve.py * Fix unit test in tests/test_core/test_prep/test_learncurve.py * Fix unit tests, variable names in tests/test_core/test_prep/test_prep.py * Call dropna before finding unique splits in move_files_into_split_subdirs * Fix unit tests in tests/test_core/test_prep/test_prep_helper.py * Fix unit test in tests/test_datasets/test_metadata.py * Fix how we get dataset_csv_path in tests/test_datasets/test_window_dataset/conftest.py * Fix how we get dataset_csv_path in tests/test_datasets/test_window_dataset/test_class_.py * Fix how we get dataset_csv_path in tests/test_datasets/test_window_dataset/test_helper.py * Fix how we glob for files from spect_dir in vak.io.spect.to_dataframe * Fix how we get dataset_csv_path in test_models/test_base.py
Fixed by #658 |
Instead of generating the splits at the start of running learncurve, we should have vak prep generate them.
The first reason to do this is for better separation of concerns:
vak learncurve
should not be in the business of preparing datasets.The second reason to do this is that it obviates the need for the
previous_run_path
option; if you want to re-run on the same splits, you just setdataset_path
to point at the same pre-generated splits created byvak prep
The third reason to do this is to explicitly incorporate whatever abstraction we use to represent the splits into the dataset. E.g, right now for WindowDatasets we represent valid windows with vectors, but those vectors are actually put in the results directory
For now have
vak prep
do this forlearncurve
, we should also go ahead and do it fortrain
allowing a user to specify a specific durationThe generated splits should go into a dataset directory as described in ENH: Have prep create a directory with standardized format for each prepared dataset #650 (so that will need to be done first)
vak learncurve
will need to change to get the vectors out of the dataset dirThe text was updated successfully, but these errors were encountered: