Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Have prep generate learncurve splits ahead of time #651

Closed
3 tasks
Tracked by #614
NickleDave opened this issue Mar 18, 2023 · 2 comments
Closed
3 tasks
Tracked by #614

ENH: Have prep generate learncurve splits ahead of time #651

NickleDave opened this issue Mar 18, 2023 · 2 comments
Assignees
Labels
ENH: enhancement enhancement; new feature or request
Projects

Comments

@NickleDave
Copy link
Collaborator

NickleDave commented Mar 18, 2023

Instead of generating the splits at the start of running learncurve, we should have vak prep generate them.

  • The first reason to do this is for better separation of concerns: vak learncurve should not be in the business of preparing datasets.

  • The second reason to do this is that it obviates the need for the previous_run_path option; if you want to re-run on the same splits, you just set dataset_path to point at the same pre-generated splits created by vak prep

  • The third reason to do this is to explicitly incorporate whatever abstraction we use to represent the splits into the dataset. E.g, right now for WindowDatasets we represent valid windows with vectors, but those vectors are actually put in the results directory

  • For now have vak prep do this for learncurve, we should also go ahead and do it for train allowing a user to specify a specific duration

  • The generated splits should go into a dataset directory as described in ENH: Have prep create a directory with standardized format for each prepared dataset #650 (so that will need to be done first)

  • vak learncurve will need to change to get the vectors out of the dataset dir

@NickleDave
Copy link
Collaborator Author

This is what's described in #556 so this should close that.

I think for now at least we need to keep the WindowDataset abstraction (so we should close #555 also) -- the families of models we are dealing with expect to live in a world of windows, and having this abstraction is better than e.g. literally preparing every window ahead of time, even though I kind of hate how it conflates the source data and a transformation applied to it

@NickleDave NickleDave self-assigned this Mar 18, 2023
@NickleDave NickleDave added the ENH: enhancement enhancement; new feature or request label Mar 18, 2023
@NickleDave NickleDave added this to In progress in ENH Mar 19, 2023
NickleDave added a commit that referenced this issue May 30, 2023
* Remove `spect_output_dir ` option

since we now will always generate spectrogram
files in the specified dataset directory.

- Remove `spect_output_dir` parameter from `vak.core.prep`
- Remove `spect_output_dir` attribute from PrepConfig
- Remove 'spect_output_dir' option from vak/config/valid.toml
- Remove use of `spect_output_dir` parameter in cli.prep
- Remove 'spect_output_dir' option from configs in tests/data_for_tests/configs/
- Remove use of `spect_output_dir` from tests/test_core/test_prep.py

* Remove logging from cli/prep.py

so we can configure logger inside core.prep
to save log file in dataset directory

* Change imports in vak/core/__init__.py

so we don't shadow module names

* Add vak/datasets/metadata.py with Metadata class

that we will use when loading datasets
in core/train, core/predict, etc.

- Import metadata module in datasets/__init__.py

* TST: Add tests/test_datasets/test_metadata.py

* CLN/ENH: Make core/prep a sub-package, rewrite

- Refactor core.prep into sub-package
- Rewrite core.prep to prepare dataset as a directory
- Import core module in vak/__init__.py
  so we can get `vak.core.prep.prep.prep` without extra imports
- Rename `vak_df` -> `dataset_df` in `core.prep`
- Add module `prep/prep_helper` and move 2 functons from io.dataframe \
  into it: `add_split_col` and `validate_and_get_timebin_dur`
- Import prep_helper in prep and use prep_helper.add_split_col there
- Use datasets.Metadata class in vak.core.prep.prep
- remove constant METADATA_JSON_FILENAME from prep/__init__
  since it became a Metadata class variable
- in core.prep use vak.timenow.get_timenow_as_str,
  add helper functions to get dataset_csv_path
- Import prep and prep_helper modules in core/prep/__init__.py

* TST/CLN: Make test_core/test_prep a sub-package, add/rewrite tests

- Move test_prep into its own sub-package
- Rewrite tests/test_core/test_prep.py
  to test we make dataset dir correctly
- Move unit test for `add_split_col` out of io.dataframe
- Add test_prep/test_prep_helper.py
  with unit test for `add_split_col`

* TST/CLN: Rename fixture `specific_dataframe` -> `specific_dataset_df`

* Rearrange code blocks in core/prep/prep.py

* Fix where we import function from to get timebin dur in WindowDataset

* Make WindowDataset attribute `duration` a property

* Remove whitespace in WindowDataset docstring

* Rewrite core/train.py to use dataset_path + Metadata

* Fix reference to attr.asdict in datasets/metadata.py

* Rewrite core/eval.py the same way as core/train.py

* Rewrite core/predict.py the same way as core/train.py

* Fix how prep_helper.move_files_into_split_subdirs handles annotation files

* Require crowsetta >=5.0.1 to get bugfix for generic-seq format

* Normalize birdsong-recognition-dataset annotation format in core/prep.py

* Don't copy annotation files into dir if they're already there in prep/prep_helper.py

* Add dataset_csv_path argument to core/train.py, defaults to None

* WIP: Rewriting core/learncurve to use dataset_path that's a directory

* Add src/vak/core/prep/learncurve.py with `make_learncurve_splits_from_dataset_df` function

* WIP: Rewrite learncurve to use splits generated by `prep.learncurve.make_learncurve_splits_from_dataset_df`

* WIP: Rewrite prep to generate learncurve splits

* Remove wrong return value from type hint for prep.learncurve.make_learncurve_splits_from_dataset_df

* WIP: Add tests/test_core/test_prep/test_learncurve.py with unit test for prep.learncurve.make_learncurve_splits_from_dataset_df

* Fix SPECT_LIST_NPZ glob in tests/fixtures/spect.py

* Remove breakpoint left in src/vak/core/train.py

* Rewrite prep.learncurve.make_learncurve_splits_from_dataset_df to save split metadata in a csv

* Fix unit test in test_prep/test_learncurve.py after rewriting make_learncurve_splits_from_dataset_df

* Fix how core/learncurve/learncurve uses splits csv generated by prep

* Remove previous_run_path attribute from LearncurveConfig, no longer needed

* Remove previous_run_path option from config/valid.toml, no longer needed

* Fix prep.learncurve.make_learncurve_splits_from_dataset_df to need less parameters

* Rewrite core.prep to also prepare splits for learncurve

* Remove parameters that were removed from learncurve docstring

* Remove args that were removed from core/learncurve from call to it in cli/learncurve

* Add missing imports in src/vak/core/prep/prep.py

* Fix cli.prep to call core.prep with extra (needed) args when purpose == 'learncurve'

* Remove attributes train_set_durs and num_replicates from LearnCurveConfig

* Add attributes train_set_durs and num_replicates to PrepConfig

* Move options train_set_durs and num_replicates from [LEARNCURVE] to [PREP] in valid.toml

* Fix cli.prep to pass window size argument from dataloader config into core.prep

* Move train_set_durs and num_replicates options to prep section in two learncurve configs

* Fix how we get values for train_set_durs and num_replicates from config in cli/prep.py

* Remove train_set_durs + num_replicates from LEARNCURVE list in REQUIRED_OPTIONS constant

* Make train_set_durs and num_replicates attributes of PrepConfig optional

* Remove unused imports in src/vak/core/eval.py

* Only save labelmap.json in core/prep if purpose is not 'predict'

* Make labelset be an actual set in src/vak/core/prep/learncurve.py

* Pass dataset_path to eval in core/learncurve, not results_path_this_replicate

* Fix how we validate dataset_csv_path in core/train

* Remove src/vak/core/learncurve/splits.py, no longer used

* Remove tests/test_core/test_learncurve/test_splits.py, no longer used

* Rename vak_df -> dataset_df in src/vak/cli/prep.py

* Fix fixture in tests/test_datasets/test_window_dataset/conftest.py

* Fix import in tests/test_datasets/test_metadata.py

* Fix fixture in tests/fixtures/csv.py

* Fix mocking of core.eval.eval in tests/test_cli/test_eval.py

* Fix unit tests in tests/test_cli/test_learncurve.py

* Remove args / a unit test in tests/test_core/test_learncurve/test_learncurve.py

* Fix how we mock core.predict.predict in tests/test_cli/test_predict.py

* Fix how we mock vak.core.train.train in tests/test_cli/test_train.py

* Fix how we mock vak.core.prep.prep, remove unneeded asserts in tests/test_cli/test_prep.py

* Fix how we get timebin_dur for post_tfm_kwargs in eval -- use dataset_csv_path not dataset_path

* Fix how we test core.eval.eval raises expected errors in tests/test_core/test_eval.py

* Add missing pathlib import, change Path -> pathlib.Path in core/predict.py

* Fix how we load dataset_df in src/vak/core/predict.py

* Fix how we load dataset_df in an assert helper in tests/test_core/test_predict.py

* Fix how we test core.predict.predict raises expected errors in tests/test_core/test_predict.py

* Fix how we test core.train.train raises expected errors in tests/test_core/test_train.py

* Fix how we test learncurve raises expected errors in test_learncurve.py

* Fix unit test in tests/test_core/test_prep/test_learncurve.py

* Fix unit tests, variable names in tests/test_core/test_prep/test_prep.py

* Call dropna before finding unique splits in move_files_into_split_subdirs

* Fix unit tests in tests/test_core/test_prep/test_prep_helper.py

* Fix unit test in tests/test_datasets/test_metadata.py

* Fix how we get dataset_csv_path in tests/test_datasets/test_window_dataset/conftest.py

* Fix how we get dataset_csv_path in tests/test_datasets/test_window_dataset/test_class_.py

* Fix how we get dataset_csv_path in tests/test_datasets/test_window_dataset/test_helper.py

* Fix how we glob for files from spect_dir in vak.io.spect.to_dataframe

* Fix how we get dataset_csv_path in test_models/test_base.py
@NickleDave
Copy link
Collaborator Author

Fixed by #658

@NickleDave NickleDave moved this from In progress to Done in ENH May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ENH: enhancement enhancement; new feature or request
Projects
ENH
Done
Development

No branches or pull requests

1 participant