Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prep datasets as directories, fixes #649 #650 #651 #658

Merged
merged 83 commits into from
May 30, 2023

Conversation

NickleDave
Copy link
Collaborator

This mainly changes vak.prep to prepare datasets as directories.
It fixes #649 #650 #651

The main goals are:

  • make datasets portable
  • save datasets in a (somewhat) standardized format -- still work to be done there but this is a first step
  • save metadata we need that doesn't fit in tabular data format, e.g. the timebin duration for spectrograms (no need to store in every column, it should always be the same) or the name of the csv representing the dataset (can't store this in the csv itself)

Additionally this does the following:

  • move logic for creating splits for learncurve from the learncurve function into prep
  • remove the previous_run_path option from learncurve, since preparing splits ahead of time and saving them in a standardized format obviates the need for the previous_run_path option
  • removes related functions, e.g. the splits module that was in core.learncurve

since we now will always generate spectrogram
files in the specified dataset directory.

- Remove `spect_output_dir` parameter from `vak.core.prep`
- Remove `spect_output_dir` attribute from PrepConfig
- Remove 'spect_output_dir' option from vak/config/valid.toml
- Remove use of `spect_output_dir` parameter in cli.prep
- Remove 'spect_output_dir' option from configs in tests/data_for_tests/configs/
- Remove use of `spect_output_dir` from tests/test_core/test_prep.py
so we can configure logger inside core.prep
to save log file in dataset directory
so we don't shadow module names
that we will use when loading datasets
in core/train, core/predict, etc.

- Import metadata module in datasets/__init__.py
- Refactor core.prep into sub-package
- Rewrite core.prep to prepare dataset as a directory
- Import core module in vak/__init__.py
  so we can get `vak.core.prep.prep.prep` without extra imports
- Rename `vak_df` -> `dataset_df` in `core.prep`
- Add module `prep/prep_helper` and move 2 functons from io.dataframe \
  into it: `add_split_col` and `validate_and_get_timebin_dur`
- Import prep_helper in prep and use prep_helper.add_split_col there
- Use datasets.Metadata class in vak.core.prep.prep
- remove constant METADATA_JSON_FILENAME from prep/__init__
  since it became a Metadata class variable
- in core.prep use vak.timenow.get_timenow_as_str,
  add helper functions to get dataset_csv_path
- Import prep and prep_helper modules in core/prep/__init__.py
- Move test_prep into its own sub-package
- Rewrite tests/test_core/test_prep.py
  to test we make dataset dir correctly
- Move unit test for `add_split_col` out of io.dataframe
- Add test_prep/test_prep_helper.py
  with unit test for `add_split_col`
…for prep.learncurve.make_learncurve_splits_from_dataset_df
@NickleDave NickleDave merged commit 2bfbaa4 into main May 30, 2023
0 of 3 checks passed
@NickleDave NickleDave deleted the prep-dataset-as-directory branch May 30, 2023 13:26
NickleDave added a commit that referenced this pull request Jun 5, 2023
Fixes some issues with #658

* Fix vak.prep.prep_helper.move_files_into_split_subdirs to save paths of moved files in dataset csv as relative to dataset directory root

* Add dataset_path parameter to vak.annotation.from_df, use to construct paths to annotations that are saved as relative to root

* - Fix dataset.seq.validators.where_unlabeled to pass
  dataset_path into annotation.from_df

- Add type hinting, revise docstrings in dataset.seq.validators

- Rename `vak_df` -> `dataset_df` in dataset.seq.validators

* Add dataset_path parameters to labels.from_df, to pass to annotation.from_df

* Add dataset_path parameter to vak.split.dataframe, to pass to vak.labels.from_df

* Pass dataset_path arg to vak.split.dataframe inside core.prep.prep

* Pass dataset_path arg into vak.split.dataframe inside prep.learncurve

* Add dataset_path parameter to functions in src/vak/datasets/window_dataset/helper.py

* Pass dataset_path arg into window_dataset.helper.vectors_from_df inside prep.learncurve

* Rewrite StandardizeSpect.fit_df method as fit_csv_path, instead of adding dataset_path parameter

* Use StandardizeSpect.fit_csv_path in core/train.py

* Fix WindowDataset class so it can load samples from dataset root

* Fix VocalDataset class to load samples from dataset root

* Fix WindowDataset/VocalDataset arg names 'csv_path' -> 'dataset_csv_path' in core/train.py

* Fix VocalDataset arg name 'csv_path' -> 'dataset_csv_path' in core/eval.py

* Fix VocalDataset arg name 'csv_path' -> 'dataset_csv_path' in core/predict.py

* Make core/train.py load labelmap.json from dataset_path, remove those args

* Remove use of args `labelset` and `labelmap_path` from cli/train.py

* Remove labelmap_path attribute from TrainConfig

* Remove labelmap_config option from TRAIN section in valid.toml

* Remove if-else block in cli/prep.py that's not needed because we can just pass in default None args from config

* Have cli.prep copy config file to dataset directory after core.prep runs

* Get timebin_dur from metadata in core/learncurve.py

* Get timebin_dur from metadata in core/train.py

* Get timebin_dur from metadata in window_dataset/class_.py

* In core/prep/prep.py, save metadata before we generate learncurve splits, because learncurve function expects it to exist

* Get timebin_dur from metadata in prep/learncurve.py

* Remove use of `labelset` from `learning_curve` function

- Do not call `train` inside `learning_curve` with a
  `labelset` argument
- Remove labelset parameter from learning_curve function

* No longer pass config.prep.labelset into core.learning_curve inside cli.learncurve

* Get timebin_dur from metadata in core/predict.py

* Fix how we determine split_csv_path in src/vak/core/learncurve/learncurve.py -- use dataset_path, not dataset_learncurve_dir

* Save learncurve split csvs in dataset root, not learncurve sub-directory, so we don't break semantic of dataset_csv_path argument in other functions

* Fix how we validate dataset_csv_path passed in by learncurve inside core/train.py

* Fix unit test in test_labels to pass in dataset_path

* Remove unit tests that's no longer needed in test_cli/test_train.py -- train no longer has labelset or labelmap_path parameters

* Rename fixture `specific_prep_csv_path` -> `specific_dataset_csv_path`

* Add tests/fixtures/dataset.py with fixture `specific_dataset_path`

* Use fixture `specific_dataset_path` in test_labels.py

* Don't add labelmap_path option to train_continue configs in tests/scripts/generate_data_for_tests.py

* Remove labelmap_path option from train_continue configs

* Fix assert helper function for core.prep to test that paths in dataset csv are relative to dataset root

* Fix core/predict.py to construct spect path relative to dataset path

* Remove labelset/labelmap_path arguments from unit test for core.train, since those paramters were removed from function

* Remove argument `labelmap_path` from unit test in core/train.py, parameter was removed from function"

* Fix argument name in test_window_dataset/test_class_.py

* Fix argument name in test_window_dataset/conftest.py

* Remove unused function from window_dataset/helper.py, `vectors_from_csv_path`

* Add missing `dataset_path` arguments and remove a unit test for removed function in test_window_dataset/test_helper.py

* Rename annotation.from_df parameter `dataset_path` to `annot_root`

Make it optional but have all internal functions use it.

Needed to do this because unit test calls this function to test
output of `io.dataframe.from_files`.

Passing in any value "worked" because the paths were absolute,
but really we should be able to get the annotations
from the dataframe at any point.

Minor detail but I don't want this to be confusing later.

* Fix argument name (by not using keyword arg) in datasets/seq/validators.py

* Fix arg name in test_models/test_base.py: csv_path -> dataset_csv_path

* Fix test for split.dataframe that now requires dataset_path arg

* Fix unit test for StandardizeSpect.fit_csv_path

* Remove argument `labelset` from test_learncurve.py, no longer exists

* Fix unit test in test_core/test_prep/test_learncurve.py

* Fix how we test paths in dataset_df in test_core/test_prep/test_prep.py

* Fix how we build paths for tests in test_core/test_prep/test_prep_helper.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Run validate at end of prep, not start of train/learncurve, and include as "metadata" in dataset
1 participant