Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add dataset sub-table to config file, remove other dataset/transform param keys #748

Closed
Tracked by #614
NickleDave opened this issue May 4, 2024 · 1 comment
Assignees
Labels
ENH: enhancement enhancement; new feature or request

Comments

@NickleDave
Copy link
Collaborator

NickleDave commented May 4, 2024

Core idea: add a dataset sub-table to top-level tabels [vak.train], [vak.eval], etc., in the config file

It will look like this:

[vak.train.dataset]
name = "BuiltInDataset"  # optional. Should exactly match class name
path = "~/path/to/data/on/my/computer"  # required
splits_path = "~/path/to/splits/I/made"  # optional
params = { window_size = 2000 }  # "optional" but required for some DataPipes

This assumes we have already done #685 and #345 (which I have--in progress now)

It allows the following:

  • Replace train_transform_params + val_transform_params with the params option in the dataset table -- in practice we only ever use these parameters for specifying window_size; by enforcing the idea that these are parameters of the dataset we move towards ENH: Rename datasets to pipes, that have built-in transforms #724 where the built-in DataPipes have fixed transforms that they always use (and if you need more control then it's time to move to your own script)
  • specifying a splits_path that is different from the default path -- this way if you want to try a different split with your dataset, you don't need to remake the entire dataset; splits are just in the splits file (edit: as in ENH: Make it possible to specify different splits for datasets #749)
@NickleDave NickleDave self-assigned this May 4, 2024
@NickleDave NickleDave added the ENH: enhancement enhancement; new feature or request label May 4, 2024
NickleDave added a commit that referenced this issue May 5, 2024
…750)

* WIP: Add src/vak/config/dataset.py

* Add module-level docstring + type annotations in src/vak/config/parse.py

* WIP: Fix how cli.prep adds dataset path to toml config file

* Change table names in src/vak/config/valid.toml

* Rename section -> table in config/parse.py

* In cli/prep change 'section' -> 'table' and lowercase table names

* In config/config.py, change 'section' -> 'table' and lowercase table names

* Change '[PREP]' -> '[vak.prep]' in config/prep.py

* WIP: Change table names in config files in tests/data_for_tests/configs

* Make tomlkit a dependency in pyproject.toml, drop toml

* Change config/parse.py to use tomlkit

* Update example configs in doc/toml/

* Add link to example config files in docs, in error messages in config/validators.py

* Remove 'spect_params' from REQUIRED_OPTIONS in config/parse.py, this is not a top-level table and will be an attribute of prep instead

* Rename 'config_toml' -> 'config_dict' in config/parse.py

* Fix function _validate_tables_arg_convert_list in config/parse.py

* Fix error message formatting in src/vak/config/validators.py

* Add ModelConfig class to config/model.py, add type annotations, fix config_from_toml_dict to look in specific section

* Fixup fixing config_from_toml_dict to look in specific section

* Rewrite config/eval.py with 'modern' attrs

* Fixup rewrite config/eval with 'modern attrs

* Rewrite config/learncurve.py with 'modern' attrs

* Rewrite config/predict.py with 'modern' attrs

* Rewrite config/prep.py with 'modern' attrs

* Rewrite config/train.py with 'modern' attrs

* Rename Dataset -> DatasetConfig in config/dataset.py

* Add are_table_options_valid to config/validators.py, will be used by classmethods from_config_dict

* WIP: Add from_config_dict classmethod to EvalConfig

* WIP: Add tests/test_config/test_dataset.py

* Make fixes to ModelConfig class, fix circular imports in config/model.py module

* Write tests in tests/test_config/test_dataset.py

* Use tomlkit not toml in cli/prep.py

* Use tomlkit in tests/fixtures/annot.py

* Use tomlkit in tests/scripts/vaktestdata/configs.py

* Use tomlkit in tests/scripts/vaktestdata/source_files.py

* Use tomlkit in tests/test_config/test_validators.py

* Remove spect_params attribute from Config in config/config.py, fix class' docstring

* Reorder attributes, fix typo in docstring of DatasetConfig

* Rewrite config/parse.py assuming config classes have from_config_dict classmethod

* Rename `table` -> `table_name` in a couple validators in config/validators.py

* Remove use of config.model.config_from_toml_path in cli/eval.py

* Remove use of config.model.config_from_toml_path in cli/learncurve.py

* Remove use of config.model.config_from_toml_path in cli/predict.py

* Remove use of config.model.config_from_toml_path in cli/train.py

* Remove functions from config/model.py: config_from_toml_path and config_from_toml_dict

* Add `to_dict` method to ModelConfig

* Use to_dict() method of ModelConfig class in cli functions

* Fix how we get labelset from config in tests/fixtures/annot.py

* WIP: Clean up / rewrite tests/fixtures/config.py

* Fix model tables in tests/data_for_tests/configs

* Finish unit tests in tests/test_config/test_model.py

* Fix model tables in doc/toml

* Rename data_for_tests/configs/invalid_option_config.toml -> invalid_key_config.toml

* Rename are_options_valid/are_table_options_valid -> are_keys_valid/are_table_keys_valid in config/validators.py

* Rename two fixtures in fixtures/config.py: invalid_section_config_path -> invalid_table_config_path, invalid_option_config_path -> invalid_key_config_path

* Fix validator names in config/parse.py, rename TABLE_CLASSES constant -> TABLE_CLASSES_MAP

* Rename config/valid.toml -> valid-version-1.0.toml, fix how model table is declared

* Fix VALID_TOML_PATH in config/validators.py after renaming config/valid.toml -> config/valid-version-1.0.toml

* Import config classes in vak/config/__init__.py

* Add _tomlkit_to_popo to tests/fixtures/config.py so we operate on dicts not tomlkit.TOMLDocument

* Add _tomlkit_to_popo to config/parse.py so we operate on dicts not tomlkit.TOMLDocument

* Finish rewriting tests for tests/test_config/test_prep.py

* Rewrite EvalConfig with from_config_dict method

* Rewrite LearncurveConfig with from_config_dict method

* Rewrite PredictConfig with from_config_dict method

* Rewrite PrepConfig with from_config_dict method

* Rewrite TrainConfig with from_config_dict method

* Remove functions from config/parse.py

* Rename config/parse.py -> config/load.py

* Make functions in config/parse.py into classmethods on Config class

* Use config.Config.from_toml_path everywhere instead of config.parse.from_toml_path

* Make fixes in Config classmethods

* Change load._load_toml_from_path again so that it returns config_dict['vak'], to avoid writing ['vak'] everywhere in calling functions

* Add docstring to are_tables_valid in config/validators.py

* Lowercase config table names in tests/scripts/vaktestdata/configs.py

* In tests/scripts/vaktestdata/source_files.py, change cfg.spect_params -> cfg.prep.spect_params, fix how we change values in toml, add tables_to_parse arg to call to Config.from_toml_path

* in test_cli/test_prep.py, call vak.config.load not vak.config.parse

* Fix how we instantiate DatasetConfig and ModelConfig in EvalConfig.from_config_dict method

* Fix how we instantiate DatasetConfig and ModelConfig in PredictConfig.from_config_dict method

* Fix how we instantiate DatasetConfig and ModelConfig in TrainConfig.from_config_dict method

* Fix how we instantiate DatasetConfig and ModelConfig in LearncurveConfig.from_config_dict method

* Remove brekapoint in src/vak/config/model.py

* Fix wrong variable name so we save configs correctly in tests/scripts/vaktestdata/source_files.py, and add tables_to_parse arg to Config.from_toml_path, so we don't get 'missing dataset' errors

* Fix how we re-write configs, in tests/scripts/vaktestdata/configs.py

* Add model and dataset tables to get those keys in top-level tables, in src/vak/config/valid-version-1.0.toml

* Change cfg.table.dataset_path -> cfg.table.dataset.path in vak/cli modules (e.g., vak.train.dataset.path)

* Get tests passing for tests/test_config/test_eval.py

* Clean up tests/test_config/test_eval.py

* Get tests passing in tests/test_config/test_predict.py

* Fix how we access config_toml in tests/scripts/vaktestdata/configs.py -- missing 'vak' key

* Add pytest.mark.parametrize to tests/test_config/test_learncurve.py

* Rewrite tests in tests/test_config/test_train.py

* Rewrite tests in tests/test_config/test_config.py

* Add unit test to tests/test_config/test_model.py

* Add unit test for exceptions in tests/test_config/test_eval.py

* Fix 'cfg.spect_params' -> 'cfg.prep.spect_params' in src/vak/cli/predict.py

* Add unit test for exceptions in tests/test_config/test_learncurve.py

* Add unit test for exceptions in tests/test_config/test_train.py

* Add more test cases to TestEvalConfig.test_from_config_dict_raises

* Add more test cases to TestLearncurveConfig.test_from_config_dict_raises

* Add unit test for exceptions in tests/test_config/test_predict.py

* Add two unit tests that PrepConfig raises expected exceptions

* Fix/add unit tests in tests/test_config/test_config.py

* Change order of parameters for Config.from_config_dict, make toml_path last param

* Fix/add unit tests in tests/fixtures/config.py

* Fix/add unit tests in tests/fixtures/config.py

* Rename test_config/test_parse.py -> test_load.py, fix/rewrite tests

* Fix tests in tests/test_config/test_spect_params.py

* Make fixups in tests/test_config

* Apply fixes from linter

* Make more linting fixes

* Speed up install in nox session 'lint', only install linting tools

* Change names 'section'/'option' -> 'table'/'key' in tests

* Fix tests in tests/test_cli/test_eval.py

* Finish fixing cli tests, fix renaming

* Fix how we get 'path' from 'dataset' table in configs, in tests/fixtures/csv.py

* Fix how we get 'path' from 'dataset' table in configs, in tests/fixtures/dataset.py

* Change .dataset_path -> .dataset.path in tests/

* Fix how we get model config and rename config attribute .dataset_path -> .dataset.path throughout tests

* In tests/, fixup change .dataset_path -> .dataset.path, use model.name where we used to use just 'model' attribute of config

* Fix fixture specific_config_toml_path in fixtures/config.py to handle case where we need to access sub-table and change a key in it--right now this is just [
'dataset']['path']

* Fix how we change ['dataset']['path'] value in tests/test_eval/test_frame_classification.py

* Fix how we change ['dataset']['path'] value in config in several tests

* Use ModelConfig attribute name where needed in tests/test_learncurve/test_frame_classification.py

* In tests, replace calls to vak.config.model.config_from_toml_path with calls to ModelConfig method to_dict()

* Change cfg.spect_params -> cfg.prep.spect_params in tests

* Fix cfg.predict -> cfg.predict.dataset.path in tests/test_predict/test_frame_classification.py

* Fix constant LABELSET_NOTMAT in fixtures/annot.py so it is a list of str, not a Tomlkit.String class

* Fix cfg.learncurve -> cfg.learncurve.dataset.path in tests/test_prep/test_frame/test_learncurve.py

* Fix cfg.learncurve -> cfg.learncurve.dataset.path in tests/test_prep/test_frame/test_learncurve.py

* Cast pathlib to str before adding to tomldoc, in tests/test_train/

* Change transform/dataset params keys in data_for_tests/configs to a dataset table with a params key

* Add `params` attribute to DatasetConfig

* Change transform/dataset params keys in doc/toml/ to a dataset table with a params key

* Rewrite vak/config/model.py method 'to_dict' as 'asdict', using attrs asdict function. We now return 'name' and will just get it from the dict instead of having a separate 'model_name' parameter for functions that take 'model_config'

* Add asdict method to DatasetConfig class, like ModelConfig.asdict

* Fix calls to model.to_dict() -> model.asdict()

* Add unit tests for DatasetConfig.asdict

* Add unit tests for ModelConfig.asdict

* Add an assertion in tests/test_config/test_dataset.py

* Remove transform params and dataset_params from EvalConfig, will just use dataset attribute, a DatasetConfig, with its params attribute

* Remove dataset/transform_params key-value pairs in valid-version-1.0.toml, and add params key to dataset tables with in-line table params

* Remove train/val/dataset/transform_params from TrainConfig, will use DatasetConfig attribute params instead

* Remove train/val/dataset/transform_params from PredictConfig, will use DatasetConfig attribute params instead

* Revise transforms.defaults.frame_classification.TrainItemTransform and change get_default_frame_classification_transform to return an instance of the TrainItemTransform when 'mode' is 'train'

* Make vak.table.dataset.params into an in-line table in toml files in tests/data_for_tests/configs

* Fix attribute name in frame_classification.TrainItemTransform.__init__: source_transform -> frames_transform

* Rewrite datasets.frame_classification.WindowDataset to require item_transform, and assume that it is an instance of transforms.frame_classification.TrainItemTransform

* Rewrite datasets.frame_classification.FramesDataset to make item_transform required

* Rewrite src/vak/train/frame_classification.py: remove params model_name, train/val_transform_params, train/val_dataset_params, and dataset_path, replace with dataset_config and just have model_config contain name

* Rewrite src/vak/train/_train.py: remove params model_name, train/val_transform_params, train/val_dataset_params, and dataset_path, replace with dataset_config and just have model_config contain name

* Rewrite vak/cli/train.py to call train._train.train with just model_config and dataset_config, remove model_name, dataset_path, train/val_transform_params and train/val_dataset_params

* Fix how we unpack batch in training_step method of FrameClassificationModel

* Change transform_kwargs parameter of transforms.defaults.parametric_umap.get_default_parametric_umap_transform to default to None, and if None to be an empty dict

* Change transform_kwargs parameter of transforms.defaults.frame_classification.get_default_frame_classification_transform to default to None, and if None to be an empty dict

* Change DatasetConfig.params attribute to default to empty dict, so we can unpack with ** operator even when no params are specified

* Fix DatasetConfig.from_config_dict method to not use dict.get method, so we don't set attributes to None inadvertently

* Modify transforms.defaults.get so that transform_kwargs is None by default. Also revise docstring and type annotations

* Rewrite src/vak/train/parametric_umap.py to use model_config and dataset_config parameters, removing parameters val/train_transform_params + val/train_dataset_params and dataset_path

* Rewrite vak/eval/frame_classification.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path

* Rewrite vak/eval/parametric_umap.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path

* Rewrite vak/eval/eval_.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path

* Rewrite cli.eval to pass model_config and dataset_config into eval_module.eval, remove dataset_path/transform_params/datset_params arguments

* Unpack dataset_config[params] with ** inside trak/frame_classification.py, instead of directly getting window_size from the params dict

* Rewrite vak/learncurve/frame_classification.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path

* Rewrite vak/learncurve/learncurve.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path

* Rewrite cli.learncurve to pass model_config and dataset_config into learning_curve.learncurve, remove dataset_path/transform_params/datset_params arguments

* Rewrite vak/predict/frame_classification.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path

* Rewrite vak/predict/parametric_umap.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path

* Rewrite vak/predict/predict.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path

* Fix dataset_path -> dataset_config[path] and add missing variable model_name in src/vak/learncurve/learncurve.py

* Fix dataset_path -> dataset_config[path] and add missing variable model_name in src/vak/learncurve/frame_classification.py

* Rewrite vak/cli/predict.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path

* Remove non-existent dataset_params variable in vak/predict/frame_classification.py

* Fix unit tests for DatasetConfig to test 'params' attribute gets handled correctly

* Remove train/val_dataset_params and train/val_transform_params from test cases we parametrize with in tests/test_config/

* Use DatasetConfig.params attribute where we need to in tests/test_datasets

* Fix method name ModelConfig.to_dict -> asdict in tests/

* In tests for eval/learncurve/predict/train, use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path

* Fix use of default transform and dataset.params attribute in test_models/test_base.py

* Fix config snippets in docs

* Apply linting to src/

* Raise 'from e' with errors in eval/predict/train/frame_classification modules
@NickleDave
Copy link
Collaborator Author

Fixed by #750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ENH: enhancement enhancement; new feature or request
Projects
None yet
Development

No branches or pull requests

1 participant