ENH: Add `dataset` sub-table to config file, remove other dataset/transform param keys #748

NickleDave · 2024-05-04T13:25:36Z

Core idea: add a dataset sub-table to top-level tabels [vak.train], [vak.eval], etc., in the config file

It will look like this:

[vak.train.dataset]
name = "BuiltInDataset"  # optional. Should exactly match class name
path = "~/path/to/data/on/my/computer"  # required
splits_path = "~/path/to/splits/I/made"  # optional
params = { window_size = 2000 }  # "optional" but required for some DataPipes

This assumes we have already done #685 and #345 (which I have--in progress now)

It allows the following:

Replace train_transform_params + val_transform_params with the params option in the dataset table -- in practice we only ever use these parameters for specifying window_size; by enforcing the idea that these are parameters of the dataset we move towards ENH: Rename datasets to pipes, that have built-in transforms #724 where the built-in DataPipes have fixed transforms that they always use (and if you need more control then it's time to move to your own script)
specifying a splits_path that is different from the default path -- this way if you want to try a different split with your dataset, you don't need to remake the entire dataset; splits are just in the splits file (edit: as in ENH: Make it possible to specify different splits for datasets #749)

The text was updated successfully, but these errors were encountered:

…750) * WIP: Add src/vak/config/dataset.py * Add module-level docstring + type annotations in src/vak/config/parse.py * WIP: Fix how cli.prep adds dataset path to toml config file * Change table names in src/vak/config/valid.toml * Rename section -> table in config/parse.py * In cli/prep change 'section' -> 'table' and lowercase table names * In config/config.py, change 'section' -> 'table' and lowercase table names * Change '[PREP]' -> '[vak.prep]' in config/prep.py * WIP: Change table names in config files in tests/data_for_tests/configs * Make tomlkit a dependency in pyproject.toml, drop toml * Change config/parse.py to use tomlkit * Update example configs in doc/toml/ * Add link to example config files in docs, in error messages in config/validators.py * Remove 'spect_params' from REQUIRED_OPTIONS in config/parse.py, this is not a top-level table and will be an attribute of prep instead * Rename 'config_toml' -> 'config_dict' in config/parse.py * Fix function _validate_tables_arg_convert_list in config/parse.py * Fix error message formatting in src/vak/config/validators.py * Add ModelConfig class to config/model.py, add type annotations, fix config_from_toml_dict to look in specific section * Fixup fixing config_from_toml_dict to look in specific section * Rewrite config/eval.py with 'modern' attrs * Fixup rewrite config/eval with 'modern attrs * Rewrite config/learncurve.py with 'modern' attrs * Rewrite config/predict.py with 'modern' attrs * Rewrite config/prep.py with 'modern' attrs * Rewrite config/train.py with 'modern' attrs * Rename Dataset -> DatasetConfig in config/dataset.py * Add are_table_options_valid to config/validators.py, will be used by classmethods from_config_dict * WIP: Add from_config_dict classmethod to EvalConfig * WIP: Add tests/test_config/test_dataset.py * Make fixes to ModelConfig class, fix circular imports in config/model.py module * Write tests in tests/test_config/test_dataset.py * Use tomlkit not toml in cli/prep.py * Use tomlkit in tests/fixtures/annot.py * Use tomlkit in tests/scripts/vaktestdata/configs.py * Use tomlkit in tests/scripts/vaktestdata/source_files.py * Use tomlkit in tests/test_config/test_validators.py * Remove spect_params attribute from Config in config/config.py, fix class' docstring * Reorder attributes, fix typo in docstring of DatasetConfig * Rewrite config/parse.py assuming config classes have from_config_dict classmethod * Rename `table` -> `table_name` in a couple validators in config/validators.py * Remove use of config.model.config_from_toml_path in cli/eval.py * Remove use of config.model.config_from_toml_path in cli/learncurve.py * Remove use of config.model.config_from_toml_path in cli/predict.py * Remove use of config.model.config_from_toml_path in cli/train.py * Remove functions from config/model.py: config_from_toml_path and config_from_toml_dict * Add `to_dict` method to ModelConfig * Use to_dict() method of ModelConfig class in cli functions * Fix how we get labelset from config in tests/fixtures/annot.py * WIP: Clean up / rewrite tests/fixtures/config.py * Fix model tables in tests/data_for_tests/configs * Finish unit tests in tests/test_config/test_model.py * Fix model tables in doc/toml * Rename data_for_tests/configs/invalid_option_config.toml -> invalid_key_config.toml * Rename are_options_valid/are_table_options_valid -> are_keys_valid/are_table_keys_valid in config/validators.py * Rename two fixtures in fixtures/config.py: invalid_section_config_path -> invalid_table_config_path, invalid_option_config_path -> invalid_key_config_path * Fix validator names in config/parse.py, rename TABLE_CLASSES constant -> TABLE_CLASSES_MAP * Rename config/valid.toml -> valid-version-1.0.toml, fix how model table is declared * Fix VALID_TOML_PATH in config/validators.py after renaming config/valid.toml -> config/valid-version-1.0.toml * Import config classes in vak/config/__init__.py * Add _tomlkit_to_popo to tests/fixtures/config.py so we operate on dicts not tomlkit.TOMLDocument * Add _tomlkit_to_popo to config/parse.py so we operate on dicts not tomlkit.TOMLDocument * Finish rewriting tests for tests/test_config/test_prep.py * Rewrite EvalConfig with from_config_dict method * Rewrite LearncurveConfig with from_config_dict method * Rewrite PredictConfig with from_config_dict method * Rewrite PrepConfig with from_config_dict method * Rewrite TrainConfig with from_config_dict method * Remove functions from config/parse.py * Rename config/parse.py -> config/load.py * Make functions in config/parse.py into classmethods on Config class * Use config.Config.from_toml_path everywhere instead of config.parse.from_toml_path * Make fixes in Config classmethods * Change load._load_toml_from_path again so that it returns config_dict['vak'], to avoid writing ['vak'] everywhere in calling functions * Add docstring to are_tables_valid in config/validators.py * Lowercase config table names in tests/scripts/vaktestdata/configs.py * In tests/scripts/vaktestdata/source_files.py, change cfg.spect_params -> cfg.prep.spect_params, fix how we change values in toml, add tables_to_parse arg to call to Config.from_toml_path * in test_cli/test_prep.py, call vak.config.load not vak.config.parse * Fix how we instantiate DatasetConfig and ModelConfig in EvalConfig.from_config_dict method * Fix how we instantiate DatasetConfig and ModelConfig in PredictConfig.from_config_dict method * Fix how we instantiate DatasetConfig and ModelConfig in TrainConfig.from_config_dict method * Fix how we instantiate DatasetConfig and ModelConfig in LearncurveConfig.from_config_dict method * Remove brekapoint in src/vak/config/model.py * Fix wrong variable name so we save configs correctly in tests/scripts/vaktestdata/source_files.py, and add tables_to_parse arg to Config.from_toml_path, so we don't get 'missing dataset' errors * Fix how we re-write configs, in tests/scripts/vaktestdata/configs.py * Add model and dataset tables to get those keys in top-level tables, in src/vak/config/valid-version-1.0.toml * Change cfg.table.dataset_path -> cfg.table.dataset.path in vak/cli modules (e.g., vak.train.dataset.path) * Get tests passing for tests/test_config/test_eval.py * Clean up tests/test_config/test_eval.py * Get tests passing in tests/test_config/test_predict.py * Fix how we access config_toml in tests/scripts/vaktestdata/configs.py -- missing 'vak' key * Add pytest.mark.parametrize to tests/test_config/test_learncurve.py * Rewrite tests in tests/test_config/test_train.py * Rewrite tests in tests/test_config/test_config.py * Add unit test to tests/test_config/test_model.py * Add unit test for exceptions in tests/test_config/test_eval.py * Fix 'cfg.spect_params' -> 'cfg.prep.spect_params' in src/vak/cli/predict.py * Add unit test for exceptions in tests/test_config/test_learncurve.py * Add unit test for exceptions in tests/test_config/test_train.py * Add more test cases to TestEvalConfig.test_from_config_dict_raises * Add more test cases to TestLearncurveConfig.test_from_config_dict_raises * Add unit test for exceptions in tests/test_config/test_predict.py * Add two unit tests that PrepConfig raises expected exceptions * Fix/add unit tests in tests/test_config/test_config.py * Change order of parameters for Config.from_config_dict, make toml_path last param * Fix/add unit tests in tests/fixtures/config.py * Fix/add unit tests in tests/fixtures/config.py * Rename test_config/test_parse.py -> test_load.py, fix/rewrite tests * Fix tests in tests/test_config/test_spect_params.py * Make fixups in tests/test_config * Apply fixes from linter * Make more linting fixes * Speed up install in nox session 'lint', only install linting tools * Change names 'section'/'option' -> 'table'/'key' in tests * Fix tests in tests/test_cli/test_eval.py * Finish fixing cli tests, fix renaming * Fix how we get 'path' from 'dataset' table in configs, in tests/fixtures/csv.py * Fix how we get 'path' from 'dataset' table in configs, in tests/fixtures/dataset.py * Change .dataset_path -> .dataset.path in tests/ * Fix how we get model config and rename config attribute .dataset_path -> .dataset.path throughout tests * In tests/, fixup change .dataset_path -> .dataset.path, use model.name where we used to use just 'model' attribute of config * Fix fixture specific_config_toml_path in fixtures/config.py to handle case where we need to access sub-table and change a key in it--right now this is just [ 'dataset']['path'] * Fix how we change ['dataset']['path'] value in tests/test_eval/test_frame_classification.py * Fix how we change ['dataset']['path'] value in config in several tests * Use ModelConfig attribute name where needed in tests/test_learncurve/test_frame_classification.py * In tests, replace calls to vak.config.model.config_from_toml_path with calls to ModelConfig method to_dict() * Change cfg.spect_params -> cfg.prep.spect_params in tests * Fix cfg.predict -> cfg.predict.dataset.path in tests/test_predict/test_frame_classification.py * Fix constant LABELSET_NOTMAT in fixtures/annot.py so it is a list of str, not a Tomlkit.String class * Fix cfg.learncurve -> cfg.learncurve.dataset.path in tests/test_prep/test_frame/test_learncurve.py * Fix cfg.learncurve -> cfg.learncurve.dataset.path in tests/test_prep/test_frame/test_learncurve.py * Cast pathlib to str before adding to tomldoc, in tests/test_train/ * Change transform/dataset params keys in data_for_tests/configs to a dataset table with a params key * Add `params` attribute to DatasetConfig * Change transform/dataset params keys in doc/toml/ to a dataset table with a params key * Rewrite vak/config/model.py method 'to_dict' as 'asdict', using attrs asdict function. We now return 'name' and will just get it from the dict instead of having a separate 'model_name' parameter for functions that take 'model_config' * Add asdict method to DatasetConfig class, like ModelConfig.asdict * Fix calls to model.to_dict() -> model.asdict() * Add unit tests for DatasetConfig.asdict * Add unit tests for ModelConfig.asdict * Add an assertion in tests/test_config/test_dataset.py * Remove transform params and dataset_params from EvalConfig, will just use dataset attribute, a DatasetConfig, with its params attribute * Remove dataset/transform_params key-value pairs in valid-version-1.0.toml, and add params key to dataset tables with in-line table params * Remove train/val/dataset/transform_params from TrainConfig, will use DatasetConfig attribute params instead * Remove train/val/dataset/transform_params from PredictConfig, will use DatasetConfig attribute params instead * Revise transforms.defaults.frame_classification.TrainItemTransform and change get_default_frame_classification_transform to return an instance of the TrainItemTransform when 'mode' is 'train' * Make vak.table.dataset.params into an in-line table in toml files in tests/data_for_tests/configs * Fix attribute name in frame_classification.TrainItemTransform.__init__: source_transform -> frames_transform * Rewrite datasets.frame_classification.WindowDataset to require item_transform, and assume that it is an instance of transforms.frame_classification.TrainItemTransform * Rewrite datasets.frame_classification.FramesDataset to make item_transform required * Rewrite src/vak/train/frame_classification.py: remove params model_name, train/val_transform_params, train/val_dataset_params, and dataset_path, replace with dataset_config and just have model_config contain name * Rewrite src/vak/train/_train.py: remove params model_name, train/val_transform_params, train/val_dataset_params, and dataset_path, replace with dataset_config and just have model_config contain name * Rewrite vak/cli/train.py to call train._train.train with just model_config and dataset_config, remove model_name, dataset_path, train/val_transform_params and train/val_dataset_params * Fix how we unpack batch in training_step method of FrameClassificationModel * Change transform_kwargs parameter of transforms.defaults.parametric_umap.get_default_parametric_umap_transform to default to None, and if None to be an empty dict * Change transform_kwargs parameter of transforms.defaults.frame_classification.get_default_frame_classification_transform to default to None, and if None to be an empty dict * Change DatasetConfig.params attribute to default to empty dict, so we can unpack with ** operator even when no params are specified * Fix DatasetConfig.from_config_dict method to not use dict.get method, so we don't set attributes to None inadvertently * Modify transforms.defaults.get so that transform_kwargs is None by default. Also revise docstring and type annotations * Rewrite src/vak/train/parametric_umap.py to use model_config and dataset_config parameters, removing parameters val/train_transform_params + val/train_dataset_params and dataset_path * Rewrite vak/eval/frame_classification.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path * Rewrite vak/eval/parametric_umap.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path * Rewrite vak/eval/eval_.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path * Rewrite cli.eval to pass model_config and dataset_config into eval_module.eval, remove dataset_path/transform_params/datset_params arguments * Unpack dataset_config[params] with ** inside trak/frame_classification.py, instead of directly getting window_size from the params dict * Rewrite vak/learncurve/frame_classification.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path * Rewrite vak/learncurve/learncurve.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path * Rewrite cli.learncurve to pass model_config and dataset_config into learning_curve.learncurve, remove dataset_path/transform_params/datset_params arguments * Rewrite vak/predict/frame_classification.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path * Rewrite vak/predict/parametric_umap.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path * Rewrite vak/predict/predict.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path * Fix dataset_path -> dataset_config[path] and add missing variable model_name in src/vak/learncurve/learncurve.py * Fix dataset_path -> dataset_config[path] and add missing variable model_name in src/vak/learncurve/frame_classification.py * Rewrite vak/cli/predict.py to use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path * Remove non-existent dataset_params variable in vak/predict/frame_classification.py * Fix unit tests for DatasetConfig to test 'params' attribute gets handled correctly * Remove train/val_dataset_params and train/val_transform_params from test cases we parametrize with in tests/test_config/ * Use DatasetConfig.params attribute where we need to in tests/test_datasets * Fix method name ModelConfig.to_dict -> asdict in tests/ * In tests for eval/learncurve/predict/train, use model_config and dataset_config parameters, removing parameters transform_params + dataset_params and dataset_path * Fix use of default transform and dataset.params attribute in test_models/test_base.py * Fix config snippets in docs * Apply linting to src/ * Raise 'from e' with errors in eval/predict/train/frame_classification modules

NickleDave · 2024-05-05T03:24:27Z

Fixed by #750

NickleDave self-assigned this May 4, 2024

NickleDave added the ENH: enhancement enhancement; new feature or request label May 4, 2024

NickleDave closed this as completed May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add `dataset` sub-table to config file, remove other dataset/transform param keys #748

ENH: Add `dataset` sub-table to config file, remove other dataset/transform param keys #748

NickleDave commented May 4, 2024 •

edited

Loading

NickleDave commented May 5, 2024

ENH: Add dataset sub-table to config file, remove other dataset/transform param keys #748

ENH: Add dataset sub-table to config file, remove other dataset/transform param keys #748

Comments

NickleDave commented May 4, 2024 • edited Loading

NickleDave commented May 5, 2024

ENH: Add `dataset` sub-table to config file, remove other dataset/transform param keys #748

ENH: Add `dataset` sub-table to config file, remove other dataset/transform param keys #748

NickleDave commented May 4, 2024 •

edited

Loading