ENH: Minimize duplication of data when preparing datasets for frame classification models #717

NickleDave · 2023-10-04T22:16:04Z

Currently running vak prep results in a lot of duplicated data, especially for learning curves. For the TweetyNet canary dataset, this means prepared datasets can balloon to ~700 GB of data and quickly use up space on a drive.

There's a couple of ways that duplication happens:

We prepare the spectrograms as npz files, that include the frequencies and times vectors
Then we extract just the spectrogram matrixes from these files, and we save these as additional files in the split directories, with the frames.npy extension
We also leave the spectrogram files untouched, the logic being that they contain a kind of data we might want to use
Additionally when we create splits for learncurves, we duplicate the training data again, in a separate directory for each split

I think we should fix this in the following ways

Get rid of the spectrogram files after we get the "frames" out of them.
- I would prefer to keep intermediate steps of data processing for reproducibility but I think we just need to conceive of "prep" as giving us the final transformed data from whatever the input is, so we don't end up using up all of people's storage space
No longer make separate splits directories for each subset of the training set we use in learncurve
Instead we will achieve the learncurve functionality as follows:
- Represent "splits" separately from "subsets" of a split, like the subsets of the training data split that we make for learning curves:
- We will add the learning curve subset names (e.g. "duration 300s, replicate 9") to a 'subset' column in the csv representing the dataset -- so some samples will get repeated in multiple rows of the dataset csv, but we won't have multiple copies of the same file
- We will add a subset parameter to dataset classes
- If specified, this parameter overrides the split, specifically when it is used to grab the rows of the dataframe (loaded from the csv) that correspond to the specified split. We instead grab all the rows that correspond to a subset when it is specified. If None is specified, we use the whole split.

The text was updated successfully, but these errors were encountered:

NickleDave · 2023-10-04T22:16:57Z

Add this subset parameter for ParametricUMAP datasets too even though we haven't implemented the learning curve yet

* Change function vak.prep.frame_classification.dataset_arrays.make_npy_files_for_each_split to remove spectrogram/audio files from dataset path after making the npy files * Modify prep_spectrogram_dataset so that it no longer makes a directory 'spectrogram_generated_{timenow} -- that way we don't have to delete the directory when we remove the spectrograms after converting to npy files later * Rename get_train_dur_replicate_split_name -> get_train_dur_replicate_subset_name in src/vak/common/learncurve.py * Modify src/vak/prep/frame_classification/learncurve.py to no longer make duplicate npy files for each subset names, and to add subset names in a separate column from split so that we can specify subsets directly in learncurve * Add subset parameter to src/vak/datasets/frame_classification/frames_dataset.py, that takes precedence over split parameter when selecting part of dataframe to use for grabbing samples * Add subset parameter to src/vak/datasets/frame_classification/window_dataset.py, that takes precedence over split parameter when selecting part of dataframe to use for grabbing samples * Rename split parameter of vak.train.frame_classification to subset, and use when making training dataset instance * Use subset inside of src/vak/learncurve/frame_classification.py * Have StandardizeSpect.fit_dataset_path take subset argument and have it take precedence over split when fitting, as with dataset classes * Use split + subset when calling StandardizeSpect.fit_dataset_path in src/vak/train/frame_classification.py * Use subset not split argument when calling training functions for model families in src/vak/train/train_.py * WIP: Use subset with ParametricUMAPDataset (haven't added argument to dataset class yet) * Add function `make_index_vectors_for_each_subset` to src/vak/prep/frame_classification/learncurve.py, rename `make_learncurve_splits` to `make_subsets_from_dataset_df` and have it call `make_index_vectors` * Revise a couple things in docstring in src/vak/prep/frame_classification/dataset_arrays.py * Have audio_format default to none in src/vak/prep/frame_classification/dataset_arrays.py and raise ValueError if input_type is audio but audio_format is None * Fix parameter order of function in src/vak/prep/frame_classification/learncurve.py to match order of dataset_arrays so it's not confusing, and set default of audio_format to None, raise a ValueError if input_type is audio but audio_format is None * In src/vak/prep/frame_classification/frame_classification.py, call make_subsets_from_data_df with correct arguments (now renamed from make_learncurve_splits_from_dataset_df) * Add src/vak/datasets/frame_classification/helper.py with helper functions that return filenames of indexing vectors for subsets of (training) data * Import helper in src/vak/datasets/frame_classification/__init__.py * Use helper functions to load indexing vectors for subsets in classmethod of src/vak/datasets/frame_classification/window_dataset.py * Use helper functions to load indexing vectors for subsets in classmethod of src/vak/datasets/frame_classification/frames_dataset.py * Rewrite functions in src/vak/prep/frame_classification/frame_classification.py -- realize I can just use frame npy files to make indexing vectors, so I don't need input type, audio format, etc. * Fix args to make_indes_vecotrs_for_each_subset and fix how we concatenate dataset_df in src/vak/prep/frame_classification/learncurve.py * Fix how we use subset in FramesDataset.__init__ * Fix how we use subset in WindowDataset.__init__ * Change word 'split' -> 'subset' in src/vak/learncurve/frame_classification.py * Fix docstrings in src/vak/datasets/frame_classification/window_dataset.py * Fix docstrings in src/vak/datasets/frame_classification/frames_dataset.py * Fix a typo in a docstring in src/vak/datasets/frame_classification/window_dataset.py * Fix subset parameter of classmethod for ParametricUMAPDataset class; move logic from classmethod into __init__ although I'm not sure this is a good idea * Rename frame_classification/dataset_arrays.py to frame_classification/make_splits.py and rewrite 'make_npy_paths' as 'make_splits', have it move/copy/create audio or spectrogram files in split dirs, in addition to making npy files, and update the 'audio_path' or 'spect_path' columns with the files in the split dirs * Remove constants from src/vak/datasets/frame_classification/constants.py that are no longer used for 'frames' files * Use make_splits function in src/vak/prep/frame_classification/frame_classification.py * Modify make_dataframe_of_spect_files function in src/vak/prep/spectrogram_dataset/spect_helper.py so it no longer converts mat files into npz files, instead it just finds/collates all the spect files and returns them in the dataframe; any converting is done by frame_classification.make_splits with the output of this function * Fix typo in list comprehension and add info to docstring in src/vak/prep/frame_classification/make_splits.py * Fix imports in src/vak/prep/frame_classification/__init__.py after renaming module to 'make_splits' * Remove other occurrences of 'spect_output_dir' from src/vak/prep/spectrogram_dataset/spect_helper.py, no longer is a parameter and not used * No longer pass 'spect_output_dir' into 'prep_spectrogram_dataset' in src/vak/prep/spectrogram_dataset/prep.py * Remove unused import in src/vak/prep/spectrogram_dataset/spect_helper.py * Add logger statement in src/vak/prep/frame_classification/make_splits.py * Fix src/vak/prep/frame_classification/learncurve.py so functions use either spect or audio to get frames and make indexing vectors * Fix src/vak/prep/frame_classification/frame_classification.py so we pass needed parameters into make_subsets_from_dataset_df * Make x_path relative to dataset_path in src/vak/prep/frame_classification/frame_classification.py, since that's what downstream functions/classes expect * Rename x_path -> source_path in src/vak/prep/frame_classification/make_splits.py * Rename x_path -> source_path in src/vak/prep/frame_classification/learncurve.py * Rewrite frame_classification.WindowDataset to load audio/spectrograms directly from 'frame_paths' * Add FRAMES_PATH_COL_NAME to src/vak/datasets/frame_classification/constants.py * Rewrite make_splits.py to add frames_path column to dataframe, and have frame_classification models use that column always; this way we keep the original 'audio_path' and 'spect_path' columns as metadata, and avoid if/else logic everywhere in dataset classes * Fix WindowDataset to use constant to load frame paths column, and to validate input type, revise docstring * Fix FramesDataset the same way as WindowDataset: load frame paths with constant, load inside __getitem__ with helper function _load_frames, validate input type, fix order of attributes in docstring * Use self.dataset_path to build frames_path in WindowDataset * Use self.dataset_path to build frames_path in FramesDataset, and pass into transform as 'frames_path', not 'source_path' * Rename 'source_path' -> 'frames_path' inside src/vak/transforms/defaults/frame_classification.py * Rename 'source_path' -> 'frames_path' in FrameClassificationModel methods, in src/vak/models/frame_classification_model.py * Rename 'source_path' -> 'frames_path' in src/vak/predict/frame_classification.py * Add SPECT_KEY to common.constants * Fix how StandardizeSpect.from_dataset_path builds frames_path paths, and use constants.SPECT_KEY when loading from frames path * Use common.constants.SPECT_KEY inside _load_frames method of WindowDataset * Use common.constants.SPECT_KEY inside _load_frames method of FramesDataset * Add newline at end of src/vak/common/constants.py * Add FRAME_CLASSIFICATION_DATASET_AUDIO_FORMAT to src/vak/datasets/frame_classification/constants.py * Add function load_frames to src/vak/datasets/frame_classification/helper.py * Have WindowDataset._load_frames use helper.load_frames * Have FramesDataset._load_frames use helper.load_frames * Rename GENERATED_TEST_DATA -> GENERATED_TEST_DATA_ROOT in tests/scripts/vaktestdata/constants.py * Rename GENERATED_TEST_DATA -> GENERATED_TEST_DATA_ROOT in tests/scripts/vaktestdata/dirs.py * Add tests/scripts/vaktestdata/spect.py * import spect module in tests/scripts/vaktestdata/__init__.py * Call vaktestdata.spect.prep_spects in prep section of script tests/scripts/generate_data_for_tests.py * Fix spect_dir_npz fixture in tests/fixtures/spect.py to use directory of just .spect.npz files that is now generated by the generate_test_data script * Add SPECT_NPZ_EXTENSION to src/vak/common/constants.py * Use common.SPECT_NPZ_EXTENSION in src/vak/prep/spectrogram_dataset/audio_helper.py * Fix prep.frame_classification.make_splits to remove any .spect.npz files remaining in dataset_path, that were not moved into splits * Fix vak.prep.frame_classification.learncurve.make_index_vectors_for_subsets to use frame_paths column instead of 'source' paths (audio_path or spect_path) -- so we are using files that definitely exist and are already assigned to splits * WIP: Rewriting unit tests in tests/test_prep/test_frame_classification/test_learncurve.py * WIP: Rewriting unit tests in tests/test_prep/test_frame_classification/test_make_splits.py * WIP: Add tests/test_datasets/test_frame_classification/test_helper.py * Rename specific_config -> specific_config_toml_path * WIP: Rewriting tests/test_prep/test_frame_classification/test_make_splits.py * Add src/vak/prep/frame_classification/get_or_make_source_files.py * Add src/vak/prep/frame_classification/assign_samples_to_splits.py * Rewrite 'prep_frame_classification_dataset' to use helper functions factored out into other modules: get_or_make_source_files and assign_samples_to_splits * Capitalize in docstring in src/vak/prep/spectrogram_dataset/prep.py * Add TIMEBINS_KEY to src/vak/common/constants.py * Finish fixing unit test for vak.prep.frame_classification.make_splits * Add imports in src/vak/prep/frame_classification/__init__.py * Revise docstring of src/vak/prep/audio_dataset.py to refer to 'source_files_df' * Revise docstring of src/vak/prep/spectrogram_dataset/spect_helper.py to refer to 'source_files_df' * Revise docstring of src/vak/prep/spectrogram_dataset/prep.py to refer to 'source_files_df' * Revise src/vak/prep/frame_classification/get_or_make_source_files.py to refer to 'source_files_df', in docstring and inside function * In 'prep_frame_classification_dataset', differentiate between 'source_files_df' and 'dataset_df' * Delete birdsong-recognition-dataset configs from tests/data_for_tests/configs * Fix a docstring in noxfile.py * Remove tests/scripts/vaktestdata/spect.py * Add model_family field in tests/data_for_tests/configs/configs.json, remove configs for birdsong-recognition-dataset * Add model_family field to ConfigMetadata dataclass in tests/scripts/vaktestdata/config_metadata.py * Remove call to vaktestdata.spect.prep_spects() since we are going to call other functions that will make spectrograms * Change parameters order of frame_classification.get_or_make_source_files, add pre-conditions/validators * Fix order of args to get_or_make_source_files in src/vak/prep/frame_classification/frame_classification.py * Add more to docstring of src/vak/prep/frame_classification/get_or_make_source_files.py * Add 'spect_output_dir' and 'data_dir' fields to tests/data_for_tests/configs/configs.json * Rewrite ConfigMetadata dataclass, add docstring and converters, add spect_output_dir and data_dir attributes * Add functions to make more directories in tests/data_for_tests/generated in tests/scripts/vaktestdata/dirs.py * Import get_or_make_source_files in tests/scripts/vaktestdata/__init__.py * Add more constants with names of directories to make in tests/data_for_tests/generated in tests/scripts/vaktestdata/constants.py * Add tests/scripts/vaktestdata/get_or_make_source_files.py * Add 'spect-output-dir/' to data_dir paths in tests/data_for_tests/configs/configs.json * Rename tests/scripts/vaktestdata/get_or_make_source_files.py -> tests/scripts/vaktestdata/source_files.py, rewrite function that makes source files + csv files we use with tests * Fix tests/scripts/vaktestdata/__init__.py to import source_files module, remove import of get_or_make_source_files module that was renamed to source_files * Import missing module constants and fix order of arguments to prep_spectrogram_dataset in src/vak/prep/frame_classification/get_or_make_source_files.py * Change 3 configs to have spect_format option set to npz * Remove import of module spect in tests/scripts/vaktestdata/__init__.py * Flesh out function in tests/scripts/vaktestdata/source_files.py * Add log statements in tests/scripts/generate_data_for_tests.py * Fix typo in src/vak/prep/frame_classification/get_or_make_source_files.py * Add SPECT_FORMAT_EXT_MAP to src/vak/common/constants.py * Use vak.commonconstants.SPECT_FORMAT_EXT_MAP in src/vak/prep/spectrogram_dataset/prep.py so that we correctly remove source file extension to pair with annotation file * Fix attributes of ConfigMetadata so we don't convert None to 'None' * Copy annotation files to spect_output_dir so we can prep from that dir, in tests/scripts/vaktestdata/source_files.py * Change name of logger in tests/scripts/generate_data_for_tests.py * Fix attributes in ConfigMetadata so we don't convert strings to bool * Remove fixtures from tests/fixtures/annot.py after removing corresponding source data * Fix import in src/vak/prep/frame_classification/__init__.py * Fix import in src/vak/prep/frame_classification/frame_classification.py * Add tests/fixtures/source_files with fixtures to get csv files * Add fixtures that return dataframes directly in tests/fixtures/source_files.py * Add tests/test_prep/test_frame_classification/test_get_or_make_source_files.py * Add tests/test_prep/test_frame_classification/test_assign_samples_to_splits.py * Fix factory functions in tests/fixtures/source_files.py * Fix assembled path in tests/fixtures/source_files.py * Fix unit test in tests/test_prep/test_frame_classification/test_make_splits.py to use fixture so it's faster and less verbose * Remove fixtures that no longer exist from specific_annot_list fixture in tests/fixtures/annot.py * Remove fixtures for data that doesn't exist in tests/fixtures/audio.py * Remove birdsong-rec from parametrize in tests/test_cli/test_predict.py * Remove birdsongrec from parametrize in tests/test_cli/test_prep.py * Remove birdsongrec from parametrize in tests/test_cli/test_train.py * Remove birdsongrec and other data no longer in source from parametrizes in tests/test_common/test_annotation.py * Remove birdsongrec from parametrize in tests/test_predict/test_frame_classification.py * Remove birdsongrec from parametrize in tests/test_prep/test_frame_classification/test_frame_classification.py * Remove birdsongrec from parametrize in tests/test_prep/test_prep.py * Remove birdsongrec from parametrize in tests/test_prep/test_sequence_dataset.py * Remove birdsongrec from parametrize in tests/test_train/test_frame_classification.py * Remove birdsongrec from parametrize in tests/test_train/test_train.py * Remove unit tests from tests/test_common/test_files/test_files.py that test on data removed from source data * Remove parametrize that uses wav/textgrid data removed from source data * Fix fixture in tests/fixtures/spect.py * Actually write unit tests in tests/test_datasets/test_frame_classification/test_helper.py * Fix prep.frame_classification.make_splits to not convert frame labels npy paths to 'None' when they are None * Fix assert helper in tests/test_prep/test_frame_classification/test_frame_classification.py * Remove spect_key and audio_format parameters from functions in src/vak/prep/frame_classification/learncurve.py, no longer used * Change order of params for make_subsets_from_dataset_df * Change order of args in call to make_subsets_from_dataset_df inside prep_fram_classification_dataset * Rename some variables to 'subset_df' in src/vak/prep/frame_classification/learncurve.py and revise docstrings * Finish adding/fixing unit tests in tests/test_prep/test_frame_classification/test_learncurve.py * Fix bug in unit test in tests/test_prep/test_frame_classification/test_make_splits.py * Fix unit tests in tests/test_prep/test_spectrogram_dataset/test_prep.py * Fix unit test in tests/test_prep/test_spectrogram_dataset/test_spect_helper.py * Fix unit test in tests/test_transforms/test_transforms.py * Use torch.testing.assert_close instead of assert_allclose in tests/test_nn/test_loss/test_dice.py

NickleDave mentioned this issue Oct 10, 2023

ENH: Minimize frame classification dataset size, fix #717 #718

Merged

8 tasks

NickleDave closed this as completed in #718 Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Minimize duplication of data when preparing datasets for frame classification models #717

ENH: Minimize duplication of data when preparing datasets for frame classification models #717

NickleDave commented Oct 4, 2023

NickleDave commented Oct 4, 2023

ENH: Minimize duplication of data when preparing datasets for frame classification models #717

ENH: Minimize duplication of data when preparing datasets for frame classification models #717

Comments

NickleDave commented Oct 4, 2023

NickleDave commented Oct 4, 2023