BUG: Running vak 1.0.0a1 with device set to CPU crashes #687

NickleDave · 2023-08-09T14:53:50Z

Describe the bug
Running vak train with version 1.0.0a1 and the device set to 'cpu' causes a crash, as described in this forum post: https://forum.vocalpy.org/t/errors-in-training-and-predicting-when-the-newest-version-of-vak-is-installed/71

It produces the following traceback:

$ vak train data/configs/TeenyTweetyNet_train_audio_cbin_annot_notmat.toml
2023-08-09 10:37:15,262 - vak.cli.train - INFO - vak version: 1.0.0a1
2023-08-09 10:37:15,263 - vak.cli.train - INFO - Logging results to ../vak-vocalpy/tests/data_for_tests/generated/results/train/audio_cbin_annot_notmat/TeenyTweetyNet/results_230809_103715
2023-08-09 10:37:15,263 - vak.core.train - INFO - Loading dataset from .csv path: ../vak-vocalpy/tests/data_for_tests/generated/prep/train/audio_cbin_annot_notmat/TeenyTweetyNet/032312_prep_230809_103332.csv
2023-08-09 10:37:15,266 - vak.core.train - INFO - Size of timebin in spectrograms from dataset, in seconds: 0.002
2023-08-09 10:37:15,266 - vak.core.train - INFO - using training dataset from ../vak-vocalpy/tests/data_for_tests/generated/prep/train/audio_cbin_annot_notmat/TeenyTweetyNet/032312_prep_230809_103332.csv
2023-08-09 10:37:15,266 - vak.core.train - INFO - Total duration of training split from dataset (in s): 50.046
2023-08-09 10:37:15,380 - vak.core.train - INFO - number of classes in labelmap: 12
2023-08-09 10:37:15,380 - vak.core.train - INFO - no spect_scaler_path provided, not loading
2023-08-09 10:37:15,380 - vak.core.train - INFO - will normalize spectrograms
2023-08-09 10:37:15,449 - vak.core.train - INFO - Duration of WindowDataset used for training, in seconds: 50.046
2023-08-09 10:37:15,460 - vak.core.train - INFO - Total duration of validation split from dataset (in s): 15.948
2023-08-09 10:37:15,461 - vak.core.train - INFO - will measure error on validation set every 50 steps of training
2023-08-09 10:37:15,467 - vak.core.train - INFO - training TeenyTweetyNet
Traceback (most recent call last):
  File "/home/pimienta/miniconda3/envs/vak-env/bin/vak", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/__main__.py", line 48, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/cli/cli.py", line 49, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/cli/cli.py", line 8, in train
    train(toml_path=toml_path)
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/cli/train.py", line 67, in train
    core.train(
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/core/train.py", line 358, in train
    trainer = get_default_trainer(
              ^^^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/trainer.py", line 66, in get_default_trainer
    trainer = lightning.Trainer(
              ^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/utilities/argparse.py", line 69, in insert_env_defaults
    return fn(self, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 398, in __init__
    self._accelerator_connector = _AcceleratorConnector(
                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 140, in __init__
    self._check_config_and_set_final_flags(
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 221, in _check_config_and_set_final_flags
    raise ValueError(
ValueError: You selected an invalid accelerator name: `accelerator=None`. Available names are: auto, cpu, cuda, hpu, ipu, mps, tpu.

As reported in the forum post, this probably affects all CLI commands besides prep: predict, eval, learncurve -- I did not verify though.

To Reproduce
Steps to reproduce the behavior:

Install vak 1.0.0a1 e.g. with conda

$ mamba create -n vak-env python vak -c pytorch -c conda-forge

Create a TOML configuration file with a [TRAIN] table that specifies device = 'cpu'
(full file is attached)

[TRAIN]
model = "TeenyTweetyNet"
normalize_spectrograms = true
batch_size = 4
num_epochs = 2
val_step = 50
ckpt_step = 200
patience = 3
num_workers = 2
device = "cpu"
root_results_dir = "../vak-vocalpy/tests/data_for_tests/generated/results/train/audio_cbin_annot_notmat/TeenyTweetyNet"
dataset_path = "../vak-vocalpy/tests/data_for_tests/generated/prep/train/audio_cbin_annot_notmat/TeenyTweetyNet/032312_prep_230809_103332.csv"

Run vak train config.toml
See error

Expected behavior
vak train should run without crashing

Desktop (please complete the following information):

I verified this on Linux
Version 1.0.0a1
I think Veit lab is using Mac OS tho

Additional context
What's going on here is that:

High-level train/predict/eval functions call get_default_trainer
Inside get_trainer, if device is not set to cuda, we default to None, but None is not a valid option for accelerator (the argument used when instantiating Trainer)
vocalpy/vak/blob/3dcce70030ae9b1fd6d040e055def0d656a7512e/src/vak/trainer.py#L60

    if device == 'cuda':
        accelerator = 'gpu'
    else:
        accelerator = None

TeenyTweetyNet_train_audio_cbin_annot_notmat.zip

The text was updated successfully, but these errors were encountered:

NickleDave · 2023-08-09T15:00:09Z

Quoting myself from the forum post:

I need to think a little more about the right way to fix this. People should be able to say they want to use 'cpu' explicitly, so I don’t think it should be either 'cuda' or 'auto'. It might be better to just let people specify the 'accelerator' option directly in the config, file although this forces people to learn about Lightning.

From the traceback above I can see that 'cpu' is actually a valid option for accelerator.
So I think now I am actually +1 on just changing device to accelerator and linking to the Lightning docs e.g. in docstrings.
We should favor being a lightweight transparent wrapper whenever possible instead of coming up with new conventions

JacquelineGoe · 2023-08-10T12:27:39Z

Thank you for the nice solution, this will be very easy to implement.

NickleDave · 2023-08-10T12:50:10Z

Thank you @JacquelineGoe for reporting this bug on the forum!
I will be sure to add you as a contributor when we fix.
Will move this near the top of the to-do list -- we need to make a new alpha release soon anyway

NickleDave · 2023-09-11T01:44:25Z

I put in a temporary fix for this #693. I will need to publish a new release to PyPI before the fix is in the version installed by pip and conda. The long-term fix is as described in #691

NickleDave · 2023-09-11T01:45:14Z

@all-contributors please add @JacquelineGoe for bug

allcontributors · 2023-09-11T01:45:23Z

@NickleDave

I've put up a pull request to add @JacquelineGoe! 🎉

* WIP: Add config/trainer.py with TrainerConfig * Rename common.device -> common.accelerator, return 'gpu' not 'cuda' if torch.cuda.is_available * Fix config section in doc/api/index.rst * Import trainer and TrainerConfig in src/vak/config/__init__.py, add to __all__ * Add pytorch-lightning to intersphinx in doc/conf.py * Fix cross-ref in docstring in src/vak/prep/frame_classification/make_splits.py: :constant: -> :const: * Make lightning a dependency, instead of pytorch_lightning; import lightning.pytorch everywhere instead of pytorch_lightning as lightning -- trying to make it so we can resolve API correctly in docstrings * Fix in doc/api/index.rst: common.device -> common.accelerator * Finish writing TrainerConfig class * Add tests for TrainerConfig class * Add trainer sub-table to all configs in tests/data_for_tests/configs * Add trainer sub-table to all configs in doc/toml * Add trainer sub-table in config/valid-version-1.0.toml, rename -> valid-version-1.1.toml * Remove device key from top-level tables in config/valid-version-1.1.toml * Remove device key from top-level tables in tests/data_for_tests/configs * Remove 'device' key from configs in doc/toml * Add 'trainer' attribute to EvalConfig, an instance of TrainerConfig; remove 'device' attribute * Add 'trainer' attribute to PredictConfig, an instance of TrainerConfig; remove 'device' attribute * Add 'trainer' attribute to TrainConfig, an instance of TrainerConfig; remove 'device' attribute * Fix typo in docstring in src/vak/config/train.py * Add 'trainer' attribute to LearncurveConfig, an instance of TrainerConfig; remove 'device' attribute. Also clean up docstring, removing attributes that no longer exist * Remove device attribute from TrainConfig docstring * Fix VALID_TOML_PATH in config/validators.py -> 'valid-version-1.1.toml' * Fix how we instantiate TrainerConfig classes in from_config_dict method of EvalConfig/LearncurveConfig/PredictConfig/TrainConfig * Fix typo in src/vak/config/valid-version-1.1.toml: predictor -> predict * Fix unit tests after adding trainer attribute that is instance of TrainerConfig * Change src/vak/train/frame_classification.py to take trainer_config argument * Change src/vak/train/parametric_umap.py to take trainer_config argument * Change src/vak/train/train_.py to take trainer_config argument * Fix src/vak/cli/train.py to pass trainer_config.asdict() into vak.train.train_.train * Replace 'device' with 'trainer_config' in vak/eval * Fix cli.eval to pass trainer_config into eval.eval_.eval * Replace 'device' with 'trainer_config' in vak/predict * Fix cli.predict to pass trainer_config into predict.predict_.predict * Replace 'device' with 'trainer_config' in vak/learncurve * Fix cli.learncurve to pass trainer_config into learncurve.learncurve.learning_curve * Rename/replace 'device' fixture with 'trainer' fixture in tests/ * Use config.table.trainer attribute throughout tests, remove config.table.device attribute that no longer exists * Fix value for devices in fixtures/trainer.py: when device is 'cpu' trainer must be > 0 * Fix default devices value for when accelerator is cpu in TrainerConfig * Fix unit tests for TrainerConfig after fixing default devices for accelerator=cpu * Fix default value for 'devices' set to -1 in some unit tests where we over-ride config in toml file * fixup use config.table.trainer attribute throughout tests -- missed one place in tests/test_eval/ * Add back 'device' fixture so we can use it to test Model class * Fix unit tests in test_models/test_base.by that literally used device to put tensors on device, not to change a config * Fix assertion in tests/test_models/test_tweetynet.py, from where we switched to using lightning as the dependency * Fix test for DiceLoss, change trainer_type fixture back to device fixture

NickleDave · 2024-05-05T21:48:48Z

Fixed by #752

NickleDave added the BUG Something isn't working label Aug 9, 2023

NickleDave self-assigned this Aug 9, 2023

This was referenced Aug 14, 2023

Add parametric UMAP model family #688

Merged

ENH: Add TrainerConfig to train/learncurve/eval/predict #691

Closed

NickleDave mentioned this issue Sep 11, 2023

TST/CLN: Fix unit tests #693

Merged

NickleDave closed this as completed Sep 11, 2023

allcontributors bot mentioned this issue Sep 11, 2023

docs: add JacquelineGoe as a contributor for bug #695

Merged

NickleDave mentioned this issue May 5, 2024

ENH: Add lightning.Trainer config, fix #691 #687 #742 #745 #752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Running vak 1.0.0a1 with device set to CPU crashes #687

BUG: Running vak 1.0.0a1 with device set to CPU crashes #687

NickleDave commented Aug 9, 2023 •

edited

NickleDave commented Aug 9, 2023

JacquelineGoe commented Aug 10, 2023

NickleDave commented Aug 10, 2023

NickleDave commented Sep 11, 2023

NickleDave commented Sep 11, 2023

allcontributors bot commented Sep 11, 2023

NickleDave commented May 5, 2024

BUG: Running vak 1.0.0a1 with device set to CPU crashes #687

BUG: Running vak 1.0.0a1 with device set to CPU crashes #687

Comments

NickleDave commented Aug 9, 2023 • edited

NickleDave commented Aug 9, 2023

JacquelineGoe commented Aug 10, 2023

NickleDave commented Aug 10, 2023

NickleDave commented Sep 11, 2023

NickleDave commented Sep 11, 2023

allcontributors bot commented Sep 11, 2023

NickleDave commented May 5, 2024

NickleDave commented Aug 9, 2023 •

edited