Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Running vak 1.0.0a1 with device set to CPU crashes #687

Closed
NickleDave opened this issue Aug 9, 2023 · 7 comments
Closed

BUG: Running vak 1.0.0a1 with device set to CPU crashes #687

NickleDave opened this issue Aug 9, 2023 · 7 comments
Assignees
Labels
BUG Something isn't working

Comments

@NickleDave
Copy link
Collaborator

NickleDave commented Aug 9, 2023

Describe the bug
Running vak train with version 1.0.0a1 and the device set to 'cpu' causes a crash, as described in this forum post: https://forum.vocalpy.org/t/errors-in-training-and-predicting-when-the-newest-version-of-vak-is-installed/71

It produces the following traceback:

$ vak train data/configs/TeenyTweetyNet_train_audio_cbin_annot_notmat.toml
2023-08-09 10:37:15,262 - vak.cli.train - INFO - vak version: 1.0.0a1
2023-08-09 10:37:15,263 - vak.cli.train - INFO - Logging results to ../vak-vocalpy/tests/data_for_tests/generated/results/train/audio_cbin_annot_notmat/TeenyTweetyNet/results_230809_103715
2023-08-09 10:37:15,263 - vak.core.train - INFO - Loading dataset from .csv path: ../vak-vocalpy/tests/data_for_tests/generated/prep/train/audio_cbin_annot_notmat/TeenyTweetyNet/032312_prep_230809_103332.csv
2023-08-09 10:37:15,266 - vak.core.train - INFO - Size of timebin in spectrograms from dataset, in seconds: 0.002
2023-08-09 10:37:15,266 - vak.core.train - INFO - using training dataset from ../vak-vocalpy/tests/data_for_tests/generated/prep/train/audio_cbin_annot_notmat/TeenyTweetyNet/032312_prep_230809_103332.csv
2023-08-09 10:37:15,266 - vak.core.train - INFO - Total duration of training split from dataset (in s): 50.046
2023-08-09 10:37:15,380 - vak.core.train - INFO - number of classes in labelmap: 12
2023-08-09 10:37:15,380 - vak.core.train - INFO - no spect_scaler_path provided, not loading
2023-08-09 10:37:15,380 - vak.core.train - INFO - will normalize spectrograms
2023-08-09 10:37:15,449 - vak.core.train - INFO - Duration of WindowDataset used for training, in seconds: 50.046
2023-08-09 10:37:15,460 - vak.core.train - INFO - Total duration of validation split from dataset (in s): 15.948
2023-08-09 10:37:15,461 - vak.core.train - INFO - will measure error on validation set every 50 steps of training
2023-08-09 10:37:15,467 - vak.core.train - INFO - training TeenyTweetyNet
Traceback (most recent call last):
  File "/home/pimienta/miniconda3/envs/vak-env/bin/vak", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/__main__.py", line 48, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/cli/cli.py", line 49, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/cli/cli.py", line 8, in train
    train(toml_path=toml_path)
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/cli/train.py", line 67, in train
    core.train(
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/core/train.py", line 358, in train
    trainer = get_default_trainer(
              ^^^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/trainer.py", line 66, in get_default_trainer
    trainer = lightning.Trainer(
              ^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/utilities/argparse.py", line 69, in insert_env_defaults
    return fn(self, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 398, in __init__
    self._accelerator_connector = _AcceleratorConnector(
                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 140, in __init__
    self._check_config_and_set_final_flags(
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 221, in _check_config_and_set_final_flags
    raise ValueError(
ValueError: You selected an invalid accelerator name: `accelerator=None`. Available names are: auto, cpu, cuda, hpu, ipu, mps, tpu.

As reported in the forum post, this probably affects all CLI commands besides prep: predict, eval, learncurve -- I did not verify though.

To Reproduce
Steps to reproduce the behavior:

  1. Install vak 1.0.0a1 e.g. with conda
$ mamba create -n vak-env python vak -c pytorch -c conda-forge
  1. Create a TOML configuration file with a [TRAIN] table that specifies device = 'cpu'
    (full file is attached)
[TRAIN]
model = "TeenyTweetyNet"
normalize_spectrograms = true
batch_size = 4
num_epochs = 2
val_step = 50
ckpt_step = 200
patience = 3
num_workers = 2
device = "cpu"
root_results_dir = "../vak-vocalpy/tests/data_for_tests/generated/results/train/audio_cbin_annot_notmat/TeenyTweetyNet"
dataset_path = "../vak-vocalpy/tests/data_for_tests/generated/prep/train/audio_cbin_annot_notmat/TeenyTweetyNet/032312_prep_230809_103332.csv"
  1. Run vak train config.toml
  2. See error

Expected behavior
vak train should run without crashing

Desktop (please complete the following information):

  • I verified this on Linux
  • Version 1.0.0a1
  • I think Veit lab is using Mac OS tho

Additional context
What's going on here is that:

    if device == 'cuda':
        accelerator = 'gpu'
    else:
        accelerator = None

TeenyTweetyNet_train_audio_cbin_annot_notmat.zip

@NickleDave NickleDave added the BUG Something isn't working label Aug 9, 2023
@NickleDave NickleDave self-assigned this Aug 9, 2023
@NickleDave
Copy link
Collaborator Author

Quoting myself from the forum post:

I need to think a little more about the right way to fix this. People should be able to say they want to use 'cpu' explicitly, so I don’t think it should be either 'cuda' or 'auto'. It might be better to just let people specify the 'accelerator' option directly in the config, file although this forces people to learn about Lightning.

From the traceback above I can see that 'cpu' is actually a valid option for accelerator.
So I think now I am actually +1 on just changing device to accelerator and linking to the Lightning docs e.g. in docstrings.
We should favor being a lightweight transparent wrapper whenever possible instead of coming up with new conventions

@JacquelineGoe
Copy link

Thank you for the nice solution, this will be very easy to implement.

@NickleDave
Copy link
Collaborator Author

Thank you @JacquelineGoe for reporting this bug on the forum!
I will be sure to add you as a contributor when we fix.
Will move this near the top of the to-do list -- we need to make a new alpha release soon anyway

@NickleDave
Copy link
Collaborator Author

I put in a temporary fix for this #693. I will need to publish a new release to PyPI before the fix is in the version installed by pip and conda. The long-term fix is as described in #691

@NickleDave
Copy link
Collaborator Author

@all-contributors please add @JacquelineGoe for bug

@allcontributors
Copy link
Contributor

@NickleDave

I've put up a pull request to add @JacquelineGoe! 🎉

NickleDave added a commit that referenced this issue May 5, 2024
* WIP: Add config/trainer.py with TrainerConfig

* Rename common.device -> common.accelerator, return 'gpu' not 'cuda' if torch.cuda.is_available

* Fix config section in doc/api/index.rst

* Import trainer and TrainerConfig in src/vak/config/__init__.py, add to __all__

* Add pytorch-lightning to intersphinx in doc/conf.py

* Fix cross-ref in docstring in src/vak/prep/frame_classification/make_splits.py: :constant: -> :const:

* Make lightning a dependency, instead of pytorch_lightning; import lightning.pytorch everywhere instead of pytorch_lightning as lightning -- trying to make it so we can resolve API correctly in docstrings

* Fix in doc/api/index.rst: common.device -> common.accelerator

* Finish writing TrainerConfig class

* Add tests for TrainerConfig class

* Add trainer sub-table to all configs in tests/data_for_tests/configs

* Add trainer sub-table to all configs in doc/toml

* Add trainer sub-table in config/valid-version-1.0.toml, rename -> valid-version-1.1.toml

* Remove device key from top-level tables in config/valid-version-1.1.toml

* Remove device key from top-level tables in tests/data_for_tests/configs

* Remove 'device' key from configs in doc/toml

* Add 'trainer' attribute to EvalConfig, an instance of TrainerConfig; remove 'device' attribute

* Add 'trainer' attribute to PredictConfig, an instance of TrainerConfig; remove 'device' attribute

* Add 'trainer' attribute to TrainConfig, an instance of TrainerConfig; remove 'device' attribute

* Fix typo in docstring in src/vak/config/train.py

* Add 'trainer' attribute to LearncurveConfig, an instance of TrainerConfig; remove 'device' attribute. Also clean up docstring, removing attributes that no longer exist

* Remove device attribute from TrainConfig docstring

* Fix VALID_TOML_PATH in config/validators.py -> 'valid-version-1.1.toml'

* Fix how we instantiate TrainerConfig classes in from_config_dict method of EvalConfig/LearncurveConfig/PredictConfig/TrainConfig

* Fix typo in src/vak/config/valid-version-1.1.toml: predictor -> predict

* Fix unit tests after adding trainer attribute that is instance of TrainerConfig

* Change src/vak/train/frame_classification.py to take trainer_config argument

* Change src/vak/train/parametric_umap.py to take trainer_config argument

* Change src/vak/train/train_.py to take trainer_config argument

* Fix src/vak/cli/train.py to pass trainer_config.asdict() into vak.train.train_.train

* Replace 'device' with 'trainer_config' in vak/eval

* Fix cli.eval to pass trainer_config into eval.eval_.eval

* Replace 'device' with 'trainer_config' in vak/predict

* Fix cli.predict to pass trainer_config into predict.predict_.predict

* Replace 'device' with 'trainer_config' in vak/learncurve

* Fix cli.learncurve to pass trainer_config into learncurve.learncurve.learning_curve

* Rename/replace 'device' fixture with 'trainer' fixture in tests/

* Use config.table.trainer attribute throughout tests, remove config.table.device attribute that no longer exists

* Fix value for devices in fixtures/trainer.py: when device is 'cpu' trainer must be > 0

* Fix default devices value for when accelerator is cpu in TrainerConfig

* Fix unit tests for TrainerConfig after fixing default devices for accelerator=cpu

* Fix default value for 'devices' set to -1 in some unit tests where we over-ride config in toml file

* fixup use config.table.trainer attribute throughout tests -- missed one place in tests/test_eval/

* Add back 'device' fixture so we can use it to test Model class

* Fix unit tests in test_models/test_base.by that literally used device to put tensors on device, not to change a config

* Fix assertion in tests/test_models/test_tweetynet.py, from where we switched to using lightning as the dependency

* Fix test for DiceLoss, change trainer_type fixture back to device fixture
@NickleDave
Copy link
Collaborator Author

Fixed by #752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants