BUG: Predict throws `KeyError` #745

avanikop · 2024-04-17T09:09:27Z

Facing problems with vak predict step. Tried setting save_net_output= true as well as = false and got the same error:
I dont know if this belongs on vocalpy or here, but I have been having problems during the vak predict predict.toml step. It always gives this error:

  Traceback (most recent call last):
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/bin/vak”, line 10, in
  sys.exit(main())
  ^^^^^^
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/main.py”, line 48, in main
  cli.cli(command=args.command, config_file=args.configfile)
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/cli.py”, line 49, in cli
  COMMAND_FUNCTION_MAPcommand
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/cli.py”, line 18, in predict
  predict(toml_path=toml_path)
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/predict.py”, line 51, in predict
  core.predict(
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/core/predict.py”, line 231, in predict
  y_pred = pred_dict[spect_path]
  ~~~~~~~~~^^^^^^^^^^^^
  KeyError: ‘/gpfs01/veit/data/thomas/avanitesting2/audio/tweetynet_testdata/spectrograms_generated_240415_152302/bu01bk01_240118_43355031_120323.wav.spect.npz’
  0%| | 0/16 [00:00<?, ?it/s]
  Traceback (most recent call last):
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/bin/vak”, line 10, in
  sys.exit(main())
  ^^^^^^
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/main.py”, line 48, in main
  cli.cli(command=args.command, config_file=args.configfile)
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/cli.py”, line 49, in cli
  COMMAND_FUNCTION_MAPcommand
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/cli.py”, line 18, in predict
  predict(toml_path=toml_path)
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/predict.py”, line 51, in predict
  core.predict(
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/core/predict.py”, line 231, in predict
  y_pred = pred_dict[spect_path]
  ~~~~~~~~~^^^^^^^^^^^^
  KeyError: ‘/gpfs01/veit/data/thomas/avanitesting2/audio/tweetynet_testdata/spectrograms_generated_240415_152302/bu01bk01_240118_43310843_120220.wav.spect.npz’

It is not file-specific - I changed the dataset and got the same result.

Working on cluster with multiple GPUs
Possible solution suggested already: $ CUDA_VISIBLE_DEVICES=0 vak predict my_config
seems to work for now

The text was updated successfully, but these errors were encountered:

NickleDave · 2024-04-17T12:59:21Z

Thank you @avanikop for catching this and helping me track down the source of the issue.

As you pointed out by email

One (unrelated) thing I noticed is, it creates two "result_datestamp_timestamp" folder during the same run and the folder with the later timestamp has no max val checkpoint file in the checkpoints folder.

I think you were talking about what happens when you run vak train but you made me realize that the same issue I'm seeing in #742 might also be causing this error with predict: it's because lightning is defaulting to a "distributed" strategy.

I can verify that lightning running in distributed mode is indeed the source of the bug 😩
If I run on a machine with multiple GPUs I reproduce your bug with predict.

A workaround for now is to do the following before you run vak:

export CUDA_VISIBLE_DEVICES=0

Basically you force lightning to not run in distributed mode by making it see there's only one GPU.

If I do this then vak predict runs without this error, and the same workaround applies for vak learncurve, and presumably vak train

Thank you for pointing out you were seeing an extra folder get generated for train -- I thought that was only happening with learncurve. You got me to the root of the problem.

My guess for what's going on is that something about how lightning runs in distributed causes us to end up with some keys missing from the dictionary returned by the predict method.

Just so it's clear what I did:
you can do this in two lines

$ export CUDA_VISIBLE_DEVICES=0
$ vak predict your_config.toml

or one (in a way that doesn't "export" the variable to the environment)

$ CUDA_VISIBLE_DEVICES=0 vak predict your_config.toml

Since we're seeing it in train + predict, this means I need to fix this bug sooner rather later.
I've been needing to do this for my own experiments anyways.
The fix will be something like adding a gpus option that gets passed directly to lightning.Trainer and then defaulting to a single device.

I will raise a separate issue with a fix

* WIP: Add config/trainer.py with TrainerConfig * Rename common.device -> common.accelerator, return 'gpu' not 'cuda' if torch.cuda.is_available * Fix config section in doc/api/index.rst * Import trainer and TrainerConfig in src/vak/config/__init__.py, add to __all__ * Add pytorch-lightning to intersphinx in doc/conf.py * Fix cross-ref in docstring in src/vak/prep/frame_classification/make_splits.py: :constant: -> :const: * Make lightning a dependency, instead of pytorch_lightning; import lightning.pytorch everywhere instead of pytorch_lightning as lightning -- trying to make it so we can resolve API correctly in docstrings * Fix in doc/api/index.rst: common.device -> common.accelerator * Finish writing TrainerConfig class * Add tests for TrainerConfig class * Add trainer sub-table to all configs in tests/data_for_tests/configs * Add trainer sub-table to all configs in doc/toml * Add trainer sub-table in config/valid-version-1.0.toml, rename -> valid-version-1.1.toml * Remove device key from top-level tables in config/valid-version-1.1.toml * Remove device key from top-level tables in tests/data_for_tests/configs * Remove 'device' key from configs in doc/toml * Add 'trainer' attribute to EvalConfig, an instance of TrainerConfig; remove 'device' attribute * Add 'trainer' attribute to PredictConfig, an instance of TrainerConfig; remove 'device' attribute * Add 'trainer' attribute to TrainConfig, an instance of TrainerConfig; remove 'device' attribute * Fix typo in docstring in src/vak/config/train.py * Add 'trainer' attribute to LearncurveConfig, an instance of TrainerConfig; remove 'device' attribute. Also clean up docstring, removing attributes that no longer exist * Remove device attribute from TrainConfig docstring * Fix VALID_TOML_PATH in config/validators.py -> 'valid-version-1.1.toml' * Fix how we instantiate TrainerConfig classes in from_config_dict method of EvalConfig/LearncurveConfig/PredictConfig/TrainConfig * Fix typo in src/vak/config/valid-version-1.1.toml: predictor -> predict * Fix unit tests after adding trainer attribute that is instance of TrainerConfig * Change src/vak/train/frame_classification.py to take trainer_config argument * Change src/vak/train/parametric_umap.py to take trainer_config argument * Change src/vak/train/train_.py to take trainer_config argument * Fix src/vak/cli/train.py to pass trainer_config.asdict() into vak.train.train_.train * Replace 'device' with 'trainer_config' in vak/eval * Fix cli.eval to pass trainer_config into eval.eval_.eval * Replace 'device' with 'trainer_config' in vak/predict * Fix cli.predict to pass trainer_config into predict.predict_.predict * Replace 'device' with 'trainer_config' in vak/learncurve * Fix cli.learncurve to pass trainer_config into learncurve.learncurve.learning_curve * Rename/replace 'device' fixture with 'trainer' fixture in tests/ * Use config.table.trainer attribute throughout tests, remove config.table.device attribute that no longer exists * Fix value for devices in fixtures/trainer.py: when device is 'cpu' trainer must be > 0 * Fix default devices value for when accelerator is cpu in TrainerConfig * Fix unit tests for TrainerConfig after fixing default devices for accelerator=cpu * Fix default value for 'devices' set to -1 in some unit tests where we over-ride config in toml file * fixup use config.table.trainer attribute throughout tests -- missed one place in tests/test_eval/ * Add back 'device' fixture so we can use it to test Model class * Fix unit tests in test_models/test_base.by that literally used device to put tensors on device, not to change a config * Fix assertion in tests/test_models/test_tweetynet.py, from where we switched to using lightning as the dependency * Fix test for DiceLoss, change trainer_type fixture back to device fixture

NickleDave · 2024-05-05T21:49:03Z

Fixed by #752

avanikop added the BUG Something isn't working label Apr 17, 2024

NickleDave self-assigned this Apr 17, 2024

NickleDave changed the title ~~BUG:~~ BUG: Predict throws KeyError Apr 17, 2024

This was referenced May 1, 2024

ENH: Add TrainerConfig to train/learncurve/eval/predict #691

Closed

ENH: Add lightning.Trainer config, fix #691 #687 #742 #745 #752

Merged

NickleDave closed this as completed May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Predict throws `KeyError` #745

BUG: Predict throws `KeyError` #745

avanikop commented Apr 17, 2024 •

edited

NickleDave commented Apr 17, 2024 •

edited

NickleDave commented May 5, 2024

BUG: Predict throws KeyError #745

BUG: Predict throws KeyError #745

Comments

avanikop commented Apr 17, 2024 • edited

NickleDave commented Apr 17, 2024 • edited

NickleDave commented May 5, 2024

BUG: Predict throws `KeyError` #745

BUG: Predict throws `KeyError` #745

avanikop commented Apr 17, 2024 •

edited

NickleDave commented Apr 17, 2024 •

edited