Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF/Keras 2.11 isn’t currently working with KerasEstimator in horovod 0.26.1 even using legacy optimizer #3810

Closed
wenfeiy-db opened this issue Jan 7, 2023 · 1 comment
Labels

Comments

@wenfeiy-db
Copy link

Environment:

  1. Framework: keras
  2. Framework version: 2.11
  3. Horovod version: 0.26.1
  4. MPI version: 4.1.4
  5. CUDA version: 11.0.3-1
  6. NCCL version: 2.10.3-1
  7. Python version: 3.8

Bug report:
With keras=2.11 and horovod 0.26.1, horovod.spark.keras.KerasEstimator doesn't work even when using legacy optimizer. It has the following error message

Traceback (most recent call last):
[1,2]<stderr>:  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
[1,2]<stderr>:    return _run_code(code, main_globals, None,
[1,2]<stderr>:  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
[1,2]<stderr>:    exec(code, run_globals)
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 52, in <module>
[1,2]<stderr>:    main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 45, in main
[1,2]<stderr>:    task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK', 'OMPI_COMM_WORLD_LOCAL_RANK')
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/task/__init__.py", line 61, in task_exec
[1,2]<stderr>:    result = fn(*args, **kwargs)
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/keras/remote.py", line 136, in train
[1,2]<stderr>:    model = deserialize_keras_model(
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/keras/remote.py", line 299, in deserialize_keras_model
[1,2]<stderr>:    return load_model_fn(f)
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/spark/keras/remote.py", line 137, in <lambda>
[1,2]<stderr>:    serialized_model, lambda x: hvd.load_model(x))
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/tensorflow/keras/__init__.py", line 274, in load_model
[1,2]<stderr>:    return _impl.load_model(keras, wrap_optimizer, _OPTIMIZER_MODULES, filepath, custom_optimizers, custom_objects)
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/_keras/__init__.py", line 272, in load_model
[1,2]<stderr>:    return keras.models.load_model(filepath, custom_objects=horovod_objects)
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
[1,2]<stderr>:    raise e.with_traceback(filtered_tb) from None
[1,2]<stderr>:  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/horovod/tensorflow/keras/__init__.py", line 273, in <lambda>
[1,2]<stderr>:    return lambda **kwargs: DistributedOptimizer(cls(**kwargs), compression=compression)
[1,2]<stderr>:ValueError[1,2]<stderr>:: decay is deprecated in the new Keras optimizer, pleasecheck the docstring for valid arguments, or use the legacy optimizer, e.g., tf.keras.optimizers.legacy.Adadelta.

We found this PR seems to solve the issue. And if we install horovod from master it works. Given this, could we make a patch release that include the linked PR?

@wenfeiy-db wenfeiy-db added the bug label Jan 7, 2023
@maxhgerlach
Copy link
Collaborator

Hi @wenfeiy-db,

we've just released Horovod 0.27 including this fix for TF 2.11: https://github.com/horovod/horovod/releases/tag/v0.27.0

I hope this will work for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants