Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU strategy scope causes save_weights error #62483

Open
moberweger opened this issue Nov 27, 2023 · 3 comments
Open

Multi-GPU strategy scope causes save_weights error #62483

moberweger opened this issue Nov 27, 2023 · 3 comments
Assignees
Labels
comp:gpu GPU related issues TF 2.13 For issues related to Tensorflow 2.13 type:bug Bug

Comments

@moberweger
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

2.13.1

Custom code

Yes

OS platform and distribution

Ubuntu 20.04

Mobile device

No response

Python version

3.8

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

11.8

GPU model and memory

2x A6000 48G

Current behavior?

I have a custom Layer with a Model that is created and saved within a multi-GPU scope. Everything works well without the MirroredStrategy scope, thus running a single GPU. But when running on multiple GPUs using the strategy scope, the names of the weights of the model get truncated and essentially only resemble kernel and bias. This leads to duplicate entries in the save_weights function causing an HDF5 error. Code to reproduce is below (you need to have >=2 GPUs), and log output is also attached.

Standalone code to reproduce the issue

import os
import shutil

import tensorflow as tf
from contextlib import nullcontext
from tensorflow.keras.layers import Input, Conv2D, Layer, GroupNormalization
from tensorflow.keras.models import Model, load_model


class MyLayer(Layer):

    def __init__(self, channels, **kwargs):
        super().__init__(**kwargs)
        self.norm = GroupNormalization(epsilon=1e-5)
        self.proj1 = Conv2D(channels, 1)
        self.proj2 = Conv2D(channels, 1)

    def call(self, inputs):
        return self.proj2(self.proj1(self.norm(inputs)))


class ModelMock(Model):

    def __init__(self, img_height, img_width, name=None):
        x_input = Input((img_height, img_width, 3), name="x_input")
        x = Conv2D(320, kernel_size=3, padding="same")(x_input)
        x = MyLayer(320)(x)
        output = Conv2D(16, kernel_size=3, padding="same")(x)
        super().__init__([x_input], output, name=name)


if __name__ == '__main__':
    strategy = tf.distribute.MirroredStrategy()
    print("Number of devices: {}".format(strategy.num_replicas_in_sync))
    if os.path.exists("./test"):
        shutil.rmtree("./test/")
    os.mkdir("./test/")

    fail = True
    with strategy.scope() if fail else nullcontext():
        print("CREATING")
        model = ModelMock(256, 256)
        for i, w in enumerate(model.weights): print(i, w.name)
        model.save_weights("./test/model1.hdf5")
        model.save("./test/model1")
        model = load_model("./test/model1", compile=False)
        for i, w in enumerate(model.weights): print(i, w.name)
        model.save_weights("./test/model2.hdf5")
    print("DONE")

Relevant log output

Number of devices: 2
CREATING
0 conv2d/kernel:0
1 conv2d/bias:0
2 my_layer/group_normalization/gamma:0
3 my_layer/group_normalization/beta:0
4 my_layer/conv2d_1/kernel:0
5 my_layer/conv2d_1/bias:0
6 my_layer/conv2d_2/kernel:0
7 my_layer/conv2d_2/bias:0
8 conv2d_3/kernel:0
9 conv2d_3/bias:0
0 kernel:0
1 bias:0
2 gamma:0
3 beta:0
4 kernel:0
5 bias:0
6 kernel:0
7 bias:0
8 kernel:0
9 bias:0
Traceback (most recent call last):
  File "/home/naked/dev/meta-cv/nak3d-fusion/python/apps/ml/check_weight_names.py", line 48, in <module>
    model.save_weights("./test/model2.hdf5")
  File "/usr/local/lib/python3.8/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/group.py", line 183, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 163, in make_new_dset
    dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl, dapl=dapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 138, in h5py.h5d.create
ValueError: Unable to create dataset (name already exists)
@google-ml-butler google-ml-butler bot added the type:bug Bug label Nov 27, 2023
@Venkat6871 Venkat6871 added comp:gpu GPU related issues TF 2.13 For issues related to Tensorflow 2.13 labels Nov 30, 2023
@Venkat6871
Copy link
Contributor

Hi @moberweger ,

I have replicated the reported behaviour with colab using TF v2.14, 2.15. Please find the gist here for reference.

Thank you!

@SuryanarayanaY
Copy link
Collaborator

Hi @moberweger ,

I have done the code changes to the attached code to make it compatible to save and load in .keras format.

Since there are custom objects in the model we need to explicitly override get_config() and from_config() methods. Please find the gists that executes fine with keras3 and kears-2.15 as well.

Thanks!

@SuryanarayanaY SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label Jan 22, 2024
@moberweger
Copy link
Author

Thanks @SuryanarayanaY for looking into this.
I checked and the specific code works on my multi-GPU setup, but I observed some inconsistencies:

  1. It seems that the extension .hdf5 vs .h5 has an impact, since it still fails for .hdf5 file format but it works on .h5
  2. The gist for keras-2.15 outputs 0 conv2d/kernel:0 whereas keras-3.0 outputs 0 kernel
  3. It is interesting to see that it also works without the .keras format

Thanks!

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues TF 2.13 For issues related to Tensorflow 2.13 type:bug Bug
Projects
None yet
Development

No branches or pull requests

3 participants