Multi-GPU strategy scope causes save_weights error #62483

moberweger · 2023-11-27T11:41:57Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

2.13.1

Custom code

Yes

OS platform and distribution

Ubuntu 20.04

Mobile device

No response

Python version

3.8

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

11.8

GPU model and memory

2x A6000 48G

Current behavior?

I have a custom Layer with a Model that is created and saved within a multi-GPU scope. Everything works well without the MirroredStrategy scope, thus running a single GPU. But when running on multiple GPUs using the strategy scope, the names of the weights of the model get truncated and essentially only resemble kernel and bias. This leads to duplicate entries in the save_weights function causing an HDF5 error. Code to reproduce is below (you need to have >=2 GPUs), and log output is also attached.

Standalone code to reproduce the issue

import os
import shutil

import tensorflow as tf
from contextlib import nullcontext
from tensorflow.keras.layers import Input, Conv2D, Layer, GroupNormalization
from tensorflow.keras.models import Model, load_model


class MyLayer(Layer):

    def __init__(self, channels, **kwargs):
        super().__init__(**kwargs)
        self.norm = GroupNormalization(epsilon=1e-5)
        self.proj1 = Conv2D(channels, 1)
        self.proj2 = Conv2D(channels, 1)

    def call(self, inputs):
        return self.proj2(self.proj1(self.norm(inputs)))


class ModelMock(Model):

    def __init__(self, img_height, img_width, name=None):
        x_input = Input((img_height, img_width, 3), name="x_input")
        x = Conv2D(320, kernel_size=3, padding="same")(x_input)
        x = MyLayer(320)(x)
        output = Conv2D(16, kernel_size=3, padding="same")(x)
        super().__init__([x_input], output, name=name)


if __name__ == '__main__':
    strategy = tf.distribute.MirroredStrategy()
    print("Number of devices: {}".format(strategy.num_replicas_in_sync))
    if os.path.exists("./test"):
        shutil.rmtree("./test/")
    os.mkdir("./test/")

    fail = True
    with strategy.scope() if fail else nullcontext():
        print("CREATING")
        model = ModelMock(256, 256)
        for i, w in enumerate(model.weights): print(i, w.name)
        model.save_weights("./test/model1.hdf5")
        model.save("./test/model1")
        model = load_model("./test/model1", compile=False)
        for i, w in enumerate(model.weights): print(i, w.name)
        model.save_weights("./test/model2.hdf5")
    print("DONE")

Relevant log output

Number of devices: 2
CREATING
0 conv2d/kernel:0
1 conv2d/bias:0
2 my_layer/group_normalization/gamma:0
3 my_layer/group_normalization/beta:0
4 my_layer/conv2d_1/kernel:0
5 my_layer/conv2d_1/bias:0
6 my_layer/conv2d_2/kernel:0
7 my_layer/conv2d_2/bias:0
8 conv2d_3/kernel:0
9 conv2d_3/bias:0
0 kernel:0
1 bias:0
2 gamma:0
3 beta:0
4 kernel:0
5 bias:0
6 kernel:0
7 bias:0
8 kernel:0
9 bias:0
Traceback (most recent call last):
  File "/home/naked/dev/meta-cv/nak3d-fusion/python/apps/ml/check_weight_names.py", line 48, in <module>
    model.save_weights("./test/model2.hdf5")
  File "/usr/local/lib/python3.8/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/group.py", line 183, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/dataset.py", line 163, in make_new_dset
    dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl, dapl=dapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 138, in h5py.h5d.create
ValueError: Unable to create dataset (name already exists)

The text was updated successfully, but these errors were encountered:

Venkat6871 · 2023-11-30T06:34:53Z

Hi @moberweger ,

I have replicated the reported behaviour with colab using TF v2.14, 2.15. Please find the gist here for reference.

Thank you!

SuryanarayanaY · 2024-01-22T08:51:47Z

Hi @moberweger ,

I have done the code changes to the attached code to make it compatible to save and load in .keras format.

Since there are custom objects in the model we need to explicitly override get_config() and from_config() methods. Please find the gists that executes fine with keras3 and kears-2.15 as well.

Thanks!

moberweger · 2024-01-22T13:50:27Z

Thanks @SuryanarayanaY for looking into this.
I checked and the specific code works on my multi-GPU setup, but I observed some inconsistencies:

It seems that the extension .hdf5 vs .h5 has an impact, since it still fails for .hdf5 file format but it works on .h5
The gist for keras-2.15 outputs 0 conv2d/kernel:0 whereas keras-3.0 outputs 0 kernel
It is interesting to see that it also works without the .keras format

Thanks!

google-ml-butler bot added the type:bug Bug label Nov 27, 2023

google-ml-butler bot assigned Venkat6871 Nov 27, 2023

Venkat6871 added comp:gpu GPU related issues TF 2.13 For issues related to Tensorflow 2.13 labels Nov 30, 2023

Venkat6871 assigned SuryanarayanaY and unassigned Venkat6871 Dec 19, 2023

SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label Jan 22, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jan 22, 2024

testhowtest mentioned this issue May 23, 2024

Model weights cannot be saved #68467

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU strategy scope causes save_weights error #62483

Multi-GPU strategy scope causes save_weights error #62483

moberweger commented Nov 27, 2023

Venkat6871 commented Nov 30, 2023

SuryanarayanaY commented Jan 22, 2024

moberweger commented Jan 22, 2024

Multi-GPU strategy scope causes save_weights error #62483

Multi-GPU strategy scope causes save_weights error #62483

Comments

moberweger commented Nov 27, 2023

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Venkat6871 commented Nov 30, 2023

SuryanarayanaY commented Jan 22, 2024

moberweger commented Jan 22, 2024