Skip to content

[Bug with Solution] "int not have assign_add" bug in parallel training #350

@davidMc0109

Description

@davidMc0109

With Tensorflow==2.3.1 and kungfu==0.2.2

While doing parallel training, it will reports a bug at
{PROJECT_ROOT}/hyperpose/Model/openpose/train.py:426,
which shows that type(step) is int but not tf.Tensor, and leads to failing.

That is because the for-loop at line 420 will introduce a variable of same name 'step' with type of int, and overwrite it.

step=tf.Variable(1, trainable=False)  
...  
for step_idx,step in enumerate(lr_decay_steps):          # overwrite the previous step  
    lr_decay_steps[step_idx] = step // current_cluster_size() + 1  # KungFu  

Therefore, to fix that bug, please rename the variable of for-loop @ 420, then solved.

step=tf.Variable(1, trainable=False)
...
for step_idx,step_local in enumerate(lr_decay_steps):          # rename the local step
    lr_decay_steps[step_idx] = step_local // current_cluster_size() + 1  # KungFu

Also, in the following part of that code, some access of step is considered it as an int, but not tf.Tensor, and may leads fails on some platform(at least on my system), so please replaces them with step.numpy().

Whatsmore, in new version of kungfu(maybe?), kungfu.current_cluster_size and kungfu.current_rank has been moved to the package kunfu.python, so the code @{PROJECT_ROOT}/hyperpose/Model/openpose/train.py:360 needs a change, too.

# from kungfu import current_cluster_size, current_rank        # old one
from kungfu.python import current_cluster_size, current_rank        # replaced

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions