-
Notifications
You must be signed in to change notification settings - Fork 275
Description
With Tensorflow==2.3.1 and kungfu==0.2.2
While doing parallel training, it will reports a bug at
{PROJECT_ROOT}/hyperpose/Model/openpose/train.py:426,
which shows that type(step) is int but not tf.Tensor, and leads to failing.
That is because the for-loop at line 420 will introduce a variable of same name 'step' with type of int, and overwrite it.
step=tf.Variable(1, trainable=False)
...
for step_idx,step in enumerate(lr_decay_steps): # overwrite the previous step
lr_decay_steps[step_idx] = step // current_cluster_size() + 1 # KungFu
Therefore, to fix that bug, please rename the variable of for-loop @ 420, then solved.
step=tf.Variable(1, trainable=False)
...
for step_idx,step_local in enumerate(lr_decay_steps): # rename the local step
lr_decay_steps[step_idx] = step_local // current_cluster_size() + 1 # KungFu
Also, in the following part of that code, some access of step is considered it as an int, but not tf.Tensor, and may leads fails on some platform(at least on my system), so please replaces them with step.numpy().
Whatsmore, in new version of kungfu(maybe?), kungfu.current_cluster_size and kungfu.current_rank has been moved to the package kunfu.python, so the code @{PROJECT_ROOT}/hyperpose/Model/openpose/train.py:360 needs a change, too.
# from kungfu import current_cluster_size, current_rank # old one
from kungfu.python import current_cluster_size, current_rank # replaced