[Bug with Solution] "int not have assign_add" bug in parallel training

With Tensorflow==2.3.1 and kungfu==0.2.2

While doing parallel training, it will reports a bug at
{PROJECT_ROOT}/hyperpose/Model/openpose/train.py:426, 
which shows that type(step) is int but not tf.Tensor, and leads to failing.

That is because the for-loop at line 420 will introduce a variable of same name 'step' with type of int, and overwrite it.
```
step=tf.Variable(1, trainable=False)  
...  
for step_idx,step in enumerate(lr_decay_steps):          # overwrite the previous step  
    lr_decay_steps[step_idx] = step // current_cluster_size() + 1  # KungFu  
```

Therefore, to fix that bug, please rename the variable of for-loop @ 420, then solved.
```
step=tf.Variable(1, trainable=False)
...
for step_idx,step_local in enumerate(lr_decay_steps):          # rename the local step
    lr_decay_steps[step_idx] = step_local // current_cluster_size() + 1  # KungFu
```

Also, in the following part of that code, some access of step is considered it as an int, but not tf.Tensor, and may leads fails on some platform(at least on my system), so please replaces them with step.numpy().

Whatsmore, in new version of kungfu(maybe?), kungfu.current_cluster_size and kungfu.current_rank has been moved to the package kunfu.python, so the code @{PROJECT_ROOT}/hyperpose/Model/openpose/train.py:360 needs a change, too.
```
# from kungfu import current_cluster_size, current_rank        # old one
from kungfu.python import current_cluster_size, current_rank        # replaced
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug with Solution] "int not have assign_add" bug in parallel training #350

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug with Solution] "int not have assign_add" bug in parallel training #350

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions