-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No distributed view in tensorboard #640
Comments
A new torch-tb-profiler v0.4.1 has been released. Could you please check if your issue is fixed there: https://pypi.org/project/torch-tb-profiler/0.4.1 |
Same thing is happening to me on 0.4.1: I do a DDP training run, and the output traces include calls to DistributedDataParallel.forward and nccl allReduce. So it seems like it should be showing the distributed view. But when I load the traces into tensorboard, all of the views appear except for distributed. It also shows the calls to all reduce and DDP in the trace, so I think the tool is seeing them...
|
So it seems that It looks for DDP here: https://github.com/pytorch/kineto/blob/main/tb_plugin/torch_tb_profiler/profiler/event_parser.py#L199, but in my trace, the
It also looks like |
without downgrading torch to 1.11.0, distributed view doesn't work (even in the latest torch-tb-profiler v0.4.1). |
Thank you for the issue report and updates. Fixed the issue in PR #717, but I missed releasing a new torch-tb-profiler package. Let me create a PR to get that going: #732 cc @woolpeeker , @arthurfeeney , @srikanthmalla |
I think the distributed view still has issues. I was able to hack around some and at least got the distributed view to appear by making some changes similar to the ones you made in #717. (basically just forcing it to look at user annotations for all reduce and |
does 1.13.1 work for distributed view? |
The python code is below. I use slurm sbatch to start it in the cluster.
backend is nccl.
The generated .json files seems normal.
There are "Overview" and other views except "Distributed", which is exactly what I need.
There are some error message from tensorboar. I paste part of them below because the rests are replications.
environment:
Python==3.8.13
torch==1.12.0
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
torch-tb-profiler==0.4.0
The text was updated successfully, but these errors were encountered: