Fix multi-gpu metrics report #4187

ymwangg · 2022-11-10T18:44:54Z

This PR fixed #2508. I'm not sure why GPU device does not allow reusing the same ClusterSpec but leaving it blank seems to work.

Tested with GPU_NUM_DEVICES=8 python test_train_mp_mnist.py --metrics_debug --fake_data --num_epochs 1.

Single-gpu metrics report: https://gist.github.com/ymwangg/609b79ed783a1797f87d55480d23bc8d
Multi-gpu metrics report: https://gist.github.com/ymwangg/8218ddf4ed8ce2d1ac35758effabdf6f

cc @YangFei1990

YangFei1990 · 2022-11-10T18:56:44Z

Great, thanks a lot @ymwangg! A few questions for my understanding:
If we do not use the cached session config will it generate some default config? How are we sure that such change won't affect the metrics generation?

ymwangg · 2022-11-10T19:21:53Z

@YangFei1990 My understanding is that the same graph will produce the same results regardless of different session configs. It could be wrong and Google folks may have a better answer.

JackCaoG · 2022-11-11T01:27:47Z

Thanks! I think rebase should fix the build issue.

JackCaoG

Thanks @ymwangg !

JackCaoG added the xla:gpu label Nov 10, 2022

Fix multi-gpu metrics report

1f61f3e

ymwangg force-pushed the fix_metrics_report branch from 56a04a1 to 1f61f3e Compare November 11, 2022 01:46

JackCaoG approved these changes Nov 11, 2022

View reviewed changes

JackCaoG merged commit 44362e6 into pytorch:master Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix multi-gpu metrics report #4187

Fix multi-gpu metrics report #4187

Uh oh!

ymwangg commented Nov 10, 2022 •

edited

Loading

Uh oh!

YangFei1990 commented Nov 10, 2022

Uh oh!

ymwangg commented Nov 10, 2022

Uh oh!

JackCaoG commented Nov 11, 2022

Uh oh!

JackCaoG left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix multi-gpu metrics report #4187

Fix multi-gpu metrics report #4187

Uh oh!

Conversation

ymwangg commented Nov 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YangFei1990 commented Nov 10, 2022

Uh oh!

ymwangg commented Nov 10, 2022

Uh oh!

JackCaoG commented Nov 11, 2022

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ymwangg commented Nov 10, 2022 •

edited

Loading