Skip to content

Conversation

@ymwangg
Copy link
Contributor

@ymwangg ymwangg commented Nov 10, 2022

This PR fixed #2508. I'm not sure why GPU device does not allow reusing the same ClusterSpec but leaving it blank seems to work.

Tested with GPU_NUM_DEVICES=8 python test_train_mp_mnist.py --metrics_debug --fake_data --num_epochs 1.

Single-gpu metrics report: https://gist.github.com/ymwangg/609b79ed783a1797f87d55480d23bc8d
Multi-gpu metrics report: https://gist.github.com/ymwangg/8218ddf4ed8ce2d1ac35758effabdf6f

cc @YangFei1990

@YangFei1990
Copy link
Contributor

Great, thanks a lot @ymwangg! A few questions for my understanding:
If we do not use the cached session config will it generate some default config? How are we sure that such change won't affect the metrics generation?

@ymwangg
Copy link
Contributor Author

ymwangg commented Nov 10, 2022

@YangFei1990 My understanding is that the same graph will produce the same results regardless of different session configs. It could be wrong and Google folks may have a better answer.

@JackCaoG
Copy link
Collaborator

Thanks! I think rebase should fix the build issue.

Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ymwangg !

@JackCaoG JackCaoG merged commit 44362e6 into pytorch:master Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash using XLA GPU

3 participants