Skip to content

Conversation

@jonb377
Copy link
Collaborator

@jonb377 jonb377 commented Apr 20, 2023

Removes the circular import introduced in #4872 and reverted in #4911.

Verified with a run of mnist and SPMD resnet50:

root@t1v-n-f494979e-w-0:/workspaces/work/pytorch/xla/test# PJRT_DEVICE=TPU python test_train_mp_mnist.py --num_epochs=1
...
/home/ptxla/.local/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_cuda.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
| Training Device=xla:0/1 Epoch=1 Step=0 Loss=2.43866 Rate=14.43 GlobalRate=14.43 Time=19:43:02
| Training Device=xla:0/0 Epoch=1 Step=0 Loss=2.39480 Rate=14.43 GlobalRate=14.43 Time=19:43:02
| Training Device=xla:0/3 Epoch=1 Step=0 Loss=2.38561 Rate=14.39 GlobalRate=14.39 Time=19:43:02
| Training Device=xla:0/2 Epoch=1 Step=0 Loss=2.40065 Rate=14.37 GlobalRate=14.37 Time=19:43:02
root@t1v-n-f494979e-w-0:/workspaces/work/pytorch/xla/test/spmd# XLA_USE_SPMD=1 PJRT_DEVICE=TPU python test_train_spmd_imagenet.py --fake_data --sharding batch --profile
/home/ptxla/.local/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_cuda.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
==> Preparing data..
Sharding input along batch dimension with mesh [[[[0]]]


 [[[1]]]


 [[[2]]]


 [[[3]]]]
Epoch 1 train begin 19:44:03
| Training Device=xla:0/0 Epoch=1 Step=0 Loss=6.89059 Rate=2.05 GlobalRate=2.05 Time=19:45:06
| Training Device=xla:0/0 Epoch=1 Step=20 Loss=6.79297 Rate=705.80 GlobalRate=41.58 Time=19:45:08

@jonb377 jonb377 added the distributed SPMD and other distributed things. label Apr 20, 2023
@jonb377 jonb377 requested a review from alanwaketan April 20, 2023 19:50
Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, Jon.

@jonb377 jonb377 merged commit a537020 into master Apr 20, 2023
@jonb377 jonb377 deleted the jonbolin-dataloader branch April 20, 2023 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

distributed SPMD and other distributed things.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants