Skip to content

Conversation

@voznesenskym
Copy link
Collaborator

@voznesenskym voznesenskym commented Jan 5, 2023

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 5, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91779

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 3 Pending

As of commit 664e89d:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@voznesenskym voznesenskym added the topic: not user facing topic category label Jan 6, 2023
@voznesenskym voznesenskym changed the title [WIP] Guard on at::Tensor device index Guard on at::Tensor device index Jan 7, 2023
@voznesenskym
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 7, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / linux-focal-rocm5.3-py3.8 / test (default, 1, 2, linux.rocm.gpu)

Details for Dev Infra team Raised by workflow job

@albanD albanD removed their request for review January 7, 2023 10:52
@ezyang
Copy link
Contributor

ezyang commented Jan 8, 2023

We should merge this as is but I feel like it would be better if compiled subgraphs that don't involve DtoD transfers can be device agnostic. Certainly this is no problem for CUDA, nor Triton I assume

@ngimel
Copy link
Collaborator

ngimel commented Jan 9, 2023

It is kind of a problem for triton, because currently it's set up to load compiled code to the current device on the first run, and this logic will need to be refactored if we need to possibly load the code to other devices on subsequent runs, the logic for codegening device guards will also need to be redone, and I'm not quite sure how it should look like. It's also tricky for heterogeneous systems. So, given we don't expect this situation to happen too often I think recompiling is ok.

@ngimel
Copy link
Collaborator

ngimel commented Jan 9, 2023

@voznesenskym is there a test forthcoming or should we land this?

@ezyang
Copy link
Contributor

ezyang commented Jan 9, 2023

IMO, the most likely situation this would happen is if someone tries to optimize (non-Distributed) DataParallel with torch.compile. I know we tell people not to use DataParallel but honestly with torch.compile its not clear to me the reasoning behind this recommendation still stands. Using PT2 to get single process multi gpu working performantly would be pretty slick.

@voznesenskym
Copy link
Collaborator Author

@voznesenskym is there a test forthcoming or should we land this?

I was torn on it - I had a test but it was ugly and felt like it was testing a bunch of other stuff. I got rejected for some failures, so might as well test.

@voznesenskym
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased voz/guards_index onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout voz/guards_index && git pull --rebase)

@ngimel
Copy link
Collaborator

ngimel commented Jan 13, 2023

Why does it have to be ugly? It could be as simple as

def fn(x):
    return x/3

opt_fn = torch.compile(fn)
x=torch.randn(4, device="cuda")
opt_fn(x)
x=torch.randn(4, device="cuda:1")
opt_fn(x)
torch.cuda.synchronize()

@voznesenskym
Copy link
Collaborator Author

Why does it have to be ugly? It could be as simple as

def fn(x):
    return x/3

opt_fn = torch.compile(fn)
x=torch.randn(4, device="cuda")
opt_fn(x)
x=torch.randn(4, device="cuda:1")
opt_fn(x)
torch.cuda.synchronize()

I wanted to assert which guard failed, but I remember I added guard_failure_fn. NVM, it need not be ugly.

@voznesenskym
Copy link
Collaborator Author

Uhh annoying - we don't have cuda:0/cuda:1 - getting invalid device errors. Might need to move this to a different suite

@ngimel
Copy link
Collaborator

ngimel commented Jan 15, 2023

Cuda 1/0 needs @requires_multigpu() or some similar wrapper, and also to be put in some suite that actually runs in multigpu config, e.g. inductor_distributed https://github.com/pytorch/pytorch/actions/runs/3920059917/jobs/6701605948

@voznesenskym
Copy link
Collaborator Author

@pytorchbot merge -f "weird unrelated failure with pip install deps on windows jobs"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.compile burns-in device numbers, leading to errors when called with tensors on new device

4 participants