Guard on at::Tensor device index #91779

voznesenskym · 2023-01-05T22:17:06Z

cc @mlazos @soumith @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

pytorch-bot · 2023-01-05T22:17:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91779

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 3 Pending

As of commit 664e89d:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

voznesenskym · 2023-01-07T07:00:33Z

@pytorchbot merge

pytorchmergebot · 2023-01-07T07:04:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-01-07T07:39:23Z

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / linux-focal-rocm5.3-py3.8 / test (default, 1, 2, linux.rocm.gpu)

Details for Dev Infra team

Raised by workflow job

ezyang · 2023-01-08T23:39:30Z

We should merge this as is but I feel like it would be better if compiled subgraphs that don't involve DtoD transfers can be device agnostic. Certainly this is no problem for CUDA, nor Triton I assume

ngimel · 2023-01-09T03:57:32Z

It is kind of a problem for triton, because currently it's set up to load compiled code to the current device on the first run, and this logic will need to be refactored if we need to possibly load the code to other devices on subsequent runs, the logic for codegening device guards will also need to be redone, and I'm not quite sure how it should look like. It's also tricky for heterogeneous systems. So, given we don't expect this situation to happen too often I think recompiling is ok.

ngimel · 2023-01-09T03:58:22Z

@voznesenskym is there a test forthcoming or should we land this?

ezyang · 2023-01-09T12:36:44Z

IMO, the most likely situation this would happen is if someone tries to optimize (non-Distributed) DataParallel with torch.compile. I know we tell people not to use DataParallel but honestly with torch.compile its not clear to me the reasoning behind this recommendation still stands. Using PT2 to get single process multi gpu working performantly would be pretty slick.

voznesenskym · 2023-01-13T21:25:25Z

@voznesenskym is there a test forthcoming or should we land this?

I was torn on it - I had a test but it was ugly and felt like it was testing a bunch of other stuff. I got rejected for some failures, so might as well test.

voznesenskym · 2023-01-13T21:26:23Z

@pytorchbot rebase

pytorchmergebot · 2023-01-13T21:29:00Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-01-13T21:29:06Z

Successfully rebased voz/guards_index onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout voz/guards_index && git pull --rebase)

ngimel · 2023-01-13T21:46:44Z

Why does it have to be ugly? It could be as simple as

def fn(x):
    return x/3

opt_fn = torch.compile(fn)
x=torch.randn(4, device="cuda")
opt_fn(x)
x=torch.randn(4, device="cuda:1")
opt_fn(x)
torch.cuda.synchronize()

voznesenskym · 2023-01-13T22:39:58Z

Why does it have to be ugly? It could be as simple as

def fn(x):
    return x/3

opt_fn = torch.compile(fn)
x=torch.randn(4, device="cuda")
opt_fn(x)
x=torch.randn(4, device="cuda:1")
opt_fn(x)
torch.cuda.synchronize()

I wanted to assert which guard failed, but I remember I added guard_failure_fn. NVM, it need not be ugly.

voznesenskym · 2023-01-15T02:45:51Z

Uhh annoying - we don't have cuda:0/cuda:1 - getting invalid device errors. Might need to move this to a different suite

ngimel · 2023-01-15T02:55:42Z

Cuda 1/0 needs @requires_multigpu() or some similar wrapper, and also to be put in some suite that actually runs in multigpu config, e.g. inductor_distributed https://github.com/pytorch/pytorch/actions/runs/3920059917/jobs/6701605948

…ndex

voznesenskym · 2023-01-19T00:55:38Z

@pytorchbot merge -f "weird unrelated failure with pip install deps on windows jobs"

pytorchmergebot · 2023-01-19T00:58:00Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions bot added the module: dynamo label Jan 5, 2023

github-actions bot requested review from Chillee, SherlockNoMad, albanD, antoniojkim, bdhirsh, ezyang, jbschlosser, miladm and wconstab January 5, 2023 22:17

ngimel approved these changes Jan 5, 2023

View reviewed changes

voznesenskym added the topic: not user facing topic category label Jan 6, 2023

voznesenskym changed the title ~~[WIP] Guard on at::Tensor device index~~ Guard on at::Tensor device index Jan 7, 2023

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 7, 2023

albanD removed their request for review January 7, 2023 10:52

ezyang mentioned this pull request Jan 11, 2023

[NOT4LAND] Initial draft of Export's logical schema #91287

Closed

voznesenskym added 2 commits January 13, 2023 21:29

Wip

59c2527

Fix

43475f9

pytorchmergebot force-pushed the voz/guards_index branch from 4cf40e2 to 43475f9 Compare January 13, 2023 21:29

voznesenskym added 2 commits January 13, 2023 23:10

test

916cec2

test

a376eeb

voznesenskym added 2 commits January 18, 2023 08:51

move test

2685638

Merge branch 'master' of github.com:pytorch/pytorch into voz/guards_i…

dc33822

…ndex

github-actions bot added the module: inductor label Jan 18, 2023

missing parens

664e89d

pytorchmergebot added the Merged label Jan 19, 2023

pytorchmergebot closed this in eb39d99 Jan 19, 2023

y-zheng18 mentioned this pull request Feb 11, 2023

torch.compile doesn't seem to work with DataParallel #94636

Closed

github-actions bot deleted the voz/guards_index branch July 20, 2024 01:53

Guard on at::Tensor device index #91779

Guard on at::Tensor device index #91779

Uh oh!

Conversation

voznesenskym commented Jan 5, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91779

⏳ No Failures, 3 Pending

Uh oh!

voznesenskym commented Jan 7, 2023

Uh oh!

pytorchmergebot commented Jan 7, 2023

Merge started

Uh oh!

pytorchmergebot commented Jan 7, 2023

Merge failed

Uh oh!

ezyang commented Jan 8, 2023

Uh oh!

ngimel commented Jan 9, 2023

Uh oh!

ngimel commented Jan 9, 2023

Uh oh!

ezyang commented Jan 9, 2023

Uh oh!

voznesenskym commented Jan 13, 2023

Uh oh!

voznesenskym commented Jan 13, 2023

Uh oh!

pytorchmergebot commented Jan 13, 2023

Uh oh!

pytorchmergebot commented Jan 13, 2023

Uh oh!

ngimel commented Jan 13, 2023

Uh oh!

voznesenskym commented Jan 13, 2023

Uh oh!

voznesenskym commented Jan 15, 2023

Uh oh!

ngimel commented Jan 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voznesenskym commented Jan 19, 2023

Uh oh!

pytorchmergebot commented Jan 19, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

voznesenskym commented Jan 5, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jan 5, 2023 •

edited

Loading

ngimel commented Jan 15, 2023 •

edited

Loading