Skip to content

Fake: fix conv_transpose2d striding #82846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

eellison
Copy link
Contributor

@eellison eellison commented Aug 4, 2022

Stack from ghstack (oldest at bottom):

The output striding channels-last preservation logic differs between cuda and cpu. For the meta kernel, we can peek at the fake tensor device and use that to determine whether to do cpu or cuda.

You could argue there's a leaking of abstraction here but this seems like a pretty minimal leak and I'm not sure there's a much cleaner way forward for device-specific striding tracing logic.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 4, 2022

🔗 Helpful links

❌ 2 New Failures

As of commit 658636d (more details on the Dr. CI page):

Expand to see more
  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu) (1/2)

Step: "Test" (full log | diagnosis details)

2022-08-05T02:07:19.3798736Z RuntimeError: test_ops failed!
2022-08-05T02:07:15.7747154Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCommonCUDA-20220805004020.xml
2022-08-05T02:07:15.9923374Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCompositeComplianceCUDA-20220805004020.xml
2022-08-05T02:07:16.1054624Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestFakeTensorNonErroringCUDA-20220805004020.xml
2022-08-05T02:07:16.2623200Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestMathBitsCUDA-20220805004020.xml
2022-08-05T02:07:16.3574312Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestTagsCUDA-20220805004020.xml
2022-08-05T02:07:19.3794521Z Traceback (most recent call last):
2022-08-05T02:07:19.3795058Z   File "/var/lib/jenkins/workspace/test/run_test.py", line 974, in <module>
2022-08-05T02:07:19.3795415Z     main()
2022-08-05T02:07:19.3795727Z   File "/var/lib/jenkins/workspace/test/run_test.py", line 952, in main
2022-08-05T02:07:19.3798364Z     raise RuntimeError(err_message)
2022-08-05T02:07:19.3798736Z RuntimeError: test_ops failed!
2022-08-05T02:07:19.6723309Z 
2022-08-05T02:07:19.6724293Z real	87m6.068s
2022-08-05T02:07:19.6724607Z user	85m50.130s
2022-08-05T02:07:19.6724858Z sys	1m23.460s
2022-08-05T02:07:19.6778667Z ##[error]Process completed with exit code 1.
2022-08-05T02:07:19.6819575Z Prepare all required actions
2022-08-05T02:07:19.6819931Z Getting action download info
2022-08-05T02:07:19.9511978Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-05T02:07:19.9512304Z with:
2022-08-05T02:07:19.9512984Z   github-token: ***

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge) (2/2)

Step: "Test" (full log | diagnosis details)

2022-08-05T01:54:35.7565935Z FAIL [1.138s]: test_load_standalone (__main__.TestStandaloneCPPJIT)
2022-08-05T01:54:34.6181571Z AssertionError: String comparison failed: '/tmp/tmp6f_a5_ho/standalone_load_test_v2' != '/tmp/tmp6f_a5_ho/standalone_load_test'
2022-08-05T01:54:34.6181910Z - /tmp/tmp6f_a5_ho/standalone_load_test_v2
2022-08-05T01:54:34.6182158Z ?                                      ---
2022-08-05T01:54:34.6182358Z + /tmp/tmp6f_a5_ho/standalone_load_test
2022-08-05T01:54:34.6182491Z 
2022-08-05T01:54:34.6182496Z 
2022-08-05T01:54:35.7564375Z   test_load_standalone (__main__.TestStandaloneCPPJIT) ... FAIL (1.138s)
2022-08-05T01:54:35.7565105Z     test_load_standalone failed - num_retries_left: 0
2022-08-05T01:54:35.7565375Z 
2022-08-05T01:54:35.7565529Z ======================================================================
2022-08-05T01:54:35.7565935Z FAIL [1.138s]: test_load_standalone (__main__.TestStandaloneCPPJIT)
2022-08-05T01:54:35.7566524Z ----------------------------------------------------------------------
2022-08-05T01:54:35.7566925Z Traceback (most recent call last):
2022-08-05T01:54:35.7567351Z   File "test_utils.py", line 626, in test_load_standalone
2022-08-05T01:54:35.7567719Z     def test_load_standalone(self):
2022-08-05T01:54:35.7568275Z   File "test_utils.py", line 652, in test_load_standalone
2022-08-05T01:54:35.7568715Z     os.path.join(build_dir, f"standalone_load_test{ext}")
2022-08-05T01:54:35.7569458Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 2393, in assertEqual
2022-08-05T01:54:35.7570096Z     msg=(lambda generated_msg: f"{generated_msg} : {msg}") if isinstance(msg, str) and self.longMessage else msg,
2022-08-05T01:54:35.7570909Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-05T01:54:35.7571416Z     raise error_metas[0].to_error(msg)

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@eellison eellison requested a review from ezyang August 4, 2022 21:43
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input sniffing is cool, we do it in other places too.

The output striding channels-last preservation logic differs between cuda and cpu. For the meta kernel, we can peek at the fake tensor device and use that to determine whether to do cpu or cuda. 

You could argue there's a leaking of abstraction here but this seems like a pretty minimal leak and I'm not sure there's a much cleaner way forward for device-specific striding tracing logic. 

[ghstack-poisoned]
return torch.channels_last
elif input_tensor.is_contiguous(memory_format=torch.contiguous_format):
def is_channels_last(ten):
return ten.is_contiguous(memory_format=torch.channels_last)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too permissive, e.g. weight tensors with 1x1 kernel will be channels-last contiguous, but they won't be channels-last for the purposes of memory format propagation

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right thing would be to expose suggest_memory_format to python because this is what's used internally to decide whether something is channels-last. But for now probably a TODO will suffice

The output striding channels-last preservation logic differs between cuda and cpu. For the meta kernel, we can peek at the fake tensor device and use that to determine whether to do cpu or cuda. 

You could argue there's a leaking of abstraction here but this seems like a pretty minimal leak and I'm not sure there's a much cleaner way forward for device-specific striding tracing logic. 

[ghstack-poisoned]
The output striding channels-last preservation logic differs between cuda and cpu. For the meta kernel, we can peek at the fake tensor device and use that to determine whether to do cpu or cuda. 

You could argue there's a leaking of abstraction here but this seems like a pretty minimal leak and I'm not sure there's a much cleaner way forward for device-specific striding tracing logic. 

[ghstack-poisoned]
eellison added a commit that referenced this pull request Aug 5, 2022
ghstack-source-id: 1d86fdd
Pull Request resolved: #82846
The output striding channels-last preservation logic differs between cuda and cpu. For the meta kernel, we can peek at the fake tensor device and use that to determine whether to do cpu or cuda. 

You could argue there's a leaking of abstraction here but this seems like a pretty minimal leak and I'm not sure there's a much cleaner way forward for device-specific striding tracing logic. 

[ghstack-poisoned]
eellison added a commit that referenced this pull request Sep 20, 2022
ghstack-source-id: 892e86e
Pull Request resolved: #82846
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 20, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82846

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 3 Pending

As of commit 20f8d2d:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ezyang
Copy link
Contributor

ezyang commented Sep 20, 2022

This might be the root cause of pytorch/torchdynamo#1274 it would be good to land it

@eellison
Copy link
Contributor Author

Yup, that's the intention. But also setting up cross-ref testing now..

@eellison
Copy link
Contributor Author

@pytorchbot help

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 20, 2022

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'help' (choose from 'merge', 'revert', 'rebase', 'label')

usage: @pytorchbot [-h] {merge,revert,rebase,label} ...

Try @pytorchbot --help for more info.

@eellison
Copy link
Contributor Author

@pytorchbot --help

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 20, 2022

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. 

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR

Merge

usage: @pytorchbot merge [-g | -f MESSAGE | -l | -r [REBASE]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -g, --green           Merge when all status checks running on the PR pass. To add status checks, use labels like `ciflow/trunk`.
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
  -l, --land-checks     Merge with land time checks. This will create a new branch with your changes rebased on viable/strict and run a majority of trunk tests _before_ landing to increase trunk reliability and decrease risk of revert. The tests added are: pull, Lint and trunk. Note that periodic is excluded. (EXPERIMENTAL)
  -r [REBASE], --rebase [REBASE]
                        Rebase the PR to re run checks before merging.  It will accept a branch name and will default to master if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the default branch of pytorch (master).
You, along with any member of the pytorch organization, can rebase your PR.

optional arguments:
  -s, --stable          Rebase to viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR

positional arguments:
  labels  Labels to add to given Pull Request

@eellison
Copy link
Contributor Author

@pytorchbot -h merge

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 20, 2022

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. 

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR

Merge

usage: @pytorchbot merge [-g | -f MESSAGE | -l | -r [REBASE]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -g, --green           Merge when all status checks running on the PR pass. To add status checks, use labels like `ciflow/trunk`.
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
  -l, --land-checks     Merge with land time checks. This will create a new branch with your changes rebased on viable/strict and run a majority of trunk tests _before_ landing to increase trunk reliability and decrease risk of revert. The tests added are: pull, Lint and trunk. Note that periodic is excluded. (EXPERIMENTAL)
  -r [REBASE], --rebase [REBASE]
                        Rebase the PR to re run checks before merging.  It will accept a branch name and will default to master if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the default branch of pytorch (master).
You, along with any member of the pytorch organization, can rebase your PR.

optional arguments:
  -s, --stable          Rebase to viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR

positional arguments:
  labels  Labels to add to given Pull Request

@eellison
Copy link
Contributor Author

@pytorchbot merge --green

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the green (-g) flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@eellison
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@github-actions
Copy link
Contributor

Hey @eellison.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@facebook-github-bot facebook-github-bot deleted the gh/eellison/305/head branch September 24, 2022 14:20
mehtanirav pushed a commit that referenced this pull request Oct 4, 2022
The output striding channels-last preservation logic differs between cuda and cpu. For the meta kernel, we can peek at the fake tensor device and use that to determine whether to do cpu or cuda.

You could argue there's a leaking of abstraction here but this seems like a pretty minimal leak and I'm not sure there's a much cleaner way forward for device-specific striding tracing logic.
Pull Request resolved: #82846
Approved by: https://github.com/ezyang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants