Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compiled autograd] Fix LoggingTensor flaky test #126144

Closed
wants to merge 5 commits into from

Conversation

xmfan
Copy link
Member

@xmfan xmfan commented May 14, 2024

Copy link

pytorch-bot bot commented May 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126144

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (6 Unrelated Failures)

As of commit 1f39ed7 with merge base 91bf952 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

xmfan added a commit that referenced this pull request May 14, 2024
ghstack-source-id: 867efe2ecf898f9c3d15819b46467dccbd2a38cc
Pull Request resolved: #126144
@xmfan xmfan changed the title [compiled autograd] Fix flaky tests [compiled autograd] Fix LoggingTensor flaky test May 14, 2024
@xmfan xmfan added the topic: not user facing topic category label May 14, 2024
test/inductor/test_compiled_autograd.py Outdated Show resolved Hide resolved
test/inductor/test_compiled_autograd.py Outdated Show resolved Hide resolved
test/inductor/test_compiled_autograd.py Outdated Show resolved Hide resolved
test/inductor/test_compiled_autograd.py Outdated Show resolved Hide resolved
torch/_dynamo/compiled_autograd.py Outdated Show resolved Hide resolved
torch/testing/_internal/logging_tensor.py Outdated Show resolved Hide resolved
LoggingTensor fails consistently when root logger level is INFO or lower
By default, root logger should be WARNING
But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: #126143


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@xmfan xmfan marked this pull request as ready for review May 14, 2024 15:31
@xmfan xmfan requested a review from jansel May 14, 2024 15:32
@xmfan xmfan requested a review from r-barnes May 15, 2024 19:40
xmfan added 2 commits May 15, 2024 17:42
LoggingTensor fails consistently when root logger level is INFO or lower
By default, root logger should be WARNING
But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: #126143


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
LoggingTensor fails consistently when root logger level is INFO or lower
By default, root logger should be WARNING
But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: #126143


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request May 16, 2024
ghstack-source-id: 9e999edf4e9a1e41c381fdf20063338a6eb2f313
Pull Request resolved: #126144
[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request May 16, 2024
FIXES #126128.

Right now, we only clear the cache on ctx manager enter. So state is bad unless we call fresh_inductor_cache again,  usually fine in tests.

Cue compiled autograd tests when going from TestCompiledAutograd -> TestAutogradWithCompiledAutograd.
TestCompiledAutograd uses the ctx manager, but TestAutogradWithCompiledAutograd don't

Pull Request resolved: #126146
Approved by: https://github.com/jgong5, https://github.com/oulgen
ghstack dependencies: #126144
bilal2vec pushed a commit to bilal2vec/pytorch that referenced this pull request May 16, 2024
ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
LoggingTensor fails consistently when root logger level is INFO or lower
By default, root logger should be WARNING
But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: pytorch#126143

Pull Request resolved: pytorch#126144
Approved by: https://github.com/jansel
ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
FIXES pytorch#126128.

Right now, we only clear the cache on ctx manager enter. So state is bad unless we call fresh_inductor_cache again,  usually fine in tests.

Cue compiled autograd tests when going from TestCompiledAutograd -> TestAutogradWithCompiledAutograd.
TestCompiledAutograd uses the ctx manager, but TestAutogradWithCompiledAutograd don't

Pull Request resolved: pytorch#126146
Approved by: https://github.com/jgong5, https://github.com/oulgen
ghstack dependencies: pytorch#126144
ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
pytorchmergebot pushed a commit that referenced this pull request May 19, 2024
Internal infra may not preserve python and c++ log ordering e.g. MAST logs: https://fburl.com/mlhub/38576cxn, all the `[python_compiled_autograd.cpp] Creating cache entry [...]` logs of the entire run are at the beginning of the file

Pull Request resolved: #126483
Approved by: https://github.com/jansel
ghstack dependencies: #126144, #126146, #126148
pytorchmergebot pushed a commit that referenced this pull request May 19, 2024
- log only first node key cache miss
- log existing node key sizes
- log which node's collected sizes became dynamic
e.g.
```
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]
...
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::AccumulateGrad (NodeCall 5) with key size 32, previous key sizes=[21]
...
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 0 of torch::autograd::GraphRoot (NodeCall 0)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of SumBackward0 (NodeCall 1)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 4 of SumBackward0 (NodeCall 1)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 2)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 9 of AddmmBackward0 (NodeCall 3)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of torch::autograd::AccumulateGrad (NodeCall 5)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 6)
```

Pull Request resolved: #126602
Approved by: https://github.com/jansel
ghstack dependencies: #126144, #126146, #126148, #126483
@github-actions github-actions bot deleted the gh/xmfan/49/head branch June 16, 2024 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants