[compiled autograd] Fix LoggingTensor flaky test #126144

xmfan · 2024-05-14T04:46:18Z

Stack from ghstack (oldest at bottom):

LoggingTensor fails consistently when root logger level is INFO or lower
By default, root logger should be WARNING
But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: #126143

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-05-14T04:46:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126144

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (6 Unrelated Failures)

As of commit 1f39ed7 with merge base 91bf952 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 867efe2ecf898f9c3d15819b46467dccbd2a38cc Pull Request resolved: #126144

test/inductor/test_compiled_autograd.py

torch/_dynamo/compiled_autograd.py

torch/testing/_internal/logging_tensor.py

LoggingTensor fails consistently when root logger level is INFO or lower By default, root logger should be WARNING But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: #126143 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 9e999edf4e9a1e41c381fdf20063338a6eb2f313 Pull Request resolved: #126144

[ghstack-poisoned]

FIXES #126128. Right now, we only clear the cache on ctx manager enter. So state is bad unless we call fresh_inductor_cache again, usually fine in tests. Cue compiled autograd tests when going from TestCompiledAutograd -> TestAutogradWithCompiledAutograd. TestCompiledAutograd uses the ctx manager, but TestAutogradWithCompiledAutograd don't Pull Request resolved: #126146 Approved by: https://github.com/jgong5, https://github.com/oulgen ghstack dependencies: #126144

…pytorch#126148) verbose flag leaks into tests ran after Pull Request resolved: pytorch#126148 Approved by: https://github.com/jansel ghstack dependencies: pytorch#126144, pytorch#126146

LoggingTensor fails consistently when root logger level is INFO or lower By default, root logger should be WARNING But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: pytorch#126143 Pull Request resolved: pytorch#126144 Approved by: https://github.com/jansel

FIXES pytorch#126128. Right now, we only clear the cache on ctx manager enter. So state is bad unless we call fresh_inductor_cache again, usually fine in tests. Cue compiled autograd tests when going from TestCompiledAutograd -> TestAutogradWithCompiledAutograd. TestCompiledAutograd uses the ctx manager, but TestAutogradWithCompiledAutograd don't Pull Request resolved: pytorch#126146 Approved by: https://github.com/jgong5, https://github.com/oulgen ghstack dependencies: pytorch#126144

…pytorch#126148) verbose flag leaks into tests ran after Pull Request resolved: pytorch#126148 Approved by: https://github.com/jansel ghstack dependencies: pytorch#126144, pytorch#126146

Internal infra may not preserve python and c++ log ordering e.g. MAST logs: https://fburl.com/mlhub/38576cxn, all the `[python_compiled_autograd.cpp] Creating cache entry [...]` logs of the entire run are at the beginning of the file Pull Request resolved: #126483 Approved by: https://github.com/jansel ghstack dependencies: #126144, #126146, #126148

- log only first node key cache miss - log existing node key sizes - log which node's collected sizes became dynamic e.g. ``` DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[] ... DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::AccumulateGrad (NodeCall 5) with key size 32, previous key sizes=[21] ... DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 0 of torch::autograd::GraphRoot (NodeCall 0) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of SumBackward0 (NodeCall 1) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 4 of SumBackward0 (NodeCall 1) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 2) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 9 of AddmmBackward0 (NodeCall 3) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of torch::autograd::AccumulateGrad (NodeCall 5) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 6) ``` Pull Request resolved: #126602 Approved by: https://github.com/jansel ghstack dependencies: #126144, #126146, #126148, #126483

[compiled autograd] Fix flaky tests

b404ae5

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo module: inductor labels May 14, 2024

xmfan added a commit that referenced this pull request May 14, 2024

[compiled autograd] Fix flaky tests

bffbe71

ghstack-source-id: 867efe2ecf898f9c3d15819b46467dccbd2a38cc Pull Request resolved: #126144

xmfan changed the title ~~[compiled autograd] Fix flaky tests~~ [compiled autograd] Fix LoggingTensor flaky test May 14, 2024

This was referenced May 14, 2024

[inductor] Clear cache on ctx manager exit #126146

Closed

[compiled autograd] clear compiled_autograd_verbose once test is done #126148

Closed

xmfan added the topic: not user facing topic category label May 14, 2024

r-barnes requested changes May 14, 2024

View reviewed changes

xmfan marked this pull request as ready for review May 14, 2024 15:31

xmfan requested a review from jansel May 14, 2024 15:32

jansel approved these changes May 15, 2024

View reviewed changes

xmfan requested a review from r-barnes May 15, 2024 19:40

xmfan added 2 commits May 15, 2024 17:42

xmfan added a commit that referenced this pull request May 16, 2024

[compiled autograd] Fix flaky tests

f13bfd8

ghstack-source-id: 9e999edf4e9a1e41c381fdf20063338a6eb2f313 Pull Request resolved: #126144

Update

1f39ed7

[ghstack-poisoned]

xmfan mentioned this pull request May 16, 2024

[compiled autograd] torch.compile API #125880

Closed

pytorchmergebot closed this in 4cd4463 May 16, 2024

pytorchmergebot added the Merged label May 16, 2024

This was referenced May 17, 2024

[compiled autograd] log in cpp using python logger #126483

Closed

[compiled autograd] Better cache miss logging #126602

Closed

github-actions bot deleted the gh/xmfan/49/head branch June 16, 2024 01:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[compiled autograd] Fix LoggingTensor flaky test #126144

[compiled autograd] Fix LoggingTensor flaky test #126144

xmfan commented May 14, 2024 •

edited

pytorch-bot bot commented May 14, 2024 •

edited

[compiled autograd] Fix LoggingTensor flaky test #126144

[compiled autograd] Fix LoggingTensor flaky test #126144

Conversation

xmfan commented May 14, 2024 • edited

pytorch-bot bot commented May 14, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126144

✅ You can merge normally! (6 Unrelated Failures)

xmfan commented May 14, 2024 •

edited

pytorch-bot bot commented May 14, 2024 •

edited