Skip to content

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented Mar 6, 2023

Fixes #91483

Using a separate test class here, so that there is no need to run setup and teardown for all tests in TestJit. The root cause here is that test_profiler could be flaky and fail in the middle without the chance to restore torch._C._set_graph_executor_optimize to its original value (#81626). This causes issues for all future tests running after as shown in #91483.

I suspect that is also the same root cause for several other flaky tests in the same file https://github.com/search?q=repo%3Apytorch%2Fpytorch+DISABLED+test_jit.TestScript&type=issues. After this fix is merged, I would let retry bot does it job and close these issues after 2 weeks.

Testing

The issue #91483 can now be reproduced by adding torch._C._set_graph_executor_optimize(False) locally to see if the test fails:

diff --git a/test/test_jit.py b/test/test_jit.py
index 2d1161d7466..17745d39182 100644
--- a/test/test_jit.py
+++ b/test/test_jit.py
@@ -5413,6 +5413,8 @@ a")
             FileCheck().check("int =").check("ListConstruct").check("aten::cat").run(str(g))

     def test_stack(self):
+        torch._C._set_graph_executor_optimize(False)
+
         with enable_profiling_mode_for_profiling_tests():
             @torch.jit.script
             def func(x):

It indeed fails:

======================================================================
FAIL [0.006s]: test_stack (test_jit.TestScript)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/test_jit.py", line 5437, in test_stack
    self.assertAutodiffNode(func2.graph_for(x, y), True, ['aten::stack'], [])
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_jit.py", line 282, in assertAutodiffNode
    self.assertEqual(should_autodiff_node,
##[endgroup]
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2975, in assertEqual
    raise error_metas[0].to_error(
AssertionError: Booleans mismatch: True is not False

Failure in testing nodes' autodifferentiation. One or more nodes were expected to be autodiffed, but were not found in specified fusible/nonfusible DifferentiableGraph groups.
Specifically:
  ['aten::stack'] were not in one of the DifferentiableGraphs when they were expected to be. Did you intend for these nodes to be autodiffed? If not, remove them from the list of nonfusible nodes.

----------------------------------------------------------------------
Ran 2677 tests in 84.596s

FAILED (failures=1, skipped=136, expected failures=13)

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 6, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96135

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 73df80c:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@huydhn huydhn requested review from a team and clee2000 March 7, 2023 01:05
@huydhn huydhn marked this pull request as ready for review March 7, 2023 01:07
@huydhn
Copy link
Contributor Author

huydhn commented Mar 7, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 7, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 12, 2023
Fixes pytorch/pytorch#91483

Using a separate test class here, so that there is no need to run setup and teardown for all tests in TestJit.  The root cause here is that test_profiler could be flaky and fail in the middle without the chance to restore `torch._C._set_graph_executor_optimize` to its original value (pytorch/pytorch#81626). This causes issues for all future tests running after as shown in pytorch/pytorch#91483.

I suspect that is also the same root cause for several other flaky tests in the same file https://github.com/search?q=repo%3Apytorch%2Fpytorch+DISABLED+test_jit.TestScript&type=issues.  After this fix is merged, I would let retry bot does it job and close these issues after 2 weeks.

### Testing
The issue pytorch/pytorch#91483 can now be reproduced by adding `torch._C._set_graph_executor_optimize(False)` locally to see if the test fails:

```
diff --git a/test/test_jit.py b/test/test_jit.py
index 2d1161d..17745d39182 100644
--- a/test/test_jit.py
+++ b/test/test_jit.py
@@ -5413,6 +5413,8 @@ a")
             FileCheck().check("int =").check("ListConstruct").check("aten::cat").run(str(g))

     def test_stack(self):
+        torch._C._set_graph_executor_optimize(False)
+
         with enable_profiling_mode_for_profiling_tests():
             @torch.jit.script
             def func(x):
```

It indeed fails:

```
======================================================================
FAIL [0.006s]: test_stack (test_jit.TestScript)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/test_jit.py", line 5437, in test_stack
    self.assertAutodiffNode(func2.graph_for(x, y), True, ['aten::stack'], [])
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_jit.py", line 282, in assertAutodiffNode
    self.assertEqual(should_autodiff_node,
##[endgroup]
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2975, in assertEqual
    raise error_metas[0].to_error(
AssertionError: Booleans mismatch: True is not False

Failure in testing nodes' autodifferentiation. One or more nodes were expected to be autodiffed, but were not found in specified fusible/nonfusible DifferentiableGraph groups.
Specifically:
  ['aten::stack'] were not in one of the DifferentiableGraphs when they were expected to be. Did you intend for these nodes to be autodiffed? If not, remove them from the list of nonfusible nodes.

----------------------------------------------------------------------
Ran 2677 tests in 84.596s

FAILED (failures=1, skipped=136, expected failures=13)
```

Pull Request resolved: pytorch/pytorch#96135
Approved by: https://github.com/clee2000
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 12, 2023
Fixes pytorch/pytorch#91483

Using a separate test class here, so that there is no need to run setup and teardown for all tests in TestJit.  The root cause here is that test_profiler could be flaky and fail in the middle without the chance to restore `torch._C._set_graph_executor_optimize` to its original value (pytorch/pytorch#81626). This causes issues for all future tests running after as shown in pytorch/pytorch#91483.

I suspect that is also the same root cause for several other flaky tests in the same file https://github.com/search?q=repo%3Apytorch%2Fpytorch+DISABLED+test_jit.TestScript&type=issues.  After this fix is merged, I would let retry bot does it job and close these issues after 2 weeks.

### Testing
The issue pytorch/pytorch#91483 can now be reproduced by adding `torch._C._set_graph_executor_optimize(False)` locally to see if the test fails:

```
diff --git a/test/test_jit.py b/test/test_jit.py
index 2d1161d..17745d39182 100644
--- a/test/test_jit.py
+++ b/test/test_jit.py
@@ -5413,6 +5413,8 @@ a")
             FileCheck().check("int =").check("ListConstruct").check("aten::cat").run(str(g))

     def test_stack(self):
+        torch._C._set_graph_executor_optimize(False)
+
         with enable_profiling_mode_for_profiling_tests():
             @torch.jit.script
             def func(x):
```

It indeed fails:

```
======================================================================
FAIL [0.006s]: test_stack (test_jit.TestScript)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/test_jit.py", line 5437, in test_stack
    self.assertAutodiffNode(func2.graph_for(x, y), True, ['aten::stack'], [])
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_jit.py", line 282, in assertAutodiffNode
    self.assertEqual(should_autodiff_node,
##[endgroup]
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2975, in assertEqual
    raise error_metas[0].to_error(
AssertionError: Booleans mismatch: True is not False

Failure in testing nodes' autodifferentiation. One or more nodes were expected to be autodiffed, but were not found in specified fusible/nonfusible DifferentiableGraph groups.
Specifically:
  ['aten::stack'] were not in one of the DifferentiableGraphs when they were expected to be. Did you intend for these nodes to be autodiffed? If not, remove them from the list of nonfusible nodes.

----------------------------------------------------------------------
Ran 2677 tests in 84.596s

FAILED (failures=1, skipped=136, expected failures=13)
```

Pull Request resolved: pytorch/pytorch#96135
Approved by: https://github.com/clee2000
ydwu4 added a commit to ydwu4/pytorch that referenced this pull request Mar 13, 2023
…h#96135)

Fixes pytorch#91483

Using a separate test class here, so that there is no need to run setup and teardown for all tests in TestJit.  The root cause here is that test_profiler could be flaky and fail in the middle without the chance to restore `torch._C._set_graph_executor_optimize` to its original value (pytorch#81626). This causes issues for all future tests running after as shown in pytorch#91483.

I suspect that is also the same root cause for several other flaky tests in the same file https://github.com/search?q=repo%3Apytorch%2Fpytorch+DISABLED+test_jit.TestScript&type=issues.  After this fix is merged, I would let retry bot does it job and close these issues after 2 weeks.

### Testing
The issue pytorch#91483 can now be reproduced by adding `torch._C._set_graph_executor_optimize(False)` locally to see if the test fails:

```
diff --git a/test/test_jit.py b/test/test_jit.py
index 2d1161d..17745d39182 100644
--- a/test/test_jit.py
+++ b/test/test_jit.py
@@ -5413,6 +5413,8 @@ a")
             FileCheck().check("int =").check("ListConstruct").check("aten::cat").run(str(g))

     def test_stack(self):
+        torch._C._set_graph_executor_optimize(False)
+
         with enable_profiling_mode_for_profiling_tests():
             @torch.jit.script
             def func(x):
```

It indeed fails:

```
======================================================================
FAIL [0.006s]: test_stack (test_jit.TestScript)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/test_jit.py", line 5437, in test_stack
    self.assertAutodiffNode(func2.graph_for(x, y), True, ['aten::stack'], [])
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_jit.py", line 282, in assertAutodiffNode
    self.assertEqual(should_autodiff_node,
##[endgroup]
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2975, in assertEqual
    raise error_metas[0].to_error(
AssertionError: Booleans mismatch: True is not False

Failure in testing nodes' autodifferentiation. One or more nodes were expected to be autodiffed, but were not found in specified fusible/nonfusible DifferentiableGraph groups.
Specifically:
  ['aten::stack'] were not in one of the DifferentiableGraphs when they were expected to be. Did you intend for these nodes to be autodiffed? If not, remove them from the list of nonfusible nodes.

----------------------------------------------------------------------
Ran 2677 tests in 84.596s

FAILED (failures=1, skipped=136, expected failures=13)
```

Pull Request resolved: pytorch#96135
Approved by: https://github.com/clee2000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DISABLED test_stack (test_jit.TestScript)

3 participants