Skip to content

Conversation

ysiraichi
Copy link
Collaborator

@ysiraichi ysiraichi commented Mar 30, 2023

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 30, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97964

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d98aa5e:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: fx release notes category label Mar 30, 2023
ysiraichi added a commit that referenced this pull request Mar 30, 2023
@ysiraichi ysiraichi changed the title Value range refinement using multi-variate expressions. [WIP] Value range refinement using multi-variate expressions. Mar 30, 2023
@ysiraichi ysiraichi requested a review from ezyang March 30, 2023 07:17
@ysiraichi
Copy link
Collaborator Author

As is, the only difference (benchmark-wise) is that pytorch_unet goes from 9/73/73 (univariate only) to 9/70/70. Still have to run the whole benchmark set, though.

ysiraichi added a commit that referenced this pull request Mar 31, 2023
Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, when refining the bounds, it could be the case that rhs_vr.lower/upper are variables. In that case, the refinement could then be lower = max(lower, rhs_vr.lower), right?

Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At any rate, this needs some testing.

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

ysiraichi added a commit that referenced this pull request May 30, 2023
…lue range refinement using multi-variate expressions."

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Jun 2, 2023
ysiraichi added a commit that referenced this pull request Jun 7, 2023
…nt using multi-variate expressions."

[ghstack-poisoned]
pytorchmergebot added a commit that referenced this pull request Jun 30, 2023
…)"

This reverts commit 2642412.

Reverted #97964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it is breaking an internal test ([comment](#97964 (comment)))
@ysiraichi
Copy link
Collaborator Author

@ezyang @huydhn
Best guess I have is: multi-variate range refinement is not sound. Could you try running it while setting torch._dynamo.config.translation_validation = True? Or, is there an example I can try to reproduce the error?

ysiraichi added a commit that referenced this pull request Jul 2, 2023
…dation off on timeouts."

Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

cc voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy chenyang78

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Jul 2, 2023
…meouts."

Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

cc voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy chenyang78

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Jul 2, 2023
Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

ghstack-source-id: 0e731c4
Pull Request resolved: #104464
@lezcano
Copy link
Collaborator

lezcano commented Jul 2, 2023

This implementations looks good to me, so I don't think it's "not sound" by itself.

My gut tells me that the issue is lurking in the value range analysis. I already found a number of bugs in it, so I wouldn't be entirely surprised that there were more issues lurking there.

@facebook-github-bot facebook-github-bot deleted the gh/ysiraichi/54/head branch July 3, 2023 14:17
ysiraichi added a commit that referenced this pull request Jul 3, 2023
… timeouts."

Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

cc voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy chenyang78

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Jul 3, 2023
Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

cc voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy chenyang78

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Jul 3, 2023
…uts."

Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

cc voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy chenyang78

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Jul 3, 2023
Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

ghstack-source-id: f2f817d
Pull Request resolved: #104464
ysiraichi added a commit that referenced this pull request Jul 3, 2023
Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

cc voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy chenyang78

[ghstack-poisoned]
@ysiraichi
Copy link
Collaborator Author

@lezcano I think the translation validator should be able to catch that.

ysiraichi added a commit that referenced this pull request Jul 5, 2023
…uts."

Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

cc voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy chenyang78

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Jul 5, 2023
Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

cc voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy chenyang78

[ghstack-poisoned]
ysiraichi added a commit that referenced this pull request Jul 5, 2023
Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

ghstack-source-id: e8698a0
Pull Request resolved: #104464
pytorchmergebot pushed a commit that referenced this pull request Jul 5, 2023
Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

Pull Request resolved: #104464
Approved by: https://github.com/malfet
@ysiraichi
Copy link
Collaborator Author

@ezyang @huydhn
This is a friendly reminder. If you have some time, could you try and run the failing test with TV on?

@ezyang
Copy link
Contributor

ezyang commented Jul 12, 2023

going to attempt this, might be tricky because i'm not sure we have z3 on our internal

@ezyang
Copy link
Contributor

ezyang commented Jul 13, 2023

Patching this

diff --git a/fbcode/caffe2/torch/_dynamo/config.py b/fbcode/caffe2/torch/_dynamo/config.py
--- a/fbcode/caffe2/torch/_dynamo/config.py
+++ b/fbcode/caffe2/torch/_dynamo/config.py
@@ -237,7 +237,7 @@
 numpy_ndarray_as_tensor = False
 
 # Uses z3 for validating the guard optimizations transformations.
-translation_validation = False
+translation_validation = True
 
 
 def is_fbcode():

and rerunning the test, I still get the same error as before:

  File "/data/users/ezyang/fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/pye/model_inventory/inside_out_tracking_model/__inside_out_tracking_model_test__/
inside_out_tracking_model_test#link-tree/torch/_dynamo/eval_frame.py", line 1012, in graph_with_interpreter                                                      return torch.fx.Interpreter(graph).run(*args)                                                                                                            
  File "/data/users/ezyang/fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/pye/model_inventory/inside_out_tracking_model/__inside_out_tracking_model_test__/
inside_out_tracking_model_test#link-tree/torch/fx/interpreter.py", line 138, in run                                                                              self.env[node] = self.run_node(node)                                                                                                                     
  File "/data/users/ezyang/fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/pye/model_inventory/inside_out_tracking_model/__inside_out_tracking_model_test__/
inside_out_tracking_model_test#link-tree/torch/fx/interpreter.py", line 195, in run_node                                                                         return getattr(self, n.op)(n.target, args, kwargs)                                                                                                       
  File "/data/users/ezyang/fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/pye/model_inventory/inside_out_tracking_model/__inside_out_tracking_model_test__/
inside_out_tracking_model_test#link-tree/torch/fx/interpreter.py", line 267, in call_function                                                                
    return target(*args, **kwargs)                                                                                                                           
IndexError: tuple index out of range                                                                                                                         
                                                                                                                                                             
While executing %getitem_6 : [num_users=1] = call_function[target=operator.getitem](args = (%chunk, 2), kwargs = {})   

which suggests that we don't get to TV (which makes sense, TV only runs all the way at the end of execution, but we're erroring midway through.

I have a live session failing, lmk what kind of debugging instrumentation you want.

@ysiraichi
Copy link
Collaborator Author

@ezyang

This is really odd. AFAIU, this PR modifies var_to_range which affects what guards are issued. So, if there was a problem with it, I would expect it to be executing a wrong compiled function (i.e. we should not have eliminated some guards).

Just to be completely sure, could you try commenting these lines at the end of refine_ranges and check that it passes the test?

# Updates the range and the guards corresponding to each bound of the symbol.
self.var_to_range[symbol] = ValueRanges(lower, upper)
self.var_to_guards[symbol] = (lower_guard, upper_guard)

@ezyang
Copy link
Contributor

ezyang commented Jul 13, 2023

Applying this patch

diff --git a/fbcode/caffe2/torch/fx/experimental/symbolic_shapes.py b/fbcode/caffe2/torch/fx/experimental/symbolic_shapes.py
--- a/fbcode/caffe2/torch/fx/experimental/symbolic_shapes.py
+++ b/fbcode/caffe2/torch/fx/experimental/symbolic_shapes.py
@@ -3467,8 +3467,8 @@
                 continue
 
             # Updates the range and the guards corresponding to each bound of the symbol.
-            self.var_to_range[symbol] = ValueRanges(lower, upper)
-            self.var_to_guards[symbol] = (lower_guard, upper_guard)
+            #self.var_to_range[symbol] = ValueRanges(lower, upper)
+            #self.var_to_guards[symbol] = (lower_guard, upper_guard)
             # Clears the cache, since this update can change the result.
             self._maybe_evaluate_static.cache_clear()
 

and confirmed the test passes afterwards.

@lezcano lezcano restored the gh/ysiraichi/54/head branch July 18, 2023 14:23
ysiraichi added a commit that referenced this pull request Jul 18, 2023
ysiraichi added a commit that referenced this pull request Jul 18, 2023
Trying to re-land: #97964.

ghstack-source-id: 2a1b9dd
Pull Request resolved: #105491
@facebook-github-bot facebook-github-bot deleted the gh/ysiraichi/54/head branch July 19, 2023 14:16
pytorchmergebot pushed a commit that referenced this pull request Jul 20, 2023
Trying to re-land: #97964.

Test strategy:

```
buck2 test '@fbcode//mode/dev-nosan' fbcode//pye/model_inventory/inside_out_tracking_model:inside_out_tracking_model_test -- --exact 'pye/model_inventory/inside_out_tracking_model:inside_out_tracking_model_test - test_executorch_e2e_output_consistency_aten (pye.model_inventory.inside_out_tracking_model.InsideOutTrackingModelTest.InsideOutTrackingModelTest)'
```
Pull Request resolved: #105491
Approved by: https://github.com/ezyang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: fx release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants