Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CSE for dynamo guards. #98488

Closed
wants to merge 10 commits into from
Closed

Conversation

ysiraichi
Copy link
Collaborator

@ysiraichi ysiraichi commented Apr 6, 2023

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 6, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98488

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 55b9e6c:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ysiraichi
Copy link
Collaborator Author

In summary, it implements the CSE in 3 steps:

  1. Parse each guard expression into an AST
  2. Traverse each parsed AST, counting the number of occurrences at each level
    • Holds a map from the stringified node (using ast.unparse function) to a counter
  3. Traverse each parsed AST, replacing the nodes that occur more than a threshold (currently, 2) by a new variable

Returns:
- the assignments of each new variable to its corresponding expression
- the new guard expressions

@ysiraichi ysiraichi requested review from ezyang, voznesenskym and jansel and removed request for voznesenskym April 7, 2023 08:19
@ezyang ezyang added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 7, 2023
torch/_dynamo/guards.py Outdated Show resolved Hide resolved
# 'v' will be defined in the 'preface' list (output argument to
# 'NodeTransformer')
class PyExprCSEPass:
IGNORED_NODE_TYPES = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly did you decide to put nodes in the ignored types or not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was done in an ad-hoc way. Basically, I went through the list of node types and decided whether they were interesting for CSE purposes. Now that I think about it, maybe I should have created the a list of ALLOWED_NODE_TYPES.

@ezyang
Copy link
Contributor

ezyang commented Apr 7, 2023

Perf run kicked off on https://github.com/pytorch/pytorch/actions/runs/4638385442

@ezyang
Copy link
Contributor

ezyang commented Apr 7, 2023

This is a very nice improvement, worth 2-4%

image

torch/_dynamo/guards.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to preserve expression ordering

@ysiraichi
Copy link
Collaborator Author

So, I think I can make this PR better (I think I can finish the implementation tomorrow) by:

  1. We can make it work without ast.unparse (eliminating the quadratic time) by using the column information from parsing
  2. Use ALLOW_NODE_TYPES (the inverse of what we have now) so as to filter the nodes interesting to the CSE, namely: Subscript, Attribute, and Call (for getattr calls)
  3. Divide the guard function into non-ShapeEnv guards and ShapeEnv guards (which will free us from the try-catch)

@ezyang
What do you think?

@voznesenskym
Copy link
Contributor

  1. Divide the guard function into non-ShapeEnv guards and ShapeEnv guards (which will free us from the try-catch)

We have prior art for this already, in terms of treating shape_env code separately from other added guard code.

@ezyang
Copy link
Contributor

ezyang commented Apr 8, 2023

We can make it work without ast.unparse (eliminating the quadratic time) by using the column information from parsing

By using the original string? That's seems brittle. According to the perf run your change doesn't materially change compile time, so I think the quadratic term doesn't matter, leave it as is (with a comment saying that it's quadratic.)

Use ALLOW_NODE_TYPES (the inverse of what we have now) so as to filter the nodes interesting to the CSE, namely: Subscript, Attribute, and Call (for getattr calls)

Sure, though IMO it's not a big deal.

Divide the guard function into non-ShapeEnv guards and ShapeEnv guards (which will free us from the try-catch)

As Voz says, this is the minimum you will need to do, but I don't think it is enough. Imagine something like:

isinstance(x, list)
len(x) == 2
isinstance(x[0], list)
len(x[0]) == 2
isinstance(x[0][0], int)
isinstance(x[0][1], int)
isinstance(x[1], int)

Naive CSE will move x[0] access before we have confirmed that x is a list.

@ezyang
Copy link
Contributor

ezyang commented Apr 9, 2023

I think a useful thing to know is what the performance difference between:

return a and b and c

vs

if not a:
  return False
if not b:
  return False
if not c:
  return False
return True

is. If there is not much difference, then the easiest thing to do is to stop compiling guards as a big list of and statements and then you can CSE at the first point a subexpression is used, without requiring to reorder things.

@ysiraichi
Copy link
Collaborator Author

According to the perf run your change doesn't materially change compile time, so I think the quadratic term doesn't matter

Not sure what you mean by that. By compile time, do you mean after we introduce JIT to the guards?

@ysiraichi
Copy link
Collaborator Author

ysiraichi commented Apr 9, 2023

Find below the benchmark (torchbench) results for the 2 versions @ezyang mentioned. In summary:

  • They don't seem to be significantly different
  • There was 1 benchmark that got >1.1x speedup: resnext50_32x4d
  • There were 2 benchmarks that got <0.9x speedup: dlrm (0.88x) and hf_T5_large (0.81x)
  • There were 3 benchmarks that got <0.98x speedup: hf_T5, hf_T5_base, and mobilenet_v2
  • baseline refers to the commit 1d7133c542f (February 15)
  • if-chain refers to the second version (literally, a chain of ifs)
Raw Results
Name Latency (baseline) Latency (if-chain) Speedup
BERT_pytorch 38.2049 38.1404 1.0017
Background_Matting 148.4077 148.1106 1.0020
LearningToPaint 42.1799 42.1211 1.0014
Super_SloMo 319.2586 319.4182 0.9995
alexnet 34.7081 34.6411 1.0019
dcgan 25.2886 25.2632 1.0010
densenet121 190.7768 192.6175 0.9904
dlrm 2.0278 2.2835 0.8880
drq 0.7249 0.7233 1.0022
fastNLP_Bert 17.2611 17.1888 1.0042
functorch_dp_cifar10 20.6523 20.6971 0.9978
functorch_maml_omniglot 0.6388 0.6464 0.9882
hf_Albert 15.0872 15.0448 1.0028
hf_Bart 18.9939 18.8683 1.0067
hf_Bert 17.6168 17.9100 0.9836
hf_DistilBert 8.0332 8.0316 1.0002
hf_GPT2 37.0473 37.0166 1.0008
hf_GPT2_large 199.5633 199.7220 0.9992
hf_Reformer 15.5395 15.6129 0.9953
hf_T5 90.9850 98.2737 0.9258
hf_T5_base 305.4032 319.5407 0.9558
hf_T5_large 127.2385 156.5839 0.8126
lennard_jones 0.2202 0.2198 1.0018
maml 114.7211 116.2960 0.9865
maml_omniglot 0.8444 0.8205 1.0291
mnasnet1_0 19.5482 19.5491 1.0000
mobilenet_v2 12.3285 12.9194 0.9543
mobilenet_v3_large 18.2497 18.3144 0.9965
nvidia_deeprecommender 11.4864 11.5554 0.9940
opacus_cifar10 20.7520 20.9510 0.9905
pyhpc_equation_of_state 6.9967 6.9972 0.9999
pyhpc_isoneutral_mixing 27.9351 27.8914 1.0016
pyhpc_turbulent_kinetic_energy 22.3412 21.3129 1.0482
pytorch_CycleGAN_and_pix2pix 14.6733 14.6800 0.9995
pytorch_stargan 51.5640 51.6061 0.9992
pytorch_unet 113.0233 113.1834 0.9986
resnet152 172.9542 172.9537 1.0000
resnet18 7.1986 7.1692 1.0041
resnet50 66.2987 66.3271 0.9996
resnext50_32x4d 62.5773 55.4107 1.1293
shufflenet_v2_x1_0 33.0737 32.9855 1.0027
soft_actor_critic 0.2517 0.2538 0.9917
speech_transformer 21.2798 21.2074 1.0034
squeezenet1_1 6.3501 6.3379 1.0019
tacotron2 1095.2057 1095.4023 0.9998
timm_efficientnet 55.2628 55.8831 0.9889
timm_regnet 212.7725 213.2575 0.9977
timm_resnest 43.5706 43.5622 1.0002
timm_vision_transformer 10.7047 10.7046 1.0000
timm_vision_transformer_large 378.2258 377.8992 1.0009
timm_vovnet 81.9470 82.6849 0.9911
tts_angular 13.3451 13.3499 0.9996
vgg16 15.5082 15.5191 0.9993
yolov3 83.8152 84.1460 0.9961

After looking at the results, I guess it's best to go with the if-chain version. Let me know your thoughts.

(FYI: I will be off until May 1)

@ezyang
Copy link
Contributor

ezyang commented Apr 9, 2023

Alright let's if chain

@voznesenskym
Copy link
Contributor

(FYI: I will be off until May 1)

If need be, I can help finish this PR and land it under your attribution.

@voznesenskym
Copy link
Contributor

I think a useful thing to know is what the performance difference between:

return a and b and c

vs

if not a:
  return False
if not b:
  return False
if not c:
  return False
return True

is. If there is not much difference, then the easiest thing to do is to stop compiling guards as a big list of and statements and then you can CSE at the first point a subexpression is used, without requiring to reorder things.

#98374

This PR does that.

I did some simple local testing, and the perf seems to be identical.

@lezcano
Copy link
Collaborator

lezcano commented Apr 10, 2023

If need be, I can help finish this PR and land it under your attribution

Yukio, as discussed, will be out until the next of the month. If you want this in now now, I think it's fine for you to finish this one up and land it as a PR by yukio and you.

@ysiraichi
Copy link
Collaborator Author

Results are out!

speedup

@ezyang @anijain2305
Is it ok if I merge this once this PR turns green?

@lezcano
Copy link
Collaborator

lezcano commented May 17, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@ysiraichi
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor

malfet commented May 18, 2023

This regressed MacOS x86 tests, and in general, string comparison feels like a very fragile thing to do.
Looks like the code you've checked in is indeed correct if astunparse is installed, but if it isnt one needs different patterns, as in that case test_guard_function_builder_with_cse would generate the following:

def ___make_guard_fn():
    def guard(L):
        if not (x[0].a < x[1].a * (3 - x[2].a)):
            return False
        if not (a.b.c[0].d.e + a.b.c[1].d.e * a.b.c[2].d.e > 0):
            return False
        if not (f(m.n[0], '0').x.y.z * f(m.n[0], '1').x.y.z * f(m.n[0], '2').x.y.z < 512):
            return False
        if not (self.g(a, b).k + (1 - self.g(a, b).k) <= m[0].a + self.g(a, b).k):
            return False
        return True
    return guard

which is indential to the one one is trying to match, but without any intermediate vars

@ysiraichi
Copy link
Collaborator Author

Ah, right. Good catch. Thanks, @malfet .

@malfet
Copy link
Contributor

malfet commented May 18, 2023

@ysiraichi Hmm, looks like the way code is written right now, it would not really work without astunparse, should we just add it as mandatory dependency on python-3.8?

@ysiraichi
Copy link
Collaborator Author

How so? To me, it looks like it would work without astunparse.

@malfet
Copy link
Contributor

malfet commented May 18, 2023

@ysiraichi do you see where this call is guarded by HAS_UNPARSE_FUNCTIONS == True

self._config.expr_count[_ast_unparse(node)] += 1

See following run:

/python3.8 test_misc.py -v -k test_guards_cse_pass_multiple
Fail to import hypothesis in common_utils, tests are not derandomized
test_guards_cse_pass_multiple (__main__.MiscTests) ... ERROR

======================================================================
ERROR: test_guards_cse_pass_multiple (__main__.MiscTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_misc.py", line 5307, in test_guards_cse_pass_multiple
    csepass.count([t.expr for t in testcases])
  File "/Users/nshulga/Library/Python/3.8/lib/python/site-packages/torch/_dynamo/guards.py", line 789, in count
    counter.visit(ast.parse(e))
  File "/Users/nshulga/Library/Python/3.8/lib/python/site-packages/torch/_dynamo/guards.py", line 736, in visit
    super().visit(node)
  File "/opt/homebrew/Cellar/python38/3.8.3_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ast.py", line 363, in visit
    return visitor(node)
  File "/opt/homebrew/Cellar/python38/3.8.3_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ast.py", line 371, in generic_visit
    self.visit(item)
  File "/Users/nshulga/Library/Python/3.8/lib/python/site-packages/torch/_dynamo/guards.py", line 736, in visit
    super().visit(node)
  File "/opt/homebrew/Cellar/python38/3.8.3_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ast.py", line 363, in visit
    return visitor(node)
  File "/opt/homebrew/Cellar/python38/3.8.3_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ast.py", line 373, in generic_visit
    self.visit(value)
  File "/Users/nshulga/Library/Python/3.8/lib/python/site-packages/torch/_dynamo/guards.py", line 736, in visit
    super().visit(node)
  File "/opt/homebrew/Cellar/python38/3.8.3_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ast.py", line 363, in visit
    return visitor(node)
  File "/opt/homebrew/Cellar/python38/3.8.3_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ast.py", line 373, in generic_visit
    self.visit(value)
  File "/Users/nshulga/Library/Python/3.8/lib/python/site-packages/torch/_dynamo/guards.py", line 735, in visit
    self._config.expr_count[_ast_unparse(node)] += 1
NameError: name '_ast_unparse' is not defined

----------------------------------------------------------------------
Ran 1 test in 0.007s

FAILED (errors=1)

@ysiraichi
Copy link
Collaborator Author

The object instantiation is guarded.

if HAS_UNPARSE_FUNCTIONS:

@malfet
Copy link
Contributor

malfet commented May 18, 2023

Ok, so I'll just guard this test than...

pytorchmergebot pushed a commit that referenced this pull request May 18, 2023
If `astunparse` is not installed, following guard will be generated in `test_guard_function_builder_with_cse`:
```python
def ___make_guard_fn():
    def guard(L):
        if not (x[0].a < x[1].a * (3 - x[2].a)):
            return False
        if not (a.b.c[0].d.e + a.b.c[1].d.e * a.b.c[2].d.e > 0):
            return False
        if not (f(m.n[0], '0').x.y.z * f(m.n[0], '1').x.y.z * f(m.n[0], '2').x.y.z < 512):
            return False
        if not (self.g(a, b).k + (1 - self.g(a, b).k) <= m[0].a + self.g(a, b).k):
            return False
        return True
    return guard
```

Though, I have to say, hardcoding string comparison is pretty weird.

Also, skip `test_guards_cse_pass_[single|multiple]` if AST unparsing is missing.

Fixes failure in a test introduced by #98488

copilot:poem
Pull Request resolved: #101805
Approved by: https://github.com/atalman, https://github.com/ysiraichi
jcaip pushed a commit that referenced this pull request May 23, 2023
This PR extracted the CSE part of the code in #89707.

Pull Request resolved: #98488
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/anijain2305
desertfire added a commit to desertfire/pytorch that referenced this pull request May 23, 2023
Summary:
pytorch#98488 implements CSE for dynamo guards, and it relies on astunparse to perform the optimization.
`test_guards_cse_pass_single` was broken and later was fixed by introducing a check_and_skip_if_needed. This actually fixes the root cause on fbcode and should bring some perf gain internally.

Test Plan: `buck2 test @//mode/opt //caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::DynamicShapesMiscTests::test_guards_cse_pass_single' --run-disabled`

Reviewed By: malfet

Differential Revision: D46126742

fbshipit-source-id: 60ca99dd075e03ba458ebc4e3250ab0f9ebfb9d7
desertfire added a commit to desertfire/pytorch that referenced this pull request May 23, 2023
Summary:
Pull Request resolved: pytorch#102120

pytorch#98488 implements CSE for dynamo guards, and it relies on astunparse to perform the optimization.
`test_guards_cse_pass_single` was broken and later was fixed by introducing a check_and_skip_if_needed. This actually fixes the root cause on fbcode and should bring some perf gain internally.

Test Plan: `buck2 test @//mode/opt //caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::DynamicShapesMiscTests::test_guards_cse_pass_single' --run-disabled`

Reviewed By: malfet

Differential Revision: D46126742

fbshipit-source-id: 259f7c2c16bca111bc81ab6be04b0ce19ba47af9
pytorchmergebot pushed a commit that referenced this pull request May 24, 2023
Summary:
#98488 implements CSE for dynamo guards, and it relies on astunparse to perform the optimization.
`test_guards_cse_pass_single` was broken and later was fixed by introducing a check_and_skip_if_needed. This actually fixes the root cause on fbcode and should bring some perf gain internally.

Test Plan: `buck2 test @//mode/opt //caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::DynamicShapesMiscTests::test_guards_cse_pass_single' --run-disabled`

Reviewed By: malfet

Differential Revision: D46126742

Pull Request resolved: #102120
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: dynamo open source release notes: dynamo triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet