Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-land precompile triton templates #124030

Closed
wants to merge 6 commits into from

Conversation

eellison
Copy link
Contributor

@eellison eellison commented Apr 15, 2024

This reverts commit e0c9764.

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Apr 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124030

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f1e2cf9 with merge base bbb6e36 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This reverts commit e0c9764.

[ghstack-poisoned]
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

[ghstack-poisoned]
@eellison eellison added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Apr 16, 2024
@eellison
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Apr 17, 2024
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.

Pull Request resolved: #122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
@DanilBaibak
Copy link
Contributor

@pytorchbot revert -m "Broken trunk" -c ignoredsignal

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: pytorch#124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…2642)

Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.

Pull Request resolved: pytorch#122642
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#124030
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: #124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: pytorch#124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…2642)

Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.

Pull Request resolved: pytorch#122642
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#124030
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
Two small fixes:

- preserve rng around compile_fx_inner
- Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed.

Pull Request resolved: pytorch#123229
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#124030, pytorch#122642
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…22825)

Two changes:
- Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default.
- Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI.

Pull Request resolved: pytorch#122825
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#124030, pytorch#122642, pytorch#123229
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```

There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.

Inference:

<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">

Training:

<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">

Pull Request resolved: #124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229, #122825
andoorve pushed a commit to andoorve/pytorch that referenced this pull request May 1, 2024
andoorve pushed a commit to andoorve/pytorch that referenced this pull request May 1, 2024
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: pytorch#124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
andoorve pushed a commit to andoorve/pytorch that referenced this pull request May 1, 2024
…2642)

Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.

Pull Request resolved: pytorch#122642
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#124030
andoorve pushed a commit to andoorve/pytorch that referenced this pull request May 1, 2024
Two small fixes:

- preserve rng around compile_fx_inner
- Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed.

Pull Request resolved: pytorch#123229
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#124030, pytorch#122642
andoorve pushed a commit to andoorve/pytorch that referenced this pull request May 1, 2024
…22825)

Two changes:
- Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default.
- Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI.

Pull Request resolved: pytorch#122825
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#124030, pytorch#122642, pytorch#123229
andoorve pushed a commit to andoorve/pytorch that referenced this pull request May 1, 2024
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```

There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.

Inference:

<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">

Training:

<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">

Pull Request resolved: pytorch#124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: pytorch#124030, pytorch#122642, pytorch#123229, pytorch#122825
pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: #124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
…2642)

Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.

Pull Request resolved: pytorch#122642
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#124030
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024
This reverts commit d68196e.

Reverted #124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](#124030 (comment)))
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: pytorch#124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: pytorch#124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.

Pull Request resolved: #122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
Two small fixes:

- preserve rng around compile_fx_inner
- Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed.

Pull Request resolved: pytorch#123229
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#124030, pytorch#122642
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
…22825)

Two changes:
- Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default.
- Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI.

Pull Request resolved: pytorch#122825
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#124030, pytorch#122642, pytorch#123229
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```

There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.

Inference:

<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">

Training:

<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">

Pull Request resolved: pytorch#124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: pytorch#124030, pytorch#122642, pytorch#123229, pytorch#122825
@github-actions github-actions bot deleted the gh/eellison/629/head branch June 1, 2024 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants