Populate the eviction_policy field for load/store properly #91316

lezcano · 2022-12-22T17:17:32Z

Stack from ghstack (oldest at bottom):

-> Populate the eviction_policy field for load/store properly #91316

This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting eviction_policy=evict_first in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting evict_last which
was the previous option. I'll put up some benchmarks soon™.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @mlazos @soumith @yanboliang @anijain2305 @chunyuan-w @desertfire

This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. [ghstack-poisoned]

pytorch-bot · 2022-12-22T17:17:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91316

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2d263b3:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. ghstack-source-id: 23b441d Pull Request resolved: #91316

torch/_inductor/codegen/triton.py

This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. ghstack-source-id: 79e029b Pull Request resolved: #91316

lezcano · 2022-12-22T22:31:42Z

@pytorchbot merge

pytorchmergebot · 2022-12-22T22:34:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-12-22T22:49:29Z

Merge failed

Reason: Not merging any PRs at the moment because there is a merge blocking https://github.com/pytorch/pytorch/labels/ci:%20sev issue open at:
#91332

Details for Dev Infra team

Raised by workflow job

eellison · 2022-12-22T23:02:28Z

run torchbench ? i tried a few ops with this and got mixed results. what was the benchmarking script ?

try python /scratch/eellison/work/pytorch/benchmarks/dynamo/microbenchmarks/operatorbench.py --op=aten._softmax.default --dtype=float32 --max-samples=25 --accuracy-checking=False --suite=huggingface before/after ?

lezcano · 2022-12-22T23:03:06Z

sure, let me try that

lezcano · 2022-12-23T01:11:03Z

It seems to get a consistent 1% speed-up running the command in #91316 (comment). I have to say that the benchmarks are not as stable as I would like them to, so I run them 3 times each:

master :
[1.0279319318750362, 1.0543131412257343, 1.1092436595025628]
[1.029289752609728, 1.0638297295929133, 1.1131844980910401]
[1.0230071700163474, 1.0672269765613274, 1.1079364778186145]
PR :
[1.0436695871330925, 1.0672269765613274, 1.1131844980910401]
[1.0323480717808713, 1.0672269765613274, 1.1182555177307507]
[1.035240162792983, 1.0583334151878585, 1.1057159259906555]

Edit. well, I needed to run it on float16 because some op would complain with a hard error, but that shouldn't change much.

ngimel · 2022-12-23T03:43:37Z

What was the error you encountered?

lezcano · 2022-12-23T07:54:02Z

There's this check that was triggered when I run that command
Edit, it's this one:

pytorch/aten/src/ATen/native/cuda/SoftMax.cu

Lines 698 to 701 in f62a3ca

    
           Tensor host_softmax(const Tensor & input_, const int64_t dim_, const bool half_to_float, const Tensor& output){ 
        
             if (half_to_float) { 
        
               TORCH_CHECK(input_.scalar_type() == ScalarType::Half, "conversion is supported for Half type only"); 
        
             }

lezcano · 2022-12-23T13:49:22Z

If eviction_policy='evict_first' seems a bit dicey, I can always set the last read to have no policy. That should be quite uncontroversial.

This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. ghstack-source-id: 79e029b Pull Request resolved: #91316

ngimel · 2022-12-23T19:17:06Z

evict_first you mean? Yeah we can leave with no policy, that shouldn't matter much.
Can you please file an issue for the failure you encountered?
Edit: nm, it looks like the benchmark parameters are set wrong.

lezcano · 2022-12-25T13:14:45Z

Yes, I meant evict_first, sorry. I did see some perf differences between one and the other in isolated experiments. In sizes for which the cache hits are important, evict_first was more performant. In others, I saw that setting evict_first was a bit of a (rather tiny) pesimisation.
Leaving it with no eviction policy in the last case. In the future we can try to predict whether caching things may be useful and use that info to populate these field. Even more, for large reductions, we could figure out whether it's hopeless to try to keep the whole input in cache and we could set the policy to no_allocate (if we can ask triton to support this one) for roffset >= 64k (or something more intelligent of course).

This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. ghstack-source-id: 9ca4f96 Pull Request resolved: #91316

This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel mlazos soumith yanboliang anijain2305 chunyuan-w desertfire [ghstack-poisoned]

This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. ghstack-source-id: f725e45 Pull Request resolved: #91316

lezcano · 2023-06-03T08:45:59Z

@ngimel This one is at last ready for review.
I run the benchmark suit, and we're getting around a 1% speed-up on torchbench (I don't know if it's just noise though, I hope not!). See the benchmarks here.

Now, I reckon that this should be more helpful when dealing with smaller models.

lezcano · 2023-06-05T13:52:16Z

@pytorchbot merge

pytorchmergebot · 2023-06-05T13:54:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

desertfire · 2023-06-09T17:27:12Z

@lezcano , I have confirmed that this PR caused a big regression on pyhpc_equation_of_state inference performance: 12.8520 → 1.9600 . It will be great if we can forward fix this. cc @eellison @jansel @zou3519

https://hud.pytorch.org/benchmark/torchbench/inductor_no_cudagraphs?startTime=Fri%2C%2002%20Jun%202023%2017%3A23%3A58%20GMT&stopTime=Fri%2C%2009%20Jun%202023%2017%3A23%3A58%20GMT&granularity=hour&mode=inference&dtype=amp&lBranch=main&lCommit=07104ca99c9d297975270fb58fda786e60b49b38&rBranch=main&rCommit=87cbfe957a3473ebd3491f8096e37084c53c4f0a

zou3519 · 2023-06-09T17:42:52Z

Should we attempt to revert this or do a forward fix?

desertfire · 2023-06-09T18:16:50Z

pyhpc_equation_of_state

Given pyhpc_equation_of_state is not a very typical model test and training perf looks fine, I think forward fix is ok.

lezcano · 2023-06-10T01:30:25Z

I am going to be on PTO for the next 3 weeks, so I'd say it's best to revert and I'll then look into this when I'm back. Sorry for that, the benchmarks in #91316 (comment) looked alright!

lezcano · 2023-06-27T01:12:39Z

Note to self: Perhaps we could get some extra speed-up by using cache.cs for data that will just be used once.

lezcano · 2023-07-21T10:58:12Z

FWIW, I believe that the performance drop of pyhpc_equation_of_state was fixed in #103514.

lezcano mentioned this pull request Dec 22, 2022

Do not generate default value when it's zero #91315

Closed

github-actions bot added ciflow/inductor module: inductor labels Dec 22, 2022

lezcano requested review from jansel and ngimel December 22, 2022 17:18

pytorchbot added the open source label Dec 22, 2022

ngimel approved these changes Dec 22, 2022

View reviewed changes

torch/_inductor/codegen/triton.py Outdated Show resolved Hide resolved

lezcano added the topic: not user facing topic category label Dec 22, 2022

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 22, 2022

lezcano mentioned this pull request Dec 23, 2022

Improve the performance of reductions/softmax in the mid/large reduction range #90994

Closed

lezcano added 2 commits May 15, 2023 20:42

lezcano added the ciflow/inductor-perf-compare label Jun 2, 2023

lezcano removed the ciflow/inductor-perf-compare label Jun 3, 2023

lezcano requested review from jansel, ngimel and peterbell10 June 3, 2023 08:46

jansel approved these changes Jun 4, 2023

View reviewed changes

pytorchmergebot added the merging label Jun 5, 2023

pytorchmergebot removed the merging label Jun 5, 2023

pytorchmergebot closed this in 2c2e4d5 Jun 5, 2023

facebook-github-bot deleted the gh/Lezcano/166/head branch June 8, 2023 14:43

Populate the eviction_policy field for load/store properly #91316

Populate the eviction_policy field for load/store properly #91316

Uh oh!

Conversation

lezcano commented Dec 22, 2022 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91316

✅ No Failures

Uh oh!

Uh oh!

lezcano commented Dec 22, 2022

Uh oh!

pytorchmergebot commented Dec 22, 2022

Merge started

Uh oh!

pytorchmergebot commented Dec 22, 2022

Merge failed

Uh oh!

eellison commented Dec 22, 2022

Uh oh!

lezcano commented Dec 22, 2022

Uh oh!

lezcano commented Dec 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Dec 23, 2022

Uh oh!

lezcano commented Dec 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lezcano commented Dec 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Dec 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lezcano commented Dec 25, 2022

Uh oh!

lezcano commented Jun 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lezcano commented Jun 5, 2023

Uh oh!

pytorchmergebot commented Jun 5, 2023

Merge started

Uh oh!

desertfire commented Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented Jun 9, 2023

Uh oh!

desertfire commented Jun 9, 2023

Uh oh!

lezcano commented Jun 10, 2023

Uh oh!

lezcano commented Jun 27, 2023

Uh oh!

lezcano commented Jul 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

lezcano commented Dec 22, 2022 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 22, 2022 •

edited

Loading

lezcano commented Dec 23, 2022 •

edited

Loading

lezcano commented Dec 23, 2022 •

edited

Loading

lezcano commented Dec 23, 2022 •

edited

Loading

ngimel commented Dec 23, 2022 •

edited

Loading

lezcano commented Jun 3, 2023 •

edited

Loading

desertfire commented Jun 9, 2023 •

edited

Loading