Skip to content

Conversation

lezcano
Copy link
Collaborator

@lezcano lezcano commented Dec 22, 2022

Stack from ghstack (oldest at bottom):

This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting eviction_policy=evict_first in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting evict_last which
was the previous option. I'll put up some benchmarks soon™.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @mlazos @soumith @yanboliang @anijain2305 @chunyuan-w @desertfire

This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 22, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91316

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2d263b3:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

lezcano added a commit that referenced this pull request Dec 22, 2022
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

ghstack-source-id: 23b441d
Pull Request resolved: #91316
@lezcano lezcano requested review from jansel and ngimel December 22, 2022 17:18
@lezcano lezcano added the topic: not user facing topic category label Dec 22, 2022
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Dec 22, 2022
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

ghstack-source-id: 79e029b
Pull Request resolved: #91316
@lezcano
Copy link
Collaborator Author

lezcano commented Dec 22, 2022

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 22, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Not merging any PRs at the moment because there is a merge blocking https://github.com/pytorch/pytorch/labels/ci:%20sev issue open at:
#91332

Details for Dev Infra team Raised by workflow job

@eellison
Copy link
Contributor

run torchbench ? i tried a few ops with this and got mixed results. what was the benchmarking script ?

try python /scratch/eellison/work/pytorch/benchmarks/dynamo/microbenchmarks/operatorbench.py --op=aten._softmax.default --dtype=float32 --max-samples=25 --accuracy-checking=False --suite=huggingface before/after ?

@lezcano
Copy link
Collaborator Author

lezcano commented Dec 22, 2022

sure, let me try that

@lezcano
Copy link
Collaborator Author

lezcano commented Dec 23, 2022

It seems to get a consistent 1% speed-up running the command in #91316 (comment). I have to say that the benchmarks are not as stable as I would like them to, so I run them 3 times each:

master :
[1.0279319318750362, 1.0543131412257343, 1.1092436595025628]
[1.029289752609728, 1.0638297295929133, 1.1131844980910401]
[1.0230071700163474, 1.0672269765613274, 1.1079364778186145]
PR :
[1.0436695871330925, 1.0672269765613274, 1.1131844980910401]
[1.0323480717808713, 1.0672269765613274, 1.1182555177307507]
[1.035240162792983, 1.0583334151878585, 1.1057159259906555]

Edit. well, I needed to run it on float16 because some op would complain with a hard error, but that shouldn't change much.

@ngimel
Copy link
Collaborator

ngimel commented Dec 23, 2022

What was the error you encountered?

@lezcano
Copy link
Collaborator Author

lezcano commented Dec 23, 2022

There's this check that was triggered when I run that command
Edit, it's this one:

Tensor host_softmax(const Tensor & input_, const int64_t dim_, const bool half_to_float, const Tensor& output){
if (half_to_float) {
TORCH_CHECK(input_.scalar_type() == ScalarType::Half, "conversion is supported for Half type only");
}

@lezcano
Copy link
Collaborator Author

lezcano commented Dec 23, 2022

If eviction_policy='evict_first' seems a bit dicey, I can always set the last read to have no policy. That should be quite uncontroversial.

This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Dec 23, 2022
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

ghstack-source-id: 79e029b
Pull Request resolved: #91316
@ngimel
Copy link
Collaborator

ngimel commented Dec 23, 2022

evict_first you mean? Yeah we can leave with no policy, that shouldn't matter much.
Can you please file an issue for the failure you encountered?
Edit: nm, it looks like the benchmark parameters are set wrong.

@lezcano
Copy link
Collaborator Author

lezcano commented Dec 25, 2022

Yes, I meant evict_first, sorry. I did see some perf differences between one and the other in isolated experiments. In sizes for which the cache hits are important, evict_first was more performant. In others, I saw that setting evict_first was a bit of a (rather tiny) pesimisation.
Leaving it with no eviction policy in the last case. In the future we can try to predict whether caching things may be useful and use that info to populate these field. Even more, for large reductions, we could figure out whether it's hopeless to try to keep the whole input in cache and we could set the policy to no_allocate (if we can ask triton to support this one) for roffset >= 64k (or something more intelligent of course).

This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
lezcano added 2 commits May 15, 2023 20:42
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 15, 2023
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

ghstack-source-id: 9ca4f96
Pull Request resolved: #91316
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel mlazos soumith yanboliang anijain2305 chunyuan-w desertfire

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Jun 2, 2023
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

ghstack-source-id: f725e45
Pull Request resolved: #91316
@lezcano
Copy link
Collaborator Author

lezcano commented Jun 3, 2023

@ngimel This one is at last ready for review.
I run the benchmark suit, and we're getting around a 1% speed-up on torchbench (I don't know if it's just noise though, I hope not!). See the benchmarks here.

Now, I reckon that this should be more helpful when dealing with smaller models.

@lezcano
Copy link
Collaborator Author

lezcano commented Jun 5, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/Lezcano/166/head branch June 8, 2023 14:43
@desertfire
Copy link
Contributor

desertfire commented Jun 9, 2023

@zou3519
Copy link
Contributor

zou3519 commented Jun 9, 2023

Should we attempt to revert this or do a forward fix?

@desertfire
Copy link
Contributor

pyhpc_equation_of_state

Given pyhpc_equation_of_state is not a very typical model test and training perf looks fine, I think forward fix is ok.

@lezcano
Copy link
Collaborator Author

lezcano commented Jun 10, 2023

I am going to be on PTO for the next 3 weeks, so I'd say it's best to revert and I'll then look into this when I'm back. Sorry for that, the benchmarks in #91316 (comment) looked alright!

@lezcano
Copy link
Collaborator Author

lezcano commented Jun 27, 2023

Note to self: Perhaps we could get some extra speed-up by using cache.cs for data that will just be used once.

@lezcano
Copy link
Collaborator Author

lezcano commented Jul 21, 2023

FWIW, I believe that the performance drop of pyhpc_equation_of_state was fixed in #103514.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants