Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logcumsumexp for CPU #93153

Closed
wants to merge 4 commits into from
Closed

Logcumsumexp for CPU #93153

wants to merge 4 commits into from

Conversation

mfkasim1
Copy link
Contributor

@mfkasim1 mfkasim1 commented Jan 27, 2023

Partial work from #90847, in the direction of solving #89205.
Most of the content is from #90847, but this is only for CPU, so hopefully it does not increase the build time by a lot.

tag: @albanD, @malfet

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 27, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93153

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e787da3:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jan 27, 2023
aten/src/ATen/native/cpu/ReduceOpsKernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/ReduceOpsKernel.cpp Outdated Show resolved Hide resolved
c10::complex<scalar_t> _logcumsumexp_minmax(c10::complex<scalar_t> x, c10::complex<scalar_t> y, bool min) {
scalar_t xr = std::real(x);
scalar_t yr = std::real(y);
if (std::isnan(yr) || (std::isnan(std::imag(y)))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in complex primitives that are already in c10, but wouldn't it be nicer to implement something like

namespace std {
template<typename scalar_t>
bool isnan(const c10::complex<scalar_t>& v) {
  return std::isnan(v.real()) || std::isnan(v.imag());

and then use the following throughout the codebase

Suggested change
if (std::isnan(yr) || (std::isnan(std::imag(y)))) {
if (std::isnan(y)) {

}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in c10 either and I'm not sure if this function is already available. It's better to discuss it with c10 experts.

Copy link
Contributor

@malfet malfet Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no harm in implementing such template in https://github.com/pytorch/pytorch/blob/master/c10/util/complex_utils.h
But at the very least just create a convenience template in this code and use it instead of explicit isnan(x.real()) || isnan(x.imag())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented it in the newest commit. I was hesitated first because there is no agreed behaviour for std::nan for complex numbers, but torch.isnan defined a complex number to be nan if either real or imag part is nan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was tempted to suggest using at::_isnan, but dispatch complexity would probably be an overkill. Thank you for addressing the suggested change. Please fix the lint though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed the change for fixing lint. If it compiles, we should ship it

aten/src/ATen/native/cpu/ReduceOpsKernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/ReduceOpsKernel.cpp Outdated Show resolved Hide resolved
c10/util/complex_utils.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we can add this to CUDA too with the help of Jiterator or such.

@Skylion007 Skylion007 added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 27, 2023
@mfkasim1
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@mfkasim1
Copy link
Contributor Author

Permission to start a discussion about jiterator for logcumsumexp here. I've spent some time to see the code that uses jiterator and related functions such as cumprod, cumsum, etc, and here are my observations:

  1. None of the accumulating functions (cumprod, cumsum, etc) are using jiterator. They are using the standard dispatcher and scan_dim function
  2. It seems like binary and unary functions are using jitted_gpu_kernel and reduce functions are using jitted_gpu_reduce_kernel, but I can't find anything like jitted_gpu_accum_kernel or jitted_gpu_scan_kernel.

Does it mean in order to use jiterator for logcumsumexp, we need to write the jitted_gpu_accum_kernel?

@ngimel
Copy link
Collaborator

ngimel commented Jan 30, 2023

Yes, that's correct, and also, given that scan_dim uses cub underneatch, tbh I don't know how feasible it is to write jitted_gpu_accum_kernel given that nvrtc needs all the headers at runtime, and we don't ship cub headers.

pytorchmergebot pushed a commit that referenced this pull request Feb 13, 2023
Hopefully fixes #89205.
This is another version of #90847 where it was reverted because it increases the compile-time significantly.
From my discussion with @ngimel in #93153 (comment), it seems the option of jiterator would be very tricky if not impossible.
So what I did was to optimize the compile-time in my computer.

To optimize the build time, first I compile the pytorch as a whole, then only change the `LogcumsumexpKernel.cu` file to see how it changes the compile time.
Here are my results for the compilation time of only the `LogcumsumexpKernel.cu` file in my computer:

- Original version (without any complex implementations): 56s (about 1 minute)
- The previous PR (#90847): 13m 57s (about 14 minutes)
- This PR: 3m 35s (about 3.5 minutes)

If the previous PR increases the build time by 30 mins in pytorch's computer, then this PR reduces the increment of build time to about 6 mins. Hopefully this is an acceptable level of build-time increase.

What I did was (sorted by how significant it reduces the build time from the most significant one):

- Substituting `log(x)` to `log1p(x - 1)`. This is applied in the infinite case, so we don't really care about precision.
- Implementing complex exponential manually

tag: @malfet, @albanD
Pull Request resolved: #94310
Approved by: https://github.com/Skylion007, https://github.com/malfet
pytorchmergebot pushed a commit that referenced this pull request Mar 1, 2023
Continuation of PR #93153 where I implemented logaddexp for complex, but didn't expose it to `torch.logaddexp`. So this PR is to expose the complex logaddexp to `torch.logaddexp`.

Pull Request resolved: #95717
Approved by: https://github.com/lezcano
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants