Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pypocketfft as FFT backend? #10175

Closed
mreineck opened this issue May 13, 2019 · 73 comments · Fixed by #10238

Comments

@mreineck
Copy link

commented May 13, 2019

Numpy 1.17 has adopted the C version of pocketfft (https://gitlab.mpcdf.mpg.de/mtr/pocketfft) as their FFT implementation. Since then I have been working on the pocketfft code, converted it to C++ and added a Python wrapper.

The new library (https://gitlab.mpcdf.mpg.de/mtr/pypocketfft)

  • supports complex and half-complex transforms, also Hartley transforms
  • supports float, double and long double precision
  • supports multi-dimensional FFTs
  • uses CPU vector instructions for ndim>1 (i.e. it computes several 1D FFTs dimultaneously).
  • uses the Bluestein algorithm for FFTs with large prime factors

Performance is significantly better than FFTPACK's, accuracy as well. For benchmarks and dicussion see numpy/numpy#11888 and recent comments in numpy/numpy#11885.
I have also added an interface which is very similar to scipy.fftpack.

Potential problems for adaptation:

  • written in C++, requires a C++11 compiler and pybind11
  • cosine/sine transforms are missing, and I probaly won't have the time ti implement them. They would have to be added on top of the existing machinery.

Would this be of interest for Scipy?

@ilayn

This comment has been minimized.

Copy link
Member

commented May 13, 2019

It is certainly but on another news we are also anticipating a GSoC project for this. Don't know exactly the latest status. This effort might be combined with that.

@rgommers

This comment has been minimized.

Copy link
Member

commented May 13, 2019

@mreineck have you followed the GSoC proposal? It just got accepted, @peterbell10 will be working on adding a backend system for FFTs to SciPy and improving the module. Peter's proposal is linked at https://mail.python.org/pipermail/scipy-dev/2019-May/023506.html, and that thread also has some discussion.

We definitely would like to get rid of FFTPACK as much as possible. I actually don't know how portable C++11 is right now - would need to look into that. If it's portable enough, we can use the new version of pypocketfft as our default implementation. Otherwise, we can use the C version that's also in numpy as our default, and enable use of the C++ version via the new backend system (which can at the same time also support pyfftw and mkl-fft).

@peterbell10

This comment has been minimized.

Copy link
Member

commented May 13, 2019

My plan is that once backend system is in place, pocketfft would be one of the first alternate backends. From there, we could make pocketfft the default and potentially remove the fortran code altogether.

I actually don't know how portable C++11 is right now

The toolchain roadmap doesn't really say much about C++ but it sounds like some subset of C++11 is allowed. pybind11 only claims to support MSVC 2015 and newer though, which I suspect would be the limiting factor.

At a glance, it looks like the pocketfft code itself should work with MSVC 2013 though. We could always just bind it using cython instead of pybind11.

@ilayn

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Is there any reason why we don't directly use the C version?

@mreineck

This comment has been minimized.

Copy link
Author

commented May 13, 2019

@rgommers I wasn't aware of the proposal, thanks for pointing it out!

One thing that could be extremely beneficial in my opinion would be some sort of unification of numpy's and scipy's FFT implementations. I'm aware that it's probably not possible to unify the interfaces (too many people are used to how things work now), but having a single implementation in the background might make maintenance a lot simpler. (This is why I provided the additional _scipy methods.)

@mreineck

This comment has been minimized.

Copy link
Author

commented May 13, 2019

Is there any reason why we don't directly use the C version?

The C version only supports double-precision 1D transforms.
Multi-dimensional transforms, different data types and vectorization were relatively easy to add using C++ templates; doing this in C would be a nightmare, unfortunately ...

@mreineck

This comment has been minimized.

Copy link
Author

commented May 13, 2019

At a glance, it looks like the pocketfft code itself should work with MSVC 2013 though. We could always just bind it using cython instead of pybind11.

Please let me know if you need any adjustments!

@rgommers

This comment has been minimized.

Copy link
Member

commented May 13, 2019

One thing that could be extremely beneficial in my opinion would be some sort of unification of numpy's and scipy's FFT implementations. I'm aware that it's probably not possible to unify the interfaces (too many people are used to how things work now), but having a single implementation in the background might make maintenance a lot simpler. (This is why I provided the additional _scipy methods.)

yes, this is definitely desirable. making the interfaces the same would be awesome, but it's indeed hard due to backwards compatibility issues. In SciPy we need to support NumPy several versions back (right now back to 1.13.3) so we can't just rely on NumPy yet. But we can take over the implementation as it is in NumPy, and then sync changes as needed.

@rgommers

This comment has been minimized.

Copy link
Member

commented May 13, 2019

My plan is that once backend system is in place, pocketfft would be one of the first alternate backends. From there, we could make pocketfft the default and potentially remove the fortran code altogether.

Why not make Pocketfft the default first, and then add the backend system?

The toolchain roadmap doesn't really say much about C++ but it sounds like some subset of C++11 is allowed. pybind11 only claims to support MSVC 2015 and newer though, which I suspect would be the limiting factor.

Yes, Windows is one issue. But it may just work as you say. The other things that needs checking are more exotic platforms (ARM, AIX, etc.), and what the minimum versions of gcc/xlang/icc are that work.

@peterbell10 perhaps this is something you could look into?

@mreineck

This comment has been minimized.

Copy link
Author

commented May 13, 2019

The other things that needs checking are more exotic platforms (ARM, AIX, etc.)

If anyone has access to those and runs tests, it would be great if you could keep me in the loop. It should be trivial to add vectorization support there, but I couldn't do it so far because I don't know the relevant preprocessor magic on these platforms.

@WarrenWeckesser

This comment has been minimized.

Copy link
Member

commented May 13, 2019

making the interfaces the same would be awesome, but it's indeed hard due to backwards compatibility issues.

Could we leave the fftpack module unchanged as a long-term legacy module, and create a new module, say scipy.fft?

The interface of rfft in scipy.fftpack is really awkward to use. It would be nice to not have to support that in the future.

@mreineck

This comment has been minimized.

Copy link
Author

commented May 13, 2019

The interface of rfft in scipy.fftpack is really awkward to use.

I agree that the half-complex format is pretty strange, but it also has two massive advantages:

  • since the output is real-valued and has the same dimensions as the input, in-place transforms are possible
  • no obscure additional parameters are necessary for the inverse transforms to make sure you get the original array size back.

On the implementer's side, the numpy interface for real-valued transforms is a much greater pain than the scipy one ...

Of course, I prefer Hartley transforms over those two alternatives any day ;)

@WarrenWeckesser

This comment has been minimized.

Copy link
Member

commented May 13, 2019

With a new module, the API for rfft could match that of numpy, and the function that returns a real array like scipy.fftpack.rfft could be given a new name.

@peterbell10

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Why not make Pocketfft the default first, and then add the backend system?

Wouldn't that mean duplicating effort? First I would have to implement the current interface with pocketfft, then I would have to reimplement it against the backend interface. Depending on the design this might not be much work, but it still seems like putting the cart before the horse.

Yes, Windows is one issue. But it may just work as you say. The other things that needs checking are more exotic platforms (ARM, AIX, etc.), and what the minimum versions of gcc/xlang/icc are that work.

I can check ARM, gcc and clang but I don't have access to AIX or icc so that would be difficult. Are minimum compiler versions documented anywhere at the moment, or at least what is used to build releases?

@rgommers

This comment has been minimized.

Copy link
Member

commented May 13, 2019

I can check ARM, gcc and clang but I don't have access to AIX or icc so that would be difficult.

I have access to an AIX box, I can do that. ICC we can find out fairly easily I think. I may even have a license still.

@rgommers

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Wouldn't that mean duplicating effort? First I would have to implement the current interface with pocketfft, then I would have to reimplement it against the backend interface.

Can you clarify what you mean by "reimplement against the backend interface"? There should not be any actual implementations of anything in SciPy except for our default implementation (which is now FFTPACK, but we'd like to get rid of that because it performs badly and is hard to maintain Fortran code). So we have a new default (Pocketfft) and a backend system. The backend system will then allow users to use external libraries like pyfftw or pypocketfft if those are installed.

@rgommers

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Could we leave the fftpack module unchanged as a long-term legacy module, and create a new module, say scipy.fft?

That does sound like a good idea. It would solve the issue cleanly without any deprecations.

@larsoner

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Agreed that the scipy.fft namespace as the backend-switchable interface is a good way to go. We can even make scipy.fft.* be the same thing API-wise and underlying default implementation-wise (pocketfft) as numpy.fft.*. Then the difference between numpy.fft.* and scipy.fft.* is that the latter might use some other backend if you tell it to, otherwise it will do the same thing as numpy.fft.*.

@rgommers

This comment has been minimized.

Copy link
Member

commented May 13, 2019

other difference perhaps that there's more transforms available, like the Hartley transform. And DCT/DST, if we're happy with the API for those we can just expose those in the new scipy.fft

@larsoner

This comment has been minimized.

Copy link
Member

commented May 13, 2019

other difference perhaps that there's more transforms available, like the Hartley transform. And DCT/DST, if we're happy with the API for those we can just expose those in the new scipy.fft

So the proposed SciPy/NumPy delination is then that scipy.fft is backend-able (falling back to same pocketfft interface/performance), and offers more, related transforms -- fine by me.

Regarding the half-complex scipy.fftpack.rfft-style interface, which I forgot to reply about -- in my experience it brings more pain and confusion than it does benefits in terms of 1) in-place operation, 2) no ambiguity in irfft length, and 3) no extra bytes wasted on DC imaginary component (or imaginary Nyquist part for even-length signals). So I'd rather hold off on adding that to scipy.fft later if people have compelling use cases for it.

Thoughts @grlee77 @endolith ?

@endolith

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

Sounds like the same thing I proposed in #2487 (comment) :)

Don't forget that next_fast_len() has to adapt to the backend. (And pocketfft has different optimum lengths for complex vs real FFTs? Not sure how to handle that, or if it really matters too much.)

(I looked through pocketfft's code a little and saw a function that does the same thing as next_fast_len, but didn't seem to use the 2**((quotient - 1).bit_length()) trick, but I didn't look closely at whether it made any difference. Optimizing the C function with that trick and then just calling it would be fastest way to do next_fast_len.)

@peterbell10

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Can you clarify what you mean by "reimplement against the backend interface"? There should not be any actual implementations of anything in SciPy [...]
The backend system will then allow users to use external libraries

In case it wasn't clear: I don't think there should be a default and backends; I think there should be a default backend. i.e. the default fft library would implement the same interface that a 3rd party backend would.

And when I say "implement", I'm strictly referring to the "glue code" that translates from the python call down to the native fft library. If the backend interface doesn't follow the exact same specification as the front end, that "glue code" might have to be rewritten. Depending on the final design this might not actually require any real effort.

Agreed that the scipy.fft namespace as the backend-switchable interface is a good way to go.

@larsoner do you mean as well as scipy.fftpack or instead of? I thought that part of the idea of adding backends was that existing code using scipy.fftpack could benefit.

@larsoner

This comment has been minimized.

Copy link
Member

commented May 13, 2019

I'm starting to think "instead of" (i.e., keep scipy.fftpack just for backward compatibility and focus this new effort on a new scipy.fft) might be nicer in the long term from a maintainability and usability/naming/user expectation perspective.

@rgommers

This comment has been minimized.

Copy link
Member

commented May 13, 2019

In case it wasn't clear: I don't think there should be a default and backends; I think there should be a default backend. i.e. the default fft library would implement the same interface that a 3rd party backend would.

Yes that makes sense. I think we're saying the same thing in different words:)

@larsoner do you mean as well as scipy.fftpack or instead of?

I'm pretty sure "instead of".

I thought that part of the idea of adding backends was that existing code using scipy.fftpack could benefit.

For most existing code it's an extra one-liner change (import fft instead of fftpack). Also, it cleans up the ugly fftpack name, which never made too much sense

@ilayn

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Actually scipy.fft would save me from a lot of troubles so +1

@endolith

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

(How does it handle different backends supporting different operations, like DCT is supported by one but not another, for instance? Will it require that every backend support every operation, even if it's a Python-level DCT↔FFT translation, or will it fallback to the pocketfft implementation in that case?)

@rgommers

This comment has been minimized.

Copy link
Member

commented May 14, 2019

Will it require that every backend support every operation,

This question is also raised by the design goals in Peter's design document. I'll copy the two relevant ones here:

  1. Any code that works with the current fftpack should work with all backends
  2. Conversely, any code that doesn’t work with the current fftpack should not work with any other backends.

I'm not sure (1) is tenable, in particular because of less common functionality like DCT. What if we have one other library wanting to provide DCT, but another one doesn't? I think that should be possible. Hence I'd answer @endolith's question with "no".

I would suggest that:

  • backends can provide support for any function in scipy.fft
  • if they support it, they are responsible for providing the backend interface (the "glue") in a way that's compatible in both API and behavior. Note that that responsibility is on them though.
  • using "partial implementations" of a function is probably not a good idea. conceptually a backend should say "hey I provide function X". and not "hey I don't have function X but please call X from within your Y". So regarding the Python-level DCT↔FFT translation question, I'd say that if a backend doesn't have DCT, then calling dct() should use the SciPy default for DCT and nothing else.
  • it's up for discussion what should happen if a user explicitly asks for a backend and that backend doesn't provide a function that the user calls. Raise an error, or switch to the default implementation with or without a warning?

The above are for me pretty recently formed ideas based on both playing with the NumPy __array_function__ protocol and with uarray. From design discussions around the former it became clear that allowing backends to provide partial implementations of the whole API is crucial to adoption. The same seems to apply here. Every backend will have fft and ifft, some may have rfft, and very few dct.

@mreineck

This comment has been minimized.

Copy link
Author

commented May 14, 2019

So regarding the Python-level DCT↔FFT translation question, I'd say that if a backend doesn't have DCT, then calling dct() should use the SciPy default for DCT and nothing else.

I'm not necessarily disagreeing, but you have to be aware that this can be very surprising for the user. Assume someone has a code that initially uses rfft; then they recognize a symmetry in their data and switch to dct - leading to potentially quadratic run times and worse accuracy. Exactly the opposite of what one would expect. This will be hard to explain in the documentation.

@mreineck

This comment has been minimized.

Copy link
Author

commented Jun 28, 2019

Question for potential future development of pypocketfft: currently I'm making no assumptions about the strides of the passed arrays whatsoever. Would it be OK to assume that strides are always a multiple of the size of the underlying numpy data type? This will hold in all "reasonable" situations, but could be wrong if someone passes parts of a structured array, for example.

@charris

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

Hmm, I think funny things can be engineered, although IIRC, we routinely divide strides by itemsize to get sizes. @seberg Thoughts?

@seberg

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

Not sure that dividing strides by itemsize is much of a thing, dividing shape by itemsize for contiguous data is a thing though.
Yes, it will hold on most reasonable data, but not on all supported data. Its similar to assuming the data is aligned maybe. Likely you can get away with it by simply falling back to a slow "copy" path for the hopefully rare case where it cannot be done.

@mreineck

This comment has been minimized.

Copy link
Author

commented Jun 28, 2019

Its similar to assuming the data is aligned maybe.

That's an assumption I make implicitly. If array entries are not sufficiently aligned (I'm not talking about sufficient alignment for vectors here, just about the natural alignment required for the data type) the module will crash.
Unfortunately requiring that strides are divisible by itemsize is a much stronger requirement. complex<double> only requires 8-byte alignment normally, although its size is 16. For long double, discrepancies are probably even larger.

Likely you can get away with it by simply falling back to a slow "copy" path for the hopefully rare case where it cannot be done.

It's not about avoiding copying (I'm doing this already where possible). It's just about my multi-D array class on the C++ side, which does all offset calculations in bytes and casts to the correct data type every time I ask for an element. This is pretty awkward, and I don't know whether it inhibits some optimizations.

Anyway, if we cannot exclude weird strides, I'll just leave things as they are. Thanks for your feedback!

@aarchiba

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

That's an assumption I make implicitly. If array entries are not sufficiently aligned (I'm not talking about sufficient alignment for vectors here, just about the natural alignment required for the data type) the module will crash.

Just to check: what alignment do you assume for 80-bit floats? Numpy offers both 12-byte and 16-byte alignment.

One can, of course, create arrays whose data items are not aligned; this is rare enough that a copy on input is fine, but a segfault is not.

@seberg

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

@mreineck but you are then talking about purely numeric arrays, which would be a copy of an old array if necessary? In that case I think you are covered by contiguity (even if it may be along a different axis), so I think you are fine, although I am not quite sure I follow.

@peterbell10

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

Is it worth handling views into structured array that lack natural alignment:

>>> np.zeros(10, dtype='u1,f8')['f1'].strides
(9,)

It sounds like pypocketfft would crash on that currently.

@mreineck

This comment has been minimized.

Copy link
Author

commented Jun 28, 2019

Just to check: what alignment do you assume for 80-bit floats? Numpy offers both 12-byte and 16-byte alignment.

I'm assuming that the data is aligned in such a way that the CPU can read an object of that type from the address, that's all.

One can, of course, create arrays whose data items are not aligned; this is rare enough that a copy on input is fine, but a segfault is not.

If I'm on a machine that requires doubles to be 8-byte aligned and I'm passed a double * pointing to an adress that is not divisible by 8, than a segfault is perfectly reasonable. Such a pointer invokes undefined behaviour. Does numpy really do such things?

@mreineck

This comment has been minimized.

Copy link
Author

commented Jun 28, 2019

>>> np.zeros(10, dtype='u1,f8')['f1'].strides
(9,)

It sounds like pypocketfft would crash on that currently.

It seems that this kind of structure works on x86 (at a performance penalty). It wouldn't work on ARM CPUs, as far as I understand. If numpy produces such strides there, that's a bug.

@peterbell10

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

this kind of structure [...] wouldn't work on ARM CPUs

It's not that it doesn't work on arm, it's just that you have to memcpy into aligned storage instead of just casting to double.

https://gcc.godbolt.org/z/TNnNWJ

@mreineck

This comment has been minimized.

Copy link
Author

commented Jun 28, 2019

It's not that it doesn't work on arm, it's just that you have to memcpy into aligned storage instead of just casting to double.

Given the above example with 9-byte strides, I would have to call an 8-byte memcpy for every single entry in the array. That's not really nice performance-wise.
The problem is that I do not know a priori on which hardware this is necessary, so the only safe solution would be to this copy everytime I detect odd alignment and/or strides...

Could you please point me to where this is done for fftpack, for example?

@seberg

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

@mreineck numpy arrays have some aligned flags (and a bit more annoying in depth handling) to detect that this may be possible. And yes, we do fall back to slow paths in that case, which for example copy each element first (not typically with memcpy, but I think that is mostly historic). Numpy does support for example Sparc, and such alignment issues cause bus errors for them, so it is a real issue (if users create funny enough arrays). But, I do feel that you should be able to just check if the numpy array is "behaved" (although that includes writeable), and if not create a copy.

@aarchiba

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

Just to check: what alignment do you assume for 80-bit floats? Numpy offers both 12-byte and 16-byte alignment.

I'm assuming that the data is aligned in such a way that the CPU can read an object of that type from the address, that's all.

It sounds like on x86 this isn't a problem (except for speed), and by default numpy chooses to pack 10-byte floats into 16-byte spaces if on a 64-bit architecture, but users can specify float96 to get 12-byte floats on any architecture, and they might in an attempt to save space.

Numpy is sometimes used on memory-mapped files, so occasionally an array arises in some layout optimized for storage on disk rather than efficient CPU operations. It is important that numpy be able to read and write such arrays in place, but it is understood that operations on such arrays might be slower and/or require a copy compared to more conventional layouts. In fact many numpy operations (for example reshape) share memory only on a best-effort basis, copying when necessary, in particular when a library needs a particular kind of layout. So it seems to me perfectly reasonable to specify your layout requirements up front and call something like np.ascontiguousarray if they are not met. (In fact an FFT might be faster on the copy than the original even for some normally strided arrays depending on cache layout and operation order.) For something as basic as an FFT, it would be nice to avoid this copy for most use cases, but people with funny alignments are expecting slowness and copies. Segfaults are not okay.

@peterbell10

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

The problem is that I do not know a priori on which hardware this is necessary, so the only safe solution would be to this copy everytime I detect odd alignment and/or strides...

If you look at the godbolt link, gcc on intel turns the memcpy into a movsd even on -O0.

@mreineck

This comment has been minimized.

Copy link
Author

commented Jun 29, 2019

That's great ... but on ARM this won't happen, so if I don't want to penalize users on that platform, I can't avoid having two code paths.

I'd like to address this outside of pocketfft (i.e. somewhere in pypocketfft.cc instead). Unfortunately this means (potentially) copying the entire multi-D arrays (out may also need to be copied at exit), which is a big memory overhead.

For the current fftpack implementation, is the aligning done somewhere in the automatically generated Fortran wrapper? I tried with the 9-byte-strided test array, and it was not copied before calling the wrapper, so it must be in there somewhere...

@mreineck

This comment has been minimized.

Copy link
Author

commented Jul 2, 2019

OK, here is the dilemma I'm facing:

Assume someone passes an array to pypocketfft which doesn't have the aligned flag set. If I make an aligned copy the array, I'm wasting space on platforms where this is not necessary (and could lead to an out-of-memory condition); if I don't copy, the code will crash on some CPUs. So I would have to make a distinction based on host CPU type, which is an information I do not have.

I do not want to change pocketfft_hdronly.h itself, since by definition of its interface it cannot be passed misaligned pointers (I'm sure that a double * from which it is impossible to read a double is ill-formed C++). So something must be done in pypocketfft.cc, and I have no idea what :-/

BTW, pyfftw refuses to create plans for unaligned and strangely strided input/output arrays. That would also be an acceptable solution for me, but if I understood correctly this is not how scipy does things.

@peterbell10

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

What if you just used ndarr in the interface to pocketfft_hdronly.h? The non-contiguous case in general_c already does a copy so changing that to memcpy shouldn't be any more work.

@mreineck

This comment has been minimized.

Copy link
Author

commented Jul 2, 2019

As you showed, using memcpy for small fixed sizes works fine on some platforms like x86, because the compiler knows it can simply replace the memcpy with an unaligned load/store. But your demo code on godbolt also showed that on ARM, memcpy is called unconditionally, which must be a big performance hit. This could only be avoided by another code path, which feels like overkill for a scenario that should never occur in a performant numerical library.
To be honest, even the existing char * hackery in ndarr feels scary to me now, and I'd prefer to go back to strides measured in sizes of the underlying data types instead of bytes.
If we fix this, please let's fix it in the wrapper ... perhaps even outside of pypocketfft.cc. I'm pretty sure that most other FFT backends won't be able to deal with weirdly aligned/strided arrays either, so the alignment code will certainly be re-used.

@peterbell10

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

I think that's fine too. I'd guess this is a very rare edge case so I wouldn't be too woried about just copying the whole array. Especially since the resulting transform could usually be done in place so it wouldn't necessarily have any memory overhead.

@mreineck

This comment has been minimized.

Copy link
Author

commented Jul 2, 2019

Especially since the resulting transform could usually be done in place so it wouldn't necessarily have any memory overhead.

Hah, very good point! Now I don't feel so bad any more :)

@aarchiba

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2019

Assume someone passes an array to pypocketfft which doesn't have the aligned flag set. If I make an aligned copy the array, I'm wasting space on platforms where this is not necessary (and could lead to an out-of-memory condition); if I don't copy, the code will crash on some CPUs. So I would have to make a distinction based on host CPU type, which is an information I do not have.

Because this is a rare case, the main concern is avoiding crashes or "I can't do that" exceptions; extra slowness or memory use are a tolerable cost. (People whose arrays are pushing the limits of virtual memory can be required to be aware of alignment issues.) So when given a non-aligned array, why not make the copy unless you know it's going to be okay? That is, the code path contains exceptions to "make a copy if non-aligned" for Intel and other architectures explicitly known to be okay?

I agree this should live in the wrapper. I guess the question is, can pypocketfft live with byte-based strides? If so, then the wrapper can skip the copy when it knows that's safe. (And when a copy is made, the code can be permitted to use in-place FFTs, if those are faster.)

@mreineck

This comment has been minimized.

Copy link
Author

commented Jul 2, 2019

So when given a non-aligned array, why not make the copy unless you know it's going to be okay? That is, the code path contains exceptions to "make a copy if non-aligned" for Intel and other architectures explicitly known to be okay?

Just to be sure, where do you suggest these tests should happen - in Python or already on the C++ side? Personally, I'd prefer Python, but I don't know how well the computing environment can be queried from there.

I guess the question is, can pypocketfft live with byte-based strides?

Yes, I can keep this functionality in. "Natural" strides look a bit nicer, but it's not a dramatic difference.

(And when a copy is made, the code can be permitted to use in-place FFTs, if those are faster.)

They'll just be more economic in terms of memory, but probably not noticeably faster.

@aarchiba

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2019

So when given a non-aligned array, why not make the copy unless you know it's going to be okay? That is, the code path contains exceptions to "make a copy if non-aligned" for Intel and other architectures explicitly known to be okay?

Just to be sure, where do you suggest these tests should happen - in Python or already on the C++ side? Personally, I'd prefer Python, but I don't know how well the computing environment can be queried from there.

I'd say put the tests wherever convenient. Since there are already python-level functions to produce array copies when necessary, python might be easier. I don't know how easy these architecture checks are from python either, but if absolutely necessary you could just add a "required_alignment" function to the C++ side that the python can call to check. Ideally numpy would have a function that took an array and copied it if necessary to produce an array where element access from C++ didn't cause a segfault, but numpy.require expects you to know whether you need aligned elements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.