Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VkFFT 1.3 #25

Closed
DTolm opened this issue Apr 2, 2023 · 70 comments
Closed

VkFFT 1.3 #25

DTolm opened this issue Apr 2, 2023 · 70 comments

Comments

@DTolm
Copy link

DTolm commented Apr 2, 2023

Dear @vincefn,

I have made some substantial changes to VkFFT in version 1.3 (https://github.com/DTolm/VkFFT/tree/develop), so there will be a two-month period before it is merged in the main branch for people to adjust the dependent projects. Namely, VkFFT is no longer a header-only file, but rather a collection of headers. This should not increase the complexity of usage - you still link to vkFFT.h file and it includes the other files. The main advantage is that the code base now is more structured and way easier to understand by other developers.

I have tested the code on some systems with the implemented benchmark scripts - RTX2080 (Vulkan, CUDA, OpenCL), MI250 (HIP), A100 (CUDA), UHD610 (Level Zero) and M1 Pro (Metal), however, your suite is more thorough in this regard. Also, you might be interested in exploring the new design.

I suppose keeping an issue for this period can be helpful for discussion.

Best regards,
Dmitrii

@vincefn
Copy link
Owner

vincefn commented Apr 3, 2023

Hi @DTolm , thanks for the heads-up, I'll try to adapt pyvkfft in a few days.
(PS: I was actually going to mail you because I found a performance regression on V100 (but not only) for some sizes, e.g. c2c 512**3 3D transform - almost 2x slower than in version 1.2.21)

@vincefn
Copy link
Owner

vincefn commented Apr 3, 2023

Dear @DTolm (Cc @picca)

Maybe one quick comment regarding the organisation of the headers: splitting the different parts is a good thing, itwill be more usable that way.
However I do not know how you intend to have the headers distributed and used. Previoulsy, this just required adding one path and using #include "vkFFT.h". I see that for Debian the approach is similar given the files distributed (https://packages.debian.org/sid/all/libvkfft-dev/filelist).

Now you are using #include "vkFFT_Structs/vkFFT_Structs.h", which means that you expect a directory organization like:

INCLUDE_DIR/vkFFT.h
INCLUDE_DIR/vkFFT_AppManagement/vkFFTDeleteApp.h
INCLUDE_DIR/vkFFT_AppManagement/...
INCLUDE_DIR/vkFFT_CodeGen/Kernelsvel0/vkFFT_KernUtils.h
INCLUDE_DIR/vkFFT_CodeGen/...
INCLUDE_DIR/vkFFT_PlanManagement/...
INCLUDE_DIR/vkFFT_Structs/...

where INCLUDE_DIR would be e.g. /usr/include under linux.

This is a little unusual as it requires copying not one directory (vkFFT) with all the headers, but instead copying both vkFFT.h and the subdirectories vkFFT_AppManagement, vkFFT_CodeGen, vkFFT_PlanManagement and vkFFT_Structs into the distribution's include directory..

This is just a suggestion, but there would be two simpler ways of doing this:

  1. do not change your directory structure, but include from one level up, i.e. use:
  • #include vkFFT/vkFFT.h
  • #include vkFFT/vkFFT_AppManagement/vkFFTDeleteApp.h

This has the advantage of only requiring a single copy or link of the vkFFT include folder.

  1. simplify a bit more and use
INCLUDE_DIR/vkFFT.h
INCLUDE_DIR/vkFFT/AppManagement/DeleteApp.h
INCLUDE_DIR/vkFFT/AppManagement/...
INCLUDE_DIR/vkFFT/CodeGen/Kernelsvel0/KernUtils.h
INCLUDE_DIR/vkFFT/CodeGen/...
INCLUDE_DIR/vkFFT/PlanManagement/...
INCLUDE_DIR/vkFFT/Structs/...

I think the latter is a much more standard way of organising the include directories - and more convenient, with one top include header and one directory.

I've also put @picca in Cc who maintains the VkFFT Debian package for further advice.

@DTolm
Copy link
Author

DTolm commented Apr 3, 2023

Dear @vincefn,

I am happy to change the codebase directory layout to be more in line with other header-only libraries - so probably the second approach is the easiest?

As for the performance regressions - it is likely related to experiments with the dispatch threadblock size. They can improve some systems while making other systems worse and it is hard to decide the best size automatically for all systems. Currently, this decision-maker is located on lines 447-773 in vkFFT_Plan_FFT file and should probably be moved to a separate file as well. I will check what happens with 512^3 system and try to make some improvements.

Best regards,
Dmitrii

@vincefn
Copy link
Owner

vincefn commented Apr 3, 2023

I am happy to change the codebase directory layout to be more in line with other header-only libraries - so probably the second approach is the easiest?

I think this would be best. But please get another opinion before committing - I don't know if @picca has an opinion on this.

As for the performance regressions - it is likely related to experiments with the dispatch threadblock size. They can improve some systems while making other systems worse and it is hard to decide the best size automatically for all systems. Currently, this decision-maker is located on lines 447-773 in vkFFT_Plan_FFT file and should probably be moved to a separate file as well. I will check what happens with 512^3 system and try to make some improvements.

Could it make sense to make this tunable - changing the threadblock size using an optional parameter ? You could imagine using this like in FFTW with their FFTW_ESTIMATE and FFTW_MEASURE - it would be the task of the calling library (e.g. pyvkfft) to test the performance of different sizes. I don't know how deep this is in the code and so if that would be possible.

Maybe it's already tunable through e.g. maxThreadNum - let me know.

@vincefn
Copy link
Owner

vincefn commented Apr 4, 2023

Regarding the speed issue, here are some comparison benchmarks:

The A2000 has no issues in 2D (at least up to 1024) but a decrease in 3D:
(below the graphs up to 1024 are 2D, up to 660 this is 3D transforms - all batched so the actual array transformed are about 1GB)

pyvkfft-benchmark-A2000-2D
pyvkfft-benchmark-A2000-3D

For Titan V and V100 the decrease is clearer:
pyvkfft-benchmark-TitanV-2D
pyvkfft-benchmark-TitanV-3D

pyvkfft-benchmark-V100-2D
pyvkfft-benchmark-V100-3D

@DTolm
Copy link
Author

DTolm commented Apr 4, 2023

That's unexpected, I will investigate the reason today and add the update to the 1.3.0 branch. Thanks for finding it out!

@picca
Copy link

picca commented Apr 4, 2023

Hello, just about the include files. I prefer the 2nd solution.

you have to decide for your user what is the prefered include directive

#include <vkFFT.h>

Then you decide where the includes files are installed by default. (for example with the autotools, it is under the includedir path which is by default $DESTDIR/usr/include/, but you can also decide to intall under a versioned version of your library

something like

/usr/include/vkFFT-'X'/
/usr/include/vkFFT-'X'/vkFFt.h
/usr/include/vkFFT-'X'/vkFFT/...

This is the way gtk libraries are installed. Usually a library comes with a pkg-config files which allows to find the -I{dir} where the includes files are expected. This is particularly important if you select the versionned directory structure, because this directory is not part of the default include search paths.

then using pkg-config we have

pkg-config --cflags vkFFT
-I/usr/include/vkFFT-X/

pkg-config --libs vkFFT
<nothing> this is a pure header library

The install script should be compatible with this pkg-config files generated during the build.

I am not a specialist of cmake, but I think that this is identical.

for now I would keep the 2nd solution proposed by Vincent, since using two version of the library at the same time is not something expected (I think)

do not hesitate to ask further question if I was not clear enough.

Frederic

@vincefn
Copy link
Owner

vincefn commented Apr 4, 2023

Hi @picca - thanks for the feedback. I don't think versioning would be needed for VkFFT, so I guess solution 2) above would be better - and it does not change anything to the end user, just using #include <vkFFT.h> would still work.

@vincefn
Copy link
Owner

vincefn commented Apr 5, 2023

Hi @DTolm - here are longer 2D benchmarks. The decrease in performance is largely localised in the length<1024 region:

Titan V:
pyvkfft-benchmark-TitanV-2D
For the V100 - I'm suprised at the spread compared to the Titan V:
pyvkfft-benchmark-V100-2D

The A2000 throughput is remarkably stable over a wide range:
pyvkfft-benchmark-A2000-2D

@DTolm
Copy link
Author

DTolm commented Apr 6, 2023

@picca Sounds good, I will switch to the solution 2 without versioning in one of the next develop branch updates.

@vincefn I have identified the issue with the 512^3 system - it is again related to distant coalesced memory accesses in z axis FFT. I changed the logic for it between 1.2.21 and 1.2.33 and it stopped coalescing as much as possible, which is a good approach to this problem. I will need to rethink the logic once again and maybe make a small autotuner for this.

as for the Titan V results, systems < 1024^2 take <8MB and are really dependent on L2 cache and can be greatly affected by background tasks. I couldn't verify the discrepancy on RTX2080. The chip of titan v is essentially the same as v100 with some sm disabled, so I am not sure why there is such a difference (VkFFT surely produces the same binaries for them). I will try and investigate the drop between 1024 and 1536, which is also happening when the sm uses between 64 and 96kb of shared memory for y axis.

The A2000 has such a low bandwidth for the chip so it is just never compute limited (the Ampere architecture also had a swap to having fp32/int cores merged, which greatly helps in case of FFTs).

@vincefn
Copy link
Owner

vincefn commented Apr 10, 2023

Hi @DTolm, I have begun playing around to advanced VkFFT parameters to see if this could easily be used to bake more optimal plans.

There is now a pyvkfft-benchmark script which can be used to test those relatively easily.

Some preliminary interesting results (only using radix 2 & 3):

On my macbook's M1 - with OpenCL, lowering aimthreads to 32 (maybe 64 would be enough, 16 gives similar results) gives improvements in 2D and 3D:
pyvkfft-benchmark-M1-2D-threads
pyvkfft-benchmark-M1-3D-threads

On a Titan V (using CUDA), increasing coalescedMemory to 128 or 256 (for 3D) gives very nice improvements:
pyvkfft-benchmark-TitanV-2D-coalmem

pyvkfft-benchmark-TitanV-3D-coalmem

And finally for a V100 (also CUDA) also tweaking coalescedMemory:
pyvkfft-benchmark-V100-2D-coalmem
pyvkfft-benchmark-V100-3D-coalmem

I like this very much - as you said finding optimal parameters can be very tricky given the diversity of GPU configurations, so searching for the best options could easily be done by a higher-level library like pyvkfft, if the user chooses to do so (like the FFTW_MEASURE approach). In my case where I use iterative algorithms this can make a lot of difference.

@DTolm
Copy link
Author

DTolm commented Apr 10, 2023

@vincefn This study is also very interesting from an understanding of the GPU perspective. I have added access to batching parameter - groupedBatch. Batching here means how many FFTs are done per single kernel by a dedicated threadblock. It will try to force the user value for batching for each of the three axes. It has an effect on how many threads are allocated, read/write pattern, coalescing, etc. I also attach a simple search routine that looks for an optimal parameter on a discrete grid of batches - it is slow and unoptimized, but still improves the results.

Below are the results for Nvidia A100:
optimization_A100

In the text file, you can see the batching parameters that the routine has found.
batching_test_A100.txt

sample_1003_benchmark_VkFFT_single_3d_2_512.zip

@DTolm
Copy link
Author

DTolm commented Apr 10, 2023

As for the increase of coalesced memory to 256b - it solves the issues of big systems (as they benefit from more batching), but will be detrimental to small systems (as some compute units will have no work). Also, it shortens the single upload size and as VkFFT which can also increase the memory transfers. So there has to be a logical explanation for when to batch more.

@vincefn
Copy link
Owner

vincefn commented Apr 10, 2023

Thanks, I have added the groupedBatch parameter to the ones which can be tweaked. It will be more complicated to tune as the range of values is large (compared to simpler parameters like aimThreads and coalescedMemory).

Just to understand - this affects the batched FFTs, or the number of parallel 1D FFTs performed per block ? I mean, if the array is 50x512x512 and the transform is 2D, does this affect how the 512 1D transforms are distributed, or the 50 (batch dimension) ones ?

Here are the benchmark results tweaking just the coalescedMemory parameter:
pyvkfft-benchmark-NVIDIA_TITAN_V-cuda-2D-coalmem
pyvkfft-benchmark-NVIDIA_TITAN_V-cuda-3D-coalmem
pyvkfft-benchmark-Tesla_V100-SXM2-32GB-cuda-2D-coalmem
pyvkfft-benchmark-Tesla_V100-SXM2-32GB-cuda-3D-coalmem

@DTolm
Copy link
Author

DTolm commented Apr 10, 2023

In 50x512x512 case for 1st axis 512 length fft, it will determine how many of the 50x512 total ffts are executed per one threadblock (usually 1-16). It is also indirectly affected by coalescedMemory and aimThreads (but forcing groupedBatch bypasses that).

@vincefn
Copy link
Owner

vincefn commented Apr 10, 2023

Ok, so for a single (non-batched) 3D transform of size 512x512x512, the batchedGroup parameter would not be relevant then (it's the size I was working on when I noticed the speed discrepancy). Thanks !

@DTolm
Copy link
Author

DTolm commented Apr 10, 2023

It will be relevant and by increasing the groupedBatch parameter for y and z axis you will solve the speed issue.

@vincefn
Copy link
Owner

vincefn commented Apr 11, 2023

OK, it is slow but it works - at least using the cuda backend: with OpenCL (on an nvidia card), I easily end up with a CUDA_ERROR_INVALID_VALUE and program abortion, so I can't really test different configuration.

Results do not seem to be very sensitive to the X-axis batch parameter (I guess it makes sense when not strided).

@DTolm
Copy link
Author

DTolm commented Apr 11, 2023

I have not tested other backends with groupedBatch option, only CUDA one, will check what is wrong with it there. For X-axis it is indeed less noticeable (only for small systems - cubes up to 100) as bigger systems have good coalescing and thread assignment already.

@vincefn
Copy link
Owner

vincefn commented May 20, 2023

Dear @DTolm - regarding the CUDA_ERROR_INVALID_VALUE error with OpenCL, you can probably ignore that. I thinks this only happened with PoCL, which I normally don't test as it has problems handling all VkFFT kernels.

However I've begun the systematic tests on the current VkFFT develop branch (including DTolm/VkFFT#112) and I have a few calculation errors (accuracy failures in my tests) with non-radix transforms, see for example on an A40 using cuda or OpenCL:
http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-19-a40cl/pyvkfft-test.html
http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-19-a40cu/pyvkfft-test.html

There are also a number of VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM for R2C transforms.

Otherwise I've introduced an auto-tuning option in pyvkfft to optimise parameters like aimThreads, coalescedMemory or warpSize and it seems to work well when preparting the VkFFTApp.

Here's an example in 2D on a V100:
image
or on a A100 in 3D:
image

In those cases I just test coalescedMemory at 32, 64 or 128, so it's cheap but effective :-)

DTolm added a commit to DTolm/VkFFT that referenced this issue May 23, 2023
-Should be fixed: Bluestein algorithm reading data out of bounds and producing errors
-Reorganized and fixed push constants
@DTolm
Copy link
Author

DTolm commented May 23, 2023

Dear @vincefn

I think I have fixed the C2C/R2C issue by changing how thread indexing works in one of the edge cases.

As for the coalescedMemory tuning - can you share the results as a text file so I can try and generalize the findings? It is good that the code can be made faster by tweaking one number, but the runtime compilation is still an issue in some cases, so I am still not sure if an autotuner should be the default behavior.

Best regards,
Dmitrii

@vincefn
Copy link
Owner

vincefn commented May 24, 2023

Thanks @DTolm - indeed the tests seem much better:

http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-24-a40cu/pyvkfft-test.html (cuda/A40)
http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-24-a40cl/pyvkfft-test.html (opencl/A40)

Tests are ongoing but the failures for the C2C seem to be gone. However the R2C transforms still give timeout errors for very specific R2C sizes, which I don't understand. But maybe those transforms generate very large kernels which take a long time to compile - surprising but I can try to increase the timeout from 10 to 20s to see)..

As for the coalescedMemory tuning - can you share the results as a text file so I can try and generalize the findings?

I can modify the benchmark script so that it also exports the tuned parameter values.

It is good that the code can be made faster by tweaking one number, but the runtime compilation is still an issue in some cases, so I am still not sure if an autotuner should be the default behavior.

Sure - this will definitely remain an option. Not all GPUs seem to benefit.

PS: is there a simple formula to determine when the transform uses the Rader vs the Bluestein algorithm (based on the prime decomposition) ? I guess that's related to app->configuration.primeSizes but not sure how to use that.

@vincefn
Copy link
Owner

vincefn commented May 24, 2023

Hmm. actually the timeout is after 120s, not 10. So it's not a question of compilation time.

If I try without parallel processing (in this cas on a A2000) with --serial, i.e.:

pyvkfft-test --systematic --backend pycuda --gpu a2000 --max-nb-tests 0 --ndim 1 --range 4198 4400 --r2c --bluestein --norm 1 --serial

Then I get a segmentation fault when it gets to the R2C transform with size 4202. That's with a cuda driver 530.30.02 (cuda 12.1) and cuda toolkit 11.7.0 (but the error also appears with the A40 and cuda driver 11.7, same toolkit).

@DTolm
Copy link
Author

DTolm commented May 24, 2023

I see, I checked these sizes on rtx 2080 and you run on a 40 which is Ampere and has more shared memory. These sizes are within single upload limit on it and behave differently compared to Turing, I will fix them later today.

As for the tuning results, it would be best to have all results to compare the relative gains, thank you!

As for Rader's Transform, the formula is - if the sequence is decomposable as multiplication of primes with each prime satisfying that P-1 is decomposable as radix 2-13 multiplication or P being 47, 59 or 83 then it is Rader's algorithm (these are default values and can be tuned). And each prime must be less than a single upload size. Otherwise it is Bluestein's algorithm.

DTolm added a commit to DTolm/VkFFT that referenced this issue May 25, 2023
-Fixed mistake in calculation of used registers in C2R Bluestein
@DTolm
Copy link
Author

DTolm commented May 25, 2023

I have identified the incorrect register calculation issue in C2R and fixed it, but I have done it only with emulating Ampere architecture, so I only checked that the code compiles.

@vincefn
Copy link
Owner

vincefn commented May 25, 2023

Looks good so far:
http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-25-a40cu/pyvkfft-test.html
http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-25-a40cl/pyvkfft-test.html

..just 1 DCT2 failure of size 4 in OpenCL - but it may be a fluke, I'll retry it separately

@vincefn
Copy link
Owner

vincefn commented May 26, 2023

OK - the cuda tests all passed on the A40, but there are some failures using OpenCL (also A40) with VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM.

I re-tested all failures with less parallel process separately, and you can see the log:
http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-25-a40cl/log-6821099.txt

There are specific failures with messages: ptxas error : Entry function 'VkFFT_main' uses too much shared data

@DTolm
Copy link
Author

DTolm commented Jul 7, 2023

these king of functionality are best handled in a packaging layer

There are many people who use C API directly, and what I was trying to say is that they would also like to have access to improved performance through kernel tuning. And I am not sure yet how to provide them with such functionality (also because of the reasons you mentioned, like allocations control)

As for the release, I will do the limited tests with all the APIs and then do a release in a week if all goes smooth.

@vincefn
Copy link
Owner

vincefn commented Jul 10, 2023

It seems I found a very small corner case not working: on the AMD gfx900, the float64 DCT4 fails for 2D transforms and sizes (255, 255) and (799,799), as well as 3D size (255, 255, 255)...

See http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06-amdgfx900/pyvkfft-test.html

That's really a corner case - I don't put a lot of importance on DCT so not a priority.

Incidentally I'm having a lot of trouble finishing that test suite on the gfx900 card. Somehow once the testsuite reaches the non-radix tests, the card never runs for more than 2-3hours before completely crashing the GPU (rocm-smi crashes) and needing a reboot... Maybe issues with the driver version.

@DTolm
Copy link
Author

DTolm commented Jul 10, 2023

@vincefn I couldn't reproduce the DCT errors with the same setup I used for R2C errors. May I ask you once again to send the generated code, so I verify that we run the same program?

As for the crashes, I also have no idea what could have happened. OpenCL has not been a priority for vendors for quite a while, so there is a non-zero chance of driver issues appearing.

@vincefn
Copy link
Owner

vincefn commented Jul 10, 2023

Attached the code for the 255x255 DCT4 float64. Interestingly, the actual calculation seems to be hanged... Haven't tried the others.

dct4_255_255.zip

@DTolm
Copy link
Author

DTolm commented Jul 10, 2023

Yes, the exact same code works on Radeon Pro VII on Windows. Weird.
I guess this version is ready to be merged, I will run some backends checks on it now.

@DTolm
Copy link
Author

DTolm commented Jul 21, 2023

@vincefn I had some time to refactor the code a bit more and added the support for the arbitrary number of dimensions by defining VKFFT_MAX_FFT_DIMENSIONS. Now it is possible to fully mimic FFTW guru interface (apart from the innermost batched R2C/C2R, which I guess can only be done out of place). The implementation should be compatible with all previous versions (I just replaced all [3] arrays and so with [VKFFT_MAX_FFT_DIMENSIONS] ones in the interface), but it can be good to rerun the tests again to be sure.

I will need to update the documentation and then do a full release.

I have also added fixes for FFTs of length 1 and DCT-IV of length 2.

Best regards,
Dmitrii

@vincefn
Copy link
Owner

vincefn commented Jul 21, 2023

Looks very interesting ! I'll try to re-run the tests (at a conference next week then vacations..).

I probably won't try to use the arbitrary number of dimensions for the next pyvkfft release - it will need some work and additional unit tests.

Cheers,
Vincent

@vincefn
Copy link
Owner

vincefn commented Jul 22, 2023

Ok, I tried some quick tests, but it generates a bunch of errors (C2C, DCT, etc..).

Might be some simple changes are needed, can you take a look ? I don't know if you can also run the pyvkfft test on your side - this just the default basic test on a few dimensions.

Log attached:
logtest.txt

@DTolm
Copy link
Author

DTolm commented Jul 27, 2023

Dear @vincefn,

I think, I found the problem. Starting with this update, specifying configuration.size[1] as an alternative to the configuration.numberBatches with configuration.FFTdim==1 no longer works due to the loop indexing for kernel launch size being tied to the specified configuration.FFTdim. I am not sure if it is a good idea of having a workaround just for this as the number of dimensions is now a compiler constant. Is this the case in pyvkfft tests codes? If so, can you change the initialization of these systems to be done with configuration.numberBatches?

Best regards,
Dmitrii

@vincefn
Copy link
Owner

vincefn commented Jul 27, 2023

Ok - currently pyvkfft's calc_transform_axes will not use n_batch if there are <=3 dimensions. For example:

calc_transform_axes((2,36), ndim=1)
returns:
nx=36, ny=2, nz=1, _batch=1

but it should rather return nx=36, ny=1, nz=1, n_batch=2

@vincefn
Copy link
Owner

vincefn commented Jul 28, 2023

Ok, by using n_batch the tests seem to run better, but there are still some issues with F-ordered arrays, I'm getting an VKFFT_ERROR_UNSUPPORTED_FFT_OMIT when trying to transform (1D C2C) an array of shape (ny=2, nx=30) with F-ordering... I have to see if I can change something on my side to make that work.

@vincefn
Copy link
Owner

vincefn commented Jul 28, 2023

Besides the F-ordered arrays issues, I get (at least, this is quick test) another issue for standard C-arrays:

pyopencl  C2C    (2,30,30,2) axes=   [-2,-3] ndim=None  complex64 lut=None inplace=1  norm=   1 C   FFT: n2=2.32e-07 ninf=3.05e-07 < 3.48e-06 (0.088) 0 iFFT: n2=2.02e-07 ninf=3.05e-07 < 3.48e-06 (0.088) 0   OK
pyopencl  C2C    (2,30,30,2) axes=   [-2,-3] ndim=None  complex64 lut=None inplace=0  norm=   0 C   FFT: n2=1.02e+00 ninf=1.02e+00 < 3.48e-06 (293600.756) 1 iFFT: n2=2.14e-07 ninf=2.94e-07 < 3.48e-06 (0.085) 1 FAIL

So in that case I have an array of shape (2,30,30,2) which is transformed along the two middle dimensions - the forward transform seems to fail when norm=0 (?), but other transforms work (ifft for norm=0, and both fft and ifft for norm=0 and 1). Strange.

DTolm added a commit to DTolm/VkFFT that referenced this issue Jul 28, 2023
-Fixed double check of omitDimension[0]
@DTolm
Copy link
Author

DTolm commented Jul 28, 2023

Whoops, my bad, I was increasing the id of the first axis two times when counting configuration.omitDimension[0]. Checking of the first axis id being not bigger than the last one failed in the first comment case. For the second one, this id is also used for buffer ordering, not the execution decision and, I guess, it read from the output buffer instead the input one in the forward FFT case, while passing the correct sequence in the iFFT case.

@vincefn
Copy link
Owner

vincefn commented Jul 28, 2023

Great, the quick test with all the dimensions/strides/axes/norms passes.

I'll launch the complete test suite later.

@vincefn
Copy link
Owner

vincefn commented Jul 30, 2023

Hi @DTolm the tests are looking good, do you want to merge and tag this for a release?

@DTolm
Copy link
Author

DTolm commented Jul 31, 2023

Dear @vincefn,

I have uploaded the final snapshot of the develop branch, I renamed a bunch of things in the code - namely loose internal Vk references, no changes in the configuration struct. I also corrected some warnings, but there were no structural changes. I have verified that the code builds and works for all backends and will merge it to the main branch tomorrow - so if you can run the tests once again before then, it would be really good.

Best regards,
Dmitrii

@vincefn
Copy link
Owner

vincefn commented Jul 31, 2023

OK I'll relaunch the tests later.

Can you bump the version returned by, VkFFTGetVersion?

@DTolm
Copy link
Author

DTolm commented Jul 31, 2023

Sure, I will do that before the merge tomorrow.

@vincefn
Copy link
Owner

vincefn commented Aug 1, 2023

Hi@DTolm - most tests are finished and the others seem to be on a good way towards success

I just have a single pycuda memory error for the v100 and the gtx 1080 which I don't fully understand, but that is on my side.

Even better news, I managed to add support for the arbitrary number of dimensions (see https://github.com/vincefn/pyvkfft/tree/max_fft_dims branch) :-) ! The generic multi-dimensional test now includes up to 5 dimensions for c2c, with all possible permutations for the non-transformed axes (it takes a little while...).

I have not extended the systematic tests in the suite as I don't think it would be useful (I assume that transforms with more than 3 axes rely on exactly the same code as for the 3rd axis, but let me know if I'm wrong).

I've set the default for VKFFT_MAX_FFT_DIMENSIONS to 8 for pyvkfft, configurable when installing.

@DTolm
Copy link
Author

DTolm commented Aug 1, 2023

I have not extended the systematic tests in the suite as I don't think it would be useful (I assume that transforms with more than 3 axes rely on exactly the same code as for the 3rd axis, but let me know if I'm wrong).

Yes, this is correct - there are no more hardforced 3s anywhere and all loop indexes are dependent on VKFFT_MAX_FFT_DIMENSIONS. So a small number of tests should be sufficient.

Configurable VKFFT_MAX_FFT_DIMENSIONS is mostly for innermost batching, but maybe someone will find it useful.

I will update the version number and merge the code later today. Big thanks for helping in refining it!

DTolm added a commit to DTolm/VkFFT that referenced this issue Aug 1, 2023
-Major library design change - from single header to multiple header approach, which improves structure and maintainability. Now instead of copying a single file, the user has to copy the vkFFT folder contents.
-VkFFT has been rewritten to follow the multiple-level platform structure, described in the VkFFT whitepaper. All algorithms have been split into respective files, which should ease an understanding of the library design by everybody. Multiple code duplication places have been restructured and unified (mainly the read/write part of kernels and pre/post-processing).
-All math operations and most variables have been abstracted to a union container approach, that can either contain numbers or variable names. Not a full compiler, but the code generated is close to machine-like. There are no math sprintf calls in the actual code generator now. More details can be found here: https://youtu.be/lHlFPqlOezo
-VkFFT supports arbitrary number of dimensions now. By defining VKFFT_MAX_FFT_DIMENSIONS, it is now possible to mimic fftw guru interface. Default 4. Innermost stride is always fixed to be 1, but there can be an arbitrary number of outer strides. to achieve innermost batching, initialize N+1 dim FFT and omit the innermost one using omitDimension[0] = 1.
-Enabled fp16 for all backends.
-Accuracy verification of the new version can be found here: vincefn/pyvkfft#25
-The new code structure will facilitate the implementation of many new features and performance improvements, so stay tuned.
@DTolm
Copy link
Author

DTolm commented Aug 1, 2023

Ok I am still not that proficient with github, so I am not sure why it closed this issue automatically. But the changes have been merged to the master branch, although the process went not as smoothly as I would want to.

Best regards,
Dmitrii

@vincefn
Copy link
Owner

vincefn commented Aug 1, 2023

Apparently it's the title of your commit DTolm/VkFFT@cc410b1 with a title which says that it fixes this (#25) which automatically triggered the close when you merged (https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue).

It can be surprising at times but in this case it seems to be the right moment !

Cheers,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants