-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VkFFT 1.3 #25
Comments
Hi @DTolm , thanks for the heads-up, I'll try to adapt pyvkfft in a few days. |
Maybe one quick comment regarding the organisation of the headers: splitting the different parts is a good thing, itwill be more usable that way. Now you are using
where This is a little unusual as it requires copying not one directory (vkFFT) with all the headers, but instead copying both This is just a suggestion, but there would be two simpler ways of doing this:
This has the advantage of only requiring a single copy or link of the vkFFT include folder.
I think the latter is a much more standard way of organising the include directories - and more convenient, with one top include header and one directory. I've also put @picca in Cc who maintains the VkFFT Debian package for further advice. |
Dear @vincefn, I am happy to change the codebase directory layout to be more in line with other header-only libraries - so probably the second approach is the easiest? As for the performance regressions - it is likely related to experiments with the dispatch threadblock size. They can improve some systems while making other systems worse and it is hard to decide the best size automatically for all systems. Currently, this decision-maker is located on lines 447-773 in vkFFT_Plan_FFT file and should probably be moved to a separate file as well. I will check what happens with 512^3 system and try to make some improvements. Best regards, |
I think this would be best. But please get another opinion before committing - I don't know if @picca has an opinion on this.
Could it make sense to make this tunable - changing the threadblock size using an optional parameter ? You could imagine using this like in FFTW with their FFTW_ESTIMATE and FFTW_MEASURE - it would be the task of the calling library (e.g. pyvkfft) to test the performance of different sizes. I don't know how deep this is in the code and so if that would be possible. Maybe it's already tunable through e.g. |
Regarding the speed issue, here are some comparison benchmarks: The A2000 has no issues in 2D (at least up to 1024) but a decrease in 3D: |
That's unexpected, I will investigate the reason today and add the update to the 1.3.0 branch. Thanks for finding it out! |
Hello, just about the include files. I prefer the 2nd solution. you have to decide for your user what is the prefered include directive
Then you decide where the includes files are installed by default. (for example with the autotools, it is under the something like
This is the way gtk libraries are installed. Usually a library comes with a pkg-config files which allows to find the -I{dir} where the includes files are expected. This is particularly important if you select the versionned directory structure, because this directory is not part of the default include search paths. then using pkg-config we have
The install script should be compatible with this pkg-config files generated during the build. I am not a specialist of cmake, but I think that this is identical. for now I would keep the 2nd solution proposed by Vincent, since using two version of the library at the same time is not something expected (I think) do not hesitate to ask further question if I was not clear enough. Frederic |
Hi @picca - thanks for the feedback. I don't think versioning would be needed for VkFFT, so I guess solution 2) above would be better - and it does not change anything to the end user, just using |
Hi @DTolm - here are longer 2D benchmarks. The decrease in performance is largely localised in the length<1024 region: Titan V: The A2000 throughput is remarkably stable over a wide range: |
@picca Sounds good, I will switch to the solution 2 without versioning in one of the next develop branch updates. @vincefn I have identified the issue with the 512^3 system - it is again related to distant coalesced memory accesses in z axis FFT. I changed the logic for it between 1.2.21 and 1.2.33 and it stopped coalescing as much as possible, which is a good approach to this problem. I will need to rethink the logic once again and maybe make a small autotuner for this. as for the Titan V results, systems < 1024^2 take <8MB and are really dependent on L2 cache and can be greatly affected by background tasks. I couldn't verify the discrepancy on RTX2080. The chip of titan v is essentially the same as v100 with some sm disabled, so I am not sure why there is such a difference (VkFFT surely produces the same binaries for them). I will try and investigate the drop between 1024 and 1536, which is also happening when the sm uses between 64 and 96kb of shared memory for y axis. The A2000 has such a low bandwidth for the chip so it is just never compute limited (the Ampere architecture also had a swap to having fp32/int cores merged, which greatly helps in case of FFTs). |
Hi @DTolm, I have begun playing around to advanced VkFFT parameters to see if this could easily be used to bake more optimal plans. There is now a Some preliminary interesting results (only using radix 2 & 3): On my macbook's M1 - with OpenCL, lowering On a Titan V (using CUDA), increasing And finally for a V100 (also CUDA) also tweaking I like this very much - as you said finding optimal parameters can be very tricky given the diversity of GPU configurations, so searching for the best options could easily be done by a higher-level library like |
@vincefn This study is also very interesting from an understanding of the GPU perspective. I have added access to batching parameter - groupedBatch. Batching here means how many FFTs are done per single kernel by a dedicated threadblock. It will try to force the user value for batching for each of the three axes. It has an effect on how many threads are allocated, read/write pattern, coalescing, etc. I also attach a simple search routine that looks for an optimal parameter on a discrete grid of batches - it is slow and unoptimized, but still improves the results. Below are the results for Nvidia A100: In the text file, you can see the batching parameters that the routine has found. |
As for the increase of coalesced memory to 256b - it solves the issues of big systems (as they benefit from more batching), but will be detrimental to small systems (as some compute units will have no work). Also, it shortens the single upload size and as VkFFT which can also increase the memory transfers. So there has to be a logical explanation for when to batch more. |
In 50x512x512 case for 1st axis 512 length fft, it will determine how many of the 50x512 total ffts are executed per one threadblock (usually 1-16). It is also indirectly affected by coalescedMemory and aimThreads (but forcing groupedBatch bypasses that). |
Ok, so for a single (non-batched) 3D transform of size 512x512x512, the batchedGroup parameter would not be relevant then (it's the size I was working on when I noticed the speed discrepancy). Thanks ! |
It will be relevant and by increasing the groupedBatch parameter for y and z axis you will solve the speed issue. |
OK, it is slow but it works - at least using the cuda backend: with OpenCL (on an nvidia card), I easily end up with a CUDA_ERROR_INVALID_VALUE and program abortion, so I can't really test different configuration. Results do not seem to be very sensitive to the X-axis batch parameter (I guess it makes sense when not strided). |
I have not tested other backends with groupedBatch option, only CUDA one, will check what is wrong with it there. For X-axis it is indeed less noticeable (only for small systems - cubes up to 100) as bigger systems have good coalescing and thread assignment already. |
Dear @DTolm - regarding the However I've begun the systematic tests on the current VkFFT develop branch (including DTolm/VkFFT#112) and I have a few calculation errors (accuracy failures in my tests) with non-radix transforms, see for example on an A40 using cuda or OpenCL: There are also a number of Otherwise I've introduced an auto-tuning option in pyvkfft to optimise parameters like Here's an example in 2D on a V100: In those cases I just test coalescedMemory at 32, 64 or 128, so it's cheap but effective :-) |
-Should be fixed: Bluestein algorithm reading data out of bounds and producing errors -Reorganized and fixed push constants
Dear @vincefn I think I have fixed the C2C/R2C issue by changing how thread indexing works in one of the edge cases. As for the coalescedMemory tuning - can you share the results as a text file so I can try and generalize the findings? It is good that the code can be made faster by tweaking one number, but the runtime compilation is still an issue in some cases, so I am still not sure if an autotuner should be the default behavior. Best regards, |
Thanks @DTolm - indeed the tests seem much better: http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-05-24-a40cu/pyvkfft-test.html (cuda/A40) Tests are ongoing but the failures for the C2C seem to be gone. However the R2C transforms still give timeout errors for very specific R2C sizes, which I don't understand. But maybe those transforms generate very large kernels which take a long time to compile - surprising but I can try to increase the timeout from 10 to 20s to see)..
I can modify the benchmark script so that it also exports the tuned parameter values.
Sure - this will definitely remain an option. Not all GPUs seem to benefit. PS: is there a simple formula to determine when the transform uses the Rader vs the Bluestein algorithm (based on the prime decomposition) ? I guess that's related to |
Hmm. actually the timeout is after 120s, not 10. So it's not a question of compilation time. If I try without parallel processing (in this cas on a A2000) with --serial, i.e.:
Then I get a segmentation fault when it gets to the R2C transform with size 4202. That's with a cuda driver 530.30.02 (cuda 12.1) and cuda toolkit 11.7.0 (but the error also appears with the A40 and cuda driver 11.7, same toolkit). |
I see, I checked these sizes on rtx 2080 and you run on a 40 which is Ampere and has more shared memory. These sizes are within single upload limit on it and behave differently compared to Turing, I will fix them later today. As for the tuning results, it would be best to have all results to compare the relative gains, thank you! As for Rader's Transform, the formula is - if the sequence is decomposable as multiplication of primes with each prime satisfying that P-1 is decomposable as radix 2-13 multiplication or P being 47, 59 or 83 then it is Rader's algorithm (these are default values and can be tuned). And each prime must be less than a single upload size. Otherwise it is Bluestein's algorithm. |
-Fixed mistake in calculation of used registers in C2R Bluestein
I have identified the incorrect register calculation issue in C2R and fixed it, but I have done it only with emulating Ampere architecture, so I only checked that the code compiles. |
Looks good so far: ..just 1 DCT2 failure of size 4 in OpenCL - but it may be a fluke, I'll retry it separately |
OK - the cuda tests all passed on the A40, but there are some failures using OpenCL (also A40) with I re-tested all failures with less parallel process separately, and you can see the log: There are specific failures with messages: |
There are many people who use C API directly, and what I was trying to say is that they would also like to have access to improved performance through kernel tuning. And I am not sure yet how to provide them with such functionality (also because of the reasons you mentioned, like allocations control) As for the release, I will do the limited tests with all the APIs and then do a release in a week if all goes smooth. |
It seems I found a very small corner case not working: on the AMD gfx900, the float64 DCT4 fails for 2D transforms and sizes (255, 255) and (799,799), as well as 3D size (255, 255, 255)... See http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2023-07-06-amdgfx900/pyvkfft-test.html That's really a corner case - I don't put a lot of importance on DCT so not a priority. Incidentally I'm having a lot of trouble finishing that test suite on the gfx900 card. Somehow once the testsuite reaches the non-radix tests, the card never runs for more than 2-3hours before completely crashing the GPU (rocm-smi crashes) and needing a reboot... Maybe issues with the driver version. |
@vincefn I couldn't reproduce the DCT errors with the same setup I used for R2C errors. May I ask you once again to send the generated code, so I verify that we run the same program? As for the crashes, I also have no idea what could have happened. OpenCL has not been a priority for vendors for quite a while, so there is a non-zero chance of driver issues appearing. |
Attached the code for the 255x255 DCT4 float64. Interestingly, the actual calculation seems to be hanged... Haven't tried the others. |
Yes, the exact same code works on Radeon Pro VII on Windows. Weird. |
@vincefn I had some time to refactor the code a bit more and added the support for the arbitrary number of dimensions by defining VKFFT_MAX_FFT_DIMENSIONS. Now it is possible to fully mimic FFTW guru interface (apart from the innermost batched R2C/C2R, which I guess can only be done out of place). The implementation should be compatible with all previous versions (I just replaced all [3] arrays and so with [VKFFT_MAX_FFT_DIMENSIONS] ones in the interface), but it can be good to rerun the tests again to be sure. I will need to update the documentation and then do a full release. I have also added fixes for FFTs of length 1 and DCT-IV of length 2. Best regards, |
Looks very interesting ! I'll try to re-run the tests (at a conference next week then vacations..). I probably won't try to use the arbitrary number of dimensions for the next pyvkfft release - it will need some work and additional unit tests. Cheers, |
Ok, I tried some quick tests, but it generates a bunch of errors (C2C, DCT, etc..). Might be some simple changes are needed, can you take a look ? I don't know if you can also run the pyvkfft test on your side - this just the default basic test on a few dimensions. Log attached: |
Dear @vincefn, I think, I found the problem. Starting with this update, specifying configuration.size[1] as an alternative to the configuration.numberBatches with configuration.FFTdim==1 no longer works due to the loop indexing for kernel launch size being tied to the specified configuration.FFTdim. I am not sure if it is a good idea of having a workaround just for this as the number of dimensions is now a compiler constant. Is this the case in pyvkfft tests codes? If so, can you change the initialization of these systems to be done with configuration.numberBatches? Best regards, |
Ok - currently
but it should rather return |
Ok, by using |
Besides the F-ordered arrays issues, I get (at least, this is quick test) another issue for standard C-arrays:
So in that case I have an array of shape |
-Fixed double check of omitDimension[0]
Whoops, my bad, I was increasing the id of the first axis two times when counting configuration.omitDimension[0]. Checking of the first axis id being not bigger than the last one failed in the first comment case. For the second one, this id is also used for buffer ordering, not the execution decision and, I guess, it read from the output buffer instead the input one in the forward FFT case, while passing the correct sequence in the iFFT case. |
Great, the quick test with all the dimensions/strides/axes/norms passes. I'll launch the complete test suite later. |
Hi @DTolm the tests are looking good, do you want to merge and tag this for a release? |
Dear @vincefn, I have uploaded the final snapshot of the develop branch, I renamed a bunch of things in the code - namely loose internal Vk references, no changes in the configuration struct. I also corrected some warnings, but there were no structural changes. I have verified that the code builds and works for all backends and will merge it to the main branch tomorrow - so if you can run the tests once again before then, it would be really good. Best regards, |
OK I'll relaunch the tests later. Can you bump the version returned by, |
Sure, I will do that before the merge tomorrow. |
Hi@DTolm - most tests are finished and the others seem to be on a good way towards success I just have a single pycuda memory error for the v100 and the gtx 1080 which I don't fully understand, but that is on my side. Even better news, I managed to add support for the arbitrary number of dimensions (see https://github.com/vincefn/pyvkfft/tree/max_fft_dims branch) :-) ! The generic multi-dimensional test now includes up to 5 dimensions for c2c, with all possible permutations for the non-transformed axes (it takes a little while...). I have not extended the systematic tests in the suite as I don't think it would be useful (I assume that transforms with more than 3 axes rely on exactly the same code as for the 3rd axis, but let me know if I'm wrong). I've set the default for |
Yes, this is correct - there are no more hardforced 3s anywhere and all loop indexes are dependent on VKFFT_MAX_FFT_DIMENSIONS. So a small number of tests should be sufficient. Configurable VKFFT_MAX_FFT_DIMENSIONS is mostly for innermost batching, but maybe someone will find it useful. I will update the version number and merge the code later today. Big thanks for helping in refining it! |
-Major library design change - from single header to multiple header approach, which improves structure and maintainability. Now instead of copying a single file, the user has to copy the vkFFT folder contents. -VkFFT has been rewritten to follow the multiple-level platform structure, described in the VkFFT whitepaper. All algorithms have been split into respective files, which should ease an understanding of the library design by everybody. Multiple code duplication places have been restructured and unified (mainly the read/write part of kernels and pre/post-processing). -All math operations and most variables have been abstracted to a union container approach, that can either contain numbers or variable names. Not a full compiler, but the code generated is close to machine-like. There are no math sprintf calls in the actual code generator now. More details can be found here: https://youtu.be/lHlFPqlOezo -VkFFT supports arbitrary number of dimensions now. By defining VKFFT_MAX_FFT_DIMENSIONS, it is now possible to mimic fftw guru interface. Default 4. Innermost stride is always fixed to be 1, but there can be an arbitrary number of outer strides. to achieve innermost batching, initialize N+1 dim FFT and omit the innermost one using omitDimension[0] = 1. -Enabled fp16 for all backends. -Accuracy verification of the new version can be found here: vincefn/pyvkfft#25 -The new code structure will facilitate the implementation of many new features and performance improvements, so stay tuned.
Ok I am still not that proficient with github, so I am not sure why it closed this issue automatically. But the changes have been merged to the master branch, although the process went not as smoothly as I would want to. Best regards, |
Apparently it's the title of your commit DTolm/VkFFT@cc410b1 with a title which says that it fixes this (#25) which automatically triggered the close when you merged (https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue). It can be surprising at times but in this case it seems to be the right moment ! Cheers, |
Dear @vincefn,
I have made some substantial changes to VkFFT in version 1.3 (https://github.com/DTolm/VkFFT/tree/develop), so there will be a two-month period before it is merged in the main branch for people to adjust the dependent projects. Namely, VkFFT is no longer a header-only file, but rather a collection of headers. This should not increase the complexity of usage - you still link to vkFFT.h file and it includes the other files. The main advantage is that the code base now is more structured and way easier to understand by other developers.
I have tested the code on some systems with the implemented benchmark scripts - RTX2080 (Vulkan, CUDA, OpenCL), MI250 (HIP), A100 (CUDA), UHD610 (Level Zero) and M1 Pro (Metal), however, your suite is more thorough in this regard. Also, you might be interested in exploring the new design.
I suppose keeping an issue for this period can be helpful for discussion.
Best regards,
Dmitrii
The text was updated successfully, but these errors were encountered: