Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I can't believe I missed this!
In transeq at the very end we do
du_x = du_x + du_y + du_z
However due to the restictions on running GPU kernels with specific thread and block dimensions we carry out this operation in 2 separate calls as
du_x = du_x + du_y
du_x = du_x + du_z
And this gives an oppurtunity to move the y2x sum up just below
transeq_y
call. Then we releasedu_y, dv_y, dw_y
right after adding these into _x counterparts.This basic fix allowed reducing the memory usage from 18 scalar fields down to 15, without affecting the performance at all for the CUDA backend. (15GiB for a$512^3$ simulation). The total figure excludes Poisson solvers memory requirement which is not yet in the codebase.
Any further reductions in memory usage after this point would result in an increase in the runtime of the simulation. For example we might be able to reduce it down to 12, which shouldn't be that hard, but it would require some extra reordering operations and its better not to work on that at this stage I think.
Assuming that FFT based Poisson solver will requre ~4 scalar field equivalent memory, we should be able to fit a$1024^3$ simulation on a typical 4xA100 node!
Now we have separate
sum_yintox
,sum_zintox
, andvecadd
subroutines in backends, all similar at some level. All these can be combined into a single subroutine like we did with reorder subroutines, and that's what I'll do next. I'll create an issue to discuss this further, and not include this next step in the current PR. I'm happy to merge this one as soon as someone approves.