Horizontal domain decomposition for parametrizations #83

milankl · 2022-05-20T21:40:19Z

One discussions we need to have is whether for the parametrizations we loop in the vert, lat, lon order as the arrays are layed out in memory (and as we do for the dynamics as most dependencies are in the horizontal and only(?) the geopotential couples the vertical levels) or whether for the parametrizations we should do lat,lon,vert because they act on columns hence we could decompose the domain the horizontal, pass on vertical views to the functions and parallelise that way. I believe the get_parametrization_tendencies function could literally just do

@distributed for j=1:nlat,i=1:nlon
    temp_column = view(temp_grid,i,j,:)
    ...
    get_parametrization_tendencies(temp_column, ... )

and hence every grid point would get its own thread and we'd have just one place where the horizontal domain decomposition happens.

@white-alistair copying over from zulip
@samhatfield do you think this is a reasonable idea for speedy?

The text was updated successfully, but these errors were encountered:

milankl · 2022-05-20T21:45:56Z

I'd guess it would even make sense to drop the vertical view for an actual deepcopy to have the non-contiguous memory access only once. In the end we only have a few prognostic variables to copy that way and given that we'll use way less processes than nlon*nlat we'll never crowd the memory with a full copy of all prognostic variables (assuming that the garbage collector collects before the end of the for j=1:nlat,i=1:nlon loop)

samhatfield · 2022-05-23T06:06:07Z

I guess my first question is: why not change all loops, including in the dycore, to lat, lon, vert?

milankl · 2022-05-24T08:23:03Z

Because we probably want the costly spherical transforms to act on arrays that are contiguously layed out in memory. So for them the loop order should be vert (outer loop) and then horizontal (inner), which leads to arrays having size nlon,nlat,nlev (column-major) or lmax,mmax,nlev in spectral. Given that the dycore is costly and involves a lot of global communication and nlon*nlat >> nlev I'd rather have memory access and loop order optimized for that.

white-alistair · 2022-05-26T08:56:45Z

Based on what I've seen of the parameterisations so far, I think it could make a lot of sense to parallelise in the horizontal.

Regarding the view versus deepcopy thing, this section of the Julia performance tips would seem to support a deepcopy, but in any case we can just do some tests.

given that we'll use way less processes than nlon*nlat

Can you explain this please?

milankl · 2022-05-26T10:07:11Z

At default T31 resolution, we'll have 96*48 = 4608 grid points in the horizontal and 8 in the vertical. If we attempt to parallelize across 8 processors then one processor takes care of one vertical level. For the physics though we would tell 8 processors to share the load of calculating for the 4608 vertical columns the parametrizations which are independent of each other. So we could do @distributed in front of the for i in 1:nlon, j in 1:nlat loop. A given processor would first copy from the prognostic variable arrays nlon x nlat x nlev the 8 vertical levels into a new array (for contiguous mem access) that's not shared with other processors, do all the calculations and then write the total physics tendencies back into a nlon x nlat x nvert array (that's shared again).

Given that 8 << 4608 we'll never actually create a full copy of the prognostic variables as the next copy of a vertical column could just override the previous.

samhatfield · 2022-05-26T13:58:05Z

As an archetypal single-column physics scheme, the convection scheme from SPEEDY might be a nice first target for porting to Julia. This produces just tendencies of specific humidity and (I think) enthalpy (i.e. temperature with a trivial conversion). As you can see from the code, there are two big loops over ngp so it fits exactly the pattern that we are talking about.

milankl · 2022-05-31T11:36:12Z

@white-alistair is currently working on that in #82 maybe also a good testbed for the domain decomposition. Once we reach that stage.

white-alistair · 2022-06-03T09:04:06Z

Yeah, LSC would also work as a test-case.

Maybe we keep this issue separate (keen to avoid scope creep on that PR...) and then do some parallelisation experiments before proceeding with other parameterisations? In any case, once we've settled on an approach, I don't think it's a lot of work to refactor whatever parameterisations have already been implemented at that point.

milankl · 2022-08-15T14:30:17Z

#123 implements this idea ☝🏼 roughly as

column = ColumnVariables{NF}(nlev=diagn.nlev)

for ij in eachgridpoint(diagn)      # loop over all horizontal grid points

    reset_column!(column)           # set accumulators back to zero for next grid point
    get_column!(column,diagn,ij,G)  # extract an atmospheric column for contiguous memory access

    # calculate parametrizations
    large_scale_condensation!(column, M)
    ...

    # write tendencies from parametrizations back into horizontal fields
    write_column_tendencies!(diagn,column,ij)
end

So this is the only loop over the horizontal that could be @distributed as every iteration is independent of others. However, every worker should have an instance of column to avoid conflicts. Otherwise this is also independent of the grid (see #112) as eachgridpoint(diagn) just loops over all gridpoints (regardless of their arrangment) with a single index.

milankl · 2023-04-05T22:17:08Z

We now use multihreading through @floop instead of @distributed but otherwise idea as above ☝🏼

Benchmarking this (T31, Float64, full F24 grid, 8 levels held suarez forcing as an example)

nthreads	time	faster
1	2.6ms	1x
2	1.4ms	1.9x
4	755μs	3.4x
8	400μs	6.5x
16	255μs	10.2x

At higher resolution (T127, 31 levels)

nthreads	time	faster
1	194ms	1x
2	100ms	1.94x
4	54.6ms	3.55x
8	28ms	6.9x
16	15.5ms	12.5x

and similar scaling for T511, 63 levels

nthreads	time	faster
1	6.6s	1x
16	520ms	12.8x

@white-alistair not surprising, but good to see. I believe we don't reach 16x because the bottleneck becomes reading from (the same) data etc.

As a comparison using nthreads=16, and the OctaHEALPixGrid (-12.5% grid points but cubic instead of quadratic truncation), the T127 15.5ms become 14ms which is just the linear scaling from the number of grid points, but with Float32 this drops again to

julia> p,d,m = initialize_speedy(Float32,PrimitiveDryCore,trunc=127,nlev=31,Grid=OctaHEALPixGrid);

julia> @btime SpeedyWeather.parameterization_tendencies!($d,$t0,$m);
  9.951 ms (145 allocations: 83.84 KiB)

which is a speedup of 1.4x Float64 -> Float32 for the parameterizations.

milankl added performance 🚀 Faster faster! parallel 🐎 Things being computed in parallel labels May 20, 2022

milankl added this to the v0.5 milestone Aug 31, 2022

milankl closed this as completed May 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horizontal domain decomposition for parametrizations #83

Horizontal domain decomposition for parametrizations #83

milankl commented May 20, 2022

milankl commented May 20, 2022

samhatfield commented May 23, 2022

milankl commented May 24, 2022

white-alistair commented May 26, 2022

milankl commented May 26, 2022 •

edited

Loading

samhatfield commented May 26, 2022

milankl commented May 31, 2022

white-alistair commented Jun 3, 2022

milankl commented Aug 15, 2022

milankl commented Apr 5, 2023 •

edited

Loading

Horizontal domain decomposition for parametrizations #83

Horizontal domain decomposition for parametrizations #83

Comments

milankl commented May 20, 2022

milankl commented May 20, 2022

samhatfield commented May 23, 2022

milankl commented May 24, 2022

white-alistair commented May 26, 2022

milankl commented May 26, 2022 • edited Loading

samhatfield commented May 26, 2022

milankl commented May 31, 2022

white-alistair commented Jun 3, 2022

milankl commented Aug 15, 2022

milankl commented Apr 5, 2023 • edited Loading

milankl commented May 26, 2022 •

edited

Loading

milankl commented Apr 5, 2023 •

edited

Loading