Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In fv3atm: convert GFS DDTs from blocked data structures to contiguous arrays #2183

Open
wants to merge 57 commits into
base: develop
Choose a base branch
from

Conversation

climbfuji
Copy link
Collaborator

@climbfuji climbfuji commented Mar 11, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

This PR updates the submodule pointers for fv3atm, gfdl_atmos_cubed_sphere, ccpp-physics for the changes described in the associates PRs below: convert internal GFS DDTs from blocked data structures to contiguous arrays. This excludes the (external) GFS_extdiag and GFS_restart DDTs.

Commit Message:

* UFSWM - In fv3atm and submodules, convert internal GFS DDTs from blocked data structures to contiguous arrays. This excludes the (external) `GFS_extdiag` and `GFS_restart` DDTs.
  * AQM - 
  * CDEPS - 
  * CICE - 
  * CMEPS - 
  * CMakeModules - 
  * FV3 - Convert GFS DDTs from blocked data structures to contiguous arrays (not including GFS_restart and GFS_extdiag DDTs)
    * ccpp-physics - Convert GFS DDTs from blocked data structures to contiguous arrays (affects `GFS_debug.{F90,meta} only`)
    * atmos_cubed_sphere - Convert GFS DDTs from blocked data structures to contiguous arrays and remove IPD_Data super DDT
  * GOCART - 
  * HYCOM - 
  * MOM6 - 
  * NOAHMP - 
  * WW3 - 
  * stochastic_physics - 

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

n/a


Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

climbfuji and others added 30 commits December 22, 2023 10:32
@jkbk2004
Copy link
Collaborator

@climbfuji can you sync up branches? I think we can start working on this pr, as debugging is underway on #2290 side.

@@ -20,7 +20,8 @@ endif()

if(DEBUG)
add_definitions(-DDEBUG)
set(CMAKE_Fortran_FLAGS_DEBUG "${CMAKE_Fortran_FLAGS_DEBUG} -O0 -check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays")
#set(CMAKE_Fortran_FLAGS_DEBUG "${CMAKE_Fortran_FLAGS_DEBUG} -O0 -check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@climbfuji May I ask why we remove the "-check nopointer" option? Can we check if there are warning messages in the DEBUG compile log file? If I remember correctly, some diagnostic field computation in dycore could throw warning messages in the setting of parallel configuration even though they are not required to output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revert this, it's not needed for the PR. But it's a nice thing that we can turn the pointer checking on without the code crashing - so I would consider this as a plus.

@climbfuji
Copy link
Collaborator Author

@climbfuji can you sync up branches? I think we can start working on this pr, as debugging is underway on #2290 side.

Yes, in a few minutes. I'll also pull in the unit testing PR that @DusanJovic-NOAA asked me to (see NOAA-EMC/fv3atm#798 (comment)).

@climbfuji
Copy link
Collaborator Author

@jkbk2004 @DusanJovic-NOAA @junwang-noaa All updated, Alex's unit testing PR pulled into my fv3 submodule PR as well.

Note I didn't test compiling/running after pulling in the latest code, just pulled it in and pushed.

@FernandoAndrade-NOAA FernandoAndrade-NOAA added the Waiting for Reviews The PR is waiting for reviews from associated component PR's. label Jun 27, 2024
@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA I updated my branches from the latest developmental code, and full regression tests pass on Hera. This PR is ready to go. It would be nice to have runtime/memory comparisons of the current head of develop and this branch on acorn or wcoss2 for a representative setup for the next operational implementation, though.

@climbfuji I ran c768 24h tests on wcoss2 using this branch and develop branch (3 runs each).

For develop branch (e784814) the following are wall clock times and memory usage:

develop/out.1:The total amount of wall time                        = 1226.909006
develop/out.1:The maximum resident set size (KB)                   = 2168396
develop/out.2:The total amount of wall time                        = 1205.529417
develop/out.2:The maximum resident set size (KB)                   = 2179608
develop/out.3:The total amount of wall time                        = 1210.080763
develop/out.3:The maximum resident set size (KB)                   = 2176196

This branch (f5059f0):

chunked/out.1:The total amount of wall time                        = 1315.654338
chunked/out.1:The maximum resident set size (KB)                   = 2076900
chunked/out.2:The total amount of wall time                        = 1304.446017
chunked/out.2:The maximum resident set size (KB)                   = 2089580
chunked/out.3:The total amount of wall time                        = 1290.245111
chunked/out.3:The maximum resident set size (KB)                   = 2093072

@climbfuji
Copy link
Collaborator Author

@DusanJovic-NOAA I updated my branches from the latest developmental code, and full regression tests pass on Hera. This PR is ready to go. It would be nice to have runtime/memory comparisons of the current head of develop and this branch on acorn or wcoss2 for a representative setup for the next operational implementation, though.

@climbfuji I ran c768 24h tests on wcoss2 using this branch and develop branch (3 runs each).

For develop branch (e784814) the following are wall clock times and memory usage:

develop/out.1:The total amount of wall time                        = 1226.909006
develop/out.1:The maximum resident set size (KB)                   = 2168396
develop/out.2:The total amount of wall time                        = 1205.529417
develop/out.2:The maximum resident set size (KB)                   = 2179608
develop/out.3:The total amount of wall time                        = 1210.080763
develop/out.3:The maximum resident set size (KB)                   = 2176196

This branch (f5059f0):

chunked/out.1:The total amount of wall time                        = 1315.654338
chunked/out.1:The maximum resident set size (KB)                   = 2076900
chunked/out.2:The total amount of wall time                        = 1304.446017
chunked/out.2:The maximum resident set size (KB)                   = 2089580
chunked/out.3:The total amount of wall time                        = 1290.245111
chunked/out.3:The maximum resident set size (KB)                   = 2093072

Thanks for doing that. So this new code is slightly slower but uses slightly less memory. Are we ok to go ahead, knowing that there are plenty of opportunities to speed up the code? I deliberately did not remove entire block loops because I wanted to limit the changes (well, to the extent possible) and have b4b identical results with the current code.

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA I updated my branches from the latest developmental code, and full regression tests pass on Hera. This PR is ready to go. It would be nice to have runtime/memory comparisons of the current head of develop and this branch on acorn or wcoss2 for a representative setup for the next operational implementation, though.

@climbfuji I ran c768 24h tests on wcoss2 using this branch and develop branch (3 runs each).
For develop branch (e784814) the following are wall clock times and memory usage:

develop/out.1:The total amount of wall time                        = 1226.909006
develop/out.1:The maximum resident set size (KB)                   = 2168396
develop/out.2:The total amount of wall time                        = 1205.529417
develop/out.2:The maximum resident set size (KB)                   = 2179608
develop/out.3:The total amount of wall time                        = 1210.080763
develop/out.3:The maximum resident set size (KB)                   = 2176196

This branch (f5059f0):

chunked/out.1:The total amount of wall time                        = 1315.654338
chunked/out.1:The maximum resident set size (KB)                   = 2076900
chunked/out.2:The total amount of wall time                        = 1304.446017
chunked/out.2:The maximum resident set size (KB)                   = 2089580
chunked/out.3:The total amount of wall time                        = 1290.245111
chunked/out.3:The maximum resident set size (KB)                   = 2093072

Thanks for doing that. So this new code is slightly slower but uses slightly less memory. Are we ok to go ahead, knowing that there are plenty of opportunities to speed up the code? I deliberately did not remove entire block loops because I wanted to limit the changes (well, to the extent possible) and have b4b identical results with the current code.

I think we can go ahead.

@climbfuji
Copy link
Collaborator Author

The CI test to check whether the repos are up to date fails, but I am quite certain I got everything updated. A bug? See https://github.com/ufs-community/ufs-weather-model/actions/runs/9699726693/job/26770514311

@jkbk2004
Copy link
Collaborator

The CI test to check whether the repos are up to date fails, but I am quite certain I got everything updated. A bug? See https://github.com/ufs-community/ufs-weather-model/actions/runs/9699726693/job/26770514311

@DomHeinzeller We have seen this issue picking up git runner id_file. I don't get why we need this id_file. Probably old feature inherited from running cloud-based test in git action. I think we will clean up later. We can start test across machines once we get all approvals on subcomponent side. @FernandoAndrade-NOAA @BrianCurtis-NOAA FYI

@junwang-noaa
Copy link
Collaborator

@DusanJovic-NOAA I updated my branches from the latest developmental code, and full regression tests pass on Hera. This PR is ready to go. It would be nice to have runtime/memory comparisons of the current head of develop and this branch on acorn or wcoss2 for a representative setup for the next operational implementation, though.

@climbfuji I ran c768 24h tests on wcoss2 using this branch and develop branch (3 runs each).
For develop branch (e784814) the following are wall clock times and memory usage:

develop/out.1:The total amount of wall time                        = 1226.909006
develop/out.1:The maximum resident set size (KB)                   = 2168396
develop/out.2:The total amount of wall time                        = 1205.529417
develop/out.2:The maximum resident set size (KB)                   = 2179608
develop/out.3:The total amount of wall time                        = 1210.080763
develop/out.3:The maximum resident set size (KB)                   = 2176196

This branch (f5059f0):

chunked/out.1:The total amount of wall time                        = 1315.654338
chunked/out.1:The maximum resident set size (KB)                   = 2076900
chunked/out.2:The total amount of wall time                        = 1304.446017
chunked/out.2:The maximum resident set size (KB)                   = 2089580
chunked/out.3:The total amount of wall time                        = 1290.245111
chunked/out.3:The maximum resident set size (KB)                   = 2093072

Thanks for doing that. So this new code is slightly slower but uses slightly less memory. Are we ok to go ahead, knowing that there are plenty of opportunities to speed up the code? I deliberately did not remove entire block loops because I wanted to limit the changes (well, to the extent possible) and have b4b identical results with the current code.

@climbfuji Since the model now runs about 7% slower, would you elaborate more on speeding up the code? Can we expect future PRs to accelerate the model to the current speed? We have 6.5mins/day operational window, this slowness may require increase resources to run the model.

@climbfuji
Copy link
Collaborator Author

@DusanJovic-NOAA I updated my branches from the latest developmental code, and full regression tests pass on Hera. This PR is ready to go. It would be nice to have runtime/memory comparisons of the current head of develop and this branch on acorn or wcoss2 for a representative setup for the next operational implementation, though.

@climbfuji I ran c768 24h tests on wcoss2 using this branch and develop branch (3 runs each).
For develop branch (e784814) the following are wall clock times and memory usage:

develop/out.1:The total amount of wall time                        = 1226.909006
develop/out.1:The maximum resident set size (KB)                   = 2168396
develop/out.2:The total amount of wall time                        = 1205.529417
develop/out.2:The maximum resident set size (KB)                   = 2179608
develop/out.3:The total amount of wall time                        = 1210.080763
develop/out.3:The maximum resident set size (KB)                   = 2176196

This branch (f5059f0):

chunked/out.1:The total amount of wall time                        = 1315.654338
chunked/out.1:The maximum resident set size (KB)                   = 2076900
chunked/out.2:The total amount of wall time                        = 1304.446017
chunked/out.2:The maximum resident set size (KB)                   = 2089580
chunked/out.3:The total amount of wall time                        = 1290.245111
chunked/out.3:The maximum resident set size (KB)                   = 2093072

Thanks for doing that. So this new code is slightly slower but uses slightly less memory. Are we ok to go ahead, knowing that there are plenty of opportunities to speed up the code? I deliberately did not remove entire block loops because I wanted to limit the changes (well, to the extent possible) and have b4b identical results with the current code.

@climbfuji Since the model now runs about 7% slower, would you elaborate more on speeding up the code? Can we expect future PRs to accelerate the model to the current speed? We have 6.5mins/day operational window, this slowness may require increase resources to run the model.

The first and most straightforward change is to remove all block loops - you'll notice that I left them in place and added additional logic to set the correct array indices.

do ib=1,nb
   do ix=1,number_of_elements_in_block
      im = offset_for_block_ib + ix
      ! do something with ddt%array(im) which was previously ddt(ib)%array(ix)

This isn't needed. You could simply iterate over the entire array

do im=1,size(ddt%array)
    ! do something with array(im) 

or whole-array manipulations

ddt%array(:) = ...

although I've heard in the past that the latter is slower than the former (which doesn't make sense to me, and I haven't actually tested it). This change can also be made in the dycore (see NOAA-GFDL/GFDL_atmos_cubed_sphere#345) and in the CCPP physics (where applicable).

The second change is more intrusive. You know that the I/O component (restart, diagnostic) uses blocked data structures and I didn't dare to touch any of this. Therefore, in this set of PRs, the non-blocked GFS physics ddts are interacting with the blocked GFS restart/diagnostic DDTs for the I/O component. This could - no, should - be change so that the I/O components us non-blocked data structures as well. I don't know enough about the I/O part of the ufs-w-m for having a good idea on how much work that requires.

@SamuelTrahanNOAA
Copy link
Collaborator

Getting back 7% of the runtime can be a lot of work, and there's no guarantee we'll get it. Considering the amount of FV3 runs we do, and how big they are, a 7% resource increase will amount to a massive amount of money.

I suggest we:

  1. Try this with cases of practical size on multiple machines to be certain it is slower. Regression test speed may not be representative of practical cases.
  2. Wait to merge it until it is at least resource neutral for cases of practical size.

@climbfuji
Copy link
Collaborator Author

Getting back 7% of the runtime can be a lot of work, and there's no guarantee we'll get it. Considering the amount of FV3 runs we do, and how big they are, a 7% resource increase will amount to a massive amount of money.

I suggest we:

  1. Try this with cases of practical size on multiple machines to be certain it is slower. Regression test speed may not be representative of practical cases.
  2. Wait to merge it until it is at least resource neutral for cases of practical size.

I was going to suggest at least multiple runs of the same test case, since there can be a large run-to-run difference for the same code. We need this development for the transition of the CCPP framework to capgen. We could keep this in a branch and keep updating it from develop, make successive changes to the code as I described above, and then merge an even larger set of changes with likely non-b4b results into develop. Also, I am no longer affiliated with NOAA and I am doing all this on the sideline - my resources are limited.

@SamuelTrahanNOAA
Copy link
Collaborator

@junwang-noaa Can someone run the global-workflow with this branch, using a full-resolution case, and report back on the effects on the runtime?

@climbfuji
Copy link
Collaborator Author

FWIW, I am running "the top part of rt.conf" (about 80 tests or so, everything until the ATM debug section starts) on hera for dev and this branch. I'll look into runtime differences for these runs. I know the setups are not representative of the implementation target, but looking at many tests might give us some answers, too. We should also check if sub-timers are provide more information where the differences come from.

@climbfuji
Copy link
Collaborator Author

@dustinswales Didn't you also compare runtimes for the ORT runs?

@climbfuji
Copy link
Collaborator Author

Here is

FWIW, I am running "the top part of rt.conf" (about 80 tests or so, everything until the ATM debug section starts) on hera for dev and this branch. I'll look into runtime differences for these runs. I know the setups are not representative of the implementation target, but looking at many tests might give us some answers, too. We should also check if sub-timers are provide more information where the differences come from.

Here is the info from those 71 runs:
combined.pdf
combined.xlsx

In short, averaged across the 71 runs, a 1.3% increase in runtime and a 0.7% decrease in memory (max resident size). One could repeat this exercise because I ran the chunked code during working hours and the dev code after work, but ... at least something.

@jkbk2004
Copy link
Collaborator

@climbfuji 1.3% looks ok to me. But I will compare on gaea. I think we can schedule this pr to commit before July 4. We will let #2327 and HR4 PRs go first until Monday or Tuesday.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jun 28, 2024

@junwang-noaa Can someone run the global-workflow with this branch, using a full-resolution case, and report back on the effects on the runtime?

@SamuelTrahanNOAA Dusan was testing the branch with a prototype C768 atmosphere only case (I believe from a run directory created by workflow). He ran the cases three times. I think it's good idea to run the same case on other platforms to confirm the running speed. RT are running at lower resolution, most of them are on C96.

Once the code is committed, it will go into prototype GFS HR4 at C1152 resolution, we haven't tested this branch at that resolution yet. I think Dom's approach of keeping a separate branch and making further changes to get comparable performance would be good.

@climbfuji
Copy link
Collaborator Author

climbfuji commented Jun 28, 2024

@junwang-noaa Can someone run the global-workflow with this branch, using a full-resolution case, and report back on the effects on the runtime?

@SamuelTrahanNOAA Dusan was testing the branch with a prototype C768 atmosphere only case (I believe from a run directory created by workflow). He ran the cases three times. I think it's good idea to run the same case on other platforms to confirm the running speed. RT are running at lower resolution, most of them are on C96.

Once the code is committed, it will go into prototype GFS HR4 at C1152 resolution, we haven't tested this branch at that resolution yet. I think Dom's approach of keeping a separate branch and making further changes to get comparable performance would be good.

Yes, but note that you will not have b4b identical results if you make performance improvements like removing block-loops etc. And someone need to maintain these separate branches (and there are many ufs-weather-model, fv3atm, atmos_cubed_sphere, ccpp-physics). I can't commit to that.

@DusanJovic-NOAA @junwang-noaa @SamuelTrahanNOAA I have two other suggestions, though, that wouldn't take too much time to test. I suggest we do those before we merge these PRs (if we decide to merge them).

  1. For chunking up the arrays for OpenMP parallel processing, I simply used the same block size as the blocked data structures did. I actually reuse the same namelist parameter. That size was determined to give best performance for blocked data structures, but it may not be the optimal value for chunked arrays. It would be good to run a few experiments with different block sizes. I believe the current value is 32, so maybe testing 16, (8?), 64, 128, (...?) would give us an indicator if there's a better value.
  2. We need to look into where the additional time is spent. If it is in the physics, then the blocksize tuning might help. If it is in the conversion of the chunked arrays back to the blocked data structures for restart/diagnostic I/O, then we know we need to work on the restart/io component to change those over to contiguous arrays, too.

@zach1221
Copy link
Collaborator

@climbfuji 1.3% looks ok to me. But I will compare on gaea. I think we can schedule this pr to commit before July 4. We will let #2327 and HR4 PRs go first until Monday or Tuesday.

I ran rt.conf on Gaea against this PR, compile suite atm_debug_dyn32_intel is failing, while trying to build fortran object FV3/ccpp/physics/CMakeFiles/ccpp_physics.dir/ccpp_FV3_HRRR_gf_physics_cap.F90.o. There are warnings complaining about UFS_SCM_NEPTUNE before the build fails with an oom_kill event.
/gpfs/f5/epic/scratch/Zachary.Shrader/RT_RUNDIRS/Zachary.Shrader/FV3_RT/rt_243900/compile_atm_debug_dyn32_intel

@DusanJovic-NOAA
Copy link
Collaborator

I ran c1152 configuration on wcoss2, two 24hr runs for each develop and chunked_array_support_use_it branch and I got this:

develop:

develop/out.1:The total amount of wall time                        = 3612.852005
develop/out.1:The maximum resident set size (KB)                   = 4555024
develop/out.2:The total amount of wall time                        = 3575.031211
develop/out.2:The maximum resident set size (KB)                   = 4578148

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax
Total runtime                            1   3335.083008   3335.084229   3335.083008      0.000293  1.000     0     0   767
Initialization                           1      0.000000      0.000000      0.000000      0.000000  0.000     0     0   767
FV dy-core                            1152   1417.159424   2231.315918   1931.180054    130.310120  0.579    11     0   767
FV subgrid_z                           576      8.700264      9.335741      9.048879      0.109144  0.003    11     0   767
FV Diag                                576     11.624299     12.455292     12.065112      0.154604  0.004    11     0   767
GFS Step Setup                        1152    276.685577    278.468109    277.904877      0.259084  0.083     1     0   767
GFS Radiation                          576     91.114044    180.401398    126.995758     14.943266  0.038     1     0   767
GFS Physics                           1152    543.373535   1302.708252    813.044128    119.443970  0.244     1     0   767
Dynamics get state                     576     41.268082     43.022018     41.708492      0.243502  0.013     1     0   767
Dynamics update state                  576    184.091141   1001.986206    700.140808    131.564636  0.210     1     0   767
FV3 Dycore                            1152   1449.884644   2267.879639   1966.171875    131.557663  0.590     1     0   767

this branch:

chunked/out.1:The total amount of wall time                        = 5420.323897
chunked/out.1:The maximum resident set size (KB)                   = 4493840
chunked/out.2:The total amount of wall time                        = 5432.124083
chunked/out.2:The maximum resident set size (KB)                   = 4497592

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax
Total runtime                            1   5148.599121   5148.609863   5148.601562      0.001674  1.000     0     0   767
Initialization                           1      0.000000      0.000000      0.000000      0.000000  0.000     0     0   767
FV dy-core                            1152   1413.875000   2233.744873   1924.438965    133.241394  0.374    11     0   767
FV subgrid_z                           576      8.779434      9.477187      9.140765      0.107187  0.002    11     0   767
FV Diag                                576     12.773606     13.469611     13.119105      0.114862  0.003    11     0   767
GFS Step Setup                        1152   1952.668823   1953.963501   1953.395630      0.205041  0.379     1     0   767
GFS Radiation                          576     91.203415    179.604431    126.950539     14.749546  0.025     1     0   767
GFS Physics                           1152    680.066895   1447.218018    959.712585    122.683105  0.186     1     0   767
Dynamics get state                     576     40.147064     41.332420     40.555683      0.160260  0.008     1     0   767
Dynamics update state                  576    183.833923   1008.430359    697.616272    134.625107  0.135     1     0   767
FV3 Dycore                            1152   1451.608765   2276.054443   1965.396851    134.616547  0.382     1     0   767

The biggest difference is in 'GFS Step Setup'.

My run directory on Dogwood is: /lfs/h2/emc/eib/noscrub/dusan.jovic/ufs/c1152_gw_case

It would be nice if somebody else can take a look at these runs and try to reproduce the results, just to make sure I did not make some stupid mistake.

@DusanJovic-NOAA
Copy link
Collaborator

I also ran the same tests on Hera.

develop:

develop/out.1:   0: The total amount of wall time                        = 3630.872328
develop/out.1:   0: The maximum resident set size (KB)                   = 4719968
develop/out.2:   0: The total amount of wall time                        = 3609.285315
develop/out.2:   0: The maximum resident set size (KB)                   = 4722360

this branch:

chunked/out.1:   0: The total amount of wall time                        = 5027.371315
chunked/out.1:   0: The maximum resident set size (KB)                   = 4665600
chunked/out.2:   0: The total amount of wall time                        = 5030.425474
chunked/out.2:   0: The maximum resident set size (KB)                   = 4667944

Run directory on Hera: /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/c1152_gw_case

The only difference compared to wcoss2 tests is that on Hera I turned off writing restart, history and inline post files:

in model_configure:

restart_interval:        99
output_history:          .false.
write_dopost:            .false.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jul 1, 2024

I see quite some timing difference on gaea:

Chunked: PASS -- TEST 'cpld_bmark_p8_intel' [23:12, 12:51](4189 MB)
Develop: PASS -- TEST 'cpld_bmark_p8_intel' [17:27, 14:02](4188 MB)

I will compare other cases as well.

@climbfuji
Copy link
Collaborator Author

This is all too slow for the proposed code. We need to find out what's going on, definitely can't merge the PR in its current form.

@climbfuji
Copy link
Collaborator Author

I ran c1152 configuration on wcoss2, two 24hr runs for each develop and chunked_array_support_use_it branch and I got this:

develop:

develop/out.1:The total amount of wall time                        = 3612.852005
develop/out.1:The maximum resident set size (KB)                   = 4555024
develop/out.2:The total amount of wall time                        = 3575.031211
develop/out.2:The maximum resident set size (KB)                   = 4578148

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax
Total runtime                            1   3335.083008   3335.084229   3335.083008      0.000293  1.000     0     0   767
Initialization                           1      0.000000      0.000000      0.000000      0.000000  0.000     0     0   767
FV dy-core                            1152   1417.159424   2231.315918   1931.180054    130.310120  0.579    11     0   767
FV subgrid_z                           576      8.700264      9.335741      9.048879      0.109144  0.003    11     0   767
FV Diag                                576     11.624299     12.455292     12.065112      0.154604  0.004    11     0   767
GFS Step Setup                        1152    276.685577    278.468109    277.904877      0.259084  0.083     1     0   767
GFS Radiation                          576     91.114044    180.401398    126.995758     14.943266  0.038     1     0   767
GFS Physics                           1152    543.373535   1302.708252    813.044128    119.443970  0.244     1     0   767
Dynamics get state                     576     41.268082     43.022018     41.708492      0.243502  0.013     1     0   767
Dynamics update state                  576    184.091141   1001.986206    700.140808    131.564636  0.210     1     0   767
FV3 Dycore                            1152   1449.884644   2267.879639   1966.171875    131.557663  0.590     1     0   767

this branch:

chunked/out.1:The total amount of wall time                        = 5420.323897
chunked/out.1:The maximum resident set size (KB)                   = 4493840
chunked/out.2:The total amount of wall time                        = 5432.124083
chunked/out.2:The maximum resident set size (KB)                   = 4497592

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax
Total runtime                            1   5148.599121   5148.609863   5148.601562      0.001674  1.000     0     0   767
Initialization                           1      0.000000      0.000000      0.000000      0.000000  0.000     0     0   767
FV dy-core                            1152   1413.875000   2233.744873   1924.438965    133.241394  0.374    11     0   767
FV subgrid_z                           576      8.779434      9.477187      9.140765      0.107187  0.002    11     0   767
FV Diag                                576     12.773606     13.469611     13.119105      0.114862  0.003    11     0   767
GFS Step Setup                        1152   1952.668823   1953.963501   1953.395630      0.205041  0.379     1     0   767
GFS Radiation                          576     91.203415    179.604431    126.950539     14.749546  0.025     1     0   767
GFS Physics                           1152    680.066895   1447.218018    959.712585    122.683105  0.186     1     0   767
Dynamics get state                     576     40.147064     41.332420     40.555683      0.160260  0.008     1     0   767
Dynamics update state                  576    183.833923   1008.430359    697.616272    134.625107  0.135     1     0   767
FV3 Dycore                            1152   1451.608765   2276.054443   1965.396851    134.616547  0.382     1     0   767

The biggest difference is in 'GFS Step Setup'.

My run directory on Dogwood is: /lfs/h2/emc/eib/noscrub/dusan.jovic/ufs/c1152_gw_case

It would be nice if somebody else can take a look at these runs and try to reproduce the results, just to make sure I did not make some stupid mistake.

I don't think you did. Your findings are very surprising to me, however, because I would have expected the "GFS Step Setup" (that's the time_vary group, correct?) to benefit from the current PR, not the other way round.

@DusanJovic-NOAA
Copy link
Collaborator

@junwang-noaa Can someone run the global-workflow with this branch, using a full-resolution case, and report back on the effects on the runtime?

@SamuelTrahanNOAA Dusan was testing the branch with a prototype C768 atmosphere only case (I believe from a run directory created by workflow). He ran the cases three times. I think it's good idea to run the same case on other platforms to confirm the running speed. RT are running at lower resolution, most of them are on C96.
Once the code is committed, it will go into prototype GFS HR4 at C1152 resolution, we haven't tested this branch at that resolution yet. I think Dom's approach of keeping a separate branch and making further changes to get comparable performance would be good.

Yes, but note that you will not have b4b identical results if you make performance improvements like removing block-loops etc. And someone need to maintain these separate branches (and there are many ufs-weather-model, fv3atm, atmos_cubed_sphere, ccpp-physics). I can't commit to that.

@DusanJovic-NOAA @junwang-noaa @SamuelTrahanNOAA I have two other suggestions, though, that wouldn't take too much time to test. I suggest we do those before we merge these PRs (if we decide to merge them).

  1. For chunking up the arrays for OpenMP parallel processing, I simply used the same block size as the blocked data structures did. I actually reuse the same namelist parameter. That size was determined to give best performance for blocked data structures, but it may not be the optimal value for chunked arrays. It would be good to run a few experiments with different block sizes. I believe the current value is 32, so maybe testing 16, (8?), 64, 128, (...?) would give us an indicator if there's a better value.
  2. We need to look into where the additional time is spent. If it is in the physics, then the blocksize tuning might help. If it is in the conversion of the chunked arrays back to the blocked data structures for restart/diagnostic I/O, then we know we need to work on the restart/io component to change those over to contiguous arrays, too.

@climbfuji I made several c768 runs on wcoss2 using different block sizes (8, 16, 32, 64, 128, 256, ... 768), and got these wall clock time and memory usage:

out.bs.8:  The total amount of wall time                        = 1395.894187
out.bs.8:  The maximum resident set size (KB)                   = 2092720
out.bs.16: The total amount of wall time                        = 1241.990803
out.bs.16: The maximum resident set size (KB)                   = 2082572
out.bs.32: The total amount of wall time                        = 1174.226261
out.bs.32: The maximum resident set size (KB)                   = 2085564
out.bs.64: The total amount of wall time                        = 1152.056636
out.bs.64: The maximum resident set size (KB)                   = 2098340
out.bs.128:The total amount of wall time                        = 1159.601259
out.bs.128:The maximum resident set size (KB)                   = 2182636
out.bs.256:The total amount of wall time                        = 1194.814813
out.bs.256:The maximum resident set size (KB)                   = 2351688
out.bs.384:The total amount of wall time                        = 1228.039325
out.bs.384:The maximum resident set size (KB)                   = 2523868
out.bs.512:The total amount of wall time                        = 1434.737620
out.bs.512:The maximum resident set size (KB)                   = 2676132
out.bs.768:The total amount of wall time                        = 1382.961159
out.bs.768:The maximum resident set size (KB)                   = 3030336

The runs were without restart and history outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for Reviews The PR is waiting for reviews from associated component PR's.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert blocked data structures in FV3atm to contiguous arrays
9 participants