-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In fv3atm: convert GFS DDTs from blocked data structures to contiguous arrays #2183
base: develop
Are you sure you want to change the base?
In fv3atm: convert GFS DDTs from blocked data structures to contiguous arrays #2183
Conversation
@climbfuji can you sync up branches? I think we can start working on this pr, as debugging is underway on #2290 side. |
@@ -20,7 +20,8 @@ endif() | |||
|
|||
if(DEBUG) | |||
add_definitions(-DDEBUG) | |||
set(CMAKE_Fortran_FLAGS_DEBUG "${CMAKE_Fortran_FLAGS_DEBUG} -O0 -check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays") | |||
#set(CMAKE_Fortran_FLAGS_DEBUG "${CMAKE_Fortran_FLAGS_DEBUG} -O0 -check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@climbfuji May I ask why we remove the "-check nopointer" option? Can we check if there are warning messages in the DEBUG compile log file? If I remember correctly, some diagnostic field computation in dycore could throw warning messages in the setting of parallel configuration even though they are not required to output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can revert this, it's not needed for the PR. But it's a nice thing that we can turn the pointer checking on without the code crashing - so I would consider this as a plus.
Yes, in a few minutes. I'll also pull in the unit testing PR that @DusanJovic-NOAA asked me to (see NOAA-EMC/fv3atm#798 (comment)). |
…r-model into feature/chunked_array_support_use_it
@jkbk2004 @DusanJovic-NOAA @junwang-noaa All updated, Alex's unit testing PR pulled into my fv3 submodule PR as well. Note I didn't test compiling/running after pulling in the latest code, just pulled it in and pushed. |
@climbfuji I ran c768 24h tests on wcoss2 using this branch and develop branch (3 runs each). For develop branch (e784814) the following are wall clock times and memory usage:
This branch (f5059f0):
|
Thanks for doing that. So this new code is slightly slower but uses slightly less memory. Are we ok to go ahead, knowing that there are plenty of opportunities to speed up the code? I deliberately did not remove entire block loops because I wanted to limit the changes (well, to the extent possible) and have b4b identical results with the current code. |
I think we can go ahead. |
The CI test to check whether the repos are up to date fails, but I am quite certain I got everything updated. A bug? See https://github.com/ufs-community/ufs-weather-model/actions/runs/9699726693/job/26770514311 |
@DomHeinzeller We have seen this issue picking up git runner id_file. I don't get why we need this id_file. Probably old feature inherited from running cloud-based test in git action. I think we will clean up later. We can start test across machines once we get all approvals on subcomponent side. @FernandoAndrade-NOAA @BrianCurtis-NOAA FYI |
@climbfuji Since the model now runs about 7% slower, would you elaborate more on speeding up the code? Can we expect future PRs to accelerate the model to the current speed? We have 6.5mins/day operational window, this slowness may require increase resources to run the model. |
The first and most straightforward change is to remove all block loops - you'll notice that I left them in place and added additional logic to set the correct array indices.
This isn't needed. You could simply iterate over the entire array
or whole-array manipulations
although I've heard in the past that the latter is slower than the former (which doesn't make sense to me, and I haven't actually tested it). This change can also be made in the dycore (see NOAA-GFDL/GFDL_atmos_cubed_sphere#345) and in the CCPP physics (where applicable). The second change is more intrusive. You know that the I/O component (restart, diagnostic) uses blocked data structures and I didn't dare to touch any of this. Therefore, in this set of PRs, the non-blocked GFS physics ddts are interacting with the blocked GFS restart/diagnostic DDTs for the I/O component. This could - no, should - be change so that the I/O components us non-blocked data structures as well. I don't know enough about the I/O part of the ufs-w-m for having a good idea on how much work that requires. |
Getting back 7% of the runtime can be a lot of work, and there's no guarantee we'll get it. Considering the amount of FV3 runs we do, and how big they are, a 7% resource increase will amount to a massive amount of money. I suggest we:
|
I was going to suggest at least multiple runs of the same test case, since there can be a large run-to-run difference for the same code. We need this development for the transition of the CCPP framework to capgen. We could keep this in a branch and keep updating it from develop, make successive changes to the code as I described above, and then merge an even larger set of changes with likely non-b4b results into develop. Also, I am no longer affiliated with NOAA and I am doing all this on the sideline - my resources are limited. |
@junwang-noaa Can someone run the global-workflow with this branch, using a full-resolution case, and report back on the effects on the runtime? |
FWIW, I am running "the top part of rt.conf" (about 80 tests or so, everything until the ATM debug section starts) on hera for dev and this branch. I'll look into runtime differences for these runs. I know the setups are not representative of the implementation target, but looking at many tests might give us some answers, too. We should also check if sub-timers are provide more information where the differences come from. |
@dustinswales Didn't you also compare runtimes for the ORT runs? |
Here is
Here is the info from those 71 runs: In short, averaged across the 71 runs, a 1.3% increase in runtime and a 0.7% decrease in memory (max resident size). One could repeat this exercise because I ran the chunked code during working hours and the dev code after work, but ... at least something. |
@climbfuji 1.3% looks ok to me. But I will compare on gaea. I think we can schedule this pr to commit before July 4. We will let #2327 and HR4 PRs go first until Monday or Tuesday. |
@SamuelTrahanNOAA Dusan was testing the branch with a prototype C768 atmosphere only case (I believe from a run directory created by workflow). He ran the cases three times. I think it's good idea to run the same case on other platforms to confirm the running speed. RT are running at lower resolution, most of them are on C96. Once the code is committed, it will go into prototype GFS HR4 at C1152 resolution, we haven't tested this branch at that resolution yet. I think Dom's approach of keeping a separate branch and making further changes to get comparable performance would be good. |
Yes, but note that you will not have b4b identical results if you make performance improvements like removing block-loops etc. And someone need to maintain these separate branches (and there are many ufs-weather-model, fv3atm, atmos_cubed_sphere, ccpp-physics). I can't commit to that. @DusanJovic-NOAA @junwang-noaa @SamuelTrahanNOAA I have two other suggestions, though, that wouldn't take too much time to test. I suggest we do those before we merge these PRs (if we decide to merge them).
|
I ran rt.conf on Gaea against this PR, compile suite atm_debug_dyn32_intel is failing, while trying to build fortran object FV3/ccpp/physics/CMakeFiles/ccpp_physics.dir/ccpp_FV3_HRRR_gf_physics_cap.F90.o. There are warnings complaining about UFS_SCM_NEPTUNE before the build fails with an oom_kill event. |
I ran c1152 configuration on wcoss2, two 24hr runs for each develop and chunked_array_support_use_it branch and I got this: develop:
this branch:
The biggest difference is in 'GFS Step Setup'. My run directory on Dogwood is: /lfs/h2/emc/eib/noscrub/dusan.jovic/ufs/c1152_gw_case It would be nice if somebody else can take a look at these runs and try to reproduce the results, just to make sure I did not make some stupid mistake. |
I also ran the same tests on Hera. develop:
this branch:
Run directory on Hera: /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/c1152_gw_case The only difference compared to wcoss2 tests is that on Hera I turned off writing restart, history and inline post files: in model_configure:
|
I see quite some timing difference on gaea:
I will compare other cases as well. |
This is all too slow for the proposed code. We need to find out what's going on, definitely can't merge the PR in its current form. |
I don't think you did. Your findings are very surprising to me, however, because I would have expected the "GFS Step Setup" (that's the time_vary group, correct?) to benefit from the current PR, not the other way round. |
@climbfuji I made several c768 runs on wcoss2 using different block sizes (8, 16, 32, 64, 128, 256, ... 768), and got these wall clock time and memory usage:
The runs were without restart and history outputs. |
…r-model into feature/chunked_array_support_use_it
Commit Queue Requirements:
Description:
This PR updates the submodule pointers for fv3atm, gfdl_atmos_cubed_sphere, ccpp-physics for the changes described in the associates PRs below: convert internal GFS DDTs from blocked data structures to contiguous arrays. This excludes the (external)
GFS_extdiag
andGFS_restart
DDTs.Commit Message:
Priority:
Git Tracking
UFSWM:
Sub component Pull Requests:
UFSWM Blocking Dependencies:
n/a
Changes
Regression Test Changes (Please commit test_changes.list):
Input data Changes:
Library Changes/Upgrades:
Testing Log: