-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Having trouble running with UCX on WCOSS2 #2231
Comments
@GeorgeVandenberghe-NOAA Jun Wang recommended that I reach out to you about this issue. My attempts to use UCX for the RRFS application fail when model starts integrating. My hope is that there is something wrong with my setup, and since you have experience running it for the global application, maybe you could take a look? Thanks! |
Do you have a WCOSS2 CWD with testcase, a job to run it and (possibly)
source code and the build?
…On Mon, Apr 22, 2024 at 5:15 PM MatthewPyle-NOAA ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> Jun
Wang recommended that I reach out to you about this issue. My attempts to
use UCX for the RRFS application fail when model starts integrating. My
hope is that there is something wrong with my setup, and since you have
experience running it for the global application, maybe you could take a
look? Thanks!
—
Reply to this email directly, view it on GitHub
<#2231 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FWQPSWBB4RRI2ZOO7LY6VAQ5AVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZQGI4TINJYGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Most details you need are described in the "To reproduce" part of the issue - I do have a test setup on dogwood. I've been pointing at RRFS model executables, but could point you at a source if needed. |
You can repair this in the ucx job by loading a later level of cray-mpich. When I do this the test job runs to timeout. #module load cray-mpich-ucx/8.1.12 |
Thanks @GeorgeVandenberghe-NOAA will give that a try! |
Have confirmed that going to cray-mpich-ucx/8.1.19 solves my issue....closing the issue. |
@MatthewPyle-NOAA is there any issue with using UCX? |
@junwang-noaa I'm still looking into something - it definitely initializes much more quickly, but seems a bit slower beyond that point. |
I lost my testcase on dogwood after the problem was closed. Do you have a
CWD and source on Cactus.
?
…On Tue, Apr 30, 2024 at 12:50 PM MatthewPyle-NOAA ***@***.***> wrote:
@junwang-noaa <https://github.com/junwang-noaa> I'm still looking into
something - it definitely initializes much more quickly, but seems a bit
slower beyond that point.
—
Reply to this email directly, view it on GitHub
<#2231 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FTZ3JDQGNPI7GMHRRTY76HPTAVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBVGI2DIMJSGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@GeorgeVandenberghe-NOAA I have things under /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307 |
The UCX stuff should be shared libraries and recompiling won't affect
it. Do you have a source and build in that directory?
I'll go ahead and snag it. I had gotten rid of my testcases after the
problem was closed.
…On Tue, Apr 30, 2024 at 6:18 PM MatthewPyle-NOAA ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> I
have things under
/lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307
on cactus. job_card.sh uses UCX, and job.card.sh_nonucx doesn't. I
accidentally scrubbed some job log files from earlier today, but have seen
for a 60 h forecast on 153 nodes that UCX saves about 7 minutes in time to
f00 output being written, but then is about 9 minutes slower than non-UCX
going from f00 to f60. So far I've just been pointing at an RRFS
executable. Would you recommend recompiling code pointing at UCX modules?
—
Reply to this email directly, view it on GitHub
<#2231 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FQL6KYZN5M2QGARNIDY77N7VAVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBWGM4TOOBVGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Okay. I'm using cray-mpich/8.1.12 for the non-UCX test. Hopefully the level of cray-mpich doesn't explain the difference. |
Description
Attempts to run using ucx rather than slingshot for an RRFS configuration have led to failures when the model begins to start integrating. The failures are similar in appearance to model instability failures, so seems like NaNs are getting into the system somehow.
To Reproduce:
Utilize the /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307/job_card.sh job card (on dogwood) to run the case (will require copying the run_dir and config_parms directories to your own space). job_card.sh_nonucx is a job card that avoids ucx and works for me.
Additional context
Very open to the idea that it is user error on my part, but could use help figuring out why it is failing the way it is.
Output
ucx failure log file on Dogwood: /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307/OUTPUT_60h_41nodes_retry_newucxtest_v0.8.9
The text was updated successfully, but these errors were encountered: