Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having trouble running with UCX on WCOSS2 #2231

Open
MatthewPyle-NOAA opened this issue Apr 8, 2024 · 12 comments
Open

Having trouble running with UCX on WCOSS2 #2231

MatthewPyle-NOAA opened this issue Apr 8, 2024 · 12 comments

Comments

@MatthewPyle-NOAA
Copy link
Collaborator

MatthewPyle-NOAA commented Apr 8, 2024

Description

Attempts to run using ucx rather than slingshot for an RRFS configuration have led to failures when the model begins to start integrating. The failures are similar in appearance to model instability failures, so seems like NaNs are getting into the system somehow.

To Reproduce:

Utilize the /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307/job_card.sh job card (on dogwood) to run the case (will require copying the run_dir and config_parms directories to your own space). job_card.sh_nonucx is a job card that avoids ucx and works for me.

Additional context

Very open to the idea that it is user error on my part, but could use help figuring out why it is failing the way it is.

Output

ucx failure log file on Dogwood: /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307/OUTPUT_60h_41nodes_retry_newucxtest_v0.8.9

@MatthewPyle-NOAA MatthewPyle-NOAA added the bug Something isn't working label Apr 8, 2024
@MatthewPyle-NOAA
Copy link
Collaborator Author

@GeorgeVandenberghe-NOAA Jun Wang recommended that I reach out to you about this issue. My attempts to use UCX for the RRFS application fail when model starts integrating. My hope is that there is something wrong with my setup, and since you have experience running it for the global application, maybe you could take a look? Thanks!

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@MatthewPyle-NOAA
Copy link
Collaborator Author

Most details you need are described in the "To reproduce" part of the issue - I do have a test setup on dogwood. I've been pointing at RRFS model executables, but could point you at a source if needed.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

You can repair this in the ucx job by loading a later level of cray-mpich. When I do this the test job runs to timeout.

#module load cray-mpich-ucx/8.1.12
module load cray-mpich-ucx/8.1.19

@MatthewPyle-NOAA
Copy link
Collaborator Author

Thanks @GeorgeVandenberghe-NOAA will give that a try!

@MatthewPyle-NOAA
Copy link
Collaborator Author

Have confirmed that going to cray-mpich-ucx/8.1.19 solves my issue....closing the issue.

@MatthewPyle-NOAA MatthewPyle-NOAA removed the bug Something isn't working label Apr 30, 2024
@junwang-noaa
Copy link
Collaborator

@MatthewPyle-NOAA is there any issue with using UCX?

@MatthewPyle-NOAA
Copy link
Collaborator Author

@junwang-noaa I'm still looking into something - it definitely initializes much more quickly, but seems a bit slower beyond that point.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@MatthewPyle-NOAA
Copy link
Collaborator Author

@GeorgeVandenberghe-NOAA I have things under /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307
on cactus. job_card.sh uses UCX, and job.card.sh_nonucx doesn't. I accidentally scrubbed some job log files from earlier today, but have seen for a 60 h forecast on 153 nodes that UCX saves about 7 minutes in time to f00 output being written, but then is about 9 minutes slower than non-UCX going from f00 to f60. So far I've just been pointing at an RRFS executable. Would you recommend recompiling code pointing at UCX modules?

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

@MatthewPyle-NOAA
Copy link
Collaborator Author

Okay. I'm using cray-mpich/8.1.12 for the non-UCX test. Hopefully the level of cray-mpich doesn't explain the difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants