Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example simulation 201 fails using Intel compiler #82

Closed
bss116 opened this issue May 5, 2020 · 10 comments
Closed

Example simulation 201 fails using Intel compiler #82

bss116 opened this issue May 5, 2020 · 10 comments
Labels
bug Something isn't working
Milestone

Comments

@bss116
Copy link
Contributor

bss116 commented May 5, 2020

The energybalance example simulation 201, which run without issues locally, fails when running on the ICL cluster. Strangely, running in debug mode does not produce an error stack to show where the code terminates, but running it in release does, pointing to the poisson solver (modpois.f90 line 1045). This makes me think that the error may at a different spot than indicated in the error stack. A similar issue arises when running the example simulation 501 with an extended vertical domain size, where debug does not produce an error stack. However, the error (in release) comes from a different line, this time from the subgrid model (modsubgrid.f90 391). Again this simulation works well on my local machine. I am not sure where to start looking for the error. Log files are attached.
output.201-debug.txt
output.201-release.txt
output.501-debug.txt
output.501-release.txt

@bss116 bss116 added the bug Something isn't working label May 5, 2020
@bss116 bss116 added this to the 0.1.0 milestone May 5, 2020
@dmey
Copy link
Contributor

dmey commented May 15, 2020

@bss116 this has no assignee -- are you looking into this already or do you want to me have a look? If so could you please assign me to it? If so may have some time at the end of next week.

@bss116
Copy link
Contributor Author

bss116 commented May 15, 2020

I'm not going to look into this, at least for a while now. Sure, if you have some time, please have a look! But don't spend all your time and nerves on it, it will probably involve going over lots of array-bound warnings, for which we will need to allocate some time, sometime...

@dmey
Copy link
Contributor

dmey commented May 15, 2020

OK I must have misunderstood the issue then -- I though this was something to do with the HPC specifically but I think this a general issue with Intel (flag) that catches a general problem that is not covered when running with GNU -- I will un-assign myself then 🤭...

@dmey dmey removed their assignment May 15, 2020
@bss116
Copy link
Contributor Author

bss116 commented May 15, 2020

Yeah I'm afraid so... 😄

@dmey
Copy link
Contributor

dmey commented May 22, 2020

@bss116 cool but in this case I would probably change the title to something a bit more specific as it is not due to the HPC per se. This only happens because you use Intel on the HPC rather than GNU so this issue is about Intel.

@bss116 bss116 changed the title Example simulation 201 fails on HPC Example simulation 201 fails using Intel compiler May 22, 2020
@bss116 bss116 removed this from the 0.1.0 milestone Nov 10, 2020
@dmey dmey added this to the 0.2.0 milestone Nov 16, 2020
@dmey
Copy link
Contributor

dmey commented Nov 27, 2020

@bss116 @samoliverowens I have checked this again running 201 for 1000 s both with Intel and GNU and on the ICL cluster. The first thing I noticed is that in both cases simulations run when using 1 core without errors. Then simulations do fail consistently across compilers but for seems to be related with the number of cores used. E.g. with 32 cores the simulation runs but with 48 it fails with error STOP 1 at the very first time step. I was therefore unable to reproduce the errors you shown in your attached logs and I wonder if this is become of some of the changes we made since this issue was opened.

So I think we should close this issue as it does not seem to apply any longer -- at least for 201 -- and, instead (1) clarify/add a check for the n cores to use and improve the error logs. I think that the for (1) we could simply clarify this in the docs for now with the option to implement a check in the code (future milestone) and for (2) add target it in future milestone. What do you guys think?

@samoliverowens
Copy link
Contributor

samoliverowens commented Nov 27, 2020

@bss116 @samoliverowens I have checked this again running 201 for 1000 s both with Intel and GNU and on the ICL cluster. The first thing I noticed is that in both cases simulations run when using 1 core without errors. Then simulations do fail consistently across compilers but for seems to be related with the number of cores used. E.g. with 32 cores the simulation runs but with 48 it fails with error STOP 1 at the very first time step. I was therefore unable to reproduce the errors you shown in your attached logs and I wonder if this is become of some of the changes we made since this issue was opened.

The number of cores has to divide jtot, which for 201 is 128, so 48 will error.

@dmey
Copy link
Contributor

dmey commented Nov 27, 2020

The number of cores has to divide jtot, which for 201 is 128, so 48 will error.

this is good then -- has this been documented? If so I would close this and open and issue about adding a check that can be targeted in a future release

@bss116
Copy link
Contributor Author

bss116 commented Nov 27, 2020

The number of cores has to divide jtot, which for 201 is 128, so 48 will error.

this is good then -- has this been documented? If so I would close this and open and issue about adding a check that can be targeted in a future release

I thought we already mentioned it somewhere, but I cannot find it in the docs now. Where do you think would be a good place? In the getting started guide under Run, or in the simulation setup notes?
The code fails because of this check at startup, so you'd already get this information in the output.xxx file:

if (mod(jtot, nprocs) /= 0) then

I'm happy that 201 runs now also on the ICL cluster. Let's close this issue.

@dmey dmey modified the milestones: 0.2.0, 0.1.0 Nov 27, 2020
@dmey
Copy link
Contributor

dmey commented Nov 27, 2020

@bss116 cool -- how about under https://github.com/uDALES/u-dales/blob/master/docs/udales-getting-started.md#run?

And sure -- will close this and open an issue for fixing that check. Will actually put this under 0.1 and I can take care of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants