Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Cluster Bugs #56

Open
RiesBen opened this issue Sep 27, 2021 · 14 comments
Open

Random Cluster Bugs #56

RiesBen opened this issue Sep 27, 2021 · 14 comments
Labels
question Further information is requested

Comments

@RiesBen
Copy link
Collaborator

RiesBen commented Sep 27, 2021

Hi @candidechamp,
the new copying back from the cluster scratch, did fail for me!
The run did not copy all files back to the work folder (cnfs missing).
It looks, like somehow the copying after run fails/ the scratch folder is not found anymore?

Does that also happen for you?

Here I attached the output file:
CHK1_nd5_enr3_complex_prod_1SS_21r_3_sopt4_rb3_max8_md.txt

@RiesBen RiesBen added the bug Something isn't working label Sep 27, 2021
@candidechamp
Copy link
Collaborator

You probably did not specify that you want the calculation to occur in the scratch directory?
What does your input job_name.sh look like?

Other idea could be that you are not on the correct pygromos commit (but I checked the version I have and it doesn't have any major difference from the pygromos v1)?

@RiesBen
Copy link
Collaborator Author

RiesBen commented Sep 27, 2021

I don't pass the work_dir command, therefore I should use the default of the new branch

The pygromos version is correct for sure, as it is the standard one for reeds?

I wonder if there was a cluster anomaly? or if the approach with the ssh-script is not robust?

@SalomeRonja
Copy link
Collaborator

I'll also check out the branch and see if it works for me

@RiesBen RiesBen changed the title Copying back from temporary folders does not work for me Random Cluster Bugs Sep 27, 2021
@RiesBen
Copy link
Collaborator Author

RiesBen commented Sep 27, 2021

Somehow, we got now the impression, that this might related to a temporary communication problem of the nodes. So right now let's collect all awkward bugs on the pipeline here and maybe we can make some sense of it. The problems occur for me apparently rarely.

@RiesBen RiesBen added question Further information is requested and removed bug Something isn't working labels Sep 27, 2021
@SalomeRonja
Copy link
Collaborator

For me, the same thing happened: after checking out the newest version of the eoff rebalancing branch (which includes the minor rework of the submission pipeline), only the files from one node are copied back correctly, the rest are missing...

Does it (usually) work for you @candidechamp @schroederb even when the job is distributed among different nodes?

@candidechamp
Copy link
Collaborator

@SalomeRonja I havn't had a single issue so far. I just diffed my local branch and the /origin/main and I don't see anything wrong.

Are you 100% sure the files job_system_name.sh which submit the calculation where generated by the new code?

@SalomeRonja
Copy link
Collaborator

Ah, after a closer look, the problem was that the job timed-out - I didn't think to increase the duration_per_job now that the simulation and cleanup are done in one job. After I increased it, it worked fine :)

@candidechamp
Copy link
Collaborator

@SalomeRonja Thanks for looking into it. That's unfortunately a drawback we can't really do anything about when running multi-node jobs. If the wall-time is reached, we have no way of getting the data, the only people who can fix this is people who develop the LSF queuing system.

@RiesBen
Copy link
Collaborator Author

RiesBen commented Sep 27, 2021

@candidechamp but we can still fall back to the old work_dir flag if desired, right?

@candidechamp
Copy link
Collaborator

@schroederb You can, but this makes the cluster slow for everyone.

@RiesBen
Copy link
Collaborator Author

RiesBen commented Sep 27, 2021

Uah, was that stated by the Cluster Support? I thought you told me it was not such a big Deal?

@candidechamp
Copy link
Collaborator

Oh no sorry actually the cluster people said something slightly different:

"""
If a program does "nice" and large writes/reads, then you won't notice the latency difference. When a program does "bad" I/O (a lot of small random reads/writes, only a few bytes per read/write), then the latency will kick in and make everything slow.

An advantage of the local scratch is, that it is independent. If people do stupid stuff on /cluster/work, it will slow down the entire file system, i.e., your job could negatively be affected by the actions of other users, while you don't have this problem on the local scratch.

Copying the data from/to local scratch can even be optimized and parallelized (using gnu parallel to untar several tar archives in parallel to $TMPDIR, using multiple cores). In some test a user could copy 360 GB of data from /cluster/work to $TMPDIR within like 3 or 4 minutes. When a job runs for several hours, then a few minutes will not cause a lot of overhead compared to the total runtime.
"""

@RiesBen
Copy link
Collaborator Author

RiesBen commented Sep 27, 2021

Ah ok, ja I think the default should be the scratch solution on the node. Just in case we want to test/debug something, it still is nice to keep the option of opting out there.

@candidechamp
Copy link
Collaborator

@SalomeRonja @epbarros I think this issue may be closed now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants