Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process on GPU killed after long run and/or restart #348

Open
Qcellaris opened this issue May 5, 2022 · 5 comments
Open

Process on GPU killed after long run and/or restart #348

Qcellaris opened this issue May 5, 2022 · 5 comments

Comments

@Qcellaris
Copy link

Dear developers,

I was using PySPH on my old GPU (GeForce GTX 680) for a while and I recently started using it on a newer GPU (NVIDIA TITAN V) as well. Unfortunately, there are some strange issues showing up, so hopefully someone can help me out here:

  1. When I run a simulation for a longer while, e.g. 8 hours or so, it suddenly gets killed. The resulting error log can be find in the attachment.
  2. When I try to restart the simulation from any of the previous restart files it gets killed again right away while showing the same error messages.
  3. The issue also shows up when want to restart a simulation that has run for a short time, say 30 minutes, and which finished correctly.

This problem doesn't appear on my GTX 680 but on that machine I am using an older version of PyOpenCL. I tried using this same older version of PyOpenCL on the TITAN V but it didn't solve the issue.

Best,

Stephan

gpu_err.log

@Qcellaris
Copy link
Author

I have also tested the code on the HPC cluster where we are using TESLA V100s and I run into the same errors. I have attached the corresponding error log here as well but I don't think it really gives us any new information.

slurm-76649.txt

@Qcellaris Qcellaris changed the title Process on GPU killed after after long run and/or restart Process on GPU killed after long run and/or restart May 17, 2022
@inducer
Copy link

inducer commented May 17, 2022

See also inducer/pyopencl#562 (comment).

@prabhuramachandran
Copy link
Contributor

Hi, sorry about the slow response. Is it possible that there is a blow up of the particles and a large increase of the domain size? As regards the restart it is possible that there is an issue with restarting on the GPU because some necessary state has not been saved. Do you have a small reproducible example? That would help debug this issue better.

@Qcellaris
Copy link
Author

Qcellaris commented May 30, 2022

I don't see any blow up or large increase in the domain size when I look at the results of the last data frame. The crash happens after hours of simulation on the GPU (approximately 70k iterations). If I restart from the last output file on a CPU it just continues fine without blow up or large increase in domain size. I have attached the files of the simulations that gave us these errors (I changed the .py extension to .txt, otherwise I couldn't include them in this message). I will check if I also encounter this issue with a smaller example.

We have been looking into the issue ourselves for a while as well. Inducer mentioned in the comment above: "An unsigned integer underflow comes to mind as a possible reason." The only place where I found unsigned integers had to do with particle indexes and are used in e.g. neighbor lists. Since the code runs fine on a CPU and this error is only occurring on a GPU we thought it possibly had to do with the specific implementation of neighbor lists on the GPU. Might it be that neighbor list memory on GPU is not dynamic and that the length of the neighbor lists has a fixed maximum? Say the maximum amount of neighbors in the neighbor list is 30 and at some moment during the simulations the amount of neighbors exceeds this number we might end up in these kind of unsigned integer underflow problems. If there is such a hard cap on the amount of neighbors I could try to change this to a larger number and see if that solves the issue, but I couldn't find anything on that matter.

collision.txt
surface_tension_adami.txt

@Qcellaris
Copy link
Author

Hi, sorry about the slow response. Is it possible that there is a blow up of the particles and a large increase of the domain size? As regards the restart it is possible that there is an issue with restarting on the GPU because some necessary state has not been saved. Do you have a small reproducible example? That would help debug this issue better.

Dear Prabhu,

The simulations just continue perfectly fine on a CPU, but when I continue them on a GPU they crash. If it would be a blow up of particles it would crash on CPU as well as on GPU right?

I also don't think it is related to some necessary state that is not saved for the restart, because when we run the simulation from start it also crashes every time at the same point and from there on restarting gives the same error log as is returned during a run straight from the start.

Maybe we can look into this issue in more detail together?

Best,

Stephan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants