[core] Multi-process(?) / GPU processes do not seem to be freed after ctrl+c on cluster #31451

thoglu · 2023-01-04T23:23:39Z

What happened + What you expected to happen

This has been discussed in #30414 at the end, but it seems more approriate to start a new issue because it is probably unrelated.

I run a cluster with a head node, and worker nodes. The head node gets a CLI ray start ...options, the worker nodes also do a CLI ray start .. options and connect to the head. All the worker nodes have GPUs.

On the head node I then run a tune script with GPU request. After ctrl+c, the GPU does not seem to be freed on the worker nodes, and IDLE or TRAIN processes remain that block the GPU memory. Only a full ray stop kills all the processes.

I tested this also in a much more simple setup (just a head node with a GPU), run the tune script, do ctrl+c, and the memory on the GPU remains blocked and ray:TRAIN processes remain (see #30414).

EDIT: The issue was found to be related to num_workers>0 in pytorch DataLoader, which leaves extra ray processes open after ctrl+c.. Related: ray-project/ray_lightning#87, pytorch/pytorch#66482

~~EDIT 2: I could solve the issue by using 515.xxx NVIDIA drivers (but only for the main node), but for 470.xxx and/or the worker nodes the issue seems to remain.~~

EDIT 3: The issue persists, irregardless of driver version

Versions / Dependencies

Ray 2.2.0
Pytorch 1.12.1
NVIDIA driver 470.xxx

Reproduction script

Something like

if(args.resume):
  tuner = Tuner.restore(
      path=os.path.join(results_dir, "test")
  )
  
  tuner.fit()
else:
  trainable=tune.with_resources(trainable, resources={"cpu": 4, "gpu": 1})
  
  failure_config=FailureConfig(max_failures=-1)
  
  run_config=RunConfig(name="test",
                       local_dir=results_dir,
                       failure_config=failure_config,
                       log_to_file=True
                       )
  
  tune_config=TuneConfig(num_samples=1,
                         reuse_actors=False
                         )
  
  tuner=Tuner(new_trainable, run_config=run_config, tune_config=tune_config, param_space=cfg)
  
  tuner.fit()

Any tune job that runs on a node that is previously started with ray start --num_cpus=4 --num_gpus=1.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

ericl · 2023-01-04T23:52:54Z

@thoglu , can you confirm the following exhibits the same leak for you?

from ray import tune
from ray.tune import TuneConfig, Tuner

ray.init("auto")

def trainable(_):
    import torch
    torch.tensor(100000).cuda()
    import time
    time.sleep(9999)

trainable=tune.with_resources(trainable, resources={"cpu": 4, "gpu": 1})

tune_config=TuneConfig(num_samples=1,
reuse_actors=False
)

tuner=Tuner(trainable, tune_config=tune_config)

tuner.fit()

thoglu · 2023-01-05T00:07:09Z

Hmm no this one does not have the leak, I will try a few things out tomorrow and work my way toward my own situation ( its too late here right now).

matthewdeng · 2023-01-05T00:12:14Z

@thoglu could you share what your trainable definition looks like?

thoglu · 2023-01-05T00:26:43Z

Hmm so it has to do with the trainable, I checked with my submission script and just replaced with the simple trainable from above, and there is no problem.

My actual trainable involves a pytorch-lightning training loop and many lines of code split up in different files.. anything in particular I should be looking for? Can this be related to a garbage collection issue with some tensor that was not detached and some reference kept? I would imagine all of that should not matter once you ctrl+c? Or pytorch-lightning?

matthewdeng · 2023-01-05T00:38:36Z

Do you have any multiprocessing in your script, e.g. a DataLoader with num_workers>0? It could be that there is a process that is being spun up but the signal is not getting propagated to that process.

One way to verify this is to run py-spy dump --pid <pid> on the remaining ray:TRAIN process, which will show you what code the active process is running.

Related: ray-project/ray_lightning#87

thoglu · 2023-01-05T09:17:31Z

@matthewdeng @ericl
Indeed, it is related to num_workers>0. I changed num_workers to 0 and did not see the problem... I should have seen that earlier, many thanks for your quick help guys!

So it is the same issue as ray-project/ray_lightning#87. Is there any hope that this will get solved at all? The issue is open since over a year already.

It is not even a lightning issue, but a "Dataloader in connection to ray" issue, right? There must be other people seeing this already .. I presume num_workers>0 is prevalent in many use cases .. for myself it speeds up training significantly.

ericl · 2023-01-05T09:33:02Z

Yeah, I think there is an underlying process management bug here when those workers are forked. I'll keep the P1 tag. @scv119 , is this something we can slot for 2.3-2.4?

thoglu · 2023-01-05T14:37:27Z

@ericl @matthewdeng
Ok I could solve this issue by updating the NVIDIA driver to a new version (515.85.01) .. the old driver version was of the 470.xxx series. It seems for some reason the driver impacts how tune, pytorch and potentially lightning interact for num_workers>0. However, there might be a fix that works for older drivers aswell?

EDIT It actually did not solve the issue, I just ran the job too shortly. After starting the dataloader with num_workers>0, the same issue appears with the new driver aswell.

ericl · 2023-01-05T19:01:37Z

Yeah, I think Ray should in principle be able to kill the process successfully even if num_workers>0 / there is an NVidia issue. Maybe there's some issue with the worker graceful shutdown and we need to force kill it.

cadedaniel · 2023-01-05T20:12:56Z

Hi @thoglu, could you run a ps aux and report the state of the ray processes when you experience the leak? Specifically I am looking to see if any of the processes with GPU resources are stuck in D state. This indicates an nvidia kernel driver bug and makes the processes unkillable without force unloading the driver first (IIRC, last time I worked with this was 2021).

ericl · 2023-01-05T20:21:51Z

@cadedaniel the processes are in SNl state (from the prior thread)

matthewdeng · 2023-01-05T21:26:26Z

This can be reproduced without GPUs, though in practice I think this is more noticeable when using GPUs because the GRAM is held.

Minimal repro:

import ray
import time
from multiprocessing import Process

@ray.remote
class MyActor:

    def run(self):
       p = Process(target = lambda: time.sleep(1000), daemon=True)
       p.start()
       p.join()

actor = MyActor.remote()
ray.get(actor.run.remote())

When executing the script:

(base) ray@ip-172-31-242-77:~/default$ ps aux | grep ray::
ray         4240 17.8  0.2 11042908 90892 ?      SNl  13:20   0:01 ray::MyActor.run
ray         4287  0.0  0.1 10879052 54796 ?      SNl  13:20   0:00 ray::MyActor.run
ray         4342  0.0  0.0   5200   724 pts/5    S+   13:20   0:00 grep --color=auto ray::

4240 is the original Actor process.
4287 is the spawned process.

After terminating the script with ctrl+C:

(base) ray@ip-172-31-242-77:~/default$ ps aux | grep ray::
ray         4287  0.0  0.1 10879052 54796 ?      SNl  13:20   0:00 ray::MyActor.run
ray         4409  0.0  0.0   5200   716 pts/5    S+   13:21   0:00 grep --color=auto ray::

matthewdeng · 2023-01-05T21:28:12Z

cc @rkooo567 this is the issue we investigated before.

cadedaniel · 2023-01-05T22:25:50Z

It likely is the same as you list here @matthewdeng , I guess I'm still wondering why the driver upgrade fixed the issue. It could be that 510 drivers more aggressively free resources. After fixing the issue we should loop back and verify it works on 470 drivers (I think they still have a good bit of lifetime in them but could be wrong)

thoglu · 2023-01-14T15:39:56Z

@matthewdeng @cadedaniel @ericl
It seems I was too fast with the solution. After updating all nodes I let it run again and still encountered the issue.
It seems previously I did not let it run long enough for the dataloader to start the multiple worker processes, but I already used the graphics card and hit ctrl+c before the worker processes had started.

So the driver update does not fix the issue for me.

Here again an example with num_workers=10:

ray  196822  0.7  0.0 118858764 68528 pts/0 SNl+ 13:30   0:02 ray::IDLE
ray  196824  0.7  0.0 118858648 66548 pts/0 SNl+ 13:30   0:02 ray::IDLE
ray  196825  0.7  0.0 119007348 94880 pts/0 SNl+ 13:30   0:02 ray::IDLE
ray  196963 94.9  1.4 141864836 5676112 pts/0 SNl+ 13:30   4:31 ray::ImplicitFunc.train
ray  197182  2.0  1.1 139993224 4716880 pts/0 SNl+ 13:31   0:04 ray::ImplicitFunc.train
ray  197240  1.9  1.1 139993236 4717052 pts/0 SNl+ 13:31   0:04 ray::ImplicitFunc.train
ray  197296  1.9  1.1 139993248 4714232 pts/0 SNl+ 13:31   0:04 ray::ImplicitFunc.train
ray  197352  1.9  1.1 139993260 4718376 pts/0 SNl+ 13:31   0:04 ray::ImplicitFunc.train
ray  197408  1.9  1.1 139993272 4716756 pts/0 SNl+ 13:31   0:04 ray::ImplicitFunc.train
ray  197464  1.8  1.1 139993284 4716016 pts/0 SNl+ 13:31   0:04 ray::ImplicitFunc.train
ray  197520  1.8  1.1 139993296 4713048 pts/0 SNl+ 13:31   0:04 ray::ImplicitFunc.train
ray  197576  1.7  1.1 139993308 4711392 pts/0 SNl+ 13:31   0:03 ray::ImplicitFunc.train
ray  197632  1.8  1.1 139993320 4714784 pts/0 SNl+ 13:31   0:04 ray::ImplicitFunc.train
ray  197688  1.8  1.1 139993332 4714380 pts/0 SNl+ 13:31   0:04 ray::ImplicitFunc.train

after ctrl+c

ray  197182  1.8  1.1 139993224 4716876 pts/0 SNl 13:31   0:04 ray::ImplicitFunc.train
ray  197240  1.8  1.1 139993236 4717048 pts/0 SNl 13:31   0:04 ray::ImplicitFunc.train
ray  197296  1.7  1.1 139993248 4714228 pts/0 SNl 13:31   0:04 ray::ImplicitFunc.train
ray  197352  1.7  1.1 139993260 4718372 pts/0 SNl 13:31   0:04 ray::ImplicitFunc.train
ray  197408  1.7  1.1 139993272 4716752 pts/0 SNl 13:31   0:04 ray::ImplicitFunc.train
ray  197464  1.6  1.1 139993284 4716012 pts/0 SNl 13:31   0:04 ray::ImplicitFunc.train
ray  197520  1.6  1.1 139993296 4713088 pts/0 SNl 13:31   0:04 ray::ImplicitFunc.train
ray  197576  1.6  1.1 139993308 4711388 pts/0 SNl 13:31   0:03 ray::ImplicitFunc.train
ray  197632  1.7  1.1 139993320 4714780 pts/0 SNl 13:31   0:04 ray::ImplicitFunc.train
ray  197688  1.6  1.1 139993332 4714376 pts/0 SNl 13:31   0:04 ray::ImplicitFunc.train

The main proces seems to be gone, but the 10 workers remain.
This was after single ctrl+c. It seems when aggressively hitting ctrl+c, a few more processes get killed, but never all.

ezorita · 2023-01-20T12:54:49Z

I have the same issue in a different context. Workers end properly, all object references cleared, but they stay in IDLE consuming resources. I suspect this is a consequence of process signal masking #31805. This could happen whenever a library running in the worker uses processes and relies on signal communication, assuming all children signals are unblocked.

scv119 · 2023-02-22T05:45:44Z

it's interesting that pytorch's dataloader subprocess does health check with its parent pytorch/pytorch#6606 , so it suppose to terminate itself it the parent dies.

https://discuss.pytorch.org/t/when-i-shut-down-the-pytorch-program-by-kill-i-encountered-the-problem-with-the-gpu/6315/2

cadedaniel · 2023-03-10T03:02:13Z

I've spent some time with Ray + Torch dataloader and can't reproduce the reported behavior. I think it's possible for Ray Lightning + Lightning + Torch Dataloader to have the issue when Ray + Torch dataloader doesn't, as the Ray Lightning integration overrides some cleanup logic in the default Lightning.

Things I've tried:

Killing the actor containing the dataloader with SIGTERM/SIGKILL.
- With SIGKILL the forked processes are inherited by init; the aforementioned health check inside the dataloader worker processes kills them after a few seconds.
- With SIGTERM the forked processes take some time to die (order 5 seconds for each one), but they still die. I believe this is because the internal queues used by Torch have a timeout which allows the shutdown logic to progress. The torch dataloader shutdown logic is quite complex and it's possible there's a bug (worker processes won't shutdown until they get a None in their input queues), but I haven't seen any obvious case.
Killing the actor with Ctrl+C from a Ray Job. This appears to effectuate the same behavior as SIGTERM.
Starting the forking actor from a Ray Task / Ray Actor, killing the job. No leaks.
Using Ray Tune to start a trainable which has a Torch dataloader. I don't see any leaks.

I will try an end-to-end example on DataLoader + Ray Tune + Ray Lightning + Lightning tomorrow. I have also been trying exclusively in Ray Jobs and should also try in Ray Client.

cadedaniel · 2023-03-10T03:03:33Z

It would be really helpful to have a runnable reproduction script, which includes the Lightning components. There is a good chance I won't be able to reproduce without it.

ericl · 2023-03-10T03:07:28Z

What about @matthewdeng 's Jan 5th repro above with just multiprocessing?

anhnami · 2023-03-10T03:10:35Z

I stopped previous raytune run by pressing Ctrl-C multiple times then next tune run had OOM, maybe was the issues. Now I always run "ray stop" after cancelling experiments with SIGINT, everything is fine now.

cadedaniel · 2023-03-10T03:18:12Z

What about @matthewdeng 's Jan 5th repro above with just multiprocessing?

Hmm, I don't think this is exactly what's happening here because the torch dataloaders should die when their ppid changes. When I run the repro, the spawned process ppid changes to 1.

That said, if the raylet polled each worker process for child processes, we could keep track of other processes to kill, and likely fix this case. Is there a better way to track which processes to clean up? Overall seems that the Torch dataloader spawning processes which don't die when their parent process dies is the root cause here (although we're probably adding an edge case they didn't think of).

ericl · 2023-03-10T03:47:14Z

Let's start by fixing the simple multiprocessing repro case?

…

On Thu, Mar 9, 2023, 7:18 PM Cade Daniel ***@***.***> wrote: What about @matthewdeng <https://github.com/matthewdeng> 's Jan 5th repro above with just multiprocessing? Hmm, I don't think this is exactly what's happening here because the torch dataloaders should die when their ppid changes. When I run the repro, the spawned process ppid changes to 1. That said, if the raylet polled each worker process for child processes, we could keep track of other processes to kill, and likely fix this case. Is there a better way to track which processes to clean up? Overall seems that the Torch dataloader spawning processes which don't die when their parent process dies is the root cause here (although we're probably adding an edge case they didn't think of). — Reply to this email directly, view it on GitHub <#31451 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSVWIKSBN7R2543X75LW3KMP7ANCNFSM6AAAAAATRKK7IU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

thoglu · 2023-03-20T16:34:12Z

I stopped previous raytune run by pressing Ctrl-C multiple times then next tune run had OOM, maybe was the issues. Now I always run "ray stop" after cancelling experiments with SIGINT, everything is fine now.

This would not fix the issue in the case I describe above: When a remote worker dies for whatever reason, a ray stop might not be possible, and when the main process tries to resend a job to the worker node, it still has full memory

… child processes in the CoreWorker shutdown sequence. (#33976) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in #31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted.

… child processes in the CoreWorker shutdown sequence. (ray-project#33976) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in ray-project#31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted.

cadedaniel · 2023-04-07T23:06:31Z

Hi all, I have an update for this issue:

We merged a partial fix into master and expect it to make out in Ray 2.4. On Linux, in the case where the driver script is cancelled or exits normally, each Ray worker process will now kill its immediate child processes. Although we could not reproduce the Torch dataloader process leak described here, we believe this will fix the Torch issue and free the previously reserved GPU memory.

We have plans for a more holistic approach to handle cases where the worker processes crash and leak processes, and where child processes cause leaks by spawning child processes of their own. Please reach out if you are experiencing these issues.

Follow the below issues for updates. Thanks!

Clean up child processes, even if the worker crashes or the child processes themselves have children [Core] Clean up process leaks in worker processes, even if worker process crashes #34125
Improve actor shutdown logic, so that Python destructors and Python atexit handlers are guaranteed to run if there's no segfault. [Core] Actors not cleaning up resources correct because force_kill=true. #34124

… child processes in the CoreWorker shutdown sequence. (#33976) (#34181) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in #31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted.

… child processes in the CoreWorker shutdown sequence. (ray-project#33976) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in ray-project#31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted.

… child processes in the CoreWorker shutdown sequence. (ray-project#33976) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in ray-project#31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted. Signed-off-by: elliottower <elliot@elliottower.com>

… child processes in the CoreWorker shutdown sequence. (ray-project#33976) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in ray-project#31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted. Signed-off-by: Jack He <jackhe2345@gmail.com>

thoglu added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 4, 2023

thoglu mentioned this issue Jan 4, 2023

A notable difference bewteen tasks and actors when it comes to GPU memory deallocation with PyTorch #30414

Closed

ericl added the core Issues that should be addressed in Ray Core label Jan 4, 2023

ericl added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 5, 2023

ericl changed the title ~~processes do not seem to be freed after ctrl+c on cluster and block GPU memory[Tune|Core]~~ [core] Multi-process(?) / GPU processes do not seem to be freed after ctrl+c on cluster Jan 5, 2023

ericl added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Jan 5, 2023

thoglu mentioned this issue Jan 5, 2023

BAR1 memory of GPU is not released when main process is killed. pytorch/pytorch#66482

Open

scv119 added Ray 2.4 and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 5, 2023

scv119 assigned cadedaniel Jan 5, 2023

cadedaniel assigned scv119 and unassigned cadedaniel Jan 5, 2023

matthewdeng mentioned this issue Feb 22, 2023

[Train] TorchTrainer does not free all GPUs on shutdown #32725

Open

rkooo567 added P1 Issue that should be fixed within a few weeks and removed P0 Issue that must be fixed in short order labels Feb 27, 2023

rkooo567 assigned cadedaniel and unassigned scv119 Feb 27, 2023

scv119 added P0 Issue that must be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Mar 6, 2023

rkooo567 mentioned this issue Mar 28, 2023

[Core] Subprocesses not terminated when actor dies of segment fault #26118

Closed

rkooo567 added release-blocker P0 Issue that blocks the release and removed release-blocker P0 Issue that blocks the release labels Mar 29, 2023

This was referenced Mar 31, 2023

[CoreWorker] Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. #33976

Merged

[Core] Actors not cleaning up resources correct because force_kill=true. #34124

Open

jjyao closed this as completed in #33976 Apr 7, 2023

cadedaniel mentioned this issue Apr 7, 2023

[Cherry-pick] "[CoreWorker] Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. #33976" #34181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Multi-process(?) / GPU processes do not seem to be freed after ctrl+c on cluster #31451

[core] Multi-process(?) / GPU processes do not seem to be freed after ctrl+c on cluster #31451

thoglu commented Jan 4, 2023 •

edited

ericl commented Jan 4, 2023

thoglu commented Jan 5, 2023

matthewdeng commented Jan 5, 2023

thoglu commented Jan 5, 2023

matthewdeng commented Jan 5, 2023

thoglu commented Jan 5, 2023 •

edited

ericl commented Jan 5, 2023

thoglu commented Jan 5, 2023 •

edited

ericl commented Jan 5, 2023

cadedaniel commented Jan 5, 2023

ericl commented Jan 5, 2023

matthewdeng commented Jan 5, 2023

matthewdeng commented Jan 5, 2023

cadedaniel commented Jan 5, 2023 •

edited

thoglu commented Jan 14, 2023 •

edited

ezorita commented Jan 20, 2023

scv119 commented Feb 22, 2023 •

edited

cadedaniel commented Mar 10, 2023

cadedaniel commented Mar 10, 2023

ericl commented Mar 10, 2023

anhnami commented Mar 10, 2023 •

edited

cadedaniel commented Mar 10, 2023

ericl commented Mar 10, 2023 via email

thoglu commented Mar 20, 2023

cadedaniel commented Apr 7, 2023

[core] Multi-process(?) / GPU processes do not seem to be freed after ctrl+c on cluster #31451

[core] Multi-process(?) / GPU processes do not seem to be freed after ctrl+c on cluster #31451

Comments

thoglu commented Jan 4, 2023 • edited

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

ericl commented Jan 4, 2023

thoglu commented Jan 5, 2023

matthewdeng commented Jan 5, 2023

thoglu commented Jan 5, 2023

matthewdeng commented Jan 5, 2023

thoglu commented Jan 5, 2023 • edited

ericl commented Jan 5, 2023

thoglu commented Jan 5, 2023 • edited

ericl commented Jan 5, 2023

cadedaniel commented Jan 5, 2023

ericl commented Jan 5, 2023

matthewdeng commented Jan 5, 2023

matthewdeng commented Jan 5, 2023

cadedaniel commented Jan 5, 2023 • edited

thoglu commented Jan 14, 2023 • edited

ezorita commented Jan 20, 2023

scv119 commented Feb 22, 2023 • edited

cadedaniel commented Mar 10, 2023

cadedaniel commented Mar 10, 2023

ericl commented Mar 10, 2023

anhnami commented Mar 10, 2023 • edited

cadedaniel commented Mar 10, 2023

ericl commented Mar 10, 2023 via email

thoglu commented Mar 20, 2023

cadedaniel commented Apr 7, 2023

thoglu commented Jan 4, 2023 •

edited

thoglu commented Jan 5, 2023 •

edited

thoglu commented Jan 5, 2023 •

edited

cadedaniel commented Jan 5, 2023 •

edited

thoglu commented Jan 14, 2023 •

edited

scv119 commented Feb 22, 2023 •

edited

anhnami commented Mar 10, 2023 •

edited