Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrupting npm install child process due to timeouts leaves orphaned npm processes #16197

Closed
rarkins opened this issue Jun 23, 2022 · 30 comments
Assignees
Labels
priority-2-high Bugs impacting wide number of users or very important features status:in-progress Someone is working on implementation type:bug Bug fix of existing functionality

Comments

@rarkins
Copy link
Collaborator

rarkins commented Jun 23, 2022

How are you running Renovate?

Self-hosted

If you're self-hosting Renovate, tell us what version of Renovate you run.

32.94.0

Please select which platform you are using if self-hosting.

github.com

If you're self-hosting Renovate, tell us what version of the platform you run.

No response

Was this something which used to work for you, and then stopped?

I never saw this working

Describe the bug

Renovate has a default executationTimeout of 15, i.e. 15 minutes. When binaryMode=install this means that the Node child_process interrupts the exec call after 15 minutes if it hasn't returned gracefully and we log a rawExec err. This seems to work, but we found that in a normal Renovate buildpack environment (i.e. ubuntu base image, running as a Docker container) then the underlying npm process remains. Worse, it does not eventually finish - instead it uses up CPU forever until the image is restarted. Over time, multiple orphaned npm processes can accumulate within the same container until eventually CPU hits maximum.

Relevant debug logs

Relevant logging was done using system commands like ps and top, I do not have a copy of those to share.

Have you created a minimal reproduction repository?

No reproduction repository

@rarkins rarkins added type:bug Bug fix of existing functionality priority-2-high Bugs impacting wide number of users or very important features status:requirements Full requirements are not yet known, so implementation should not be started labels Jun 23, 2022
@rarkins
Copy link
Collaborator Author

rarkins commented Jun 23, 2022

In short, the problem appears to be that when an exec child_process is killed due to a timeout, the npm process which is running is not killed too.

I observed that SIGTERM is used for this purpose and that this is configurable, but I'm not sure if changing that makes any difference.

We also need to be aware that this could potentially impact different managers in different ways.

@viceice
Copy link
Member

viceice commented Jun 23, 2022

does it happen on our oss docker images or on ws images?

@viceice
Copy link
Member

viceice commented Jun 23, 2022

@rarkins
Copy link
Collaborator Author

rarkins commented Jun 23, 2022

It was observed on a WS which was built using buildpack v2. When the problem occurred we could observe that dumb-init was running, as was node src/server.js, and then 1 or more npm processes. They did not show up as zombies and they also never "stopped" - they used up CPU forever (days).

@nabeelsaabnaTests
Copy link

working on a reproduction

@rarkins rarkins added status:in-progress Someone is working on implementation and removed status:requirements Full requirements are not yet known, so implementation should not be started labels Jun 23, 2022
@nabeelsaabnaTests
Copy link

minimal reproduction: https://github.com/nabeelys/renovate-16197-orphaned-processes

2 consecutive runs resulting in 2 npm install not terminated

ubuntu@d3f7fd88ad3b:/usr/src/app$ history
    1  time renovate --binary-source=install --execution-timeout=1 --allow-scripts=true
    2  ps -fu ubuntu
    3  time renovate --binary-source=install --execution-timeout=1 --allow-scripts=true
    4  ps -fu ubuntu
    5  history
ubuntu@d3f7fd88ad3b:/usr/src/app$
ubuntu@d3f7fd88ad3b:/usr/src/app$ ps -fu ubuntu
UID          PID    PPID  C STIME TTY          TIME CMD
ubuntu         1       0  0 11:03 ?        00:00:00 dumb-init -- bash
ubuntu         9       1  0 11:03 pts/0    00:00:00 bash
ubuntu       133       1  0 11:04 pts/0    00:00:00 npm install
ubuntu       144     133  0 11:04 pts/0    00:00:00 sh -c /tmp/prepare-1655982247632.sh
ubuntu       145     144  0 11:04 pts/0    00:00:00 /usr/bin/sh /tmp/prepare-1655982247632.sh
ubuntu       146     145  0 11:04 pts/0    00:00:00 sleep 900
ubuntu       217       1  0 11:06 pts/0    00:00:00 npm install
ubuntu       228     217  0 11:07 pts/0    00:00:00 sh -c /tmp/prepare-1655982420855.sh
ubuntu       229     228  0 11:07 pts/0    00:00:00 /usr/bin/sh /tmp/prepare-1655982420855.sh
ubuntu       230     229  0 11:07 pts/0    00:00:00 sleep 900
ubuntu       232       9  0 11:08 pts/0    00:00:00 ps -fu ubuntu
ubuntu@d3f7fd88ad3b:/usr/src/app$

@Gabriel-Ladzaretti
Copy link
Collaborator

reproduced on my end as well.

image

@viceice
Copy link
Member

viceice commented Jun 23, 2022

We should probably use https://azimi.me/2014/12/31/kill-child_process-node-js.html solution in renovate

@nabeelsaabnaTests
Copy link

We are exploring multiple options, the main problem is {detached: true} can't be combined with {detached: true, timeout: 10} and the example above showing 3 newly created processes after renovate calling the npm install command

@rarkins
Copy link
Collaborator Author

rarkins commented Jun 29, 2022

What if we narrow the scope to environments using buildpack? Or perhaps we could expand wider to even say containerized environments with Renovate as the exclusive task? Because when we run within a container, and we're the only application, the process list is pretty short. At the end of each child process we could essentially just check for "processes which shouldn't be there" and kill them all (e.g. PIDs above a certain number).

@nabeelsaabnaTests
Copy link

nabeelsaabnaTests commented Jun 29, 2022

We don't need to narrow the scope to specific environments.

the general approach is replacing:

export const rawExec: (
  cmd: string,
  opts: RawExecOptions
) => Promise<ExecResult> = promisify(exec);

with:

export const rawExec: (
  cmd: string,
  opts: RawExecOptions
) => Promise<ExecResult> = customPromisify(exec);

// a new wrapper that handles the "Promisify" logic 
// in addition to capturing the returned `ChildProcess` value by `exec` or `spawn` for handling the bad cases
async function customPromisify(..);

after all that being said, since switching to spawn for the detached functionality (which can't be combined with timeout) we'll need to manually implement the timeout if we want to run the new process in detached mode

@rarkins
Copy link
Collaborator Author

rarkins commented Jun 29, 2022

Switching from exec to spawn is a pretty big change. We have considered it before so that we can stream logs as we go, and found some challenges.

Why not just keep it as it is, and when we catch an err we kill all child processes such as the npm causing the problem?

@nabeelsaabnaTests
Copy link

first I'm going to test the fix using exec and killing the hanging process and validating it's children are also killed automatically.
if the first option doesn't work for the child processes of the hanging process, we'll most likely need the {detached: true} which will force us to use spawn.

In addition, using spawn does not seem to be a big challenge (we already have a working POC)

@rarkins
Copy link
Collaborator Author

rarkins commented Jun 29, 2022

I thought the problem was that the child process we create gets successfully killed, but all it's children are not. So then your challenge is identifying which are it's children.

@viceice
Copy link
Member

viceice commented Jun 29, 2022

if we use detach, we can kill the whole process group. pretty easy. we just need to follow the Blogpost linked above

@rarkins
Copy link
Collaborator Author

rarkins commented Jun 29, 2022

@viceice ..and also switch from exec to spawn, and implement our own timeout?

@viceice
Copy link
Member

viceice commented Jun 29, 2022

yes

@nabeelsaabnaTests
Copy link

nabeelsaabnaTests commented Jun 30, 2022

renovate calls npm command and the new process is PID 406 mapped to bin/sh -c npm install .....

ubuntu@cb9621e287e5:/usr/src/app$ ps -fu $USER
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ubuntu         8  0.0  0.1   6000  4076 pts/0    Ss   07:45   0:00 bash
ubuntu       337 45.9 11.7 1180036 357748 pts/0  Sl+  08:40   0:19  \_ node renovate.js --binary-source=install --execution-timeout=1 --allow-scripts=true
ubuntu       406  1.0  0.0   2616   592 pts/0    S+   08:40   0:00      \_ /bin/sh -c npm install --package-lock-only --no-audit
ubuntu       407 36.5  2.1 773056 65636 pts/0    Sl+  08:40   0:01          \_ npm install
ubuntu       418  0.0  0.0   2616   532 pts/0    S+   08:40   0:00              \_ sh -c /tmp/prepare656578442.sh
ubuntu       419  0.0  0.0   2616   532 pts/0    S+   08:40   0:00                  \_ sh /tmp/prepare656578442.sh
ubuntu       420  0.0  0.0   4268   580 pts/0    S+   08:40   0:00                      \_ sleep 900
ubuntu       249  0.0  0.1   6000  4004 pts/1    Ss   08:08   0:00 bash
ubuntu       421  0.0  0.1   7652  3372 pts/1    R+   08:40   0:00  \_ ps -fu

After the timeout is reached PID 406 is killed, but the sub-tree remains (this is a behavior related to sh)

ubuntu@cb9621e287e5:/usr/src/app$ ps -fu $USER
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ubuntu       407  2.4  1.9 774080 58928 pts/0    Sl+  08:40   0:01 npm install
ubuntu       418  0.0  0.0   2616   532 pts/0    S+   08:40   0:00  \_ sh -c /tmp/prepare656578442.sh
ubuntu       419  0.0  0.0   2616   532 pts/0    S+   08:40   0:00      \_ sh /tmp/prepare656578442.sh
ubuntu       420  0.0  0.0   4268   580 pts/0    S+   08:40   0:00          \_ sleep 900
ubuntu         8  0.0  0.1   6000  4076 pts/0    Ss   07:45   0:00 bash
ubuntu       337 20.7  6.2 1041592 190264 pts/0  Rl+  08:40   0:21  \_ node renovate.js --binary-source=install --execution-timeout=1 --allow-scripts=true
ubuntu       249  0.0  0.1   6000  4004 pts/1    Ss   08:08   0:00 bash
ubuntu       454  0.0  0.1   7652  3344 pts/1    R+   08:41   0:00  \_ ps -fu
ubuntu@cb9621e287e5:/usr/src/app$

if we for example change the shell to bash (using opts.shell = '/bin/bash') the npm install will also be killed we'll remain with whatever commands npm is invoking using sh
in our case it will be like:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ubuntu       419  0.0  0.0   2616   532 pts/0    S+   08:40   0:00 sh /tmp/prepare656578442.sh
ubuntu       420  0.0  0.0   4268   580 pts/0    S+   08:40   0:00  \_ sleep 900
ubuntu         8  0.0  0.1   6000  4076 pts/0    Ss   07:45   0:00 bash
ubuntu       337 20.7  6.2 1041592 190264 pts/0  Rl+  08:40   0:21  \_ node renovate.js --binary-source=install --execution-timeout=1 --allow-scripts=true
ubuntu       249  0.0  0.1   6000  4004 pts/1    Ss   08:08   0:00 bash
ubuntu       454  0.0  0.1   7652  3344 pts/1    R+   08:41   0:00  \_ ps -fu
ubuntu@cb9621e287e5:/usr/src/app$

@nabeelsaabnaTests
Copy link

nabeelsaabnaTests commented Jun 30, 2022

Should we start with simply adding opts.shell = '/bin/bash' ?

Switching to spawn will need to happen in 3 phases:

  1. keep same behavior when switching to spawn, this means with builtin timeout and without detached
  2. implement our own timeout (+ handling killing the process)
  3. implement the detached behavior + switching to process.kill(-child.pid); + handling io-streams

@viceice
Copy link
Member

viceice commented Jun 30, 2022

bash can only be used on linux, we should use default on windows/ macos

@viceice
Copy link
Member

viceice commented Jun 30, 2022

but setting she'll seems a good idea

@PhilipAbed
Copy link
Collaborator

Assigning to @Gabriel-Ladzaretti , already have the POC code implemented, will open PR as soon as he finishes.

@nabeelsaabnaTests
Copy link

Correction to:

We are exploring multiple options, the main problem is {detached: true} can't be combined with {detached: true, timeout: 10} and the example above showing 3 newly created processes after renovate calling the npm install command

{detached: true, timeout: 10000} works fine, only the child.unref() is the problematic behavior when using timeout and required implementing our own "waiting/timeout"

@Gabriel-Ladzaretti
Copy link
Collaborator

#16414 (comment) contains a POC for steps 1, 2 & 3 from #16197 (comment), while the PR itself is a refactor only.

@Gabriel-Ladzaretti
Copy link
Collaborator

@rarkins
Copy link
Collaborator Author

rarkins commented Aug 23, 2022

Let's try. Can it be controllable via config? e.g. env variable? RENOVATE_X_ for experimental

@Gabriel-Ladzaretti
Copy link
Collaborator

Gabriel-Ladzaretti commented Aug 23, 2022

shouldn't be a problem, env variable only or a full config option?

@rarkins
Copy link
Collaborator Author

rarkins commented Aug 23, 2022

Env variable opt in to begin with

@Gabriel-Ladzaretti
Copy link
Collaborator

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
priority-2-high Bugs impacting wide number of users or very important features status:in-progress Someone is working on implementation type:bug Bug fix of existing functionality
Projects
None yet
Development

No branches or pull requests

5 participants