Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hight number of processes of /next/dist/compiled/jest-worker/processChild.js still alive after next build #45508

Closed
1 task done
zqjimlove opened this issue Feb 2, 2023 · 129 comments · Fixed by #54500
Closed
1 task done
Labels
bug Issue was opened via the bug report template. linear: next Confirmed issue that is tracked by the Next.js team. please add a complete reproduction Please add a complete reproduction.

Comments

@zqjimlove
Copy link

zqjimlove commented Feb 2, 2023

Verify canary release

  • I verified that the issue exists in the latest Next.js canary release

Provide environment information

Operating System:
  Platform: darwin
  Arch: arm64
  Version: Darwin Kernel Version 22.3.0: Thu Jan  5 20:48:54 PST 2023; root:xnu-8792.81.2~2/RELEASE_ARM64_T6000
Binaries:
  Node: 18.13.0
  npm: 8.19.3
  Yarn: 1.22.19
  pnpm: 7.26.2
Relevant packages:
  next: 12.0.9
  react: 17.0.2
  react-dom: 17.0.2

Which area(s) of Next.js are affected? (leave empty if unsure)

CLI (create-next-app)

Link to the code that reproduces this issue

https://github.com/vercel/next.js/files/10565355/reproduce.zip

To Reproduce

reproduce.zip

image

This problem can reproduce above next@12.0.9, but 12.0.8 was all right.

Or remove getInitialProps in _app.tsx was all right above next@12.0.9.

// GlobalApp.getInitialProps = async function getInitialProps(appContext) {
//   const appProps = await App.getInitialProps(appContext);

//   return {
//     ...appProps,
//   };
// };

Describe the Bug

Hight number of processes of /next/dist/compiled/jest-worker/processChild.js still alive after next build

Expected Behavior

Kill all child processes.

Which browser are you using? (if relevant)

No response

How are you deploying your application? (if relevant)

No response

NEXT-1348

@zqjimlove zqjimlove added the bug Issue was opened via the bug report template. label Feb 2, 2023
@Francoois

This comment was marked as off-topic.

@yzubkov
Copy link

yzubkov commented Apr 24, 2023

In my case I get several ".../node_modules/next/dist/compiled/jest-worker/processChild.js" of these processes taking up lots of memory. I see these appear after executing "npm run start", and they disappear when I terminate the app (Ctrl+c).
Not sure if or how this potentially relates to the build process.

@schorfES
Copy link

We have also observed this issue in production, where it consumes memory that is likely not used or needed. This behavior was introduced in version 13.4.0. There is an open discussion about this topic, which you can find at: #49238.

@nicosh
Copy link

nicosh commented May 24, 2023

We have the same problem, after few deployments server is going out of memory. As temporary fix i added the following script in the deployment pipeline :

#!/bin/bash

# Find the process IDs of all processes containing the string "processChild.js" in the command path
pids=$(pgrep -f "processChild.js")

# Iterate over each process ID and kill the corresponding process
for pid in $pids; do
    echo "Killing process: $pid"
    kill "$pid"
done

But even with this script seems that applications keep spawning zombie processes.

@switz
Copy link

switz commented May 29, 2023

Seeing this as well in prod

@MonstraG
Copy link
Contributor

Downgrading to <13.4.0 for now I guess

@leerob
Copy link
Member

leerob commented May 30, 2023

Merged this discussion into here: #49238

This might be related: 83b774e#diff-90d1d5f446bdf243be25cc4ea2295a9c91508859d655e51d5ec4a3562d3a24d9L1930

@leerob
Copy link
Member

leerob commented May 30, 2023

Small favor, could you include a reproduction as a CodeSandbox instead of a zip file?

@leerob leerob added the please add a complete reproduction Please add a complete reproduction. label May 30, 2023
@github-actions
Copy link
Contributor

We cannot recreate the issue with the provided information. Please add a reproduction in order for us to be able to investigate.

Why was this issue marked with the please add a complete reproduction label?

To be able to investigate, we need access to a reproduction to identify what triggered the issue. We prefer a link to a public GitHub repository (template for pages, template for App Router), but you can also use these templates: CodeSandbox: pages or CodeSandbox: App Router.

To make sure the issue is resolved as quickly as possible, please make sure that the reproduction is as minimal as possible. This means that you should remove unnecessary code, files, and dependencies that do not contribute to the issue.

Please test your reproduction against the latest version of Next.js (next@canary) to make sure your issue has not already been fixed.

I added a link, why was it still marked?

Ensure the link is pointing to a codebase that is accessible (e.g. not a private repository). "example.com", "n/a", "will add later", etc. are not acceptable links -- we need to see a public codebase. See the above section for accepted links.

What happens if I don't provide a sufficient minimal reproduction?

Issues with the please add a complete reproduction label that receives no meaningful activity (e.g. new comments with a reproduction link) are automatically closed and locked after 30 days.

If your issue has not been resolved in that time and it has been closed/locked, please open a new issue with the required reproduction.

I did not open this issue, but it is relevant to me, what can I do to help?

Anyone experiencing the same issue is welcome to provide a minimal reproduction following the above steps. Furthermore, you can upvote the issue using the 👍 reaction on the topmost comment (please do not comment "I have the same issue" without reproduction steps). Then, we can sort issues by votes to prioritize.

I think my reproduction is good enough, why aren't you looking into it quicker?

We look into every Next.js issue and constantly monitor open issues for new comments.

However, sometimes we might miss one or two due to the popularity/high traffic of the repository. We apologize, and kindly ask you to refrain from tagging core maintainers, as that will usually not result in increased priority.

Upvoting issues to show your interest will help us prioritize and address them as quickly as possible. That said, every issue is important to us, and if an issue gets closed by accident, we encourage you to open a new one linking to the old issue and we will look into it.

Useful Resources

@bfife-bsci
Copy link

bfife-bsci commented May 31, 2023

I am commenting as a +1 to #49238 which I think more accurately described our issue. We only have 2 processChild.js processes, but this is likely due to running on GKE nodes with 2 CPUs. We run a minimum of 3 pods behind a service/load balancer. We unfortunately do not have a reproduction.

We were running 13.4.1 on node v16.19.0 in our production environment, and discovered that after some volume of requests or perhaps even period of time (as short as a day and a half, as long as 5 days), some next.js servers were becoming slow to unresponsive. New requests would take at least 5 seconds to process a response. CPU usage in the pod was maxed out, divided roughly at 33% user and 66% system. We discovered that requests are being proxied to a processChild.js child process, which is listening on a different port (is this the new App Router?). We observed the following characteristics:

  • excessive CPU usage
  • increased memory usage overall
  • no extra logging was observed (we don't see any logs after server startup)
  • there were over 3100 TCP connections established between the parent and processChild.js process
  • the parent process appeared to be retrying/attempting to retransmit requests that were queued up inside of it

strace'ing showed the following signature over and over again with different sockets/URLs

...
write(1593, "GET /URL1"..., 2828) = -1 EAGAIN (Resource temporarily unavailable)
write(1600, "GET /URL2"..., 2833) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(13, [{EPOLLOUT, {u32=266, u64=266}}, {EPOLLOUT, {u32=276, u64=276}}, {EPOLLOUT, {u32=280, u64=280}}, {EPOLLOUT, {u32=267, u64=267}}, {EPOLLOUT, {u32=315, u64=315}}, {EPOLLOUT, {u32=20, u64=20}}, {EPOLLOUT, {u32=322, u64=322}}, {EPOLLOUT, {u32=275, u64=275}}, {EPOLLOUT, {u32=325, u64=325}}, {EPOLLOUT, {u32=279, u64=279}}, {EPOLLOUT, {u32=332, u64=332}}, {EPOLLOUT, {u32=336, u64=336}}, {EPOLLOUT, {u32=314, u64=314}}, {EPOLLOUT, {u32=358, u64=358}}, {EPOLLOUT, {u32=324, u64=324}}, {EPOLLOUT, {u32=360, u64=360}}, {EPOLLOUT, {u32=281, u64=281}}, {EPOLLOUT, {u32=335, u64=335}}, {EPOLLOUT, {u32=296, u64=296}}, {EPOLLOUT, {u32=343, u64=343}}, {EPOLLOUT, {u32=377, u64=377}}, {EPOLLOUT, {u32=379, u64=379}}, {EPOLLOUT, {u32=359, u64=359}}, {EPOLLOUT, {u32=285, u64=285}}, {EPOLLOUT, {u32=268, u64=268}}, {EPOLLOUT, {u32=392, u64=392}}, {EPOLLOUT, {u32=366, u64=366}}, {EPOLLOUT, {u32=378, u64=378}}, {EPOLLOUT, {u32=406, u64=406}}, {EPOLLOUT, {u32=326, u64=326}}, {EPOLLOUT, {u32=323, u64=323}}, {EPOLLOUT, {u32=420, u64=420}}, ...], 1024, 0) = 275
write(266, "GET /URL3"..., 2839) = -1 EAGAIN (Resource temporarily unavailable)
write(276, "GET /URL4"..., 2830) = -1 EAGAIN (Resource temporarily unavailable)
write(280, "GET /URL5"..., 2825) = -1 EAGAIN (Resource temporarily unavailable)
...

It looks like the parent process continuously retries sending requests which are not being serviced/read into the child process. We're not sure what puts the server into this state (new requests will still be accepted and responded to slowly), but due to the unresponsiveness we downgraded back to 13.2.3.

@billnbell
Copy link

I get next/dist/compiled/jest-worker/processChild.js running in NODE_ENV=production when running next start ??

Downgrading.

@csi-lk
Copy link
Contributor

csi-lk commented Jun 2, 2023

Downgrading

Hmm I don't know if downgrading helps @billnbell , I've seen this in our traces going back a few versions now, let me know if you have a specific version where this isn't an issue, I'm worried about memory utilisation as we're seeing it max out on our containers :)

Edit: just read above about < 13.4.0 ill give this a go and report back

@cjcheshire
Copy link

cjcheshire commented Jun 2, 2023

Here to say me too. We’ve recently jumped on the 13.4 bandwagon and the last two weeks started to see memory maxing.

(Apologies, just read the bot asking me not to say this)

@BuddhiAbeyratne
Copy link

I just had a massive outage thanks to this it creeps up on you and doesn't die there's no way to easily kill the workers and also it stops build systems once it hits max ram

@billnbell
Copy link

I can confirm downgrading worked for me. 13.2.3

@BuddhiAbeyratne
Copy link

maybe this will help too

    git pull
    npm ci || exit
    BUILD_DIR=.nexttemp npm run build || exit
    if [ ! -d ".nexttemp" ]; then\
            echo '\033[31m .nexttemp Directory not exists!\033[0m'; \
            exit 1; \
    fi;
    rm -rf .next
    mv .nexttemp .next
    pm2 reload all --update-env
    echo "Deployment done."

@BuddhiAbeyratne
Copy link

seems like the jest worker is required or else pm2 can't serve the site on prod mode

@BuddhiAbeyratne
Copy link

the solution im using now is kill all and restart the service so it only makes 2 workers

@cjcheshire
Copy link

I just had a massive outage thanks to this it creeps up on you and doesn't die there's no way to easily kill the workers and also it stops build systems once it hits max ram

This is freaky. We just did too!

We have 800 pages, some more than others have more than two api requests to build the page. We had a 1gb limit on our pods, upped to 2gb and has helped us.

@BuddhiAbeyratne
Copy link

I'm on 13.4.1 if that helps to debug

@billnbell
Copy link

I'm on 13.4.1 if that helps to debug
That is why I switched to 13.2.3. I have not tried newer versions or canary yet.

@billnbell
Copy link

Just to be clear - we get out of memory in PRODUCTION mode when serving the site. I know others are seeing it when using next build but we are getting this over time when. using next start. Downgrading worked for us.

I don't really know why jest is running and eating all memory on the box. Can we add a parameter to turn off jest when running next start ?

@cjcheshire
Copy link

@billnbell its not jest though right it’s the jest-worker package.

We even prune dev dependencies in production!

@billnbell
Copy link

What is a jest-worker?

@cjcheshire
Copy link

cjcheshire commented Jun 2, 2023

It’s a package. Which we presume is how the background tasks work for building. https://www.npmjs.com/package/jest-worker?activeTab=readme

@S-YOU
Copy link

S-YOU commented Jun 2, 2023

The name jest-worker is actually confusing (at least for me) because of popular test framework jest,
jest itself seems to be huge repository with a lot of packages, it should called facebook's web server / worker or something else.

@timneutkens
Copy link
Member

I just checked with @ijjk and as it turns out he saw something similar and fixed it in a recent refactor:

const cleanup = () => {
debug('router-server process cleanup')
for (const curWorker of [
...((renderWorkers.app as any)?._workerPool?._workers || []),
...((renderWorkers.pages as any)?._workerPool?._workers || []),
] as {
_child?: import('child_process').ChildProcess
}[]) {
curWorker._child?.kill('SIGKILL')
}
}
process.on('exit', cleanup)
process.on('SIGINT', cleanup)
process.on('SIGTERM', cleanup)
process.on('uncaughtException', cleanup)
process.on('unhandledRejection', cleanup)
. Could you try with next@canary?

@hanoii
Copy link

hanoii commented Jul 28, 2023

@timneutkens I tried it locally as I was able to reproduce it as well and yes, next@canary at least doesn't leave the process in a straight out start fail:

I am getting a different error:

[Error: ENOENT: no such file or directory, open '/var/www/html/next/.next/BUILD_ID'] {

but I guess that's ok.

Maybe this fixes it.

@sedlukha
Copy link
Contributor

@timneutkens

Also please make it clear what you're running. I.e. @sedlukha is that development? I guess so?

No, this is prod. I run it for 17 apps.

And i've tried canary, now even worse memory usage, 4.9G (13.4.13-canary.6) vs 2.4G (v13.2.4) vs 3.16G (v13.4.12)

image

@sedlukha
Copy link
Contributor

sedlukha commented Jul 29, 2023

@timneutkens seems that experimental.appDir: false might disable next-render-worker-app process and solve the problem for those, who use only pages routing.

I would be happy to test it, but I can't do it on my real apps because of next issue
#52875

@timneutkens
Copy link
Member

timneutkens commented Jul 29, 2023

@sedlukha Seems what you're reporting is exactly the same as #49929 in which I've already explained the memory usage, there is no leak, it's just using multiple processes and we're working on optimizing that: #49929 (comment)

Setting appDir: false is not supported and that option will go away in a future version, we just haven't gotten around to removing the feature flag.


@hanoii thanks for checking 👍

@Nirmal1992
Copy link

Same here.. my macbook crashed when I used Nextjs latest with turbo repo.. multiple child processes were running in the background even after terminating the server...

@S-YOU
Copy link

S-YOU commented Aug 8, 2023

FYI: experimental: {appDir: false} does not work anymore on 13.4.13 for me (page rendered, but url changes failed to load json and triggering ssr), and now spawning 3 processes apart from main process.

  • next-router-worker
  • next-render-worker-app
  • next-render-worker-pages

@space1worm
Copy link

I have same issue as well. version 13.4.8

@timneutkens
Copy link
Member

@Nirmal1992 @S-YOU @space1worm I'm surprised you did not read my previous comment. I thought it was clear that these types of comments are not constructive? #45508 (comment)

@space1worm I'm even more surprised you're posting "same issue" without trying the latest version of Next.js...

@space1worm
Copy link

space1worm commented Aug 9, 2023

@timneutkens Hey, yeah sorry I missed it, here I made my test repo public.

You can check this commit tracer

I had memory usage problem on version 13.4.8, after navigating on any page my pod's memory would skyrocket for some reason... and after that whole app was braking an becoming unresponsive.

not sure, if this problem is related to my codebase or not, would love to hear what is the problem!

one more thing, I tried to increase resources but the application was still unresponsive after breaking.

Here as a reference

Screenshot 2023-08-09 at 20 38 59 Screenshot 2023-08-09 at 20 50 44

@timneutkens
Copy link
Member

Application is still not using the latest version of Next.js, same in the commit linked: https://gitlab.cern.ch/nzurashv/tracer/-/blob/master/package-lock.json#L4673

@space1worm
Copy link

space1worm commented Aug 9, 2023

@timneutkens I have updated to latest version, created new branch tracer/test

Issue still persist

here you can check this link as well

tracer-test.web.cern.ch

Screenshot 2023-08-09 at 21 17 07 Screenshot 2023-08-09 at 21 17 14

Additionally, I inquired with the support team regarding the cause of the failure, and they provided me with the following explanation.

Screenshot 2023-08-09 at 21 22 16

@glaustino
Copy link

FYI: experimental: {appDir: false} does not work anymore on 13.4.13 for me (page rendered, but url changes failed to load json and triggering ssr), and now spawning 3 processes apart from main process.

  • next-router-worker
  • next-render-worker-app
  • next-render-worker-pages

I have a question about these child processes, currently it seems they open random ports which broke my application behind WAF in Azure, this happened because we only open certain ports. Is there anyway for me to force the ports these child processes are going to use at all? I am on the latest next release

@jrscholey
Copy link

jrscholey commented Aug 14, 2023

FYI: With 13.4.11 we were unable to start our app in Kubernetes. We received a spawn process E2BIG at jest-worker. This only happened when our rewrites (regex path matching) were above a certain length (although still below max).

Downgrading back to 13.2.4 resolved the issue.

@S-YOU
Copy link

S-YOU commented Aug 14, 2023

FYI: now main process started with node server.js is gone in Next.js 13.4.15, and next-router-worker's parent PID become 1 (init). This could probably use less memory since It use one less process.

1362416       1      00:00:02 next-router-worker
1362432 1362416      00:00:00 next-render-worker-app
1362433 1362416      00:00:05 next-render-worker-pages

@S-YOU
Copy link

S-YOU commented Aug 14, 2023

@timneutkens, sorry, I probably misread it. I do not mean to claim or anything. I am just sharing what I've observed in the version I am using (which supposed to be latest release).

@timneutkens
Copy link
Member

In 13.4.15 (but really upgrade to 13.4.16 instead) this PR has landed to remove one of the processes indeed: #53523

@timneutkens

This comment was marked as outdated.

@sedlukha
Copy link
Contributor

@timneutkens I've tried v13.4.20-canary.2.

It was expected that #53523 and #54143 would reduce the number of processes, resulting in lower memory usage.

Yes, the number of processes has been reduced; after the update, I see only two processes. However, memory usage is still higher than it was with v.13.2.4.

node v.16.18.1 (if it matters)

v13.4.20-canary.2
image

13.2.4
image

@timneutkens
Copy link
Member

It's entirely unclear what you're running / filtering by, e.g. you're filtering by next- but 13.2.4 doesn't set process.title to anything specific.

Sharing screenshots is really not useful, I keep having to repeat that in every single comment around these memory issues.

Please share code, I can't do anything to help you otherwise.

@billnbell

This comment was marked as off-topic.

@magalhas
Copy link

magalhas commented Aug 24, 2023

I'm seeing this behaviour running next dev starting on 13.3 and newer versions (13.4 included). This isn't happening on 13.2. Somehow this looks like it's happening whenever files are being added/removed from the FS (not sure due to my current use case) while the dev script is running.

Even after closing next dev, jest orphaned processes are leftover.

@timneutkens
Copy link
Member

I'm amazed by how often my comments are flat out ignored the past few weeks on various issues. We won't be able to investigate/help based on comments saying the equivalent of "It's happening". Please share code, I can't do anything to help you otherwise.

I'll have to close this issue when there is one more comment without a reproduction as I've checked multiple times now and the processes are cleaned up correctly in the latest version.

timneutkens added a commit to timneutkens/next.js that referenced this issue Aug 24, 2023
This implements the same cleanup logic used for start-server and render-workers for the workers used during build.

Fixes vercel#45508
@magalhas
Copy link

magalhas commented Aug 24, 2023

By latest version you mean the latest RC @timneutkens ? Sorry can't help with steps to reproduce, this is happening inside a spawn call in a very specific use case so best I can do is confirm that it happens.

@kodiakhq kodiakhq bot closed this as completed in #54500 Aug 26, 2023
kodiakhq bot pushed a commit that referenced this issue Aug 26, 2023
This implements the same cleanup logic used for start-server and render-workers for the workers used during build.

It's more of a contingency as we do call `.end()` on the worker too.

Fixes #45508




Co-authored-by: Zack Tanner <1939140+ztanner@users.noreply.github.com>
@vercel vercel locked as resolved and limited conversation to collaborators Aug 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue was opened via the bug report template. linear: next Confirmed issue that is tracked by the Next.js team. please add a complete reproduction Please add a complete reproduction.
Projects
None yet
Development

Successfully merging a pull request may close this issue.