Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No messages after worker crash #477

Open
neben opened this issue Aug 4, 2018 · 8 comments
Open

No messages after worker crash #477

neben opened this issue Aug 4, 2018 · 8 comments

Comments

@neben
Copy link
Contributor

@neben neben commented Aug 4, 2018

If an nginx worker process dies and gets respawned, some existing and future nchan subscriptions cease to work correctly (no more messages for certain channels). This can be reproduced easily by killing an nginx worker process. In contrast, if the nginx master process is signaled to reload the config and respawn the worker processes (SIGHUP on the master process), new subscriptions and messages will work fine.

Nchan version 1.2.0
Openresty 1.13.6.2

@slact

This comment has been minimized.

Copy link
Owner

@slact slact commented Aug 5, 2018

This is a known efficiency/robustness tradeoff in Nchan. I have in mind some ways to make it more resilient to worker crashes without sacrificing speed, but that is a nontrivial job and a few weeks of coding. I'd love it if someone could sponsor this work.

@neben

This comment has been minimized.

Copy link
Contributor Author

@neben neben commented Aug 5, 2018

Ok, thanks! Is this documented somewhere? Unfortunately we don't have the funds to sponsor that (yet). Our current workaround is to SIGHUP the master process if we detect a worker crash. Alternatively, running with a single worker should solve it, too. For now, we'll focus on figuring out why the worker is crashing in the first place.

@slact

This comment has been minimized.

Copy link
Owner

@slact slact commented Oct 1, 2018

I guess this isn't explicitly documented -- I figured it's common sense that if a worker crashes you should expect some data loss and some undefined behavior. It wouldn't hurt to document that though.

@kajmagnus

This comment has been minimized.

Copy link

@kajmagnus kajmagnus commented Dec 30, 2018

@neben I have this problem too, that Nchan in effect stops working after a worker crash, and can stay broken until the next restart which might not be until weeks later (no live notifications, until then). How do you detect a crash and send a SIGHUP, you don't happen to have to have a reusable script or something?

( @slact I suppose it'd be impossibly much work to do this, but anyway, there's Rust for Nginx: https://github.com/nginxinc/ngx-rust (hmm only a proof of concept though) — maybe Rust could be a way to fix all crashes once and for all ... except that ... impossibly much work to port to Rust I suppose.)

@ivanovv

This comment has been minimized.

Copy link

@ivanovv ivanovv commented Jan 9, 2019

@kajmagnus the easiest fix for this is to have only one worker, then nginx master process will auto restart it and everything works again. I guess that needs to go into the README as it is non obvious thing and many had prod servers stuck

@kajmagnus

This comment has been minimized.

Copy link

@kajmagnus kajmagnus commented Jan 10, 2019

@ivanovv Thanks I'll try this. My VPS servers don't have more than 1 or 2 vCPUs anyway.

@ivanovv

This comment has been minimized.

Copy link

@ivanovv ivanovv commented Jan 10, 2019

@kajmagnus well, nginx by default comes with worker_processes 1; and unless you are sure that nginx uses something close to 100% of one core there is no good reason to set it to auto or any other number.

It is all a bit different for nchan as having more workers helps a lot with scaling (channels are spread between workers) as @slact has mentioned in one of the issues. But again, that all starts to play a role only when you have a lot of subscribers anyway.

@kajmagnus

This comment has been minimized.

Copy link

@kajmagnus kajmagnus commented Jan 10, 2019

@ivanovv ok thanks for the info. I had previously changed from worker_processes 1 to auto based on https://www.nginx.com/blog/tuning-nginx/: "we recommend setting this directive to auto" but yes I agree, anything above 1 seems like a bit overkill in most cases. Now back to 1. I did quick tests:

wrk -t 6 -c 100 -d 10 http://site-3.localhost/-/ping-nginx —>
77 k req/seq, with auto, and 35 k req/seq, with 1 on my core i7 laptop.

And 35 k is faster than fast enough in my case :- ) Ok so hopefully all fine here then ... for the nearest ... years. — I agree, b.t.w., that it seems like good to have this in the readme. ... Now I'll just wait and see if the problem is gone :- )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.