Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing EMFILE in acceptNewConnection #830

Closed
kazu-yamamoto opened this issue Dec 17, 2020 · 13 comments
Closed

Allowing EMFILE in acceptNewConnection #830

kazu-yamamoto opened this issue Dec 17, 2020 · 13 comments

Comments

@kazu-yamamoto
Copy link
Contributor

As @domenkozar described in haskell/network#493, the following error can happen:

Network.Socket.accept: resource exhausted (Too many open files)

This is EMFILE.

@snoyberg I think eMFILE should be also allowed in Run.acceptNewConnection. What do you think?

@snoyberg
Copy link
Member

This sounds like a bad idea to me, I disagree with @domenkozar here. Looking at the proposed patch, let me compare current vs proposed behavior:

Current: if the FD limit is set too low, the main warp thread will exit with an exception, which under normal circumstances will cause the entire process to exit, causing a discernible error. (If the process isn't exiting, I'd like to know why). This means that there will be evidence that something is wrong, and an operator can increase the FD limit, which is the right solution.

Proposed: instead of exiting, the Warp thread and process will run in a busy-loop hoping that an FD clears up so that it can accept a new connection. CPU usage will be high, the throughput of the application low (due to limited FDs), and there will be no evidence in the logs or alerts from automated systems to indicate something is wrong.

Basically, I don't see a problem that should be fixed in Warp. The program crashing because of insufficient FDs is a Good Thing IMO.

@domenkozar
Copy link

Current: if the FD limit is set too low, the main warp thread will exit with an exception, which under normal circumstances will cause the entire process to exit, causing a discernible error. (If the process isn't exiting, I'd like to know why). This means that there will be evidence that something is wrong, and an operator can increase the FD limit, which is the right solution.

That's not the current behavior: the process will print the error and not accept new connections according to my observation.

If it did crash that would be fine.

Proposed: instead of exiting, the Warp thread and process will run in a busy-loop hoping that an FD clears up so that it can accept a new connection. CPU usage will be high, the throughput of the application low (due to limited FDs), and there will be no evidence in the logs or alerts from automated systems to indicate something is wrong.

Why would it busy-loop, it should try to handle new connection, opening a socket fails and that should be handled but not retried?

@snoyberg
Copy link
Member

That's not the current behavior: the process will print the error and not accept new connections according to my observation.

You're right. I think that's the bug, not the fact that it doesn't retry.

Why would it busy-loop, it should try to handle new connection, opening a socket fails and that should be handled but not retried?

I'm commenting on the implementation in @kazu-yamamoto's PR. I'm not sure what behavior you'd expect to see besides busy-looping. The server will repeatedly try to accept new connections regardless of how many open FDs are available until an existing connection closes and an FD is freed up.

@domenkozar
Copy link

domenkozar commented Dec 17, 2020

The server will repeatedly try to accept new connections regardless of how many open FDs are available until an existing connection closes and an FD is freed up.

I'm not familiar with the code so I can't comment about what's in the PR but: If a new client connection comes in, then FD error pops up, which is logged and that connection is closed. I don't see how it could busy-loop?

@kazu-yamamoto
Copy link
Contributor Author

Relating to #553.

@kazu-yamamoto
Copy link
Contributor Author

@snoyberg Is the following the intended behavior?

  • Stop the accept loop when a listening socket is closed (for some reasons).
  • Stop the accept loop in the case of unrecoverable errors (e.g. most ResourceEhausted)
  • Continue the accept loop in the case of recoverable error (eCONNABORTED according to accept(2) error handling #553)

And do you categorize eMFILE as unrecoverable?

@snoyberg
Copy link
Member

Yes, that sounds right @kazu-yamamoto. I think perhaps we should go to a whitelist of "recoverable errors" and default to treating errors as unrecoverable.

@domenkozar
Copy link

I gave this a bit more thought and I don't think stopping the accept loop is the best way to handle this. Consider the scenario of hitting the limit for just a fraction of connections being handled, then you're giving up 99% of connections because the server couldn't handle 1%.

I see two solutions here:

  • sleep in case of EMFILE and retry
  • reserve one fd at the beginning and free it when hitting EMFILE (not sure how much cost would locking bring here)

@ProofOfKeags
Copy link

Currently running into this now. If warp ever gets an exception for too many open files, it just refuses all future connections, irrespective of whether load has decreased. I have tried increasing the ulimit but to no avail.

@ProofOfKeags
Copy link

I rebased #831 and put in a PR: #870 . I have verified that this actually works in the production case where we had this problem. Feedback appreciated.

@kazu-yamamoto
Copy link
Contributor Author

kazu-yamamoto commented Feb 2, 2022

Thank you for the verification.

@kazu-yamamoto
Copy link
Contributor Author

#831 has been merged and a new version is released.
Thank you, all, for your contribution!

@nh2
Copy link

nh2 commented Apr 22, 2023

This change created busy-looping problems, see #928.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants