Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore ECONNABORTED in accept(2) #404

Closed
wants to merge 8 commits into from
Closed

Conversation

mherrb
Copy link
Contributor

@mherrb mherrb commented May 10, 2019

An aborted connection attempt does not affect listening socket's
ability to accept other connections. If the error is not ignored, Squid
gets stuck after logging an oldAccept error like this one:

oldAccept ...: (53) Software caused connection abort

This bug fix was motivated by accept(2) changes in OpenBSD v6.5 that
resulted in new ECONNABORTED errors under regular deployment conditions:
openbsd/src@c255b5a

The accept(2) documentation indicate that the syscall should
be restarted after this error.
@squid-prbot
Copy link
Collaborator

Can one of the admins verify this patch?

@mherrb
Copy link
Contributor Author

mherrb commented May 10, 2019

The accept(2) documentation indicate that the syscall should
be restarted after this error.

Hi,

I recently upgraded the squid cache at work to OpenBSD 6.5 (and thus
squid 4.6). After this, squid started locking up (and not accepting
new connections) every few hours with the error below logged in
cache.log:

oldAccept FD 17, [::] [ job1]: ignoring: (53) Software caused
connection abort

After digging a bit in the source code and reading about ECONNABORTED
and accept(), I came out with the patch above, which fixes the issue
for me.

Does this look ok ?

Copy link
Contributor

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that Squid should ignore ECONNABORTED accept(2) errors. Thank you for fixing this!

FWIW, a similar fix was done in v2.6.STABLE13 (squid-cache/squid2@096e9e6) but it looks like the fix has not made it to Squid v3.

According to squid-cache/squid2@65d191b, you need to #define ECONNABORTED on Windows. You do not need to test on Windows, but please port that v2 change as a part of this PR unless you are sure it is not needed. See compat/os/mswindows.h.

Please see the inlined change requests for another simple adjustment.

I recently upgraded ... to OpenBSD 6.5 (and thus squid 4.6)

What OS/Squid versions were you running before? I wonder if this problem surfaced because of the OS rather than Squid upgrade. Knowing this may be useful for other admins.

After this, squid started locking up (and not accepting new connections) every few hours

Yes. That is expected in the official code that does not treat ECONNABORTED specially.

with the error below logged in cache.log:
oldAccept FD 17, [::] [ job1]: ignoring: (53) Software caused connection abort

Hmm... AFAICT, the above "ignoring" line is never printed by the official code. It is printed by your PR! Perhaps you copy/pasted it from a patched Squid log? Knowing the exact bug symptoms may be useful for other admins...

src/comm/TcpAcceptor.cc Outdated Show resolved Hide resolved
src/comm/TcpAcceptor.cc Outdated Show resolved Hide resolved
@mherrb
Copy link
Contributor Author

mherrb commented May 10, 2019

I agree that Squid should ignore ECONNABORTED accept(2) errors. Thank you for fixing this!

FWIW, a similar fix was done in v2.6.STABLE13 (squid-cache/squid2@096e9e6) but it looks like the fix has not made it to Squid v3.

According to squid-cache/squid2@65d191b, you need to #define ECONNABORTED on Windows. You do not need to test on Windows, but please port that v2 change as a part of this PR unless you are sure it is not needed. See compat/os/mswindows.h.

Please see the inlined change requests for another simple adjustment.

I recently upgraded ... to OpenBSD 6.5 (and thus squid 4.6)

What OS/Squid versions were you running before? I wonder if this problem surfaced because of the OS rather than Squid upgrade. Knowing this may be useful for other admins.

The cache was running OpenBSD 6.4, which packaged squid 3.5.28

After this, squid started locking up (and not accepting new connections) every few hours

Yes. That is expected in the official code that does not treat ECONNABORTED specially.

with the error below logged in cache.log:
oldAccept FD 17, [::] [ job1]: ignoring: (53) Software caused connection abort

Hmm... AFAICT, the above "ignoring" line is never printed by the official code. It is printed by your PR! Perhaps you copy/pasted it from a patched Squid log? Knowing the exact bug symptoms may be useful for other admins...

Yes I copied the wrong line from my cache.log.
For reference here's one from before my patch:
2019/05/06 19:58:01 kid1| oldAccept FD 17, [::] [ job1]: (53) Software caused connection abort

mherrb and others added 4 commits May 10, 2019 21:07
Copy link
Contributor

@yadij yadij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have now implicitly extended the scope to polish the existing errno/errcode checks it would be a good chance to fix the use of yoda style. IIRC the ancient compilers where it was a useful safety measure are no longer supported.

@yadij
Copy link
Contributor

yadij commented May 11, 2019

Also, the OpenBSD documentation and the Linux documentation differ a bit on how they describe accept() handling of errno. Linux has a fixed list of codes that may be relayed, whereas BSD manual specifically makes a point of saying that errno is passed through instead of being cleared.

So these errors may be a case of pollution from somewhere else (eg inside the prior select(2) or poll(2) call). You may fine it interesting to test whether the issue disappears with errno=0 before the accept(2) syscall (where errcode is reset).

@mherrb
Copy link
Contributor Author

mherrb commented May 11, 2019

Also, the OpenBSD documentation and the Linux documentation differ a bit on how they describe accept() handling of errno. Linux has a fixed list of codes that may be relayed, whereas BSD manual specifically makes a point of saying that errno is passed through instead of being cleared.

So these errors may be a case of pollution from somewhere else (eg inside the prior select(2) or poll(2) call). You may fine it interesting to test whether the issue disappears with errno=0 before the accept(2) syscall (where errcode is reset).

ECONNABORTED can only be generated by accept(2) (at least on OpenBSD, but the Linux manual pages seem to agree). So it's not possible that the value is a leftovers from a previous system call.
I have not been able to understand what kind of client activity triggers the error, all I'm sure is that it's not a file descriptor shortage (we're far below the limit when it happens).

@yadij
Copy link
Contributor

yadij commented May 11, 2019

The way I've understood is that accept() triggers the SYN+ACK step for TCP. Squid logic separating the select() and accept() calls can lead to a (relatively) long delay between the client SYN and that SYN+ACK. At any point during that delay the client (or TCP stack itself) may use FIN or timeout the connection.

@mherrb
Copy link
Contributor Author

mherrb commented May 11, 2019

Ok, so clients implementing the happy eyeballs algorithm and sending both v4 and v6 SYN paquets to the same squid process can cause this, especially since there are 2 different listening sockets for the 2 protocol versions there.

This was the only area in this file where this style was used
src/comm/TcpAcceptor.cc Outdated Show resolved Hide resolved
Co-Authored-By: Amos Jeffries <yadij@users.noreply.github.com>
bob-beck pushed a commit to openbsd/ports that referenced this pull request May 11, 2019
This was causing lockups on my squid cache at work.
Upstreem PR: squid-cache/squid#404
ok sthen@, jca@
Copy link
Contributor

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing one of my two concerns. To avoid misunderstanding, I am repeating the other concern here:

According to squid-cache/squid2@65d191b, you need to #define ECONNABORTED on Windows. You do not need to test on Windows, but please port that v2 change as a part of this PR unless you are sure it is not needed. See compat/os/mswindows.h.

@mherrb
Copy link
Contributor Author

mherrb commented May 13, 2019

Thank you for addressing one of my two concerns. To avoid misunderstanding, I am repeating the other concern here:

According to squid-cache/squid2@65d191b, you need to #define ECONNABORTED on Windows. You do not need to test on Windows, but please port that v2 change as a part of this PR unless you are sure it is not needed. See compat/os/mswindows.h.

Done.

Copy link
Contributor

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my remaining concern.

@rousskov rousskov changed the title Ignore ECONNABORTED in accept(2). Ignore ECONNABORTED in accept(2) May 13, 2019
@rousskov rousskov added S-waiting-for-author author action is expected (and usually required) S-waiting-for-committer privileged action is expected (and usually required) labels May 13, 2019
@rousskov
Copy link
Contributor

OK to test

@rousskov
Copy link
Contributor

rousskov commented May 13, 2019

@mherrb, I detailed the PR description (i.e. the future official commit message) a little. If you are OK with my changes, please let me know, and I will clear this PR for commit. Otherwise, please adjust to your liking and ping me.

@mherrb
Copy link
Contributor Author

mherrb commented May 14, 2019

@rousskov OK by me. Thanks.

@rousskov rousskov added M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels and removed S-waiting-for-author author action is expected (and usually required) S-waiting-for-committer privileged action is expected (and usually required) labels May 14, 2019
squid-anubis pushed a commit that referenced this pull request May 14, 2019
An aborted connection attempt does not affect listening socket's
ability to accept other connections. If the error is not ignored, Squid
gets stuck after logging an oldAccept error like this one:

    oldAccept ...: (53) Software caused connection abort

This bug fix was motivated by accept(2) changes in OpenBSD v6.5 that
resulted in new ECONNABORTED errors under regular deployment conditions:
openbsd/src@c255b5a
@squid-anubis squid-anubis added the M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels label May 14, 2019
@squid-anubis squid-anubis added M-merged https://github.com/measurement-factory/anubis#pull-request-labels and removed M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels labels May 14, 2019
squidadm pushed a commit to squidadm/squid that referenced this pull request May 27, 2019
An aborted connection attempt does not affect listening socket's
ability to accept other connections. If the error is not ignored, Squid
gets stuck after logging an oldAccept error like this one:

    oldAccept ...: (53) Software caused connection abort

This bug fix was motivated by accept(2) changes in OpenBSD v6.5 that
resulted in new ECONNABORTED errors under regular deployment conditions:
openbsd/src@c255b5a
yadij pushed a commit that referenced this pull request Jun 12, 2019
An aborted connection attempt does not affect listening socket's
ability to accept other connections. If the error is not ignored, Squid
gets stuck after logging an oldAccept error like this one:

    oldAccept ...: (53) Software caused connection abort

This bug fix was motivated by accept(2) changes in OpenBSD v6.5 that
resulted in new ECONNABORTED errors under regular deployment conditions:
openbsd/src@c255b5a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
M-merged https://github.com/measurement-factory/anubis#pull-request-labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants