-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Fix the crash in tls shutdown #246
Conversation
This fix will warrant another patch |
9e417ed
to
cb340cc
Compare
The root cause was due to a socket being closed and destroyed inside ListenerInterface::RunSingleConnection and in parallel the listener was performing the shutdown sequence inside ListenerInterface::RunAcceptLoop(). In fact we had a comment saying it's working because socket.Shutdown did not preempt. In one of my recent PRs I implemented the logic where TlsSocket informs the peer about the shutdown which made the call preemptible. As a result the whole management code around this became unsafe. 1. I introduced an intrusive refcnt so that a connection would not be destroyed inside RunSingleConnection when RunAcceptLoop still uses it. 2. Changed the iteration loop in the latter tu support preemption. 3. Fixed the Shutdown code to support the case where the tcp socket is closed in the middle of shutdown. 4. Found unrelated migration bug that where we failed to Unlink a connection because the listener was destroyed. Now we try to unlink only if the connection indeed is linked inside a connection list. Signed-off-by: Roman Gershman <romange@gmail.com>
conn.Shutdown(); | ||
DVSOCK(1, conn) << "Shutdown"; | ||
} | ||
cur_conn_cnt.fetch_add(clist->list.size(), memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't clist->list
empty at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no necessarily. Shutdown does not close and removes connections. it just shuts them down and they still need to go through their closing sequence before they are removed from the list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you call pop_back()
in line 178?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, it's a bug and I am not sure how to fix it actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, provided an awkward fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's an intrusive list, meaning that if an element is not linked then the list was reduced by one element.
Also, Shutdown
by itself does not cause the element to be removed from the list. Shutdown triggers the finalization flow for the connection, that at the end will remove the connection from the list (at the end of RunSingleConnection function).
But if Shutdown preempts we may or may now have the connection to be unlinked depending how quickly RunSingleConnection finishes its run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the list will never have new elements so any element that is unlinked reduces the list by one, so no endless loop here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does it remove from the list? I see that in the original code we used to call pop_back()
here, but in the latest iteration we don't so how does that conform with:
it's an intrusive list, meaning that if an element is not linked then the list was reduced by one element.
?
Please excuse my picking, I'm just not sure what I'm missing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's inside RunSingleConnection.
here: https://github.com/romange/helio/pull/246/files#diff-01d997cdbc4d2835ea398fa879f935f6dee87da00e4fdf7ae7264bdd5a98f7f8R255-R257
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please note that the goal of the loop is not to empty the list but trigger closing flows, so an element may or may not be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤓
I can explain the flow via VC if you want
…On Tue, Apr 9, 2024, 10:18 PM Shahar Mike ***@***.***> wrote:
***@***.**** approved this pull request.
🤓
—
Reply to this email directly, view it on GitHub
<#246 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4BFCHKC3PS62RNDOIDNWLY4Q5IPAVCNFSM6AAAAABF5LIIVOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSOBZHE4DKOBRHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
No need, it makes sense, I approved this PR :) |
The root cause was due to a socket being closed and destroyed inside ListenerInterface::RunSingleConnection
and in parallel the listener was performing the shutdown sequence inside ListenerInterface::RunAcceptLoop().
In fact we had a comment saying it's working because socket.Shutdown did not preempt.
In one of my recent PRs I implemented the logic where TlsSocket informs the peer about the shutdown
which made the call preemptible. As a result the whole management code around this became unsafe.
when RunAcceptLoop still uses it.
Now we try to unlink only if the connection indeed is linked inside a connection list.