Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Fix the crash in tls shutdown #246

Merged
merged 3 commits into from
Apr 9, 2024
Merged

chore: Fix the crash in tls shutdown #246

merged 3 commits into from
Apr 9, 2024

Conversation

romange
Copy link
Owner

@romange romange commented Apr 8, 2024

The root cause was due to a socket being closed and destroyed inside ListenerInterface::RunSingleConnection
and in parallel the listener was performing the shutdown sequence inside ListenerInterface::RunAcceptLoop().

In fact we had a comment saying it's working because socket.Shutdown did not preempt.
In one of my recent PRs I implemented the logic where TlsSocket informs the peer about the shutdown
which made the call preemptible. As a result the whole management code around this became unsafe.

  1. I introduced an intrusive refcnt so that a connection would not be destroyed inside RunSingleConnection
    when RunAcceptLoop still uses it.
  2. Changed the iteration loop in the latter tu support preemption.
  3. Fixed the Shutdown code to support the case where the tcp socket is closed in the middle of shutdown.
  4. Found unrelated migration bug that where we failed to Unlink a connection because the listener was destroyed.
    Now we try to unlink only if the connection indeed is linked inside a connection list.

@romange
Copy link
Owner Author

romange commented Apr 8, 2024

This fix will warrant another patch

@romange romange force-pushed the Pr1 branch 2 times, most recently from 9e417ed to cb340cc Compare April 8, 2024 20:45
The root cause was due to a socket being closed and destroyed inside ListenerInterface::RunSingleConnection
and in parallel the listener was performing the shutdown sequence inside ListenerInterface::RunAcceptLoop().

In fact we had a comment saying it's working because socket.Shutdown did not preempt.
In one of my recent PRs I implemented the logic where TlsSocket informs the peer about the shutdown
which made the call preemptible. As a result the whole management code around this became unsafe.

1. I introduced an intrusive refcnt so that a connection would not be destroyed inside RunSingleConnection
   when RunAcceptLoop still uses it.
2. Changed the iteration loop in the latter tu support preemption.
3. Fixed the Shutdown code to support the case where the tcp socket is closed in the middle of shutdown.
4. Found unrelated migration bug that where we failed to Unlink a connection because the listener was destroyed.
   Now we try to unlink only if the connection indeed is linked inside a connection list.

Signed-off-by: Roman Gershman <romange@gmail.com>
conn.Shutdown();
DVSOCK(1, conn) << "Shutdown";
}
cur_conn_cnt.fetch_add(clist->list.size(), memory_order_relaxed);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't clist->list empty at this point?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no necessarily. Shutdown does not close and removes connections. it just shuts them down and they still need to go through their closing sequence before they are removed from the list

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you call pop_back() in line 178?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it's a bug and I am not sure how to fix it actually.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, provided an awkward fix.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an intrusive list, meaning that if an element is not linked then the list was reduced by one element.
Also, Shutdown by itself does not cause the element to be removed from the list. Shutdown triggers the finalization flow for the connection, that at the end will remove the connection from the list (at the end of RunSingleConnection function).
But if Shutdown preempts we may or may now have the connection to be unlinked depending how quickly RunSingleConnection finishes its run.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the list will never have new elements so any element that is unlinked reduces the list by one, so no endless loop here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does it remove from the list? I see that in the original code we used to call pop_back() here, but in the latest iteration we don't so how does that conform with:

it's an intrusive list, meaning that if an element is not linked then the list was reduced by one element.

?

Please excuse my picking, I'm just not sure what I'm missing :)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note that the goal of the loop is not to empty the list but trigger closing flows, so an element may or may not be removed.

util/tls/tls_socket.cc Outdated Show resolved Hide resolved
@romange romange requested a review from chakaz April 9, 2024 05:51
Copy link
Collaborator

@chakaz chakaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤓

@romange
Copy link
Owner Author

romange commented Apr 9, 2024 via email

@chakaz
Copy link
Collaborator

chakaz commented Apr 9, 2024

No need, it makes sense, I approved this PR :)

@romange romange merged commit d819bf4 into master Apr 9, 2024
5 checks passed
@romange romange deleted the Pr1 branch April 9, 2024 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants