Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Don't hold mutex while sending on connection #2435

Closed
wants to merge 12 commits into from

Conversation

lrettig
Copy link
Member

@lrettig lrettig commented May 26, 2021

Motivation

Partially addresses #1868, #2386 (and maybe #2405)

Changes

The underlying connection objects contain a mutex that's used to protect the closed struct member. They hold the mutex while trying to send on the socket. But the send operation may be delayed or time out, which has a ripple effect up the stack: it blocks the operation that enqueues outbound messages, which in turns blocks propagation of gossip messages (#1868) (since gossip waits for outbound messages to all peers to be enqueued before it processes the next outbound gossip message). There are various places in this stack where we could add more concurrency or queues, but ultimately this appears to be the bottleneck, and fixing it should address the gossip message bottleneck.

Test Plan

Adds a test that makes sure that a blocked Write operation does not block Send

TODO

  • Explain motivation or link existing issue(s)
  • Test changes and document test plan
  • Update documentation as needed

DevOps Notes

  • This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
  • This PR does not affect public APIs
  • This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
  • This PR does not make changes to log messages (which monitoring infrastructure may rely on)

The mutex is used to protect the closed property, and shouldn't be held
while waiting for a socket send to complete. This causes enqueueing of
outgoing messages to get stuck too.
@lrettig lrettig added the P2P label May 26, 2021
@lrettig lrettig requested a review from y0sher May 26, 2021 16:37
@lrettig lrettig self-assigned this May 26, 2021
@lrettig lrettig mentioned this pull request May 26, 2021
Copy link
Contributor

@countvonzero countvonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems the underlying io.Writer is not thread-safe tho. i don't know enough to be sure there will be no concurrent reads/writes to this network connection.

@lrettig
Copy link
Member Author

lrettig commented May 27, 2021

Fair point, but writing on the connection is already synchronized with the messages channel.

@moshababo
Copy link
Contributor

moshababo commented May 30, 2021

net.Conn seems to be thread safe according to the documentation, so concurrent calls to the underlying io.Writer are expected to be serialized. This obviously doesn't cover the interface per se, but we need to optimize here for the concrete implementation, since this is a hot spot.

Also see this clarification regarding the documentation: golang/go#27203 (comment)

p2p/net/conn_test.go Show resolved Hide resolved
p2p/net/conn_test.go Show resolved Hide resolved
p2p/net/conn_test.go Outdated Show resolved Hide resolved
p2p/net/conn_test.go Outdated Show resolved Hide resolved
p2p/net/conn_test.go Outdated Show resolved Hide resolved
p2p/net/conn.go Outdated Show resolved Hide resolved
@y0sher
Copy link
Contributor

y0sher commented May 31, 2021

@lrettig so I understand that contention can happen around the mutex you moved, though you should know that what you wrote above about the whole p2p stack waiting for every message to be sent to all the peers is wrong. the Send method on conn holds the mutex only for checking closed and then sends it into messages queue meaning that the method is returned right away unless the messages chan is full (current size 250 per peer).
so the change you've done might get the queue free faster, but we could also try increasing its size..

@lrettig
Copy link
Member Author

lrettig commented Jun 1, 2021

the Send method on conn holds the mutex only for checking closed and then sends it into messages queue meaning that the method is returned right away unless the messages chan is full (current size 250 per peer).

The point is that, if SendSock holds the mutex, then Send gets blocked until SendSock finishes, or times out - regardless of the channel capacity. That's not the right design pattern. Send should never block, regardless of what SendSock is doing. The issue here has nothing to do with the channel capacity, and increasing it won't fix the issue. See the test I added - it fails in develop (i.e., it's a regression test). It should be clear what's going on from the test.

Rename SendSock to make it an internal method
@moshababo
Copy link
Contributor

Send should never block, regardless of what SendSock is doing. The issue here has nothing to do with the channel capacity, and increasing it won't fix the issue

Right, the fixed issue has nothing to do with it, but Send still depends on SendSock to consume the channel, and as a secondary potential issue it may block if full capacity was reached. So if the side-effects of a blocking Send are particularly detrimental, we should increase it, but i'd rather get some analytics info first before trying to find the optimal size.

p2p/net/conn.go Outdated Show resolved Hide resolved
@lrettig
Copy link
Member Author

lrettig commented Jun 10, 2021

Right, the fixed issue has nothing to do with it, but Send still depends on SendSock to consume the channel, and as a secondary potential issue it may block if full capacity was reached. So if the side-effects of a blocking Send are particularly detrimental, we should increase it, but i'd rather get some analytics info first before trying to find the optimal size.

You read my mind :) It turns out that the side-effects of a blocking Send are quite detrimental. See #2450 for a fix. It'll be included in the next testnet build as well.

Clean up handling of lock and closing connections
Clean up tests
@lrettig lrettig requested a review from moshababo June 10, 2021 17:40
@lrettig
Copy link
Member Author

lrettig commented Jun 10, 2021

Thanks for feedback. @moshababo @y0sher @countvonzero this is ready for final review.

Add timeout back to test to account for concurrency issue
@lrettig lrettig added this to the July 5 Sprint milestone Jul 5, 2021
@lrettig lrettig removed this from the July 5 Sprint milestone Jul 17, 2021
bors bot pushed a commit that referenced this pull request Aug 29, 2021
## Motivation
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
The underlying connection objects contain a mutex that's used to protect the `closed` struct member. They hold the mutex while trying to send on the socket. But the send operation may be delayed or time out, which has a ripple effect up the stack: it blocks the operation that enqueues outbound messages, which in turns blocks propagation of gossip messages (#1868) (since gossip waits for outbound messages to all peers to be enqueued before it processes the next outbound gossip message). There are various places in this stack where we could add more concurrency or queues, but ultimately this appears to be the bottleneck, and fixing it should address the gossip message bottleneck.

## Test Plan
Adds a test that makes sure that a blocked Write operation does not block Send

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@bors
Copy link

bors bot commented Aug 29, 2021

Build failed:

bors bot pushed a commit that referenced this pull request Aug 29, 2021
## Motivation
Closely related to #2435
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
When the outbound message queue for a given connection is full, rather than blocking on `Send`, begin dropping messages and printing errors (rather than allowing the blocking to propagate up the gossip stack)

It would be good to go one step further and drop peers that are hanging, but this is a quick, simple fix for the acute issue we're seeing on the testnets. See #2385.

## Test Plan
Includes a new unit test for both `conn` and `msgconn` types

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@lrettig
Copy link
Member Author

lrettig commented Aug 29, 2021

bors merge

bors bot pushed a commit that referenced this pull request Aug 29, 2021
## Motivation
Closely related to #2435
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
When the outbound message queue for a given connection is full, rather than blocking on `Send`, begin dropping messages and printing errors (rather than allowing the blocking to propagate up the gossip stack)

It would be good to go one step further and drop peers that are hanging, but this is a quick, simple fix for the acute issue we're seeing on the testnets. See #2385.

## Test Plan
Includes a new unit test for both `conn` and `msgconn` types

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
bors bot pushed a commit that referenced this pull request Aug 29, 2021
## Motivation
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
The underlying connection objects contain a mutex that's used to protect the `closed` struct member. They hold the mutex while trying to send on the socket. But the send operation may be delayed or time out, which has a ripple effect up the stack: it blocks the operation that enqueues outbound messages, which in turns blocks propagation of gossip messages (#1868) (since gossip waits for outbound messages to all peers to be enqueued before it processes the next outbound gossip message). There are various places in this stack where we could add more concurrency or queues, but ultimately this appears to be the bottleneck, and fixing it should address the gossip message bottleneck.

## Test Plan
Adds a test that makes sure that a blocked Write operation does not block Send

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@bors
Copy link

bors bot commented Aug 29, 2021

Build failed:

bors bot pushed a commit that referenced this pull request Aug 29, 2021
## Motivation
Closely related to #2435
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
When the outbound message queue for a given connection is full, rather than blocking on `Send`, begin dropping messages and printing errors (rather than allowing the blocking to propagate up the gossip stack)

It would be good to go one step further and drop peers that are hanging, but this is a quick, simple fix for the acute issue we're seeing on the testnets. See #2385.

## Test Plan
Includes a new unit test for both `conn` and `msgconn` types

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
bors bot pushed a commit that referenced this pull request Aug 29, 2021
## Motivation
Closely related to #2435
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
When the outbound message queue for a given connection is full, rather than blocking on `Send`, begin dropping messages and printing errors (rather than allowing the blocking to propagate up the gossip stack)

It would be good to go one step further and drop peers that are hanging, but this is a quick, simple fix for the acute issue we're seeing on the testnets. See #2385.

## Test Plan
Includes a new unit test for both `conn` and `msgconn` types

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@lrettig
Copy link
Member Author

lrettig commented Aug 30, 2021

bors merge

bors bot pushed a commit that referenced this pull request Aug 30, 2021
## Motivation
Closely related to #2435
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
When the outbound message queue for a given connection is full, rather than blocking on `Send`, begin dropping messages and printing errors (rather than allowing the blocking to propagate up the gossip stack)

It would be good to go one step further and drop peers that are hanging, but this is a quick, simple fix for the acute issue we're seeing on the testnets. See #2385.

## Test Plan
Includes a new unit test for both `conn` and `msgconn` types

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
bors bot pushed a commit that referenced this pull request Aug 30, 2021
## Motivation
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
The underlying connection objects contain a mutex that's used to protect the `closed` struct member. They hold the mutex while trying to send on the socket. But the send operation may be delayed or time out, which has a ripple effect up the stack: it blocks the operation that enqueues outbound messages, which in turns blocks propagation of gossip messages (#1868) (since gossip waits for outbound messages to all peers to be enqueued before it processes the next outbound gossip message). There are various places in this stack where we could add more concurrency or queues, but ultimately this appears to be the bottleneck, and fixing it should address the gossip message bottleneck.

## Test Plan
Adds a test that makes sure that a blocked Write operation does not block Send

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@bors
Copy link

bors bot commented Aug 30, 2021

Build failed:

bors bot pushed a commit that referenced this pull request Aug 30, 2021
## Motivation
Closely related to #2435
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
When the outbound message queue for a given connection is full, rather than blocking on `Send`, begin dropping messages and printing errors (rather than allowing the blocking to propagate up the gossip stack)

It would be good to go one step further and drop peers that are hanging, but this is a quick, simple fix for the acute issue we're seeing on the testnets. See #2385.

## Test Plan
Includes a new unit test for both `conn` and `msgconn` types

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
bors bot pushed a commit that referenced this pull request Aug 30, 2021
## Motivation
Closely related to #2435
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
When the outbound message queue for a given connection is full, rather than blocking on `Send`, begin dropping messages and printing errors (rather than allowing the blocking to propagate up the gossip stack)

It would be good to go one step further and drop peers that are hanging, but this is a quick, simple fix for the acute issue we're seeing on the testnets. See #2385.

## Test Plan
Includes a new unit test for both `conn` and `msgconn` types

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@lrettig lrettig assigned antonlerner and unassigned lrettig Aug 30, 2021
@lrettig lrettig modified the milestones: Aug 16 Sprint, Aug 30 Sprint Aug 30, 2021
@lrettig
Copy link
Member Author

lrettig commented Aug 30, 2021

This is done and ready to be merged, but stuck on flaky tests. Assigning to @antonlerner to make sure it gets merged. Let me know if I've missed anything or if there are any questions. Thanks!

@lrettig
Copy link
Member Author

lrettig commented Sep 2, 2021

bors merge

bors bot pushed a commit that referenced this pull request Sep 2, 2021
## Motivation
Closely related to #2435
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
When the outbound message queue for a given connection is full, rather than blocking on `Send`, begin dropping messages and printing errors (rather than allowing the blocking to propagate up the gossip stack)

It would be good to go one step further and drop peers that are hanging, but this is a quick, simple fix for the acute issue we're seeing on the testnets. See #2385.

## Test Plan
Includes a new unit test for both `conn` and `msgconn` types

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)


Co-authored-by: Dmitry Shulyak <yashulyak@gmail.com>
@bors
Copy link

bors bot commented Sep 2, 2021

Merge conflict.

@lrettig
Copy link
Member Author

lrettig commented Sep 8, 2021

bors merge

bors bot pushed a commit that referenced this pull request Sep 8, 2021
## Motivation
Partially addresses #1868, #2386 (and maybe #2405)

## Changes
The underlying connection objects contain a mutex that's used to protect the `closed` struct member. They hold the mutex while trying to send on the socket. But the send operation may be delayed or time out, which has a ripple effect up the stack: it blocks the operation that enqueues outbound messages, which in turns blocks propagation of gossip messages (#1868) (since gossip waits for outbound messages to all peers to be enqueued before it processes the next outbound gossip message). There are various places in this stack where we could add more concurrency or queues, but ultimately this appears to be the bottleneck, and fixing it should address the gossip message bottleneck.

## Test Plan
Adds a test that makes sure that a blocked Write operation does not block Send

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [ ] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@bors
Copy link

bors bot commented Sep 8, 2021

Pull request successfully merged into develop.

Build succeeded:

@bors bors bot changed the title Don't hold mutex while sending on connection [Merged by Bors] - Don't hold mutex while sending on connection Sep 8, 2021
@bors bors bot closed this Sep 8, 2021
@bors bors bot deleted the send-mutex branch September 8, 2021 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants