Addnode fixes#10
Conversation
Make addnode rpc call synchronous
There was a problem hiding this comment.
Code Review
This pull request implements synchronous connection handling for the addnode RPC by introducing an optional completion callback to the network manager and using std::promise and std::future to wait for the result. The feedback suggests adding an explicit default case to the ConnectionResult switch statement for better maintainability and replacing the non-portable <unistd.h> header in unit tests with standard C++ chrono and thread utilities to ensure cross-platform compatibility.
| switch (r) { | ||
| case network::ConnectionResult::Success: | ||
| return "Connected"; | ||
| case network::ConnectionResult::NotRunning: | ||
| return "Network is not running"; | ||
| case network::ConnectionResult::AddressBanned: | ||
| return "Address is banned"; | ||
| case network::ConnectionResult::AddressDiscouraged: | ||
| return "Address is discouraged"; | ||
| case network::ConnectionResult::AlreadyConnected: | ||
| return "Already connected (or connection attempt already in flight)"; | ||
| case network::ConnectionResult::NoSlotsAvailable: | ||
| return "No outbound connection slots available"; | ||
| case network::ConnectionResult::TransportFailed: | ||
| return "Transport connect failed"; | ||
| } | ||
| return "Unknown connection error"; |
There was a problem hiding this comment.
The switch statement on network::ConnectionResult handles all current enum members, but lacks a default case. The return statement after the switch acts as an implicit default, which can be less clear and may prevent compilers from warning about unhandled enum values if new members are added in the future. For improved maintainability and robustness, it's better to include an explicit default case.
switch (r) {
case network::ConnectionResult::Success:
return "Connected";
case network::ConnectionResult::NotRunning:
return "Network is not running";
case network::ConnectionResult::AddressBanned:
return "Address is banned";
case network::ConnectionResult::AddressDiscouraged:
return "Address is discouraged";
case network::ConnectionResult::AlreadyConnected:
return "Already connected (or connection attempt already in flight)";
case network::ConnectionResult::NoSlotsAvailable:
return "No outbound connection slots available";
case network::ConnectionResult::TransportFailed:
return "Transport connect failed";
default:
return "Unknown connection error";
}| #include "util/uint.hpp" | ||
| #include "util/hash.hpp" | ||
| #include <memory> | ||
| #include <unistd.h> |
There was a problem hiding this comment.
Including <unistd.h> can limit the portability of this test code, as it's specific to POSIX-compliant systems. If this header is being used for functions like sleep() or usleep(), consider using the standard C++ equivalent std::this_thread::sleep_for from the <thread> and <chrono> headers. This will ensure the tests can run on a wider range of platforms, including Windows.
| #include <memory> | ||
| #include <vector> | ||
| #include <atomic> | ||
| #include <unistd.h> |
There was a problem hiding this comment.
Including <unistd.h> can limit the portability of this test code, as it's specific to POSIX-compliant systems. If this header is being used for functions like sleep() or usleep(), consider using the standard C++ equivalent std::this_thread::sleep_for from the <thread> and <chrono> headers. This will ensure the tests can run on a wider range of platforms, including Windows.
There was a problem hiding this comment.
Pull request overview
This PR improves the addnode RPC behavior by waiting for an outbound manual connection attempt to resolve and returning a more accurate success/error response, enabled by threading a new optional completion callback through the network connection stack.
Changes:
- Add a synchronous-wait wrapper in
RPCServer::HandleAddNodethat blocks (bounded) until the async transport connect outcome is known and returns a clearer RPC response. - Extend
NetworkManager::connect_to/ConnectionManager::ConnectToto accept an optionalConnectCompletioncallback and invoke it when the async connect path completes. - Add
<unistd.h>includes to a couple of unit test files.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
test/unit/chain/notifications_tests.cpp |
Adds a POSIX header include for tests. |
test/unit/chain/miner_tests.cpp |
Adds a POSIX header include for tests. |
src/network/rpc_server.cpp |
Implements bounded synchronous waiting for addnode and improved error messaging. |
src/network/network_manager.cpp |
Plumbs a new optional completion callback through NetworkManager::connect_to. |
src/network/connection_manager.cpp |
Invokes the new completion callback from the async connect flow with specific ConnectionResults. |
include/network/network_manager.hpp |
Updates connect_to signature to accept an optional completion callback. |
include/network/connection_manager.hpp |
Introduces ConnectCompletion and updates ConnectTo signature accordingly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| LOG_NET_WARN("outbound connect failed: {}:{} (conn_type={})", address, port, | ||
| ConnectionTypeAsString(effective_conn_type)); | ||
| } | ||
| if (on_complete) { | ||
| on_complete(ConnectionResult::TransportFailed); |
There was a problem hiding this comment.
on_complete is invoked while mutex_ is held (inside the asio::post handler). If a caller’s completion handler calls back into ConnectionManager/NetworkManager (or anything else that might take mutex_), this can deadlock or create hard-to-debug re-entrancy. Consider capturing the result under the lock, releasing the lock, and then invoking on_complete (or posting the callback separately) so user code never runs under mutex_.
| if (future.wait_for(addNodeWaitTimeout) != std::future_status::ready) { | ||
| return util::JsonError("Connection attempt timed out"); | ||
| } | ||
| auto final_result = future.get(); |
There was a problem hiding this comment.
future.get() can throw (e.g., std::future_error for a broken promise if the connect attempt is aborted and the completion callback is destroyed without setting the promise). Since this is on an RPC path, it would be better to catch std::future_error around wait_for/get and return a deterministic JsonError message instead of relying on the outer handler exception path.
| if (future.wait_for(addNodeWaitTimeout) != std::future_status::ready) { | |
| return util::JsonError("Connection attempt timed out"); | |
| } | |
| auto final_result = future.get(); | |
| network::ConnectionResult final_result; | |
| try { | |
| if (future.wait_for(addNodeWaitTimeout) != std::future_status::ready) { | |
| return util::JsonError("Connection attempt timed out"); | |
| } | |
| final_result = future.get(); | |
| } catch (const std::future_error&) { | |
| return util::JsonError("Connection attempt failed"); | |
| } |
MastaP
left a comment
There was a problem hiding this comment.
Targeted review on top of the existing bot comments — focusing on the new bifurcated sync/async API surface, timeout coupling, and a few cosmetic/refactor notes. Big picture this is fine and useful; the issues are mostly around future-proofing the contract.
Substantive
ConnectCompletionAPI contract is undocumented. The lambda callers must understand thaton_completeis invoked iffConnectToreturnsSuccesssynchronously — forNotRunning/AddressBanned/AlreadyConnected/NoSlotsAvailable/ sync-TransportFailed, the early returns drop the callback on the floor.AddNodeSyncConnecthappens to handle this correctly by branching onsync_resultfirst, but the next caller will not. (Inline.)addNodeWaitTimeoutis coupled-by-comment toDEFAULT_CONNECT_TIMEOUT. Both constants live in unrelated translation units; tests can also override the transport timeout viaconnect_timeout_override_ms_to a value larger than 12s, which would make the RPC report a false-positive timeout while the transport is still working. (Inline.)- Late-completion divergence. If the async transport callback lands after the 12s wait expires, the RPC user sees
"Connection attempt timed out"but the peer has actually been added (peer registry,manual_addresses_, outbound count). A retry returnsAlreadyConnected, which is confusing. Worth either cancelling the in-flight transport on RPC timeout or documenting the race. (Inline.) - Shutdown amplification (pre-existing pattern, slightly worsened).
RPCServer::Stop()drains active requests for at most 5s;addNodeWaitTimeoutis 12s, so an in-flightaddnodecan outliveStop()by ~7s. After Stop returns,Application::shutdown()proceeds tonetwork_manager_->stop()and the surrounding state can teardown. Long-poll already has the same gap (30s vs 5s), so this isn't new, but the diff makes the entry point bigger. Consider raising the drain timeout to envelope all known handlers, or piping a cancellation token intowait_for.
Cosmetic / refactor
5. addNodeWaitTimeout uses lowerCamelCase; every other compile-time constant in rpc_server.cpp is UPPER_SNAKE_CASE (MAX_CONCURRENT_REQUESTS, MAX_SOCKET_PATH, SHUTDOWN_TIMEOUT_MS, POLL_INTERVAL_MS, STALE_TIP_THRESHOLD, DEFAULT_BANTIME_SEC, LONGPOLL_TIMEOUT_SECONDS). (Inline.)
6. Trailing em-dash with no continuation in the synchronous-rejection branch comment. (Inline.)
7. connect_to(..., const ConnectCompletion& on_complete) then on_complete = on_complete in the capture forces an extra copy. Idiomatic alternative: take ConnectCompletion by value and std::move into the capture — saves one copy of the std::function. (Picky.)
8. The <unistd.h> adds in the two test files are correct on macOS for mkdtemp, but the underlying mkdtemp boilerplate already exists as unicity::test::TempDir in test/common/test_util.hpp (which itself includes <unistd.h>). Replacing the open-coded blocks with TempDir would obviate the include and remove duplication. Not in this PR's scope, but worth a follow-up. (Inline on test file.)
Coverage
No new tests cover the synchronous-wait behavior (timeout path, success-after-handshake, NotRunning/Banned sync rejection). A small RPC-level integration test using a stub transport that defers the ConnectCallback would lock in the new contract and prevent regressions.
| enum class ConnectionResult; // From network_manager.hpp | ||
|
|
||
| // Optional connection completion callback fired from the io_context thread | ||
| using ConnectCompletion = std::function<void(ConnectionResult)>; |
There was a problem hiding this comment.
Please document the contract here, since it's non-obvious from the type alone:
on_completeis invoked iff the correspondingConnectTocall returnedConnectionResult::Successsynchronously. On any synchronous-failure return (NotRunning,AddressBanned,AddressDiscouraged,AlreadyConnected,NoSlotsAvailable, syncTransportFailed), it is not invoked.- It runs on the io_context thread while
mutex_is held (also flagged elsewhere) — callbacks must not re-enterConnectionManager/NetworkManagerand should keep work to a minimum (e.g. setting a promise). - It fires at most once.
The current AddNodeSyncConnect happens to get this right because it branches on the sync result before touching the future, but a future caller without this context can easily wait on a promise that will never be set.
| // timeout (see DEFAULT_CONNECT_TIMEOUT in include/network/real_transport.hpp, | ||
| // currently 10s) so the transport always reports a real result before this | ||
| // fires; this is a defensive outer bound in case the callback never fires. | ||
| constexpr auto addNodeWaitTimeout = std::chrono::seconds(12); |
There was a problem hiding this comment.
Two concerns about the timeout choice:
-
Naming convention. Every other compile-time constant in this file is
UPPER_SNAKE_CASE(MAX_CONCURRENT_REQUESTS,MAX_SOCKET_PATH,SHUTDOWN_TIMEOUT_MS,POLL_INTERVAL_MS,STALE_TIP_THRESHOLD,DEFAULT_BANTIME_SEC,LONGPOLL_TIMEOUT_SECONDS). SuggestADDNODE_WAIT_TIMEOUTfor consistency. -
Coupled-by-comment to
RealTransportConnection::DEFAULT_CONNECT_TIMEOUT. The 12s value only "works" because it's > 10s, but that relationship lives only in this comment. Two failure modes:- If someone bumps
DEFAULT_CONNECT_TIMEOUTinreal_transport.hppto e.g. 15s without grepping, the addnode RPC will start reporting false-positive timeouts. - Tests can already override the transport timeout via
RealTransportConnection::connect_timeout_override_ms_(a public static atomic), which silently breaks the relationship in test scenarios.
- If someone bumps
Consider deriving programmatically, e.g.:
constexpr auto ADDNODE_WAIT_TIMEOUT =
network::RealTransportConnection::DEFAULT_CONNECT_TIMEOUT + std::chrono::seconds(2);(Or expose a public accessor on the transport.) That removes the comment-only coupling and survives test overrides if connect_timeout_ms() is consulted instead.
| [promise](network::ConnectionResult r) { promise->set_value(r); }); | ||
|
|
||
| if (sync_result != network::ConnectionResult::Success) { | ||
| // Synchronous rejection (NotRunning, AlreadyConnected, Banned etc) — |
There was a problem hiding this comment.
Trailing em-dash with no continuation — looks like an unfinished thought. Either complete it (... — return immediately, no async path will fire) or drop the dash.
| } | ||
|
|
||
| if (future.wait_for(addNodeWaitTimeout) != std::future_status::ready) { | ||
| return util::JsonError("Connection attempt timed out"); |
There was a problem hiding this comment.
Late-completion race: if the transport callback lands between 12s and ~12s+ε, this branch returns "Connection attempt timed out" but the peer is actually being added (peer registry, manual_addresses_, outbound count). The user thinks the addnode failed; a retry now returns AlreadyConnected, which is the worst-of-both.
Two possible mitigations:
- On timeout, walk the connection back: ask
ConnectionManagerto cancel the pending entry and disconnect the peer if it lands. Requires a small new API onConnectionManager(a cancel handle keyed onAddressKey). - Document the race explicitly here and in the user-facing RPC docs, so operators know that a timeout error doesn't preclude a successful late connection.
Low severity in practice (you'd need transport timing > 10s and < 12s + scheduling slack), but it's a real divergence between RPC reply and node state.
| #include "util/uint.hpp" | ||
| #include "util/hash.hpp" | ||
| #include <memory> | ||
| #include <unistd.h> |
There was a problem hiding this comment.
Follow-up suggestion (out of scope for this PR's intent, but it would obviate the macOS portability concern raised by the bots): the mkdtemp block at L41–46 (and the matching block in notifications_tests.cpp) duplicates unicity::test::TempDir in test/common/test_util.hpp, which already encapsulates the call and already includes <unistd.h>. Replacing the open-coded boilerplate with TempDir test_dir{"unicity_miner_test"} would let you drop this <unistd.h> add entirely and centralize the POSIX dependency.
Also fixes open PR review feedback: - default-case in switch - em-dash - programmatic timeout derivation - <unistd.h> removal via TempDir - ConnectionManager contract clarified - integration test for addnode
Increase listen backlog from 20 to SOMAXCONN
No description provided.