Skip to content

Addnode fixes#10

Merged
lploom merged 5 commits intobft-snapshotsfrom
addnode-fixes
Apr 30, 2026
Merged

Addnode fixes#10
lploom merged 5 commits intobft-snapshotsfrom
addnode-fixes

Conversation

@lploom
Copy link
Copy Markdown
Contributor

@lploom lploom commented Apr 28, 2026

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements synchronous connection handling for the addnode RPC by introducing an optional completion callback to the network manager and using std::promise and std::future to wait for the result. The feedback suggests adding an explicit default case to the ConnectionResult switch statement for better maintainability and replacing the non-portable <unistd.h> header in unit tests with standard C++ chrono and thread utilities to ensure cross-platform compatibility.

Comment thread src/network/rpc_server.cpp Outdated
Comment on lines +66 to +82
switch (r) {
case network::ConnectionResult::Success:
return "Connected";
case network::ConnectionResult::NotRunning:
return "Network is not running";
case network::ConnectionResult::AddressBanned:
return "Address is banned";
case network::ConnectionResult::AddressDiscouraged:
return "Address is discouraged";
case network::ConnectionResult::AlreadyConnected:
return "Already connected (or connection attempt already in flight)";
case network::ConnectionResult::NoSlotsAvailable:
return "No outbound connection slots available";
case network::ConnectionResult::TransportFailed:
return "Transport connect failed";
}
return "Unknown connection error";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The switch statement on network::ConnectionResult handles all current enum members, but lacks a default case. The return statement after the switch acts as an implicit default, which can be less clear and may prevent compilers from warning about unhandled enum values if new members are added in the future. For improved maintainability and robustness, it's better to include an explicit default case.

  switch (r) {
    case network::ConnectionResult::Success:
      return "Connected";
    case network::ConnectionResult::NotRunning:
      return "Network is not running";
    case network::ConnectionResult::AddressBanned:
      return "Address is banned";
    case network::ConnectionResult::AddressDiscouraged:
      return "Address is discouraged";
    case network::ConnectionResult::AlreadyConnected:
      return "Already connected (or connection attempt already in flight)";
    case network::ConnectionResult::NoSlotsAvailable:
      return "No outbound connection slots available";
    case network::ConnectionResult::TransportFailed:
      return "Transport connect failed";
    default:
      return "Unknown connection error";
  }

Comment thread test/unit/chain/miner_tests.cpp Outdated
#include "util/uint.hpp"
#include "util/hash.hpp"
#include <memory>
#include <unistd.h>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Including <unistd.h> can limit the portability of this test code, as it's specific to POSIX-compliant systems. If this header is being used for functions like sleep() or usleep(), consider using the standard C++ equivalent std::this_thread::sleep_for from the <thread> and <chrono> headers. This will ensure the tests can run on a wider range of platforms, including Windows.

Comment thread test/unit/chain/notifications_tests.cpp Outdated
#include <memory>
#include <vector>
#include <atomic>
#include <unistd.h>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Including <unistd.h> can limit the portability of this test code, as it's specific to POSIX-compliant systems. If this header is being used for functions like sleep() or usleep(), consider using the standard C++ equivalent std::this_thread::sleep_for from the <thread> and <chrono> headers. This will ensure the tests can run on a wider range of platforms, including Windows.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the addnode RPC behavior by waiting for an outbound manual connection attempt to resolve and returning a more accurate success/error response, enabled by threading a new optional completion callback through the network connection stack.

Changes:

  • Add a synchronous-wait wrapper in RPCServer::HandleAddNode that blocks (bounded) until the async transport connect outcome is known and returns a clearer RPC response.
  • Extend NetworkManager::connect_to / ConnectionManager::ConnectTo to accept an optional ConnectCompletion callback and invoke it when the async connect path completes.
  • Add <unistd.h> includes to a couple of unit test files.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/unit/chain/notifications_tests.cpp Adds a POSIX header include for tests.
test/unit/chain/miner_tests.cpp Adds a POSIX header include for tests.
src/network/rpc_server.cpp Implements bounded synchronous waiting for addnode and improved error messaging.
src/network/network_manager.cpp Plumbs a new optional completion callback through NetworkManager::connect_to.
src/network/connection_manager.cpp Invokes the new completion callback from the async connect flow with specific ConnectionResults.
include/network/network_manager.hpp Updates connect_to signature to accept an optional completion callback.
include/network/connection_manager.hpp Introduces ConnectCompletion and updates ConnectTo signature accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/network/connection_manager.cpp Outdated
Comment on lines +1200 to +1204
LOG_NET_WARN("outbound connect failed: {}:{} (conn_type={})", address, port,
ConnectionTypeAsString(effective_conn_type));
}
if (on_complete) {
on_complete(ConnectionResult::TransportFailed);
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on_complete is invoked while mutex_ is held (inside the asio::post handler). If a caller’s completion handler calls back into ConnectionManager/NetworkManager (or anything else that might take mutex_), this can deadlock or create hard-to-debug re-entrancy. Consider capturing the result under the lock, releasing the lock, and then invoking on_complete (or posting the callback separately) so user code never runs under mutex_.

Copilot uses AI. Check for mistakes.
Comment thread src/network/rpc_server.cpp Outdated
Comment on lines +112 to +115
if (future.wait_for(addNodeWaitTimeout) != std::future_status::ready) {
return util::JsonError("Connection attempt timed out");
}
auto final_result = future.get();
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future.get() can throw (e.g., std::future_error for a broken promise if the connect attempt is aborted and the completion callback is destroyed without setting the promise). Since this is on an RPC path, it would be better to catch std::future_error around wait_for/get and return a deterministic JsonError message instead of relying on the outer handler exception path.

Suggested change
if (future.wait_for(addNodeWaitTimeout) != std::future_status::ready) {
return util::JsonError("Connection attempt timed out");
}
auto final_result = future.get();
network::ConnectionResult final_result;
try {
if (future.wait_for(addNodeWaitTimeout) != std::future_status::ready) {
return util::JsonError("Connection attempt timed out");
}
final_result = future.get();
} catch (const std::future_error&) {
return util::JsonError("Connection attempt failed");
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@MastaP MastaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Targeted review on top of the existing bot comments — focusing on the new bifurcated sync/async API surface, timeout coupling, and a few cosmetic/refactor notes. Big picture this is fine and useful; the issues are mostly around future-proofing the contract.

Substantive

  1. ConnectCompletion API contract is undocumented. The lambda callers must understand that on_complete is invoked iff ConnectTo returns Success synchronously — for NotRunning / AddressBanned / AlreadyConnected / NoSlotsAvailable / sync-TransportFailed, the early returns drop the callback on the floor. AddNodeSyncConnect happens to handle this correctly by branching on sync_result first, but the next caller will not. (Inline.)
  2. addNodeWaitTimeout is coupled-by-comment to DEFAULT_CONNECT_TIMEOUT. Both constants live in unrelated translation units; tests can also override the transport timeout via connect_timeout_override_ms_ to a value larger than 12s, which would make the RPC report a false-positive timeout while the transport is still working. (Inline.)
  3. Late-completion divergence. If the async transport callback lands after the 12s wait expires, the RPC user sees "Connection attempt timed out" but the peer has actually been added (peer registry, manual_addresses_, outbound count). A retry returns AlreadyConnected, which is confusing. Worth either cancelling the in-flight transport on RPC timeout or documenting the race. (Inline.)
  4. Shutdown amplification (pre-existing pattern, slightly worsened). RPCServer::Stop() drains active requests for at most 5s; addNodeWaitTimeout is 12s, so an in-flight addnode can outlive Stop() by ~7s. After Stop returns, Application::shutdown() proceeds to network_manager_->stop() and the surrounding state can teardown. Long-poll already has the same gap (30s vs 5s), so this isn't new, but the diff makes the entry point bigger. Consider raising the drain timeout to envelope all known handlers, or piping a cancellation token into wait_for.

Cosmetic / refactor
5. addNodeWaitTimeout uses lowerCamelCase; every other compile-time constant in rpc_server.cpp is UPPER_SNAKE_CASE (MAX_CONCURRENT_REQUESTS, MAX_SOCKET_PATH, SHUTDOWN_TIMEOUT_MS, POLL_INTERVAL_MS, STALE_TIP_THRESHOLD, DEFAULT_BANTIME_SEC, LONGPOLL_TIMEOUT_SECONDS). (Inline.)
6. Trailing em-dash with no continuation in the synchronous-rejection branch comment. (Inline.)
7. connect_to(..., const ConnectCompletion& on_complete) then on_complete = on_complete in the capture forces an extra copy. Idiomatic alternative: take ConnectCompletion by value and std::move into the capture — saves one copy of the std::function. (Picky.)
8. The <unistd.h> adds in the two test files are correct on macOS for mkdtemp, but the underlying mkdtemp boilerplate already exists as unicity::test::TempDir in test/common/test_util.hpp (which itself includes <unistd.h>). Replacing the open-coded blocks with TempDir would obviate the include and remove duplication. Not in this PR's scope, but worth a follow-up. (Inline on test file.)

Coverage
No new tests cover the synchronous-wait behavior (timeout path, success-after-handshake, NotRunning/Banned sync rejection). A small RPC-level integration test using a stub transport that defers the ConnectCallback would lock in the new contract and prevent regressions.

Comment thread include/network/connection_manager.hpp Outdated
enum class ConnectionResult; // From network_manager.hpp

// Optional connection completion callback fired from the io_context thread
using ConnectCompletion = std::function<void(ConnectionResult)>;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the contract here, since it's non-obvious from the type alone:

  • on_complete is invoked iff the corresponding ConnectTo call returned ConnectionResult::Success synchronously. On any synchronous-failure return (NotRunning, AddressBanned, AddressDiscouraged, AlreadyConnected, NoSlotsAvailable, sync TransportFailed), it is not invoked.
  • It runs on the io_context thread while mutex_ is held (also flagged elsewhere) — callbacks must not re-enter ConnectionManager/NetworkManager and should keep work to a minimum (e.g. setting a promise).
  • It fires at most once.

The current AddNodeSyncConnect happens to get this right because it branches on the sync result before touching the future, but a future caller without this context can easily wait on a promise that will never be set.

Comment thread src/network/rpc_server.cpp Outdated
// timeout (see DEFAULT_CONNECT_TIMEOUT in include/network/real_transport.hpp,
// currently 10s) so the transport always reports a real result before this
// fires; this is a defensive outer bound in case the callback never fires.
constexpr auto addNodeWaitTimeout = std::chrono::seconds(12);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concerns about the timeout choice:

  1. Naming convention. Every other compile-time constant in this file is UPPER_SNAKE_CASE (MAX_CONCURRENT_REQUESTS, MAX_SOCKET_PATH, SHUTDOWN_TIMEOUT_MS, POLL_INTERVAL_MS, STALE_TIP_THRESHOLD, DEFAULT_BANTIME_SEC, LONGPOLL_TIMEOUT_SECONDS). Suggest ADDNODE_WAIT_TIMEOUT for consistency.

  2. Coupled-by-comment to RealTransportConnection::DEFAULT_CONNECT_TIMEOUT. The 12s value only "works" because it's > 10s, but that relationship lives only in this comment. Two failure modes:

    • If someone bumps DEFAULT_CONNECT_TIMEOUT in real_transport.hpp to e.g. 15s without grepping, the addnode RPC will start reporting false-positive timeouts.
    • Tests can already override the transport timeout via RealTransportConnection::connect_timeout_override_ms_ (a public static atomic), which silently breaks the relationship in test scenarios.

Consider deriving programmatically, e.g.:

constexpr auto ADDNODE_WAIT_TIMEOUT =
    network::RealTransportConnection::DEFAULT_CONNECT_TIMEOUT + std::chrono::seconds(2);

(Or expose a public accessor on the transport.) That removes the comment-only coupling and survives test overrides if connect_timeout_ms() is consulted instead.

Comment thread src/network/rpc_server.cpp Outdated
[promise](network::ConnectionResult r) { promise->set_value(r); });

if (sync_result != network::ConnectionResult::Success) {
// Synchronous rejection (NotRunning, AlreadyConnected, Banned etc) —
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing em-dash with no continuation — looks like an unfinished thought. Either complete it (... — return immediately, no async path will fire) or drop the dash.

Comment thread src/network/rpc_server.cpp Outdated
}

if (future.wait_for(addNodeWaitTimeout) != std::future_status::ready) {
return util::JsonError("Connection attempt timed out");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late-completion race: if the transport callback lands between 12s and ~12s+ε, this branch returns "Connection attempt timed out" but the peer is actually being added (peer registry, manual_addresses_, outbound count). The user thinks the addnode failed; a retry now returns AlreadyConnected, which is the worst-of-both.

Two possible mitigations:

  • On timeout, walk the connection back: ask ConnectionManager to cancel the pending entry and disconnect the peer if it lands. Requires a small new API on ConnectionManager (a cancel handle keyed on AddressKey).
  • Document the race explicitly here and in the user-facing RPC docs, so operators know that a timeout error doesn't preclude a successful late connection.

Low severity in practice (you'd need transport timing > 10s and < 12s + scheduling slack), but it's a real divergence between RPC reply and node state.

Comment thread test/unit/chain/miner_tests.cpp Outdated
#include "util/uint.hpp"
#include "util/hash.hpp"
#include <memory>
#include <unistd.h>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up suggestion (out of scope for this PR's intent, but it would obviate the macOS portability concern raised by the bots): the mkdtemp block at L41–46 (and the matching block in notifications_tests.cpp) duplicates unicity::test::TempDir in test/common/test_util.hpp, which already encapsulates the call and already includes <unistd.h>. Replacing the open-coded boilerplate with TempDir test_dir{"unicity_miner_test"} would let you drop this <unistd.h> add entirely and centralize the POSIX dependency.

Also fixes open PR review feedback:
- default-case in switch
- em-dash
- programmatic timeout derivation
- <unistd.h> removal via TempDir
- ConnectionManager contract clarified
- integration test for addnode
@lploom lploom merged commit 4e94471 into bft-snapshots Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Silent outbound connection failure — addnode returns success but connection never established

3 participants