Skip to content

Issue: disconnect with some peer because of not sync for a long time in p2p #427

@317787106

Description

@317787106

Simple Summary

We introduce a solution to reduce the log error "Peer xxx not sync for a long time" when fullnode syncs with peer.

Abstract

When one fullnode synchronizes with other peer, there is a blockLock in TronNetDelegate. There are three threads to apply for this lock. sometimes, when one peer canot get this blockLock in ChainInventoryMsgHandler,
this peer may be disconnected by thread PeerStatusCheck.check because it finds that this peer's blockBothHaveUpdateTime is blockUpdateTimeout seconds ago than system time. We Introduce a solution to decrease the time cost of acquire blockLock in fullnode that synchronizes with other peer.

Motivation

Sometimes we see the error in tron.log:

00:03:22.898 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /206.221.176.113 not sync for a long time.
00:12:14.924 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /104.194.8.221 not sync for a long time.
00:12:42.926 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /5.9.101.190 not sync for a long time.
00:16:40.980 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /109.238.14.182 not sync for a long time.
00:16:40.981 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /5.9.101.190 not sync for a long time.
00:55:55.277 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /47.104.153.187 not sync for a long time.
00:56:01.277 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /3.142.169.23 not sync for a long time.
01:07:29.314 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /3.142.169.23 not sync for a long time.
01:09:49.333 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /13.229.124.42 not sync for a long time.
01:17:39.358 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /103.85.253.22 not sync for a long time.
01:17:39.358 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /183.131.184.130 not sync for a long time.
02:05:57.891 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /47.252.13.200 not sync for a long time.
02:06:09.892 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /18.224.252.203 not sync for a long time.
02:06:09.892 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /51.81.109.193 not sync for a long time.
02:06:53.895 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /47.252.6.86 not sync for a long time.
02:06:53.895 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /10.60.100.237 not sync for a long time.
02:06:53.895 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /10.60.100.33 not sync for a long time.
02:06:53.895 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /209.126.87.106 not sync for a long time.
02:07:27.897 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /18.232.41.41 not sync for a long time.
02:11:29.941 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /206.221.176.113 not sync for a long time.
02:11:29.941 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /13.214.179.230 not sync for a long time.
03:22:54.472 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /51.81.184.18 not sync for a long time.
03:22:58.472 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /13.38.128.252 not sync for a long time.
03:23:00.473 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /51.81.67.97 not sync for a long time.
13:26:09.850 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /47.90.250.85 not sync for a long time.
13:26:13.851 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /10.60.100.83 not sync for a long time.
13:26:15.851 WARN  [pool-53-thread-1] [net](PeerStatusCheck.java:49) Peer /10.60.100.33 not sync for a long time.

these peers are disconnected because unexpect reason because "not sync for a long time" by thread PeerStatusCheck.check.

public void statusCheck() {

  long now = System.currentTimeMillis();

  tronNetDelegate.getActivePeer().forEach(peer -> {

    boolean isDisconnected = false;

    if (peer.isNeedSyncFromPeer()
        && peer.getBlockBothHaveUpdateTime() < now - blockUpdateTimeout) {
      logger.warn("Peer {} not sync for a long time.", peer.getInetAddress());
      isDisconnected = true;
    }
  ...
  }
}

So why this peer has not been sync for a long time? It's the same question as why PeerConnection.blockBothHaveUpdateTime has not been updated for long time.
blockBothHaveUpdateTime is update in ChainInventoryMsgHandler.processMessage

synchronized (tronNetDelegate.getBlockLock()) {
  while (!peer.getSyncBlockToFetch().isEmpty() && tronNetDelegate
      .containBlock(peer.getSyncBlockToFetch().peek())) {
    try {
      BlockId blockId = peer.getSyncBlockToFetch().pop();
      peer.setBlockBothHave(blockId);
      logger.info("Block {} from {} is processed",
              blockId.getString(), peer.getNode().getHost());
    } catch (NoSuchElementException e) {
      logger.warn("Process ChainInventoryMessage failed, peer {}, isDisconnect:{}",
              peer.getNode().getHost(), peer.isDisconnect());
      return;
    }
  }

we can see if this peer cannot get the blockLock, blockBothHaveUpdateTime cannot be update. blockLock is also used in SyncService.handleSyncBlock

synchronized (tronNetDelegate.getBlockLock()) {
  blockWaitToProcess.forEach((msg, peerConnection) -> {
    ...
    final boolean[] isFound = {false};
    tronNetDelegate.getActivePeer().stream()
        .filter(peer -> msg.getBlockId().equals(peer.getSyncBlockToFetch().peek()))
        .forEach(peer -> {
          peer.getSyncBlockToFetch().pop();
          peer.getSyncBlockInProcess().add(msg.getBlockId());
          isFound[0] = true;
        });
    if (isFound[0]) {
      blockWaitToProcess.remove(msg);
      isProcessed[0] = true;
      processSyncBlock(msg.getBlockCapsule());
    }
  });
}

Suppose we get many blocks from one peer, as processSyncBlock need to write disk, this for cycle may cost a long time.

Implementation

Our solution is to swap the order of getBlockLock and the for cycle:

blockWaitToProcess.forEach((msg, peerConnection) -> {
  synchronized (tronNetDelegate.getBlockLock()) {
    ...
    final boolean[] isFound = {false};
    tronNetDelegate.getActivePeer().stream()
        .filter(peer -> msg.getBlockId().equals(peer.getSyncBlockToFetch().peek()))
        .forEach(peer -> {
          peer.getSyncBlockToFetch().pop();
          peer.getSyncBlockInProcess().add(msg.getBlockId());
          isFound[0] = true;
        });
    if (isFound[0]) {
      blockWaitToProcess.remove(msg);
      isProcessed[0] = true;
      processSyncBlock(msg.getBlockCapsule());
    }
  }
});

This can reduce the total time of lock blockLock in SyncService.handleSyncBlock, so other thread such as ChainInventoryMsgHandler.processMessage can get the lock more esaily.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions