Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize event subscribe exception handling #5719

Open
lurais opened this issue Feb 7, 2024 · 33 comments
Open

Optimize event subscribe exception handling #5719

lurais opened this issue Feb 7, 2024 · 33 comments
Assignees
Labels
topic: Event subscribe transaction trigger, block trigger, contract event, contract log type: Enhancement minor enhancement type:feature

Comments

@lurais
Copy link
Contributor

lurais commented Feb 7, 2024

Rationale

Why should this feature exist?
I notice that the codes of the event subscribe function are quite close to the applyBlock process codes in the pushBlock function, and any exception thrown by the event subscribe function will lead to normal blocks being removed from KhaosDB, which has been committed successfully.

In this case, the next block will be processed failed when putting into the KhaosDB since no parent block could be found in KhaosDB.

The detail code is here:

try (ISession tmpSession = revokingStore.buildSession()) {
    long oldSolidNum =
            chainBaseManager.getDynamicPropertiesStore().getLatestSolidifiedBlockNum();
    applyBlock(newBlock, txs);
    tmpSession.commit();
    // if event subscribe is enabled, post block trigger to queue
    postBlockTrigger(newBlock);
    // if event subscribe is enabled, post solidity trigger to queue
    postSolidityTrigger(oldSolidNum,
            getDynamicPropertiesStore().getLatestSolidifiedBlockNum());
  } catch (Throwable throwable) {
    logger.error(throwable.getMessage(), throwable);
    khaosDb.removeBlk(block.getBlockId());
    throw throwable;
  }

The block synchronization should work as we expected when any exception is thrown in the event subscribe function, in detail, it should work well if the event subscribe function is allowed to miss some data.

Implementation

Do you have ideas regarding the implementation of this feature?

Solution: The event subscribe logic is the subsequent processing logic of block processing, and the event subscribe logic is independent from the existing block processing logic. Therefore, exceptions related to the event subscribe logic should be handled separately. In addition, due to some users' high dependence on the stability of event subscribe functions, while others require synchronization services to remain stable and allow occasional exceptions to occur in the event subscribe logic, a new switch is added to control whether synchronization functions should be suspended when event subscribe services are abnormal.

The new added switch is located in config file path: event.subscribe.ignoreSubscribeError, and the node will stop when any exception is thrown in event subscribe services if the config is not set or set to false; and the node will ignore the exceptions thrown by the event subscribe services and continue to synchronize blocks only if it has set to true .

The related code will be like this:

    try (ISession tmpSession = revokingStore.buildSession()) {
      applyBlock(newBlock, txs);
      tmpSession.commit();
    } catch (Throwable throwable) {
      logger.error(throwable.getMessage(), throwable);
      khaosDb.removeBlk(block.getBlockId());
      throw throwable;
    }
    try {
      long oldSolidNum =
              chainBaseManager.getDynamicPropertiesStore().getLatestSolidifiedBlockNum();
      // if event subscribe is enabled, post block trigger to queue
      postBlockTrigger(newBlock);
      // if event subscribe is enabled, post solidity trigger to queue
      postSolidityTrigger(oldSolidNum,
              getDynamicPropertiesStore().getLatestSolidifiedBlockNum());
    } catch (Exception e) {
      logger.error("Post trigger error.", e);
      // as disscussed below, this switch will not be added 
      if (!ignoreSubscribeErrorSwitchOn()) {
        System.exit(1);
      }
    }

As discussed below, the config ignoreSubscribeError is unnecessary and will not be added. The next question is, if the event subscription function throws an exception, does it need to be retried? Or just stop the node if any exception is found in the event subscription:

     try (ISession tmpSession = revokingStore.buildSession()) {

        long oldSolidNum =
                chainBaseManager.getDynamicPropertiesStore().getLatestSolidifiedBlockNum();

        applyBlock(newBlock, txs);
        // if event subscribe is enabled, post block trigger to queue
        postBlockTrigger(newBlock);
        // if event subscribe is enabled, post solidity trigger to queue
        postSolidityTrigger(oldSolidNum,
                getDynamicPropertiesStore().getLatestSolidifiedBlockNum());
        tmpSession.commit();
      } catch (Throwable throwable) {
        logger.error(throwable.getMessage(), throwable);
        khaosDb.removeBlk(block.getBlockId());
        throw throwable;
      }

As discussed in the comments, the node will stop if an exception is found in the event subscription.

Are you willing to implement this feature?
Yes.

@lurais lurais closed this as completed Feb 7, 2024
@lurais lurais reopened this Feb 7, 2024
@halibobo1205 halibobo1205 added topic: Event subscribe transaction trigger, block trigger, contract event, contract log type: Enhancement minor enhancement labels Feb 7, 2024
@halibobo1205
Copy link
Contributor

why need ignoreSubscribeErrorSwitchOn ?

@lurais
Copy link
Contributor Author

lurais commented Feb 19, 2024

while others require synchronization services to remain stable and allow occasional exceptions to occur in the event subscribe logic

To offer users a choice, synchronization services can remain stable while allowing occasional exceptions to occur during subscribe logic.

@lurais
Copy link
Contributor Author

lurais commented Feb 26, 2024

@jwrct @317787106 @ss3344520 Do you have any ideas about this issue?

@ss334452
Copy link
Contributor

@lurais From your description, there are more than two exception handling strategies. It is recommended to use enumeration types for configuration parameters instead of boolean types.

@halibobo1205
Copy link
Contributor

I don't recommend adding more configurations for user purposes.

@lurais
Copy link
Contributor Author

lurais commented Feb 26, 2024

@lurais From your description, there are more than two exception handling strategies. It is recommended to use enumeration types for configuration parameters instead of boolean types.

@ss3344520 In my opinion, is there any situation in which it needs to use another strategy besides stopping or continuing to synchronize blocks?

@317787106
Copy link
Contributor

@lurais is System.exit(1) has any potential problems, such as resource release ?

@lurais
Copy link
Contributor Author

lurais commented Feb 26, 2024

@lurais is System.exit(1) has any potential problems, such as resource release ?

@317787106 When you use the command "System.exit", it will take effect as same as "kill -15" actually.

@lurais
Copy link
Contributor Author

lurais commented Feb 26, 2024

@halibobo1205 Why do you not recommend adding more configurations for user purposes? Do you have any more detail reasons?
@317787106 @ss3344520 @jwrct The issue that needs to be discussed now is whether to add more configurations to continue to synchronize blocks for user purposes when there is a problem with the event subscription service. Do you have any ideas about this?

@jwrct
Copy link
Contributor

jwrct commented Feb 27, 2024

Based on your plan and default configuration, if an exception is thrown causing the system to exit when publishing an event, how should the node user handle it to recover? Will there be any data loss issue after the node restarts? Will there be any impact on the event subscribers?

@halibobo1205
Copy link
Contributor

Event subscribers are not allowed to drop data in most cases.

@lurais
Copy link
Contributor Author

lurais commented Feb 27, 2024

Based on your plan and default configuration, if an exception is thrown causing the system to exit when publishing an event, how should the node user handle it to recover? Will there be any data loss issue after the node restarts? Will there be any impact on the event subscribers?

@jwrct We expect the event subscription mechanism to work without any issues and not lose any data. If you encounter any problems with the event subscription component itself, you should fix the problem and restart it as before. If the problem is caused by sporadic issues like network problems, restarting the node can generally return it to normal. As the currently processed block is not solidified, there will be no data loss. However, duplicate data may be generated due to the impact on the event subscription component.

@lurais
Copy link
Contributor Author

lurais commented Feb 27, 2024

Event subscribers are not allowed to drop data in most cases.

@halibobo1205 As we can see, if someone builds a website, the website provides external services by calling the RPC interface of the node, and also uses event subscription for subsequent on-chain tracking. If the node is stopped directly, the overall service will lose availability. Is this reasonable?

@lurais
Copy link
Contributor Author

lurais commented Feb 29, 2024

Event subscribers are not allowed to drop data in most cases.

@halibobo1205 As we can see, if someone builds a website, the website provides external services by calling the RPC interface of the node, and also uses event subscription for subsequent on-chain tracking. If the node is stopped directly, the overall service will lose availability. Is this reasonable?

@jwrct @ss3344520 @317787106 do you have any ideas about this scenario?

@jwrct
Copy link
Contributor

jwrct commented Mar 1, 2024

@lurais What scene? I don't understand what you are asking.

@lurais
Copy link
Contributor Author

lurais commented Mar 1, 2024

if someone builds a website, the website provides external services by calling the RPC interface of the node, and also uses event subscription for subsequent on-chain tracking. If the node is stopped directly, the overall service will lose availability. Is this reasonable?

@jwrct if someone builds a website, the website provides external services by calling the RPC interface of the node, and also uses event subscription for subsequent on-chain tracking. If the node is stopped directly, the overall service will lose availability. Is this reasonable?

@lxcmyf
Copy link
Contributor

lxcmyf commented Mar 1, 2024

Isn't that right? I see that when the modified code throws an exception in the event subscription service, if the switch is not turned on, the program will exit directly? There's nothing like that. How do users perceive when a node stops? Since we are concerned about the generation of duplicate data, we should focus on data deduplication.

@lurais
Copy link
Contributor Author

lurais commented Mar 1, 2024

Isn't that right? I see that when the modified code throws an exception in the event subscription service, if the switch is not turned on, the program will exit directly? There's nothing like that. How do users perceive when a node stops? Since we are concerned about the generation of duplicate data, we should focus on data deduplication.

@lxcmyf If the switch is not turned on, the program will terminate. This behavior is similar to the one before the modification. Previously, if there was an abnormal event subscription, synchronization would stop but the process would continue running. However, existing tools can easily detect when a node stops running. Since forks and other similar situations may arise, it is possible that the subscription service may receive duplicate data, this is another question.

@lxcmyf
Copy link
Contributor

lxcmyf commented Mar 1, 2024

Isn't that right? I see that when the modified code throws an exception in the event subscription service, if the switch is not turned on, the program will exit directly? There's nothing like that. How do users perceive when a node stops? Since we are concerned about the generation of duplicate data, we should focus on data deduplication.

@lxcmyf If the switch is not turned on, the program will terminate. This behavior is similar to the one before the modification. Previously, if there was an abnormal event subscription, synchronization would stop but the process would continue running. However, existing tools can easily detect when a node stops running. Since forks and other similar situations may arise, it is possible that the subscription service may receive duplicate data, this is another question.

I think the focus is on why the event subscription service threw an error, and then resolving the error and restarting it.
This switch gives a strange feeling. Since there is a mistake, it is necessary to locate and solve the problem, rather than temporarily ignoring it. tron provides underlying services and should not provide such a switch.

@halibobo1205
Copy link
Contributor

My point is to make sure that messages can't be lost if the event subscription is turned on .

@lurais
Copy link
Contributor Author

lurais commented Mar 1, 2024

tron provides underlying services and should not provide such a switch.

@lxcmyf Most people think that the switch is not needed. Therefore, it will be removed to avoid data loss in the event subscription service. However, users with specific requirements can modify the code to suit their needs.

@bladehan1
Copy link
Contributor

This switch doesn't do much for me, I don't really like using a service that clearly tells me there's a problem.

@lurais
Copy link
Contributor Author

lurais commented Mar 4, 2024

This switch doesn't do much for me, I don't really like using a service that clearly tells me there's a problem.

@bladehan1 Do you mean you would like to be told that there is a problem clearly without an option to ignore it?

@CarlChaoCarl
Copy link
Contributor

I've read all of the above and think the ignoreSubscribeErrorSwitchOn switch is unnecessary.

@lurais
Copy link
Contributor Author

lurais commented Mar 4, 2024

I've read all of the above and think the ignoreSubscribeErrorSwitchOn switch is unnecessary.

@CarlChaoCarl Ok, got it.

@lurais
Copy link
Contributor Author

lurais commented Mar 12, 2024

As discussed below, the config ignoreSubscribeError is unnecessary and will not be added.
The next question is, if the event subscription function throws an exception, does it need to be retried? Or just stop the node if any exception is found in the event subscription. @317787106 @lxcmyf @ss3344520 @jwrct @halibobo1205

@halibobo1205
Copy link
Contributor

event subscribe exception: You need to confirm when the exception occurs.

@lurais
Copy link
Contributor Author

lurais commented Mar 13, 2024

event subscribe exception: You need to confirm when the exception occurs.

@halibobo1205 From what I can understand, the event subscription logic seems to work fine without any exceptions. However, there may be cases where we end up throwing exceptions due to incorrect usage of third-party components. For instance, in this scenario issue, if two threads try to use the Zeromq client simultaneously, it might result in errors. Although it may work if we try pushBlock again, it is not advisable as some data might be lost, and it could lead to other issues in the future.

@317787106
Copy link
Contributor

317787106 commented Mar 13, 2024

@lurais This exception only happens when write to ZeroMQ fails. When event plugin enables, the data is write to a queue that exists in memory so exception may not happen in this scene generaly. ZeroMQ and event plugin cannot be used at the same time. ZeroMQ is a native queue, exception happens rarely. I prefer to stop the node as retrying to write data might not work if IO error occurs.

@lurais
Copy link
Contributor Author

lurais commented Mar 13, 2024

@lurais This exception only happens when write to ZeroMQ fails. When event plugin enables, the data is write to a queue that exists in memory so exception may not happen in this scene generaly. ZeroMQ and event plugin cannot be used at the same time. ZeroMQ is a native queue, exception happens rarely. I prefer to stop the node as retrying to write data might not work if IO error occurs.

@317787106 Yes , it may be wise to resolve a problem as soon as you find it, to avoid any further complications.

@lurais
Copy link
Contributor Author

lurais commented Mar 18, 2024

@lxcmyf @ss3344520 @jwrct @CarlChaoCarl As discussed earlier, the node will stop if an exception is found in the event subscription. Do you have any suggestions?

@tomatoishealthy
Copy link
Contributor

Based on the discussion above, I find that most of the views do not want to add a new switch to complicate the feature. Secondly, the scenario in which the exception occurs is so rare that a retry operation is not always guaranteed to succeed.

And for most Dapps, event loss is unacceptable.

So I suggest:

  1. When an exception occurs in the event service, throw the exact exception and stop the node
  2. The rest logic remains unchanged

I would like to continue this topic AllDevsMeeting

@tomatoishealthy
Copy link
Contributor

The optimize logic: tomatoishealthy@a9bc104

Will be submitted as a PR later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: Event subscribe transaction trigger, block trigger, contract event, contract log type: Enhancement minor enhancement type:feature
Projects
None yet
Development

No branches or pull requests

10 participants