Skip to content

bridge: Significant Delays in BurnTransactionReady/RefundTransactionReady Processing Leading to Accumulating Unhandled Transactions #1053

@sameh-farouk

Description

@sameh-farouk

Describe the bug

The TFChain bridge is experiencing a critical operational issue where BurnTransactionReady events (internally handled as WithdrawReadyEvents by the bridge daemon) are processed with significant delays, often exceeding the 2-minute window during which their associated Stellar signatures are available on TFChain. This leads to a growing backlog of unhandled transactions, impacting the bridge's efficiency and reliability.

Additional context

Stellar Multi-signature and TFChain's Role

  1. Stellar Multi-signature for TFT Burns
    When TFT is to be unlock on the Stellar network (e.g., as part of a withdrawal from TFChain), a multi-signature Stellar transaction is required.
    This transaction needs to be signed by multiple designated validators (bridge daemons) to ensure security and decentralization.

  2. Signatures on TFChain
    To facilitate this, individual validators submit their signatures for a given burn transaction to TFChain. These signatures are stored on TFChain's state, making them accessible to the bridge daemon.

  3. Stellar Sequence Numbers
    Every Stellar account has a sequence number, which is a monotonically increasing integer. Each new transaction originating from that account must have a sequence number exactly one greater than the account's current sequence number.
    This mechanism prevents transaction replay attacks and ensures transaction ordering. If the Stellar bridge account's sequence number advances before a transaction is submitted, any signatures based on the old sequence number become invalid.


The 2-Minute Signature Expiry Mechanism

TFChain incorporates a defensive mechanism within its pallet-tft-bridge to manage these Stellar signatures. Signatures associated with BurnTransactionReady events are periodically cleared from TFChain state after approximately 2 minutes.

  • Why the Expiry?
    This timeout is a safety measure. Stellar's sequence numbers are critical for transaction validity. If the Stellar bridge account's sequence number advances (e.g., due to another transaction being submitted), any pending signatures for older transactions become stale and unusable.
    Clearing these old signatures prevents the accumulation of invalid data on-chain and reduces the risk of being stuck attempting to submit transactions that are guaranteed to fail.

How the Expiry Causes the Issue

  1. Event Backlog
    The bridge daemon processes all incoming TFChain events sequentially. This includes:

    • WithdrawCreatedEvents
    • WithdrawExpiredEvents
    • RefundExpiredEvents
    • RefundReadyEvents
    • WithdrawReadyEvents (BurnTransactionReady events)
  2. Processing Delays

    • High Event Volume: During periods of high activity, or after bridge outages, a large volume of events can accumulate.
    • Legacy Unhandled Transactions: The bridge also faces a backlog of transactions that previous versions were unable to process.
    • These factors cause the bridge daemon to spend significant time processing other events before reaching WithdrawReadyEvents.
  3. Race Condition / Stale Signatures
    By the time the bridge daemon reaches a WithdrawReadyEvent emitted more than 2 minutes ago, the corresponding Stellar signatures have already been cleared from TFChain by the expiry mechanism.

  4. Failed Processing and Accumulation
    When the bridge attempts to process such an event, it finds no valid signatures. Consequently:

    • It cannot construct and submit the stellar transaction.
    • The transaction remains in an unhandled state, contributing to the growing count of "stuck" transactions in TFChain storage.
    • This creates a continuous cycle where new BurnTransactionReady events are emitted, but processing is delayed until their signatures expire, leading to an ever-increasing backlog.

Proposed Solution: Batching processing TFChain events

To alleviate the processing bottleneck and prevent BurnTransactionReady events from becoming unprocessable, I propose to optimize the handling of all TFChain events.

  1. Batching all TFChain calls

    • Instead of processing each event (WithdrawExpiredEvent, etc) individually, the bridge daemon will collect and process these events into a batch
  2. Single Batched Transaction

    • A single Substrate extrinsic using Utility.force_all will be constructed and submitted to TFChain.
    • This batched extrinsic will contain multiple calls ( TFTBridgeModule.propose_burn_transaction_or_add_sig, proposeOrVoteMintTransactionCall, etc, each corresponding to a deposit, withdrawal, or refund operation.
  3. Benefits

    • Reduced Transaction Overhead: Submitting one batched transaction instead of many individual ones reduces overhead from signing, network propagation, and block inclusion.
    • Improved Throughput: Efficiently clears the backlog of WithdrawExpiredEvents, allowing the bridge daemon to reach and process critical WithdrawReadyEvents more quickly—ideally before their 2-minute signature expiry.

This approach should help the bridge catch up on the existing backlog of unhandled transactions and prevent future ones from accumulating due to signature expiry.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions