Skip to content

Node skips uptime reporting when it is actually connected to tfchain #2377

@scottyeager

Description

@scottyeager

A farmer reported an unusual farmerbot violation on node 6694. While investigating, I found that an uptime report was not submitted during the node's wake up.

Here is a sampling of relevant log entries:

[+] powerd: 2024-07-16T07:52:21Z info setting node power state state=true
[+] powerd: 2024-07-16T07:52:20Z info listening for power events
[+] powerd: 2024-07-16T07:52:20Z info enabling wol on interface nic=enp1s0f0
[+] powerd: 2024-07-16T07:52:20Z info node uptime hash hash=0x0000000000000000000000000000000000000000000000000000000000000000
[+] powerd: 2024-07-16T07:52:20Z error node is not healthy skipping uptime reports
[+] powerd: 2024-07-16T07:52:20Z error node can not reach grid services
[+] powerd: 2024-07-16T07:52:20Z info node address address=5H4TnXjryWu8v7gNxTCHM3kwbbqDMf8KMM41XaCV5FprQWet
[+] noded: 2024-07-16T07:52:19Z info setting node public config
[+] noded: 2024-07-16T07:52:19Z info node address address=5H4TnXjryWu8v7gNxTCHM3kwbbqDMf8KMM41XaCV5FprQWet
[+] noded: 2024-07-16T07:52:19Z info node has been registered twin=12216
[-] noded: 2024/07/16 07:52:19 Connecting to wss://03.tfchain.grid.tf/...

The first troubling thing here is that Zos has reported that the node is not healthy and cannot reach grid services when it is reading and writing from and to tfchain in the seconds immediately proceeding and following this!

I am also seeing a zero value hash associated with an uptime report for the first time. I can't tell if this is a normal part of the error situation or a bug where Zos wrongly thinks an uptime report was submitted successfully.

The node does not retry sending this uptime report, then later its power target is set to Down by the farmerbot. The node sets its power state to Down then sends a final uptime report and shuts down. From the perspective of minting, this node has only woken up when it is finally shutting down (when the successful uptime report is registered). This confuses the normal flow of the minting process and eventually results in a violation for the node on a later boot, even though the node is booting up successfully when requested.

Bottom line here is that a node must send an uptime report before it sets its power state to Down and powers off. If it can't do so immediately, it should retry until successful.

Here's the full sequence of events:

  1. Node is standby, power target and state are both Down
  2. Power target is changed to Up, node receives WoL packet and powers on
  3. Node submits an uptime report after booting to a sufficient state (this is the part that failed in this example and can't be skipped)
  4. Power target is changed to Down
  5. Node changes its power state to Down and powers off (don't proceed to this step if step 3 failed, wait and retry)
  6. Node sends final uptime report after changing power state to Down

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions