Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ship crashing on boot with "behn: timer failed, queue blocked" #6928

Closed
dosullivan opened this issue Feb 29, 2024 · 2 comments
Closed

Ship crashing on boot with "behn: timer failed, queue blocked" #6928

dosullivan opened this issue Feb 29, 2024 · 2 comments
Labels

Comments

@dosullivan
Copy link

Describe the bug
We have a ship on Tlon hosting that is repeatedly crashing with the following error:

~
urbit 2.12
boot: home is /pier/sampel-palnet
loom: mapped 2048MB
lite: arvo formula 2a2274c9
lite: core 4bb376f0
lite: final state 4bb376f0
loom: mapped 8192MB
boot: protected loom
live: mapped: MB/403.406.848
live: loaded: KB/16.384
boot: installed 661 jets
---------------- playback starting ----------------
pier: replaying events 2487466-2487478
pier: (2487478): play: done
---------------- playback complete ----------------
vere: checking version compatibility
ames: live on 37217 (localhost only)
conn: listening on /pier/sampel-palnet/.urb/conn.sock
http: web interface live on http://localhost:80
http: loopback live on http://localhost:12321
pier (2487490): live
eyre: [i=//http-server/0ve.8cp25/50/3 t=~] invalid outstanding connection
eyre: [i=//http-server/0ve.8cp25/50/3 t=~] invalid outstanding connection
behn: bail 1

[%poke %crud]
wire=/channel/heartbeat/1707376469-461ee2
bar-stack=[i=[i=//http-server/0ve.8cp25/50/3 t=~] t=[i=[i=//behn t=~] t=~]]
take: failed
/sys/vane/eyre/hoon:<[3.500 3].[3.754 5]>
/sys/vane/eyre/hoon:<[3.501 3].[3.754 5]>
/sys/vane/eyre/hoon:<[3.510 3].[3.754 5]>
/sys/vane/eyre/hoon:<[3.512 3].[3.754 5]>
/sys/vane/eyre/hoon:<[3.512 7].[3.527 9]>
/sys/vane/eyre/hoon:<[3.514 7].[3.527 9]>
/sys/vane/eyre/hoon:<[3.516 7].[3.527 9]>
/sys/vane/eyre/hoon:<[3.518 7].[3.527 9]>
/sys/vane/eyre/hoon:<[3.524 28].[3.524 35]>
/sys/vane/eyre/hoon:<[3.596 5].[3.637 7]>
/sys/vane/eyre/hoon:<[3.599 5].[3.637 7]>
/sys/vane/eyre/hoon:<[3.601 5].[3.637 7]>
/sys/vane/eyre/hoon:<[3.616 7].[3.620 31]>
/sys/vane/eyre/hoon:<[3.618 7].[3.620 31]>
/sys/vane/eyre/hoon:<[3.619 9].[3.619 42]>
/sys/vane/eyre/hoon:<[2.786 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.788 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.794 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.795 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.796 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.797 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.797 11].[2.802 13]>
/sys/vane/eyre/hoon:<[2.799 13].[2.801 60]>
/sys/vane/eyre/hoon:<[2.799 17].[2.799 54]>
behn: bail: %exit
behn: bail 2

[%poke %wake]
wire=/channel/heartbeat/1707376469-461ee2
bar-stack=[i=[i=//http-server/0ve.8cp25/50/3 t=~] t=[i=[i=//behn t=~] t=~]]
take: failed
/sys/vane/eyre/hoon:<[3.500 3].[3.754 5]>
/sys/vane/eyre/hoon:<[3.501 3].[3.754 5]>
/sys/vane/eyre/hoon:<[3.510 3].[3.754 5]>
/sys/vane/eyre/hoon:<[3.512 3].[3.754 5]>
/sys/vane/eyre/hoon:<[3.512 7].[3.527 9]>
/sys/vane/eyre/hoon:<[3.514 7].[3.527 9]>
/sys/vane/eyre/hoon:<[3.516 7].[3.527 9]>
/sys/vane/eyre/hoon:<[3.518 7].[3.527 9]>
/sys/vane/eyre/hoon:<[3.524 28].[3.524 35]>
/sys/vane/eyre/hoon:<[3.596 5].[3.637 7]>
/sys/vane/eyre/hoon:<[3.599 5].[3.637 7]>
/sys/vane/eyre/hoon:<[3.601 5].[3.637 7]>
/sys/vane/eyre/hoon:<[3.616 7].[3.620 31]>
/sys/vane/eyre/hoon:<[3.618 7].[3.620 31]>
/sys/vane/eyre/hoon:<[3.619 9].[3.619 42]>
/sys/vane/eyre/hoon:<[2.786 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.788 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.794 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.795 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.796 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.797 7].[2.803 71]>
/sys/vane/eyre/hoon:<[2.797 11].[2.802 13]>
/sys/vane/eyre/hoon:<[2.799 13].[2.801 60]>
/sys/vane/eyre/hoon:<[2.799 17].[2.799 54]>
behn: bail: %exit
behn: %wake event on /behn failed

behn: timer failed; queue blocked
~sampel-palnet:dojo>

In the above example, I've tried starting the ship with an 8gb loom, and with local networking only. Basically just kept increasing the loom and then tried local networking to see if it made any difference, but the result was the same each time.

To Reproduce
Steps to reproduce the behaviour:

  1. Boot the ship with any amount of loom.

Expected behaviour
Ship should successfully boot.

System (please supply the following information, if relevant):

  • OS: linux (fedora 39)
  • vere 2.12, presumably using 412
  • Unclear on base hash, since I can't boot it
@dosullivan dosullivan added the bug label Feb 29, 2024
@joemfb
Copy link
Member

joemfb commented Feb 29, 2024

%behn timers are globally ordered within a ship. If a timer event crashes, it's replaced with an error notification event routed to the same source app/vane/&c. If that timer error notification also crashes, the ship is not able to make progress through the timer queue. So if that happens 3x in a row, vere prints this message and shuts down.

The stack trace points to a data consistency assertion in %eyre (+jab:by updates a value in a map, asserting that it's present):

%+ ~(jab by session.channel-state.state) channel-id
|= =channel
channel(heartbeat (some [heartbeat-time duct]))

That code is running on both the original timer and the error notification event. The callsite shows a more typical pattern:

%timeout
?> ?=([%behn %wake *] sign)
?^ error.sign
[[duct %slip %d %flog %crud %wake u.error.sign]~ http-server-gate]
=* id i.t.t.wire
%- %+ trace:(per-server-event event-args) 1
|.("{(trip id)} cancelling channel due to timeout")
=^ moves server-state.ax
(discard-channel:by-channel:(per-server-event event-args) id &)
[moves http-server-gate]
::
%heartbeat
=/ on-channel-heartbeat
on-channel-heartbeat:by-channel:(per-server-event event-args)
=^ moves server-state.ax
(on-channel-heartbeat i.t.t.wire)
[moves http-server-gate]

For the /timeout wire, %eyre checks if the timer included an error notification, and merely logs it if it did. But for /heartbeat, it just runs the same code either way. Looks like it's been this way for a long time.

For this ship, you'll either need to inject a new version of %eyre that logs or no-ops in this case.

/cc @pkova

@ashelkovnykov
Copy link
Contributor

@joemfb We used this advice to inject a new Eyre with error printing logic for %heartbeat and were able to successfully boot both ships.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants