Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ames: crash on %strange-current #6912

Open
yosoyubik opened this issue Feb 15, 2024 · 0 comments
Open

ames: crash on %strange-current #6912

yosoyubik opened this issue Feb 15, 2024 · 0 comments
Labels

Comments

@yosoyubik
Copy link
Contributor

yosoyubik commented Feb 15, 2024

~nattyv is crashing on a message ack coming from ~roller-foster:

Screenshot 2024-02-13 at 2 48 51 PM

The state of the flow that ~nattyv has with ~roller-foster looks like this:

Screenshot 2024-02-15 at 10 33 52 AM

Not shown on the debug dashboard but ~nattyv has unsent fragments from message-num, in this case the last one (num-fragment=2).

The crash happens because when we receive the message ack, we clear the live packets (num-fragment=0, num-fragment=1) for that message from the live queue, increase current since it's been acked, and take the unsent last fragment, to be enqueued for sending it, putting it into the live packets. This is what causes the crash, since we have just acked message-num=2, but we still have live packets.

The main question is why are we receiving a message ack when we have not sent the last fragment of the message—this would be a violation of %ames semantics since only whole messages are acked at that level.

This could be ~roller-foster's %ames not working properly but there's another data point on ~nattyv:

Screenshot 2024-02-15 at 10 50 44 AM

This is a flow coming from ~bitter-bitduc that shows that no packets have been acked or heard, but somehow we have enqueued message number 2. That message has not been acked, so ~bitter-bitduc is trying to resend it but ~nattyv no-ops because the fragment has been already heard.

We have the flow from ~bitter-bitduc that shows that message-num=1 has been acked (current=2):

Screenshot 2024-02-15 at 10 50 36 AM

This should be reflected on ~nattyv with a last-acked/last-heard = 1 but this is not the case.

Something that could point to the reason why ~nattyv seems to have lost state is a double boot, or at least booting an older version of the event log before the arrival and ack of message number 1, but after it was formally acked. ~bitter-bitduc is acting normally but ~nattyv is not, when message-num=2 comes into ~nattyv it creates a new flow from scratch (last-acked/last-heard = 0) and enqueues the fragment into partial-rcv-message and becuase its sequence number = 2 is not the one we expect +(last-heard.state) == 1 , we no-op.

As a last datapoint, ~nattyv has several of these (last-acked/last-heard = 0) flows with a number ships.

PS: and one more, ~nattyv has a lot of %ping ducts, which happen when the binary restarts, which creates a new flow for pinging its sponsor.

@yosoyubik yosoyubik added the bug label Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant