Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geyser drops finalized notifications #31124

Closed
linuskendall opened this issue Apr 10, 2023 · 9 comments
Closed

Geyser drops finalized notifications #31124

linuskendall opened this issue Apr 10, 2023 · 9 comments
Assignees
Labels
community Community contribution

Comments

@linuskendall
Copy link

linuskendall commented Apr 10, 2023

Problem

It seems that quite often, Geyser will drop the 'Finalized' slot notification. It seems more commonly to happen around skipped slots, but we have found examples where it was not. An example is illustrated below:

$ journalctl --user-unit solana-rpc |grep geyser_pg | grep -E '(18759483[2-5]|187594840)'
Apr 10 08:41:07 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:07.482695590Z DEBUG geyser_pg::postgres] Message: Account(false 187594832 SysvarS1otHashes111111111111111111111111111 667609360988)
Apr 10 08:41:08 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:08.140092059Z DEBUG geyser_pg::postgres] Message: Slot(187594832 Processed)
Apr 10 08:41:08 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:08.140123559Z INFO  geyser_pg::postgres] Processing new slot 187594832, with 60 accounts
Apr 10 08:41:08 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:08.159428777Z DEBUG geyser_pg::postgres] Message: Account(false 187594833 SysvarS1otHashes111111111111111111111111111 667609370319)
Apr 10 08:41:08 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:08.490537309Z DEBUG geyser_pg::postgres] Message: Slot(187594833 Processed)
Apr 10 08:41:08 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:08.490563639Z INFO  geyser_pg::postgres] Processing new slot 187594833, with 38 accounts
Apr 10 08:41:08 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:08.517442426Z DEBUG geyser_pg::postgres] Message: Account(false 187594834 SysvarS1otHashes111111111111111111111111111 667609374246)
Apr 10 08:41:08 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:08.537672355Z DEBUG geyser_pg::postgres] Message: Slot(187594832 Confirmed)
Apr 10 08:41:08 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:08.852584298Z DEBUG geyser_pg::postgres] Message: Slot(187594833 Confirmed)
Apr 10 08:41:08 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:08.887478755Z DEBUG geyser_pg::postgres] Message: Slot(187594834 Processed)
Apr 10 08:41:08 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:08.887516387Z INFO  geyser_pg::postgres] Processing new slot 187594834, with 45 accounts
Apr 10 08:41:08 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:08.911527730Z DEBUG geyser_pg::postgres] Message: Account(false 187594835 SysvarS1otHashes111111111111111111111111111 667609380818)
Apr 10 08:41:09 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:09.236842372Z DEBUG geyser_pg::postgres] Message: Slot(187594835 Processed)
Apr 10 08:41:09 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:09.236876858Z INFO  geyser_pg::postgres] Processing new slot 187594835, with 61 accounts
Apr 10 08:41:10 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:10.667922211Z DEBUG geyser_pg::postgres] Message: Account(false 187594840 SysvarS1otHashes111111111111111111111111111 667609383434)
Apr 10 08:41:10 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:10.971665249Z DEBUG geyser_pg::postgres] Message: Account(false 187594840 PinYvHqMTZVrRTpwK9x3dB9vL7tsGtGedSz8EqeynuA 667609384855)
Apr 10 08:41:11 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:11.262569677Z DEBUG geyser_pg::postgres] Message: Account(false 187594840 AuQkCWHMUxPfM3tTGv8JySTaSSqcsN73vu5kYyidoUuh 667609388919)
Apr 10 08:41:11 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:11.305351858Z DEBUG geyser_pg::postgres] Message: Slot(187594840 Processed)
Apr 10 08:41:11 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:11.305382768Z INFO  geyser_pg::postgres] Processing new slot 187594840, with 123 accounts
Apr 10 08:41:11 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:11.389888160Z DEBUG geyser_pg::postgres] Message: Slot(187594834 Confirmed)
Apr 10 08:41:11 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:11.389904822Z DEBUG geyser_pg::postgres] Message: Slot(187594835 Confirmed)
Apr 10 08:41:12 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:12.159304081Z DEBUG geyser_pg::postgres] Message: Slot(187594840 Confirmed)
Apr 10 08:41:23 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:23.822562681Z DEBUG geyser_pg::postgres] Message: Slot(187594832 Finalized)
Apr 10 08:41:23 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:23.848824468Z DEBUG geyser_pg::postgres] Update finalized to 187594832 in 1.942ms
Apr 10 08:41:24 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:24.230460632Z DEBUG geyser_pg::postgres] Message: Slot(187594833 Finalized)
Apr 10 08:41:24 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:24.249380547Z DEBUG geyser_pg::postgres] Update finalized to 187594833 in 1.236ms
Apr 10 08:41:24 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:24.544539330Z DEBUG geyser_pg::postgres] Message: Slot(187594840 Finalized)
Apr 10 08:41:24 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:24.590291031Z DEBUG geyser_pg::postgres] Update finalized to 187594834 in 1.541ms
Apr 10 08:41:24 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:24.591646081Z DEBUG geyser_pg::postgres] Update finalized to 187594835 in 1.337ms
Apr 10 08:41:24 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:24.595147365Z DEBUG geyser_pg::postgres] Update finalized to 187594840 in 3.379ms

Slot 187594840 has parent 187594835. However, 187594835 and 187594834 never saw slot notifications for finalization. Only for processed and confirmed. However, slot 187594840 saw the finalized slot notification correctly.

This is different from #28871 since it happens during general runtime, not just at start up.

Proposed Solution

This is possible to work around in the geyser plugin (just assume that the 'tree' below the last finalized slot is also finalized), but it leads to delays in ingesting state since we cannot know that the tree is finalized until 187594840 is finalized. Therefore it would be good to identify why geyser doesn't notify the finalized state of the slots in between.

@linuskendall linuskendall added the community Community contribution label Apr 10, 2023
@lijunwangs
Copy link
Contributor

@linuskendall Have you seen any messages in the log files like the following?

"bank_notification_sender failed: ..."

@linuskendall
Copy link
Author

linuskendall commented Apr 10, 2023 via email

@lijunwangs
Copy link
Contributor

That message is from

https://github.com/solana-labs/solana/blob/master/core/src/replay_stage.rs#L2062

When we failed to send a notification. It is logged at WRN level.

Do you actually see the following message in the log file for the slot?

"new root "

This is logged at INFO level.

@linuskendall
Copy link
Author

linuskendall commented Apr 11, 2023

In the example above we did not get "new root" for the missing finalized slot notifications (34/35):

Apr 10 08:41:23 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:23.822550678Z INFO  solana_core::replay_stage] new root 187594832
Apr 10 08:41:24 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:24.230445973Z INFO  solana_core::replay_stage] new root 187594833
Apr 10 08:41:24 pit54 solana-rpc.sh[25574]: [2023-04-10T08:41:24.544527427Z INFO  solana_core::replay_stage] new root 187594840

@lijunwangs
Copy link
Contributor

Okay -- that is why geyser did not see the notifications because the replay_stage did not see them when replaying the banks. Did you see in your slot table that the slot 187594840 has a parent 187594835? That can be used to auto mark 187594835 as root.

@linuskendall
Copy link
Author

linuskendall commented Apr 11, 2023 via email

@lijunwangs
Copy link
Contributor

I am investigating why replay_stage is missing the root events for these slots and see if any workaround in the plugin framework.

@lijunwangs lijunwangs self-assigned this Apr 12, 2023
@lijunwangs
Copy link
Contributor

This is due to the same root cause as #19040 and will be addressed together

@lijunwangs
Copy link
Contributor

This is fixed by #31124

lijunwangs pushed a commit that referenced this issue May 22, 2023
…backport of #31180) (#31650)

Problem

It is reported that Geyser is missing some Root notifications for slots. #31124
The Root notification is sent from replay_stage's code in handle_votable_bank. https://github.com/solana-labs/solana/blob/master/core/src/replay_stage.rs#L1981. However, the validator does not necessarily vote on every slot on the rooted chain. From @carllin

For instance if the rooted chain is 1->2->3->4

You might only vote on 1 and 4
But when 4 is rooted, 2 is also rooted
But handle_votavle_bank is not called on 2

As result of this, we may miss notifications for slot 2 and 3.

Summary of Changes

Enhanced BankNotification to add NewRootedChain enum to send the chains of parent roots.
Renamed BankNotification::Root -> BankNotification::NewRootBank
Introduced SlotNotification for SlotStatusObserver interfaces to send slot status without Bank.
In the OptimisticallyConfirmedBankTracker notify parents of a new root if these parents were not notified.
Modified and added unit test cases to verify the logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution
Projects
None yet
Development

No branches or pull requests

2 participants