Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mainnet stuck on 2020-12-04 #13958

Closed
leoluk opened this issue Dec 4, 2020 · 11 comments
Closed

Mainnet stuck on 2020-12-04 #13958

leoluk opened this issue Dec 4, 2020 · 11 comments
Milestone

Comments

@leoluk
Copy link
Contributor

leoluk commented Dec 4, 2020

Edit: People are linking to this issue claiming that "Solana is not BFT" and a single faulty leader can take it down, or that this is somehow a design flaw. This is false - Solana has leaders, but 1/3+ of them have to be faulty to halt the network. People have even been running malicious fuzzer nodes on the Tour de SOL network that flipped random bits in blocks they produced. What happened here is that a single misbehaving leader triggered a bug that broke consensus, nothing more, nothing less.

(Not an official statement, and I'm not a Solana team member - Certus One runs a Solana validator and we were part of this incident investigation - yay decentralization!)


Mainnet is stuck on root slot 53180903, at 2020-12-04 13:45:45 UTC:

image

Snapshot of delinquency graph: https://snapshot.raintank.io/dashboard/snapshot/KqeO9cF1pcC5VAfutkGrE4oJ2rroDKx7?orgId=2

Please upload your logs to help with the investigation:

ADDRESS=<your node ID>
TZ=UTC journalctl -u solana.service -o cat --utc --since '2020-12-04 07:00' --until '2020-12-04 16:00' | xz | curl --data-binary @- "http://logbin.certus.one:8080/solana-github-13958/$ADDRESS.log.xz"

For plain logfiles:

ADDRESS=<your node ID>
xz | curl --data-binary @- "http://logbin.certus.one:8080/solana-github-13958/$ADDRESS.log.xz" < your_logfile.log

Validator snapshot: https://gist.github.com/leoluk/9bd2a3a2eb2ec9f0ceb9ba3476ada154

Please do not copy the commands to Discord - refer to this issue instead.

Upload destination: http://logbin.certus.one:8080/Qui9iehu/solana-github-13958

(no need to reply on this issue once you uploaded logs)

@ryoqun
Copy link
Member

ryoqun commented Dec 4, 2020

current most suscipus log:

[2020-12-04T13:45:02.411236229Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
[2020-12-04T13:45:02.411270019Z WARN  solana_ledger::blockstore] Slot: 53180936, shred index: 44, set_index: 44, is_duplicate: true, stored config: ErasureConfig {
        num_data: 2,
        num_coding: 2,
    }, new config: ErasureConfig {
        num_data: 1,
        num_coding: 1,
    }
[2020-12-04T13:45:02.413192505Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
[2020-12-04T13:45:02.413213360Z WARN  solana_ledger::blockstore] Slot: 53180936, shred index: 35, set_index: 34, is_duplicate: true, stored config: ErasureConfig {
        num_data: 1,
        num_coding: 1,
    }, new config: ErasureConfig {
        num_data: 3,
        num_coding: 3,
    }
[2020-12-04T13:45:02.421624025Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
[2020-12-04T13:45:02.421647243Z WARN  solana_ledger::blockstore] Slot: 53180936, shred index: 47, set_index: 47, is_duplicate: true, stored config: ErasureConfig {
        num_data: 2,
        num_coding: 2,
    }, new config: ErasureConfig {
        num_data: 1,
        num_coding: 1,
    }
$ grep 'Received multiple erasure configs for the same erasure set' solana-validator.log.2 solana-validator.log.1 solana-validator.log
solana-validator.log:[2020-12-04T13:45:02.411236229Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.413192505Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.421624025Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.421973566Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.443032907Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.456852518Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.463331555Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.470160359Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.470242892Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.470260008Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.470292109Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.470521196Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.473408783Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.478121091Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.484564531Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.486686495Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.486980328Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
solana-validator.log:[2020-12-04T13:45:02.490208229Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!

@jstarry
Copy link
Member

jstarry commented Dec 4, 2020

Pulled from ba1

[2020-12-04T13:45:02.494678140Z WARN  solana_core::replay_stage] Fatal replay error in slot: 53180936, err: InvalidBlock(InvalidTickHashCount)

EDIT: expanded logs

[2020-12-04T13:45:02.494602065Z WARN  solana_ledger::entry] invalid tick hash count!: entry: Entry {
        num_hashes: 1304,
        hash: Bgwhb2D8DsTCAbMDbuQ1asZVDRFSaR4b41MeXrCu9eet,
        transactions: [],
    }, tick_hash_count: 11817, hashes_per_tick: 12500
[2020-12-04T13:45:02.494628514Z WARN  solana_ledger::blockstore_processor] Tick with invalid number of hashes found in slot: 53180936
[2020-12-04T13:45:02.494633360Z WARN  solana_ledger::blockstore_processor] InvalidTickHashCount, slot: 53180936, entry len: 22, tick_height: 3403579908, last entry: 6a6mww8CSxmPHTnUR4G1WBvorqpFj7JJ3Log33uSzxmf, last_blockhash: GqkGnC5248jhTQGK7frgQTFrbacpSyFrbHMZ1yB7xx9j, shred_index: 17, slot_full: false
[2020-12-04T13:45:02.494678140Z WARN  solana_core::replay_stage] Fatal replay error in slot: 53180936, err: InvalidBlock(InvalidTickHashCount)
[2020-12-04T13:45:02.494717213Z ERROR solana_metrics::metrics] datapoint: replay-stage-mark_dead_slot error="error: InvalidBlock(InvalidTickHashCount)" slot=53180936i
[2020-12-04T13:45:02.496736479Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!
[2020-12-04T13:45:02.496751754Z WARN  solana_ledger::blockstore] Slot: 53180936, shred index: 69, set_index: 69, is_duplicate: true, stored config: ErasureConfig {
        num_data: 1,
        num_coding: 1,
    }, new config: ErasureConfig {
        num_data: 3,
        num_coding: 3,
    }
[2020-12-04T13:45:02.502411230Z WARN  solana_ledger::blockstore] Received multiple erasure configs for the same erasure set!!!

@leoluk
Copy link
Contributor Author

leoluk commented Dec 4, 2020

solana_core::replay_stage] LEADER CHANGE at slot: 53180928 leader: 5XKJwdKB2Hs7pkEXzifAysjSk6q7Rt6k5KfHwmAMPtoQ
solana_core::replay_stage] LEADER CHANGE at slot: 53180948 leader: DWvDTSh3qfn88UoQTEKRV2JnLt5jtJAVoiCo3ivtMwXP

5XKJwdKB2Hs7pkEXzifAysjSk6q7Rt6k5KfHwmAMPtoQ is Moonlet.io

@bennofs
Copy link
Contributor

bennofs commented Dec 4, 2020

$ solana -u https://api.mainnet-beta.solana.com leader-schedule | grep 53180936
  53180936        HoXANZnWTGeePertqWkMEnnhgXjTjzpfWaT2kja2ZgVU

@leoluk
Copy link
Contributor Author

leoluk commented Dec 4, 2020

HoXANZnWTGeePertqWkMEnnhgXjTjzpfWaT2kja2ZgVU is GRom81. Perhaps the leader change never showed up in the log because of the dead slot?

@sakridge
Copy link
Member

sakridge commented Dec 4, 2020

Graph from one our nodes that marked 53180936 as dead:

graph.pdf

89% stake on 53180935

@jstarry
Copy link
Member

jstarry commented Dec 4, 2020

I decided to investigate slot 53183440 because we have logs from dokia

Is it normal to send repair requests for your own slot?

[2020-12-04T14:06:57.909529823Z INFO  solana_core::replay_stage] new fork:53183440 parent:53183439 (leader) root:53180905
[2020-12-04T14:06:57.909535094Z INFO  solana_metrics::metrics] datapoint: replay_stage-new_leader slot=53183440i leader="Dokia75SVtetShgapUBoVFfYjL99fQyr1twxKKyTZKa3"
[2020-12-04T14:06:57.909545197Z INFO  solana_metrics::metrics] datapoint: replay_stage-my_leader_slot slot=53183440i
[2020-12-04T14:06:57.909701848Z INFO  solana_metrics::metrics] datapoint: bank-new_from_parent-heights slot_height=53183440i block_height=50247787i
[2020-12-04T14:06:57.911065494Z INFO  solana_core::replay_stage] LEADER CHANGE at slot: 53183440 leader: Dokia75SVtetShgapUBoVFfYjL99fQyr1twxKKyTZKa3. I am now the leader
[2020-12-04T14:06:58.153860311Z INFO  solana_core::repair_service] repair_stats: RepairStats { shred: RepairStatsGroup { count: 8, min: 0, max: 53184027, slot_pubkeys: { <SHORTENED> highest_shred: RepairStatsGroup { count: 30, min: 0, max: 53184027, slot_pubkeys: {53183440: SlotRepairs { highest_shred_index: 40, pubkey_repairs: {5p8qKVyKthA9DUb1rwQDzjcmTkaZdwN97J3LiaEywUjd: 1, Awes4Tr6TX8JDzEhCZY2QVNimT6iD1zWHzf1vNyGvpLM: 1} },

@mvines mvines added this to the v1.3.22 milestone Dec 4, 2020
@solana-labs solana-labs deleted a comment from jstarry Dec 4, 2020
@mvines
Copy link
Member

mvines commented Dec 14, 2020

This issue has now run its course, with the resulting fixes and improvements tracked elsewhere.

@mvines mvines closed this as completed Dec 14, 2020
@mankinskin
Copy link

Are there any infos about what the bug was?

@t-nelson
Copy link
Contributor

Hi @mankinskin! The postmortem blog post is here, https://medium.com/solana-labs/mainnet-beta-stall-postmortem-ba0c6064e3

@mankinskin
Copy link

Thanks!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants