Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make gossip synchronization on bootstrap more robust #2866

Closed
tgrabiec opened this issue Oct 5, 2017 · 4 comments

Comments

@tgrabiec
Copy link
Contributor

commented Oct 5, 2017

Currently it simply waits for ring_delay (30 seconds).

If processing of the state takes longer than that, e.g. due to the number of nodes in the cluster, it will start bootstrapping before it finished processing all the nodes, leading to data loss. Refs #2855.

The wait could be improved by, in addition to a fixed sleep, wait for:

  • the endpoint state map to stop growing
  • all received but not yet processed gossip messages to finish processing
@tzach

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2017

@tgrabiec maybe a new node should ask, and receive, more info on its first round of gossip.
If it gets the full cluster data (line in a shadow round), it will know there are nonresponsive nodes.

@tgrabiec

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2017

We already do a shadow round with the seeds, so will learn all they know.

@tgrabiec tgrabiec self-assigned this Oct 5, 2017

@tgrabiec tgrabiec added the gossip label Oct 16, 2017

@slivne slivne assigned elcallio and unassigned tgrabiec Oct 29, 2017

@tzach tzach added this to the 2.x milestone Nov 13, 2017

@elcallio

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2017

Is that not basically asking for having "wait_for_gossip_to_settle()" moved to the storage_service::bootstrap() code (or called there as well mayhap, since the waiting part is conditional in bootstrap)? Plus maybe adding some message state awareness to it?

@tgrabiec

This comment has been minimized.

Copy link
Contributor Author

commented Nov 20, 2017

Something like that, yes.

@avikivity avikivity closed this in 8af0b50 Dec 12, 2017

avikivity added a commit that referenced this issue Dec 12, 2017

Merge "Gossip propagation and stabilization" from Calle
"Fixes #2866
Fixes #2894

Changes gossip propagation to allow "atomic" grouping of values to ensure
their respective order.
Modifies gossip bootstrap startup to potentially wait longer in cases
where stabilization (messages done) takes time, to avoid data loss
in repair."

* 'calle/gossip' of github.com:scylladb/seastar-dev:
  gossip: wait for stabilized gossip on bootstrap
  gossiper: Prevent race condition in  propagation
  utils::to_string: Add printers for pairs+maps
  utils::in: Add helper type for perfect forwarding initializer lists
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.