Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Follower Snapshot #135

Open
BusyJay opened this issue Oct 30, 2018 · 7 comments
Open

RFC: Follower Snapshot #135

BusyJay opened this issue Oct 30, 2018 · 7 comments
Labels
Feature Related to a major feature. Optimization Performance related optimizations. Request for Comment A proposal to be considered. Analogous to an RFC in TiKV/Rust.

Comments

@BusyJay
Copy link
Member

BusyJay commented Oct 30, 2018

For now snapshot is always sent from leader to follower, which is not always sufficient. For example, consider there are 5 nodes in two data center, (1, 2) and (3, 4, 5). If 5 is leader and 1 needs a snapshot, then data have to be transferred across data center.

But in fact any nodes in cluster can send a snapshot once requested logs are applied. So it's possible that 1 requests a snapshot aggressively from 2, so that data can be transferred internally.

Support requesting snapshot aggressively is also useful for recovery from snapshot files corruption.

@siddontang
Copy link
Contributor

We can name this feature Follower snapshot :-)

Maybe we can even support Follower replication 🤔

@Hoverbear Hoverbear added Feature Related to a major feature. Optimization Performance related optimizations. Request for Comment A proposal to be considered. Analogous to an RFC in TiKV/Rust. labels Oct 31, 2018
@Hoverbear Hoverbear changed the title Support requesting snapshot aggressively RFC: Support requesting snapshot aggressively Oct 31, 2018
@Hoverbear
Copy link
Contributor

We've previously talked about how this feature would be a desirable feature for our use case in TiKV, and I think it can be useful for others as well.

I'd definitely love to do this, but if anyone else wants to tackle it please feel encouraged to let us know and start it!

It sounds like the generalized, simplest definition of the feature is:

The ability to request snapshots from any node to any other node, so long as the snapshot only contains committed data.

I think doing this would open the door for a second, future feature to support follower replication. However this feature will require some more consideration. Let's plan for that after this one? I can open another issue for it, so we can separate the discussion

@Hoverbear Hoverbear changed the title RFC: Support requesting snapshot aggressively RFC: Follower Snapshot Nov 7, 2018
@Fullstop000
Copy link
Member

I'd like to work on this and I came up with a few questions which may need to be discussed probably:

  • If a node tries to get snapshot from an another node, should the node ignore msgs from leader (even heartbeat) to avoid log replication ?
  • Do we need a timeout mechanism for this kind of snapshot request if the target node can not deliver message to the requester ?

@siddontang
Copy link
Contributor

Thanks @Fullstop000

  1. In the current Raft implementation, if the leader is sending the snapshot to the follower, it can still send heartbeat messages to it to keep alive, so I think we still need to let leader do this.
  2. Now there is no timeout mechanism for snapshot sending, we will handle this outside the Raft library.

@siddontang
Copy link
Contributor

@Fullstop000
You can send your Wechat account to my email tl@pingcap.com if you want to a real-time discussion.

@Fullstop000
Copy link
Member

@siddontang Thanks for the quick reply.

Considering about the point of efficiency @BusyJay mentioned here, in the situation of 2 IDCs, once leader has applied the AddNode , it'll start communication with the new node which can be less efficient to send snapshot between IDC than internal transporting and will increase working load to the leader.

We can save 1.5 roundtrip and crossing IDC data sending from leader if node ignores the msgs but it may introduce some extra complexity into raft algorithm.

I prefer to keep raft layer clean and let third party ( such as pd ? ) to do the control generally.

@betwins
Copy link

betwins commented Mar 5, 2022

I agree with siddontang, it is better to keep raft layer clean. raft is complicated and is critical to protect data consistency across nodes. one new mechanism introduced should be ensured not to break raft protocol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Related to a major feature. Optimization Performance related optimizations. Request for Comment A proposal to be considered. Analogous to an RFC in TiKV/Rust.
Projects
None yet
Development

No branches or pull requests

5 participants