Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specification for a new cluster implementation. #10875

Open
wants to merge 10 commits into
base: unstable
Choose a base branch
from

Conversation

ushachar
Copy link
Contributor

@ushachar ushachar commented Jun 17, 2022

The new approach (Fully outlined in CLUSTER.md) has been reviewed by the core Redis team and is now ready for a wider community review.
It is proposed in order to:

  • Provide predictable behavior & correctness guarantees across the various possible failure modes
  • Create a minimal, well-defined internal interface to the clustering code so alternative approaches can easily replace the
    entire clustering functionality at compile time.
  • Support a significantly higher number of shards
  • Significantly reduce failover time to under 10 seconds (at worse) and under 5 seconds in some configurations
  • Allow for alternative consistency mode of operation
  • Long term: Support a clustered, but un-sharded deployment mode with no cross-slot limitations, which will allow us to deprecate all Sentinel code
  • Long term: Enable centralized management for global configuration settings & Redis Functions

This new approach is:

  • Based on a two-tier strongly consistent, consensus-based control plane
  • Maintains full backwards compatibility with data path Cluster commands (MOVED, NODES, SLOTS, ...), but not
    administrative commands (SETSLOT, FORGET, MEET,...)
  • Has a consistent source of truth for all cluster metadata
  • Requires the shards to reconcile their state according to the above source
  • Maintains the ability to scale down to a bare minimum 3 redis-server processes
  • Is called Flotilla :)

And is intended to fully replace the current clustering implementation after a deprecation period in which the two
implementations will live side-by-side with compile time flag to chose between the two.

Currently drawings are placed under /images (both source and rendered jpgs).

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HLD still seems good to me, just commenting about the details. I haven't had much time to really think about it until today.

- Has a consistent source of truth for all cluster metadata
- Requires the shards to reconcile their state according to the above source
- Maintains the ability to scale down to a bare minimum 3 redis-server processes
- Is called Flotilla :)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop the flotilla name now? I think it will just cause confusion moving forward.

CLUSTER.md Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
CLUSTER.md Outdated Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
CLUSTER.md Outdated Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
@@ -0,0 +1,450 @@
# Redis Cluster v2 Proposal
Copy link
Contributor

@madolson madolson Jun 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still no conversation about how to keep Functions consistent across the cluster. I think that is still an important item to try to solve. It doesn't sound all that difficult to store it in the TDs, but it will require interleaving functions with replication traffic. (Mixing CP and DP functionality, which isn't great)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a difficult one.
If we perceive functions (and lua as well prob.) as configuration then it makes sense to use the TD/FC layers to distribute them.

It might be worth getting some acutal usage stats on Lua (doubt there's enough on Functions atm) - would current usage easily lend itself to a 'configuration' pattern of infrequent changes.

Copy link
Contributor

@PingXie PingXie Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Functions are an essential part of the cluster V2 design. Given that Functions are non-sharded user data, I feel they warrant their own host servers so they can scale independently. A container-registry like design could be an option IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike EVAL (which is the user's app responsibility), we should think of Functions as a light weight Modules.
They're an extending the API of the database and should be propagated to all nodes at function registration time and later when new nodes join.
Unlike EVAL, the user's app shouldn't be the one that loads them, but rather the admin, so either the admin tools or the cluster itself, must be the ones responsible to make sure they exist on all nodes.
I suppose that the difference between functions and modules in that respect is that for modules we don't want to offer any code to distribute the binaries to different hosts and load them (we should leave that to some orchestration system that deploys the software and creates the config files), but for functions we want a lower entry barrier.
We need to think about this more, and when we do, let's consider getting ACL into that mix too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that Functions are non-sharded user data, I feel they warrant their own host servers so they can scale independently.

I think this was just a choice made out of practically. We wanted functions in 7, but we didn't really have a good way to expose "non-sharded" Redis data outside of asking users to apply them onto all ndoes. I think if we had time, we would have tried to solve this problem, but it's hard. I think clusterv2 should at least be able to solve this problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think propagating by the control path, in this case, is probably what we want. In a lot of ways it is like an external operator going in and calling a command on every node in the cluster and making sure it applied correctly. The expectation should be that functions are low throughput. I would even posit we could be aggressive and say you can only propagate one function at a time, just too artificially slow down the rate of propagation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treating them as configuration makes a lot of sense to me, but since we don't have sufficient knowledge on how functions will be used in the wild - maybe defer the decision?
In the next couple of months we may see a more 'lua-ish' usage which would then completely break once we GA Flotilla.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could defer that decision, but i'd like to point out two things:

first, we should be the ones to decide how functions are used in the wild, and adding mechanisms around their propagation could assist. standing aside to see what users would do, is likely to promote a misuse.

of course in the context of Flotilla, the above statement is not relevant since it's not yet available to users, so indeed we can take the time. (i just like to avoid taking a similar approach elsewhere).

secondly, arguably sooner or later, we'll need to propagate some (non-cluster related) configuration (be it ACL, Functions or something else) through the cluster, so i think it may be better to include this topic in the design in an early stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree wrt ACLs and other configs - will add to the design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on adding a section for propagating configuration in general.

This was referenced Jun 18, 2022
CLUSTER.md Outdated Show resolved Hide resolved
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting!!!

I read the rendered doc (... -> View file) and spotted some Markdown formatting issues.

CLUSTER.md Outdated Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
CLUSTER.md Outdated Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
CLUSTER.md Outdated Show resolved Hide resolved
CLUSTER.md Outdated Show resolved Hide resolved
ushachar and others added 3 commits June 19, 2022 21:44
Elaborate on fencing behavior

Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Fix typo

Co-authored-by: Viktor Söderqvist <viktor@zuiderkwast.se>
Typo fix

Co-authored-by: Tian <skylypig@gmail.com>
@hwware
Copy link
Collaborator

hwware commented Jun 20, 2022

I just quickly take a look the new design for cluster V2, and I need take sometime to consume it. But I have one question: it looks like no concept or plan for supporting multi tenancy or multi users is mentioned. Could someone confirn this? Or there is no plan to support multi tenancy in open source version? Thanks

@ushachar
Copy link
Contributor Author

I just quickly take a look the new design for cluster V2, and I need take sometime to consume it. But I have one question: it looks like no concept or plan for supporting multi tenancy or multi users is mentioned. Could someone confirn this? Or there is no plan to support multi tenancy in open source version? Thanks

Redis already supports multiple users, and the new design does not alter that.

If by multi-tenancy you mean multi-database (with the SELECT command), that's only supported with unclustered Redis, and this new spec does not talk about adding it to clustered mode at the moment.

@hwware
Copy link
Collaborator

hwware commented Jun 20, 2022

I just quickly take a look the new design for cluster V2, and I need take sometime to consume it. But I have one question: it looks like no concept or plan for supporting multi tenancy or multi users is mentioned. Could someone confirn this? Or there is no plan to support multi tenancy in open source version? Thanks

Redis already supports multiple users, and the new design does not alter that.

If by multi-tenancy you mean multi-database (with the SELECT command), that's only supported with unclustered Redis, and this new spec does not talk about adding it to clustered mode at the moment.

Thanks for clarification.

@oranagra
Copy link
Member

If by multi-tenancy you mean multi-database (with the SELECT command), that's only supported with unclustered Redis, and this new spec does not talk about adding it to clustered mode at the moment.

We do intend for this system to support un-sharded configurations though, in which we could / would support multiple databases and SELECT (and terminate Sentinel some day).
@ushachar please ack, and see if the text needs any addition to make it clear.

CLUSTER.md Outdated Show resolved Hide resolved
CLUSTER.md Outdated Show resolved Hide resolved
CLUSTER.md Outdated Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
CLUSTER.md Outdated Show resolved Hide resolved
CLUSTER.md Outdated Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
CLUSTER.md Show resolved Hide resolved
@uvletter
Copy link
Contributor

After a rough read the point most concern me is the two tier architecture seems too flexible, that users may be hard to understand it, and abuse the deployment. There're five deployment modes provided, but user may fall into difficulty in choosing, which one is best for my scenario? And it also bring some complexity into implementation, for instance, we should support the transition between the five modes, which means there are 4*5=20 cases to be consider, maybe some cases can be eliminated, but it is still substantial number. Besides, some optimization may only effect only some deployment modes, for example, for the situation FC and TD and deplyed together, the raft group may be shared between the two modules, but for others not. So if we should do this optization is undetermined since it's only available for some deployments.

If we put the FC and TD in one deployment unit by design, the variant would be far less. Thus only two deployment modes is exposed to users, Native and Midrange. For the users who is beginner, for a fast setup, or deploys a small cluster, Native is sufficient, as well as for the advanced users, Midrange could be a better choice, there is no puzzle which one I should take. With the raft membership change and configuration change, we can easily implement the transition between the two modes, as well as add or remove a control plane node.

CLUSTER.md Show resolved Hide resolved
CLUSTER.md Outdated Show resolved Hide resolved
@hwware
Copy link
Collaborator

hwware commented Aug 3, 2022

@ushachar I reviewed all discussions and description of the cluster v2 spec. I have one concern (aka you could understand maybe this is a bonus for this design).
Now in redis non-cluster mode, it support multiply databases, thus, in business level, multiply user could read and write the same redis node instance by multiply databases (db0 for user1, db1 for user2, db2 for user3 ...).
But in current redis cluster design, and in new cluster design, it looks like it still only support one database. Multiply users can not save their business data in the same redis instance or even they could save data in the instance, but there is no way to distinguish the user data. (you can think this is one kind mode of multi-tenant in one single database cluster)

Thus, one block happens when clients want to migrated data from non-cluster mode to cluster mode, for the same name data between 2 users, there is no smart way to save in the cluster mode redis instance. It creates trouble.

I think one way to support the multi-tenant is using namespace or simialar concept.
Do you think about the new cluste could add this concept or you want to clients or third-party developers to implement this feature by themselves in the future?
Thanks

@ushachar
Copy link
Contributor Author

ushachar commented Aug 4, 2022

@ushachar I reviewed all discussions and description of the cluster v2 spec. I have one concern (aka you could understand maybe this is a bonus for this design). Now in redis non-cluster mode, it support multiply databases, thus, in business level, multiply user could read and write the same redis node instance by multiply databases (db0 for user1, db1 for user2, db2 for user3 ...). But in current redis cluster design, and in new cluster design, it looks like it still only support one database....
...

I consider this out of scope for this design - it requires changes in additional areas and this is a big enough task as-is...
Flotilla won't make it harder to add later in the game.


### Roles

While the Flotilla spec mostly discusses data nodes (and not Failover Coordinator & Topology Director implementation), in our reference implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can discuss control plane implementation separately. But I fear there might be some influence on the data plane design.
For example handling of FC and TD failovers: will we maintain a replica for FC and TD? if so can we face a condition of epoch downgrading?
multiple times administrators place replicas in separate geo-location for survivability - maybe we should consider having a replica-FC to better support these cases? (same goes for TD)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not address this aspect of the TD/FC internal implementation other then specifying they need to be CP entities (so epoch downgrades cannot happen).
In the upcoming reference implementation we'll have a module which incorporates Raft with 3+ instances to achieve the required guarantees and provide the ability for an admin to use different AZs for HA.

- The heartbeat results from all the nodes in the same shard (and their 'tick' value)
- If needed, the updated cluster topology.

If a replica node detects that its primary has not sent a heartbeat for more than N ticks it checks if it is the most
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case the replica-primary keepalive is broken BUT the primary-FC heartbit is O.K - will this also be validated by the FC?
there are cases were the replication stream is broken (bug/route issues etc...) but that does not mean we should failover...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FC does not validate anything in the heartbeat beyond epoch correctness -- it's the responsibility of the individual nodes to trigger failovers.
In the case you outlined:

  • If the replica <-> FC connectivity is ok the replica will see that the primary is still alive (since the FC shares that info in the heartbeat response) and not initiates failover
  • If the replica <-> FC connectivity is down the replica can't promote itself even if it wanted to


### Failover

Every K milliseconds, every node sends a heartbeat to the Failover Coordinator which owns it. This heartbeat includes:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many issues related to cluster failover due to long running commands (LUA scripts, large sets joins etc...) I wonder if we want to improve the heartbeat mechanism to overcome these issues. Maybe this is a separate discussion but I think that redesign the cluster should allow users to run long commands without having to keep adjusting the timeouts values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I've mentioned before and want to address. I want the heartbeating mechanism to be pulled off the main thread of Redis so we can heartbeat to the FCs inside of commands that take single digit seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to do, and won't impact the rest of the design -- but is it always desirable? I'd argue that after a certain threshold we do want to failover a primary with a non-responsive main thread.

The ideal solution would probably be to have a heartbeat thread which also monitors the main one and can decide to stop heartbeating if the main thread is unresponsive for a long enough time.

Copy link
Contributor

@madolson madolson Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to do, and won't impact the rest of the design

100% correct, it's an implementation decision.

but is it always desirable?

I think so, we've had quite a bit of issue reliably determining the state of the execution thread in Redis since it is synchronous.

I'd argue that after a certain threshold we do want to failover a primary with a non-responsive main thread. I'd argue that after a certain threshold we do want to failover a primary with a non-responsive main thread.

This is mostly what I'm expecting. I suppose we would have two thresholds.

  • No heartbeat received after X seconds: The process is bricked for some reason and we should declare it unhealthy so failover can happen.
  • Too many heartbeats of the same type received for Y seconds: Something is stuck and not making progress fast enough, maybe a sync flush was executed. We should still failover eventually, but give some more room here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually speaking, failure detection is orthogonal to the cluster V2 spec discussion so I agree it can wait or be explored in parallel.

I think the key question in the domain of failure detection is "progress", i.e., what do we mean by "progress", especially when a script is involved? It is not a trivial problem IMO to 100% reliably determine if an arbitrary command is running in an infinite loop, or in other words, making "progress", or not. Therefore, I think there should always be an upper bound on how long a command can run. Whether or not we need a separate thread to pump h/b is a secondary question to me as I think the main thread is the best place to implement the logic to determine "progress". The separate thread is more like the IO threads, which can pick up the "progress" breadcrumb left by the main thread and report it with the h/b. It is just there to help reduce the work on the main thread but the main thread could just do it on its own as well.

* Describe how ClusterV2 will be implemented based on that.
-->

#### Potential Enhancements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely thought we discussed this at somepoint, but what is the authentication between data nodes and TD/FC?

CLUSTER.md Show resolved Hide resolved
@zuiderkwast
Copy link
Contributor

Why not merge this spec in the redis-doc repo? That's where we have the legacy cluster spec and other specs and it'd show up on the redis.io website.

@oranagra
Copy link
Member

We actually have https://github.com/redis/redis-specifications for this purpose.

@madolson
Copy link
Contributor

Yeah, I don't think we should merge this document into main redis repository.

While Flotilla can be implemented using any sort of strongly consistent system, the reference implementation we propose
uses Redis to host the Topology Director & Failover Coordinator logic. This allows for different deployment modes depending on
the needs and orchestrating abilities of the administrator:
- Full Blown: Dedicated nodes holding the Topology Directors and multiple Failover Coordinators dividing the data nodes among them.
Copy link
Contributor

@lsn1994 lsn1994 Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for Full Blown mode, do we have option that one TD can support multiple clusters? Thus we can divide system into two layer: computing and storage. @ushachar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the (very) late reply - missed this comment.
It's possible to enhance the spec to support this mode, but not something that's actually planned at this time due to the additional complexity it introduces.
(In any case, it does not allow dividing the compute/storage requirement - it only reduces the control plane overhead when orchestrating lots of small DBs)


#### Failover Coordinator Tick

In order to avoid relying on multiple clocks when making failover decisions, the node relies on a monotonically increasing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ushachar I'm not sure if I understand the tick correctly here? Suppose we have a shard with three nodes A, B, and C.

A, B, C will send heartbeat packets to FC every k milliseconds, including:

Replication offset (for replicas)
Last known cluster topology epoch
Last known Shard topology epoch
Role (primary/replica)

At this time, FC will record a map: node->tick , such as {A:4, B:4, C:1}, assuming that the tick at this time is 4, but C has not reported a heartbeat for a long time.

Then FC will add: {A:4, B:4, C:1} field to the reply of the heartbeat packet to let other nodes know that C has not been updated for k*(4-1) = k*3 ms.

It means tick will only be incremented on FC and assigned to different Nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an implementation detail, but the tick gets saved as part of the heartbeat, and not in a separate map.

If A & B both send a heartbeat within the relevant period, but C hasn't updated in a while, then yes - the heartbeat response will contain the information that A&B were heard from in tick 4, and C was heard from in tick 1.

oranagra added a commit that referenced this pull request Nov 22, 2023
This PR reworks the clustering code to support multiple clustering
implementations, specifically, the current "legacy" clustering
implementation or, although not part of this PR, flotilla (see #10875).
Which implementation is used could be a compile-time flag (will be added
later). Legacy clustering functionality remains unchanged.

The basic idea is as follows. The header cluster.h now contains function
declarations that define the "Cluster API." These are the contract and
interface between any clustering implementation and the rest of the
Redis source code. Some of the function definitions are shared between
all clustering implementations. These functions are in cluster.c. The
functions and data structures specific to legacy clustering are in
cluster-legacy.c/h. One consequence of this is that the structs
clusterNode and clusterState which were previously "public" to the rest
of Redis are now hidden behind the Cluster API.

The PR is divided up into commits, each with a commit message explaining
the changes. some are just mass rename or moving code between files (may
not require close inspection / review), others are manual changes.

One other, related change is:
- The "failover" command is now plugged into the Cluster API so that the
clustering implementation can (a) enable/disable the command to begin
with and if enabled (b) perform the actual failover. The "failover"
command remains disabled for legacy clustering.
@CLAassistant
Copy link

CLAassistant commented Mar 24, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ Uri Shachar
❌ ushachar


Uri Shachar seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Status: Backlog
Development

Successfully merging this pull request may close these issues.

None yet