The clustering megathread #1911

coffeemug · 2014-01-27T23:27:06Z

Our clustering implementation has a lot of limitations, both in performance/scalability, and operations. We have lots of open issues about various problems. Many of them are related on the product level, many are related on the code/architecture level, many are unrelated. We'll have to do significant reworking and I'd like to start a discussion about the overall refactor/rearchitecture here.

Here are the issues we'll need to fix:

Improve blueprint suggester #1900 -- improve blueprint suggester to make sure automatic node allocations are (more) balanced in various scenarios.
Master election / node discovery backends #1876, Automatic failover/rebalance and lossless rebalance. #223 -- automatic failover.
Sharding while keeping docs available. #1774 -- sharding without losing availability.
100 machines/100k tables #1861 -- directory scalability.
Replace range sharding with consistent hash sharding #364, Function in the web interface to create evenly-spaced shards #1679 -- consistent hash sharding.
Support custom shard keys #1862 -- custom sharding functions. That is, I want primaries for my US users to be in datacenter A, and primaries for my EU users to be in datacenter B.
Flexible routing of outdated reads #1858, use outdated allow for positive outdated. #1859 -- more flexible outdated reads routing. E.g. "run outdated, but make sure it's not older than token X", "run outdated and route to datacenter A", "run outdated if there is no primary in the DC I'm connected to, otherwise run up-to-date".
Replication of Tables without having to manually specify? #1792, Summarize statistics on the server #1392 -- programmatic access to cluster/statistics via ReQL.
Revisit multidatacenter replication #1906 -- make networking more datacenter-aware (e.g. can servers within one datacenter communicate only through private IPs/internal network interfaces?)
Adding a new machine to cluster results in "duplicate database" issue #83, Do something smarter with the default database ('test') #301, Connecting two different clusters together #1905 -- cluster setup flow. Specifically duplicate database "test" issues, and connecting two disjoint clusters by accident.

EDIT:

Better issue resolution. A problem with one table shouldn't make administration of the whole cluster impossible until the issue is resolved. A machine down that doesn't result in unsatisfiable goals shouldn't be treated as an emergency. A netsplit between datacenters shouldn't inundate the web ui in a hundred issues. We need to treat machine failure and nesplits like regular events rather than something out of the ordinary.

EDIT2:

Proposal for clustering - Avoid routing all queries to the master #2119 -- better/different presentation of acks/replicas, and possibly some latency minimization query routing strategies.

We could make this a bazaar or a cathedral or anything in between. Pinging @Tryneus and @timmaxw. I'd like your thoughts and proposal on how to go about this.

EDIT3:

Bad performance after adding a replica #2083 -- our query routing algorithm (basically we pick a random node) is too simple and causes latency issues.

timmaxw · 2014-01-28T01:59:23Z

A lot of these suggestions involve adding features to the cluster administration code. We should consider making it possible for external programs to bypass/replace the C++ cluster administration code. I made a separate issue to keep the discussion organized. See #1913.

Tryneus · 2014-02-13T19:52:22Z

So I went through and identified the biggest problems, prioritized them, and have a rough outline of a proposal in 6 phases:

Phase 1: Directory and Blueprint optimization

Remove the NOTHING role from the Blueprint
- Will drastically reduce the size of the blueprint in large clusters
Remove the nothing role from the Directory
- Will drastically reduce the size of the directory in large clusters
- Leave the nothing_when_safe and nothing_when_done_erasing roles
Need to know when to spawn nothing roles to perform backfilling and deletion

Phase 2: Auto-Failover

Integrate a Paxos or Raft (or some other consensus) library as a new service in the connectivity layer
- As opposed to a daemon, which would add more dependencies and more complicated cluster management for users
- BSD or otherwise free library we can build straight in or staticly link would be preferred
  - libpaxos3 (https://bitbucket.org/sciascid/libpaxos) - should be able to work this into our architecture without much trouble
  - CRaft (https://github.com/willemt/CRaft) - simpler interface, but missing some key features, we could contribute
- Alternative is to implement a consensus algorithm ourselves, but that is a last resort
Consensus library will be used to synchronize a Master Failover Table
- When the master failover table changes, reactor roles need to be reevaluated
There appears to be a window for data divergence in how writes are committed
- When recovering from a failover, we need the new master to eclipse the old master, if divergence is detected.
- This should not be an error from the user's perspective, as any rollbacked writes will not have been acked to any client

Process to elect a new master:
On peer disconnect

For each shard the peer was master of (known by checking the blueprint and master failover table):

If the table has a number of write acks greater than half the number of replicas (rounded up), we can attempt failover without worrying about data divergence

Propose a new master from the list of connected peers hosting a replica for that shard

Leader polls each replica for the shard to ensure that contact to the peer is gone

If at least half the replicas (rounded up) cannot contact the peer, commit the change to the master failover table

On peer reconnect

For each shard the peer should be master of, but isn't (known by checking the blueprint and master failover table):

Propose that the peer becomes master

Leader polls each replica for the shard to ensure that contact to the peer is restored

If at least half the replicas (rounded up) can contact the peer, commit the change to the master failover table

Phase 3: ReQL Cluster API

r.cluster() operations to change semilattice metadata
Ideally, treat the semilattice metadata as a virtual table or tables
Change the admin CLI and web UI to use this interface

Phase 4: Minimal downtime on blueprint change

Change reactor shutdown such that a machine will continue serving queries until a new master is available
Could probably reuse a lot of the auto failover logic

Phase 5: Blueprint generation without full connectivity

Not sure if this is possible, but we need it

Phase 6: More Directory and Blueprint optimization

Use hash maps instead of trees for table and mailbox lookup
- Important if we want to scale to hundreds of thousands of tables
Consider using a single mailbox per core for all tables, rather than one for each table
- This would make the directory much smaller, at the cost of complicating reactor transitions
Deterministic blueprints
- Move the blueprint out of the cluster semilattice metadata
- Save time distributing blueprints, which could feasibly be dozens of MB in size
- Instead, pass around blueprint hashes, and only send the entire blueprint if there is a desync

As you can see, there are a number of open questions. The biggest are:
Phase 2: Which consensus library to use and how to integrate it into our clustering system
Phase 3: ReQL Cluster API description
Phase 3: How to interface a virtual table with the rest of the ReQL protocol
Phase 5: Feasibility and architecture design

Phases 1 and 6 are optimization changes, but phase 1 is there because it should be relatively easy and hits one of the biggest bottlenecks in the current state of things. Phases 2 and 3 are necessary for a production-ready product. I wouldn't say phases 4 and 5 are necessary for production-readiness, but I would strongly recommend them.

danielmewes · 2014-02-13T21:43:05Z

@Tryneus This sounds like a great plan.

Depending on how difficult phase 5 is, we should probably do it earlier because it impacts users a lot.

timmaxw · 2014-02-14T00:04:29Z

Phase 1 sounds like a really good optimization. When removing "nothing" from the directory, there is a small complication to watch out for. When the current system is switching between two roles, it temporarily has no directory entry. For example, when it is switching from "primary" to "secondary", it first removes the "primary" entry and then adds the "secondary" entry, so it briefly has no entry at all. It's important to distinguish between this case and a true "nothing". This should be easy to fix by inserting a placeholder during the switch-over, or perhaps by leaving the old entry until the new entry is ready.

Your description for Phase 2 is not very clear, and a lot of details are missing. Would you mind writing it up in more detail? Perhaps it should be a separate issue, to keep the discussion organized.

You propose to move the web UI into a separate process (presumably Python) in Phase 3. I suggest you consider moving the suggester and auto-failover logic into the separate process as well. Then, if you do Phase 3 before Phase 2, the consensus system could be integrated in Python rather than C++, which might be nicer.

coffeemug · 2014-02-14T00:15:44Z

You propose to move the web UI into a separate process (presumably Python) in Phase 3

I don't think @Tryneus proposed that. I think he meant implementing a ReQL API to conveniently manage the cluster via client drivers, and then switching the WebUI and the Admin UI to use this API instead of writing to semilattices directly. That doesn't require moving the WebUI out.

timmaxw · 2014-02-14T00:18:49Z

Oh! I thought he was proposing to partially implement the idea from #1913. Oops.

Tryneus · 2014-02-14T00:21:42Z

I think #1913 is a good idea, and it could even fit into one of the phases I proposed above, but it is also orthogonal to a lot of these features. It would be really nice if we could use some higher-level constructs for dealing with the goals, blueprints, and even consensus (the libraries available for C/C++ are pretty sparse), but at the same time, it wouldn't necessarily affect the user experience for some time.

timmaxw · 2014-02-14T00:29:07Z

I agree. The advantage of doing it soon is that the longer we wait, the more code we have to write in C++ and later port to Python. The disadvantage is that it will take a lot of time, and there is no direct payoff in terms of user experience or making any particular issue easier to solve. So the correct answer depends on the development schedule, which I don't know very much about. I just wanted to suggest that you keep it in mind. 😃

mlucy · 2014-02-20T23:33:00Z

I'm extremely skeptical of outside paxos libraries.

libpaxos3 looks like it was written by some dude (http://atelier.inf.usi.ch/~sciascid/), has like 4 simple tests (https://bitbucket.org/sciascid/libpaxos/src/20414d195443e9fe82973f0a0be8c5a3bd24e954/unit/?at=master), doesn't appear to have a bug tracker, had a super-low-traffic mailing list (http://sourceforge.net/mailarchive/forum.php?forum_name=libpaxos-general), and has bug reports in said mailing list that don't seem to all be resolved.

mlucy · 2014-02-21T06:42:44Z

If we are forced to choose an external library, we should also look at https://github.com/logcabin/logcabin . It claims it isn't ready for production use, which I honestly take as a positive signal in this case, and it's written by Diego Ongaro, who's one of the authors on https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf .

josephglanville · 2014-03-13T09:16:06Z

@mlucy I would also vote for Logcabin. It's the reference implementation of Raft.
You wouldn't need to use the entire implementation, it would be feasible to just extract the Raft algorithm (1200LOC or so) and implement your own log + snapshotting.

coffeemug · 2014-04-14T04:36:08Z

Also note #2083 is a part of this.

neumino · 2014-07-07T19:43:55Z

I talked a little with @timmaxw last Monday and asked him a few questions. I don't remember all of them, but here are a few (feel free to open an issue if they are relevant)

How are we going to relieve the master of a shard (now it takes all the writes+read) -- slightly related to Proposal for clustering - Avoid routing all queries to the master #2119
Can we remove the state of each hash shard and instead send back one value (whether the shard is ready or not). That should decreaze the size of the directory roughly by 8.
Can we clean the content of ajax/stat. They are pretty big, and the web interface filter them with a regex, but from what I know, the whole thing is sent between servers, which probably create a lot of intra-cluster traffic.
And actually, same question for the content in ajax

I'm pretty sure there was other stuff, but I somehow can't remember. I'll add another comment if something comes up.

danielmewes · 2015-03-24T02:26:45Z

The ideas in here have pretty much been translated into the ReQL admin interface that we shipped with 1.16 and the Raft rework that @timmaxw has been working on for the past months.

There are some separate issues (query routing, hash sharding etc.) that are already tracked elsewhere.

I think this thread has outlived its usefulness.

timmaxw mentioned this issue Jan 28, 2014

Proposal: allow external programs to bypass/replace cluster administration code #1913

Closed

neumino mentioned this issue Mar 25, 2014

Turning primary datacenter option on and off in the web UI causes an off by one replica count error #1815

Closed

This was referenced Apr 1, 2014

Better cluster management and monitoring API #2169

Closed

Archive shards? #2216

Open

coffeemug modified the milestones: 2.x, subsequent Sep 29, 2014

danielmewes closed this as completed Mar 24, 2015

danielmewes modified the milestones: outdated, subsequent Mar 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The clustering megathread #1911

The clustering megathread #1911

coffeemug commented Jan 27, 2014

timmaxw commented Jan 28, 2014

Tryneus commented Feb 13, 2014

danielmewes commented Feb 13, 2014

timmaxw commented Feb 14, 2014

coffeemug commented Feb 14, 2014

timmaxw commented Feb 14, 2014

Tryneus commented Feb 14, 2014

timmaxw commented Feb 14, 2014

mlucy commented Feb 20, 2014

mlucy commented Feb 21, 2014

josephglanville commented Mar 13, 2014

coffeemug commented Apr 14, 2014

neumino commented Jul 7, 2014

danielmewes commented Mar 24, 2015

The clustering megathread #1911

The clustering megathread #1911

Comments

coffeemug commented Jan 27, 2014

timmaxw commented Jan 28, 2014

Tryneus commented Feb 13, 2014

danielmewes commented Feb 13, 2014

timmaxw commented Feb 14, 2014

coffeemug commented Feb 14, 2014

timmaxw commented Feb 14, 2014

Tryneus commented Feb 14, 2014

timmaxw commented Feb 14, 2014

mlucy commented Feb 20, 2014

mlucy commented Feb 21, 2014

josephglanville commented Mar 13, 2014

coffeemug commented Apr 14, 2014

neumino commented Jul 7, 2014

danielmewes commented Mar 24, 2015