Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The clustering megathread #1911

Closed
coffeemug opened this issue Jan 27, 2014 · 14 comments
Closed

The clustering megathread #1911

coffeemug opened this issue Jan 27, 2014 · 14 comments

Comments

@coffeemug
Copy link
Contributor

Our clustering implementation has a lot of limitations, both in performance/scalability, and operations. We have lots of open issues about various problems. Many of them are related on the product level, many are related on the code/architecture level, many are unrelated. We'll have to do significant reworking and I'd like to start a discussion about the overall refactor/rearchitecture here.

Here are the issues we'll need to fix:

EDIT:

  • Better issue resolution. A problem with one table shouldn't make administration of the whole cluster impossible until the issue is resolved. A machine down that doesn't result in unsatisfiable goals shouldn't be treated as an emergency. A netsplit between datacenters shouldn't inundate the web ui in a hundred issues. We need to treat machine failure and nesplits like regular events rather than something out of the ordinary.

EDIT2:

We could make this a bazaar or a cathedral or anything in between. Pinging @Tryneus and @timmaxw. I'd like your thoughts and proposal on how to go about this.

EDIT3:

@timmaxw
Copy link
Member

timmaxw commented Jan 28, 2014

A lot of these suggestions involve adding features to the cluster administration code. We should consider making it possible for external programs to bypass/replace the C++ cluster administration code. I made a separate issue to keep the discussion organized. See #1913.

@Tryneus
Copy link
Member

Tryneus commented Feb 13, 2014

So I went through and identified the biggest problems, prioritized them, and have a rough outline of a proposal in 6 phases:


Phase 1: Directory and Blueprint optimization

  • Remove the NOTHING role from the Blueprint
    • Will drastically reduce the size of the blueprint in large clusters
  • Remove the nothing role from the Directory
    • Will drastically reduce the size of the directory in large clusters
    • Leave the nothing_when_safe and nothing_when_done_erasing roles
  • Need to know when to spawn nothing roles to perform backfilling and deletion

Phase 2: Auto-Failover

  • Integrate a Paxos or Raft (or some other consensus) library as a new service in the connectivity layer
    • As opposed to a daemon, which would add more dependencies and more complicated cluster management for users
    • BSD or otherwise free library we can build straight in or staticly link would be preferred
    • Alternative is to implement a consensus algorithm ourselves, but that is a last resort
  • Consensus library will be used to synchronize a Master Failover Table
    • When the master failover table changes, reactor roles need to be reevaluated
  • There appears to be a window for data divergence in how writes are committed
    • When recovering from a failover, we need the new master to eclipse the old master, if divergence is detected.
    • This should not be an error from the user's perspective, as any rollbacked writes will not have been acked to any client

Process to elect a new master:
On peer disconnect

For each shard the peer was master of (known by checking the blueprint and master failover table):

  1. If the table has a number of write acks greater than half the number of replicas (rounded up), we can attempt failover without worrying about data divergence
  2. Propose a new master from the list of connected peers hosting a replica for that shard
  3. Leader polls each replica for the shard to ensure that contact to the peer is gone
  4. If at least half the replicas (rounded up) cannot contact the peer, commit the change to the master failover table

On peer reconnect

For each shard the peer should be master of, but isn't (known by checking the blueprint and master failover table):

  1. Propose that the peer becomes master
  2. Leader polls each replica for the shard to ensure that contact to the peer is restored
  3. If at least half the replicas (rounded up) can contact the peer, commit the change to the master failover table

Phase 3: ReQL Cluster API

  • r.cluster() operations to change semilattice metadata
  • Ideally, treat the semilattice metadata as a virtual table or tables
  • Change the admin CLI and web UI to use this interface

Phase 4: Minimal downtime on blueprint change

  • Change reactor shutdown such that a machine will continue serving queries until a new master is available
  • Could probably reuse a lot of the auto failover logic

Phase 5: Blueprint generation without full connectivity

  • Not sure if this is possible, but we need it

Phase 6: More Directory and Blueprint optimization

  • Use hash maps instead of trees for table and mailbox lookup
    • Important if we want to scale to hundreds of thousands of tables
  • Consider using a single mailbox per core for all tables, rather than one for each table
    • This would make the directory much smaller, at the cost of complicating reactor transitions
  • Deterministic blueprints
    • Move the blueprint out of the cluster semilattice metadata
    • Save time distributing blueprints, which could feasibly be dozens of MB in size
    • Instead, pass around blueprint hashes, and only send the entire blueprint if there is a desync

As you can see, there are a number of open questions. The biggest are:
Phase 2: Which consensus library to use and how to integrate it into our clustering system
Phase 3: ReQL Cluster API description
Phase 3: How to interface a virtual table with the rest of the ReQL protocol
Phase 5: Feasibility and architecture design

Phases 1 and 6 are optimization changes, but phase 1 is there because it should be relatively easy and hits one of the biggest bottlenecks in the current state of things. Phases 2 and 3 are necessary for a production-ready product. I wouldn't say phases 4 and 5 are necessary for production-readiness, but I would strongly recommend them.

@danielmewes
Copy link
Member

@Tryneus This sounds like a great plan.

Depending on how difficult phase 5 is, we should probably do it earlier because it impacts users a lot.

@timmaxw
Copy link
Member

timmaxw commented Feb 14, 2014

Phase 1 sounds like a really good optimization. When removing "nothing" from the directory, there is a small complication to watch out for. When the current system is switching between two roles, it temporarily has no directory entry. For example, when it is switching from "primary" to "secondary", it first removes the "primary" entry and then adds the "secondary" entry, so it briefly has no entry at all. It's important to distinguish between this case and a true "nothing". This should be easy to fix by inserting a placeholder during the switch-over, or perhaps by leaving the old entry until the new entry is ready.

Your description for Phase 2 is not very clear, and a lot of details are missing. Would you mind writing it up in more detail? Perhaps it should be a separate issue, to keep the discussion organized.

You propose to move the web UI into a separate process (presumably Python) in Phase 3. I suggest you consider moving the suggester and auto-failover logic into the separate process as well. Then, if you do Phase 3 before Phase 2, the consensus system could be integrated in Python rather than C++, which might be nicer.

@coffeemug
Copy link
Contributor Author

You propose to move the web UI into a separate process (presumably Python) in Phase 3

I don't think @Tryneus proposed that. I think he meant implementing a ReQL API to conveniently manage the cluster via client drivers, and then switching the WebUI and the Admin UI to use this API instead of writing to semilattices directly. That doesn't require moving the WebUI out.

@timmaxw
Copy link
Member

timmaxw commented Feb 14, 2014

Oh! I thought he was proposing to partially implement the idea from #1913. Oops.

@Tryneus
Copy link
Member

Tryneus commented Feb 14, 2014

I think #1913 is a good idea, and it could even fit into one of the phases I proposed above, but it is also orthogonal to a lot of these features. It would be really nice if we could use some higher-level constructs for dealing with the goals, blueprints, and even consensus (the libraries available for C/C++ are pretty sparse), but at the same time, it wouldn't necessarily affect the user experience for some time.

@timmaxw
Copy link
Member

timmaxw commented Feb 14, 2014

I agree. The advantage of doing it soon is that the longer we wait, the more code we have to write in C++ and later port to Python. The disadvantage is that it will take a lot of time, and there is no direct payoff in terms of user experience or making any particular issue easier to solve. So the correct answer depends on the development schedule, which I don't know very much about. I just wanted to suggest that you keep it in mind. 😃

@mlucy
Copy link
Member

mlucy commented Feb 20, 2014

I'm extremely skeptical of outside paxos libraries.

libpaxos3 looks like it was written by some dude (http://atelier.inf.usi.ch/~sciascid/), has like 4 simple tests (https://bitbucket.org/sciascid/libpaxos/src/20414d195443e9fe82973f0a0be8c5a3bd24e954/unit/?at=master), doesn't appear to have a bug tracker, had a super-low-traffic mailing list (http://sourceforge.net/mailarchive/forum.php?forum_name=libpaxos-general), and has bug reports in said mailing list that don't seem to all be resolved.

@mlucy
Copy link
Member

mlucy commented Feb 21, 2014

If we are forced to choose an external library, we should also look at https://github.com/logcabin/logcabin . It claims it isn't ready for production use, which I honestly take as a positive signal in this case, and it's written by Diego Ongaro, who's one of the authors on https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf .

@josephglanville
Copy link

@mlucy I would also vote for Logcabin. It's the reference implementation of Raft.
You wouldn't need to use the entire implementation, it would be feasible to just extract the Raft algorithm (1200LOC or so) and implement your own log + snapshotting.

@coffeemug
Copy link
Contributor Author

Also note #2083 is a part of this.

@neumino
Copy link
Member

neumino commented Jul 7, 2014

I talked a little with @timmaxw last Monday and asked him a few questions. I don't remember all of them, but here are a few (feel free to open an issue if they are relevant)

  • How are we going to relieve the master of a shard (now it takes all the writes+read) -- slightly related to Proposal for clustering - Avoid routing all queries to the master #2119
  • Can we remove the state of each hash shard and instead send back one value (whether the shard is ready or not). That should decreaze the size of the directory roughly by 8.
  • Can we clean the content of ajax/stat. They are pretty big, and the web interface filter them with a regex, but from what I know, the whole thing is sent between servers, which probably create a lot of intra-cluster traffic.
    And actually, same question for the content in ajax

I'm pretty sure there was other stuff, but I somehow can't remember. I'll add another comment if something comes up.

@coffeemug coffeemug modified the milestones: 2.x, subsequent Sep 29, 2014
@danielmewes
Copy link
Member

The ideas in here have pretty much been translated into the ReQL admin interface that we shipped with 1.16 and the Raft rework that @timmaxw has been working on for the past months.

There are some separate issues (query routing, hash sharding etc.) that are already tracked elsewhere.

I think this thread has outlived its usefulness.

@danielmewes danielmewes modified the milestones: outdated, subsequent Mar 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants