Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write Jepsen tests for RethinkDB #1493

Closed
novabyte opened this issue Sep 29, 2013 · 44 comments
Closed

Write Jepsen tests for RethinkDB #1493

novabyte opened this issue Sep 29, 2013 · 44 comments
Assignees
Labels
Milestone

Comments

@novabyte
Copy link

@novabyte novabyte commented Sep 29, 2013

Jepsen is a tool for simulating network partitions in databases. It's written in Clojure.

You can find out more about how to write Jensen tests here:

http://aphyr.com/posts/289-automating-jepsen

These tests would be very useful in determining how RethinkDB behaves during certain kinds of network failures.

@AtnNn

This comment has been minimized.

Copy link
Member

@AtnNn AtnNn commented Sep 30, 2013

This seems like a good idea. However I don't think we can give it a high priority.

I cannot find any documentation on Jepsen. Is it still a work in progress?

@novabyte

This comment has been minimized.

Copy link
Author

@novabyte novabyte commented Oct 1, 2013

Yes, unfortunately Jepsen is very much a work in progress but Kyle (Aphyr) has been producing a lot database testing for different databases:

and: http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions

Jepsen's tests are written in Clojure, an example is here:

https://github.com/aphyr/jepsen/blob/master/src/jepsen/mongo.clj

I agree it's probably worth marking this as low priority for the moment, it also depends on whether enough consistency controls available in RethinkDB are exposed in the Java client library.

I mentioned the project because it's really worth making it part of the internal benchmarking/test suite used to verify the performance and safety of new RethinkDB releases.

Hope this helps. :)

@coffeemug

This comment has been minimized.

Copy link
Contributor

@coffeemug coffeemug commented Oct 4, 2013

Moving to backlog for now. FYI @novabyte -- we do this via internal tools, but we'll definitely get a publicly reproducible version of this (hopefully with widely available tools) at some point.

@MrJoy

This comment has been minimized.

Copy link

@MrJoy MrJoy commented Apr 15, 2014

Any update on this?

@coffeemug

This comment has been minimized.

Copy link
Contributor

@coffeemug coffeemug commented Apr 15, 2014

There hasn't been so far, sorry.

@mlucy mlucy modified the milestones: tests, backlog Apr 25, 2014
@dminkovsky

This comment has been minimized.

Copy link

@dminkovsky dminkovsky commented Jul 1, 2014

@MrJoy Wouldn't it be best for @aphyr to just write up a RethinkDB Jepsen entry himself. He's been adding new ones lately, so maybe poke him to see if he's interested in Rethink?

@wamatt

This comment has been minimized.

Copy link

@wamatt wamatt commented Jul 1, 2014

👍 gave @aphyrr a shout on twitter.

@aphyr

This comment has been minimized.

Copy link

@aphyr aphyr commented Jul 1, 2014

Y'all have funding, right? Been trying to drum up interest for Jepsen as a nonprofit org, cuz this stuff takes months to do.

@dminkovsky

This comment has been minimized.

Copy link

@dminkovsky dminkovsky commented Jul 1, 2014

@aphyr: FWIW, they do: http://rethinkdb.com/blog/funding/. But I hope they'd only shell out if Jepsen 503 would ensure all truly confirmed writes and CAP-defying availability.

@dminkovsky

This comment has been minimized.

Copy link

@dminkovsky dminkovsky commented Jul 1, 2014

@aphyr But for cereal, RethinkDB is pretty tight if you've not messed with it. Regardless of funding, check it out if you haven't!

@MrJoy

This comment has been minimized.

Copy link

@MrJoy MrJoy commented Jul 1, 2014

@dminkovsky Having him do it would serve as a one-time independent verification of RethinkDB's ability to handle partitioning events. They probably want to either incorporate a Jepsen suite that he writes, or write their own, to use as part of their regression testing suite though. It's easy to make changes to a complex system that can have disastrous implications for subtle semantics and if those semantics aren't continuously under test, one stands a good chance of having such breakage go unnoticed.

@MrJoy

This comment has been minimized.

Copy link

@MrJoy MrJoy commented Jul 1, 2014

@dminkovsky Not sure what you mean by this:

But I hope they'd only shell out if Jepsen 503 would ensure all truly confirmed writes and CAP-defying availability.

Not sure what you mean by "ensure". If you mean "thoroughly validates", then well, Jepsen is quite brutal and capable of simulating virtually any network partition event -- and even if it isn't comprehensive, coverage of some aspect of the problem domain is better than no coverage. If you mean "corrects the behavior of", that isn't Jepsen's role. I ask here what you mean because brining up the potential cost makes me think you mean the second interpretation, but the first is really the only one that makes sense to me.

@dminkovsky

This comment has been minimized.

Copy link

@dminkovsky dminkovsky commented Jul 1, 2014

Not sure what you mean by "ensure".

I was just kidding, suggesting, jokingly, that money and incentives are often maligned.

@dminkovsky

This comment has been minimized.

Copy link

@dminkovsky dminkovsky commented Jul 1, 2014

Oh and by "Jepsen 503" I meant "Jepsen 501(c)(3)", the hypothetical Jepsen non-profit @aphyr mentioned. I think I must have had the HTTP status code on my mind.

Anyway, as @coffeemug said earlier:

Moving to backlog for now. FYI @novabyte -- we do this via internal tools, but we'll definitely get a publicly reproducible version of this (hopefully with widely available tools) at some point.

The Jepsen series has obviously revealed that internal testing can sometimes be limited, but for what it's worth the internal testing does exist.

@jdunck

This comment has been minimized.

Copy link

@jdunck jdunck commented Jan 16, 2015

When I hear about a new distribute datastore, I google "$DBNAME jepsen", and this is what got me here.

RethinkDB, if you agree that good Jepsen results would be good marketing (as I do), I think you should see if you can contract @aphyr (or a colleague) and, if not, you do it yourself, preferably with community input that You're Doing It Right.

@danielmewes

This comment has been minimized.

Copy link
Member

@danielmewes danielmewes commented Jan 17, 2015

@jdunck We definitely agree that Jepsen results are extremely useful.
We are currently in the process of re-designing a lot of our clustering logic for one of the upcoming releases, so we would however like to wait until then before putting resources into this.

@donspaulding

This comment has been minimized.

Copy link

@donspaulding donspaulding commented Jan 30, 2015

@danielmewes Is 1.16 the release you were talking about? If so, does that mean a discussion of how RethinkDB holds up against Jepsen is forthcoming?

If so... yay!

@deontologician

This comment has been minimized.

Copy link
Contributor

@deontologician deontologician commented Jan 30, 2015

@donspaulding We're moving to raft for automatic failover, that's the change he was talking about. We'll probably have a milestone for it soon, but #223 is an issue you can follow if you're interested

@timmaxw

This comment has been minimized.

Copy link
Member

@timmaxw timmaxw commented Jan 30, 2015

To be completely clear: RethinkDB 1.16 does not have automatic failover. RethinkDB 2.0 will not have automatic failover. We're actively working on automatic failover, but it requires some big changes under the hood, and we're expecting to ship sometime around April or May. We're planning to test RethinkDB against Jepsen once we've implemented automatic failover.

@wamatt

This comment has been minimized.

Copy link

@wamatt wamatt commented Jan 30, 2015

@timmaxw thats fantastic. good luck! :)

@timmaxw

This comment has been minimized.

Copy link
Member

@timmaxw timmaxw commented Mar 4, 2015

The new raft milestone tracks issues that need to be fixed before we can ship auto-failover.

I'm moving this issue to the raft-polish milestone; it would be nice to do it before we ship Raft, but it's not necessary.

@timmaxw timmaxw modified the milestones: raft-polish, tests Mar 4, 2015
@jwr

This comment has been minimized.

Copy link

@jwr jwr commented Mar 31, 2015

I'll pitch in with a non-technical comment. I'm looking for a DB for a new project. RethinkDB looks tempting, but there is the trust issue: databases are hard, distributed databases way more so. Jepsen has become the reference: a reality check for various claims being made. After reading a Jepsen article carefully, I know whether I can trust a distributed DB to get things relatively right.

I'm writing this to suggest that once you do get auto-failover working, you should ensure that Jepsen results are published.

@danielmewes danielmewes modified the milestones: raft-polish, raft May 4, 2015
@danielmewes

This comment has been minimized.

Copy link
Member

@danielmewes danielmewes commented Jul 24, 2015

@mlucy has implemented a few test scenarios with Jepsen and they're passing on the development branch.

The remaining item in this issue I think is to prepare a pull request for the Clojure driver, and if possible one for our Jepsen adaptation as well.

The code is currently in https://github.com/rethinkdb/jepsen in the branches rdb and rdb_mongo.

@danielmewes

This comment has been minimized.

Copy link
Member

@danielmewes danielmewes commented Jul 24, 2015

@mlucy Anything else you wanted to do for this?

@danielcompton

This comment has been minimized.

Copy link
Contributor

@danielcompton danielcompton commented Jul 24, 2015

Happy to help with the Clojure driver, I'm a maintainer of it.

@wamatt

This comment has been minimized.

Copy link

@wamatt wamatt commented Jul 24, 2015

@danielmewes this is great news

@aphyr

This comment has been minimized.

Copy link

@aphyr aphyr commented Jul 24, 2015

In an attempt to make Jepsen tests more modular and less prone to breaking as core changes, I've made Jepsen into a library and split out the tests into their own projects. Take a look at, say, https://github.com/aphyr/jepsen/tree/master/disque for an example of how to structure your tests. :)

@aphyr

This comment has been minimized.

Copy link

@aphyr aphyr commented Jul 24, 2015

Also I'm a little confused by some of the commit messages--for instance, rethinkdb/jepsen@c0062c3 says "etcd tests" but seems mostly related to rethinkdb, not etcd...

@danielmewes

This comment has been minimized.

Copy link
Member

@danielmewes danielmewes commented Jul 24, 2015

Hi @aphyr , great news about the modularization. Assuming we can find the time, we'll try to bring the RethinkDB tests over to the new structure.
The explanation for the commit messages is that we weren't completely sure how things were working and which tests were possible when we started out, so we went to your blog and started by implementing tests from specific articles of yours. So the "etcd tests" are similar to the tests that you describe in https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul , but ported to RethinkDB. Those were the first we tried, and I'm not sure if their currently committed configuration is actually very useful for testing RethinkDB at this point. The rdb_mongo branch has the mongo in it for the same reason. They try to mirror your tests from https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads .

@danielmewes

This comment has been minimized.

Copy link
Member

@danielmewes danielmewes commented Jul 24, 2015

@mlucy will know better which commit is the most recent one and if there's anything else to keep in mind.

@mlucy

This comment has been minimized.

Copy link
Member

@mlucy mlucy commented Jul 29, 2015

@aphyr -- I opened a pull request with what I have at jepsen-io/jepsen#70 . It pulls the current 2.1 beta release (which is known to have bugs) from the Internet, and requires a slightly-patched version of clj-rethinkdb to run (I link to it in the description of the merge). If you want to wait until the actual release to merge it I can update it to pull the right version and use the mainline clj-rethinkdb (which will probably support error types by then).

@danielcompton -- my changes to the clojure driver are at https://github.com/mlucy/clj-rethinkdb/tree/mlucy_error_types . Basically we added fine-grained error types in 2.1. The error hierarchy is changing a little before the final release (see #4559), but once that's in I'll update the branch and open a pull request.

@danielmewes danielmewes modified the milestones: 2.1, 2.1.x Aug 11, 2015
@mlucy

This comment has been minimized.

Copy link
Member

@mlucy mlucy commented Aug 28, 2015

Marking as closed since the main task associated with this issue (running the Jepsen tests against RethinkDB 2.1 prior to release) is done. If anyone has questions about anything in this thread, though, feel free to continue commenting here.

@mlucy mlucy closed this Aug 28, 2015
@lamielle

This comment has been minimized.

Copy link

@lamielle lamielle commented Aug 28, 2015

@mlucy one small question that we could take somewhere else if necessary: what were the results of the testing? I'm really curious to see how the tests worked, what (if anything) was found, and if anything was learned from the process. Maybe a blog post is more appropriate?

@mlucy

This comment has been minimized.

Copy link
Member

@mlucy mlucy commented Aug 28, 2015

@lamielle -- they were definitely helpful to run. I'd have to go back and look at the issue history to remember what they actually found, but I remember them being useful. They also spurred progress on a new feature (typed errors) since the Jepsen tests distinguish between failed and indeterminate operations, and previously the only way to tell those apart was to look at the error messages. Writing up a blog post on the process and what we found would probably be a good project.

@pmbauer

This comment has been minimized.

Copy link

@pmbauer pmbauer commented Sep 1, 2015

+1; we are evaluating RethinkDB at Udacity and need a full understanding of the write guarantees and failure modes; looking forward to your Jepsen blog.

@danielmewes

This comment has been minimized.

Copy link
Member

@danielmewes danielmewes commented Sep 1, 2015

@pmbauer In case you haven't seen it yet, this page might be useful for you: http://rethinkdb.com/docs/consistency/

@coffeemug

This comment has been minimized.

Copy link
Contributor

@coffeemug coffeemug commented Sep 1, 2015

@pmbauer -- shoot me an email at slava@rethinkdb.com. I'd love to learn more about your use case, answer any questions, and help with the evaluation in any way I can.

@pmbauer

This comment has been minimized.

Copy link

@pmbauer pmbauer commented Sep 2, 2015

@danielmewes @coffeemug thank you! We are just starting out but like what we see so far ... will reach out later this week.

@danielmewes danielmewes modified the milestones: 2.1, 2.1.x Sep 3, 2015
@roynasser

This comment has been minimized.

Copy link

@roynasser roynasser commented Oct 29, 2015

Sorry if I miss a point, but are there published results/interpretation? thks!

@coffeemug

This comment has been minimized.

Copy link
Contributor

@coffeemug coffeemug commented Oct 29, 2015

@RVN-BR -- not at the moment, but they should be available in a month or two.

@roynasser

This comment has been minimized.

Copy link

@roynasser roynasser commented Oct 29, 2015

Great, I'll wait for that :) I assume itl be posted here so ill keep watching the issue

thanks

@mafrosis

This comment has been minimized.

@mglukhovsky

This comment has been minimized.

Copy link
Member

@mglukhovsky mglukhovsky commented Jan 4, 2016

Hey guys, as @mafrosis pointed out, the Jepsen analysis went live today.

From the article:

As far as I can ascertain, RethinkDB’s safety claims are accurate. You can lose updates if you write with anything less than majority, and see assorted read anomalies with single or outdated reads, but majority/majority appears linearizable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.