Skip to content

May 2019 Edition

Pre-release
Pre-release
Compare
Choose a tag to compare
@steveloughran steveloughran released this 10 May 18:47
· 5 commits to master since this release
54b3139

This is the May 2019 update of this paper

  1. Moved to ACM draft paper format, which offers much better readability.
  2. Review of AWS EMR Spark committer, based on the details in their blog post.

It's nice to see the EMR team describe some of the algorithm, and also to give our work credit. It's disappointing to see that despite the prior art by ourselves and others, the commit algorithm they implemented will leave the output directory in an unknown state if a single task attempt fails during its commit operation.

  1. If the first attempt created files with different names than a second, the destination would have the results of both tasks.
  2. Even with they both created files with exactly the same names, a partitioning could cause the first attempt to complete one or more of its uploads after a second attempt has already committed its work. This could even happen after the entire job has completed "successfully"

This is dangerous. It's a simpler algorithm to implement —no need to pass information to the job committer— and by removing the need to issue POST operations in job complete, makes job commit O(1) irrespective of size. But it does not meet the expectations which spark has of commit protocols. Unless the EMR version of spark has been modified to fail fast on a timeout/failure during a task attempt commit, this is risking data.

This is why we explicitly chose not to do this in our committer. Doing it in the worker would have given better benchmark numbers, but remember: correctness is more important than speed.

To be fair: the MRv2 commit algorithm has precisely the same flaws, and, because the time for the task's nonatomic commit operation is O(data), a potentially large window of vulnerability. This option to use that algorithm presence and its tangible speedups means people use it, probably unaware of its failing. Which is precisely why we didn't provide a "materialize output in task commit" option. People would be drawn to using it.

A fix to the Spark driver and MR job committers to fail fast in such a situation will benefit both: they should be able to interrogate the committers to get confirmation that a fail-in-task-commit is recoverable, just as they already do for the job commit. Yes, some jobs may fail visibly —but it is that or retain the risk of silent data corruption.