Skip to content

Releases: steveloughran/zero-rename-committer

May 2021 Release

17 May 15:50
fe31fbf
Compare
Choose a tag to compare

The May 2021 update looks at the issues which surface in production and identifies common causes. While no flaws were found with the algorithm, the implementation encountered concurrency, scale and resilience issues.

June 2020 release

16 Jun 17:15
fe31fbf
Compare
Choose a tag to compare
June 2020 release Pre-release
Pre-release

Noticed an error in magic task commit algorithms; referenced task attempt in path rather than task.

May 2019 Edition

10 May 18:47
54b3139
Compare
Choose a tag to compare
May 2019 Edition Pre-release
Pre-release

This is the May 2019 update of this paper

  1. Moved to ACM draft paper format, which offers much better readability.
  2. Review of AWS EMR Spark committer, based on the details in their blog post.

It's nice to see the EMR team describe some of the algorithm, and also to give our work credit. It's disappointing to see that despite the prior art by ourselves and others, the commit algorithm they implemented will leave the output directory in an unknown state if a single task attempt fails during its commit operation.

  1. If the first attempt created files with different names than a second, the destination would have the results of both tasks.
  2. Even with they both created files with exactly the same names, a partitioning could cause the first attempt to complete one or more of its uploads after a second attempt has already committed its work. This could even happen after the entire job has completed "successfully"

This is dangerous. It's a simpler algorithm to implement —no need to pass information to the job committer— and by removing the need to issue POST operations in job complete, makes job commit O(1) irrespective of size. But it does not meet the expectations which spark has of commit protocols. Unless the EMR version of spark has been modified to fail fast on a timeout/failure during a task attempt commit, this is risking data.

This is why we explicitly chose not to do this in our committer. Doing it in the worker would have given better benchmark numbers, but remember: correctness is more important than speed.

To be fair: the MRv2 commit algorithm has precisely the same flaws, and, because the time for the task's nonatomic commit operation is O(data), a potentially large window of vulnerability. This option to use that algorithm presence and its tangible speedups means people use it, probably unaware of its failing. Which is precisely why we didn't provide a "materialize output in task commit" option. People would be drawn to using it.

A fix to the Spark driver and MR job committers to fail fast in such a situation will benefit both: they should be able to interrogate the committers to get confirmation that a fail-in-task-commit is recoverable, just as they already do for the job commit. Yes, some jobs may fail visibly —but it is that or retain the risk of silent data corruption.

Paper Draft 004: post-ASF-release edition

08 Jan 22:58
Compare
Choose a tag to compare

Review of paper after experience in production: http connection pool can be the bottleneck on the commit process. We should think about updating it in general, though that increases the resource consumption of the connector in any shared process (Spark, Hive LLAP, Impala). It does benefit the work, and it may be slowing down other operations, unknown to the users.

  • Update Sanjay & Steve's email addresses to be @apache.org.
  • Add Thomas Demoor as a co-author.
  • Adds a mention of the EMR committer -this appears to be an uncredited
    re-implementation of our work without any detail on its correctness.

Commentary

It's disappointing that the EMR S3 Committer Documentation fails to credit our work. Both the Netflix and separate Hadoop committer work were under public development for some time, as was the joint development —and a JOIN of the watchers of the HADOOP-13786 with the Linked In graph easily identifies which AWS team members were watching the work.

Personal disappointment credit aside, a bigger concern is that in the absence of any information as to the specific nature of the algorithm or its correctness means that there is no evidence that the closed-source committer meets our criteria of a "correct" committer:

Completeness of job output

After a successful invocation of commitJob(): the destination directory tree will contain all files written under the output directory of all task attempts which successfully returned from an invocation of commitTask(). The contents of these files will contain exactly the data written by the user code.

You get what was committed

Exclusivity of output.

After a successful invocation of commitJob(): the destination directory tree must only contain the output of successfully committed tasks.

And not what wasn't.

Consistency of the commit.

The task or job must be able to reliably commit the work, even in the presence of inconsistent listings.
This could be addressed, for example, by using a consistent store for some operations, or a manifest mechanism and a reliance on create consistency.

Consistency with subsequent queries in a workflow is encouraged, else a "sufficient"
delay is needed for the listings to become consistent.

Addresses store inconsistencies, somehow

Concurrent.

Multiple tasks in the same job must be able to commit concurrently.

A job must be able to commit its work while other jobs are committing
their work to different destinations in the store.

Ability to abort.

If a job attempt is aborted before commitJob(), is invoked, and cleanupJob() called, then the output of the attempt will not appear in the destination directory at any point in the future.

An aborted/cleaned up job no longer exists

Continuity of correctness.

After a job has been successfully committed, no outstanding task may promote
output into the destination directory.

That is: if a task attempt has not "failed" mid-commit, merely proceeded at a slow rate,
its output will not contaminate the directory of the already-successful job.

A dead task attempt stays dead

We hope that the EMR team's omission of credit will be rectified, along with the proof that their implementation meets these correctness requirements.

(ps: note this is a personal opinion of one of the authors)

Paper draft 003

19 Feb 20:19
Compare
Choose a tag to compare
Paper draft 003 Pre-release
Pre-release

This is the paper in sync with HADOOP-15107 patch 001

Paper Draft 002

12 Jan 23:16
Compare
Choose a tag to compare
Paper Draft 002 Pre-release
Pre-release

Draft 002; fix up the algorithms more, text tightened, spelling.

Algorithm change: spend time looking at MR job recovery and so detail on v1 vs v2 details there, particularly how job history is used to rebuild job manager state.

Draft 001: A zero rename committer

04 Jan 22:18
2cc7629
Compare
Choose a tag to compare
Pre-release

First public draft.

a_zero_rename_committer.pdf

Too verbose, under proof-checked, algorithms rendering but not reviewed, proofs badly argued and doesn't include any benchmarks.

One thing I'd like early comments on is actually my descriptions/illustrations of both the Hadoop and Spark commit protocols. I've stepped through a lot of the Hadoop one (it's how you understand the FileOutputCommitter algorithms after all, but the Spark one more from IDE archaeology. Jackson versions are stopping me bringing up the new committers in the IDE there, at least for now.