Releases · steveloughran/zero-rename-committer

17 May 15:50

tag_release_2021-05-17

fe31fbf

May 2021 Release Latest

Latest

The May 2021 update looks at the issues which surface in production and identifies common causes. While no flaws were found with the algorithm, the implementation encountered concurrency, scale and resilience issues.

Assets 3

16 Jun 17:15

steveloughran

tag_draft_006

fe31fbf

June 2020 release Pre-release

Pre-release

Noticed an error in magic task commit algorithms; referenced task attempt in path rather than task.

Assets 3

10 May 18:47

steveloughran

tag_draft_005

54b3139

May 2019 Edition Pre-release

Pre-release

This is the May 2019 update of this paper

Moved to ACM draft paper format, which offers much better readability.
Review of AWS EMR Spark committer, based on the details in their blog post.

It's nice to see the EMR team describe some of the algorithm, and also to give our work credit. It's disappointing to see that despite the prior art by ourselves and others, the commit algorithm they implemented will leave the output directory in an unknown state if a single task attempt fails during its commit operation.

If the first attempt created files with different names than a second, the destination would have the results of both tasks.
Even with they both created files with exactly the same names, a partitioning could cause the first attempt to complete one or more of its uploads after a second attempt has already committed its work. This could even happen after the entire job has completed "successfully"

This is dangerous. It's a simpler algorithm to implement —no need to pass information to the job committer— and by removing the need to issue POST operations in job complete, makes job commit O(1) irrespective of size. But it does not meet the expectations which spark has of commit protocols. Unless the EMR version of spark has been modified to fail fast on a timeout/failure during a task attempt commit, this is risking data.

This is why we explicitly chose not to do this in our committer. Doing it in the worker would have given better benchmark numbers, but remember: correctness is more important than speed.

To be fair: the MRv2 commit algorithm has precisely the same flaws, and, because the time for the task's nonatomic commit operation is O(data), a potentially large window of vulnerability. This option to use that algorithm presence and its tangible speedups means people use it, probably unaware of its failing. Which is precisely why we didn't provide a "materialize output in task commit" option. People would be drawn to using it.

A fix to the Spark driver and MR job committers to fail fast in such a situation will benefit both: they should be able to interrogate the committers to get confirmation that a fail-in-task-commit is recoverable, just as they already do for the job commit. Yes, some jobs may fail visibly —but it is that or retain the risk of silent data corruption.

Assets 3

08 Jan 22:58

steveloughran

tag_draft_004

0e97a6e

Paper Draft 004: post-ASF-release edition Pre-release

Pre-release

Review of paper after experience in production: http connection pool can be the bottleneck on the commit process. We should think about updating it in general, though that increases the resource consumption of the connector in any shared process (Spark, Hive LLAP, Impala). It does benefit the work, and it may be slowing down other operations, unknown to the users.

Update Sanjay & Steve's email addresses to be @apache.org.
Add Thomas Demoor as a co-author.
Adds a mention of the EMR committer -this appears to be an uncredited
re-implementation of our work without any detail on its correctness.

Commentary

It's disappointing that the EMR S3 Committer Documentation fails to credit our work. Both the Netflix and separate Hadoop committer work were under public development for some time, as was the joint development —and a JOIN of the watchers of the HADOOP-13786 with the Linked In graph easily identifies which AWS team members were watching the work.

Personal disappointment credit aside, a bigger concern is that in the absence of any information as to the specific nature of the algorithm or its correctness means that there is no evidence that the closed-source committer meets our criteria of a "correct" committer:

Completeness of job output

After a successful invocation of commitJob(): the destination directory tree will contain all files written under the output directory of all task attempts which successfully returned from an invocation of commitTask(). The contents of these files will contain exactly the data written by the user code.

You get what was committed

Exclusivity of output.

After a successful invocation of commitJob(): the destination directory tree must only contain the output of successfully committed tasks.

And not what wasn't.

Consistency of the commit.

The task or job must be able to reliably commit the work, even in the presence of inconsistent listings.
This could be addressed, for example, by using a consistent store for some operations, or a manifest mechanism and a reliance on create consistency.

Consistency with subsequent queries in a workflow is encouraged, else a "sufficient"
delay is needed for the listings to become consistent.

Addresses store inconsistencies, somehow

Concurrent.

Multiple tasks in the same job must be able to commit concurrently.

A job must be able to commit its work while other jobs are committing
their work to different destinations in the store.

Ability to abort.

If a job attempt is aborted before commitJob(), is invoked, and cleanupJob() called, then the output of the attempt will not appear in the destination directory at any point in the future.

An aborted/cleaned up job no longer exists

Continuity of correctness.

After a job has been successfully committed, no outstanding task may promote
output into the destination directory.

That is: if a task attempt has not "failed" mid-commit, merely proceeded at a slow rate,
its output will not contaminate the directory of the already-successful job.

A dead task attempt stays dead

We hope that the EMR team's omission of credit will be rectified, along with the proof that their implementation meets these correctness requirements.

(ps: note this is a personal opinion of one of the authors)

Assets 3

19 Feb 20:19

steveloughran

tag_draft_003

b3ba688

Paper draft 003 Pre-release

Pre-release

This is the paper in sync with HADOOP-15107 patch 001

Assets 3

12 Jan 23:16

steveloughran

tag_draft_002

55371b2

Paper Draft 002 Pre-release

Pre-release

Draft 002; fix up the algorithms more, text tightened, spelling.

Algorithm change: spend time looking at MR job recovery and so detail on v1 vs v2 details there, particularly how job history is used to rebuild job manager state.

Assets 3

04 Jan 22:18

steveloughran

tag_draft_001

2cc7629

Draft 001: A zero rename committer Pre-release

Pre-release

First public draft.

a_zero_rename_committer.pdf

Too verbose, under proof-checked, algorithms rendering but not reviewed, proofs badly argued and doesn't include any benchmarks.

One thing I'd like early comments on is actually my descriptions/illustrations of both the Hadoop and Spark commit protocols. I've stepped through a lot of the Hadoop one (it's how you understand the FileOutputCommitter algorithms after all, but the Spark one more from IDE archaeology. Jackson versions are stopping me bringing up the new committers in the IDE there, at least for now.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commentary

Completeness of job output

Exclusivity of output.

Consistency of the commit.

Concurrent.

Ability to abort.

Continuity of correctness.

Releases: steveloughran/zero-rename-committer

May 2021 Release

June 2020 release

May 2019 Edition

Paper Draft 004: post-ASF-release edition

Commentary

Completeness of job output

Exclusivity of output.

Consistency of the commit.

Concurrent.

Ability to abort.

Continuity of correctness.

Paper draft 003

Paper Draft 002

Draft 001: A zero rename committer