Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alg] Reuse ranges of contended merges to speed up concurrent merges #4057

Merged
merged 2 commits into from
Oct 23, 2022

Conversation

arielshaqed
Copy link
Contributor

No description provided.

@arielshaqed arielshaqed added area/cataloger Improvements or additions to the cataloger exclude-changelog PR description should not be included in next release changelog team/versioning-engine Team versioning engine area/KV Improvements to the KV store implementation labels Aug 31, 2022
@arielshaqed arielshaqed marked this pull request as ready for review August 31, 2022 11:31
@arielshaqed
Copy link
Contributor Author

I asked DALL-E to draw this. Here's

A cyborg reusing ranges to speed up concurrent merges

DALL-E drawing of "A cyborg reusing ranges to speed up concurrent merges" shows a cyborg doing something with 2 targets and meaningless Latin-character text"

Copy link
Contributor

@eden-ohana eden-ohana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea!

merged metarange_ as source and the original destination as the base! (See
"Correctness" below for why this yields the correct result.)

The performance gain occurs during the retry: We expect the new source to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To gain performance improvement the assumption here is that the two merges don't change the same ranges, need to make sure that this is the case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right: this will be true for some but not all merges. With our emerging focus on merging as the basic tool for handling multi-object file formats (Spark outputs, probably Iceberg, maybe Delta too), I would expect ranges to separate if there are repeated racing merges -- simply because large multi-objects will split off into their range soon enough :-)

Adding words to that effect.

| A | B | A | B | C | conflict | Conflict with updated dst |
| A | A | B | B | C | C | |


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a correction issue when using the conflict resolution flag. The example uses one file and the letter represents a different version of it.

Two concurrent merges

  1. Source wins in case of conflict
    Base A
    Source B
    Dest C
    Res B

  2. Dest wins in case of conflict
    Base A
    Source D
    Dest C
    Res C

The first one finish first so the second one will run with the result metarange - C and the dest wins flag:

Base C
Source C
Dest B
Res B

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think conflict resolution also works here, actually. You just need to interpret the new base and source.

Think of this as doing exactly what you would do during manual conflict resolution after losing a race:

  • First you merge destination HEAD into your source. This uses the reverse conflict resolution strategy.
  • Now you merge the result back to your destination HEAD.

Our multi-object writing strategy is going to do this, and will benefit.
Copy link
Contributor Author

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

merged metarange_ as source and the original destination as the base! (See
"Correctness" below for why this yields the correct result.)

The performance gain occurs during the retry: We expect the new source to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right: this will be true for some but not all merges. With our emerging focus on merging as the basic tool for handling multi-object file formats (Spark outputs, probably Iceberg, maybe Delta too), I would expect ranges to separate if there are repeated racing merges -- simply because large multi-objects will split off into their range soon enough :-)

Adding words to that effect.

| A | B | A | B | C | conflict | Conflict with updated dst |
| A | A | B | B | C | C | |


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think conflict resolution also works here, actually. You just need to interpret the new base and source.

Think of this as doing exactly what you would do during manual conflict resolution after losing a race:

  • First you merge destination HEAD into your source. This uses the reverse conflict resolution strategy.
  • Now you merge the result back to your destination HEAD.

@arielshaqed
Copy link
Contributor Author

@eden-ohana PTAL :-)

Copy link
Contributor

@eden-ohana eden-ohana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@arielshaqed
Copy link
Contributor Author

THANKS!

Pulling by force because docs don't run tests 🤷🏽

@arielshaqed arielshaqed merged commit 12eae11 into master Oct 23, 2022
@arielshaqed arielshaqed deleted the design/reuse-contended-merges branch October 23, 2022 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cataloger Improvements or additions to the cataloger area/KV Improvements to the KV store implementation exclude-changelog PR description should not be included in next release changelog proposal team/versioning-engine Team versioning engine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Design how to reuse ranges of contended merges to speed up concurrent merges
3 participants