Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concurrent writes reconciliation for OPTIMIZE in Delta Lake #22443

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Jun 19, 2024

Description

Allow committing OPTIMIZE operations in a concurrent context
by placing these operations right after any other previously
concurrently completed write operations.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Delta Lake 
* Add support for concurrent execution of `OPTIMIZE`. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jun 19, 2024
@github-actions github-actions bot added the delta-lake Delta Lake connector label Jun 19, 2024
@findinpath findinpath force-pushed the findinpath/delta-lake-optimize-reconciliation branch 2 times, most recently from c3fe9e8 to 9264397 Compare June 19, 2024 15:09
@findinpath findinpath marked this pull request as ready for review June 19, 2024 15:09
@findinpath findinpath force-pushed the findinpath/delta-lake-optimize-reconciliation branch from 9264397 to f4be087 Compare June 19, 2024 15:12
Allow committing OPTIMIZE operations in a concurrent context
by placing these operations right after any other previously
concurrently completed write operations.
@findinpath findinpath force-pushed the findinpath/delta-lake-optimize-reconciliation branch from f4be087 to e695b50 Compare June 19, 2024 15:14
@findinpath findinpath requested review from ebyhr and pajaks June 19, 2024 15:14
Comment on lines +2538 to +2542
// Note: during writes we want to preserve original case of partition columns
List<String> partitionColumns = getPartitionColumns(
optimizeHandle.getMetadataEntry().getOriginalPartitionColumns(),
optimizeHandle.getTableColumns(),
getColumnMappingMode(optimizeHandle.getMetadataEntry(), optimizeHandle.getProtocolEntry()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put it inside commitOptimizeOperation function?

.build())
.forEach(MoreFutures::getDone);

assertThat(query("SELECT * FROM " + tableName)).matches("VALUES (1, 10), (2, 10), (11, 20), (12, 20), (21, 30), (22, 30)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check file count to verify that OPTIMIZE actually happen?

ExecutorService executor = newFixedThreadPool(threads);
String tableName = "test_concurrent_optimize_and_inserts_table_" + randomNameSuffix();

assertUpdate("CREATE TABLE " + tableName + " (a, part) WITH (partitioned_by = ARRAY['part']) AS VALUES (1, 10), (11, 20)", 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more files, so that OPTIMIZE can rewrite something?

.build())
.forEach(MoreFutures::getDone);

assertThat(query("SELECT * FROM " + tableName)).matches("VALUES (1, 10), (8, 10), (11, 20), (21, 30)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check that OPTIMIZE actually changed something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector
Development

Successfully merging this pull request may close these issues.

None yet

2 participants