Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement(aws_s3 source): batch SQS deletes #7992

Merged
merged 5 commits into from Jun 23, 2021

Conversation

tobz
Copy link
Contributor

@tobz tobz commented Jun 22, 2021

This PR introduces batching to the delete phase of the core S3 source loop. After reading a batch of messages from SQS, we would process them one-by-one, and then, if enabled, delete them from SQS to mark them as completed. Instead, we now batch up the message IDs to delete as we retrieve/process the files from S3, and only once a batch of SQS messages has been processed in that way do we delete them from SQS.

In a synthetic benchmark of many small files, such that the time spent reading the files from S3/sending them to the pipeline is small, we can observe a 40-50% increase in throughput to the overall source.

Signed-off-by: Toby Lawrence toby@nuclearfurnace.com

@tobz tobz requested a review from a team June 22, 2021 20:47
@tobz tobz changed the title chore(s3 source): batch SQS deletes enhancement(aws_s3 source): batch SQS deletes Jun 22, 2021
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
@tobz tobz force-pushed the tobz/s3-source-batch-sqs-deletes branch from c2b7e1a to cbb1975 Compare June 22, 2021 20:50
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 🎉

src/internal_events/aws_s3.rs Outdated Show resolved Hide resolved

#[derive(Debug)]
pub(crate) struct SqsMessageDeleteFailed {
state: MessageDeleteFailureState,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code paths for Complete and Partial below seem to be completely disjoint. What is the benefit of making this an enum instead of two separate event structs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuity in downstream systems that currently look at these events was the thought... although maybe that doesn't really matter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how. There are already two separate code paths that emit the same warning and metric (albeit in the same function). I don't think separating them into separate methods/structs will make that big a deal (assuming I understand your concern correctly).

tobz added 2 commits June 22, 2021 19:43
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
@tobz
Copy link
Contributor Author

tobz commented Jun 23, 2021

I think you want to drop the % here

You're totally right. I swear I ran cargo check before committing that code... 🤔 Anyways, fixed!


#[derive(Debug)]
pub(crate) struct SqsMessageDeleteFailed {
state: MessageDeleteFailureState,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how. There are already two separate code paths that emit the same warning and metric (albeit in the same function). I don't think separating them into separate methods/structs will make that big a deal (assuming I understand your concern correctly).

src/internal_events/aws_s3.rs Outdated Show resolved Hide resolved
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Comment on lines +288 to +289
let cloned_entries = delete_entries.clone();
match self.delete_messages(delete_entries).await {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too bad you can't pass this by reference. This clone is otherwise pretty pointless (since the internal events can be written to use a reference and not need a clone).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's annoying. :(

@tobz tobz merged commit ea77790 into master Jun 23, 2021
@tobz tobz deleted the tobz/s3-source-batch-sqs-deletes branch June 23, 2021 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants