Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry bulk chunk requests to Elasticsearch 2, 5, and 6 #1626

Closed
wants to merge 7 commits into from

Conversation

soundofjw
Copy link

@soundofjw soundofjw commented Jan 23, 2019

Implements retries when saving bulk documents to Elasticsearch.
This is useful for transient connection problems and other failures which are temporary in nature.

The current behavior will cause the entire "Save to Elasticsearch" step to fail and retry, which creates redundant saves to the cluster. There is no existing method to enable a single failed chunk to retry.

This behavior is implemented with two new options, withMaxRetries and withRetryPause, which can be passed in the configuration as "maxRetries" and "retryPause" respectively.

For completeness, this also patches a typo in the name of the "Write to Elasticesarch" transform ;)


TODO:

  • Implement FluentBackoff. See comments.

Here we see the logs for the re-attempts in action:

image

Implements retries when saving bulk documents to Elasticsearch.
This is useful for transient connection problems and other failures
which are temporary in nature.

The current behavior will cause the entire "Save to Elasticsearch" step
to fail and retry, which creates redundant saves to the cluster. There
is no existing method to enable a single failed chunk to retry.

This behavior is implemented with two new options, withMaxRetries and
withRetryPause, which can be passed in the configuration as "maxRetries"
and "retryPause" respectively.
@codecov
Copy link

codecov bot commented Jan 23, 2019

Codecov Report

Merging #1626 into master will decrease coverage by 3.8%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1626      +/-   ##
=========================================
- Coverage    72.7%   68.9%   -3.81%     
=========================================
  Files         186     174      -12     
  Lines        5764    5268     -496     
  Branches      329     321       -8     
=========================================
- Hits         4191    3630     -561     
- Misses       1573    1638      +65
Impacted Files Coverage Δ
...m/spotify/scio/elasticsearch/ElasticsearchIO.scala 0% <0%> (ø) ⬆️
...m/spotify/scio/elasticsearch/ElasticsearchIO.scala 0% <0%> (ø) ⬆️
...m/spotify/scio/elasticsearch/ElasticsearchIO.scala 0% <0%> (ø) ⬆️
.../spotify/scio/bigquery/client/BigQueryConfig.scala 0% <0%> (-70.59%) ⬇️
...scala/com/spotify/scio/bigquery/client/Cache.scala 0% <0%> (-66.67%) ⬇️
.../spotify/scio/bigquery/BigQueryPartitionUtil.scala 0% <0%> (-58.98%) ⬇️
...scala/com/spotify/scio/bigquery/BigQueryUtil.scala 50% <0%> (-50%) ⬇️
...la/com/spotify/scio/bigquery/client/QueryOps.scala 0.79% <0%> (-47.15%) ⬇️
...com/spotify/scio/bigquery/types/BigQueryType.scala 37.5% <0%> (-25%) ⬇️
...la/com/spotify/scio/bigquery/client/TableOps.scala 0% <0%> (-22.37%) ⬇️
... and 85 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0d8103...d8c1724. Read the comment docs.

@regadas
Copy link
Contributor

regadas commented Jan 23, 2019

Hi @soundofjw thanks for your PR! 😄 when I added the ES6 module this was one of the things I had in my todo list! Thank you!

At first look this seems ok, however I think i would like to see this using the FluentBackOff already provided by apache beam. wdyt? Do you think it's feasible for you to reflect this on your PR?

Josh Whelchel added 2 commits January 23, 2019 16:04
@soundofjw
Copy link
Author

@regadas I have not prior knowledge of this - I think it's a great idea.
I'll add this to the PR proper the coming week. Can you put a "WIP" label on this sucker?

@regadas regadas added the WIP label Jan 23, 2019
…rch method"

This exceeds the number of max params in scala style. So rely on it coming as cmdLine arg.

This reverts commit 00c299f.
Copy link
Contributor

@regadas regadas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soundofjw cool! 👍 lets use FluentBackoff

@soundofjw
Copy link
Author

@regadas FluentBackoff changes in - I did not go through the trouble of adding every FluentBackoff argument. Let me know if you think this is desired - my experience suggests it may be OK to have the defaults.

this.maxBulkRequestSize = maxBulkRequestSize;
this.clientSupplier = new ClientSupplier(clusterName, servers);
this.toActionRequests = toActionRequests;
this.error = error;

this.backoffConfig = FluentBackoff.DEFAULT
.withMaxRetries(maxRetries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.withMaxRetries(maxRetries)
.withMaxRetries(maxRetries - 1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the param were maxAttempts, this would make sense - but when we say "3 retries," we are effectively saying 4 attempts. I'd resolve to keep this as is.


do {
try {
final BulkRequest bulkRequest = new BulkRequest().add(chunk).refresh(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soundofjw no need for .refresh(false) it's already the default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though it should be noted that when we are indexing with a refresh interval of -1, we're getting refreshes anyway (after a certain point, probably hitting the maximum amount of docs that can hang in the bulk queue) - it's really hard to know why this is happening. But to your point, changing this had no effect.

Copy link
Contributor

@regadas regadas Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah refresh at indexing time can happen do to several things ... document sizes, nr documents, queues... I guess at some point we can offer support, but let's not do it in the same PR, next one! 😄

final long numOfShard,
final int maxBulkRequestSize,
final ThrowingConsumer<BulkExecutionException> error) {
final InetSocketAddress[] servers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soundofjw can you undo the unrelated changes, keep the format?

this.maxBulkRequestSize = maxBulkRequestSize;
this.clientSupplier = new ClientSupplier(clusterName, servers);
this.toActionRequests = toActionRequests;
this.error = error;

this.backoffConfig = FluentBackoff.DEFAULT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soundofjw This is actually needs to be inside a

@Setup 
public void setup() throws Exception {
...
}


final BackOff backoff = backoffConfig.backoff();

do {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soundofjw wdyt of remove this do { ... } while(true) in favor of a

while (BackOffUtils.next(Sleeper.DEFAULT, backoff)) {
 ...
}

removes the need for all of this

try {
  final long sleepTime = backoff.nextBackOffMillis();
  if (sleepTime == BackOff.STOP) {
    break;
  }
  Thread.sleep(sleepTime);
} catch (InterruptedException | IOException e1) {
  LOG.error("Interrupt during backoff", e1);
  break;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your proposal will introduce a sleep at the beginning of every attempt:

  public static boolean next(Sleeper sleeper, BackOff backOff)
      throws InterruptedException, IOException {
    long backOffTime = backOff.nextBackOffMillis();
    if (backOffTime == BackOff.STOP) {
      return false;
    }
    sleeper.sleep(backOffTime);
    return true;
  }```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry for poor explanation, yeah correct... what I was trying to say is that I think we can refactor this block by decoupling the logic a little bit and at the same time trying to leverage that util for the retry logic ... I think we can get a better semantic and readability.


do {
try {
final BulkRequest bulkRequest = new BulkRequest().add(chunk).refresh(false);
Copy link
Contributor

@regadas regadas Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah refresh at indexing time can happen do to several things ... document sizes, nr documents, queues... I guess at some point we can offer support, but let's not do it in the same PR, next one! 😄

do {
try {
final BulkRequest bulkRequest = new BulkRequest().add(chunk)
.setRefreshPolicy(WriteRequest.RefreshPolicy.NONE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here ... this policy is the default one.


final BackOff backoff = backoffConfig.backoff();

do {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry for poor explanation, yeah correct... what I was trying to say is that I think we can refactor this block by decoupling the logic a little bit and at the same time trying to leverage that util for the retry logic ... I think we can get a better semantic and readability.

@jto jto added this to the 0.8.0 milestone Mar 22, 2019
@nevillelyh nevillelyh force-pushed the master branch 2 times, most recently from 9bc5e21 to b0d8103 Compare March 29, 2019 15:06
@jto jto requested a review from regadas April 9, 2019 09:47
@jto
Copy link
Contributor

jto commented Apr 9, 2019

@regadas what's the status of this ?

@regadas
Copy link
Contributor

regadas commented Apr 9, 2019

Closed via #1826

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants