Elasticsearch partial bulk failures #140

binarylogic · 2019-03-14T17:48:03Z

I would like to think about partial failures when ingesting data into Elasticsearch and if we even want to handle this scenario. Options include:

Don't do anything.
Collect individual failed items and retry. This will probably require error code matching.
Provide an option to dead-letter the items.

This is not urgent but I wanted to get it up for discussion and posterity.

lukesteensen · 2019-03-21T14:00:53Z

I spent some time on this as part of #139, and I think to do it right will require some rework of the util HttpSink. Right now, we don't have a good way to parse the body of an http response and inspect it in our retry logic. One option would be to introduce some kind of response mapper we pass in. Another would be to strip down the inner sink, just using it as a service and pulling up things like timeouts and retries to the level of the outer sinks. The latter would mesh well with making those things configurable per-sink, so that might be the most promising route to explore.

binarylogic · 2019-03-21T14:15:34Z

Another would be to strip down the inner sink, just using it as a service and pulling up things like timeouts and retries to the level of the outer sinks.

I like that and agree

To add more context, this is definitely a later version feature and I don't think we should get cute with it. The other issue is that the response body can be very large depending on how much data was flushed, which can also cause performance and memory issues. I think a good first step is to check the root level "errors" key and then just retry the entire request. To prevent duplicate data, users can generate their own deterministic IDs which ES will reject on a duplicate requests.

jszwedko · 2020-11-09T22:12:43Z

Noting that a user ran into this again today: https://discord.com/channels/742820443487993987/746070591097798688/775478035738525707

Also noting that @fanatid attempted something like @lukesteensen mentioned with retrying partial failures in the #2755 sink, though it looks like we'll be reverting and reintroducing that handling later.

@binarylogic I'd tag this as a have: must unless I'm missing something. I feel like this is a fairly common failure mode for ES to reject individual records and, right now, these events just fall on floor without even any indication in the logs. I agree that just retrying the whole request would be better than nothing given we are using the index action and id conflicts would be handled.

binarylogic · 2020-11-09T22:19:20Z

Yeah, there are a few reasons I marked it as should:

Logstash does not retry partial failures.
The status code returned by Elasticsearch is 201.
The most common reason for partial failures is events that violate the schema, which can be ignored via the ignore_malformed elasticsearch setting.
We can't retry the entire request unless we also set the _id field to avoid inserting duplicate data.

Given the above, I'm wondering if partial retries are the best path forward as opposed to a dead-letter sink. There will be circumstances where partial retries will never succeed.

jszwedko · 2020-11-10T15:19:01Z

Those are fair reasons, there are certainly cases where partial retries will never succeed and it would be good to have dead-letter support.

However, in the case mentioned in discord, the none of the inserts in the request were successful, and would never succeed on a retry, and yet we were continuing to process events without note. Depending on the source the events are coming from, this could require some work on the user's part to replay messages once they noticed that none were making it into ES.

At the least, I think we should be logging and reporting a metric for failed inserts into ES.

binarylogic · 2020-11-10T15:27:08Z

At the least, I think we should be logging and reporting a metric for failed inserts into ES.

Agree, and @bruceg implemented logging for partial failures in #2185.

jszwedko · 2020-11-10T15:51:46Z

It seems like that was dropped along the way because it isn't in master:

https://github.com/timberio/vector/blob/a789e42159cc8a617da0c2f8f7df8f5de91fca06/src/sinks/elasticsearch.rs#L300-L326

EDIT never mind, I see it was just moved 😄

I'll try this out as it seemed like the user did not see any errors.

binarylogic · 2020-11-10T15:55:13Z

Yeah, and clearly it would be nice to have a test for this if there's a way.

jszwedko · 2020-11-10T16:13:37Z

Indeed, I was mistaken, apologies for the goose chase. I do see errors:

Nov 10 16:11:52.532  INFO vector::app: Log level is enabled. level="info"
Nov 10 16:11:52.534  INFO vector::app: Loading configs. path=["/home/jesse/workspace/vector-configs/elasticsearch.toml"]
Nov 10 16:11:52.560  INFO vector::sources::stdin: Capturing STDIN.
Nov 10 16:11:52.593  INFO vector::topology: Running healthchecks.
Nov 10 16:11:52.596  INFO vector::topology: Starting source. name="stdin"
Nov 10 16:11:52.596  INFO vector::topology: Starting sink. name="elasticsearch"
Nov 10 16:11:52.596  INFO vector: Vector has started. version="0.11.0" git_version="v0.9.0-1298-g76bb25c" released="Tue, 10 Nov 2020 16:02:23 +0000" arch="x86_64"
Nov 10 16:11:52.602  INFO vector::topology::builder: Healthcheck: Passed.
Nov 10 16:11:53.816 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=0}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:54.754 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=1}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:55.873 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=2}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:56.801 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=3}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:57.895 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=4}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"

narendraingale2 · 2021-07-05T03:23:01Z

We were trying to using Vector as a server which will read data from Kakfa and push into elasticsearch, but right now vector is not support mapping conflicts/error handling. We dont want to loose errored out events. There should be config which will allow to configure DLQ which will allow to manipulate errored out events with the help of transforms and reprocess them.

Thor77 · 2022-03-01T15:50:46Z

If partial retries are too hard to implement for now I would like to have an option for just retrying the full request (which is completely safe if setting the _id field) if that would help to get this implemented more quickly.

ojh3636 · 2022-05-16T11:32:12Z

Is there any progress here?

jszwedko · 2022-05-16T23:45:47Z

Is there any progress here?

Not yet, unfortunately, but it is on our radar.

zamazan4ik · 2022-09-16T17:54:22Z

@jszwedko what is the current location of this issue on your radar? :)

jszwedko · 2022-09-16T19:33:53Z

@jszwedko what is the current location of this issue on your radar? :)

This may be something we tackle in Q4. We'll likely start with the suggested approach of just retrying the whole payload at first and do something more sophisticated in the future to only retry failed events.

0.26 update

binarylogic added the sink: elasticsearch Anything `elasticsearch` sink related label Mar 14, 2019

lukesteensen added this to the 0.1.1 milestone Apr 22, 2019

binarylogic mentioned this issue May 14, 2019

Handle Kinesis partial failures #359

Open

binarylogic removed this from the 0.1.1 milestone Jun 7, 2019

binarylogic added the type: enhancement A value-adding code change that enhances its existing functionality. label Aug 7, 2019

binarylogic added have: nice This feature is nice to have. It is low priority. meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. labels Apr 8, 2020

binarylogic mentioned this issue Jul 19, 2020

The Elasticsearch sink drops events when a 429 is returned from one of multiple indexes #2968

Closed

binarylogic added have: should We should have this feature, but is not required. It is medium priority. and removed have: nice This feature is nice to have. It is low priority. meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. labels Jul 19, 2020

binarylogic added domain: reliability Anything related to Vector's reliability domain: sinks Anything related to the Vector's sinks domain: networking Anything related to Vector's networking labels Aug 7, 2020

binarylogic added this to the 2020-12-21 Kryptek Yeti milestone Dec 6, 2020

binarylogic mentioned this issue Dec 6, 2020

enhancement(elasticsearch sink): add support for data streams #5126

Merged

binarylogic added needs: requirements Needs a a list of requirements before work can be begin needs: rfc Needs an RFC before work can begin. labels Dec 6, 2020

jamtur01 removed this from the 2020-12-21 Kryptek Yeti milestone Dec 21, 2020

jszwedko mentioned this issue Dec 13, 2021

Bug with at-least-once delivery #10146

Closed

Thor77 mentioned this issue Feb 24, 2022

Elasticsearch sink drops messages if ES rejects execution #11359

Closed

ktff mentioned this issue Jul 24, 2022

enhancement(elasticsearch sink): Multiple hosts #13236

Closed

3 tasks

ktff mentioned this issue Oct 19, 2022

enhancement(elasticsearch sink): Retry whole payload on partial bulk failure #14891

Merged

1 task

jasongoodwin mentioned this issue Mar 3, 2023

Add acknowledgements support to aws_kinesis_firehose sink #7659

Open

jasongoodwin mentioned this issue Mar 11, 2023

feat(kinesis sinks): implement full retry of partial failures in firehose/streams #16771

Closed

jszwedko mentioned this issue Dec 6, 2023

component_discarded_events_total incorrect value on elasticsearch sink #19320

Open

aholmberg pushed a commit to aholmberg/vector that referenced this issue Feb 14, 2024

Merge pull request vectordotdev#140 from answerbook/0.26-update

1352f07

0.26 update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch partial bulk failures #140

Elasticsearch partial bulk failures #140

binarylogic commented Mar 14, 2019

lukesteensen commented Mar 21, 2019

binarylogic commented Mar 21, 2019 •

edited

jszwedko commented Nov 9, 2020

binarylogic commented Nov 9, 2020

jszwedko commented Nov 10, 2020

binarylogic commented Nov 10, 2020

jszwedko commented Nov 10, 2020 •

edited

binarylogic commented Nov 10, 2020

jszwedko commented Nov 10, 2020

narendraingale2 commented Jul 5, 2021 •

edited

Thor77 commented Mar 1, 2022

ojh3636 commented May 16, 2022

jszwedko commented May 16, 2022

zamazan4ik commented Sep 16, 2022

jszwedko commented Sep 16, 2022

Elasticsearch partial bulk failures #140

Elasticsearch partial bulk failures #140

Comments

binarylogic commented Mar 14, 2019

lukesteensen commented Mar 21, 2019

binarylogic commented Mar 21, 2019 • edited

jszwedko commented Nov 9, 2020

binarylogic commented Nov 9, 2020

jszwedko commented Nov 10, 2020

binarylogic commented Nov 10, 2020

jszwedko commented Nov 10, 2020 • edited

binarylogic commented Nov 10, 2020

jszwedko commented Nov 10, 2020

narendraingale2 commented Jul 5, 2021 • edited

Thor77 commented Mar 1, 2022

ojh3636 commented May 16, 2022

jszwedko commented May 16, 2022

zamazan4ik commented Sep 16, 2022

jszwedko commented Sep 16, 2022

binarylogic commented Mar 21, 2019 •

edited

jszwedko commented Nov 10, 2020 •

edited

narendraingale2 commented Jul 5, 2021 •

edited