Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch partial bulk failures #140

Open
binarylogic opened this issue Mar 14, 2019 · 15 comments
Open

Elasticsearch partial bulk failures #140

binarylogic opened this issue Mar 14, 2019 · 15 comments
Labels
domain: networking Anything related to Vector's networking domain: reliability Anything related to Vector's reliability domain: sinks Anything related to the Vector's sinks have: should We should have this feature, but is not required. It is medium priority. needs: requirements Needs a a list of requirements before work can be begin needs: rfc Needs an RFC before work can begin. sink: elasticsearch Anything `elasticsearch` sink related type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@binarylogic
Copy link
Contributor

I would like to think about partial failures when ingesting data into Elasticsearch and if we even want to handle this scenario. Options include:

  1. Don't do anything.
  2. Collect individual failed items and retry. This will probably require error code matching.
  3. Provide an option to dead-letter the items.

This is not urgent but I wanted to get it up for discussion and posterity.

@binarylogic binarylogic added the sink: elasticsearch Anything `elasticsearch` sink related label Mar 14, 2019
@lukesteensen
Copy link
Member

I spent some time on this as part of #139, and I think to do it right will require some rework of the util HttpSink. Right now, we don't have a good way to parse the body of an http response and inspect it in our retry logic. One option would be to introduce some kind of response mapper we pass in. Another would be to strip down the inner sink, just using it as a service and pulling up things like timeouts and retries to the level of the outer sinks. The latter would mesh well with making those things configurable per-sink, so that might be the most promising route to explore.

@binarylogic
Copy link
Contributor Author

binarylogic commented Mar 21, 2019

Another would be to strip down the inner sink, just using it as a service and pulling up things like timeouts and retries to the level of the outer sinks.

I like that and agree

To add more context, this is definitely a later version feature and I don't think we should get cute with it. The other issue is that the response body can be very large depending on how much data was flushed, which can also cause performance and memory issues. I think a good first step is to check the root level "errors" key and then just retry the entire request. To prevent duplicate data, users can generate their own deterministic IDs which ES will reject on a duplicate requests.

@lukesteensen lukesteensen added this to the 0.1.1 milestone Apr 22, 2019
@binarylogic binarylogic removed this from the 0.1.1 milestone Jun 7, 2019
@binarylogic binarylogic added the type: enhancement A value-adding code change that enhances its existing functionality. label Aug 7, 2019
@binarylogic binarylogic added have: nice This feature is nice to have. It is low priority. meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. labels Apr 8, 2020
@binarylogic binarylogic added have: should We should have this feature, but is not required. It is medium priority. and removed have: nice This feature is nice to have. It is low priority. meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. labels Jul 19, 2020
@binarylogic binarylogic added domain: reliability Anything related to Vector's reliability domain: sinks Anything related to the Vector's sinks domain: networking Anything related to Vector's networking labels Aug 7, 2020
@jszwedko
Copy link
Member

jszwedko commented Nov 9, 2020

Noting that a user ran into this again today: https://discord.com/channels/742820443487993987/746070591097798688/775478035738525707

Also noting that @fanatid attempted something like @lukesteensen mentioned with retrying partial failures in the #2755 sink, though it looks like we'll be reverting and reintroducing that handling later.

@binarylogic I'd tag this as a have: must unless I'm missing something. I feel like this is a fairly common failure mode for ES to reject individual records and, right now, these events just fall on floor without even any indication in the logs. I agree that just retrying the whole request would be better than nothing given we are using the index action and id conflicts would be handled.

@binarylogic
Copy link
Contributor Author

Yeah, there are a few reasons I marked it as should:

  1. Logstash does not retry partial failures.
  2. The status code returned by Elasticsearch is 201.
  3. The most common reason for partial failures is events that violate the schema, which can be ignored via the ignore_malformed elasticsearch setting.
  4. We can't retry the entire request unless we also set the _id field to avoid inserting duplicate data.

Given the above, I'm wondering if partial retries are the best path forward as opposed to a dead-letter sink. There will be circumstances where partial retries will never succeed.

@jszwedko
Copy link
Member

Those are fair reasons, there are certainly cases where partial retries will never succeed and it would be good to have dead-letter support.

However, in the case mentioned in discord, the none of the inserts in the request were successful, and would never succeed on a retry, and yet we were continuing to process events without note. Depending on the source the events are coming from, this could require some work on the user's part to replay messages once they noticed that none were making it into ES.

At the least, I think we should be logging and reporting a metric for failed inserts into ES.

@binarylogic
Copy link
Contributor Author

At the least, I think we should be logging and reporting a metric for failed inserts into ES.

Agree, and @bruceg implemented logging for partial failures in #2185.

@jszwedko
Copy link
Member

jszwedko commented Nov 10, 2020

It seems like that was dropped along the way because it isn't in master:

https://github.com/timberio/vector/blob/a789e42159cc8a617da0c2f8f7df8f5de91fca06/src/sinks/elasticsearch.rs#L300-L326

EDIT never mind, I see it was just moved 😄

I'll try this out as it seemed like the user did not see any errors.

@binarylogic
Copy link
Contributor Author

Yeah, and clearly it would be nice to have a test for this if there's a way.

@jszwedko
Copy link
Member

Indeed, I was mistaken, apologies for the goose chase. I do see errors:

Nov 10 16:11:52.532  INFO vector::app: Log level is enabled. level="info"
Nov 10 16:11:52.534  INFO vector::app: Loading configs. path=["/home/jesse/workspace/vector-configs/elasticsearch.toml"]
Nov 10 16:11:52.560  INFO vector::sources::stdin: Capturing STDIN.
Nov 10 16:11:52.593  INFO vector::topology: Running healthchecks.
Nov 10 16:11:52.596  INFO vector::topology: Starting source. name="stdin"
Nov 10 16:11:52.596  INFO vector::topology: Starting sink. name="elasticsearch"
Nov 10 16:11:52.596  INFO vector: Vector has started. version="0.11.0" git_version="v0.9.0-1298-g76bb25c" released="Tue, 10 Nov 2020 16:02:23 +0000" arch="x86_64"
Nov 10 16:11:52.602  INFO vector::topology::builder: Healthcheck: Passed.
Nov 10 16:11:53.816 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=0}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:54.754 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=1}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:55.873 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=2}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:56.801 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=3}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"
Nov 10 16:11:57.895 ERROR sink{component_kind="sink" component_name=elasticsearch component_type=elasticsearch}:request{request_id=4}: vector::sinks::util::retries: Not retriable; dropping the request. reason="error type: illegal_argument_exception, reason: only write ops with an op_type of create are allowed in data streams"

@binarylogic binarylogic added this to the 2020-12-21 Kryptek Yeti milestone Dec 6, 2020
@binarylogic binarylogic added needs: requirements Needs a a list of requirements before work can be begin needs: rfc Needs an RFC before work can begin. labels Dec 6, 2020
@jamtur01 jamtur01 removed this from the 2020-12-21 Kryptek Yeti milestone Dec 21, 2020
@narendraingale2
Copy link

narendraingale2 commented Jul 5, 2021

We were trying to using Vector as a server which will read data from Kakfa and push into elasticsearch, but right now vector is not support mapping conflicts/error handling. We dont want to loose errored out events. There should be config which will allow to configure DLQ which will allow to manipulate errored out events with the help of transforms and reprocess them.

@Thor77
Copy link

Thor77 commented Mar 1, 2022

If partial retries are too hard to implement for now I would like to have an option for just retrying the full request (which is completely safe if setting the _id field) if that would help to get this implemented more quickly.

@ojh3636
Copy link

ojh3636 commented May 16, 2022

Is there any progress here?

@jszwedko
Copy link
Member

Is there any progress here?

Not yet, unfortunately, but it is on our radar.

@zamazan4ik
Copy link
Contributor

@jszwedko what is the current location of this issue on your radar? :)

@jszwedko
Copy link
Member

@jszwedko what is the current location of this issue on your radar? :)

This may be something we tackle in Q4. We'll likely start with the suggested approach of just retrying the whole payload at first and do something more sophisticated in the future to only retry failed events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: networking Anything related to Vector's networking domain: reliability Anything related to Vector's reliability domain: sinks Anything related to the Vector's sinks have: should We should have this feature, but is not required. It is medium priority. needs: requirements Needs a a list of requirements before work can be begin needs: rfc Needs an RFC before work can begin. sink: elasticsearch Anything `elasticsearch` sink related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests

8 participants