Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When creating a new Index, ZDB needs to wait for ES health status of "yellow" before returning control #79

Closed
eeeebbbbrrrr opened this issue Jan 29, 2016 · 5 comments
Assignees

Comments

@eeeebbbbrrrr
Copy link
Collaborator

The travis-ci tests occasionally fail the Postgres regression test for issue #58. After quite a bit of debugging, the failure isn't related to the changes introduced in issue #58, but instead the timing of the test is such that sometimes we try to call ZDB's _pgcount endpoint before the newly created ES index has finished migrating all its shards to the STARTED status.

After some head scratching, chatting with @nz (thanks!), and documentation reading, looks like ZDB needs to call the /_cluster/health/<index_name> endpoint with a ?wait_for_status=yellow before returning control back to Postgres. This should ensure that (at least) all the primary shards of a newly created index are actually available for use.

@eeeebbbbrrrr
Copy link
Collaborator Author

For reference, this (https://gist.github.com/eeeebbbbrrrr/968edf5941c654f240ca) is a little shell script that can re-create the problem from the command-line

@eeeebbbbrrrr
Copy link
Collaborator Author

I wonder if doing this only when creating a new index is enough. I have a feeling that it could be necessary in any situation where ZDB runs a SearchRequest. Basically, if the SearchResponse indicates that the total shards != successful shards and failed shards is zero, then ?wait_for_status=yellow, and then try again.

This might be necessary in cases where a node (or the entire cluster) has been restarted and ZDB tries to query before all the indexes are at least yellow.

I'm not going to do anything about this case right now, but wanted to note that I've at least thought it could be a problem -- no reports of it yet.

@nz
Copy link

nz commented Jan 29, 2016

There are other cases where some shards report failure. Shard corruption is one, or there's syntax/query errors on multi-index searching with mapping mismatches. Individual shard timeouts are also plausible. You might end up stuck if you check the health too often :-)

@eeeebbbbrrrr
Copy link
Collaborator Author

Those are good points. All the more reason to hold off on doing something like this everywhere. ZDB is very good at detecting (and re-throwing) actual failures, which I suspect corruption/timeout issues would cause.

In this case of "successful shards" not being the same as "total shards", there's no actual indication of failure (ie, .getFailedShards() is zero). ES just doesn't seem to consider it a failure that all shards didn't respond if they're (at least) in an INITIALIZING state.

eeeebbbbrrrr added a commit that referenced this issue Jan 30, 2016
@eeeebbbbrrrr eeeebbbbrrrr self-assigned this Feb 3, 2016
@eeeebbbbrrrr eeeebbbbrrrr mentioned this issue Feb 3, 2016
eeeebbbbrrrr added a commit that referenced this issue Feb 3, 2016
@eeeebbbbrrrr
Copy link
Collaborator Author

to be released in v2.6.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants