Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cartridge: expose operation last error to issues #74

Closed

Conversation

DifferentialOrange
Copy link
Member

@DifferentialOrange DifferentialOrange commented Apr 12, 2024

Expose last operation error to Cartridge issues.

issues
(Issues are also exposed to default Grafana dashboard, as well as default alerts.)

image
(Error message could be improved, but it's always has been like this: I haven't changed anything here in this patch.)

The original issue was about exposing migrations inconsistency from new migrations tab to Cartridge issues as well. But using straightforward approach is rather bad: checking inconsistency is a full cluster map-reduce operation, and, if exposed to get_issues, it will omit N^2 network requests since issues are collected from each instance, there is no way to check whether migrations are consistent without cluster map-reduce and there is no distinct migrator provider -- any instance is migration provider. And, since get_issues may trigger rather often, having such a feature may make cluster unhealthy (we already had similar things with metrics [1]). Last error is reset on each operation call.

  1. tnt_cartridge_issues gather only local issues metrics#243

Closes #73

Expose last operation error to Cartridge issues.

The original issue was about exposing migrations inconsistency from
new `migrations` tab to Cartridge issues as well. But using
straightforward approach is rather bad: checking inconsistency is
a full cluster map-reduce operation, and, if exposed to `get_issues`,
it will omit N^2 network requests since issues are collected from each
instance, there is no way to check whether migrations are consistent
without cluster map-reduce and there is no distinct migrator provider
-- any instance is migration provider. And, since `get_issues` may
trigger rather often, having such a feature may make cluster
unhealthy (we already had similar things with metrics [1]). Last error
is reset on each operation call.

1. tarantool/metrics#243

Closes #73
@DifferentialOrange DifferentialOrange requested review from psergee, better0fdead and filonenko-mikhail and removed request for psergee April 12, 2024 11:57
@DifferentialOrange
Copy link
Member Author

DifferentialOrange commented Apr 15, 2024

The problem @filonenko-mikhail had pointed out: error is lost on restart, but inconsistency is not.

@DifferentialOrange
Copy link
Member Author

For now, I don't see any perfect solution to this one. Two points are the reason:

  • checking for inconsistency is always a full cluster operation,
  • module does not have a single entrypoint in terms of Cartridge roles -- user can trigger migrator.up from any node.

If we start to check for inconsistencies on instance start, it may break the cluster in case of new cluster start/full cluster restart/half cluster restart/etc since it would be N^2 again.

@DifferentialOrange
Copy link
Member Author

Persisting an error on up caller is also doesn't seem like a good solution since one may start a migrations from RO instance.

@DifferentialOrange
Copy link
Member Author

Nonetheless, this solution is broken even without restarts -- errors reset on each up, but if an error has been caught on instance 1, then one would call up on instance 2 and everything will be consistent after second up, issue still will be there since it is cached per-instance.

@yngvar-antonsson
Copy link

The problem @filonenko-mikhail had pointed out: error is lost on restart, but inconsistency is not.

We have several same issues in Cartridge. I propose just adding a note that the issue stays until restart.

@yngvar-antonsson
Copy link

Nonetheless, this solution is broken even without restarts -- errors reset on each up, but if an error has been caught on instance 1, then one would call up on instance 2 and everything will be consistent after second up, issue still will be there since it is cached per-instance.

Maybe we could add some "clear cached issues" button in Cartridge? Users can check the actual status of migrations with the migrations tab, can't they?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Display migrations inconsistency on issues
2 participants