-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA-395174: Try to unarchive VM's metrics when they aren't running #5855
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some minor comments.
Non-running VMs' metrics are stored in the coordinator. When the coordinator is asked about the metrics try to unarchive them instead of failing while trying to fetch the coordinator's IP address. This needs to force the HTTP method of the query to be POST Also returns a Service Unavailable when the host is marked as Broken. Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Forces users to use an address, instead of being implicit, this avoid the underlying cause for the issue fixed in the previous commit: it allowed a coordinator to call Pool_role.get_master_address, which always fails. Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
This makes the selection of the action obvious, previously the two booleans made it hazy to understand the decision, and was part of the error why the coordinator tried to get the coordinator address from the pool_role file (and failed badly) Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Currently a List.assoc is used, which raises an unhandled exception. Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without being familiar with the domain, I can't really review this. I did not spot anything, though.
ocaml/xapi/rrdd_proxy.ml
Outdated
unarchive () | ||
else | ||
| Slave _, true, _ | Slave _, _, false -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the current host is a slave, and the owner is a the local host (which is the Slave _, true, _
case), why do we need to ask the coordinator here? Although it seems that line 69 already checks that the metrics is not available locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a host reboot and starts running a VM again, it needs to fetch the VM's metrics, leading to this convoluted flow:
member ----get_metrics---> coordinator ----get_metrics---> member ---unarchive---> coordinator
Non-running VMs' metrics are stored in the coordinator. When the coordinator is
asked about the metrics try to unarchive them instead of failing while trying
to fetch the coordinator's IP address.
This needs to force the HTTP method of the query to be POST
Also returns a Service Unavailable when the host is marked as Broken.
A build with these changes has passed the automated tests that discovered the issue: Job 4055228