Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA-395174: Try to unarchive VM's metrics when they aren't running #5855

Merged
merged 5 commits into from
Jul 22, 2024

Conversation

psafont
Copy link
Member

@psafont psafont commented Jul 18, 2024

Non-running VMs' metrics are stored in the coordinator. When the coordinator is
asked about the metrics try to unarchive them instead of failing while trying
to fetch the coordinator's IP address.

This needs to force the HTTP method of the query to be POST

Also returns a Service Unavailable when the host is marked as Broken.

A build with these changes has passed the automated tests that discovered the issue: Job 4055228

Copy link
Contributor

@edwintorok edwintorok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor comments.

Non-running VMs' metrics are stored in the coordinator. When the coordinator is
asked about the metrics try to unarchive them instead of failing while trying
to fetch the coordinator's IP address.

This needs to force the HTTP method of the query to be POST

Also returns a Service Unavailable when the host is marked as Broken.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Forces users to use an address, instead of being implicit, this avoid the
underlying cause for the issue fixed in the previous commit: it allowed a
coordinator to call Pool_role.get_master_address, which always fails.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
This makes the selection of the action obvious, previously the two booleans
made it hazy to understand the decision, and was part of the error why the
coordinator tried to get the coordinator address from the pool_role file (and
failed badly)

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Currently a List.assoc is used, which raises an unhandled exception.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Copy link
Contributor

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without being familiar with the domain, I can't really review this. I did not spot anything, though.

unarchive ()
else
| Slave _, true, _ | Slave _, _, false ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the current host is a slave, and the owner is a the local host (which is the Slave _, true, _ case), why do we need to ask the coordinator here? Although it seems that line 69 already checks that the metrics is not available locally

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a host reboot and starts running a VM again, it needs to fetch the VM's metrics, leading to this convoluted flow:
member ----get_metrics---> coordinator ----get_metrics---> member ---unarchive---> coordinator

@psafont psafont merged commit 54abab8 into xapi-project:master Jul 22, 2024
15 checks passed
@psafont psafont deleted the rrd-unpack branch July 22, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants