Skip to content

feat(ironic): Add node state metrics from versioned notifications#1963

Merged
nidzrai merged 13 commits into
mainfrom
feat/ironic-hardware-exporter-states
May 6, 2026
Merged

feat(ironic): Add node state metrics from versioned notifications#1963
nidzrai merged 13 commits into
mainfrom
feat/ironic-hardware-exporter-states

Conversation

@nidzrai
Copy link
Copy Markdown
Contributor

@nidzrai nidzrai commented Apr 22, 2026

Added power_state, provision_state, maintenance, and fault metrics by consuming
Ironic versioned notifications from a dedicated queue.
Added second consumer for ironic-hardware-exporter-states queue bound to
ironic_versioned_notifications.info routing key
Add parser for baremetal.node.power_set/provision_set events

log :

2026/04/22 18:26:54 HTTP server listening on :9608
2026/04/22 18:26:55 waiting for messages from queue: ironic-hardware-exporter-states
2026/04/22 18:26:55 waiting for messages from queue: ironic-hardware-exporter
2026/04/22 18:26:56 cached sensors node=Dell-93GSW04
2026/04/22 18:26:56 cached sensors node=Dell-F3GSW04
2026/04/22 18:26:56 cached sensors node=Dell-J3GSW04
2026/04/22 18:26:56 cached sensors node=Dell-24GSW04
2026/04/22 18:26:56 cached sensors node=Dell-53GSW04
2026/04/22 18:26:56 cached sensors node=Dell-B3GSW04
2026/04/22 18:26:56 cached sensors node=Dell-JBTSW04
2026/04/22 18:26:56 cached sensors node=Dell-D3GSW04
2026/04/22 18:27:02 GET /metrics — served 22 nodes
2026/04/22 21:42:11 cached state node=Dell-24GSW04 power=0x239aed3481d0 provision=0x239aed3481e0

❯ openstack baremetal node power off Dell-24GSW04
from /metrics
ironic_node_power_state{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",power_state="power off"} 1 ironic_node_provision_state{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",provision_state="available"} 1 ironic_node_maintenance{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3"} 0 ironic_node_fault{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",fault="none"} 0

❯ openstack baremetal node power on Dell-24GSW04
2026/04/22 21:44:28 cached state node=Dell-24GSW04 power=0x239aed2ab400 provision=0x239aed2ab410
ironic_node_power_state{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",power_state="power on"} 1 ironic_node_provision_state{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",provision_state="available"} 1 ironic_node_maintenance{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3"} 0 ironic_node_fault{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",fault="none"} 0

missing : conductor_host label on sensor metrics coz it does not exist with sensor data
WIP - deployment artifact and TLS

helm chart follows nautobotop go project

@nidzrai nidzrai requested review from cardoe and skrobul April 22, 2026 13:23
@nidzrai nidzrai force-pushed the feat/ironic-hardware-exporter-states branch from 6d790ad to ed18e1b Compare April 24, 2026 15:05
@ctria
Copy link
Copy Markdown
Contributor

ctria commented Apr 24, 2026

in that metric:

ironic_node_fault{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",fault="none"} 0

shouldn't it be 1 instead of 0? there are 1 entities with node_uuid=XXX, node_name=YYY, conductor_host="ZZZ" and fault "none".

My assumption here is that if that has a fault all the labels will remain same except "fault" which will now get the fault description. at THAT point fault="none" should drop to 0 (or disappear).

Do I read it wrongly?

@nidzrai
Copy link
Copy Markdown
Contributor Author

nidzrai commented Apr 25, 2026

in that metric:

ironic_node_fault{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",fault="none"} 0

shouldn't it be 1 instead of 0? there are 1 entities with node_uuid=XXX, node_name=YYY, conductor_host="ZZZ" and fault "none".

My assumption here is that if that has a fault all the labels will remain same except "fault" which will now get the fault description. at THAT point fault="none" should drop to 0 (or disappear).

Do I read it wrongly?

i treated fault more like a status/condition metric, where the numeric value indicates whether the node is faulted (0 = no, 1 = yes), rather than a pure state-label metric where the active label always has value 1.

@nidzrai
Copy link
Copy Markdown
Contributor Author

nidzrai commented Apr 25, 2026

in that metric:

ironic_node_fault{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",fault="none"} 0

shouldn't it be 1 instead of 0? there are 1 entities with node_uuid=XXX, node_name=YYY, conductor_host="ZZZ" and fault "none".

My assumption here is that if that has a fault all the labels will remain same except "fault" which will now get the fault description. at THAT point fault="none" should drop to 0 (or disappear).

Do I read it wrongly?

so I modeled ironic_node_fault as a boolean/status-style metric rather than a pure state-label metric.
The thought process was that fault is closer to a condition like “is the node faulted right now?” than to a normal lifecycle state like power_state or provision_state. With the current shape

ironic_node_fault{...,fault="none"} 0 the node is not faulted
ironic_node_fault{...,fault=""} 1 the node is faulted, and the label carries the reason

The main reason to chose that shape is query behavior with this model, expressions like sum(ironic_node_fault) directly answer “how many nodes are faulted right now?”. If fault="none" were emitted as 1, that kind of query would count healthy nodes as well, idk but this felt misleading to me.
There also isn’t a direct upstream fault metric in ironic-prometheus-exporter to mirror exactly, but the closest upstream precedent for status/health-style metrics is to encode the status in the metric value rather than using an always-1 state-label pattern.
That said, I agree your reading is reasonable, especially since power_state / provision_state in this exporter do use the value always 1, state in label model.

So if we want fault to follow the same pattern for consistency, I’m happy to change it. My choice here was mainly to avoid the “healthy nodes count as 1” downside of the alternative.

@nidzrai nidzrai requested a review from a team April 27, 2026 16:04
@nidzrai nidzrai force-pushed the feat/ironic-hardware-exporter-states branch from ed18e1b to 22db994 Compare April 28, 2026 10:46
@cardoe
Copy link
Copy Markdown
Contributor

cardoe commented May 4, 2026

Can we add some documentation to docs about the design of this code, how to build it and how to validate it. We should also have under the deployment docs some information on how to deploy it.

store.UpdateNodeState(stateMsg)
log.Printf("cached state node=%s power=%v provision=%v",
stateMsg.NodeName, stateMsg.PowerState, stateMsg.ProvisionState)
}); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issue: if the states consumer exits here, the process keeps running and /health remains 200. bothReady will flip /ready to 503, but Kubernetes will not restart the pod from readiness alone and the exporter will continue serving stale state metrics. The sensor consumer path exits fatally on failure; the states path should do the same or share a supervisor/reconnect loop so state collection recovers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// nodeStateEventPrefixes are the baremetal.node events .
// we use .end and .success variants so we capture final state
var nodeStateEventPrefixes = []string{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this parser also drives ironic_node_maintenance and ironic_node_fault, limiting the allow-list to power/provision events leaves those metrics stale when maintenance or fault changes through maintenance/update notifications. It also misses power_state_corrected, which the existing Nautobot sync path treats as a power-state event. Please include all Ironic node events that can change the fields exported here, or do not export fields this consumer cannot keep current.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do these in follow up pr

fmt.Fprintf(b, "# TYPE ironic_node_last_seen_timestamp_seconds gauge\n")
for _, n := range nodes {
fmt.Fprintf(&b, "ironic_node_last_seen_timestamp_seconds{node_uuid=%q,node_name=%q} %d\n",
fmt.Fprintf(b, "ironic_node_last_seen_timestamp_seconds{node_uuid=%q,node_name=%q} %d\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State events can create a cache entry before any hardware metrics arrive because UpdateNodeState does that. For those entries LastSeen is the Go zero time, so this loop emits ironic_node_last_seen_timestamp_seconds with -62135596800, which is a bogus hardware metrics timestamp. Skip zero LastSeen values or track a separate state timestamp instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do this in follow up pr

if n.PowerState == nil {
continue
}
fmt.Fprintf(b, "ironic_node_power_state{node_uuid=%q,node_name=%q,conductor_host=%q,power_state=%q} 1\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This current-state-only shape leaves the previous label series active in Prometheus until staleness kicks in. For example, after power off changes to power on, the old power_state="power off" sample can still match queries for several minutes. State metrics are safer if the exporter emits the complete enum as 0/1 for each node, or uses a numeric gauge without state labels.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do these in follow up pr

@cardoe
Copy link
Copy Markdown
Contributor

cardoe commented May 4, 2026

Suggestion: switch the exporter release tag namespace to ironic-hardware-exporter/vX.Y.Z and make the CI/release path target that format consistently.

Concrete shape I would suggest:

  • Change the workflow tag trigger from ironic-hardware-exporter-v* to ironic-hardware-exporter/v*.
  • Extract the release version from GITHUB_REF_NAME with ironic-hardware-exporter/v stripped, so image and chart versions remain X.Y.Z.
  • Validate the ref early so manual dispatches or accidental tags fail unless they use refs/tags/ironic-hardware-exporter/vX.Y.Z.
  • Rename the GoReleaser env var from CUSTOM_TAG to something like RELEASE_VERSION, since the value passed into image tags and OCI labels is the cleaned version, not the full git tag.
  • Consider setting a GoReleaser release.name_template such as ironic-hardware-exporter v{{ .Env.RELEASE_VERSION }} so the GitHub release title stays clean while the git tag remains namespaced.

This aligns the exporter with the component-style tag format already used elsewhere, for example ironic-ipxe/v*, while keeping published image tags and Helm chart versions as plain semver (X.Y.Z).

Copy link
Copy Markdown
Collaborator

@skrobul skrobul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with minor suggestions

}
}

// todo: circle back here for parsing
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover todo

for _, n := range nodes {
for key, d := range n.Sensors.Drive {
val := 0.0
if d.State != nil && *d.State == "Enabled" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if d.State != nil && *d.State == "Enabled" {
if d.State != nil && strings.EqualFold(*d.State, "Enabled") {

Comment on lines +127 to +129
if n.PowerState == nil && n.ProvisionState == nil {
continue
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this check for? does the maintenance status change always arrive only with power info?
I feel like there is some story behind this, but couldn't figure out what it is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old check was replaced by HasStateData (set once on the first UpdateNodeState() call). The old PowerState == nil && ProvisionState == nil check would have silently dropped maintenance-only events maintenance.set arrives with both fields nil since Ironic only sends what changed. HasStateData correctly gates on "has any state event arrived for this node" rather than "does this node have both specific fields populated".

Comment on lines +141 to +143
if n.PowerState == nil && n.ProvisionState == nil {
continue
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar as above - why is this one gated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was guarding maintenance and fault emission, but maintenance.set events (which only set maintenance=true) arrive with both power_state and provision_state nil. but replaced it with this was also replaced by HasStateData

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both checks are gone. transformer.go now has a single gate at the top of the stateFamilies loop: if !n.HasStateData { continue }. HasStateData is set by the first UpdateNodeState() call regardless of which fields the event carried, so maintenance-only events (nil power/provision state) too stays

@nidzrai nidzrai force-pushed the feat/ironic-hardware-exporter-states branch from 9a7f3b7 to 4f4110e Compare May 5, 2026 23:35
@nidzrai nidzrai added this pull request to the merge queue May 6, 2026
Merged via the queue into main with commit 95d6c06 May 6, 2026
24 checks passed
@nidzrai nidzrai deleted the feat/ironic-hardware-exporter-states branch May 6, 2026 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants