feat(ironic): Add node state metrics from versioned notifications#1963
Conversation
6d790ad to
ed18e1b
Compare
|
in that metric: shouldn't it be 1 instead of 0? there are 1 entities with node_uuid=XXX, node_name=YYY, conductor_host="ZZZ" and fault "none". My assumption here is that if that has a fault all the labels will remain same except "fault" which will now get the fault description. at THAT point fault="none" should drop to 0 (or disappear). Do I read it wrongly? |
i treated fault more like a status/condition metric, where the numeric value indicates whether the node is faulted (0 = no, 1 = yes), rather than a pure state-label metric where the active label always has value 1. |
so I modeled ironic_node_fault as a boolean/status-style metric rather than a pure state-label metric. ironic_node_fault{...,fault="none"} 0 the node is not faulted The main reason to chose that shape is query behavior with this model, expressions like sum(ironic_node_fault) directly answer “how many nodes are faulted right now?”. If fault="none" were emitted as 1, that kind of query would count healthy nodes as well, idk but this felt misleading to me. So if we want fault to follow the same pattern for consistency, I’m happy to change it. My choice here was mainly to avoid the “healthy nodes count as 1” downside of the alternative. |
ed18e1b to
22db994
Compare
|
Can we add some documentation to docs about the design of this code, how to build it and how to validate it. We should also have under the deployment docs some information on how to deploy it. |
| store.UpdateNodeState(stateMsg) | ||
| log.Printf("cached state node=%s power=%v provision=%v", | ||
| stateMsg.NodeName, stateMsg.PowerState, stateMsg.ProvisionState) | ||
| }); err != nil { |
There was a problem hiding this comment.
Blocking issue: if the states consumer exits here, the process keeps running and /health remains 200. bothReady will flip /ready to 503, but Kubernetes will not restart the pod from readiness alone and the exporter will continue serving stale state metrics. The sensor consumer path exits fatally on failure; the states path should do the same or share a supervisor/reconnect loop so state collection recovers.
|
|
||
| // nodeStateEventPrefixes are the baremetal.node events . | ||
| // we use .end and .success variants so we capture final state | ||
| var nodeStateEventPrefixes = []string{ |
There was a problem hiding this comment.
Because this parser also drives ironic_node_maintenance and ironic_node_fault, limiting the allow-list to power/provision events leaves those metrics stale when maintenance or fault changes through maintenance/update notifications. It also misses power_state_corrected, which the existing Nautobot sync path treats as a power-state event. Please include all Ironic node events that can change the fields exported here, or do not export fields this consumer cannot keep current.
There was a problem hiding this comment.
will do these in follow up pr
| fmt.Fprintf(b, "# TYPE ironic_node_last_seen_timestamp_seconds gauge\n") | ||
| for _, n := range nodes { | ||
| fmt.Fprintf(&b, "ironic_node_last_seen_timestamp_seconds{node_uuid=%q,node_name=%q} %d\n", | ||
| fmt.Fprintf(b, "ironic_node_last_seen_timestamp_seconds{node_uuid=%q,node_name=%q} %d\n", |
There was a problem hiding this comment.
State events can create a cache entry before any hardware metrics arrive because UpdateNodeState does that. For those entries LastSeen is the Go zero time, so this loop emits ironic_node_last_seen_timestamp_seconds with -62135596800, which is a bogus hardware metrics timestamp. Skip zero LastSeen values or track a separate state timestamp instead.
There was a problem hiding this comment.
will do this in follow up pr
| if n.PowerState == nil { | ||
| continue | ||
| } | ||
| fmt.Fprintf(b, "ironic_node_power_state{node_uuid=%q,node_name=%q,conductor_host=%q,power_state=%q} 1\n", |
There was a problem hiding this comment.
This current-state-only shape leaves the previous label series active in Prometheus until staleness kicks in. For example, after power off changes to power on, the old power_state="power off" sample can still match queries for several minutes. State metrics are safer if the exporter emits the complete enum as 0/1 for each node, or uses a numeric gauge without state labels.
There was a problem hiding this comment.
will do these in follow up pr
|
Suggestion: switch the exporter release tag namespace to Concrete shape I would suggest:
This aligns the exporter with the component-style tag format already used elsewhere, for example |
skrobul
left a comment
There was a problem hiding this comment.
lgtm with minor suggestions
| } | ||
| } | ||
|
|
||
| // todo: circle back here for parsing |
| for _, n := range nodes { | ||
| for key, d := range n.Sensors.Drive { | ||
| val := 0.0 | ||
| if d.State != nil && *d.State == "Enabled" { |
There was a problem hiding this comment.
| if d.State != nil && *d.State == "Enabled" { | |
| if d.State != nil && strings.EqualFold(*d.State, "Enabled") { |
| if n.PowerState == nil && n.ProvisionState == nil { | ||
| continue | ||
| } |
There was a problem hiding this comment.
what is this check for? does the maintenance status change always arrive only with power info?
I feel like there is some story behind this, but couldn't figure out what it is.
There was a problem hiding this comment.
The old check was replaced by HasStateData (set once on the first UpdateNodeState() call). The old PowerState == nil && ProvisionState == nil check would have silently dropped maintenance-only events maintenance.set arrives with both fields nil since Ironic only sends what changed. HasStateData correctly gates on "has any state event arrived for this node" rather than "does this node have both specific fields populated".
| if n.PowerState == nil && n.ProvisionState == nil { | ||
| continue | ||
| } |
There was a problem hiding this comment.
similar as above - why is this one gated?
There was a problem hiding this comment.
was guarding maintenance and fault emission, but maintenance.set events (which only set maintenance=true) arrive with both power_state and provision_state nil. but replaced it with this was also replaced by HasStateData
There was a problem hiding this comment.
Both checks are gone. transformer.go now has a single gate at the top of the stateFamilies loop: if !n.HasStateData { continue }. HasStateData is set by the first UpdateNodeState() call regardless of which fields the event carried, so maintenance-only events (nil power/provision state) too stays
9a7f3b7 to
4f4110e
Compare
Added power_state, provision_state, maintenance, and fault metrics by consuming
Ironic versioned notifications from a dedicated queue.
Added second consumer for ironic-hardware-exporter-states queue bound to
ironic_versioned_notifications.info routing key
Add parser for baremetal.node.power_set/provision_set events
log :
❯ openstack baremetal node power off Dell-24GSW04
from /metrics
ironic_node_power_state{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",power_state="power off"} 1 ironic_node_provision_state{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",provision_state="available"} 1 ironic_node_maintenance{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3"} 0 ironic_node_fault{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",fault="none"} 0❯ openstack baremetal node power on Dell-24GSW04
2026/04/22 21:44:28 cached state node=Dell-24GSW04 power=0x239aed2ab400 provision=0x239aed2ab410ironic_node_power_state{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",power_state="power on"} 1 ironic_node_provision_state{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",provision_state="available"} 1 ironic_node_maintenance{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3"} 0 ironic_node_fault{node_uuid="a8a8548c-fc07-4d9c-a5f2-5f2c6fe7992c",node_name="Dell-24GSW04",conductor_host="ironic-conductor.1327175-hp3",fault="none"} 0missing : conductor_host label on sensor metrics coz it does not exist with sensor data
WIP - deployment artifact and TLS
helm chart follows nautobotop go project