Skip to content

Conversation

@cardoe
Copy link
Contributor

@cardoe cardoe commented Oct 9, 2025

Introduces a declarative hardware categorization system for UnderStack that separates physical hardware definitions (device-types) from workload matching criteria (flavors), enabling GitOps-driven hardware management with JSON schema validation and CLI tooling.

Architecture

Device-Types define physical hardware models with resource class variants:

  • JSON schema for hardware specs (manufacturer, model, CPU, memory, drives, interfaces)
  • Multiple resource classes per device (e.g., m1.small/medium/large for different configs)
  • Ironic inspection hook performs exact matching against specs and sets node.resource_class

Flavors define node matching criteria using resource classes and hardware traits:

  • Reference device-type resource classes (Nova properties derived from device-type specs)
  • Optional trait requirements (required/absent) for hardware filtering
  • Trait names without CUSTOM_ prefix (added automatically)

Hardware Traits use hybrid approach with standard and custom traits:

  • Standard traits: NVME, GPU_NVIDIA, NIC_MELLANOX, CPU_AVX512, etc.
  • Custom traits: Site-specific following naming conventions
  • Discovered during inspection and added to Ironic nodes

Flow:

  1. Ironic inspection discovers hardware specs and capabilities
  2. Inspection hook matches device-type and sets resource_class via exact spec matching
  3. Inspection hook discovers and adds hardware traits
  4. Flavor definitions filter nodes by resource_class + trait requirements
  5. Nova flavors created with properties from device-type resource class

CLI Tools

Both understackctl device-type and understackctl flavor provide:

  • add: Validate and add definition, auto-update kustomization
  • validate: Standalone JSON schema validation
  • delete: Remove definition, update kustomization
  • list: Show all definitions
  • show: Display details

Examples

Device-Types:

  • dell-poweredge-r7615.yaml: Server with 3 resource classes
  • cisco-nexus-9336c-fx2.yaml: 1U ToR switch
  • palo-alto-pa-5220.yaml: 3U firewall

Flavors:

  • m1.small.yaml: Generic flavor matching all m1.small nodes
  • m1.small.nicX.yaml: Specialized flavor requiring NICX trait

Documentation

Design guides: device-types.md, hardware-traits.md, flavors.md
Operator guides: device-types.md, flavors.md

Follow-On Work

  • Update python/ironic-understack/ironic_understack/resource_class.py inspection hook to read device-types ConfigMap instead of legacy flavor-matcher directory
  • Implement trait discovery inspection hooks for standard traits
  • Consider cross-reference validation (flavors → device-type resource classes)
  • Add verification commands for ConfigMap status checking

@cardoe cardoe force-pushed the hardware-categorization branch 2 times, most recently from b52c6de to a054d6a Compare October 9, 2025 17:44

### Nova Flavor Property Derivation

Nova flavor properties are derived from the device-type resource class:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for baremetal, the Nova properties are always set to 0. We only set those values in fields:

❯ openstack flavor show gp2.small
+----------------------------+--------------------------------------------------------------------------------------------------------------+
| Field                      | Value                                                                                                        |
+----------------------------+--------------------------------------------------------------------------------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                                                                                        |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                                                                            |
| access_project_ids         | None                                                                                                         |
| description                | None                                                                                                         |
| disk                       | 480                                                                                                          |
| id                         | 4bf13946-1ea4-4ccd-9d66-2f0de50cd99d                                                                         |
| name                       | gp2.small                                                                                                    |
| os-flavor-access:is_public | True                                                                                                         |
| properties                 | resources:CUSTOM_BAREMETAL_GP2_SMALL='1', resources:DISK_GB='0', resources:MEMORY_MB='0', resources:VCPU='0' |
| ram                        | 98304                                                                                                        |
| rxtx_factor                | 1.0                                                                                                          |
| swap                       | 0                                                                                                            |
| vcpus                      | 16                                                                                                           |
+----------------------------+--------------------------------------------------------------------------------------------------------------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is addressed. Agree?


* Define traits at appropriate granularity (model-specific vs. category)
* Document trait meanings and discovery logic
* Use consistent trait naming across the organization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we achieve that? What are the names used elsewhere in the organization?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah traits will require governance guidelines - in fact I would call this section "trait governance"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant across the deployment of UnderStack between different groups using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a hardware trait doc as a follow on.

* Always validate device types before committing
* Include descriptive commit messages explaining what hardware is being added
* Submit changes via pull requests for team review
* Tag releases when updating device type definitions for production deployments
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs more explanation on how it can be done, especially that in other sections we are going to very low level detail like listing exact git commands

resource_class: m1.medium
traits:
- trait: NVME
requirement: required
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what to you think about:

Suggested change
requirement: required
state: required

instead?

Comment on lines +471 to +478

1. Reads the flavor definition
2. Queries Ironic for nodes with `resource_class=m1.small`
3. Looks up the device-type `m1.small` resource class
4. Creates a Nova flavor with:
* vcpus: 16 (from `cpu.cores`)
* ram: 131072 MB (from `memory.size` * 1024)
* disk: 480 GB (from `drives[0].size`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand this section - which flavor matcher service are we talking about here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flavor matching lib you had previously made.

Copy link

@grizzlydev grizzlydev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great @cardoe - really well explained.

one high level comment is that for the short and immediate term, the a small leadership team need to wholly own the governance of device types and resource classes, while we figure out if/ how the model fits reality. There are too many examples of frameworks being provided without enough governance which are then (mis)used in many difficult to predict ways

* **resource_class**: Array of resource class configurations (required for
`class: server`)

### Resource Classes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider call this "Server resource classes" as it's specific to servers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe baremetal_node_resource_class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we can. I thought this could also be used for physical firewalls. But we can revisit that when we mess with that.


* Define traits at appropriate granularity (model-specific vs. category)
* Document trait meanings and discovery logic
* Use consistent trait naming across the organization

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah traits will require governance guidelines - in fact I would call this section "trait governance"

Copy link
Contributor

@stevekeay stevekeay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I threw in a load of comments, feel free to ignore them, these were just the questions that came to mind as I was reading.


## Management Workflow

Flavor definitions are managed using the `understackctl flavor` CLI:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean "CAN be", "SHOULD be", or "MUST be" managed using that thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folks have complained about modifying YAML files without guidanc and YAML being hard to read. So this validates what you supply and it pretty prints the output. It also ensures that your files are merged into the right config map.

* **resource_class**: Array of resource class configurations (required for
`class: server`)

### Resource Classes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe baremetal_node_resource_class?

deployment configurations, providing versioning, review, and audit capabilities
* **Cross-Platform Integration**: Device types integrate with Nautobot,
Ironic, and Nova to provide consistent hardware metadata throughout the stack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we elaborate at what point we create these different things?

It sounds like we make a new device type for each new vendor product name (like Dell "PowerEdge R7615").

For modular chassis like dell servers - it sounds like we don't make a device type for each different configuration.

Then resource_class are a kind of bucket where we might put a number of different configurations. This is a one-dimensional attribute used to select a suitable server.

Outside of the device type, each individual node has a set of traits to describe properties that we think might be important to server selection/scheduling, but don't fit easily into the resource_class bucket.

Can we describe some of the thinking that led to this design? Like, all the servers could have been flattened into a single device type called "x86_64_server", or they could be way more granular. Similarly, instead of using lots of different resource classes, we could do node selection based purely on traits?

Should we also mention what device types are not intended to be? E.g. a device type is not designed to tell us:

  • how to order one from the supplier?
  • monetary value?
  • how much electricity it will use?
  • which spare parts are compatible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're needed inside of Nautobot and for detection of the hardware. That's all this is trying to define.

@cardoe cardoe changed the title Hardware categorization feat: Add hardware categorization design Oct 10, 2025
@cardoe
Copy link
Contributor Author

cardoe commented Oct 10, 2025

I threw in a load of comments, feel free to ignore them, these were just the questions that came to mind as I was reading.

I tried to address them all.

@cardoe
Copy link
Contributor Author

cardoe commented Oct 10, 2025

@stevekeay @skrobul @grizzlydev lemme know if I didn't address your feedback well enough in the updates.

@cardoe cardoe marked this pull request as ready for review October 10, 2025 18:20
cardoe added 11 commits October 13, 2025 09:30
Introduce a JSON Schema (draft 2020-12) that defines hardware device
types for bare metal infrastructure. Device types describe physical
hardware models with their specifications and resource class variants.

Key schema features:
- Required fields: class (server/switch/firewall), manufacturer, model,
  u_height, is_full_depth
- Optional interfaces array with named ports (BMC, management)
- Optional resource_class array defining hardware configuration profiles
- Resource classes include CPU (cores, model), memory (size), drives,
  and nic_count
- Conditional validation: server class requires resource_class array

Resource classes represent common build variations of the same chassis
model (e.g., different CPU/RAM configurations). The resource class name
is later set on Ironic nodes and referenced by flavor definitions for
Nova flavor creation.
Add three example device-type definitions covering all supported device
classes, along with kustomization configuration for ConfigMap generation.

Examples added:
- dell-poweredge-r7615.yaml: Server with 3 resource classes (m1.small,
  m1.medium, m1.large) demonstrating different CPU/RAM/drive configs
- cisco-nexus-9336c-fx2.yaml: 1U top-of-rack switch with 36x 100G ports
- palo-alto-pa-5220.yaml: 3U firewall with mixed interface types

Also creates hardware/base/kustomization.yaml with device-types
ConfigMapGenerator that includes all three examples. The ConfigMap is
consumed by flavor-matching and hardware discovery workflows.
Implement understackctl device-type subcommand with five operations for
managing hardware device-type definitions in GitOps deployments.

Commands:
- add: Validate and copy device-type YAML to deployment repo, auto-update
  kustomization.yaml with new entry
- validate: Standalone validation against JSON schema without deployment
- delete: Remove device-type file and update kustomization.yaml
- list: Display all device-types in hardware/device-types directory
- show: Pretty-print device-type details including resource classes

Implementation details:
- Full JSON schema validation using jsonschema/v5 library
- Type-safe structs (DeviceType, ResourceClass, Interface, CPU, Memory,
  Drive) with yaml/json tags
- Auto-generates filenames from manufacturer-model in lowercase with
  hyphens
- UC_DEPLOY environment variable required for deployment repo path
- Automatic kustomization.yaml updates for both add and delete operations
Add comprehensive documentation for device-type feature covering both
architectural design and operational usage.

Design guide (design-guide/device-types.md):
- Purpose and architecture of device-type system
- Schema structure with detailed field descriptions
- Resource class concept and Nova flavor integration
- GitOps deployment via ConfigMap generation
- Relationship between device-types and flavor matching
- File organization in deployment repository

Operator guide (operator-guide/device-types.md):
- Prerequisites and command overview (add/validate/delete/list/show)
- Step-by-step workflow for creating device-type definitions
- Distinction between interfaces (named physical ports) and nic_count
  (user-usable network interfaces for workloads)
- Multiple resource classes use case (common build variations)
- Update workflow (edit-in-place with validation)
- Examples for servers, switches, and firewalls
- Best practices for naming, resource class design, and version control
- Troubleshooting common validation failures and ConfigMap issues
Replace the existing flavor.schema.json (which defined hardware
inspection matching criteria) with a new schema for hardware flavor
definitions that match Ironic nodes to resource classes with trait
filtering.

Key changes:
- Remove hardware-specific fields (manufacturer, model, cpu_cores,
  cpu_model, memory_gb, memory_modules, drives, pci)
- Add resource_class field (required) to reference device-type resource
  classes
- Add optional traits array with trait name and requirement (required or
  absent)
- Remove Nova properties (vcpus, ram, disk, ephemeral, swap) - these are
  now derived from device-type resource class definitions
- Simplify to minimal schema: only name and resource_class required

Trait requirements allow filtering nodes within a resource class:
- "required": node must have the trait
- "absent": node must NOT have the trait
- Trait names specified without CUSTOM_ prefix (added automatically)

This aligns with the device-type architecture where Nova flavor
properties (vCPUs, RAM, disk) come from device-type resource class specs,
while flavor definitions only control which nodes are eligible based on
traits.
Add two example flavor definitions demonstrating generic and trait-
specific matching, along with kustomization updates for ConfigMap
generation.

Examples added:
- m1.small.yaml: Generic flavor matching all nodes in m1.small resource
  class without trait filtering
- m1.small.nicX.yaml: Specialized flavor requiring CUSTOM_NICX trait,
  matches subset of m1.small nodes with specific NIC hardware

Both flavors reference the m1.small resource class from device-type
definitions. Nova flavor properties (vCPUs, RAM, disk) are automatically
derived from the device-type resource class specification, not defined
in flavor files.

Updates hardware/base/kustomization.yaml to add flavors ConfigMapGenerator
alongside existing device-types ConfigMap. The flavors ConfigMap is
consumed by flavor-matching workflows to create Nova flavors.
Implement understackctl flavor subcommand with five operations for
managing hardware flavor definitions that control Ironic node matching
and Nova flavor creation.

Commands:
- add: Validate and copy flavor YAML to deployment repo, auto-update
  kustomization.yaml
- validate: Standalone validation against JSON schema
- delete: Remove flavor file and update kustomization.yaml
- list: Display all flavors in hardware/flavors directory
- show: Display flavor with resource class and trait requirements

Implementation details:
- Type-safe structs (Flavor, Trait) with yaml/json tags
- Full JSON schema validation using jsonschema/v5
- Auto-generates filenames from flavor name (e.g., m1.small.yaml)
- UC_DEPLOY environment variable required for deployment repo path
- Automatic kustomization.yaml updates for flavors ConfigMap
- Trait name pattern validation (uppercase alphanumeric with underscores)
- Requirement enum validation (required or absent)

Show command displays resource class with note that Nova properties are
derived from device-type resource class definition, plus any trait
requirements for hardware filtering.
Add comprehensive documentation for flavor feature covering architecture,
integration points, and operational usage.

Design guide (design-guide/flavors.md):
- Purpose: flavors as filters for Ironic node matching with trait
  requirements
- Four-phase workflow: inspection adds traits, device-type sets
  resource_class, flavor matches nodes, Nova flavor created
- Schema structure with required (name, resource_class) and optional
  (traits) fields
- Trait system: CUSTOM_ prefix added automatically, pattern validation
- Flavor-matcher integration: queries nodes by resource_class, filters by
  traits, derives Nova properties from device-type resource class
- Nova flavor property derivation: vcpus from cpu.cores, ram from
  memory.size, disk from drives[0].size
- Use cases: generic pools, specialized workloads, legacy exclusion
- Best practices for naming, trait design, resource class alignment
- Relationship to device-types: tightly coupled, flavors reference
  resource classes defined in device-types

Operator guide (operator-guide/flavors.md):
- Prerequisites and command overview (add/validate/delete/list/show)
- Step-by-step workflow for creating flavor definitions
- Trait requirements syntax and CUSTOM_ prefix handling
- Update workflow (edit-in-place with validation)
- Common use cases with examples (generic, specialized, exclusion, multi-
  trait)
- Best practices for naming, resource class references, trait management
- Troubleshooting validation failures, ConfigMap updates, resource class
  references, trait matching
- Integration with device-types: shows how Nova properties are derived
  from device-type resource class definitions
Further expand the design guide for device-types by explaining how
matches are made to the hardware as it is inspected.
Attempt to explain hardware traits and how we should standardize them
along with how custom traits can be used as well.
There were some questions around properties so this attempts to clarify
them and link to the upstream docs.
@cardoe cardoe force-pushed the hardware-categorization branch from 644a9cd to 2b018d7 Compare October 13, 2025 14:30
cardoe and others added 2 commits October 13, 2025 09:36
@cardoe cardoe added this pull request to the merge queue Oct 13, 2025
Merged via the queue into main with commit c104a88 Oct 13, 2025
23 checks passed
@cardoe cardoe deleted the hardware-categorization branch October 13, 2025 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants