Skip to content

OCPBUGS-56648: fixes systemd unit creation for empty units #5086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cheesesashimi
Copy link
Member

@cheesesashimi cheesesashimi commented May 28, 2025

- What I did

If a MachineConfig is applied with empty systemd unit contents, the MCD will degrade because it skips writing the file in that particular situation. For parity with the CoreOS Ignition implementation, we should not attempt to enable or disable any systemd units where the unit file does not have any contents.

This PR also extends the TestIgn3Cfg e2e test to include systemd units as well as assertions for verifying that the on-disk state and systemd state is as expected.

- How to verify it

  1. Bring up a cluster.
  2. Apply a MachineConfig which adds an empty systemd unit.
  3. The node(s) should roll out the config as usual. However, the file will not be created on the node(s) nor will systemd be aware of the unit.

The e2e tests have been modified to perform this automatically.

- Description for the changelog
Fixes systemd unit creation for empty units

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 28, 2025
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: This pull request references Jira Issue OCPBUGS-56648, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did

If a MachineConfig is applied with empty systemd unit contents, the MCD
will degrade because it skips writing the file in that particular
situation. For parity with the CoreOS Ignition implementation, we should
not attempt to enable or disable any systemd units where the unit file
does not have any contents.

- How to verify it

  1. Bring up a cluster.
  2. Apply a MachineConfig which adds an empty systemd unit.
  3. The node(s) should roll out the config as usual. However, the file will not be created on the node(s) nor will systemd be aware of the unit.

- Description for the changelog
Fixes systemd unit creation for empty units

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from djoshy and yuqi-zhang May 28, 2025 21:46
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2025
@@ -745,7 +745,7 @@ func TestDontDeleteRPMFiles(t *testing.T) {
func TestIgn3Cfg(t *testing.T) {
cs := framework.NewClientSet("")

delete := helpers.CreateMCP(t, cs, "infra")
deleteFunc := helpers.CreateMCP(t, cs, "infra")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewer: I changed this because delete() is a built-in function in Golang. While it was not causing any problems, I opted to proactively fix it.

Copy link
Member

@isabella-janssen isabella-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, I will leave final tagging to someone else on the team with more context.

Side note: Good catch with changing the delete var name to deleteFunc.

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm, some thoughts inline on logging and erroring. Thanks for adding the comprehensive testing! Bonus points for adding it to the existing general ignition test so we don't have to add any extra time/reboots :)

disabledUnits = append(disabledUnits, u.Name)
// Only when a unit has contents should we attempt to enable or disable it.
// See: https://issues.redhat.com/browse/OCPBUGS-56648
if unitHasContent(u) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: maybe we should log it like we do for the preset below, just to have something to reference that we attempted to enable/disable but we could not find contents.

If we do, we should be careful about the message since some bug reporters have thought that the preset failure is fatal (presumably since it says Error msg), might be a good time to rephrase that as well.

// Only when a unit has contents should we attempt to enable or disable it.
// See: https://issues.redhat.com/browse/OCPBUGS-56648
if unitHasContent(u) {
if *u.Enabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +1 on going with the match-ignition approach. One very minor concern is that the few times this bug has popped up, it was because the user was trying to use a layered image build with a service built in, but enabling the service via a MachineConfig.

Before, if they apply both changes in the same update, what happened was: the update fails since it can't find the service, which allows the user to at least try to figure out what happened

Now, if they apply both changes in the same update, the service enablement is skipped on this update, since the MCD can't find it, but the actual service is being staged as part of the OS update. Upon reboot, the MCD doesn't error, but the service that the user might expect to be enabled is not there. Upon a second reboot sometime in the future, it actually gets enabled properly since now the service exists, which might catch some users off guard.

So then the options I can see are:

  1. say that's fine and direct users to either apply the image first and then enable the service via a second update, or try to build that into the image directly
  2. have the post-reboot MCD run the list again and try to make sure everything is enabled/started
  3. leave it as is for now and eventually, once layering is the default mechanism, we should probably be building the service and enablement into the update image directly instead of the hybrid management we have now

I'm happy with 1/3 but just wanted to write that out in case someone feels differently

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I'm in agreement. But something I want to point out is the OCL reboot work which will effectively stop baking the MachineConfigs into the OS image. The advice to give at that point is that they should not use a MachineConfig for enabling the service in that case.

If a MachineConfig is applied with empty systemd unit contents, the MCD
will degrade because it skips writing the file in that particular
situation. For parity with the CoreOS Ignition implementation, we should
not attempt to enable or disable any systemd units where the unit file
does not have any contents.
@cheesesashimi cheesesashimi force-pushed the zzlotnik/OCPBUGS-56648 branch from 41489d7 to 713bbd7 Compare June 12, 2025 19:12
Copy link
Contributor

openshift-ci bot commented Jun 12, 2025

@cheesesashimi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-hypershift-techpreview 41489d7 link false /test e2e-hypershift-techpreview
ci/prow/e2e-azure-ovn-upgrade-out-of-change 713bbd7 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-aws-ovn 713bbd7 link true /test e2e-aws-ovn
ci/prow/verify-deps 713bbd7 link true /test verify-deps
ci/prow/e2e-gcp-op 713bbd7 link true /test e2e-gcp-op
ci/prow/e2e-gcp-op-techpreview 713bbd7 link false /test e2e-gcp-op-techpreview
ci/prow/e2e-gcp-op-ocl 713bbd7 link false /test e2e-gcp-op-ocl
ci/prow/e2e-gcp-op-single-node 713bbd7 link true /test e2e-gcp-op-single-node
ci/prow/bootstrap-unit 713bbd7 link false /test bootstrap-unit
ci/prow/e2e-agent-compact-ipv4 713bbd7 link false /test e2e-agent-compact-ipv4

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2025
Copy link
Contributor

openshift-ci bot commented Jun 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cheesesashimi,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants