Skip to content

OCPBUGS-57426: OCPBUGS-57506: Boot image controller should correctly handle marketplace boot images #5122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 23, 2025

Conversation

djoshy
Copy link
Contributor

@djoshy djoshy commented Jun 13, 2025

Closes OCPBUGS-57348 and OCPBUGS-57426.

The MCO will now only update boot images that have been published previously in the installer.

  • On AWS, we validate against a hardcoded AMI list.
  • On GCP, we validate against a URL header.

When an unexpected boot image is found, the controller will skip the MachineSet for updates.

@openshift-ci-robot
Copy link
Contributor

@djoshy: An error was encountered querying GitHub for users with public email (sregidor@redhat.com) for bug OCPBUGS-57426 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details.

Full error message. Post "http://ghproxy/graphql": dial tcp 172.30.229.2:80: connect: connection refused

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

In response to this:

The MCO will now only update boot images that have been published previously in the installer.

  • On AWS, we validate against a hardcoded AMI list.
  • On GCP, we validate against a URL header.

When an unexpected boot image is found, the controller will degrade. This can be cleared by disabling the feature by applying:

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 logLevel: Normal
 operatorLogLevel: Normal
 managedBootImages:
   machineManagers:
     - resource: machinesets
       apiGroup: machine.openshift.io
       selection:
         mode: None

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2025
@djoshy djoshy force-pushed the use-known-boot-image branch from ca83084 to 6e1afd9 Compare June 13, 2025 19:42
@djoshy
Copy link
Contributor Author

djoshy commented Jun 13, 2025

/retest-required

/payload-job-with-prs openshift/origin#29919 periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-serial

if newami != currentAMI {
// Validate that we're allowed to update from the current AMI
if !AllowedAMIs.Has(currentAMI) {
return false, nil, fmt.Errorf("current AMI %s is unknown, boot image management is not supported on this cluster. Please disable this feature to remove the degrade", currentAMI)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, since we're keeping bootimage opt-out for AWS, this means that a marketplace cluster will automatically opt in once upgraded to 4.19, and then immediately degrade?

Copy link
Contributor Author

@djoshy djoshy Jun 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I think our current plan is to handle the upgrade awareness via admin-ack docs(openshift/openshift-docs#94706). I suppose we could skip the MachineSet and not degrade(maybe degrade in a later y version?)

@djoshy
Copy link
Contributor Author

djoshy commented Jun 15, 2025

/retest-required

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 15, 2025
@openshift-ci-robot
Copy link
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-57426, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/retest-required

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from sergiordlr June 15, 2025 11:17
@djoshy djoshy force-pushed the use-known-boot-image branch from 6e1afd9 to 79fc2e6 Compare June 16, 2025 13:04

// AllowedAMIs contains the set of AMI IDs that are allowed to be updated from
var AllowedAMIs = sets.New(
"ami-000145e5a91e9ac22", "ami-0001c9b25b9ca6731", "ami-000277c401a2db73a", "ami-0002c5529c6fff5a4", "ami-00033b241162023da",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't dug in to see how this is generated, but you might want to wrap at one-per-line, just so new-AMI-insertion generates a single-line diff, and not a many-line-re-wrapping diff.

@djoshy djoshy force-pushed the use-known-boot-image branch 2 times, most recently from aa94349 to abeb602 Compare June 17, 2025 20:04
@djoshy djoshy force-pushed the use-known-boot-image branch from abeb602 to 5e332d5 Compare June 17, 2025 20:22
@jlebon
Copy link
Member

jlebon commented Jun 17, 2025

This is unfortunate but yeah, I think a list like this is probably the safest/most comprehensive approach. Two general comments though:

  1. The list needs provenance. We should ideally even have the script which generated it committed in-tree. Can we do that?
  2. If we're treating this as a historical list that will never change, then it seems fine to have that live in the MCO, assuming we have a strategy to catch new installs from this point on. Is that the case? E.g. will the installer add some metadata to make this less ambiguous?

If not, and this list is meant to be ongoing, my concern is that it'll be extremely easy for this list to become out of date here. I would propose instead having it live in openshift/installer, since that's where bootimages are bumped (and we could even have CI checks that verify the list is updated). Note that bootimages are bumped multiple times for a minor lifecycle.

@sergiordlr
Copy link
Contributor

sergiordlr commented Jun 18, 2025

Verified using IPI on AWS and IPI on GCP

We created a new test case to test this scenario in polarion
OCP-82747

Manually tested scenarios

Positive:

  1. Configure Machineconfiguration resource to update only images with a label
  2. Create a Machineset with a boot image that is in the updateable list (or with the right url in gcp) and a 2.2.0 ignition secret
  3. Label the new Machineset so that MCO updates its boot image
  4. Check that the boot image was updated
  5. Check that the user-data secret using 2.2.0 ignition version was updated to 3.5.0
  6. Check that the new machineconfig can be scaled up and create new nodes

Negative:

  1. Configure Machineconfiguration resource to update only images with a label
  2. Create a Machineset with a boot image that is not considered updateable and a 2.2.0 ignition secret
  3. Label the new Machineset so that MCO updates its boot image
  4. Check that the boot image was NOT updated
  5. Check that the user-data secret using 2.2.0 ignition version is still using a 2.2.0 ignition version

We also updated the existing test cases automation so that they can handle this scenario too. All automated e2e test cases passed in both platforms.

We can qe-approve this PR. Nevertheless, since it seems that there is still an ongoing conversation about how to create/store/maintain the list of images we will wait until everything is decided, just in case we need extra testing.

@djoshy
Copy link
Contributor Author

djoshy commented Jun 18, 2025

Thanks for testing, @sergiordlr !

This is unfortunate but yeah, I think a list like this is probably the safest/most comprehensive approach. Two general comments though:

  1. The list needs provenance. We should ideally even have the script which generated it committed in-tree. Can we do that?

So, this list was originally compiled by @sdodson and he mentioned that he'll work on automating it. This was quickly put together targeting a fix for 4.19.2, so I'm happy to consider a different solution going forward in 4.20+

If not, and this list is meant to be ongoing, my concern is that it'll be extremely easy for this list to become out of date here. I would propose instead having it live in openshift/installer, since that's where bootimages are bumped (and we could even have CI checks that verify the list is updated). Note that bootimages are bumped multiple times for a minor lifecycle.

IMO this sounds the best approach to me, I'll be happy to assist as needed.

@jlebon
Copy link
Member

jlebon commented Jun 18, 2025

I guess one thing that helps here is that (and let me know if this is incorrect) it's fine if e.g. we add an AMI to the installer and don't immediately update the list here. Until the list is updated, the MCO will avoid managing the image, but once we add it, it will. And e.g. 4.19 bootimages will be good for a while so it doesn't have to be managed immediately. And so adding to the list doesn't have to be that closely in sync but could be done e.g. once a release or so.

But that said, another way to do this which could be interesting is to have the script run at MCO build time and dynamically generate the list. That way it's self-maintained. Probably saner to have it generate JSON that gets embedded in the image and then the bootimage controller can just read that. But what's nice is that this script then also gives us the opportunity to gather other things that may help with bootimage updates, like the associated OCP minor release (based on the oldest git branch that contains the AMI ID).

Anyway, not blocking this to be clear.

/approve

@djoshy
Copy link
Contributor Author

djoshy commented Jun 18, 2025

I guess one thing that helps here is that (and let me know if this is incorrect) it's fine if e.g. we add an AMI to the installer and don't immediately update the list here. Until the list is updated, the MCO will avoid managing the image, but once we add it, it will. And e.g. 4.19 bootimages will be good for a while so it doesn't have to be managed immediately. And so adding to the list doesn't have to be that closely in sync but could be done e.g. once a release or so.

This is correct.

But that said, another way to do this which could be interesting is to have the script run at MCO build time...

This is an interesting approach, let me see if I can work something up 👍

CI is broken at the moment and we should also block this on openshift/origin#29919 so we don't break component readiness; throwing in a preemptive hold.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 18, 2025
@yuqi-zhang
Copy link
Contributor

If not, and this list is meant to be ongoing, my concern is that it'll be extremely easy for this list to become out of date here. I would propose instead having it live in openshift/installer, since that's where bootimages are bumped (and we could even have CI checks that verify the list is updated). Note that bootimages are bumped multiple times for a minor lifecycle.

I think long term a hardcoded MCO list is not feasible, but we'd like to have this as a temporary fix to mitigate any 4.19 clusters upgrading. Having something in installer that we can consume should work, I think we floated some other ideas, happy to discuss via an image mode sync or similar.

@djoshy djoshy changed the title OCPBUGS-57426: Boot image controller should correctly handle marketplace boot images OCPBUGS-57426: OCPBUGS-57506: Boot image controller should correctly handle marketplace boot images Jun 20, 2025
@djoshy
Copy link
Contributor Author

djoshy commented Jun 22, 2025

/unhold

/retest-required

/cherrypick release-4.19

@openshift-cherrypick-robot

@djoshy: once the present PR merges, I will cherry-pick it on top of release-4.19 in a new PR and assign it to you.

In response to this:

/unhold

/retest-required

/cherrypick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2025
@djoshy
Copy link
Contributor Author

djoshy commented Jun 22, 2025

/test all

@djoshy
Copy link
Contributor Author

djoshy commented Jun 22, 2025

/retest-required

Copy link
Contributor

openshift-ci bot commented Jun 22, 2025

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-upgrade-out-of-change 5e332d5 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/okd-scos-e2e-aws-ovn 5e332d5 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@isabella-janssen
Copy link
Member

/lgtm

@sergiordlr
Copy link
Contributor

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Jun 23, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2025
Copy link
Contributor

openshift-ci bot commented Jun 23, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, isabella-janssen, jlebon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [djoshy,isabella-janssen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 0fd8575 into openshift:main Jun 23, 2025
17 of 19 checks passed
@openshift-ci-robot
Copy link
Contributor

@djoshy: Jira Issue OCPBUGS-57426: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-57426 has been moved to the MODIFIED state.

In response to this:

Closes OCPBUGS-57348 and OCPBUGS-57426.

The MCO will now only update boot images that have been published previously in the installer.

  • On AWS, we validate against a hardcoded AMI list.
  • On GCP, we validate against a URL header.

When an unexpected boot image is found, the controller will skip the MachineSet for updates.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@djoshy: new pull request created: #5135

In response to this:

/unhold

/retest-required

/cherrypick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.20.0-202506231443.p0.g0fd8575.assembly.stream.el9.
All builds following this will include this PR.

@djoshy djoshy deleted the use-known-boot-image branch July 9, 2025 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants