Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Instability] ACI-CNI variables not configurable using Rancher server UI #44980

Closed
akhilesh-oc opened this issue Mar 29, 2024 · 5 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release QA/S team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@akhilesh-oc
Copy link

akhilesh-oc commented Mar 29, 2024

Rancher Server Setup

  • Rancher version: v2.8.2, v2.7.6, v2.8.3
  • Installation option : Docker

Information about the Cluster

  • Kubernetes version: v1.26.11-rancher2-2 (All the k8s versions which are mapped with aci cni versions 6.0.3.2, 6.0.3.1 and 5.2.7.1), v1.26.14-rancher1-1, v1.27.11-rancher1-1 (All the k8s versions which are mapped with aci cni version 6.0.4.1)
  • Cluster Type: Downstream
  • Custom = Running a docker command on a node

User Information
What is the role of the user logged in?

  • Admin

Describe the bug
We are experiencing difficulties where the values we input for ACI fields are being discarded, and the default values are being used instead. This issue seems to be with the Rancher server. When we update the cluster configuration in the UI by inputting fields in the "Edit as YAML" option, the values we provide are being discarded/disappeared upon clicking the save button.
On the API level, the post request is sent out with our fields intact, but the response does not include them.
Also, the variables we added for 6.0.3.1 (RKE 1.4.9 which were successfully tested on Rancher 2.7.9), are not being picked up.
The variables introduced with ACI-CNI 6.0.4.1 are not being picked in by Rancher server v2.8.3.

To Reproduce
1.Go rancher server UI , Create a Custom rke1 cluster
2. Give some name and select k8s version you want (v1.26.11-rancher2-2)
3. Go to the top, Select ‘Edit Yaml’ option
4. Copy the generated network provider file contents in the network section.
Add the custom values for aci cni as below:

network:
  plugin: "aci"
  aci_network_provider:
    opflex_device_delete_timeout: "101"                    # old field
    opflex_device_reconnect_wait_timeout: "15"             # new field added with 6.0.3.1
    toleration_seconds: "111"                              # new field added with 6.0.3.2

Edit: for v2.8.3 add the following fields:

network:
  plugin: "aci"
  aci_network_provider:
    opflex_device_delete_timeout: "101"                    # field added with 5.2.7.1
    opflex_device_reconnect_wait_timeout: "15"             # field added with 6.0.3.1
    toleration_seconds: "111"                              # field added with 6.0.3.2
    taint_not_ready_node: 'true'                              # new field added with 6.0.4.1
    apic_connection_retry_limit: "6"                           # new field added with 6.0.4.1
  1. Click on Next, and go to the node registration command page, check the roles you want and copy the command, run it on the node you want to register. Repeat for all the nodes you want in the cluster.

Result
We can verify aci values after creating cluster are missing by below:
a. Go to 'Edit Config'
b. Edit As YAML
c. Recheck the above added aci variables are present there.
we see missing aci values now:

  network:
    aci_network_provider:
      opflex_device_delete_timeout: '101'
    mtu: 0
    plugin: aci

Also if everything works fine then, we should be able to get these values from kubectl commands like:
kubectl describe deploy -n aci-containers-system aci-containers-controller | grep toleration

Edit: for v2.8.3, just the 6.0.4.1 variables are being discarded:

  network:
    aci_network_provider:
          opflex_device_delete_timeout: '101'
          opflex_device_reconnect_wait_timeout: '15'
          toleration_seconds: '111'
    mtu: 0
    plugin: aci

Expected Result
Once the aci cni fields added in in Network section using 'Edit As YAML', it should reflect on Kubernetes resources and in UI's 'Edit As YAML' option as well.

(edited to include details pertaining similar issue on Rancher v2.8.3. separate issue created here: #45200)

@akhilesh-oc akhilesh-oc added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Mar 29, 2024
@kinarashah kinarashah self-assigned this Mar 29, 2024
@kinarashah
Copy link
Member

@rancher/rancher-team-2-hostbusters-qa Need help with reproducing this issue. I am trying to reproduce with kinarashah/rancher:v2.8.2-linux which has additional logs that might help us debug further but unable to reproduce so far. @akhilesh-oc is able to see it on rancher/rancher image but not with my image so I trying to figure out how to get more info.

@kinarashah
Copy link
Member

kinarashah commented May 1, 2024

Able to reproduce the issue.

Steps to reproduce on Rancher v2.8.3:

  • Create RKE1 custom cluster
  • Under Edit as Yaml replace the network section with
  network:
    aci_network_provider:
      toleration_seconds: '111'
      taint_not_ready_node: 'true'
    plugin: aci
  • Save the cluster, no need to actually add nodes
  • Field taint_not_ready_node disappears from the config

Notes:

  • The same behavior is observed when editing cluster.management.cattle.io via kubectl
  • Logged marshaled objects in norman and the fields don't disappear until the final client PUT request for update goes through, which is consistent with the kubectl behavior mentioned above
  • Only the latest new fields seem to be affected
  • Reproducible with just the cluster creation, but sometimes also needed to update cluster post creation to see the issue

@kinarashah
Copy link
Member

Debugging notes:

  • Looks like it's not connected with KDM or RKE or any ACI versions, but more of a Rancher issue. To verify this, I tried the following on latest upstream versions of Rancher and RKE for v2.6, v2.7 and v2.8
  • The field always shows up when checked under v3/schemas/acinetworkprovider and v3/schemas/cluster

Rancher v2.6 / RKE release/v1.3

  • Add FooName to types/rke_types.go under AciNetworkProvider struct.
FooName							 string              `yaml:"foo_name,omitempty" json:"fooName,omitempty"`
  • Vendor RKE to Rancher (go.mod and pkg/apis/go.mod + go mod tidy)
  • Run go generate and confirm zz_generated* file for ACI gets updated
  • Repeat the same steps above, with the following for aci config:
  network:
    aci_network_provider:
      apicUserName: testuser
      fooName: helloworld
    plugin: aci
  • fooName persists on the config with both API/kubectl.

Rancher v2.7 / RKE release/v1.4

  • Same steps as above, fooName disappears after saving the cluster.

@kinarashah
Copy link
Member

We've confirmed the root cause of this issue. rancher/webhook is vendoring an older version of RKE so the fields disappear if webhook doesn't have the right version of RKE. This also explains why editing cluster CRs via kubectl doesn't work because these requests also go through webhook.

Rancher v2.8.3 vendors RKE v1.5.7 and runs webhook v0.4.3 which vendors RKE v1.5.7-rc2. The new ACI field taint_not_ready doesn't exist in RKE v1.5.7-rc2 but does in RKE v1.5.7.

@susesgartner susesgartner self-assigned this May 7, 2024
@snasovich snasovich added this to the v2.8-Next1 milestone May 7, 2024
@snasovich snasovich added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label May 8, 2024
@susesgartner
Copy link
Contributor

susesgartner commented May 8, 2024

Validated that the ACI values do not disappear after provisioning the rke1 cluster on 2.8.4-rc4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release QA/S team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

4 participants