Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A potential bug in NewElementalPartitionsFromList which caused upgrade error code 33 #1827

Closed
w13915984028 opened this issue Sep 15, 2023 · 10 comments · Fixed by #1836
Closed
Assignees
Labels
kind/bug Something isn't working kind/question Further information is requested

Comments

@w13915984028
Copy link

elemental-toolkit version:

Latest

CPU architecture, OS, and Version:

Describe the bug

Error code 33 was encountered in Harvester upgrade.

ERRO[2023-09-13T12:02:42Z] Invalid upgrade command setup undefined state partition 
elemental upgrade failed with return code: 33
+ ret=33


cdo path:
			spec, err := config.ReadUpgradeSpec(cfg, cmd.Flags()) //========================
			if err != nil {
				cfg.Logger.Errorf("Invalid upgrade command setup %v", err) //========================
				return elementalError.NewFromError(err, elementalError.ReadingSpecConfig) //========================
			}

Please refer harvester/harvester#4526 (comment)

// NewElementalPartitionsFromList fills an ElementalPartitions instance from given
// partitions list. First tries to match partitions by partition label, if not,
// it tries to match partitions by filesystem label
func NewElementalPartitionsFromList(pl PartitionList, state *InstallState) ElementalPartitions {
	ep := ElementalPartitions{}

	lm := map[string]string{
		constants.EfiPartName:        constants.EfiLabel,
		constants.OEMPartName:        constants.OEMLabel,
		constants.RecoveryPartName:   constants.RecoveryLabel,
		constants.StatePartName:      constants.StateLabel,
		constants.PersistentPartName: constants.PersistentLabel,
	}
	if state != nil {
		for k := range lm {
			if state.Partitions[k] != nil {
				lm[k] = state.Partitions[k].FSLabel  /// the state.Partitions[k].FSLabel may be nil
			}
		}
	}

To Reproduce

Harvesrter Upgrade.

Not sure why, the /host/run/initramfs/cos-state/state.yaml is missing the FSLabel, and it further caused error code 33

pre-harvester01-01:/ # cat /host/run/initramfs/cos-state/state.yaml 
# Autogenerated file by elemental client, do not edit

date: "2023-09-13T08:31:42Z"
state:
    active:
        source: dir:///tmp/tmp.01deNrXNEC
        label: COS_ACTIVE
        fs: ext2
    passive: null

Expected behavior

Logs

Additional context

@kkaempf kkaempf added kind/bug Something isn't working kind/question Further information is requested labels Sep 15, 2023
@kkaempf
Copy link
Contributor

kkaempf commented Sep 15, 2023

@w13915984028 can you give us access to a system where the error happens ?

@frelon
Copy link
Contributor

frelon commented Sep 15, 2023

Hi @w13915984028, I'm having trouble reproducing this, could you share the logs including --debug flag when installing the system such that it produces that state file?

@w13915984028
Copy link
Author

@kkaempf @frelon thanks.
The issue is on a community user's environment, we can only see the current content.

The cluster node1, is upgrading from Harvester v112 (from old elemental-cli) to v120 (elemental-toolkit), and it sutcks now. Maybe the old version generated the mal-formed yaml. (I did not check old code yet)

If you need any information, please move to harvester/harvester#4526 (comment) :)

@davidcassany
Copy link
Contributor

Ok I finally went through the upgrade path of previous staring from v0.0.14. I found there are mostly three relevant commits, in chronological order:

  • 9285843 changing configuration to better structured yaml configuration that is prefilled with defaults so config priorities goes as (low -> high) defaults -> yaml config -> cmd line flags -> env variables. This commit also renames GPT partition tables (p.state to state) and that stage partition names were not used or mandatory. This is part of v0.0.16.
  • 02bb111 introduces the state.yaml file with the purpose of tracking installed container images and collect some other installation details like filesystem labels of images and partitions. This will allow in the future changes on default values without breaking upgrades. this was part of v0.0.16
  • bc21b9c starts using state.yaml in upgrades, this allows changes in default filesystem labels without breaking upgrades. this is part of v0.1.3

Unless I missed some detail my understanding is that a secure upgrade path would be v0.0.14 -> v0.0.16 ~ v0.1.2 -> v0.1.3 ~ HEAD. Note I don't have these old environments at hand to verify it. Always using default filesystem labels. For non default setups it might be slightly different, even I think no custom values were allowed back in v0.0.14.

So in short the issue comes from the fact that the harvester deployment was including an incomplete state.yaml file and upgrade action relies on it. Was this state.yaml was created manually or by Harvester somehow?

@Vicente-Cheng suggested and improvement which is reasonable to my eyes which is filling missing state.yaml values with defaults. Currenlty default values are only used if no state.yaml is provided at all, we can expand this to missing values.

From a Harvester POV the solution I think is to upgrade from system having a complete state.yaml file or a system completely missing it.

@davidcassany davidcassany self-assigned this Sep 22, 2023
@Vicente-Cheng
Copy link

So in short the issue comes from the fact that the harvester deployment was including an incomplete state.yaml file and upgrade action relies on it. Was this state.yaml was created manually or by Harvester somehow?

I check with the early version like harvester v1.1.1, there is not any state.yaml on the /run/initramfs/cos-state/.
But after upgrade (with branch v0.0.14-harvester1 on elemental-cli repo), this file was generated.

Harvester v1.1.2 and later version use the newer elemental-cli, so the format of state.yaml is correct.

@davidcassany
Copy link
Contributor

But after upgrade (with branch v0.0.14-harvester1 on elemental-cli repo), this file was generated.

I'd like to know the exact toolkit versions for which the upgrade caused the creation of the state.yaml with missing information, if it was generated by the toolkit I might be missing some corner case. I am wondering if some elemental-toolkit between 02bb111 (v0.0.16) and 5ae3220 (v0.1.3) was used for the upgrade.

@Vicente-Cheng
Copy link

But after upgrade (with branch v0.0.14-harvester1 on elemental-cli repo), this file was generated.

I'd like to know the exact toolkit versions for which the upgrade caused the creation of the state.yaml with missing information, if it was generated by the toolkit I might be missing some corner case. I am wondering if some elemental-toolkit between 02bb111 (v0.0.16) and 5ae3220 (v0.1.3) was used for the upgrade.

Let me reproduce it again; will update later.

@Vicente-Cheng
Copy link

TL;DR, The upgrade elemental toolkit version (form command elemental upgrade) is below:

harvester-node-0:~ # curl -sfL https://github.com/rancher/elemental-cli/releases/download/v0.1.1/elemental-v0.1.1-Linux-x86_64.tar.gz | tar -xz -C /tmp/elemental_cli
harvester-node-0:~ # /tmp/elemental_cli/elemental version
v0.1.1+gb918db3

Related commit: rancher/elemental-cli@b918db3
Related branch: v0.1.1


Detail described as below (you could see the state.yaml was generated after upgrade)

harvester v1.1.1 partition list as below:

harvester-node-0:~ # lsblk -f
NAME   FSTYPE FSVER LABEL           UUID                                 FSAVAIL FSUSE% MOUNTPOINT
loop0  ext2   1.0   COS_ACTIVE      2ea58bad-b868-4186-a8a5-5a057ddcee7a    1.5G    43% /
vda
├─vda1
├─vda2 ext4   1.0   COS_OEM         3ef518d6-4f0e-4fcd-b5fb-8aa5769150af   39.7M     0% /oem
├─vda3 ext4   1.0   COS_STATE       494048a8-948e-4d31-8447-7e49c669b0c2    9.6G    29% /run/initramfs/cos-state
├─vda4 ext4   1.0   COS_RECOVERY    9b248a98-313a-4daa-a188-2fa652c1e84b
├─vda5 ext4   1.0   COS_PERSISTENT  446810ff-a489-4aa4-bbce-05907c699a58   56.1G    31% /usr/local
└─vda6 ext4   1.0   HARV_LH_DEFAULT 92574930-1669-4572-bec6-17b7e12e7d6c  359.1G     0% /var/lib/harvester/defaultdisk

harvester-node-0:~ # ls -al /run/initramfs/cos-state/
total 36
drwxr-xr-x 5 root root  4096 Sep 22 15:19 .
drwxr-xr-x 6 root root   140 Sep 22 15:20 ..
drwxr-xr-x 2 root root  4096 Sep 22 15:16 cOS
drwxr-xr-x 5 root root  4096 Sep 22 15:19 grub2
-rw-r--r-- 1 root root  1024 Sep 22 15:16 grub_oem_env
-rw-r--r-- 1 root root   356 Sep 22 15:19 grubmenu
drwx------ 2 root root 16384 Sep 22 15:15 lost+found

The elemental-cli version (for command elemental install)

harvester-node-0:/oem/install # elemental version
v0.0.14+gedc412b

The elemental-cli version (for command elemental upgrade)

harvester-node-0:~ # curl -sfL https://github.com/rancher/elemental-cli/releases/download/v0.1.1/elemental-v0.1.1-Linux-x86_64.tar.gz | tar -xz -C /tmp/elemental_cli
harvester-node-0:~ # /tmp/elemental_cli/elemental version
v0.1.1+gb918db3

After upgrade

harvester-node-0:~ # /tmp/elemental_cli/elemental upgrade --debug --directory /tmp/new_root/ --logfile upgrade.log
... logs for upgrade

harvester-node-0:~ # ls /run/initramfs/cos-state/ -al
total 40
drwxr-xr-x 5 root root  4096 Sep 22 16:02 .
drwxr-xr-x 6 root root   140 Sep 22 15:20 ..
drwxr-xr-x 2 root root  4096 Sep 22 16:02 cOS
drwxr-xr-x 5 root root  4096 Sep 22 15:19 grub2
-rw-r--r-- 1 root root  1024 Sep 22 16:02 grub_oem_env
-rw-r--r-- 1 root root   356 Sep 22 15:19 grubmenu
drwx------ 2 root root 16384 Sep 22 15:15 lost+found
-rw-r----- 1 root root   200 Sep 22 16:02 state.yaml

harvester-node-0:~ # cat /run/initramfs/cos-state/state.yaml
# Autogenerated file by elemental client, do not edit

date: "2023-09-22T16:02:07Z"
state:
    active:
        source: dir:///tmp/new_root
        label: COS_ACTIVE
        fs: ext2
    passive: null

@w13915984028 w13915984028 changed the title A potential bug in NewElementalPartitionsFromList which cuased upgrade error code 33 A potential bug in NewElementalPartitionsFromList which caused upgrade error code 33 Sep 22, 2023
@davidcassany
Copy link
Contributor

@Vicente-Cheng thanks for all the details. Yes this explains the issue, v0.1.1 was missing rancher/elemental-cli#388 which was included as part of v0.1.3.

We need to better align elemental and harvester releases.

@Vicente-Cheng
Copy link

@davidcassany,
Nice fix!! Thanks for your detailed investigation and quick fixing!

I think we harvester would have better alignment from v1.2.0 by relying on the OBS elemental RPM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working kind/question Further information is requested
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants