Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failed creating partition: exit status 1" - log missing essential information #50

Closed
kkaempf opened this issue Jul 15, 2022 · 21 comments
Closed
Assignees

Comments

@kkaempf
Copy link
Contributor

kkaempf commented Jul 15, 2022

Output from journalctl -u elemental-operator after booting an ISO

-- Logs begin at Fri 2022-07-15 06:15:20 UTC, end at Fri 2022-07-15 06:18:02 UTC. --
Jul 15 06:15:40 localhost systemd[1]: Starting Elemental Operator Registration...
Jul 15 06:15:40 localhost elemental-operator[1600]: time="2022-07-15T06:15:40Z" level=info msg="Using TPMHash 1c9b6e73145c651a5cb73ee0eed13c2300effb030f4b7dbfefe5e7a63db96308 to dial wss://172.17.0.2/elemental/registration/b5msj66rsjm9twbnjnz2vsswpsr76w4hvl2mbn58jp9xn8cc94txsr"
Jul 15 06:15:40 localhost elemental-operator[1600]: Install environment:
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_INSTALL_CLOUD_INIT=/tmp/2415592138.yip
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_DEBUG=true
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_INSTALL_REBOOT=true
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_INSTALL_PASSWORD=<removed>
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_INSTALL_TARGET=/dev/sda
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_REGISTRATION_URL=https://172.17.0.2/elemental/registration/b5msj66rsjm9twbnjnz2vsswpsr76w4hvl2mbn58jp9xn8cc94txsr
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_REGISTRATION_CA_CERT=-----BEGIN CERTIFICATE-----
Jul 15 06:15:40 localhost elemental-operator[1600]: MIIBpzCCAU2gAwIBAgIBADAKBggqhkjOPQQDAjA7MRwwGgYDVQQKExNkeW5hbWlj
Jul 15 06:15:40 localhost elemental-operator[1600]: bGlzdGVuZXItb3JnMRswGQYDVQQDExJkeW5hbWljbGlzdGVuZXItY2EwHhcNMjIw
Jul 15 06:15:40 localhost elemental-operator[1600]: NzA1MDcwOTU2WhcNMzIwNzAyMDcwOTU2WjA7MRwwGgYDVQQKExNkeW5hbWljbGlz
Jul 15 06:15:40 localhost elemental-operator[1600]: dGVuZXItb3JnMRswGQYDVQQDExJkeW5hbWljbGlzdGVuZXItY2EwWTATBgcqhkjO
Jul 15 06:15:40 localhost elemental-operator[1600]: PQIBBggqhkjOPQMBBwNCAAS2wzKIZ5azTvfTMHc4+Zrmy9efM3ndy2ocgycwTCVB
Jul 15 06:15:40 localhost elemental-operator[1600]: rBdbRGH7t1wjGHtpaYh9CR2AflK34+CQ/nibHJLGzH8Go0IwQDAOBgNVHQ8BAf8E
Jul 15 06:15:40 localhost elemental-operator[1600]: BAMCAqQwDwYDVR0TAQH/BAUwAwEB/zAdBgNVHQ4EFgQUa11N1nJQbJ8YX6MtGjva
Jul 15 06:15:40 localhost elemental-operator[1600]: X4KvRIUwCgYIKoZIzj0EAwIDSAAwRQIgNHzToz74FitImN5T4uCTdmOfYOKVwX0+
Jul 15 06:15:40 localhost elemental-operator[1600]: WyDGp78hxx8CIQDFThQWNi1j1JJTOiIJKXpqjrr4ndzWzsQHqRjveFqdBg==
Jul 15 06:15:40 localhost elemental-operator[1600]: -----END CERTIFICATE-----
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_SYSTEM_AGENT_URL=https://172.17.0.2/k8s/clusters/local
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_SYSTEM_AGENT_TOKEN=eyJhbGciOiJSUzI1NiIsImtpZCI6Ik1BU1FFdkhiVzI0ZFFKbF9TeDFNZzVtVHdlRHZURTZ0YmY4MHJCLWtxcjgifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJmbGVldC1kZWZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6Im15LW5vZGUtdG9rZW4tazhxeGIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoibXktbm9kZSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImQ4MTIxOTA5LTJhNWEtNGNkZi05YjQzLWQyYTc1ZTgzZjZjZSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpmbGVldC1kZWZhdWx0Om15LW5vZGUifQ.esXZgq5-6vRPwjnXP3g9L-RzXQKrj0LPTc8oqQN5OUCYtr6PU7lbg-43zsN83_J-I89nmVvH2nwvD1BbucZ9B3U7aq_B2wPZ-_rRveUj4Xsbp-7DenXbS5rqXPjra8ieEhJ5Nr62sNPrPfDkm-lCtiu5olHFgroK2KWHfgrU3DizHqVEJfa2eJJ6ZA8hWMiYZ-JzAuDaoB2Q67FbAJhYRDtSXF7ei5KsfbOXb-FBdm94TDunN6qL2I-kgT06xR7_FWlvqLB5dpU7RFA0xN_aWF5dEj5UknZBenGLrgJRRdaQeDz_tHZmKSVBbviC9z7FoQde4ZHmrGaUplTzmN1F6g
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_SYSTEM_AGENT_SECRET_NAME=m-qemu-standard-pc-q35-ich9-2009-not-specified
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_SYSTEM_AGENT_SECRET_NAMESPACE=fleet-default
Jul 15 06:15:40 localhost elemental-operator[1651]: INFO[2022-07-15T06:15:40Z] Starting elemental version 0.0.15
Jul 15 06:15:40 localhost elemental-operator[1651]: INFO[2022-07-15T06:15:40Z] Install called
Jul 15 06:15:40 localhost elemental-operator[1651]: INFO[2022-07-15T06:15:40Z] Running before-install hook
Jul 15 06:15:40 localhost elemental-operator[1651]: INFO[2022-07-15T06:15:40Z] Partitioning device...
Jul 15 06:15:40 localhost elemental-operator[1651]: ERRO[2022-07-15T06:15:40Z] Failed creating partition: exit status 1
Jul 15 06:15:40 localhost elemental-operator[1651]: ERRO[2022-07-15T06:15:40Z] Failed creating bios partition
Jul 15 06:15:40 localhost elemental-operator[1651]: Error: 1 error occurred:
Jul 15 06:15:40 localhost elemental-operator[1651]:         * exit status 1
Jul 15 06:15:40 localhost systemd[1]: elemental-operator.service: Main process exited, code=exited, status=1/FAILURE
Jul 15 06:15:40 localhost elemental-operator[1600]: time="2022-07-15T06:15:40Z" level=fatal msg="failed to write elemental config: exit status 1"
Jul 15 06:15:40 localhost systemd[1]: elemental-operator.service: Failed with result 'exit-code'.
Jul 15 06:15:40 localhost systemd[1]: Failed to start Elemental Operator Registration.

There are several things missing from this log

  • "Partitioning device ..." should also print the actual device its partitioning
  • "Failed creating partition" should print the partition number
  • How is the partition created ? Is parted (or fdisk) being called ? What the full output of the partitioning tool ?
@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 15, 2022

/proc/partitions shows that sda1 was created.

@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 15, 2022

Manually creating another partition (by calling fdisk) is successful.

@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 15, 2022

I could not reproduce the error :-/

@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 16, 2022

This just happened again. A systemctl restart elemental-operator fixed it. So it's a timing issue.

@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 16, 2022

udev needs several seconds to settle. 😉

@davidcassany
Copy link
Contributor

I did several attempts and I could not reproduce it, I'd like to actually have a failure with the output when having the debug flag enabled. #58 ensures debug flag on elementa-cli is used if the install spec includes a debug: true, I can see in the logs above the ELEMENTAL_DEBUG=true did not have any effect, #58 fixes it.

@davidcassany
Copy link
Contributor

I am not fully convinced this is a timing issue between udev and parted as this happened when creating the first partition of the disk, just right after creating the partition table.

The commands sequence should be equal to:

  • parted --script --machine -- /dev/vda unit s mklabel gpt
  • parted --script --machine -- /dev/vda unit s print
  • parted --script --machine -- /dev/vda unit s mkpart bios 2048 4095 set 1 bios_grub on

According to the logs above I guess elemental-cli failed on the third call. Between the first and the second calls we could add a udevadm settle execution, however I can't see how to verify any relevant event is not synched. The second call did not fail and I do not feel like adding a retry mechanism for failures happened on mkpart parted command (third call).

I have some doubts the root issue of this is a timing issue. I'd appreciate if anyone can reproduce it including debug messages (requires #58) and share the logs before merging rancher/elemental-cli#286

@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 19, 2022

Happened again, now with more detail:

> cat operator.log 
-- Logs begin at Tue 2022-07-19 06:43:24 UTC, end at Tue 2022-07-19 06:44:59 UTC. --
Jul 19 06:43:45 localhost systemd[1]: Starting Elemental Operator Registration...
Jul 19 06:43:45 localhost elemental-operator[1612]: time="2022-07-19T06:43:45Z" level=info msg="Operator version 0.2.1, commit 3530dc5, commit date git20220718"
Jul 19 06:43:45 localhost elemental-operator[1612]: time="2022-07-19T06:43:45Z" level=info msg="Using TPMHash 1c9b6e73145c651a5cb73ee0eed13c2300effb030f4b7dbfefe5e7a63db96308 to dial wss://172.17.0.2/elemental/registration/g9mscdw8f7m9q4l59rgldx459t9s2ths9gvtqx4z422mm2p6hrv462"
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Starting elemental version 0.0.15 on commit 605346d
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Full config loaded: &v1.RunConfig{
Jul 19 06:43:46 localhost elemental-operator[1656]:   Strict: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   Reboot: true,
Jul 19 06:43:46 localhost elemental-operator[1656]:   PowerOff: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   CloudInitPaths: nil,
Jul 19 06:43:46 localhost elemental-operator[1656]:   EjectCD: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   Config: v1.Config{
Jul 19 06:43:46 localhost elemental-operator[1656]:     Logger: &v1.logrusWrapper{ // p0
Jul 19 06:43:46 localhost elemental-operator[1656]:       Logger: &logrus.Logger{
Jul 19 06:43:46 localhost elemental-operator[1656]:         Out: &os.File{},
Jul 19 06:43:46 localhost elemental-operator[1656]:         Hooks: logrus.LevelHooks{},
Jul 19 06:43:46 localhost elemental-operator[1656]:         Formatter: &logrus.TextFormatter{
Jul 19 06:43:46 localhost elemental-operator[1656]:           ForceColors: true,
Jul 19 06:43:46 localhost elemental-operator[1656]:           DisableColors: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           ForceQuote: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           DisableQuote: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           EnvironmentOverrideColors: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           DisableTimestamp: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           FullTimestamp: true,
Jul 19 06:43:46 localhost elemental-operator[1656]:           TimestampFormat: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:           DisableSorting: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           SortingFunc: ,
Jul 19 06:43:46 localhost elemental-operator[1656]:           DisableLevelTruncation: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           PadLevelText: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           QuoteEmptyFields: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           FieldMap: logrus.FieldMap(nil), // p1
Jul 19 06:43:46 localhost elemental-operator[1656]:           CallerPrettyfier: ,
Jul 19 06:43:46 localhost elemental-operator[1656]:         },
Jul 19 06:43:46 localhost elemental-operator[1656]:         ReportCaller: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:         Level: 5,
Jul 19 06:43:46 localhost elemental-operator[1656]:         ExitFunc: os.Exit,
Jul 19 06:43:46 localhost elemental-operator[1656]:       },
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     Fs: &vfs.osfs{}, // p2
Jul 19 06:43:46 localhost elemental-operator[1656]:     Mounter: &mount.Mounter{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     Runner: &v1.RealRunner{ // p3
Jul 19 06:43:46 localhost elemental-operator[1656]:       Logger: p0,
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     Syscall: &v1.RealSyscall{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     CloudInitRunner: &cloudinit.YipCloudInitRunner{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     Luet: &luet.Luet{
Jul 19 06:43:46 localhost elemental-operator[1656]:       VerifyImageUnpack: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:       TmpDir: "/tmp/elemental-1621924831",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     Client: &http.Client{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     Cosign: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:     Verify: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:     CosignPubKey: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     LocalImage: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:     Repos: []v1.Repository{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     Arch: "x86_64",
Jul 19 06:43:46 localhost elemental-operator[1656]:     SquashFsCompressionConfig: []string{
Jul 19 06:43:46 localhost elemental-operator[1656]:       "-comp",
Jul 19 06:43:46 localhost elemental-operator[1656]:       "xz",
Jul 19 06:43:46 localhost elemental-operator[1656]:       "-Xbcj",
Jul 19 06:43:46 localhost elemental-operator[1656]:       "x86",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     SquashFsNoCompression: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   },
Jul 19 06:43:46 localhost elemental-operator[1656]: }
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Loaded install spec: &v1.InstallSpec{
Jul 19 06:43:46 localhost elemental-operator[1656]:   Target: "/dev/sda",
Jul 19 06:43:46 localhost elemental-operator[1656]:   Firmware: "bios",
Jul 19 06:43:46 localhost elemental-operator[1656]:   PartTable: "gpt",
Jul 19 06:43:46 localhost elemental-operator[1656]:   Partitions: v1.ElementalPartitions{
Jul 19 06:43:46 localhost elemental-operator[1656]:     BIOS: &v1.Partition{
Jul 19 06:43:46 localhost elemental-operator[1656]:       Name: "bios",
Jul 19 06:43:46 localhost elemental-operator[1656]:       FilesystemLabel: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Size: 1,
Jul 19 06:43:46 localhost elemental-operator[1656]:       FS: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Flags: []string{
Jul 19 06:43:46 localhost elemental-operator[1656]:         "bios_grub",
Jul 19 06:43:46 localhost elemental-operator[1656]:       },
Jul 19 06:43:46 localhost elemental-operator[1656]:       MountPoint: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Path: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Disk: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     EFI: nil,
Jul 19 06:43:46 localhost elemental-operator[1656]:     OEM: &v1.Partition{
Jul 19 06:43:46 localhost elemental-operator[1656]:       Name: "oem",
Jul 19 06:43:46 localhost elemental-operator[1656]:       FilesystemLabel: "COS_OEM",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Size: 64,
Jul 19 06:43:46 localhost elemental-operator[1656]:       FS: "ext4",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Flags: []string{}, // p0
Jul 19 06:43:46 localhost elemental-operator[1656]:       MountPoint: "/run/cos/oem",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Path: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Disk: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     Recovery: &v1.Partition{
Jul 19 06:43:46 localhost elemental-operator[1656]:       Name: "recovery",
Jul 19 06:43:46 localhost elemental-operator[1656]:       FilesystemLabel: "COS_RECOVERY",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Size: 8192,
Jul 19 06:43:46 localhost elemental-operator[1656]:       FS: "ext4",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Flags: p0,
Jul 19 06:43:46 localhost elemental-operator[1656]:       MountPoint: "/run/cos/recovery",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Path: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Disk: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     State: &v1.Partition{
Jul 19 06:43:46 localhost elemental-operator[1656]:       Name: "state",
Jul 19 06:43:46 localhost elemental-operator[1656]:       FilesystemLabel: "COS_STATE",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Size: 15360,
Jul 19 06:43:46 localhost elemental-operator[1656]:       FS: "ext4",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Flags: p0,
Jul 19 06:43:46 localhost elemental-operator[1656]:       MountPoint: "/run/cos/state",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Path: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Disk: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     Persistent: &v1.Partition{
Jul 19 06:43:46 localhost elemental-operator[1656]:       Name: "persistent",
Jul 19 06:43:46 localhost elemental-operator[1656]:       FilesystemLabel: "COS_PERSISTENT",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Size: 0,
Jul 19 06:43:46 localhost elemental-operator[1656]:       FS: "ext4",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Flags: p0,
Jul 19 06:43:46 localhost elemental-operator[1656]:       MountPoint: "/run/cos/persistent",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Path: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Disk: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:   },
Jul 19 06:43:46 localhost elemental-operator[1656]:   NoFormat: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   Force: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   CloudInit: "/tmp/767908735.yip",
Jul 19 06:43:46 localhost elemental-operator[1656]:   Iso: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:   GrubDefEntry: "cOS",
Jul 19 06:43:46 localhost elemental-operator[1656]:   Tty: "tty1",
Jul 19 06:43:46 localhost elemental-operator[1656]:   Active: v1.Image{
Jul 19 06:43:46 localhost elemental-operator[1656]:     File: "/run/cos/state/cOS/active.img",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Label: "COS_ACTIVE",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Size: 3072,
Jul 19 06:43:46 localhost elemental-operator[1656]:     FS: "ext2",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Source: &v1.ImageSource{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     MountPoint: "/run/cos/active",
Jul 19 06:43:46 localhost elemental-operator[1656]:     LoopDevice: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:   },
Jul 19 06:43:46 localhost elemental-operator[1656]:   Recovery: v1.Image{
Jul 19 06:43:46 localhost elemental-operator[1656]:     File: "/run/cos/recovery/cOS/recovery.img",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Label: "COS_SYSTEM",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Size: 0,
Jul 19 06:43:46 localhost elemental-operator[1656]:     FS: "ext2",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Source: &v1.ImageSource{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     MountPoint: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     LoopDevice: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:   },
Jul 19 06:43:46 localhost elemental-operator[1656]:   Passive: v1.Image{
Jul 19 06:43:46 localhost elemental-operator[1656]:     File: "/run/cos/state/cOS/passive.img",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Label: "COS_PASSIVE",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Size: 0,
Jul 19 06:43:46 localhost elemental-operator[1656]:     FS: "ext2",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Source: &v1.ImageSource{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     MountPoint: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     LoopDevice: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:   },
Jul 19 06:43:46 localhost elemental-operator[1656]:   GrubConf: "/etc/cos/grub.cfg",
Jul 19 06:43:46 localhost elemental-operator[1656]: }
Jul 19 06:43:46 localhost elemental-operator[1656]: INFO[2022-07-19T06:43:46Z] Install called
Jul 19 06:43:46 localhost elemental-operator[1656]: INFO[2022-07-19T06:43:46Z] Running before-install hook
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Running cmd: 'blkdeactivate --lvmoptions retry,wholevg --dmoptions force,retry --errors'
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] blkdeactivate command output: Deactivating block devices:
Jul 19 06:43:46 localhost elemental-operator[1656]: INFO[2022-07-19T06:43:46Z] Partitioning device...
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Running cmd: 'parted --script --machine -- /dev/sda unit s mklabel gpt'
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Running cmd: 'parted --script --machine -- /dev/sda unit s print'
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Adding partition bios
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Running cmd: 'parted --script --machine -- /dev/sda unit s mkpart bios  2048 4095 set 1 bios_grub on'
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] partitioner output: Error: Partition(s) 1 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You should reboot now before making further changes.
Jul 19 06:43:46 localhost elemental-operator[1656]: ERRO[2022-07-19T06:43:46Z] Failed creating partition: exit status 1
Jul 19 06:43:46 localhost elemental-operator[1656]: ERRO[2022-07-19T06:43:46Z] Failed creating bios partition
Jul 19 06:43:46 localhost elemental-operator[1656]: Error: 1 error occurred:
Jul 19 06:43:46 localhost elemental-operator[1656]:         * exit status 1
Jul 19 06:43:46 localhost systemd[1]: elemental-operator.service: Main process exited, code=exited, status=1/FAILURE
Jul 19 06:43:46 localhost elemental-operator[1612]: time="2022-07-19T06:43:46Z" level=fatal msg="failed calling elemental client: exit status 1"
Jul 19 06:43:46 localhost systemd[1]: elemental-operator.service: Failed with result 'exit-code'.
Jul 19 06:43:46 localhost systemd[1]: Failed to start Elemental Operator Registration.

@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 19, 2022

Partition got created

# cat /proc/partitions 
major minor  #blocks  name

   7        0     384448 loop0
   8        0   41943040 sda
   8        1       1024 sda1
  11        0     467968 sr0

@davidcassany
Copy link
Contributor

davidcassany commented Jul 19, 2022

Awesome thanks for the report including the logs... 🤔 I actually managed to reproduce something similar, to me it feels like some sort of race condition during boot, in my case it fails on the parted --script --machine -- /dev/vda unit s mklabel gpt call with the same error message. It only fails if there was a previous installation on the disk. Still debugging, but closer to the root cause.

At least for the issue I experienced I can asses that changes in rancher/elemental-cli#286 are having no effect.

@davidcassany
Copy link
Contributor

@kkaempf here is a request to update the elemental-operator.service https://build.opensuse.org/request/show/990105 From my tests looks like the elemental-operator.service is having some sort of bad interaction with the cos-setup-network.service, they both are executed in parallel.

This issue brings me to the point if the service unit file should belong to this repository instead of OBS, or even more, do we actually need the service unit file? wouldn't it be more consistent to call elemental-operator register as part of one of our boot stages, more specific we could call it as part of a cloud-init config network.after stage so it would be last thing to do as part of cos-setup-network.service. This we concentrate our boot procedure within the cloud-init stages and its config files.

@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 19, 2022

I'm fine to drop the .service file at any time 😉

@davidcassany
Copy link
Contributor

The provided solution in https://build.opensuse.org/request/show/990105 was incomplete, my bad... now the service does not start automatically, on my local tests I just changed the unit service file but I did not realize that the symlinks to the multi-user.target were already created, hence new builds including https://build.opensuse.org/request/show/990105 do not start elemental-operator at boot... my bad, I am bout to provide a new SR.

@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 19, 2022

Ah, that explains it 😆

@davidcassany
Copy link
Contributor

Ah, that explains it laughing

My bad sorry.... 🙏 In may local env I was just placing the new unit file on top and I did not realize that there was some spurious symlink already created during the image build while running systemctl enable elemantl-operator.service. Here is the https://build.opensuse.org/request/show/990272 that should fix the regression and hopefully the former issue.

Gonna add a card to drop this service and call elemental-operator based on our cloud-init config files, think this approach is way more explicit and easier to maintain, however requires few changes regarding the /etc/motd and find a proper way store read elemental-operator logs. That's why a separate specific card makes sense.

@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 19, 2022

Gonna add a card to drop this service and call elemental-operator based on our cloud-init config files,

👍🏻

@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 20, 2022

After doing more than half a dozen installs without problem, I'd consider this done.

@davidcassany
Copy link
Contributor

after the change into running elemental-operator as part of cloud-init this issue appeared again. Not very often, but still...

This are the relevant logs I get time to time:


Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Done executing stage 'network'
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Running stage: network.after
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/01_elemental-rootfs.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/04_accounting.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/05_motd_and_autologin.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/05_network.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/06_recovery.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/07_live.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/08_boot_assessment.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/09_services.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/99_elemental-operator.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Applying 'Elemental operator bootstrap' for stage 'network.after'. Total stages: 1
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Processing stage step ''. ( commands: 1, files: 0, ... )
Jul 22 15:24:18 localhost elemental[1680]: time="2022-07-22T15:24:18Z" level=info msg="Operator version 0.3.0, commit f2ab68c, commit date git20220722"
Jul 22 15:24:18 localhost elemental[1680]: time="2022-07-22T15:24:18Z" level=info msg="Using TPMHash f6c611b8f2310511b1f1337e3b930754f9d9cf2863771f178636f3ce3de91da5 to dial wss://ranchersuse/elemental/registration/blz4scw6dxwvmgdpgx8grctf8tlm4wbcp5ntctjf5tjgz668rjw6fk"
Jul 22 15:24:19 localhost elemental[1680]: time="2022-07-22T15:24:19Z" level=debug msg="Fetched configuration from manager cluster:\nelemental:\n  install:\n    device: /dev/vda\n    debug: true\n    reboot: true\n  registration:\n    url: https://ranchersuse/elemental/registration/blz4scw6dxwvmgdpgx8grctf8tlm4wbcp5ntctjf5tjgz668rjw6fk\n    cacert: |-\n      -----BEGIN CERTIFICATE-----\n      MIIBpzCCAU2gAwIBAgIBADAKBggqhkjOPQQDAjA7MRwwGgYDVQQKExNkeW5hbWlj\n      bGlzdGVuZXItb3JnMRswGQYDVQQDExJkeW5hbWljbGlzdGVuZXItY2EwHhcNMjIw\n      NzE0MDgzMTM3WhcNMzIwNzExMDgzMTM3WjA7MRwwGgYDVQQKExNkeW5hbWljbGlz\n      dGVuZXItb3JnMRswGQYDVQQDExJkeW5hbWljbGlzdGVuZXItY2EwWTATBgcqhkjO\n      PQIBBggqhkjOPQMBBwNCAARALzVe/qVHEhX4doODreuKydYKvLHRTJ+eG85jIimV\n      VN001MshCh3KusuYBuGH8GSY+Zz4ieJDICEnBOrdCVp2o0IwQDAOBgNVHQ8BAf8E\n      BAMCAqQwDwYDVR0TAQH/BAUwAwEB/zAdBgNVHQ4EFgQUa+46+mWZVkp/r/S9wZdo\n      KS7ZNsswCgYIKoZIzj0EAwIDSAAwRQIhAMKAV3P9mLAu0Esrcg0iDUq+Q46+H4/r\n      XXseh20+NnXHAiABBrj0T6CiLScUxxRr3m9ciP4bIHDKQPMJPRtOJIK9BA==\n      -----END CERTIFICATE-----\n    emulatetpm: false\n    emulatedtpmseed: 0\n    nosmbios: false\n    labels: {}\n  systemagent:\n    url: https://ranchersuse/k8s/clusters/local\n    token: eyJhbGciOiJSUzI1NiIsImtpZCI6IjBpQkxscFdnbGFhaVk3LVU4NmlkcjUxQkJCdFZPTXpibFdqWmtNVlVubDQifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJmbGVldC1kZWZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6InRlc3Qtbm9kZXMtdG9rZW4tNng1MnciLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoidGVzdC1ub2RlcyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjQ4ZmZiNTcyLTFmNTAtNDI3Ny04ZWFiLWMxN2E3NzRlYTQ1ZiIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpmbGVldC1kZWZhdWx0OnRlc3Qtbm9kZXMifQ.ZzehNpPW1AnWMQZLkquDjUdsry9jdoJIBzbR-IZ1gJ9czhoIjTpYSMmONx8m-Or7413XB0hHGwYKSrIYCrR6VTHpTpgLR5E9chvZ9E0tDIRkRnO-ERKt5QWOP5vwf-cb5QnMwL0MZNeCQRd2pIK90HaDCvUsVpbFFvnM4LL6rFMpQCR8HlB_5_DdSHwtMLfF55EAVeTopuiglQAX126CBfrLekT-7OmNofFT9IssCedzWGq0eDMPhmyVE46dP86ecN4ZNLEzLWls66GT65jtTq26M83x4OG1dUkbtzTh1SnH5WgIe-iEALNmLejKp2GnXULPtB88vbMQHKSBOL8e2Q\n    secretname: m-qemu-standard-pc-q35-ich9-2009-not-specified\n    secretnamespace: fleet-default\ncloud-config:\n  users:\n  - name: root\n    passwd: password\n  - name: dcassany\n    passwd: $6$z7WFaY8420.r91h3$io2HBMke9VNW6uodX/Ci4VotzWk89eZ8AvseLHekUSHPo55Qz3rv4jdrA9M6ONRAvjzGTJe1Wg.TC8YEDL0yU1\n\n\n"
Jul 22 15:24:19 localhost elemental[1680]: time="2022-07-22T15:24:19Z" level=debug msg="Computed environment variables:"
Jul 22 15:24:19 localhost elemental[1680]: time="2022-07-22T15:24:19Z" level=debug msg="ELEMENTAL_DEBUG=true\n"
Jul 22 15:24:19 localhost elemental[1680]: time="2022-07-22T15:24:19Z" level=debug msg="ELEMENTAL_REBOOT=true\n"
Jul 22 15:24:19 localhost elemental[1680]: time="2022-07-22T15:24:19Z" level=debug msg="ELEMENTAL_INSTALL_TARGET=/dev/vda\n"
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] Starting elemental version 0.0.15 on commit 605346d
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] Full config loaded: &v1.RunConfig{
Jul 22 15:24:19 localhost elemental[1687]:   Strict: false,
Jul 22 15:24:19 localhost elemental[1687]:   Reboot: true,
Jul 22 15:24:19 localhost elemental[1687]:   PowerOff: false,
Jul 22 15:24:19 localhost elemental[1687]:   CloudInitPaths: nil,
Jul 22 15:24:19 localhost elemental[1687]:   EjectCD: false,
Jul 22 15:24:19 localhost elemental[1687]:   Config: v1.Config{
Jul 22 15:24:19 localhost elemental[1687]:     Logger: &v1.logrusWrapper{ // p0
Jul 22 15:24:19 localhost elemental[1687]:       Logger: &logrus.Logger{
Jul 22 15:24:19 localhost elemental[1687]:         Out: &os.File{},
Jul 22 15:24:19 localhost elemental[1687]:         Hooks: logrus.LevelHooks{},
Jul 22 15:24:19 localhost elemental[1687]:         Formatter: &logrus.TextFormatter{
Jul 22 15:24:19 localhost elemental[1687]:           ForceColors: true,
Jul 22 15:24:19 localhost elemental[1687]:           DisableColors: false,
Jul 22 15:24:19 localhost elemental[1687]:           ForceQuote: false,
Jul 22 15:24:19 localhost elemental[1687]:           DisableQuote: false,
Jul 22 15:24:19 localhost elemental[1687]:           EnvironmentOverrideColors: false,
Jul 22 15:24:19 localhost elemental[1687]:           DisableTimestamp: false,
Jul 22 15:24:19 localhost elemental[1687]:           FullTimestamp: true,
Jul 22 15:24:19 localhost elemental[1687]:           TimestampFormat: "",
Jul 22 15:24:19 localhost elemental[1687]:           DisableSorting: false,
Jul 22 15:24:19 localhost elemental[1687]:           SortingFunc: ,
Jul 22 15:24:19 localhost elemental[1687]:           DisableLevelTruncation: false,
Jul 22 15:24:19 localhost elemental[1687]:           PadLevelText: false,
Jul 22 15:24:19 localhost elemental[1687]:           QuoteEmptyFields: false,
Jul 22 15:24:19 localhost elemental[1687]:           FieldMap: logrus.FieldMap(nil), // p1
Jul 22 15:24:19 localhost elemental[1687]:           CallerPrettyfier: ,
Jul 22 15:24:19 localhost elemental[1687]:         },
Jul 22 15:24:19 localhost elemental[1687]:         ReportCaller: false,
Jul 22 15:24:19 localhost elemental[1687]:         Level: 5,
Jul 22 15:24:19 localhost elemental[1687]:         ExitFunc: os.Exit,
Jul 22 15:24:19 localhost elemental[1687]:       },
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     Fs: &vfs.osfs{}, // p2
Jul 22 15:24:19 localhost elemental[1687]:     Mounter: &mount.Mounter{},
Jul 22 15:24:19 localhost elemental[1687]:     Runner: &v1.RealRunner{ // p3
Jul 22 15:24:19 localhost elemental[1687]:       Logger: p0,
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     Syscall: &v1.RealSyscall{},
Jul 22 15:24:19 localhost elemental[1687]:     CloudInitRunner: &cloudinit.YipCloudInitRunner{},
Jul 22 15:24:19 localhost elemental[1687]:     Luet: &luet.Luet{
Jul 22 15:24:19 localhost elemental[1687]:       VerifyImageUnpack: false,
Jul 22 15:24:19 localhost elemental[1687]:       TmpDir: "/tmp/elemental-4140224633",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     Client: &http.Client{},
Jul 22 15:24:19 localhost elemental[1687]:     Cosign: false,
Jul 22 15:24:19 localhost elemental[1687]:     Verify: false,
Jul 22 15:24:19 localhost elemental[1687]:     CosignPubKey: "",
Jul 22 15:24:19 localhost elemental[1687]:     LocalImage: false,
Jul 22 15:24:19 localhost elemental[1687]:     Repos: []v1.Repository{},
Jul 22 15:24:19 localhost elemental[1687]:     Arch: "x86_64",
Jul 22 15:24:19 localhost elemental[1687]:     SquashFsCompressionConfig: []string{
Jul 22 15:24:19 localhost elemental[1687]:       "-comp",
Jul 22 15:24:19 localhost elemental[1687]:       "xz",
Jul 22 15:24:19 localhost elemental[1687]:       "-Xbcj",
Jul 22 15:24:19 localhost elemental[1687]:       "x86",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     SquashFsNoCompression: false,
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]: }
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] Loaded install spec: &v1.InstallSpec{
Jul 22 15:24:19 localhost elemental[1687]:   Target: "/dev/vda",
Jul 22 15:24:19 localhost elemental[1687]:   Firmware: "bios",
Jul 22 15:24:19 localhost elemental[1687]:   PartTable: "gpt",
Jul 22 15:24:19 localhost elemental[1687]:   Partitions: v1.ElementalPartitions{
Jul 22 15:24:19 localhost elemental[1687]:     BIOS: &v1.Partition{
Jul 22 15:24:19 localhost elemental[1687]:       Name: "bios",
Jul 22 15:24:19 localhost elemental[1687]:       FilesystemLabel: "",
Jul 22 15:24:19 localhost elemental[1687]:       Size: 1,
Jul 22 15:24:19 localhost elemental[1687]:       FS: "",
Jul 22 15:24:19 localhost elemental[1687]:       Flags: []string{
Jul 22 15:24:19 localhost elemental[1687]:         "bios_grub",
Jul 22 15:24:19 localhost elemental[1687]:       },
Jul 22 15:24:19 localhost elemental[1687]:       MountPoint: "",
Jul 22 15:24:19 localhost elemental[1687]:       Path: "",
Jul 22 15:24:19 localhost elemental[1687]:       Disk: "",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     EFI: nil,
Jul 22 15:24:19 localhost elemental[1687]:     OEM: &v1.Partition{
Jul 22 15:24:19 localhost elemental[1687]:       Name: "oem",
Jul 22 15:24:19 localhost elemental[1687]:       FilesystemLabel: "COS_OEM",
Jul 22 15:24:19 localhost elemental[1687]:       Size: 64,
Jul 22 15:24:19 localhost elemental[1687]:       FS: "ext4",
Jul 22 15:24:19 localhost elemental[1687]:       Flags: []string{}, // p0
Jul 22 15:24:19 localhost elemental[1687]:       MountPoint: "/run/cos/oem",
Jul 22 15:24:19 localhost elemental[1687]:       Path: "",
Jul 22 15:24:19 localhost elemental[1687]:       Disk: "",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     Recovery: &v1.Partition{
Jul 22 15:24:19 localhost elemental[1687]:       Name: "recovery",
Jul 22 15:24:19 localhost elemental[1687]:       FilesystemLabel: "COS_RECOVERY",
Jul 22 15:24:19 localhost elemental[1687]:       Size: 8192,
Jul 22 15:24:19 localhost elemental[1687]:       FS: "ext4",
Jul 22 15:24:19 localhost elemental[1687]:       Flags: p0,
Jul 22 15:24:19 localhost elemental[1687]:       MountPoint: "/run/cos/recovery",
Jul 22 15:24:19 localhost elemental[1687]:       Path: "",
Jul 22 15:24:19 localhost elemental[1687]:       Disk: "",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     State: &v1.Partition{
Jul 22 15:24:19 localhost elemental[1687]:       Name: "state",
Jul 22 15:24:19 localhost elemental[1687]:       FilesystemLabel: "COS_STATE",
Jul 22 15:24:19 localhost elemental[1687]:       Size: 15360,
Jul 22 15:24:19 localhost elemental[1687]:       FS: "ext4",
Jul 22 15:24:19 localhost elemental[1687]:       Flags: p0,
Jul 22 15:24:19 localhost elemental[1687]:       MountPoint: "/run/cos/state",
Jul 22 15:24:19 localhost elemental[1687]:       Path: "",
Jul 22 15:24:19 localhost elemental[1687]:       Disk: "",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     Persistent: &v1.Partition{
Jul 22 15:24:19 localhost elemental[1687]:       Name: "persistent",
Jul 22 15:24:19 localhost elemental[1687]:       FilesystemLabel: "COS_PERSISTENT",
Jul 22 15:24:19 localhost elemental[1687]:       Size: 0,
Jul 22 15:24:19 localhost elemental[1687]:       FS: "ext4",
Jul 22 15:24:19 localhost elemental[1687]:       Flags: p0,
Jul 22 15:24:19 localhost elemental[1687]:       MountPoint: "/run/cos/persistent",
Jul 22 15:24:19 localhost elemental[1687]:       Path: "",
Jul 22 15:24:19 localhost elemental[1687]:       Disk: "",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]:   NoFormat: false,
Jul 22 15:24:19 localhost elemental[1687]:   Force: false,
Jul 22 15:24:19 localhost elemental[1687]:   CloudInit: []string{
Jul 22 15:24:19 localhost elemental[1687]:     "/tmp/3429870971.yaml",
Jul 22 15:24:19 localhost elemental[1687]:     "/tmp/825229707.yaml",
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]:   Iso: "",
Jul 22 15:24:19 localhost elemental[1687]:   GrubDefEntry: "cOS",
Jul 22 15:24:19 localhost elemental[1687]:   Tty: "tty1",
Jul 22 15:24:19 localhost elemental[1687]:   Active: v1.Image{
Jul 22 15:24:19 localhost elemental[1687]:     File: "/run/cos/state/cOS/active.img",
Jul 22 15:24:19 localhost elemental[1687]:     Label: "COS_ACTIVE",
Jul 22 15:24:19 localhost elemental[1687]:     Size: 3072,
Jul 22 15:24:19 localhost elemental[1687]:     FS: "ext2",
Jul 22 15:24:19 localhost elemental[1687]:     Source: &v1.ImageSource{},
Jul 22 15:24:19 localhost elemental[1687]:     MountPoint: "/run/cos/active",
Jul 22 15:24:19 localhost elemental[1687]:     LoopDevice: "",
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]:   Recovery: v1.Image{
Jul 22 15:24:19 localhost elemental[1687]:     File: "/run/cos/recovery/cOS/recovery.img",
Jul 22 15:24:19 localhost elemental[1687]:     Label: "COS_SYSTEM",
Jul 22 15:24:19 localhost elemental[1687]:     Size: 0,
Jul 22 15:24:19 localhost elemental[1687]:     FS: "ext2",
Jul 22 15:24:19 localhost elemental[1687]:     Source: &v1.ImageSource{},
Jul 22 15:24:19 localhost elemental[1687]:     MountPoint: "",
Jul 22 15:24:19 localhost elemental[1687]:     LoopDevice: "",
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]:   Passive: v1.Image{
Jul 22 15:24:19 localhost elemental[1687]:     File: "/run/cos/state/cOS/passive.img",
Jul 22 15:24:19 localhost elemental[1687]:     Label: "COS_PASSIVE",
Jul 22 15:24:19 localhost elemental[1687]:     Size: 0,
Jul 22 15:24:19 localhost elemental[1687]:     FS: "ext2",
Jul 22 15:24:19 localhost elemental[1687]:     Source: &v1.ImageSource{},
Jul 22 15:24:19 localhost elemental[1687]:     MountPoint: "",
Jul 22 15:24:19 localhost elemental[1687]:     LoopDevice: "",
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]:   GrubConf: "/etc/cos/grub.cfg",
Jul 22 15:24:19 localhost elemental[1687]: }
Jul 22 15:24:19 localhost elemental[1687]: INFO[2022-07-22T15:24:19Z] Install called
Jul 22 15:24:19 localhost elemental[1687]: INFO[2022-07-22T15:24:19Z] Running before-install hook
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] Running cmd: 'blkdeactivate --lvmoptions retry,wholevg --dmoptions force,retry --errors'
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] blkdeactivate command output: Deactivating block devices:
Jul 22 15:24:19 localhost elemental[1687]: INFO[2022-07-22T15:24:19Z] Partitioning device...
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] Running cmd: 'parted --script --machine -- /dev/vda unit s mklabel gpt'
Jul 22 15:24:29 localhost elemental[1687]: ERRO[2022-07-22T15:24:29Z] Failed creating new partition table: Error: Partition(s) 1, 2, 3, 4, 5 on /dev/vda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You should reboot now before making further changes.
Jul 22 15:24:29 localhost elemental[1687]: Error: 1 error occurred:
Jul 22 15:24:29 localhost elemental[1687]:         * exit status 1
Jul 22 15:24:29 localhost elemental[1680]: time="2022-07-22T15:24:29Z" level=fatal msg="failed calling elemental client: exit status 1"
Jul 22 15:24:29 localhost elemental[1558]: ERRO[2022-07-22T15:24:29Z] : failed to run systemd-cat -t elemental elemental-operator register --debug /run/initramfs/live/: exit status 1
Jul 22 15:24:29 localhost elemental[1558]: ERRO[2022-07-22T15:24:29Z] 1 error occurred:
Jul 22 15:24:29 localhost elemental[1558]:         * failed to run systemd-cat -t elemental elemental-operator register --debug /run/initramfs/live/: exit status 1
Jul 22 15:24:29 localhost elemental[1558]:  
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Stage 'network.after'. Defined stages: 1. Errors: true
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Executing /oem/install-hook.yaml
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Done executing stage 'network.after'
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Running stage: network.before
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Executing BOOT_IMAGE=(cd)/boot/kernel.xz cdroot root=live:CDLABEL=COS_LIVE rd.live.dir=/ rd.live.squashimg=rootfs.squashfs console=tty1 console=ttyS0 rd.cos.disable
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Done executing stage 'network.before'
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Running stage: network
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Executing BOOT_IMAGE=(cd)/boot/kernel.xz cdroot root=live:CDLABEL=COS_LIVE rd.live.dir=/ rd.live.squashimg=rootfs.squashfs console=tty1 console=ttyS0 rd.cos.disable
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Done executing stage 'network'
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Running stage: network.after
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Executing BOOT_IMAGE=(cd)/boot/kernel.xz cdroot root=live:CDLABEL=COS_LIVE rd.live.dir=/ rd.live.squashimg=rootfs.squashfs console=tty1 console=ttyS0 rd.cos.disable
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Done executing stage 'network.after'

IMHO this also raises a something we probably should consider. What is the expected behavior if the installation fails? Certainly we need to fix and understand the former issue, however installation failures will certainly happen this goes up to scale at some point. So I was wondering if some sort of reboot and retry mechanism is something desirable for the registration and provisioning process.

@kkaempf
Copy link
Contributor Author

kkaempf commented Jul 22, 2022

A systemctl restart elemental-operator helped in all such cases I encountered. Going back to a systemd service (with restart-on-failure - if this is even possible 🤔 ) would be one option.

@davidcassany
Copy link
Contributor

I am pretty much convinced the issue is with the datasource plugin which is executed at network stage. More explicit I believe the problem is related to the cdrom provider which attempts to read all devices and leaves the files opened. In fact, right after booting and before elemental-operator kicks in a call to lsof | grep vda results into something like:

elemental 1558                            root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558                            root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558                            root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558                            root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558                            root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1559 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1559 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1559 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1559 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1559 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1560 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1560 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1560 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1560 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1560 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1561 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1561 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1561 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1561 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1561 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1562 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1562 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1562 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1562 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1562 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1563 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1563 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1563 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1563 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1563 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1564 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1564 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1564 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1564 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1564 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5

I added davidcassany/linuxkit#5 as an attempt to solve it, however I can't understand why all devices seams to be opened by multiple thread.... 🤔 In any case the traces from below they disappear as soon as the cdrom datasource is removed, so we are certain it is the cdrom provider the responsible of opening all this devices.

davidcassany added a commit to rancher/elemental that referenced this issue Jul 26, 2022
Avoid the use of 'cdrom' provider for datasource plugin. This provider
opens all available block devices to probe them, however an `lsof` call
show they are not closed afterwards. This might cause issues if within
the same process we pretend to repartition a disk for installation.

This is a workaround for rancher/elemental-operator#50

Signed-off-by: David Cassany <dcassany@suse.com>
davidcassany added a commit to rancher/elemental that referenced this issue Jul 26, 2022
* Do not use cdrom provider datasource

Avoid the use of 'cdrom' provider for datasource plugin. This provider
opens all available block devices to probe them, however an `lsof` call
show they are not closed afterwards. This might cause issues if within
the same process we pretend to repartition a disk for installation.

This is a workaround for rancher/elemental-operator#50

Signed-off-by: David Cassany <dcassany@suse.com>

* Completely remove datasource plugin

We are not targeting public cloud infraestructure for now, hence there
is no need to include such a stage.

Signed-off-by: David Cassany <dcassany@suse.com>
@davidcassany
Copy link
Contributor

Closing again since there is already a workaround in place in rancher/elemental#212 Note this is not solve, this is just a workaround. A proper work is likely to require a dive into linuxkit and go-diskfs libraries to properly release devices after probing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants