-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Failed creating partition: exit status 1" - log missing essential information #50
Comments
/proc/partitions shows that sda1 was created. |
Manually creating another partition (by calling |
I could not reproduce the error :-/ |
This just happened again. A |
udev needs several seconds to settle. 😉 |
I did several attempts and I could not reproduce it, I'd like to actually have a failure with the output when having the debug flag enabled. #58 ensures debug flag on elementa-cli is used if the install spec includes a |
I am not fully convinced this is a timing issue between udev and parted as this happened when creating the first partition of the disk, just right after creating the partition table. The commands sequence should be equal to:
According to the logs above I guess elemental-cli failed on the third call. Between the first and the second calls we could add a I have some doubts the root issue of this is a timing issue. I'd appreciate if anyone can reproduce it including debug messages (requires #58) and share the logs before merging rancher/elemental-cli#286 |
Happened again, now with more detail:
|
Partition got created
|
Awesome thanks for the report including the logs... 🤔 I actually managed to reproduce something similar, to me it feels like some sort of race condition during boot, in my case it fails on the At least for the issue I experienced I can asses that changes in rancher/elemental-cli#286 are having no effect. |
@kkaempf here is a request to update the elemental-operator.service https://build.opensuse.org/request/show/990105 From my tests looks like the elemental-operator.service is having some sort of bad interaction with the cos-setup-network.service, they both are executed in parallel. This issue brings me to the point if the service unit file should belong to this repository instead of OBS, or even more, do we actually need the service unit file? wouldn't it be more consistent to call |
I'm fine to drop the |
The provided solution in https://build.opensuse.org/request/show/990105 was incomplete, my bad... now the service does not start automatically, on my local tests I just changed the unit service file but I did not realize that the symlinks to the multi-user.target were already created, hence new builds including https://build.opensuse.org/request/show/990105 do not start elemental-operator at boot... my bad, I am bout to provide a new SR. |
Ah, that explains it 😆 |
My bad sorry.... 🙏 In may local env I was just placing the new unit file on top and I did not realize that there was some spurious symlink already created during the image build while running Gonna add a card to drop this service and call elemental-operator based on our cloud-init config files, think this approach is way more explicit and easier to maintain, however requires few changes regarding the /etc/motd and find a proper way store read elemental-operator logs. That's why a separate specific card makes sense. |
👍🏻 |
After doing more than half a dozen installs without problem, I'd consider this done. |
after the change into running elemental-operator as part of cloud-init this issue appeared again. Not very often, but still... This are the relevant logs I get time to time:
IMHO this also raises a something we probably should consider. What is the expected behavior if the installation fails? Certainly we need to fix and understand the former issue, however installation failures will certainly happen this goes up to scale at some point. So I was wondering if some sort of reboot and retry mechanism is something desirable for the registration and provisioning process. |
A |
I am pretty much convinced the issue is with the datasource plugin which is executed at
I added davidcassany/linuxkit#5 as an attempt to solve it, however I can't understand why all devices seams to be opened by multiple thread.... 🤔 In any case the traces from below they disappear as soon as the |
Avoid the use of 'cdrom' provider for datasource plugin. This provider opens all available block devices to probe them, however an `lsof` call show they are not closed afterwards. This might cause issues if within the same process we pretend to repartition a disk for installation. This is a workaround for rancher/elemental-operator#50 Signed-off-by: David Cassany <dcassany@suse.com>
* Do not use cdrom provider datasource Avoid the use of 'cdrom' provider for datasource plugin. This provider opens all available block devices to probe them, however an `lsof` call show they are not closed afterwards. This might cause issues if within the same process we pretend to repartition a disk for installation. This is a workaround for rancher/elemental-operator#50 Signed-off-by: David Cassany <dcassany@suse.com> * Completely remove datasource plugin We are not targeting public cloud infraestructure for now, hence there is no need to include such a stage. Signed-off-by: David Cassany <dcassany@suse.com>
Closing again since there is already a workaround in place in rancher/elemental#212 Note this is not solve, this is just a workaround. A proper work is likely to require a dive into linuxkit and go-diskfs libraries to properly release devices after probing them. |
Output from
journalctl -u elemental-operator
after booting an ISOThere are several things missing from this log
The text was updated successfully, but these errors were encountered: