New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do NOT install Rook in a Kind/k3os cluster(s) (it'll destroy your volumes) #9470
Comments
Hi @withernet, sorry about this, indeed we had similar reports in the past with Kind. Our dev env guide reflects this https://rook.io/docs/rook/v1.8/development-environment.html#minikube. By any chance, do you have the logs from the prepare job? So that we could try to understand why the root disk was picked? |
Unfortunately I do not. I do remember seeing in the web UI that there was 3.9TB available and I thought that was weird. But... both the volumes I had added up to about 3.9TB. I also want to say that it was all attached disks. It reformatted my NVMe root volume and my USB drive. |
Is Kind aggregating the storage and presenting it differently somewhat?
Yes, that makes sense with |
😿 same problem last year on my desktop. I lost all ext4 drivers while NTFS drivers remain untouched |
Finally, independent confirmation ;) |
Leaving my mark here. Environment: Bare Metal +
Went to go grab a drink from the fridge and chat with the fiancee. Came back to the console (iDRAC) spitting errors that k3s couldn't log to /var/log/. Rebooted and got the grub screen. @withernet Can you modify the title a bit to be specific to @leseb Mind if I make a PR to change the |
@crdnl sorry to hear that. |
Thankfully, it was a dev cluster so there wasn't anything on it (yet). I did add a I was more-so referencing modifying the provided sample file (in deploy/examples/) to not use As for why it happened, I did some digging through my prepare logs on a new cluster, and paid attention to the preparer this time. Logs are at https://gist.github.com/crdnl/f63bc578e5108f72f4bfa0b405564368 On line #53, it does note that the device By tweaking the toolbox container (setting privileged, mounting
To add on, the output of lsblk for
TL; DR; |
@crdnl Thanks for the detailed analysis. It's surprising that even if |
@leseb Just ran
|
@crdnl thanks but this isn't the full output, is it? |
Just ran into this as well on k3os, will try to provide a log after I reinstall the affected hosts; a couple hosts were unaffected, probably because they used GPT.
|
Thanks for the report. If we don't even have Python it will be hard to simulate an |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation. |
This definitely shouldn't be closed |
We really need to find more checks to ensure unintended devices aren't picked up... |
As a quick "fix" to improve the docs, we could look into adding a red warning box to the Rook Ceph prerequisites and quickstart doc pages. Still as @travisn mentioned in a quick chat, we should find out why |
Since this issue was opened, we added a sentence in bold text to the opening section of the Quickstart guide. I wonder if it would be sufficient just to add a "WARNING:" prefix to make it more obvious, or move it to be the first thing in the prerequisites section. |
I think this error occurs when you use a tool to play with rook that isnt |
Hi, is there any updates? This sounds like a big problem. |
Would the clarification in the docs suggested here be sufficient from your view? |
@travisn Well, not really... Without reading this issue, I almost forgot that sentence (because you know we have to read a lot of documentation and compare multiple storage engine etc in a short time). If it is in a big red box it may be better - but that may make people think Rook very dangerous. |
Thus, IMHO this bug should be fixed (and of high priority - it is quite dangerous). It can be either fixed in the Rook level, or fixed by adding some protection in Helm chart, etc. |
Fixing it of course is desired, but detecting when it's an invalid environment is the challenge. Do you have a suggestion on how rook can detect when to prevent installation? Or we need to gather more logging as mentioned here. |
@travisn I am not an expert here. But I guess Rook should not destroy volumes that originally have data inside it? |
Rook does check for existing filesystems or partitions on a device, and will not allow using such devices. So the remaining question is how we can detect other cases of disks that should be skipped where those checks are not sufficient. |
@travisn Well I am a bit confused. This bug report seems to mention the files are destroyed. So that is existing filesystem or partitions, so Rook should not allow using them. |
@fzyzcjy I'll try to clear up the confusion, the problem seems to be when using Kind/k3os (and similar environment where the host's disks can be seen from the node), the disks look like they are "empty" for some reason and are therefore used by Rook Ceph as OSDs, e.g., #9470 (comment). @leseb To get the |
Whereever the CLI runs |
Rook has always ignored existing ext4 filesystems while using 'useAllDevices: true'. |
So... the good news I was able to reproduce the issue on first try, the bad news I need to reinstall my laptop and wasn't able to save the logs... I'll probably have the @leseb Are you available today? I can gladly give you SSH access for you to get more debug info as my laptop seemed quite responsive even with the disk being "re-organized" to being multiple Ceph OSDs. |
Logs should be enough for now :), please post them here. |
@leseb Here you go: rook-operator-pod.log Quick look from my side shows that ceph osd prepare reports the disk as empty (no partitions). rook-discover correctly on the other detects that nvme0n1 has partitions:
|
|
The issue is still in ceph-volume not detecting the device property correctly. It looks like this will be fixed by ceph/ceph@9f4b830. There is a pending backport so the next Pacific/Quincy should get the fix. Also, Rook should have been capable of handling this too, I found a bug so I'm fixing it too. |
When detecting the device property the filesystem was not passed but used later to validate if the device should be taken or not. Now we read the filesystem info from "lsblk" and populate it in the device type. Closes: rook#9470 Signed-off-by: Sébastien Han <seb@redhat.com>
When detecting the device property the filesystem was not passed but used later to validate if the device should be taken or not. Now we read the filesystem info from "lsblk" and populate it in the device type. Closes: rook#9470 Signed-off-by: Sébastien Han <seb@redhat.com>
When detecting the device property the filesystem was not passed but used later to validate if the device should be taken or not. Now we read the filesystem info from "lsblk" and populate it in the device type. Closes: #9470 Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit 8c1bf0f)
When detecting the device property the filesystem was not passed but used later to validate if the device should be taken or not. Now we read the filesystem info from "lsblk" and populate it in the device type. Closes: #9470 Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit 8c1bf0f)
When detecting the device property the filesystem was not passed but used later to validate if the device should be taken or not. Now we read the filesystem info from "lsblk" and populate it in the device type. Closes: #9470 Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit 8c1bf0f)
I was testing out rook in a
kind
cluster on my Linux laptop. I setup the CRDs and installedcluster-test.yaml
last night. Today, I noticed that my drives started to behave weird (couldnt decrypt anymore, couldnt mount). I decided to reboot. My root volume and my other volume were converted intoceph_bluestore
volumes, and therefore unmountable. Literally destroyed over 3TB of data.A wild guess is that since my user is part of the
docker
group, the underlying system seen these volumes and decided to convert them all to ceph volumes. This is either a major bug or feature.I have had to re-install my OS and restore a volume from backup due to this.
Probably here: https://github.com/rook/rook/blob/master/deploy/examples/cluster-test.yaml#L43-L45
The text was updated successfully, but these errors were encountered: