-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manager prometheus module failed to construct - ImportError: Module not found #13527
Comments
Strange because on the logs it seems that rook module is not enabled:
Did you try to enable the module from toolbox?
|
ceph mgr module enable rook && ceph orch set backend rook && ceph orch status output: module 'rook' is already enabled In the manager logs there was nothing about that, so I restarted the mgr pod then the startup log is the same as above. |
mgr container running this image: quay.io/ceph/ceph:v18.2.1 |
rook module seems working because I can see the cluster -> Physical Disks |
I am also able to reproduce this. My config only explicitly configures the Something interesting to note is that Prometheus works on a standby mgr, but not the active one. Logs from active mgr: (some lines are omitted, indicated by
Environment:
|
Than you all for the feedback. I'll try to reproduce the issue locally 👍 |
@matthewpi please, can you post the full logs of the mgr startup including |
Here are the logs from startup. |
Thank you very much @matthewpi. Just FYI I tried several times to reproduce the issue locally but with no success. Analyzing these logs and the ones provided originally by @pasztorl the common pattern is the order of modules loading: first prometheus is loaded then rook (we can see this in the following snippet from your active mgr logs):
I think the problem is a dependency between promethues and orchestrator (rook in this case) modules introduced recently by the changes from PR ceph/ceph#52191. Specifically the orch status check. In the error stracktrace we can see how the mgr fails to load prometheus (
That also explains why I wasn't able to reproduce in my local environment because in my setup rook is loaded always before prometheus. |
This issue also affects our setup. It started to coming immediately after upgrading to 1.13 from 1.12. MGR Pod is constantly crashing since then. [rook@rook-ceph-tools-68644848b9-flk4x /]$ ceph crash info 2024-01-10T09:17:44.340271Z_c16a026b-2df4-4857-b64a-ea73c41a5493 |
@kpoos I have confirmed that the issue is related with the changes from the PR I post above. I have already opened a ticket on the ceph project to track the issue. https://tracker.ceph.com/issues/63992 Unfortunately at this moment there's no work around to this problem. |
Thanks for the info! |
Not sure if this could help or not, but on my local env I was able to enable rook and prometheus by using the following setup:
Following these steps the mgr seems to start without issues but I can't guarantee that same will happen in your env. Besides, plz keep in mind that any restart on the mgr can lead to the problem again. |
@rkachach Looks that the workaround works in our setup as well. Thanks for that. Please keep us informed when a permanent fix is available. Thanks. |
I'm unable to achieve a good behaviour using this workaround. |
@xavi-clovr Unfortunately loading process is not deterministic and right now there's a race condition between loading prometheus and the orchestrator (rook) which could lead to a crash of prometheus module in case it's loaded before rook. So the proposed work around is not 100% sure to work. |
I created the following image which contains IMPORTANT:
|
Good news: The prometheus module doesn't fail on startup. Bad news: Metric collection doesn't work. Hitting the /metrics endpoint results in...
|
Still getting the same error with the new image as well. |
@rspier thanks for deploying and testing the first part of the fix (the one related to the prometheus module crash). As for the new issue I'm sure that it has to do with some mismatch between how As of building an image you have basically to pick the changes from my PR https://github.com/ceph/ceph/pull/55149/files and use them to update the base image of v18.2.1. A simple docker file like:
|
The Thank you! |
Welcome! thanks to you for helping with the testing 👍 Please let me know if you observe any new issues. I'll be working on the definitive fix, as I said before, that could takes more time as it depends on the ceph project next release. |
Also can confirm that the |
I confirm the v1 fix the issue also for me (using rook helm charts). |
Was having this issue in the discussion area. Hopefully this makes it into the rook-ceph release. Nice work! |
running the v1 patch, and ofc fixed the issue <3 |
@barrettMCW plz, if you really think that there's a potential BUG with metrics I'd recommend opening a new issue and providing the details to reproduce it. |
Unfortunately, this version did not start for me. I am running it on ARM though, if that make any difference. |
@dcplaya please, test the image |
@rkachach can you build an arm image? |
@rkachach I did test that image, the error I got while running that version is in the screenshot I posted above. |
All: Just wanted to report that I had this issue and the patched version fixed it for me. @dcplaya that image probably won't work on ARM, unless an ARM version was built for it. Maybe somebody will volunteer to build one, but if not you could follow the suggestions to build your own. Containers are typically arch-specific since they will contain binaries. |
@bhuism i have build multi arch image, a public gitlab repository is available on https://gitlab.onlineterroir.com/ceph/ceph with opreationnal CI. Multi arch image is availbe on I confirm the v1 fix the issue also for me (using rook without helm charts). |
@guillaumetorresani thank you for providing the multi-arch image. |
@guillaumetorresani's multi-arch image works for my ARM setup! |
@guillaumetorresani thanks for the arm image! @rkachach works like a charm! |
Is there a PR in progress or a timeline for this to get merged into main? |
@ahgraber It's a ceph problem. |
Thank You! |
v18.2.2 was released last week. Does anyone know usually how long does it take for a new image to show up on quay.io? The latest image is still v18.2.1. |
I don't think it is actually released. It is tagged, but it seems to be going through the QA process - there is a thread on the list. I'm not sure what the normal release processes is, but I've seen those sorts of threads go on for a while. I'm guessing this one will be done relatively quickly as it is a hotfix. In any case, once it is released and the release page is updated/etc, I'm guessing the quay image will be updated. |
new image tag should be now available on |
Hi! I've tested with the new image, there is no error about the prometheus module on the mgr log. The mgr serve on the metrics port, thanks! |
This issue is fixed by updating the Ceph version to v18.2.2. See the Ceph upgrade guide to make this change. This will become the default in Rook v1.13.7 when released in the next few days, but no need to wait for that release before applying v18.2.2. |
Hi,
After modified the cluster spec with this settings:
The prometheus module does not starting.
The CRD is in Ready status, everything seems to work except the dasboard "performance" tabs.
What can I check to debug this issue?
Thanks!
The text was updated successfully, but these errors were encountered: