Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manager prometheus module failed to construct - ImportError: Module not found #13527

Closed
pasztorl opened this issue Jan 8, 2024 · 44 comments · Fixed by #13913
Closed

Manager prometheus module failed to construct - ImportError: Module not found #13527

pasztorl opened this issue Jan 8, 2024 · 44 comments · Fixed by #13913
Assignees
Labels

Comments

@pasztorl
Copy link

pasztorl commented Jan 8, 2024

Hi,

After modified the cluster spec with this settings:

  mgr:
  ...
    modules:
      - name: rook
        enabled: true
      - name: pg_autoscaler
        enabled: true
      - name: prometheus
        enabled: true
...
monitoring:
  enabled: true

The prometheus module does not starting.

debug 2024-01-08T10:02:03.658+0000 7fb850722200  0 set uid:gid to 167:167 (ceph:ceph)
debug 2024-01-08T10:02:03.658+0000 7fb850722200  0 ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable), process ceph-mgr, pid 13
debug 2024-01-08T10:02:03.658+0000 7fb850722200  0 pidfile_write: ignore empty --pid-file
debug 2024-01-08T10:02:03.694+0000 7fb850722200  1 mgr[py] Loading python module 'restful'
debug 2024-01-08T10:02:04.142+0000 7fb850722200  1 mgr[py] Loading python module 'progress'
debug 2024-01-08T10:02:04.282+0000 7fb850722200 -1 mgr[py] Module progress has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:04.282+0000 7fb850722200  1 mgr[py] Loading python module 'telegraf'
debug 2024-01-08T10:02:04.394+0000 7fb850722200 -1 mgr[py] Module telegraf has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:04.394+0000 7fb850722200  1 mgr[py] Loading python module 'cephadm'
debug 2024-01-08T10:02:05.262+0000 7fb850722200  1 mgr[py] Loading python module 'osd_support'
debug 2024-01-08T10:02:05.458+0000 7fb850722200 -1 mgr[py] Module osd_support has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:05.458+0000 7fb850722200  1 mgr[py] Loading python module 'pg_autoscaler'
debug 2024-01-08T10:02:05.606+0000 7fb850722200 -1 mgr[py] Module pg_autoscaler has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:05.606+0000 7fb850722200  1 mgr[py] Loading python module 'selftest'
debug 2024-01-08T10:02:05.754+0000 7fb850722200 -1 mgr[py] Module selftest has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:05.754+0000 7fb850722200  1 mgr[py] Loading python module 'prometheus'
debug 2024-01-08T10:02:06.326+0000 7fb850722200 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:06.326+0000 7fb850722200  1 mgr[py] Loading python module 'localpool'
debug 2024-01-08T10:02:06.446+0000 7fb850722200  1 mgr[py] Loading python module 'balancer'
debug 2024-01-08T10:02:06.666+0000 7fb850722200 -1 mgr[py] Module balancer has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:06.666+0000 7fb850722200  1 mgr[py] Loading python module 'iostat'
debug 2024-01-08T10:02:06.802+0000 7fb850722200 -1 mgr[py] Module iostat has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:06.802+0000 7fb850722200  1 mgr[py] Loading python module 'snap_schedule'
debug 2024-01-08T10:02:06.982+0000 7fb850722200 -1 mgr[py] Module snap_schedule has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:06.982+0000 7fb850722200  1 mgr[py] Loading python module 'orchestrator'
debug 2024-01-08T10:02:07.250+0000 7fb850722200 -1 mgr[py] Module orchestrator has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:07.250+0000 7fb850722200  1 mgr[py] Loading python module 'rook'
debug 2024-01-08T10:02:08.538+0000 7fb850722200 -1 mgr[py] Module rook has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:08.538+0000 7fb850722200  1 mgr[py] Loading python module 'crash'
debug 2024-01-08T10:02:08.814+0000 7fb850722200 -1 mgr[py] Module crash has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:08.814+0000 7fb850722200  1 mgr[py] Loading python module 'k8sevents'
debug 2024-01-08T10:02:09.298+0000 7fb850722200  1 mgr[py] Loading python module 'nfs'
debug 2024-01-08T10:02:09.638+0000 7fb850722200 -1 mgr[py] Module nfs has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:09.638+0000 7fb850722200  1 mgr[py] Loading python module 'status'
debug 2024-01-08T10:02:09.938+0000 7fb850722200 -1 mgr[py] Module status has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:09.938+0000 7fb850722200  1 mgr[py] Loading python module 'stats'
debug 2024-01-08T10:02:10.082+0000 7fb850722200  1 mgr[py] Loading python module 'mirroring'
debug 2024-01-08T10:02:10.234+0000 7fb850722200  1 mgr[py] Loading python module 'rbd_support'
debug 2024-01-08T10:02:10.398+0000 7fb850722200 -1 mgr[py] Module rbd_support has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:10.398+0000 7fb850722200  1 mgr[py] Loading python module 'alerts'
debug 2024-01-08T10:02:10.546+0000 7fb850722200 -1 mgr[py] Module alerts has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:10.546+0000 7fb850722200  1 mgr[py] Loading python module 'diskprediction_local'
debug 2024-01-08T10:02:10.754+0000 7fb850722200 -1 mgr[py] Module diskprediction_local has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:10.754+0000 7fb850722200  1 mgr[py] Loading python module 'volumes'
debug 2024-01-08T10:02:11.098+0000 7fb850722200 -1 mgr[py] Module volumes has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:11.098+0000 7fb850722200  1 mgr[py] Loading python module 'mds_autoscaler'
debug 2024-01-08T10:02:11.578+0000 7fb850722200  1 mgr[py] Loading python module 'telemetry'
debug 2024-01-08T10:02:11.794+0000 7fb850722200 -1 mgr[py] Module telemetry has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:11.794+0000 7fb850722200  1 mgr[py] Loading python module 'osd_perf_query'
debug 2024-01-08T10:02:11.942+0000 7fb850722200 -1 mgr[py] Module osd_perf_query has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:11.942+0000 7fb850722200  1 mgr[py] Loading python module 'test_orchestrator'
debug 2024-01-08T10:02:12.222+0000 7fb850722200 -1 mgr[py] Module test_orchestrator has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:12.222+0000 7fb850722200  1 mgr[py] Loading python module 'zabbix'
debug 2024-01-08T10:02:12.354+0000 7fb850722200 -1 mgr[py] Module zabbix has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:12.354+0000 7fb850722200  1 mgr[py] Loading python module 'rgw'
debug 2024-01-08T10:02:12.654+0000 7fb850722200 -1 mgr[py] Module rgw has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:12.654+0000 7fb850722200  1 mgr[py] Loading python module 'devicehealth'
debug 2024-01-08T10:02:12.842+0000 7fb850722200 -1 mgr[py] Module devicehealth has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:12.842+0000 7fb850722200  1 mgr[py] Loading python module 'insights'
debug 2024-01-08T10:02:13.030+0000 7fb850722200  1 mgr[py] Loading python module 'dashboard'
debug 2024-01-08T10:02:14.134+0000 7fb850722200  1 mgr[py] Loading python module 'influx'
debug 2024-01-08T10:02:14.282+0000 7fb850722200 -1 mgr[py] Module influx has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:14.286+0000 7fb843dd0700  0 ms_deliver_dispatch: unhandled message 0x557908acb600 mon_map magic: 0 v1 from mon.0 v2:172.31.16.198:3300/0
debug 2024-01-08T10:02:14.294+0000 7fb843dd0700  1 mgr handle_mgr_map Activating!
debug 2024-01-08T10:02:14.294+0000 7fb843dd0700  1 mgr handle_mgr_map I am now activating
debug 2024-01-08T10:02:14.310+0000 7fb7ff998700  0 [balancer DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.310+0000 7fb7ff998700  1 mgr load Constructed class from module: balancer
debug 2024-01-08T10:02:14.310+0000 7fb7e045d700  0 [balancer INFO root] Starting
debug 2024-01-08T10:02:14.310+0000 7fb7ff998700  0 [crash DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.310+0000 7fb7ff998700  1 mgr load Constructed class from module: crash
debug 2024-01-08T10:02:14.314+0000 7fb7ff998700  0 [dashboard DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.314+0000 7fb7ff998700  1 mgr load Constructed class from module: dashboard
debug 2024-01-08T10:02:14.314+0000 7fb7ff998700  0 [devicehealth DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.314+0000 7fb7ff998700  1 mgr load Constructed class from module: devicehealth
debug 2024-01-08T10:02:14.314+0000 7fb7de459700  0 [dashboard INFO access_control] Loading user roles DB version=2
debug 2024-01-08T10:02:14.314+0000 7fb7de459700  0 [dashboard INFO sso] Loading SSO DB version=1
debug 2024-01-08T10:02:14.314+0000 7fb7de459700  0 [dashboard INFO root] server: ssl=no host=:: port=8080
debug 2024-01-08T10:02:14.314+0000 7fb7de459700  0 [dashboard INFO root] Configured CherryPy, starting engine...
debug 2024-01-08T10:02:14.318+0000 7fb7dd457700  0 [devicehealth INFO root] Starting
debug 2024-01-08T10:02:14.318+0000 7fb7ff998700  0 [iostat DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.318+0000 7fb7ff998700  1 mgr load Constructed class from module: iostat
debug 2024-01-08T10:02:14.322+0000 7fb7ff998700  0 [nfs DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.322+0000 7fb7ff998700  1 mgr load Constructed class from module: nfs
debug 2024-01-08T10:02:14.322+0000 7fb7ff998700  0 [orchestrator DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.322+0000 7fb7ff998700  1 mgr load Constructed class from module: orchestrator
debug 2024-01-08T10:02:14.326+0000 7fb7e045d700  0 [balancer INFO root] Optimize plan auto_2024-01-08_10:02:14
debug 2024-01-08T10:02:14.326+0000 7fb7e045d700  0 [balancer INFO root] Mode upmap, max misplaced 0.050000
debug 2024-01-08T10:02:14.326+0000 7fb7e045d700  0 [balancer INFO root] Some PGs (1.000000) are unknown; try again later
debug 2024-01-08T10:02:14.326+0000 7fb7ff998700  0 [pg_autoscaler DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.326+0000 7fb7ff998700  1 mgr load Constructed class from module: pg_autoscaler
debug 2024-01-08T10:02:14.330+0000 7fb7ff998700  0 [progress DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.330+0000 7fb7ff998700  1 mgr load Constructed class from module: progress
debug 2024-01-08T10:02:14.334+0000 7fb7d944f700  0 [pg_autoscaler INFO root] _maybe_adjust
debug 2024-01-08T10:02:14.338+0000 7fb7d844d700  0 [progress INFO root] Loading...
debug 2024-01-08T10:02:14.338+0000 7fb7ff998700  0 [prometheus DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.346+0000 7fb7d844d700  0 [progress INFO root] Loaded [<progress.module.GhostEvent object at 0x7fb7e460bbe0>, <progress.module.GhostEvent object at 0x7fb7e460b7b8>, <progress.modu
le.GhostEvent object at 0x7fb7e460bac8>, <progress.module.GhostEvent object at 0x7fb7e460bf28>, <progress.module.GhostEvent object at 0x7fb7e460b080>, <progress.module.GhostEvent object at 0x7fb7e460b438>
, <progress.module.GhostEvent object at 0x7fb7e460b208>, <progress.module.GhostEvent object at 0x7fb7e460bf60>, <progress.module.GhostEvent object at 0x7fb7e460bcf8>, <progress.module.GhostEvent object at
 0x7fb7e460bc18>, <progress.module.GhostEvent object at 0x7fb7e460b8d0>, <progress.module.GhostEvent object at 0x7fb7e460b048>, <progress.module.GhostEvent object at 0x7fb7e460b9b0>, <progress.module.Ghos
tEvent object at 0x7fb7e460ba58>, <progress.module.GhostEvent object at 0x7fb7e460b550>, <progress.module.GhostEvent object at 0x7fb7e460ba90>, <progress.module.GhostEvent object at 0x7fb7e460b940>, <prog
ress.module.GhostEvent object at 0x7fb7e460be48>, <progress.module.GhostEvent object at 0x7fb7e460b7f0>, <progress.module.GhostEvent object at 0x7fb7e460b668>, <progress.module.GhostEvent object at 0x7fb7
e460b160>, <progress.module.GhostEvent object at 0x7fb7e460bda0>, <progress.module.GhostEvent object at 0x7fb7e460b128>, <progress.module.GhostEvent object at 0x7fb7e460bba8>, <progress.module.GhostEvent
object at 0x7fb7e460b6d8>, <progress.module.GhostEvent object at 0x7fb7e460b320>, <progress.module.GhostEvent object at 0x7fb7e460bd30>, <progress.module.GhostEvent object at 0x7fb7e460bc88>, <progress.mo
dule.GhostEvent object at 0x7fb7e460b748>, <progress.module.GhostEvent object at 0x7fb7e460b5c0>, <progress.module.GhostEvent object at 0x7fb7e460bf98>, <progress.module.GhostEvent object at 0x7fb7e460b4e
0>] historic events
debug 2024-01-08T10:02:14.346+0000 7fb7d844d700  0 [progress INFO root] Loaded OSDMap, ready.
debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 no module 'rook'
debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 no module 'rook'
debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 mgr load Failed to construct class in 'prometheus'
debug 2024-01-08T10:02:14.358+0000 7fb81919b700  1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs
debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 mgr load Traceback (most recent call last):
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote
    return mgr.remote(o, meth, *args, **kwargs)
  File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote
    args, kwargs)
ImportError: Module not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 649, in __init__
    self.modify_instance_id = self.get_orch_status() and self.get_module_option(
  File "/usr/share/ceph/mgr/prometheus/module.py", line 869, in get_orch_status
    return self.available()[0]
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1661, in _oremote
    f_set = self.get_feature_set()
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote
    return mgr.remote(o, meth, *args, **kwargs)
  File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote
    args, kwargs)
ImportError: Module not found

debug 2024-01-08T10:02:14.358+0000 7fb7ff998700 -1 mgr operator() Failed to run module in active mode ('prometheus')
ceph mgr module ls
MODULE                              
balancer              on (always on)
crash                 on (always on)
devicehealth          on (always on)
orchestrator          on (always on)
pg_autoscaler         on (always on)
progress              on (always on)
rbd_support           on (always on)
status                on (always on)
telemetry             on (always on)
volumes               on (always on)
dashboard             on            
iostat                on            
nfs                   on            
prometheus            on            
restful               on            
rook                  on       
...

The CRD is in Ready status, everything seems to work except the dasboard "performance" tabs.

What can I check to debug this issue?
Thanks!

@pasztorl pasztorl added the bug label Jan 8, 2024
@rkachach
Copy link
Contributor

rkachach commented Jan 8, 2024

Strange because on the logs it seems that rook module is not enabled:

debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 no module 'rook'
debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 no module 'rook'

Did you try to enable the module from toolbox?

ceph mgr module enable rook && ceph orch set backend rook && ceph orch status

@pasztorl
Copy link
Author

pasztorl commented Jan 8, 2024

ceph mgr module enable rook && ceph orch set backend rook && ceph orch status

output:

module 'rook' is already enabled
Backend: rook
Available: Yes

In the manager logs there was nothing about that, so I restarted the mgr pod then the startup log is the same as above.

@pasztorl
Copy link
Author

pasztorl commented Jan 8, 2024

mgr container running this image: quay.io/ceph/ceph:v18.2.1

@pasztorl
Copy link
Author

pasztorl commented Jan 8, 2024

rook module seems working because I can see the cluster -> Physical Disks

@matthewpi
Copy link
Contributor

matthewpi commented Jan 9, 2024

I am also able to reproduce this. My config only explicitly configures the rook module though, monitoring is enabled.

Something interesting to note is that Prometheus works on a standby mgr, but not the active one.

Logs from active mgr: (some lines are omitted, indicated by ...)

debug 2024-01-09T01:03:37.231+0000 7f6def5d3200  1 mgr[py] Loading python module 'rook'
debug 2024-01-09T01:03:38.140+0000 7f6def5d3200 -1 mgr[py] Module rook has missing NOTIFY_TYPES member
...
debug 2024-01-09T01:03:40.096+0000 7f6de2c81700  1 mgr handle_mgr_map Activating!
debug 2024-01-09T01:03:40.098+0000 7f6de2c81700  1 mgr handle_mgr_map I am now activating
...
debug 2024-01-09T01:03:40.194+0000 7f6c98670700  0 [dashboard INFO root] Configured CherryPy, starting engine...
debug 2024-01-09T01:03:40.196+0000 7f6c8766e700  0 [devicehealth INFO root] Starting
debug 2024-01-09T01:03:40.206+0000 7f6b7fe4f700  1 mgr load Constructed class from module: pg_autoscaler
debug 2024-01-09T01:03:40.209+0000 7f6b7fe4f700  0 [progress DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-09T01:03:40.209+0000 7f6b7fe4f700  1 mgr load Constructed class from module: progress
debug 2024-01-09T01:03:40.218+0000 7f6b7fe4f700  0 [prometheus DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-09T01:03:40.226+0000 7f6c43666700  0 [pg_autoscaler INFO root] _maybe_adjust
debug 2024-01-09T01:03:40.226+0000 7f6c3a664700  0 [progress INFO root] Loading...
debug 2024-01-09T01:03:40.233+0000 7f6c3a664700  0 [progress INFO root] Loaded [...] historic events
debug 2024-01-09T01:03:40.233+0000 7f6c3a664700  0 [progress INFO root] Loaded OSDMap, ready.
debug 2024-01-09T01:03:40.247+0000 7f6b7fe4f700 -1 no module 'rook'
debug 2024-01-09T01:03:40.255+0000 7f6b7fe4f700 -1 no module 'rook'
debug 2024-01-09T01:03:40.256+0000 7f6b7fe4f700 -1 mgr load Failed to construct class in 'prometheus'
debug 2024-01-09T01:03:40.256+0000 7f6b7fe4f700 -1 mgr load Traceback (most recent call last):
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote
    return mgr.remote(o, meth, *args, **kwargs)
  File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote
    args, kwargs)
ImportError: Module not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 649, in __init__
    self.modify_instance_id = self.get_orch_status() and self.get_module_option(
  File "/usr/share/ceph/mgr/prometheus/module.py", line 869, in get_orch_status
    return self.available()[0]
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1661, in _oremote
    f_set = self.get_feature_set()
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote
    return mgr.remote(o, meth, *args, **kwargs)
  File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote
    args, kwargs)
ImportError: Module not found

debug 2024-01-09T01:03:40.261+0000 7f6b7fe4f700 -1 mgr operator() Failed to run module in active mode ('prometheus')
debug 2024-01-09T01:03:40.262+0000 7f6b7fe4f700  0 [rbd_support DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-09T01:03:40.278+0000 7f6be5e5b700  0 [rbd_support INFO root] recovery thread starting
debug 2024-01-09T01:03:40.278+0000 7f6be5e5b700  0 [rbd_support INFO root] starting setup
debug 2024-01-09T01:03:40.278+0000 7f6b7fe4f700  1 mgr load Constructed class from module: rbd_support
debug 2024-01-09T01:03:40.285+0000 7f6b7fe4f700  0 [restful DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-09T01:03:40.285+0000 7f6b7fe4f700  1 mgr load Constructed class from module: restful
debug 2024-01-09T01:03:40.286+0000 7f6be5e5b700  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
debug 2024-01-09T01:03:40.300+0000 7f6be5e5b700  0 [rbd_support INFO root] load_schedules: blocks, start_after=
debug 2024-01-09T01:03:40.302+0000 7f6b7fe4f700  0 [rook DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-09T01:03:40.303+0000 7f6bcc658700  0 [restful INFO root] server_addr: :: server_port: 8003
debug 2024-01-09T01:03:40.306+0000 7f6b7fe4f700  1 mgr load Constructed class from module: rook
...

Environment:

  • OS: NixOS 24.05 (Uakari)
  • Kernel: Linux 6.1.71 #1-NixOS SMP PREEMPT_DYNAMIC Fri Jan 5 14:18:41 UTC 2024 x86_64 GNU/Linux
  • Rook version: v1.13.1
  • Storage backend version: quay.io/ceph/ceph:v18.2.1-20231215
  • Kubernetes version: v1.28.5

@rkachach
Copy link
Contributor

rkachach commented Jan 9, 2024

Than you all for the feedback. I'll try to reproduce the issue locally 👍

@rkachach rkachach self-assigned this Jan 9, 2024
@rkachach
Copy link
Contributor

rkachach commented Jan 9, 2024

@matthewpi please, can you post the full logs of the mgr startup including ALL the modules?

@matthewpi
Copy link
Contributor

@matthewpi please, can you post the full logs of the mgr startup including ALL the modules?

Here are the logs from startup.
mgr-active.log
mgr-standby.log

@rkachach
Copy link
Contributor

rkachach commented Jan 10, 2024

Thank you very much @matthewpi. Just FYI I tried several times to reproduce the issue locally but with no success.

Analyzing these logs and the ones provided originally by @pasztorl the common pattern is the order of modules loading: first prometheus is loaded then rook (we can see this in the following snippet from your active mgr logs):

      4:debug 2024-01-09T18:30:54.532+0000 7ff82be94200  1 mgr[py] Loading python module 'alerts'
      6:debug 2024-01-09T18:30:54.698+0000 7ff82be94200  1 mgr[py] Loading python module 'balancer'
      ....
      ....
     35:debug 2024-01-09T18:31:00.365+0000 7ff82be94200  1 mgr[py] Loading python module 'progress'
     37:debug 2024-01-09T18:31:00.485+0000 7ff82be94200  1 mgr[py] Loading python module 'prometheus'
     39:debug 2024-01-09T18:31:01.159+0000 7ff82be94200  1 mgr[py] Loading python module 'rbd_support'
     41:debug 2024-01-09T18:31:01.311+0000 7ff82be94200  1 mgr[py] Loading python module 'restful'
     42:debug 2024-01-09T18:31:01.591+0000 7ff82be94200  1 mgr[py] Loading python module 'rgw'
     44:debug 2024-01-09T18:31:01.875+0000 7ff82be94200  1 mgr[py] Loading python module 'rook'
     46:debug 2024-01-09T18:31:02.844+0000 7ff82be94200  1 mgr[py] Loading python module 'selftest'

I think the problem is a dependency between promethues and orchestrator (rook in this case) modules introduced recently by the changes from PR ceph/ceph#52191. Specifically the orch status check.

In the error stracktrace we can see how the mgr fails to load prometheus (Module not found) because rook has not yet been loaded:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 649, in __init__
    self.modify_instance_id = self.get_orch_status() and self.get_module_option(
  File "/usr/share/ceph/mgr/prometheus/module.py", line 869, in get_orch_status
    return self.available()[0]
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1661, in _oremote
    f_set = self.get_feature_set()
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote
    return mgr.remote(o, meth, *args, **kwargs)
  File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote
    args, kwargs)
ImportError: Module not found

That also explains why I wasn't able to reproduce in my local environment because in my setup rook is loaded always before prometheus.

@kpoos
Copy link

kpoos commented Jan 10, 2024

This issue also affects our setup. It started to coming immediately after upgrading to 1.13 from 1.12. MGR Pod is constantly crashing since then.
[rook@rook-ceph-tools-68644848b9-flk4x /]$ ceph status
cluster:
id: e64af4e2-a50d-408b-943a-89cc8539a3ca
health: HEALTH_WARN
125 mgr modules have recently crashed

[rook@rook-ceph-tools-68644848b9-flk4x /]$ ceph crash info 2024-01-10T09:17:44.340271Z_c16a026b-2df4-4857-b64a-ea73c41a5493
{
"backtrace": [
" File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote\n return mgr.remote(o, meth, *args, **kwargs)",
" File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote\n args, kwargs)",
"ImportError: Module not found",
"\nDuring handling of the above exception, another exception occurred:\n",
"Traceback (most recent call last):",
" File "/usr/share/ceph/mgr/prometheus/module.py", line 649, in init\n self.modify_instance_id = self.get_orch_status() and self.get_module_option(",
" File "/usr/share/ceph/mgr/prometheus/module.py", line 869, in get_orch_status\n return self.available()[0]",
" File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner\n completion = self._oremote(method_name, args, kwargs)",
" File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1661, in _oremote\n f_set = self.get_feature_set()",
" File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner\n completion = self._oremote(method_name, args, kwargs)",
" File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote\n return mgr.remote(o, meth, *args, **kwargs)",
" File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote\n args, kwargs)",
"ImportError: Module not found"
],
"ceph_version": "18.2.1",
"crash_id": "2024-01-10T09:17:44.340271Z_c16a026b-2df4-4857-b64a-ea73c41a5493",
"entity_name": "mgr.b",
"mgr_module": "prometheus",
"mgr_module_caller": "ActivePyModule::load",
"mgr_python_exception": "ImportError",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig": "6f4033c8739c625d3935380854c7945f8bd28267f6a4d03fc2017bc8815c257c",
"timestamp": "2024-01-10T09:17:44.340271Z",
"utsname_hostname": "rook-ceph-mgr-b-6d58d88fd9-j5bw6",
"utsname_machine": "x86_64",
"utsname_release": "5.15.0-53-generic",
"utsname_sysname": "Linux",
"utsname_version": "#59~20.04.1-Ubuntu SMP Thu Oct 20 15:10:22 UTC 2022"
}
[rook@rook-ceph-tools-68644848b9-flk4x /]$

@rkachach
Copy link
Contributor

rkachach commented Jan 10, 2024

@kpoos I have confirmed that the issue is related with the changes from the PR I post above. I have already opened a ticket on the ceph project to track the issue. https://tracker.ceph.com/issues/63992 Unfortunately at this moment there's no work around to this problem.

@pasztorl
Copy link
Author

Thanks for the info!

@rkachach
Copy link
Contributor

rkachach commented Jan 10, 2024

Not sure if this could help or not, but on my local env I was able to enable rook and prometheus by using the following setup:

  • remove modules section from the cluster yaml
  • once the cluster up & running, enable rook as following:
 ceph mgr module enable rook
 ceph orch set backend rook
 ceph orch status

Following these steps the mgr seems to start without issues but I can't guarantee that same will happen in your env. Besides, plz keep in mind that any restart on the mgr can lead to the problem again.

@kpoos
Copy link

kpoos commented Jan 11, 2024

@rkachach Looks that the workaround works in our setup as well. Thanks for that. Please keep us informed when a permanent fix is available. Thanks.

@xavi-clovr
Copy link

Not sure if this could help or not, but on my local env I was able to enable rook and prometheus by using the following setup:

  • remove modules section from the cluster yaml
  • once the cluster up & running, enable rook as following:
 ceph mgr module enable rook
 ceph orch set backend rook
 ceph orch status

Following these steps the mgr seems to start without issues but I can't guarantee that same will happen in your env. Besides, plz keep in mind that any restart on the mgr can lead to the problem again.

I'm unable to achieve a good behaviour using this workaround.
I tried also disabling rook and prometheus mgr modules from Ceph commands and enabling them with the correct order, I guess it should also work? (to me is still failing)

@rkachach
Copy link
Contributor

rkachach commented Jan 11, 2024

Not sure if this could help or not, but on my local env I was able to enable rook and prometheus by using the following setup:

  • remove modules section from the cluster yaml
  • once the cluster up & running, enable rook as following:
 ceph mgr module enable rook
 ceph orch set backend rook
 ceph orch status

Following these steps the mgr seems to start without issues but I can't guarantee that same will happen in your env. Besides, plz keep in mind that any restart on the mgr can lead to the problem again.

I'm unable to achieve a good behaviour using this workaround. I tried also disabling rook and prometheus mgr modules from Ceph commands and enabling them with the correct order, I guess it should also work? (to me is still failing)

@xavi-clovr Unfortunately loading process is not deterministic and right now there's a race condition between loading prometheus and the orchestrator (rook) which could lead to a crash of prometheus module in case it's loaded before rook. So the proposed work around is not 100% sure to work.

@rkachach
Copy link
Contributor

rkachach commented Jan 12, 2024

I created the following image which contains v18.2.1 + a fix candidate: docker.io/rkachach/ceph:v18.2.1_patched_v1
It would be great if somebody can test the fix on some testing environment and report the results.

IMPORTANT:

This Docker image includes a patch aimed at addressing a specific issue. It is intended for testing purposes 
only and should not be used in a production environment. The changes made in this image are experimental 
and may not have undergone thorough testing or received reviewers approval.

Usage Guidelines:

This image is provided as-is, without any warranties or guarantees of any kind.
Do not use this image in a production environment or any critical system.
Use this image solely for testing, validation, and feedback purposes.
The changes included in this image may be subject to further modifications, and compatibility with future 
releases is not guaranteed.

@rspier
Copy link
Contributor

rspier commented Jan 12, 2024

Good news: The prometheus module doesn't fail on startup.

Bad news: Metric collection doesn't work.

Hitting the /metrics endpoint results in...

debug 2024-01-12T17:02:59.686+0000 7fcf92f16700  0 [prometheus ERROR root] failed to collect metrics:
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 514, in collect
    data = self.mod.collect()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1715, in collect
    self.get_metadata_and_osd_status()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1304, in get_metadata_and_osd_status
    str(daemon.daemon_id).split(".")[2]))
IndexError: list index out of range

@jimmy-ungerman
Copy link

Still getting the same error with the new image as well.

@rkachach
Copy link
Contributor

Good news: The prometheus module doesn't fail on startup.

Bad news: Metric collection doesn't work.

Hitting the /metrics endpoint results in...

debug 2024-01-12T17:02:59.686+0000 7fcf92f16700  0 [prometheus ERROR root] failed to collect metrics:
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 514, in collect
    data = self.mod.collect()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1715, in collect
    self.get_metadata_and_osd_status()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1304, in get_metadata_and_osd_status
    str(daemon.daemon_id).split(".")[2]))
IndexError: list index out of range

@rspier thanks for deploying and testing the first part of the fix (the one related to the prometheus module crash). As for the new issue I'm sure that it has to do with some mismatch between how daemon_ids are called in cephadm based deployments vs rook. I created a new image docker.io/rkachach/ceph:v18.2.1_patched_v1 with a small change hoping it fixes this new issue.

As of building an image you have basically to pick the changes from my PR https://github.com/ceph/ceph/pull/55149/files and use them to update the base image of v18.2.1. A simple docker file like:

FROM quay.io/ceph/ceph:v18.2.1
COPY ./your_patched_prometheus_module.py /usr/share/ceph/mgr/prometheus/module.py

@rspier
Copy link
Contributor

rspier commented Jan 12, 2024

The v1 patch fixes the issue for me. podip:9223/metrics returns metrics again!

Thank you!

@rkachach
Copy link
Contributor

rkachach commented Jan 12, 2024

Welcome! thanks to you for helping with the testing 👍

Please let me know if you observe any new issues. I'll be working on the definitive fix, as I said before, that could takes more time as it depends on the ceph project next release.

@jimmy-ungerman
Copy link

Also can confirm that the v1 patch is working for me

@arichard42
Copy link

I confirm the v1 fix the issue also for me (using rook helm charts).

@reefland
Copy link

Was having this issue in the discussion area. Hopefully this makes it into the rook-ceph release. Nice work!

@rkachach rkachach pinned this issue Jan 19, 2024
@barrettMCW
Copy link

running the v1 patch, and ofc fixed the issue <3
but i've noticed in my dashboard that the recovery throughput does not match the information from "ceph -s"
any ideas? would that be the patch or something from the ceph 18 update in general? the weird part is that the recovery thoughput seems to be getting some metric but doesn't seem like network throughput to me (tells me i have 0-0.5 bytes of throughput? ceph -s says i'm doing 150MiB of recovery ops)
Thanks!

@rkachach
Copy link
Contributor

rkachach commented Jan 23, 2024

@barrettMCW plz, if you really think that there's a potential BUG with metrics I'd recommend opening a new issue and providing the details to reproduce it.

@dcplaya
Copy link

dcplaya commented Jan 29, 2024

Good news: The prometheus module doesn't fail on startup.
Bad news: Metric collection doesn't work.
Hitting the /metrics endpoint results in...

debug 2024-01-12T17:02:59.686+0000 7fcf92f16700  0 [prometheus ERROR root] failed to collect metrics:
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 514, in collect
    data = self.mod.collect()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1715, in collect
    self.get_metadata_and_osd_status()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1304, in get_metadata_and_osd_status
    str(daemon.daemon_id).split(".")[2]))
IndexError: list index out of range

@rspier thanks for deploying and testing the first part of the fix (the one related to the prometheus module crash). As for the new issue I'm sure that it has to do with some mismatch between how daemon_ids are called in cephadm based deployments vs rook. I created a new image docker.io/rkachach/ceph:v18.2.1_patched_v1 with a small change hoping it fixes this new issue.

As of building an image you have basically to pick the changes from my PR https://github.com/ceph/ceph/pull/55149/files and use them to update the base image of v18.2.1. A simple docker file like:

FROM quay.io/ceph/ceph:v18.2.1
COPY ./your_patched_prometheus_module.py /usr/share/ceph/mgr/prometheus/module.py

Unfortunately, this version did not start for me. I am running it on ARM though, if that make any difference.

image

@rkachach
Copy link
Contributor

@dcplaya please, test the image docker.io/rkachach/ceph:v18.2.1_patched_v1 so far all the users have reported that it's working without issues.

@bhuism
Copy link

bhuism commented Jan 30, 2024

@rkachach can you build an arm image?

@dcplaya
Copy link

dcplaya commented Jan 30, 2024

@rkachach I did test that image, the error I got while running that version is in the screenshot I posted above.

@rich0
Copy link

rich0 commented Jan 30, 2024

All: Just wanted to report that I had this issue and the patched version fixed it for me.

@dcplaya that image probably won't work on ARM, unless an ARM version was built for it. Maybe somebody will volunteer to build one, but if not you could follow the suggestions to build your own. Containers are typically arch-specific since they will contain binaries.

@guillaumetorresani
Copy link

@bhuism i have build multi arch image, a public gitlab repository is available on https://gitlab.onlineterroir.com/ceph/ceph with opreationnal CI. Multi arch image is availbe on registry.gitlab.onlineterroir.com/ceph/ceph:v18.2.1_patched_v1

I confirm the v1 fix the issue also for me (using rook without helm charts).

@rkachach
Copy link
Contributor

@bhuism i have build multi arch image, a public gitlab repository is available on https://gitlab.onlineterroir.com/ceph/ceph with opreationnal CI. Multi arch image is availbe on registry.gitlab.onlineterroir.com/ceph/ceph:v18.2.1_patched_v1

I confirm the v1 fix the issue also for me (using rook without helm charts).

@guillaumetorresani thank you for providing the multi-arch image.

@dcplaya
Copy link

dcplaya commented Jan 30, 2024

@guillaumetorresani's multi-arch image works for my ARM setup!
Metrics in Prometheus have also returned

@bhuism
Copy link

bhuism commented Jan 30, 2024

@guillaumetorresani thanks for the arm image!

@rkachach works like a charm!

@ahgraber
Copy link
Contributor

Is there a PR in progress or a timeline for this to get merged into main?

@barrettMCW
Copy link

@ahgraber It's a ceph problem.
pr has been merged: ceph/ceph#55149
it will be part of 18.2.2 https://github.com/ceph/ceph/milestone/19

@esomore
Copy link

esomore commented Mar 7, 2024

@bhuism i have build multi arch image, a public gitlab repository is available on https://gitlab.onlineterroir.com/ceph/ceph with opreationnal CI. Multi arch image is availbe on registry.gitlab.onlineterroir.com/ceph/ceph:v18.2.1_patched_v1

I confirm the v1 fix the issue also for me (using rook without helm charts).

Thank You!

@yurirocha15
Copy link

v18.2.2 was released last week. Does anyone know usually how long does it take for a new image to show up on quay.io? The latest image is still v18.2.1.

@rich0
Copy link

rich0 commented Mar 11, 2024

v18.2.2 was released last week. Does anyone know usually how long does it take for a new image to show up on quay.io? The latest image is still v18.2.1.

I don't think it is actually released. It is tagged, but it seems to be going through the QA process - there is a thread on the list. I'm not sure what the normal release processes is, but I've seen those sorts of threads go on for a while. I'm guessing this one will be done relatively quickly as it is a hotfix.

In any case, once it is released and the release page is updated/etc, I'm guessing the quay image will be updated.

@rkachach
Copy link
Contributor

rkachach commented Mar 11, 2024

new image tag should be now available on quay.io/ceph/ceph:v18.2.2

@pasztorl
Copy link
Author

pasztorl commented Mar 11, 2024

Hi! I've tested with the new image, there is no error about the prometheus module on the mgr log. The mgr serve on the metrics port, thanks!

@travisn
Copy link
Member

travisn commented Mar 12, 2024

This issue is fixed by updating the Ceph version to v18.2.2. See the Ceph upgrade guide to make this change. This will become the default in Rook v1.13.7 when released in the next few days, but no need to wait for that release before applying v18.2.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.