Manager prometheus module failed to construct - ImportError: Module not found #13527

pasztorl · 2024-01-08T10:59:07Z

Hi,

After modified the cluster spec with this settings:

  mgr:
  ...
    modules:
      - name: rook
        enabled: true
      - name: pg_autoscaler
        enabled: true
      - name: prometheus
        enabled: true
...
monitoring:
  enabled: true

The prometheus module does not starting.

debug 2024-01-08T10:02:03.658+0000 7fb850722200  0 set uid:gid to 167:167 (ceph:ceph)
debug 2024-01-08T10:02:03.658+0000 7fb850722200  0 ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable), process ceph-mgr, pid 13
debug 2024-01-08T10:02:03.658+0000 7fb850722200  0 pidfile_write: ignore empty --pid-file
debug 2024-01-08T10:02:03.694+0000 7fb850722200  1 mgr[py] Loading python module 'restful'
debug 2024-01-08T10:02:04.142+0000 7fb850722200  1 mgr[py] Loading python module 'progress'
debug 2024-01-08T10:02:04.282+0000 7fb850722200 -1 mgr[py] Module progress has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:04.282+0000 7fb850722200  1 mgr[py] Loading python module 'telegraf'
debug 2024-01-08T10:02:04.394+0000 7fb850722200 -1 mgr[py] Module telegraf has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:04.394+0000 7fb850722200  1 mgr[py] Loading python module 'cephadm'
debug 2024-01-08T10:02:05.262+0000 7fb850722200  1 mgr[py] Loading python module 'osd_support'
debug 2024-01-08T10:02:05.458+0000 7fb850722200 -1 mgr[py] Module osd_support has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:05.458+0000 7fb850722200  1 mgr[py] Loading python module 'pg_autoscaler'
debug 2024-01-08T10:02:05.606+0000 7fb850722200 -1 mgr[py] Module pg_autoscaler has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:05.606+0000 7fb850722200  1 mgr[py] Loading python module 'selftest'
debug 2024-01-08T10:02:05.754+0000 7fb850722200 -1 mgr[py] Module selftest has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:05.754+0000 7fb850722200  1 mgr[py] Loading python module 'prometheus'
debug 2024-01-08T10:02:06.326+0000 7fb850722200 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:06.326+0000 7fb850722200  1 mgr[py] Loading python module 'localpool'
debug 2024-01-08T10:02:06.446+0000 7fb850722200  1 mgr[py] Loading python module 'balancer'
debug 2024-01-08T10:02:06.666+0000 7fb850722200 -1 mgr[py] Module balancer has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:06.666+0000 7fb850722200  1 mgr[py] Loading python module 'iostat'
debug 2024-01-08T10:02:06.802+0000 7fb850722200 -1 mgr[py] Module iostat has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:06.802+0000 7fb850722200  1 mgr[py] Loading python module 'snap_schedule'
debug 2024-01-08T10:02:06.982+0000 7fb850722200 -1 mgr[py] Module snap_schedule has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:06.982+0000 7fb850722200  1 mgr[py] Loading python module 'orchestrator'
debug 2024-01-08T10:02:07.250+0000 7fb850722200 -1 mgr[py] Module orchestrator has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:07.250+0000 7fb850722200  1 mgr[py] Loading python module 'rook'
debug 2024-01-08T10:02:08.538+0000 7fb850722200 -1 mgr[py] Module rook has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:08.538+0000 7fb850722200  1 mgr[py] Loading python module 'crash'
debug 2024-01-08T10:02:08.814+0000 7fb850722200 -1 mgr[py] Module crash has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:08.814+0000 7fb850722200  1 mgr[py] Loading python module 'k8sevents'
debug 2024-01-08T10:02:09.298+0000 7fb850722200  1 mgr[py] Loading python module 'nfs'
debug 2024-01-08T10:02:09.638+0000 7fb850722200 -1 mgr[py] Module nfs has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:09.638+0000 7fb850722200  1 mgr[py] Loading python module 'status'
debug 2024-01-08T10:02:09.938+0000 7fb850722200 -1 mgr[py] Module status has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:09.938+0000 7fb850722200  1 mgr[py] Loading python module 'stats'
debug 2024-01-08T10:02:10.082+0000 7fb850722200  1 mgr[py] Loading python module 'mirroring'
debug 2024-01-08T10:02:10.234+0000 7fb850722200  1 mgr[py] Loading python module 'rbd_support'
debug 2024-01-08T10:02:10.398+0000 7fb850722200 -1 mgr[py] Module rbd_support has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:10.398+0000 7fb850722200  1 mgr[py] Loading python module 'alerts'
debug 2024-01-08T10:02:10.546+0000 7fb850722200 -1 mgr[py] Module alerts has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:10.546+0000 7fb850722200  1 mgr[py] Loading python module 'diskprediction_local'
debug 2024-01-08T10:02:10.754+0000 7fb850722200 -1 mgr[py] Module diskprediction_local has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:10.754+0000 7fb850722200  1 mgr[py] Loading python module 'volumes'
debug 2024-01-08T10:02:11.098+0000 7fb850722200 -1 mgr[py] Module volumes has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:11.098+0000 7fb850722200  1 mgr[py] Loading python module 'mds_autoscaler'
debug 2024-01-08T10:02:11.578+0000 7fb850722200  1 mgr[py] Loading python module 'telemetry'
debug 2024-01-08T10:02:11.794+0000 7fb850722200 -1 mgr[py] Module telemetry has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:11.794+0000 7fb850722200  1 mgr[py] Loading python module 'osd_perf_query'
debug 2024-01-08T10:02:11.942+0000 7fb850722200 -1 mgr[py] Module osd_perf_query has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:11.942+0000 7fb850722200  1 mgr[py] Loading python module 'test_orchestrator'
debug 2024-01-08T10:02:12.222+0000 7fb850722200 -1 mgr[py] Module test_orchestrator has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:12.222+0000 7fb850722200  1 mgr[py] Loading python module 'zabbix'
debug 2024-01-08T10:02:12.354+0000 7fb850722200 -1 mgr[py] Module zabbix has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:12.354+0000 7fb850722200  1 mgr[py] Loading python module 'rgw'
debug 2024-01-08T10:02:12.654+0000 7fb850722200 -1 mgr[py] Module rgw has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:12.654+0000 7fb850722200  1 mgr[py] Loading python module 'devicehealth'
debug 2024-01-08T10:02:12.842+0000 7fb850722200 -1 mgr[py] Module devicehealth has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:12.842+0000 7fb850722200  1 mgr[py] Loading python module 'insights'
debug 2024-01-08T10:02:13.030+0000 7fb850722200  1 mgr[py] Loading python module 'dashboard'
debug 2024-01-08T10:02:14.134+0000 7fb850722200  1 mgr[py] Loading python module 'influx'
debug 2024-01-08T10:02:14.282+0000 7fb850722200 -1 mgr[py] Module influx has missing NOTIFY_TYPES member
debug 2024-01-08T10:02:14.286+0000 7fb843dd0700  0 ms_deliver_dispatch: unhandled message 0x557908acb600 mon_map magic: 0 v1 from mon.0 v2:172.31.16.198:3300/0
debug 2024-01-08T10:02:14.294+0000 7fb843dd0700  1 mgr handle_mgr_map Activating!
debug 2024-01-08T10:02:14.294+0000 7fb843dd0700  1 mgr handle_mgr_map I am now activating
debug 2024-01-08T10:02:14.310+0000 7fb7ff998700  0 [balancer DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.310+0000 7fb7ff998700  1 mgr load Constructed class from module: balancer
debug 2024-01-08T10:02:14.310+0000 7fb7e045d700  0 [balancer INFO root] Starting
debug 2024-01-08T10:02:14.310+0000 7fb7ff998700  0 [crash DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.310+0000 7fb7ff998700  1 mgr load Constructed class from module: crash
debug 2024-01-08T10:02:14.314+0000 7fb7ff998700  0 [dashboard DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.314+0000 7fb7ff998700  1 mgr load Constructed class from module: dashboard
debug 2024-01-08T10:02:14.314+0000 7fb7ff998700  0 [devicehealth DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.314+0000 7fb7ff998700  1 mgr load Constructed class from module: devicehealth
debug 2024-01-08T10:02:14.314+0000 7fb7de459700  0 [dashboard INFO access_control] Loading user roles DB version=2
debug 2024-01-08T10:02:14.314+0000 7fb7de459700  0 [dashboard INFO sso] Loading SSO DB version=1
debug 2024-01-08T10:02:14.314+0000 7fb7de459700  0 [dashboard INFO root] server: ssl=no host=:: port=8080
debug 2024-01-08T10:02:14.314+0000 7fb7de459700  0 [dashboard INFO root] Configured CherryPy, starting engine...
debug 2024-01-08T10:02:14.318+0000 7fb7dd457700  0 [devicehealth INFO root] Starting
debug 2024-01-08T10:02:14.318+0000 7fb7ff998700  0 [iostat DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.318+0000 7fb7ff998700  1 mgr load Constructed class from module: iostat
debug 2024-01-08T10:02:14.322+0000 7fb7ff998700  0 [nfs DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.322+0000 7fb7ff998700  1 mgr load Constructed class from module: nfs
debug 2024-01-08T10:02:14.322+0000 7fb7ff998700  0 [orchestrator DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.322+0000 7fb7ff998700  1 mgr load Constructed class from module: orchestrator
debug 2024-01-08T10:02:14.326+0000 7fb7e045d700  0 [balancer INFO root] Optimize plan auto_2024-01-08_10:02:14
debug 2024-01-08T10:02:14.326+0000 7fb7e045d700  0 [balancer INFO root] Mode upmap, max misplaced 0.050000
debug 2024-01-08T10:02:14.326+0000 7fb7e045d700  0 [balancer INFO root] Some PGs (1.000000) are unknown; try again later
debug 2024-01-08T10:02:14.326+0000 7fb7ff998700  0 [pg_autoscaler DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.326+0000 7fb7ff998700  1 mgr load Constructed class from module: pg_autoscaler
debug 2024-01-08T10:02:14.330+0000 7fb7ff998700  0 [progress DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.330+0000 7fb7ff998700  1 mgr load Constructed class from module: progress
debug 2024-01-08T10:02:14.334+0000 7fb7d944f700  0 [pg_autoscaler INFO root] _maybe_adjust
debug 2024-01-08T10:02:14.338+0000 7fb7d844d700  0 [progress INFO root] Loading...
debug 2024-01-08T10:02:14.338+0000 7fb7ff998700  0 [prometheus DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-08T10:02:14.346+0000 7fb7d844d700  0 [progress INFO root] Loaded [<progress.module.GhostEvent object at 0x7fb7e460bbe0>, <progress.module.GhostEvent object at 0x7fb7e460b7b8>, <progress.modu
le.GhostEvent object at 0x7fb7e460bac8>, <progress.module.GhostEvent object at 0x7fb7e460bf28>, <progress.module.GhostEvent object at 0x7fb7e460b080>, <progress.module.GhostEvent object at 0x7fb7e460b438>
, <progress.module.GhostEvent object at 0x7fb7e460b208>, <progress.module.GhostEvent object at 0x7fb7e460bf60>, <progress.module.GhostEvent object at 0x7fb7e460bcf8>, <progress.module.GhostEvent object at
 0x7fb7e460bc18>, <progress.module.GhostEvent object at 0x7fb7e460b8d0>, <progress.module.GhostEvent object at 0x7fb7e460b048>, <progress.module.GhostEvent object at 0x7fb7e460b9b0>, <progress.module.Ghos
tEvent object at 0x7fb7e460ba58>, <progress.module.GhostEvent object at 0x7fb7e460b550>, <progress.module.GhostEvent object at 0x7fb7e460ba90>, <progress.module.GhostEvent object at 0x7fb7e460b940>, <prog
ress.module.GhostEvent object at 0x7fb7e460be48>, <progress.module.GhostEvent object at 0x7fb7e460b7f0>, <progress.module.GhostEvent object at 0x7fb7e460b668>, <progress.module.GhostEvent object at 0x7fb7
e460b160>, <progress.module.GhostEvent object at 0x7fb7e460bda0>, <progress.module.GhostEvent object at 0x7fb7e460b128>, <progress.module.GhostEvent object at 0x7fb7e460bba8>, <progress.module.GhostEvent
object at 0x7fb7e460b6d8>, <progress.module.GhostEvent object at 0x7fb7e460b320>, <progress.module.GhostEvent object at 0x7fb7e460bd30>, <progress.module.GhostEvent object at 0x7fb7e460bc88>, <progress.mo
dule.GhostEvent object at 0x7fb7e460b748>, <progress.module.GhostEvent object at 0x7fb7e460b5c0>, <progress.module.GhostEvent object at 0x7fb7e460bf98>, <progress.module.GhostEvent object at 0x7fb7e460b4e
0>] historic events
debug 2024-01-08T10:02:14.346+0000 7fb7d844d700  0 [progress INFO root] Loaded OSDMap, ready.
debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 no module 'rook'
debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 no module 'rook'
debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 mgr load Failed to construct class in 'prometheus'
debug 2024-01-08T10:02:14.358+0000 7fb81919b700  1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs
debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 mgr load Traceback (most recent call last):
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote
    return mgr.remote(o, meth, *args, **kwargs)
  File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote
    args, kwargs)
ImportError: Module not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 649, in __init__
    self.modify_instance_id = self.get_orch_status() and self.get_module_option(
  File "/usr/share/ceph/mgr/prometheus/module.py", line 869, in get_orch_status
    return self.available()[0]
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1661, in _oremote
    f_set = self.get_feature_set()
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote
    return mgr.remote(o, meth, *args, **kwargs)
  File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote
    args, kwargs)
ImportError: Module not found

debug 2024-01-08T10:02:14.358+0000 7fb7ff998700 -1 mgr operator() Failed to run module in active mode ('prometheus')

ceph mgr module ls
MODULE                              
balancer              on (always on)
crash                 on (always on)
devicehealth          on (always on)
orchestrator          on (always on)
pg_autoscaler         on (always on)
progress              on (always on)
rbd_support           on (always on)
status                on (always on)
telemetry             on (always on)
volumes               on (always on)
dashboard             on            
iostat                on            
nfs                   on            
prometheus            on            
restful               on            
rook                  on       
...

The CRD is in Ready status, everything seems to work except the dasboard "performance" tabs.

What can I check to debug this issue?
Thanks!

The text was updated successfully, but these errors were encountered:

rkachach · 2024-01-08T14:29:49Z

Strange because on the logs it seems that rook module is not enabled:

debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 no module 'rook'
debug 2024-01-08T10:02:14.346+0000 7fb7ff998700 -1 no module 'rook'

Did you try to enable the module from toolbox?

ceph mgr module enable rook && ceph orch set backend rook && ceph orch status

pasztorl · 2024-01-08T15:36:40Z

ceph mgr module enable rook && ceph orch set backend rook && ceph orch status

output:

module 'rook' is already enabled
Backend: rook
Available: Yes

In the manager logs there was nothing about that, so I restarted the mgr pod then the startup log is the same as above.

pasztorl · 2024-01-08T15:38:12Z

mgr container running this image: quay.io/ceph/ceph:v18.2.1

pasztorl · 2024-01-08T17:28:10Z

rook module seems working because I can see the cluster -> Physical Disks

matthewpi · 2024-01-09T01:17:02Z

I am also able to reproduce this. My config only explicitly configures the rook module though, monitoring is enabled.

Something interesting to note is that Prometheus works on a standby mgr, but not the active one.

Logs from active mgr: (some lines are omitted, indicated by ...)

debug 2024-01-09T01:03:37.231+0000 7f6def5d3200  1 mgr[py] Loading python module 'rook'
debug 2024-01-09T01:03:38.140+0000 7f6def5d3200 -1 mgr[py] Module rook has missing NOTIFY_TYPES member
...
debug 2024-01-09T01:03:40.096+0000 7f6de2c81700  1 mgr handle_mgr_map Activating!
debug 2024-01-09T01:03:40.098+0000 7f6de2c81700  1 mgr handle_mgr_map I am now activating
...
debug 2024-01-09T01:03:40.194+0000 7f6c98670700  0 [dashboard INFO root] Configured CherryPy, starting engine...
debug 2024-01-09T01:03:40.196+0000 7f6c8766e700  0 [devicehealth INFO root] Starting
debug 2024-01-09T01:03:40.206+0000 7f6b7fe4f700  1 mgr load Constructed class from module: pg_autoscaler
debug 2024-01-09T01:03:40.209+0000 7f6b7fe4f700  0 [progress DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-09T01:03:40.209+0000 7f6b7fe4f700  1 mgr load Constructed class from module: progress
debug 2024-01-09T01:03:40.218+0000 7f6b7fe4f700  0 [prometheus DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-09T01:03:40.226+0000 7f6c43666700  0 [pg_autoscaler INFO root] _maybe_adjust
debug 2024-01-09T01:03:40.226+0000 7f6c3a664700  0 [progress INFO root] Loading...
debug 2024-01-09T01:03:40.233+0000 7f6c3a664700  0 [progress INFO root] Loaded [...] historic events
debug 2024-01-09T01:03:40.233+0000 7f6c3a664700  0 [progress INFO root] Loaded OSDMap, ready.
debug 2024-01-09T01:03:40.247+0000 7f6b7fe4f700 -1 no module 'rook'
debug 2024-01-09T01:03:40.255+0000 7f6b7fe4f700 -1 no module 'rook'
debug 2024-01-09T01:03:40.256+0000 7f6b7fe4f700 -1 mgr load Failed to construct class in 'prometheus'
debug 2024-01-09T01:03:40.256+0000 7f6b7fe4f700 -1 mgr load Traceback (most recent call last):
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote
    return mgr.remote(o, meth, *args, **kwargs)
  File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote
    args, kwargs)
ImportError: Module not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 649, in __init__
    self.modify_instance_id = self.get_orch_status() and self.get_module_option(
  File "/usr/share/ceph/mgr/prometheus/module.py", line 869, in get_orch_status
    return self.available()[0]
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1661, in _oremote
    f_set = self.get_feature_set()
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote
    return mgr.remote(o, meth, *args, **kwargs)
  File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote
    args, kwargs)
ImportError: Module not found

debug 2024-01-09T01:03:40.261+0000 7f6b7fe4f700 -1 mgr operator() Failed to run module in active mode ('prometheus')
debug 2024-01-09T01:03:40.262+0000 7f6b7fe4f700  0 [rbd_support DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-09T01:03:40.278+0000 7f6be5e5b700  0 [rbd_support INFO root] recovery thread starting
debug 2024-01-09T01:03:40.278+0000 7f6be5e5b700  0 [rbd_support INFO root] starting setup
debug 2024-01-09T01:03:40.278+0000 7f6b7fe4f700  1 mgr load Constructed class from module: rbd_support
debug 2024-01-09T01:03:40.285+0000 7f6b7fe4f700  0 [restful DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-09T01:03:40.285+0000 7f6b7fe4f700  1 mgr load Constructed class from module: restful
debug 2024-01-09T01:03:40.286+0000 7f6be5e5b700  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
debug 2024-01-09T01:03:40.300+0000 7f6be5e5b700  0 [rbd_support INFO root] load_schedules: blocks, start_after=
debug 2024-01-09T01:03:40.302+0000 7f6b7fe4f700  0 [rook DEBUG root] setting log level based on debug_mgr: INFO (2/5)
debug 2024-01-09T01:03:40.303+0000 7f6bcc658700  0 [restful INFO root] server_addr: :: server_port: 8003
debug 2024-01-09T01:03:40.306+0000 7f6b7fe4f700  1 mgr load Constructed class from module: rook
...

Environment:

OS: NixOS 24.05 (Uakari)
Kernel: Linux 6.1.71 #1-NixOS SMP PREEMPT_DYNAMIC Fri Jan 5 14:18:41 UTC 2024 x86_64 GNU/Linux
Rook version: v1.13.1
Storage backend version: quay.io/ceph/ceph:v18.2.1-20231215
Kubernetes version: v1.28.5

rkachach · 2024-01-09T12:13:53Z

Than you all for the feedback. I'll try to reproduce the issue locally 👍

rkachach · 2024-01-09T14:47:26Z

@matthewpi please, can you post the full logs of the mgr startup including ALL the modules?

matthewpi · 2024-01-09T18:37:07Z

@matthewpi please, can you post the full logs of the mgr startup including ALL the modules?

Here are the logs from startup.
mgr-active.log
mgr-standby.log

rkachach · 2024-01-10T09:18:14Z

Thank you very much @matthewpi. Just FYI I tried several times to reproduce the issue locally but with no success.

Analyzing these logs and the ones provided originally by @pasztorl the common pattern is the order of modules loading: first prometheus is loaded then rook (we can see this in the following snippet from your active mgr logs):

      4:debug 2024-01-09T18:30:54.532+0000 7ff82be94200  1 mgr[py] Loading python module 'alerts'
      6:debug 2024-01-09T18:30:54.698+0000 7ff82be94200  1 mgr[py] Loading python module 'balancer'
      ....
      ....
     35:debug 2024-01-09T18:31:00.365+0000 7ff82be94200  1 mgr[py] Loading python module 'progress'
     37:debug 2024-01-09T18:31:00.485+0000 7ff82be94200  1 mgr[py] Loading python module 'prometheus'
     39:debug 2024-01-09T18:31:01.159+0000 7ff82be94200  1 mgr[py] Loading python module 'rbd_support'
     41:debug 2024-01-09T18:31:01.311+0000 7ff82be94200  1 mgr[py] Loading python module 'restful'
     42:debug 2024-01-09T18:31:01.591+0000 7ff82be94200  1 mgr[py] Loading python module 'rgw'
     44:debug 2024-01-09T18:31:01.875+0000 7ff82be94200  1 mgr[py] Loading python module 'rook'
     46:debug 2024-01-09T18:31:02.844+0000 7ff82be94200  1 mgr[py] Loading python module 'selftest'

I think the problem is a dependency between promethues and orchestrator (rook in this case) modules introduced recently by the changes from PR ceph/ceph#52191. Specifically the orch status check.

In the error stracktrace we can see how the mgr fails to load prometheus (Module not found) because rook has not yet been loaded:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 649, in __init__
    self.modify_instance_id = self.get_orch_status() and self.get_module_option(
  File "/usr/share/ceph/mgr/prometheus/module.py", line 869, in get_orch_status
    return self.available()[0]
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1661, in _oremote
    f_set = self.get_feature_set()
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner
    completion = self._oremote(method_name, args, kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote
    return mgr.remote(o, meth, *args, **kwargs)
  File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote
    args, kwargs)
ImportError: Module not found

That also explains why I wasn't able to reproduce in my local environment because in my setup rook is loaded always before prometheus.

kpoos · 2024-01-10T11:03:50Z

This issue also affects our setup. It started to coming immediately after upgrading to 1.13 from 1.12. MGR Pod is constantly crashing since then.
[rook@rook-ceph-tools-68644848b9-flk4x /]$ ceph status
cluster:
id: e64af4e2-a50d-408b-943a-89cc8539a3ca
health: HEALTH_WARN
125 mgr modules have recently crashed

[rook@rook-ceph-tools-68644848b9-flk4x /]$ ceph crash info 2024-01-10T09:17:44.340271Z_c16a026b-2df4-4857-b64a-ea73c41a5493
{
"backtrace": [
" File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote\n return mgr.remote(o, meth, *args, **kwargs)",
" File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote\n args, kwargs)",
"ImportError: Module not found",
"\nDuring handling of the above exception, another exception occurred:\n",
"Traceback (most recent call last):",
" File "/usr/share/ceph/mgr/prometheus/module.py", line 649, in init\n self.modify_instance_id = self.get_orch_status() and self.get_module_option(",
" File "/usr/share/ceph/mgr/prometheus/module.py", line 869, in get_orch_status\n return self.available()[0]",
" File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner\n completion = self._oremote(method_name, args, kwargs)",
" File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1661, in _oremote\n f_set = self.get_feature_set()",
" File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1586, in inner\n completion = self._oremote(method_name, args, kwargs)",
" File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1657, in _oremote\n return mgr.remote(o, meth, *args, **kwargs)",
" File "/usr/share/ceph/mgr/mgr_module.py", line 2228, in remote\n args, kwargs)",
"ImportError: Module not found"
],
"ceph_version": "18.2.1",
"crash_id": "2024-01-10T09:17:44.340271Z_c16a026b-2df4-4857-b64a-ea73c41a5493",
"entity_name": "mgr.b",
"mgr_module": "prometheus",
"mgr_module_caller": "ActivePyModule::load",
"mgr_python_exception": "ImportError",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig": "6f4033c8739c625d3935380854c7945f8bd28267f6a4d03fc2017bc8815c257c",
"timestamp": "2024-01-10T09:17:44.340271Z",
"utsname_hostname": "rook-ceph-mgr-b-6d58d88fd9-j5bw6",
"utsname_machine": "x86_64",
"utsname_release": "5.15.0-53-generic",
"utsname_sysname": "Linux",
"utsname_version": "#59~20.04.1-Ubuntu SMP Thu Oct 20 15:10:22 UTC 2022"
}
[rook@rook-ceph-tools-68644848b9-flk4x /]$

rkachach · 2024-01-10T12:17:29Z

@kpoos I have confirmed that the issue is related with the changes from the PR I post above. I have already opened a ticket on the ceph project to track the issue. https://tracker.ceph.com/issues/63992 Unfortunately at this moment there's no work around to this problem.

pasztorl · 2024-01-10T12:19:43Z

Thanks for the info!

rkachach · 2024-01-10T13:52:16Z

Not sure if this could help or not, but on my local env I was able to enable rook and prometheus by using the following setup:

remove modules section from the cluster yaml
once the cluster up & running, enable rook as following:

 ceph mgr module enable rook
 ceph orch set backend rook
 ceph orch status

Following these steps the mgr seems to start without issues but I can't guarantee that same will happen in your env. Besides, plz keep in mind that any restart on the mgr can lead to the problem again.

kpoos · 2024-01-11T11:37:29Z

@rkachach Looks that the workaround works in our setup as well. Thanks for that. Please keep us informed when a permanent fix is available. Thanks.

xavi-clovr · 2024-01-11T15:33:04Z

Not sure if this could help or not, but on my local env I was able to enable rook and prometheus by using the following setup:

remove modules section from the cluster yaml

once the cluster up & running, enable rook as following:
 ceph mgr module enable rook
 ceph orch set backend rook
 ceph orch status
Following these steps the mgr seems to start without issues but I can't guarantee that same will happen in your env. Besides, plz keep in mind that any restart on the mgr can lead to the problem again.

I'm unable to achieve a good behaviour using this workaround.
I tried also disabling rook and prometheus mgr modules from Ceph commands and enabling them with the correct order, I guess it should also work? (to me is still failing)

rkachach · 2024-01-11T16:54:03Z

Not sure if this could help or not, but on my local env I was able to enable rook and prometheus by using the following setup:

remove modules section from the cluster yaml

once the cluster up & running, enable rook as following:
 ceph mgr module enable rook
 ceph orch set backend rook
 ceph orch status
Following these steps the mgr seems to start without issues but I can't guarantee that same will happen in your env. Besides, plz keep in mind that any restart on the mgr can lead to the problem again.
I'm unable to achieve a good behaviour using this workaround. I tried also disabling rook and prometheus mgr modules from Ceph commands and enabling them with the correct order, I guess it should also work? (to me is still failing)

@xavi-clovr Unfortunately loading process is not deterministic and right now there's a race condition between loading prometheus and the orchestrator (rook) which could lead to a crash of prometheus module in case it's loaded before rook. So the proposed work around is not 100% sure to work.

rkachach · 2024-01-12T09:58:03Z

I created the following image which contains v18.2.1 + a fix candidate: docker.io/rkachach/ceph:v18.2.1_patched_v1
It would be great if somebody can test the fix on some testing environment and report the results.

IMPORTANT:

This Docker image includes a patch aimed at addressing a specific issue. It is intended for testing purposes 
only and should not be used in a production environment. The changes made in this image are experimental 
and may not have undergone thorough testing or received reviewers approval.

Usage Guidelines:

This image is provided as-is, without any warranties or guarantees of any kind.
Do not use this image in a production environment or any critical system.
Use this image solely for testing, validation, and feedback purposes.
The changes included in this image may be subject to further modifications, and compatibility with future 
releases is not guaranteed.

rspier · 2024-01-12T17:04:35Z

Good news: The prometheus module doesn't fail on startup.

Bad news: Metric collection doesn't work.

Hitting the /metrics endpoint results in...

debug 2024-01-12T17:02:59.686+0000 7fcf92f16700  0 [prometheus ERROR root] failed to collect metrics:
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 514, in collect
    data = self.mod.collect()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1715, in collect
    self.get_metadata_and_osd_status()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1304, in get_metadata_and_osd_status
    str(daemon.daemon_id).split(".")[2]))
IndexError: list index out of range

jimmy-ungerman · 2024-01-12T20:01:39Z

Still getting the same error with the new image as well.

rkachach · 2024-01-12T20:37:18Z

Good news: The prometheus module doesn't fail on startup.

Bad news: Metric collection doesn't work.

Hitting the /metrics endpoint results in...

debug 2024-01-12T17:02:59.686+0000 7fcf92f16700  0 [prometheus ERROR root] failed to collect metrics:
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 514, in collect
    data = self.mod.collect()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1715, in collect
    self.get_metadata_and_osd_status()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1304, in get_metadata_and_osd_status
    str(daemon.daemon_id).split(".")[2]))
IndexError: list index out of range

@rspier thanks for deploying and testing the first part of the fix (the one related to the prometheus module crash). As for the new issue I'm sure that it has to do with some mismatch between how daemon_ids are called in cephadm based deployments vs rook. I created a new image docker.io/rkachach/ceph:v18.2.1_patched_v1 with a small change hoping it fixes this new issue.

As of building an image you have basically to pick the changes from my PR https://github.com/ceph/ceph/pull/55149/files and use them to update the base image of v18.2.1. A simple docker file like:

FROM quay.io/ceph/ceph:v18.2.1
COPY ./your_patched_prometheus_module.py /usr/share/ceph/mgr/prometheus/module.py

rspier · 2024-01-12T20:49:01Z

The v1 patch fixes the issue for me. podip:9223/metrics returns metrics again!

Thank you!

rkachach · 2024-01-12T21:00:37Z

Welcome! thanks to you for helping with the testing 👍

Please let me know if you observe any new issues. I'll be working on the definitive fix, as I said before, that could takes more time as it depends on the ceph project next release.

jimmy-ungerman · 2024-01-12T23:43:26Z

Also can confirm that the v1 patch is working for me

arichard42 · 2024-01-16T12:59:06Z

I confirm the v1 fix the issue also for me (using rook helm charts).

reefland · 2024-01-17T02:30:47Z

Was having this issue in the discussion area. Hopefully this makes it into the rook-ceph release. Nice work!

barrettMCW · 2024-01-22T07:38:29Z

running the v1 patch, and ofc fixed the issue <3
but i've noticed in my dashboard that the recovery throughput does not match the information from "ceph -s"
any ideas? would that be the patch or something from the ceph 18 update in general? the weird part is that the recovery thoughput seems to be getting some metric but doesn't seem like network throughput to me (tells me i have 0-0.5 bytes of throughput? ceph -s says i'm doing 150MiB of recovery ops)
Thanks!

rkachach · 2024-01-23T14:38:47Z

@barrettMCW plz, if you really think that there's a potential BUG with metrics I'd recommend opening a new issue and providing the details to reproduce it.

dcplaya · 2024-01-29T19:19:29Z

Good news: The prometheus module doesn't fail on startup.
Bad news: Metric collection doesn't work.
Hitting the /metrics endpoint results in...
debug 2024-01-12T17:02:59.686+0000 7fcf92f16700  0 [prometheus ERROR root] failed to collect metrics:
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 514, in collect
    data = self.mod.collect()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1715, in collect
    self.get_metadata_and_osd_status()
  File "/usr/share/ceph/mgr/mgr_util.py", line 859, in wrapper
    result = f(*args, **kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1304, in get_metadata_and_osd_status
    str(daemon.daemon_id).split(".")[2]))
IndexError: list index out of range
@rspier thanks for deploying and testing the first part of the fix (the one related to the prometheus module crash). As for the new issue I'm sure that it has to do with some mismatch between how daemon_ids are called in cephadm based deployments vs rook. I created a new image docker.io/rkachach/ceph:v18.2.1_patched_v1 with a small change hoping it fixes this new issue.

As of building an image you have basically to pick the changes from my PR https://github.com/ceph/ceph/pull/55149/files and use them to update the base image of v18.2.1. A simple docker file like:
FROM quay.io/ceph/ceph:v18.2.1
COPY ./your_patched_prometheus_module.py /usr/share/ceph/mgr/prometheus/module.py

Unfortunately, this version did not start for me. I am running it on ARM though, if that make any difference.

rkachach · 2024-01-30T13:30:15Z

@dcplaya please, test the image docker.io/rkachach/ceph:v18.2.1_patched_v1 so far all the users have reported that it's working without issues.

bhuism · 2024-01-30T13:32:16Z

@rkachach can you build an arm image?

dcplaya · 2024-01-30T14:33:47Z

@rkachach I did test that image, the error I got while running that version is in the screenshot I posted above.

rich0 · 2024-01-30T14:42:19Z

All: Just wanted to report that I had this issue and the patched version fixed it for me.

@dcplaya that image probably won't work on ARM, unless an ARM version was built for it. Maybe somebody will volunteer to build one, but if not you could follow the suggestions to build your own. Containers are typically arch-specific since they will contain binaries.

guillaumetorresani · 2024-01-30T15:44:54Z

@bhuism i have build multi arch image, a public gitlab repository is available on https://gitlab.onlineterroir.com/ceph/ceph with opreationnal CI. Multi arch image is availbe on registry.gitlab.onlineterroir.com/ceph/ceph:v18.2.1_patched_v1

I confirm the v1 fix the issue also for me (using rook without helm charts).

rkachach · 2024-01-30T16:03:50Z

@bhuism i have build multi arch image, a public gitlab repository is available on https://gitlab.onlineterroir.com/ceph/ceph with opreationnal CI. Multi arch image is availbe on registry.gitlab.onlineterroir.com/ceph/ceph:v18.2.1_patched_v1

I confirm the v1 fix the issue also for me (using rook without helm charts).

@guillaumetorresani thank you for providing the multi-arch image.

dcplaya · 2024-01-30T16:52:48Z

@guillaumetorresani's multi-arch image works for my ARM setup!
Metrics in Prometheus have also returned

bhuism · 2024-01-30T20:20:21Z

@guillaumetorresani thanks for the arm image!

@rkachach works like a charm!

ahgraber · 2024-02-16T00:35:04Z

Is there a PR in progress or a timeline for this to get merged into main?

barrettMCW · 2024-02-16T00:45:06Z

@ahgraber It's a ceph problem.
pr has been merged: ceph/ceph#55149
it will be part of 18.2.2 https://github.com/ceph/ceph/milestone/19

esomore · 2024-03-07T07:06:36Z

@bhuism i have build multi arch image, a public gitlab repository is available on https://gitlab.onlineterroir.com/ceph/ceph with opreationnal CI. Multi arch image is availbe on registry.gitlab.onlineterroir.com/ceph/ceph:v18.2.1_patched_v1

I confirm the v1 fix the issue also for me (using rook without helm charts).

Thank You!

yurirocha15 · 2024-03-11T09:37:49Z

v18.2.2 was released last week. Does anyone know usually how long does it take for a new image to show up on quay.io? The latest image is still v18.2.1.

rich0 · 2024-03-11T11:12:49Z

v18.2.2 was released last week. Does anyone know usually how long does it take for a new image to show up on quay.io? The latest image is still v18.2.1.

I don't think it is actually released. It is tagged, but it seems to be going through the QA process - there is a thread on the list. I'm not sure what the normal release processes is, but I've seen those sorts of threads go on for a while. I'm guessing this one will be done relatively quickly as it is a hotfix.

In any case, once it is released and the release page is updated/etc, I'm guessing the quay image will be updated.

rkachach · 2024-03-11T16:05:52Z

new image tag should be now available on quay.io/ceph/ceph:v18.2.2

pasztorl · 2024-03-11T16:53:27Z

Hi! I've tested with the new image, there is no error about the prometheus module on the mgr log. The mgr serve on the metrics port, thanks!

travisn · 2024-03-12T18:28:02Z

This issue is fixed by updating the Ceph version to v18.2.2. See the Ceph upgrade guide to make this change. This will become the default in Rook v1.13.7 when released in the next few days, but no need to wait for that release before applying v18.2.2.

pasztorl added the bug label Jan 8, 2024

rkachach self-assigned this Jan 9, 2024

xavi-clovr mentioned this issue Jan 11, 2024

MGR crashing due to Prometheus module not found #13559

Closed

rkachach mentioned this issue Jan 11, 2024

mgr/prometheus: fix orch check to prevent Prometheus from crashing ceph/ceph#55149

Merged

14 tasks

mbleichner mentioned this issue Jan 12, 2024

rook-ceph-mgr serviceMonitor fails after upgrade to v1.13.2 #13566

Closed

uhthomas mentioned this issue Jan 18, 2024

missing metrics #13589

Closed

rkachach pinned this issue Jan 19, 2024

travisn mentioned this issue Jan 22, 2024

No metrics from mgr after update to ceph 18.2.1 #13605

Closed

rkachach mentioned this issue Jan 24, 2024

Object Storage page does not load in the Ceph dashboard by default #9660

Closed

travisn mentioned this issue Jan 24, 2024

module not found #13616

Closed

rkachach mentioned this issue Feb 1, 2024

Rook Dashboard hangs/times out #13662

Closed

rkachach mentioned this issue Feb 26, 2024

Cannot use RGW overview in web dashboard #11169

Closed

rkachach mentioned this issue Mar 11, 2024

mgr prometheus not running. #13908

Closed

travisn mentioned this issue Mar 11, 2024

core: Set default ceph version to v18.2.2 #13913

Merged

6 tasks

travisn closed this as completed in #13913 Mar 12, 2024

rkachach unpinned this issue Mar 12, 2024

Manager prometheus module failed to construct - ImportError: Module not found #13527

Manager prometheus module failed to construct - ImportError: Module not found #13527

Comments

pasztorl commented Jan 8, 2024

rkachach commented Jan 8, 2024

pasztorl commented Jan 8, 2024

pasztorl commented Jan 8, 2024

pasztorl commented Jan 8, 2024

matthewpi commented Jan 9, 2024 • edited

rkachach commented Jan 9, 2024

rkachach commented Jan 9, 2024

matthewpi commented Jan 9, 2024

rkachach commented Jan 10, 2024 • edited

kpoos commented Jan 10, 2024 • edited

rkachach commented Jan 10, 2024 • edited

pasztorl commented Jan 10, 2024

rkachach commented Jan 10, 2024 • edited

kpoos commented Jan 11, 2024

xavi-clovr commented Jan 11, 2024

rkachach commented Jan 11, 2024 • edited

rkachach commented Jan 12, 2024 • edited

rspier commented Jan 12, 2024

jimmy-ungerman commented Jan 12, 2024

rkachach commented Jan 12, 2024

rspier commented Jan 12, 2024

rkachach commented Jan 12, 2024 • edited

jimmy-ungerman commented Jan 12, 2024

arichard42 commented Jan 16, 2024

reefland commented Jan 17, 2024

barrettMCW commented Jan 22, 2024

rkachach commented Jan 23, 2024 • edited

dcplaya commented Jan 29, 2024

rkachach commented Jan 30, 2024

bhuism commented Jan 30, 2024

dcplaya commented Jan 30, 2024

rich0 commented Jan 30, 2024

guillaumetorresani commented Jan 30, 2024

rkachach commented Jan 30, 2024

dcplaya commented Jan 30, 2024

bhuism commented Jan 30, 2024

ahgraber commented Feb 16, 2024

barrettMCW commented Feb 16, 2024

esomore commented Mar 7, 2024

yurirocha15 commented Mar 11, 2024

rich0 commented Mar 11, 2024

rkachach commented Mar 11, 2024 • edited

pasztorl commented Mar 11, 2024 • edited

travisn commented Mar 12, 2024

matthewpi commented Jan 9, 2024 •

edited

rkachach commented Jan 10, 2024 •

edited

kpoos commented Jan 10, 2024 •

edited

rkachach commented Jan 10, 2024 •

edited

rkachach commented Jan 10, 2024 •

edited

rkachach commented Jan 11, 2024 •

edited

rkachach commented Jan 12, 2024 •

edited

rkachach commented Jan 12, 2024 •

edited

rkachach commented Jan 23, 2024 •

edited

rkachach commented Mar 11, 2024 •

edited

pasztorl commented Mar 11, 2024 •

edited