New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceph MGR: 2 modules failed on default install #2335

Closed
galexrt opened this Issue Dec 5, 2018 · 25 comments

Comments

Projects
@galexrt
Copy link
Member

galexrt commented Dec 5, 2018

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
Two Ceph MGR modules failed to come up causing the Ceph cluster to report HEALTH_ERROR state.

See logs: https://gist.github.com/galexrt/3626102e96dddcef071060b71d94e280

Expected behavior:
The dashboard and prometheus modules to work fine.

How to reproduce it (minimal and precise):

  1. Use the example cluster.yaml, in my case in a minikube environment on K8S 1.11.4.

Environment:

  • OS (e.g. from /etc/os-release): ```
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

* Kernel (e.g. `uname -a`): `Linux minikube 4.15.0 #1 SMP Fri Oct 5 20:44:14 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux`
* Cloud provider or hardware configuration:
* Rook version (use `rook version` inside of a Rook Pod): `rook: v0.8.0-350.g18b2da5f` (freshly built from latest `master` this morning, https://github.com/rook/rook/commit/18b2da5fc5d7a303b9a48119ce55108b55af7f0e)
* Kubernetes version (use `kubectl version`): ```
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:04:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.4", GitCommit:"bf9a868e8ea3d3a8fa53cbb22f566771b3f8068b", GitTreeState:"clean", BuildDate:"2018-10-25T19:06:30Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): minikube
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_ERROR - 2 modules have failed

@galexrt galexrt added the ceph label Dec 5, 2018

@travisn travisn added this to the 0.9 milestone Dec 5, 2018

@travisn travisn added this to To do in v0.9 via automation Dec 5, 2018

@NickUfer

This comment has been minimized.

Copy link

NickUfer commented Dec 5, 2018

Same problem for me. Ports are binded and reachable via curl on 127.0.0.1 and pod ip.

sh-4.2# ss -lntu
Netid State      Recv-Q Send-Q   Local Address:Port                  Peer Address:Port              
tcp   LISTEN     0      128         10.0.16.95:6800                             *:*                  
tcp   LISTEN     0      5                   :::9283                            :::*                  
tcp   LISTEN     0      5                   :::8443                            :::*                  
@stucki

This comment has been minimized.

Copy link

stucki commented Dec 7, 2018

Had the same problem multiple times after starting over with a fresh setup.
Try this as a workaround:

kubectl -n rook-ceph exec -it rook-ceph-tools -- ceph status     # reports HEALTH_ERR
kubectl -n rook-ceph exec -it rook-ceph-tools -- ceph mgr module disable prometheus
kubectl -n rook-ceph exec -it rook-ceph-tools -- ceph mgr module disable dashboard
kubectl -n rook-ceph exec -it rook-ceph-tools -- ceph status     # reports HEALTH_OK
@lastessa

This comment has been minimized.

Copy link

lastessa commented Dec 10, 2018

the same problem on fresh ubuntu with k8s cluster and rook install.

@galexrt

This comment has been minimized.

Copy link
Member

galexrt commented Dec 10, 2018

@lastessa @NickUfer @stucki Could you please post your Rook Ceph MGR logs as an attachment or gist.

@lastessa

This comment has been minimized.

Copy link

lastessa commented Dec 10, 2018

https://gist.github.com/lastessa/784153cfb132e8701250be7441e59387

[root@krs-1 /]# ceph status
cluster:
id: 476b51c2-9edc-4831-8b3a-a7655997c5fd
health: HEALTH_ERR
2 modules have failed

services:
mon: 4 daemons, quorum b,d,c,a
mgr: a(active)
osd: 3 osds: 3 up, 3 in

data:
pools: 1 pools, 100 pgs
objects: 57 objects, 140 MiB
usage: 82 GiB used, 1.1 TiB / 1.2 TiB avail
pgs: 100 active+clean

@based64god

This comment has been minimized.

Copy link

based64god commented Dec 11, 2018

Same issue here. I've managed to nail down some repro steps, but feel free to ask away if anything else is needed that I may have missed.

My cluster is currently running ubuntu 18.04.1 with kernel 4.15.0-42-generic and docker.io/weaveworks/weave-npc:2.5.0 for CNI on all machines. The cluster also has the latest prometheus operator v0.26.0 and the prometheus, prometheus-service, and service-monitor yaml running from commit fb557b0.

Installing the rook operator and cluster with the operator and cluster yaml in also in commit fb557b0 causes the prometheus and dashboard modules to report errors binding to ports 9283 and 8443, with logs being nearly verbatim to https://gist.github.com/galexrt/3626102e96dddcef071060b71d94e280.

Interesting notes: Disabling and re-enabling the modules as described by @stucki temporarily fixes the problem for ~30 seconds. I get the same output as @NickUfer 's ss for both the 30 seconds of success, as well as the failure afterwards.

@sinqinc

This comment has been minimized.

Copy link

sinqinc commented Dec 12, 2018

Same problem here after upgrade to 0.9 and ceph v13.2.2-20181023

2018-12-12 03:03:00.673 7f8235749700  1 mgr send_beacon active
[12/Dec/2018:03:03:01] ENGINE Error in 'start' listener <bound method Server.start of <cherrypy._cpserver.Server object at 0x7f822f9d35d0>>
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cherrypy/process/wspbus.py", line 197, in publish
    output.append(listener(*args, **kwargs))
  File "/usr/lib/python2.7/site-packages/cherrypy/_cpserver.py", line 151, in start
    ServerAdapter.start(self)
  File "/usr/lib/python2.7/site-packages/cherrypy/process/servers.py", line 174, in start
    self.wait()
  File "/usr/lib/python2.7/site-packages/cherrypy/process/servers.py", line 214, in wait
    wait_for_occupied_port(host, port)
  File "/usr/lib/python2.7/site-packages/cherrypy/process/servers.py", line 427, in wait_for_occupied_port
    raise IOError("Port %r not bound on %r" % (port, host))
IOError: Port 9283 not bound on '::'

[12/Dec/2018:03:03:01] ENGINE Shutting down due to error in start listener:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cherrypy/process/wspbus.py", line 235, in start
    self.publish('start')
  File "/usr/lib/python2.7/site-packages/cherrypy/process/wspbus.py", line 215, in publish
    raise exc
ChannelFailures: IOError("Port 9283 not bound on '::'",)

[12/Dec/2018:03:03:01] ENGINE Bus STOPPING
[12/Dec/2018:03:03:01] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('::', 9283)) already shut down
[12/Dec/2018:03:03:01] ENGINE Stopped thread '_TimeoutMonitor'.
[12/Dec/2018:03:03:01] ENGINE Bus STOPPED
[12/Dec/2018:03:03:01] ENGINE Bus EXITING
[12/Dec/2018:03:03:01] ENGINE Bus EXITED
@travisn

This comment has been minimized.

Copy link
Member

travisn commented Dec 12, 2018

Ok, I was able to repro in minikube simply by using the default cluster yaml (that configures mimic), then looking at the cluster health in the toolbox. Within the first 30-60 seconds the mgr modules are fine, but then the prometheus and dashboard modules fail and the ceph status reports HEALTH_ERR with the two module failures. This does not repro when launching luminous. This does repro with nautilus.

@sebastian-philipp Could you take a look at this issue? I'm still trying to narrow down if there was some recent change in rook that could have caused this failure, but your input would be very helpful, thanks.

@jbw976

This comment has been minimized.

Copy link
Member

jbw976 commented Dec 12, 2018

We saw this also today during the demo at the Rook intro session at Kubecon Seattle :)

@sebastian-philipp

This comment has been minimized.

Copy link
Member

sebastian-philipp commented Dec 13, 2018

Ok, this seems to be closely related to CherryPy.

  1. Which exact CherryPy version is installed in the containers?
  2. Do you see the same issue with a Nautilus MGR?
@sebastian-philipp

This comment has been minimized.

Copy link
Member

sebastian-philipp commented Dec 13, 2018

@sinqinc

This comment has been minimized.

Copy link

sinqinc commented Dec 13, 2018

root@rook-ceph-mgr-a-55bb9c6474-9ddpn /]# python -c "import cherrypy;print cherrypy.version"
3.2.2

@sebastian-philipp

This comment has been minimized.

Copy link
Member

sebastian-philipp commented Dec 13, 2018

root@rook-ceph-mgr-a-55bb9c6474-9ddpn /]# python -c "import cherrypy;print cherrypy.version"
3.2.2

This version is kind of old. Do you get the same error with 3.5.0 or newer?

@sebastian-philipp

This comment has been minimized.

Copy link
Member

sebastian-philipp commented Dec 13, 2018

As this is a CentOS based image, @epuertat have you seen this in your RH downstream testing?

@sebastian-philipp

This comment has been minimized.

Copy link
Member

sebastian-philipp commented Dec 13, 2018

And finally, ceph/ceph#24734 was merged into 13.2.3. Do you see the same behavior with 13.2.3?

@pennpeng

This comment has been minimized.

Copy link

pennpeng commented Dec 14, 2018

+1

@epuertat

This comment has been minimized.

Copy link

epuertat commented Dec 14, 2018

As this is a CentOS based image, @epuertat have you seen this in your RH downstream testing?

I've seen that, but my common setup has been, you know, CentOS7 + custom Luminous backport of the dashboard.

Just run a search and found that this was also happening in Luminous 12.2.5's Prometheus module and dashboard too. But it was mostly fixed with ceph/ceph#15588.

In the past I was able to work around this issue by setting the listening IP to a specific local address, instead of the default 0.0.0.0 or ::/128.

With master I can see a similar error with the restful module (I forced that by immediately disabling and enabling the dashboard module):

2018-12-14 10:33:08.550 7fab244ef700  0 mgr[restful] Traceback (most recent call last):
  File "/ceph/src/pybind/mgr/restful/module.py", line 255, in serve
    self._serve()
  File "/ceph/src/pybind/mgr/restful/module.py", line 330, in _serve
    ssl_context=(cert_fname, pkey_fname),
  File "/usr/lib/python2.7/site-packages/werkzeug/serving.py", line 486, in make_server
    passthrough_errors, ssl_context)
  File "/usr/lib/python2.7/site-packages/werkzeug/serving.py", line 410, in __init__
    HTTPServer.__init__(self, (host, int(port)), handler)
  File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__
    self.server_bind()
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
    SocketServer.TCPServer.server_bind(self)
  File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
    self.socket.bind(self.server_address)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 98] Address already in use
@Amos-85

This comment has been minimized.

Copy link

Amos-85 commented Dec 14, 2018

I have the same issue and same ceph mgr errors like @galexrt publish in gist
https://gist.github.com/galexrt/3626102e96dddcef071060b71d94e280
+1

@liejuntao001

This comment has been minimized.

Copy link

liejuntao001 commented Dec 14, 2018

ceph status

cluster:
id: 5abf65ad-ed84-4719-ac62-22ef3d9fe836
health: HEALTH_ERR
Module 'prometheus' has failed: IOError("Port 9283 not bound on '::'",)

@Amos-85

This comment has been minimized.

Copy link

Amos-85 commented Dec 15, 2018

@liejuntao001 as far as I understand from the community, we'll need to wait for ceph to publish docker version 13.2.3 in order to solve the prometheus & ceph dashboard bugs in mimic.
they suppose to publish it around Jan 2019.

@sebastian-philipp

This comment has been minimized.

Copy link
Member

sebastian-philipp commented Dec 17, 2018

@epuertat : @travisn was able to reproduce this with Nautilus

@BlaineEXE BlaineEXE changed the title Ceph MGR 2 modules failed on default install Ceph MGR: 2 modules failed on default install Dec 17, 2018

@jbw976 jbw976 moved this from To do to In progress in v0.9 Dec 18, 2018

@sebastian-philipp

This comment has been minimized.

Copy link
Member

sebastian-philipp commented Dec 18, 2018

@LenzGr fyi.

@travisn travisn self-assigned this Dec 19, 2018

@travisn

This comment has been minimized.

Copy link
Member

travisn commented Dec 19, 2018

Ok, I believe I have prototyped the fix... The setting on the mgr modules for server_addr needs to be set to the pod IP. By default the dashboard and prometheus modules are binding to :: (all interfaces) as seen here, which is causing the issues in the k8s clusters.

The fix can be tested by running the following commands from the toolbox:

# Get the IP of the mgr pod:
kubectl -n rook-ceph get pod -l app=rook-ceph-mgr -o wide

# Set the server_addr for the two modules (replacing the pod ip queried above)
ceph config set mgr.a mgr/prometheus/server_addr <podIP>
ceph config set mgr.a mgr/dashboard/server_addr  <podIP>

# Restart the mgr pod in one of these two ways:
# 1) The easy way is to delete the pod, however depending on your env it may get 
# a new pod ip
kubectl -n rook-ceph delete pod -l app=rook-ceph-mgr

# 2) Alternatively, exec into the mgr pod and kill the ceph-mgr process so the same pod 
# will simply restart
kubectl -n rook-ceph exec -it <pod> bash
# ceph-mgr is running as pid 1
kill 1

Now to automate this when the mgr pod starts up...

@lastessa

This comment has been minimized.

Copy link

lastessa commented Dec 19, 2018

After restart mgr pod
kubectl -n rook-ceph delete pod -l app=rook-ceph-mgr

it's get another ip address different from ceph config set settings

@travisn

This comment has been minimized.

Copy link
Member

travisn commented Dec 19, 2018

@lastessa I updated the instructions to restart the pod without deleting it. Does that work for you?

@travisn travisn moved this from In progress to In Review in v0.9 Dec 20, 2018

v0.9 automation moved this from In Review to Done Dec 21, 2018

@travisn travisn moved this from Done to In Review in v0.9 Dec 21, 2018

@travisn travisn moved this from In Review to Done in v0.9 Dec 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment