Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator cannot connect to mons after Ceph upgrade to Nautilus #2973

Closed
travisn opened this issue Apr 12, 2019 · 6 comments · Fixed by #3037
Closed

Operator cannot connect to mons after Ceph upgrade to Nautilus #2973

travisn opened this issue Apr 12, 2019 · 6 comments · Fixed by #3037
Assignees
Labels
Milestone

Comments

@travisn
Copy link
Member

travisn commented Apr 12, 2019

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
After upgrading from Mimic to Nautilus, the operator is not able to connect to the mons.

The operator log is full of these messages about msgr1:

2019-04-12 03:28:31.891 7fd687fff700 -1 --2-  >> [v2:10.107.197.153:3300/0,v1:10.107.197.153:6789/0] conn(0x7fd67400e6f0 0x7fd674010af0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner peer [v2:10.107.197.153:3300/0,v1:10.107.197.153:6789/0] is using msgr V1 protocol

Here is the connection found in the ceph.conf in the operator pod:

mon host = [v2:10.107.197.153:3300,v1:10.107.197.153:6789],[v2:10.103.185.75:3300,v1:10.103.185.75:6789],[v2:10.109.181.220:3300,v1:10.109.181.220:6789]

The mon service endpoints are updated properly to have ports for both msgr1 and 2. For example:

$ kubectl get svc
rook-ceph-mon-a           ClusterIP   10.109.181.220   <none>        6789/TCP,3300/TCP   2m51s

Ceph commands from the toolbox are working fine, but they are still based on the msgr1 endpoints:

mon_host = 10.109.181.220:6789,10.107.197.153:6789,10.103.185.75:6789

@leseb @liewegas Anything stand out for why msgr2 wouldn't be working?

Expected behavior:
The operator should be able to connect to the mons after upgrade to Nautilus.

How to reproduce it (minimal and precise):

  • Create the cluster based on ceph/ceph:v13
  • Update the cephcluster CR with the cephVersion.image: ceph/ceph:v14
  • The operator log shows that the mons are updated, but the operator cannot connect to them to confirm they are in quorum.
@travisn travisn added bug ceph main ceph tag labels Apr 12, 2019
@travisn travisn added this to the 1.0 milestone Apr 12, 2019
@liewegas
Copy link
Member

"6789/TCP,3300/TCP" ... does 6789 forward to teh container's 6789, and 3300 forward to the containers 3300? That error means that there is a connection to 3300 that is speaking the v1 (6789) protocol.

What does ceph mon dump report?

@travisn
Copy link
Member Author

travisn commented Apr 12, 2019

After running this command, the mons immediately started responding:

ceph mon enable-msgr2

How about the following procedure for updating to nautilus?

  • The user updates the ceph version to v14
  • The operator starts the orchestration
  • The operator updates each mon to v14, but continues using the msgrv1 while the mons are updated
  • After all the mons are updated and in quorum, the operator runs the above command to enable msgr2
  • The operator regenerates is config to use msgr2

@leseb, can you take a look?

@leseb
Copy link
Member

leseb commented Apr 15, 2019

@travisn I know what's going on, working on a fix, perhaps as part of #2901.

@dimm0
Copy link
Contributor

dimm0 commented Apr 17, 2019

While everyone is looking, is there a way to fix this in 0.9 and ceph v13?
I'm getting a bunch of errors about using v1 protocol in toolkit... If I run ceph mon enable-msgr2, will it break anything?

@travisn
Copy link
Member Author

travisn commented Apr 17, 2019

@dimm0 If you're using 0.9 and ceph v13 everything should be on v1. The enable-msgr2 only applies to ceph v14. Did you launch the toolbox from master perhaps?

@dimm0
Copy link
Contributor

dimm0 commented Apr 17, 2019

Ah!! That explains it 😁 thanks

Did you launch the toolbox from master perhaps?

Indeed..

leseb added a commit to leseb/rook that referenced this issue Apr 24, 2019
This commits allows the upgrade from Mimic to Nautilus to work by:

* removing the v2 brackets on the operator ceph config file generation,
we stick with v1 and can revert this back once we deprecate mimic there
is no rush since mons keep on listening to v1.

* enable messengers 2 when the cluster runs on Nautilus

Fixes: rook#2973
Signed-off-by: Sébastien Han <seb@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants