Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed mon assert when mds standby replay is assigned with multiple file systems #1027

Closed
travisn opened this issue Sep 29, 2017 · 13 comments · Fixed by #1935
Closed

Failed mon assert when mds standby replay is assigned with multiple file systems #1027

travisn opened this issue Sep 29, 2017 · 13 comments · Fixed by #1935
Assignees
Labels
ceph main ceph tag filesystem

Comments

@travisn
Copy link
Member

travisn commented Sep 29, 2017

While testing multiple file systems with varying standby and replay settings, two of the mons core dumped with the following assert:

2017-09-29 21:55:06.978169 I | rook-ceph-mon0:      0> 2017-09-29 21:55:06.961413 7f55aba29700 -1 /build/ceph/src/mds/FSMap.cc: In function 'void FSMap::assign_standby_replay(mds_gid_t, fs_cluster_id_t, mds_rank_t)' thread 7f55aba29700 time 2017-09-29 21:55:06.957486
2017-09-29 21:55:06.978179 I | rook-ceph-mon0: /build/ceph/src/mds/FSMap.cc: 870: FAILED assert(mds_roles.at(standby_gid) == FS_CLUSTER_ID_NONE)

It would appear there is an issue with the standby being assigned by the mon after adding a third filesystem. The configuration of the file systems in the cluster was:

  • myfs: two mds active, two mds on standby-replay
  • yourfs: three mds active, three mds on standby
  • jaredsfs: one mds active, one mds on standby-replay

After the first two were created, ceph status showed the following mds status:

    mds: myfs-2/2/2 up yourfs-3/3/3 up  {[myfs:0]=msdfdx=up:active,[myfs:1]=m88104=up:active,[yourfs:0]=m739m0=up:active,[yourfs:1]=mdv8k2=up:active,[yourfs:2]=m6ktsw=up:active}, 2 up:standby-replay, 3 up:standby

The pod status after the crash is

NAME                                      READY     STATUS             RESTARTS   AGE
rook-api-1435667874-qfj5g                 1/1       Running            0          1h
rook-ceph-mds-jaredsfs-1477892749-1jntq   1/1       Running            0          38m
rook-ceph-mds-jaredsfs-1477892749-qdtvt   1/1       Running            0          38m
rook-ceph-mds-myfs-3914257510-88104       1/1       Running            0          1h
rook-ceph-mds-myfs-3914257510-bpw4g       1/1       Running            0          1h
rook-ceph-mds-myfs-3914257510-kndbh       1/1       Running            0          1h
rook-ceph-mds-myfs-3914257510-sdfdx       1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-1xtsg     1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-6ktsw     1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-739m0     1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-dv8k2     1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-vgctz     1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-xfbl2     1/1       Running            0          1h
rook-ceph-mgr0-2792268363-gq4gp           1/1       Running            0          1h
rook-ceph-mon0-rhbzj                      0/1       CrashLoopBackOff   12         1h
rook-ceph-mon1-lp7j0                      1/1       Running            0          1h
rook-ceph-mon2-q3nx9                      0/1       CrashLoopBackOff   12         1h
rook-ceph-osd-zvmpf                       1/1       Running            0          1h
rook-tools                                1/1       Running            0          1h
@travisn travisn self-assigned this Sep 29, 2017
@travisn
Copy link
Member Author

travisn commented Feb 1, 2018

Note that multiple file systems is still considered an experimental feature in ceph. Perhaps rook makes it too easy to go down this path of an experimental feature.

@dimm0
Copy link
Contributor

dimm0 commented Feb 17, 2018

Any way to fix this?

@galexrt galexrt added the ceph main ceph tag label Mar 7, 2018
@galexrt
Copy link
Member

galexrt commented Mar 7, 2018

This seems like a Ceph issue with the FS Map when MDSs are running in standby-replay mode.

@abh
Copy link

abh commented Mar 7, 2018

I opened #1382, I had made a second file system not long before.

I’d second the suggestion of limiting the system to one FS until it’s fixed. The documentation / tools makes it seem like having however many is fine.

@galexrt
Copy link
Member

galexrt commented Mar 17, 2018

@travisn We need to implement a limit to just one FS as more and more users are currently hitting this.
Additionally adding a link to the path parameter (see https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/kube-registry.yaml#L49-L50) for volumes should be added to the docs. This would tell the users they can use one FS for/with multiple applications at the same time but also mention the trip fall that paths need to created manually as this isn't done automatically (by Ceph or Rook currently).

@travisn travisn added this to the 0.8 milestone Apr 10, 2018
@travisn travisn changed the title Failed mon assert when mds standby replay is assigned Failed mon assert when mds standby replay is assigned with multiple file systems Apr 11, 2018
@batrick
Copy link
Contributor

batrick commented Apr 11, 2018

http://tracker.ceph.com/issues/23658

@Coolfeather2
Copy link

Any way to recover?

@galexrt
Copy link
Member

galexrt commented Jun 3, 2018

@Coolfeather2 I think someone on the Slack said that you can potentially remove the second FS by modifying the fsmap directly but I can't tell you the exact steps.

@jbw976
Copy link
Member

jbw976 commented Jul 20, 2018

@travisn, it looks like Luminous 12.2.7 is out now? Does it have this fix? http://docs.ceph.com/docs/master/releases/luminous/#v12-2-7-luminous

@travisn
Copy link
Member Author

travisn commented Jul 20, 2018

yes, we can pick up this fix now.

@travisn
Copy link
Member Author

travisn commented Jul 20, 2018

@leseb when should we expect a ceph-container v3.0.7 release that includes luminous 12.2.7? my previous comment missed that we would require that release first.

@leseb
Copy link
Member

leseb commented Jul 24, 2018

@travisn it's here!

$ docker run -ti --entrypoint=ceph ceph/daemon:v3.0.7-stable-3.0-luminous-centos-7 -v
ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)

@travisn
Copy link
Member Author

travisn commented Jul 24, 2018

@leseb thanks!

@travisn travisn mentioned this issue Jul 24, 2018
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph main ceph tag filesystem
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants