Failed mon assert when mds standby replay is assigned with multiple file systems #1027

travisn · 2017-09-29T22:07:08Z

While testing multiple file systems with varying standby and replay settings, two of the mons core dumped with the following assert:

2017-09-29 21:55:06.978169 I | rook-ceph-mon0:      0> 2017-09-29 21:55:06.961413 7f55aba29700 -1 /build/ceph/src/mds/FSMap.cc: In function 'void FSMap::assign_standby_replay(mds_gid_t, fs_cluster_id_t, mds_rank_t)' thread 7f55aba29700 time 2017-09-29 21:55:06.957486
2017-09-29 21:55:06.978179 I | rook-ceph-mon0: /build/ceph/src/mds/FSMap.cc: 870: FAILED assert(mds_roles.at(standby_gid) == FS_CLUSTER_ID_NONE)

It would appear there is an issue with the standby being assigned by the mon after adding a third filesystem. The configuration of the file systems in the cluster was:

myfs: two mds active, two mds on standby-replay
yourfs: three mds active, three mds on standby
jaredsfs: one mds active, one mds on standby-replay

After the first two were created, ceph status showed the following mds status:

    mds: myfs-2/2/2 up yourfs-3/3/3 up  {[myfs:0]=msdfdx=up:active,[myfs:1]=m88104=up:active,[yourfs:0]=m739m0=up:active,[yourfs:1]=mdv8k2=up:active,[yourfs:2]=m6ktsw=up:active}, 2 up:standby-replay, 3 up:standby

The pod status after the crash is

NAME                                      READY     STATUS             RESTARTS   AGE
rook-api-1435667874-qfj5g                 1/1       Running            0          1h
rook-ceph-mds-jaredsfs-1477892749-1jntq   1/1       Running            0          38m
rook-ceph-mds-jaredsfs-1477892749-qdtvt   1/1       Running            0          38m
rook-ceph-mds-myfs-3914257510-88104       1/1       Running            0          1h
rook-ceph-mds-myfs-3914257510-bpw4g       1/1       Running            0          1h
rook-ceph-mds-myfs-3914257510-kndbh       1/1       Running            0          1h
rook-ceph-mds-myfs-3914257510-sdfdx       1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-1xtsg     1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-6ktsw     1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-739m0     1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-dv8k2     1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-vgctz     1/1       Running            0          1h
rook-ceph-mds-yourfs-1824654297-xfbl2     1/1       Running            0          1h
rook-ceph-mgr0-2792268363-gq4gp           1/1       Running            0          1h
rook-ceph-mon0-rhbzj                      0/1       CrashLoopBackOff   12         1h
rook-ceph-mon1-lp7j0                      1/1       Running            0          1h
rook-ceph-mon2-q3nx9                      0/1       CrashLoopBackOff   12         1h
rook-ceph-osd-zvmpf                       1/1       Running            0          1h
rook-tools                                1/1       Running            0          1h

The text was updated successfully, but these errors were encountered:

travisn · 2018-02-01T15:57:01Z

Note that multiple file systems is still considered an experimental feature in ceph. Perhaps rook makes it too easy to go down this path of an experimental feature.

dimm0 · 2018-02-17T08:11:36Z

Any way to fix this?

galexrt · 2018-03-07T17:01:23Z

This seems like a Ceph issue with the FS Map when MDSs are running in standby-replay mode.

abh · 2018-03-07T23:42:40Z

I opened #1382, I had made a second file system not long before.

I’d second the suggestion of limiting the system to one FS until it’s fixed. The documentation / tools makes it seem like having however many is fine.

galexrt · 2018-03-17T13:14:45Z

@travisn We need to implement a limit to just one FS as more and more users are currently hitting this.
Additionally adding a link to the path parameter (see https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/kube-registry.yaml#L49-L50) for volumes should be added to the docs. This would tell the users they can use one FS for/with multiple applications at the same time but also mention the trip fall that paths need to created manually as this isn't done automatically (by Ceph or Rook currently).

batrick · 2018-04-11T17:29:47Z

http://tracker.ceph.com/issues/23658

Coolfeather2 · 2018-05-15T07:15:44Z

Any way to recover?

galexrt · 2018-06-03T18:34:31Z

@Coolfeather2 I think someone on the Slack said that you can potentially remove the second FS by modifying the fsmap directly but I can't tell you the exact steps.

jbw976 · 2018-07-20T06:16:36Z

@travisn, it looks like Luminous 12.2.7 is out now? Does it have this fix? http://docs.ceph.com/docs/master/releases/luminous/#v12-2-7-luminous

travisn · 2018-07-20T16:54:46Z

yes, we can pick up this fix now.

travisn · 2018-07-20T17:39:42Z

@leseb when should we expect a ceph-container v3.0.7 release that includes luminous 12.2.7? my previous comment missed that we would require that release first.

leseb · 2018-07-24T13:33:30Z

@travisn it's here!

$ docker run -ti --entrypoint=ceph ceph/daemon:v3.0.7-stable-3.0-luminous-centos-7 -v
ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)

travisn · 2018-07-24T15:57:48Z

@leseb thanks!

travisn self-assigned this Sep 29, 2017

dimm0 mentioned this issue Feb 17, 2018

After losing some nodes, rook's mons are crashlooping #1508

Closed

galexrt added the ceph main ceph tag label Mar 7, 2018

galexrt mentioned this issue Mar 7, 2018

mon crashing after reboot #1382

Closed

travisn added this to the 0.8 milestone Apr 10, 2018

travisn changed the title ~~Failed mon assert when mds standby replay is assigned~~ Failed mon assert when mds standby replay is assigned with multiple file systems Apr 11, 2018

travisn added the filesystem label Apr 11, 2018

galexrt mentioned this issue May 3, 2018

Monitor crashes after reboot inhibiting cluster recovery #1697

Closed

batrick mentioned this issue May 15, 2018

Rook should not enable experimental multi-filesystem flag by default #1720

Closed

travisn assigned travisn and unassigned travisn May 21, 2018

jbw976 removed this from the 0.8 milestone Jun 19, 2018

pavansai1 mentioned this issue Jul 17, 2018

Rook fails to run mds pods when i try to create a filesystem in my cluster. #1891

Closed

travisn mentioned this issue Jul 24, 2018

Update ceph to 12.2.7 #1935

Merged

5 tasks

travisn closed this as completed in #1935 Jul 25, 2018

travisn mentioned this issue Jul 25, 2018

Backport PR #1935 (Update Ceph to Luminous 12.2.7) #1946

Merged

5 tasks

manavtidhan mentioned this issue Apr 21, 2020

Rook-Operator is not able to create Ceph Block pool and MDS pods #5298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed mon assert when mds standby replay is assigned with multiple file systems #1027

Failed mon assert when mds standby replay is assigned with multiple file systems #1027

travisn commented Sep 29, 2017 •

edited by galexrt

Loading

travisn commented Feb 1, 2018

dimm0 commented Feb 17, 2018

galexrt commented Mar 7, 2018

abh commented Mar 7, 2018

galexrt commented Mar 17, 2018

batrick commented Apr 11, 2018

Coolfeather2 commented May 15, 2018

galexrt commented Jun 3, 2018

jbw976 commented Jul 20, 2018

travisn commented Jul 20, 2018

travisn commented Jul 20, 2018

leseb commented Jul 24, 2018

travisn commented Jul 24, 2018

Failed mon assert when mds standby replay is assigned with multiple file systems #1027

Failed mon assert when mds standby replay is assigned with multiple file systems #1027

Comments

travisn commented Sep 29, 2017 • edited by galexrt Loading

travisn commented Feb 1, 2018

dimm0 commented Feb 17, 2018

galexrt commented Mar 7, 2018

abh commented Mar 7, 2018

galexrt commented Mar 17, 2018

batrick commented Apr 11, 2018

Coolfeather2 commented May 15, 2018

galexrt commented Jun 3, 2018

jbw976 commented Jul 20, 2018

travisn commented Jul 20, 2018

travisn commented Jul 20, 2018

leseb commented Jul 24, 2018

travisn commented Jul 24, 2018

travisn commented Sep 29, 2017 •

edited by galexrt

Loading