Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph: Set the location on the mon daemon for stretch clusters #7535

Merged
merged 3 commits into from
Jul 20, 2021

Conversation

travisn
Copy link
Member

@travisn travisn commented Apr 6, 2021

Description of your changes:
The mon daemon in a stretch cluster now can have its location set as a CLI param instead of setting it with a separate command. This enables mon failover to set the location of a mon immediately when it is joining quorum instead of having a delayed command to set the location.

The mon is not yet joining quorum with these changes, still testing

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

@travisn travisn added the do-not-merge DO NOT MERGE :) label Apr 6, 2021
@mergify mergify bot added the ceph main ceph tag label Apr 6, 2021
Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

until testing works

@mergify
Copy link

mergify bot commented Apr 14, 2021

This pull request has merge conflicts that must be resolved before it can be merged. @travisn please rebase it. https://rook.io/docs/rook/master/development-flow.html#updating-your-fork

@travisn travisn force-pushed the mon-stretch-location branch 2 times, most recently from c88623a to 21b9fe2 Compare April 14, 2021 16:56
@travisn
Copy link
Member Author

travisn commented May 12, 2021

These changes are working when testing against the latest ceph master tag: ceph/daemon-base:latest-master-devel
There is still a pending backports to the pacific branch: ceph/ceph#41131

After we get a new Pacific release with that merged, we can get this PR merged.

@travisn travisn force-pushed the mon-stretch-location branch 3 times, most recently from 8eda3eb to 63e0432 Compare June 7, 2021 22:13
@travisn
Copy link
Member Author

travisn commented Jun 7, 2021

This is confirmed working with ceph/daemon-base:latest-pacific-devel, so we will be able to merge after ceph/ceph:v16.2.5 is available.

Comment on lines +320 to +328
if monConfig.Zone != "" {
desiredLocation := fmt.Sprintf("%s=%s", c.stretchFailureDomainName(), monConfig.Zone)
container.Args = append(container.Args, []string{"--set-crush-location", desiredLocation}...)
if monConfig.Zone == c.getArbiterZone() {
// remember the arbiter mon to be set later in the reconcile after the OSDs are configured
c.arbiterMon = monConfig.DaemonName
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do a check and enforce that the user is using v16.2.5 or higher here? That might be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'll add a note in the pending release notes that stretch requires 16.2.5 or higher. I can't add a code check because it will block downstream from running on a custom build of nautilus for stretch clusters.

Copy link
Member

@BlaineEXE BlaineEXE Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do the check but have an env var option that would disable the check for downstream. I think providing clear feedback for upstream users who try to use the wrong version is a good experience for them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stretch clusters are a much less common scenario where I haven't seen upstream feedback, so I think we'll be fine upstream without a check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's acceptable to have different checks for upstream and downstream. I agree stretch mode is downstream focus but I don't want downstream to drive upstream code especially if it brings clarity to users. Also, since it's now considered stable We do this already for rbd and cephfs mirror which caused some small issues downstream, we just need to track this so during the resync we change the version for downstream Nautilus.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed downstream shouldn't drive the upstream since upstream is first. I'll go ahead and add the check, then we can remove it downstream as needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a new commit

Comment on lines +320 to +328
if monConfig.Zone != "" {
desiredLocation := fmt.Sprintf("%s=%s", c.stretchFailureDomainName(), monConfig.Zone)
container.Args = append(container.Args, []string{"--set-crush-location", desiredLocation}...)
if monConfig.Zone == c.getArbiterZone() {
// remember the arbiter mon to be set later in the reconcile after the OSDs are configured
c.arbiterMon = monConfig.DaemonName
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's acceptable to have different checks for upstream and downstream. I agree stretch mode is downstream focus but I don't want downstream to drive upstream code especially if it brings clarity to users. Also, since it's now considered stable We do this already for rbd and cephfs mirror which caused some small issues downstream, we just need to track this so during the resync we change the version for downstream Nautilus.

The mon daemon in a stretch cluster now can have its location set
as a CLI param instead of setting it with a separate command.
This enables mon failover to set the location of a mon immediately
when it is joining quorum instead of having a delayed command
to set the location.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
Show an example of configuring a stretch cluster in AWS.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

@@ -240,6 +240,12 @@ func (c *ClusterController) configureLocalCephCluster(cluster *cluster) error {
return errors.Wrap(err, "failed the ceph version check")
}

if cluster.Spec.IsStretchCluster() {
if !cephVersion.IsAtLeast(cephver.CephVersion{Major: 16, Minor: 2, Build: 5}) {
return fmt.Errorf("stretch clusters minimum ceph version is v16.2.5, but is running %s", cephVersion.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use errors.Errorf()

The stretch clusters pass a new parameter to the mon daemons which is
only available in v16.2.5 and newer. Older versions of Ceph will fail
to run with stretch clusters in rook.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
@travisn travisn merged commit 1422760 into rook:master Jul 20, 2021
@travisn travisn deleted the mon-stretch-location branch September 1, 2021 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph main ceph tag
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants