Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph: Allow using lvm batch for ceph 14.2.15 #6831

Merged
merged 3 commits into from
Dec 18, 2020
Merged

Conversation

jshen28
Copy link
Contributor

@jshen28 jshen28 commented Dec 15, 2020

14.2.15 lvm batch command prepare report changes
output format. This commit skips md check if ceph version
is greater than 14.2.13.

Signed-off-by: shenjiatong yshxxsjt715@gmail.com

Description of your changes:

In 14.2.15, lvm batch --prepare --report breaks report format which causes
creating osd failed.

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

[test full]

@jshen28 jshen28 changed the title Use ceph 14.2.15 Allow using lvm batch for ceph 14.2.15 Dec 15, 2020
@jshen28 jshen28 force-pushed the use-14-2-15 branch 2 times, most recently from fc44b3c to dcaf3c7 Compare December 15, 2020 06:42
@travisn
Copy link
Member

travisn commented Dec 15, 2020

@jshen28 Please also enable the latest nautilus in the tests so we can test it in v14.2.15, you could cherry-pick the fix in #6827.

@jshen28
Copy link
Contributor Author

jshen28 commented Dec 16, 2020

from test results, seems like cephcluster is created successfully, but creating snapshot failed.

[2020-12-16T00:47:15.422Z] 2020-12-16 00:47:15.103511 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-mon in namespace helm-ns
[2020-12-16T00:47:20.719Z] 2020-12-16 00:47:20.109582 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-mon in namespace helm-ns
[2020-12-16T00:47:25.999Z] 2020-12-16 00:47:25.114419 I | testutil: found 1 pods with label app=rook-ceph-mon
[2020-12-16T00:47:25.999Z] 2020-12-16 00:47:25.116550 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-mgr in namespace helm-ns
[2020-12-16T00:47:30.196Z] 2020-12-16 00:47:30.120203 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-mgr in namespace helm-ns
[2020-12-16T00:47:35.497Z] 2020-12-16 00:47:35.171223 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-mgr in namespace helm-ns
[2020-12-16T00:47:40.783Z] 2020-12-16 00:47:40.191825 I | testutil: found 1 pods with label app=rook-ceph-mgr
[2020-12-16T00:47:40.783Z] 2020-12-16 00:47:40.202970 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:47:46.080Z] 2020-12-16 00:47:45.206222 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:47:50.295Z] 2020-12-16 00:47:50.230342 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:47:55.574Z] 2020-12-16 00:47:55.250600 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:48:00.860Z] 2020-12-16 00:48:00.255400 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:48:06.139Z] 2020-12-16 00:48:05.280416 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:48:10.338Z] 2020-12-16 00:48:10.286407 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:48:15.619Z] 2020-12-16 00:48:15.289826 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:48:20.944Z] 2020-12-16 00:48:20.296872 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:48:26.239Z] 2020-12-16 00:48:25.301663 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:48:30.443Z] 2020-12-16 00:48:30.308096 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:48:35.720Z] 2020-12-16 00:48:35.312448 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:48:41.030Z] 2020-12-16 00:48:40.316119 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:48:46.302Z] 2020-12-16 00:48:45.319349 I | testutil: waiting for 1 pods (found 0) with label app=rook-ceph-osd in namespace helm-ns
[2020-12-16T00:48:50.515Z] 2020-12-16 00:48:50.329658 I | testutil: found 1 pods with label app=rook-ceph-osd
[2020-12-16T00:48:50.516Z] 2020-12-16 00:48:50.346121 I | testutil: found CronJob with name rook-ceph-crashcollector-pruner in namespace helm-ns
[2020-12-16T00:48:50.516Z] 2020-12-16 00:48:50.346177 I | installer: Rook Cluster started
[2020-12-16T00:58:48.334Z] --- FAIL: TestCephHelmSuite (722.71s)
[2020-12-16T00:58:48.334Z]     --- PASS: TestCephHelmSuite/TestARookInstallViaHelm (0.10s)
[2020-12-16T00:58:48.334Z]     --- FAIL: TestCephHelmSuite/TestBlockStoreOnRookInstalledViaHelm (260.86s)
[2020-12-16T00:58:48.334Z]         ceph_base_block_test.go:194: 
[2020-12-16T00:58:48.334Z]             	Error Trace:	ceph_base_block_test.go:194
[2020-12-16T00:58:48.334Z]             	            				ceph_base_block_test.go:433
[2020-12-16T00:58:48.334Z]             	            				ceph_helm_test.go:99
[2020-12-16T00:58:48.334Z]             	Error:      	Received unexpected error:
[2020-12-16T00:58:48.334Z]             	            	giving up waiting for "rbd-pvc-snapshot" snapshot in namespace "default"
[2020-12-16T00:58:48.334Z]             	Test:       	TestCephHelmSuite/TestBlockStoreOnRookInstalledViaHelm
[2020-12-16T00:58:48.334Z]         ceph_base_block_test.go:473: 

@jshen28 jshen28 force-pushed the use-14-2-15 branch 5 times, most recently from e78fc2c to aa1343c Compare December 16, 2020 12:07
@travisn
Copy link
Member

travisn commented Dec 16, 2020

The snapshot tests are already failing in master, I opened #6837 for this, so it should be unrelated to your changes.

@BlaineEXE BlaineEXE requested review from travisn, leseb and satoru-takeuchi and removed request for leseb December 16, 2020 18:36
@BlaineEXE BlaineEXE added bug ceph main ceph tag labels Dec 16, 2020

logger.Debugf("ceph-volume report: %+v", cvOut)
if !cephVersion.IsNautilus() || cephver.IsInferior(cephVersion, cephver.CephVersion{Major: 14, Minor: 2, Extra: 13}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment to this line explaining why it's needed for when we come back to it in 6-12 months?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, we want to run the report for the following?

  • Octopus or newer
  • Nautilus if less than 14.2.13

Is the report being fixed in Nautilus? Seems like we should expect a fix in 14.2.16 instead of blocking the report in all future Nautilus releases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking for the version and skipping it if a broken version, what if we always run the report, but just log and ignore any errors that it produces? Seems like a good pattern so we can always attempt to create the OSDs even if the report fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible, but maybe a new type should be introduced to reflect the new return structure...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What new type are you referring to?

Copy link
Contributor Author

@jshen28 jshen28 Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before 14.2.13, it looks like

{"changed": true, "vg": {"devices": "xxx"}}

For 14.2.15, it looks like

2020-12-14 02:15:28.771476 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 --crush-device-class sjt /dev/vdc --db-devices /dev/vdd --report --format json
2020-12-14 02:15:37.640274 D | cephosd: ceph-volume report: --> passed data devices: 1 physical, 0 LVM
--> relative data size: 1.0
--> passed block_db devices: 1 physical, 0 LVM
[{"block_db": "/dev/vdd", "encryption": "None", "data": "/dev/vdc", "data_size": "100.00 GB", "block_db_size": "100.00 GB"}]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Octopus v15.2.8 was just released and the release notes indicated it also received an update to ceph-volume batch mode, so I assume we also need the same change on v15.2.8 and newer.

@@ -39,6 +39,12 @@ const (
// CheckSnapshotISReadyToUse checks snapshot is ready to use
func (k8sh *K8sHelper) CheckSnapshotISReadyToUse(name, namespace string, retries int) (bool, error) {
for i := 0; i < retries; i++ {

output, err := k8sh.executor.ExecuteCommandWithOutput("kubectl", "get", "volumesnapshot", name, "--namespace", namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since #6840 is merged, you can revert this as the snapshot tests are fixed now


logger.Debugf("ceph-volume report: %+v", cvOut)
if !cephVersion.IsNautilus() || cephver.IsInferior(cephVersion, cephver.CephVersion{Major: 14, Minor: 2, Extra: 13}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, we want to run the report for the following?

  • Octopus or newer
  • Nautilus if less than 14.2.13

Is the report being fixed in Nautilus? Seems like we should expect a fix in 14.2.16 instead of blocking the report in all future Nautilus releases


logger.Debugf("ceph-volume report: %+v", cvOut)
if !cephVersion.IsNautilus() || cephver.IsInferior(cephVersion, cephver.CephVersion{Major: 14, Minor: 2, Extra: 13}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking for the version and skipping it if a broken version, what if we always run the report, but just log and ignore any errors that it produces? Seems like a good pattern so we can always attempt to create the OSDs even if the report fails.

@travisn travisn added this to In review in v1.5 via automation Dec 16, 2020
@travisn travisn changed the title Allow using lvm batch for ceph 14.2.15 ceph: Allow using lvm batch for ceph 14.2.15 Dec 16, 2020
@jshen28
Copy link
Contributor Author

jshen28 commented Dec 17, 2020

#6824

@jshen28 jshen28 force-pushed the use-14-2-15 branch 8 times, most recently from e0f5ff1 to 1ea2c77 Compare December 17, 2020 04:24
@jshen28
Copy link
Contributor Author

jshen28 commented Dec 17, 2020

2020-12-17T04:48:27.4147048Z 2020-12-17 04:48:27.409037 I | cephosd: immediateExecuteArgs - [-oL ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 /dev/sda1]
2020-12-17T04:48:27.4149021Z 2020-12-17 04:48:27.409053 I | cephosd: immediateReportArgs - [-oL ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 /dev/sda1 --report]
2020-12-17T04:48:27.4150848Z 2020-12-17 04:48:27.409170 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm batch --prepare --bluestore --yes --osds-per-device 1 /dev/sda1 --report
2020-12-17T04:48:27.9506610Z 2020-12-17 04:48:27.949320 D | exec: usage: ceph-volume lvm batch [-h] [--db-devices [DB_DEVICES [DB_DEVICES ...]]]
2020-12-17T04:48:27.9547242Z 2020-12-17 04:48:27.949413 D | exec:                              [--wal-devices [WAL_DEVICES [WAL_DEVICES ...]]]
2020-12-17T04:48:27.9578911Z 2020-12-17 04:48:27.949429 D | exec:                              [--journal-devices [JOURNAL_DEVICES [JOURNAL_DEVICES ...]]]
2020-12-17T04:48:27.9579990Z 2020-12-17 04:48:27.949441 D | exec:                              [--auto] [--no-auto] [--bluestore] [--filestore]
2020-12-17T04:48:27.9580863Z 2020-12-17 04:48:27.949452 D | exec:                              [--report] [--yes]
2020-12-17T04:48:27.9581813Z 2020-12-17 04:48:27.949462 D | exec:                              [--format {json,json-pretty,pretty}] [--dmcrypt]
2020-12-17T04:48:27.9582892Z 2020-12-17 04:48:27.950322 D | exec:                              [--crush-device-class CRUSH_DEVICE_CLASS]
2020-12-17T04:48:27.9583749Z 2020-12-17 04:48:27.950362 D | exec:                              [--no-systemd]
2020-12-17T04:48:27.9584605Z 2020-12-17 04:48:27.950405 D | exec:                              [--osds-per-device OSDS_PER_DEVICE]
2020-12-17T04:48:27.9585841Z 2020-12-17 04:48:27.950424 D | exec:                              [--data-slots DATA_SLOTS]
2020-12-17T04:48:27.9586708Z 2020-12-17 04:48:27.950435 D | exec:                              [--block-db-size BLOCK_DB_SIZE]
2020-12-17T04:48:27.9587588Z 2020-12-17 04:48:27.950445 D | exec:                              [--block-db-slots BLOCK_DB_SLOTS]
2020-12-17T04:48:27.9588470Z 2020-12-17 04:48:27.950855 D | exec:                              [--block-wal-size BLOCK_WAL_SIZE]
2020-12-17T04:48:27.9589378Z 2020-12-17 04:48:27.950891 D | exec:                              [--block-wal-slots BLOCK_WAL_SLOTS]
2020-12-17T04:48:27.9590245Z 2020-12-17 04:48:27.950917 D | exec:                              [--journal-size JOURNAL_SIZE]
2020-12-17T04:48:27.9591185Z 2020-12-17 04:48:27.950931 D | exec:                              [--journal-slots JOURNAL_SLOTS] [--prepare]
2020-12-17T04:48:27.9592067Z 2020-12-17 04:48:27.950956 D | exec:                              [--osd-ids [OSD_IDS [OSD_IDS ...]]]
2020-12-17T04:48:27.9592861Z 2020-12-17 04:48:27.950995 D | exec:                              [DEVICES [DEVICES ...]]
2020-12-17T04:48:27.9593900Z 2020-12-17 04:48:27.951006 D | exec: ceph-volume lvm batch: error: /dev/sda1 is a partition, please pass LVs or raw block devices
2020-12-17T04:48:27.9924305Z failed to configure devices: failed to initialize devices: failed ceph-volume report: exit status 2

Looks like ceph-volume will reject creating osd on partition ...

@jshen28 jshen28 force-pushed the use-14-2-15 branch 4 times, most recently from a76d2b1 to 6969f42 Compare December 18, 2020 01:10
@travisn
Copy link
Member

travisn commented Dec 18, 2020

Looks like ceph-volume will reject creating osd on partition ...

Right, looks like there is an issue with partitions, I opened #6849.

The latest changes are looking good. I am about to merge #6847, which will take care of the issue with tests failing on partitions. After that, you can rebase and won't need my commit you cherry-picked.

@jshen28 jshen28 force-pushed the use-14-2-15 branch 4 times, most recently from fe74444 to 6506cea Compare December 18, 2020 02:39
@mergify
Copy link

mergify bot commented Dec 18, 2020

This pull request has merge conflicts that must be resolved before it can be merged. @jshen28 please rebase it. https://rook.io/docs/rook/master/development-flow.html#updating-your-fork

return true
}

return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Pacific or newer we should return true, right?

Copy link
Contributor Author

@jshen28 jshen28 Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems 16.0.0 still uses legacy format ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s unexpected, the changes should be in master first, oh well we can revisit separately.

14.2.15 lvm batch command prepare report changes
output format. This commit skips md check if ceph version
is greater than 14.2.13.

Signed-off-by: shenjiatong <yshxxsjt715@gmail.com>
Signed-off-by: shenjiatong <yshxxsjt715@gmail.com>
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more small suggestions...

logger.Debugf("ceph-volume report: %+v", cvOut)
// ceph version v14.2.13 ~ v14.2.16 changes output of `lvm batch --prepare --report`
// use previous logic if ceph version does not fall into this range
if !isNewStyledLvmBatch(cephVersion) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor suggestion instead of declaring a separate variable.

Suggested change
if !isNewStyledLvmBatch(cephVersion) {
if !isNewStyledLvmBatch(a.clusterInfo.CephVersion) {


logger.Debugf("ceph-volume report: %+v", cvOut)
// ceph version v14.2.13 ~ v14.2.16 changes output of `lvm batch --prepare --report`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ceph version v14.2.13 ~ v14.2.16 changes output of `lvm batch --prepare --report`
// ceph version v14.2.13 and v15.2.8 changed the output format of `lvm batch --prepare --report`

if err = json.Unmarshal([]byte(cvOut), &cvReport); err != nil {
return errors.Wrap(err, "failed to unmarshal ceph-volume report json")
}
cvOut, err := context.Executor.ExecuteCommandWithCombinedOutput(baseCommand, reportArgs...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command is the same for both types of reports, the output is just different, right? Can you factor out this command before the version check?

Copy link
Contributor Author

@jshen28 jshen28 Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly it is not the same... new format will print something into stderr which is problematic if using ExecuteCommandWithCombinedOutput which seems collecting stdout + stderr....

get_deployment_layout will call debug which prints some extra stuff to stderr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually it might be ok to change both of them to ExecuteCommandWithOutput ....


if len(strings.Split(conf["devices"], " ")) != len(cvReports) {
return fmt.Errorf("failed to create enough required devices, required: %s, actual: %v", cvOut, cvReports)
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need this else. For example:

if len(strings.Split(conf["devices"], " ")) != len(cvReports) {
  return ...
}
for _, report := range cvReports {
   ...
}

return errors.Wrap(err, "failed to unmarshal ceph-volume report json")
}

if len(strings.Split(conf["devices"], " ")) != len(cvReports) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to check for the length of reports? Just wondering what issues it will really find in the device configuration, or if it might just be a ceph-volume bug?

Copy link
Contributor Author

@jshen28 jshen28 Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not, I am just wondering if if devices are '/dev/sda1', '/dev/sdb', will ceph volume ignores /dev/sda1, and uses only /dev/sdb?

aslo, will it be possible that cvReports could be an empty list if ceph-volume reject all inputs?

@jshen28 jshen28 force-pushed the use-14-2-15 branch 2 times, most recently from 56b23bf to 47d354c Compare December 18, 2020 09:31
Signed-off-by: shenjiatong <yshxxsjt715@gmail.com>
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@travisn travisn merged commit f13d1f1 into rook:master Dec 18, 2020
v1.5 automation moved this from In review to Done Dec 18, 2020
@jshen28 jshen28 deleted the use-14-2-15 branch December 21, 2020 00:19
mergify bot added a commit that referenced this pull request Jan 4, 2021
ceph: Allow using lvm batch for ceph 14.2.15 (bp #6831)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ceph main ceph tag
Projects
No open projects
v1.5
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants