Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: handle device name change and device removal correctly #11567

Merged
merged 1 commit into from
Feb 27, 2023

Conversation

satoru-takeuchi
Copy link
Member

@satoru-takeuchi satoru-takeuchi commented Jan 19, 2023

Description of your changes:

If a kernel device name change happens and a block device file in the OSD directory becomes dangling link, this OSD fails to start continuously. This problem can be resolved by confirming the validity of the device file and recreating it if necessary.

Which issue is resolved by this Pull Request:
Resolves #10860

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide).
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

@satoru-takeuchi
Copy link
Member Author

This PR will be ready to review when I finish verifying this PR covers the following matrix.

  • device name change and device removal
  • device is specified as kernel device name and symlink such as udev persistent device

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@satoru-takeuchi
Copy link
Member Author

I finished testing. So this PR is ready to review.

test detail

  • There are two extra disks B(sdb) and C(sdc).
  • test matrix:
    • device name
      • kernel name(i.e. sdb, sdc, and so on)
      • udev persistent name(i.e. /dev/disks/by-id/XXX)
    • the type of device name change
      • device becomes missing: remove disk B. Then disk C becomes sdb
      • flip device names: exchange the slot of disk B and C. Then disk C becomes sdb and disk B becomes sdc

environment

  • HW
    • A Hyper-V guests
    • two extra disks
      • device B: inserted to slot2: sdb in guest
      • device C: inserted to slot3: sdc in guest
  • SW
    • ubuntu 20.04
    • k8s: 1.25.6
    • rook: master(d7f84c9) + my PR
    • ceph: v17.2.5

device name: kernel name

device becomes missing

  1. shutdown the VM
  2. remove device B as follows and restart the VM
  • slot2: missing
  • slot3: device C: sdb in guest
  1. verify OSD on device C is running

=> OK

flip device names

  1. shutdown the VM
  2. exchange the slot of device B and C as follows and restart the VM.
  • slot2: device C: sdb in guest
  • slot3: device B: sdc in guest
  1. verify OSD on device C is running

=> OK

NOTE
When restarting, the new OSD is created on top of new /dev/sdc points to scratch1.img.
However, it's by design and this behavior can be avoided by specifying the udev persistent name.

device name: persistent device name (/dev/disk/by-id/wwn-0x60022480e20a2701d91006b946030922)

device becomes missing

  1. shutdown the VM
  2. remove device B as follows and restart the VM
  • slot2: missing
  • slot3: device C: sdb in guest
  1. verify OSD on device C is running

=> OK

flip device names

  1. shutdown
  2. exchange the slot of device B and C as follows.
  • slot2: scratch0.img: sdb in guest
  • slot3: scratch1.img: sdc in guest
  1. Then sdc points to device B and vice versa.
  2. verify OSD is running

=> OK

@satoru-takeuchi
Copy link
Member Author

The original problem happened in host-based clusters. A similar problem exists in PVC-based clusters. In this case, if a PV, corresponds to an existing OSD, and points to a missing block device file, the OSD pod fails to consume this PV. Although this behavior is undesirable, I don't think this problem should be handled in Rook. If doing so, we must re-create the existing PV. Rook shouldn't do such work.

If a kernel device name change happens and a block device file
in the OSD directory becomes dangling link, this OSD fails
to start continuously. This problem can be resolved by confirming
the validity of the device file and recreating it if necessary.

The original problem happened in host-based clusters. A similar
problem exists in PVC-based clusters. In this case, if a PV,
corresponds to an existing OSD, and points to a missing block
device file, the OSD pod fails to consume this PV. Although this
behavior is undesirable, I don't think this problem should be
handled in Rook. If doing so, we must re-create the existing PV.
Rook shouldn't do such work.

Closes: rook#10860

Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
@satoru-takeuchi
Copy link
Member Author

multi-cluster-mirroring test fails consistently as described in #11742

# If a kernel device name change happens and a block device file
# in the OSD directory becomes missing, this OSD fails to start
# continuously. This problem can be resolved by confirming
# the validity of the device file and recreating it if necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The re-creating of the path is done by ceph-volume, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's correct.

@travisn travisn merged commit 6bcf900 into rook:master Feb 27, 2023
mergify bot added a commit that referenced this pull request Feb 27, 2023
osd: handle device name change and device removel correctly (backport #11567)
travisn added a commit that referenced this pull request Feb 27, 2023
osd: handle device name change and device removel correctly (backport #11567)
@travisn travisn changed the title osd: handle device name change and device removel correctly osd: handle device name change and device removal correctly Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster unavailable after node reboot, symlink already exist
2 participants