Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph: ignore certain partition changes in device hotplug #3131

Closed
dotnwat opened this issue May 7, 2019 · 0 comments · Fixed by #3256
Closed

ceph: ignore certain partition changes in device hotplug #3131

dotnwat opened this issue May 7, 2019 · 0 comments · Fixed by #3256
Assignees
Labels

Comments

@dotnwat
Copy link
Contributor

@dotnwat dotnwat commented May 7, 2019

Device hotplug detection should ignore certain aspects of the set of partitions. Here the file system is all that is changing (notice the end: zfs_member vs ""):

"Partitions":[{"Name":"sdd9","Size":8388608,"Label":"","Filesystem":""},
{"Name":"sdd1","Size":1799715028992,"Label":"zfs-173d8f0eabba522f","Filesystem":"zfs_member"}]

"Partitions":[{"Name":"sdd9","Size":8388608,"Label":"","Filesystem":""},
{"Name":"sdd1","Size":1799715028992,"Label":"zfs-173d8f0eabba522f","Filesystem":""}]

Perhaps we should simply ignore all content of the partitions, and only perform equality on the number of partitions. My first reaction is to ignore partitions completely, but I'm wondering about a scenario like (1) insert device with old partitions (2) time elapses and orchestration runs (3) admin zaps the device (4) would probably want orchestration to run now that there is a fresh device--the difference being that the partition set changes. @travisn

@dotnwat dotnwat added the feature label May 7, 2019
@dotnwat dotnwat self-assigned this May 7, 2019
dotnwat added a commit to dotnwat/rook that referenced this issue Jun 3, 2019
the only exception to a naive device list comparison had been to ignore
drive UUID information which was unreliable when a device wasn't
formatted / partitioned. however various users have reported different
type of false positives that resulted in orchestration being run
continuously due to the wrong observation that devices were changing.

this patch fixes the cases we have observed and attempts to be slightly
more conservative in the calculation.

1. the devlinks is ignored. when a device is setup for lvm, for example,
the devlinks will be updated with different paths that point to the
device in addition to its standard paths addressable by pci address.

2. in the lvm case, the "model" field and "filesystem" field may also
change.

3. we ignore devices with devlinks that contain "usb" to avoid issues
when using usb drives.

4. be smart about detecting device availability. if a device transitions
from a non-empty (or has-partitions) state to an empty (or unpartitioned)
state then orchestration is triggered. this like observing that a device
is now available (e.g. in the allDevices case). however, when a device
transistions from empty to non-empty, then this is ignored as while it
is a change, it's generally a change associated with the new consumption
of the device.

fixes: rook#3059
fixes: rook#3185
fixes: rook#3131

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
dotnwat added a commit to dotnwat/rook that referenced this issue Jun 4, 2019
the only exception to a naive device list comparison had been to ignore
drive UUID information which was unreliable when a device wasn't
formatted / partitioned. however various users have reported different
type of false positives that resulted in orchestration being run
continuously due to the wrong observation that devices were changing.

this patch fixes the cases we have observed and attempts to be slightly
more conservative in the calculation.

1. the devlinks is ignored. when a device is setup for lvm, for example,
the devlinks will be updated with different paths that point to the
device in addition to its standard paths addressable by pci address.

2. in the lvm case, the "model" field and "filesystem" field may also
change.

3. we ignore devices with devlinks that contain "usb" to avoid issues
when using usb drives.

4. be smart about detecting device availability. if a device transitions
from a non-empty (or has-partitions) state to an empty (or unpartitioned)
state then orchestration is triggered. this like observing that a device
is now available (e.g. in the allDevices case). however, when a device
transistions from empty to non-empty, then this is ignored as while it
is a change, it's generally a change associated with the new consumption
of the device.

fixes: rook#3059
fixes: rook#3185
fixes: rook#3131

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
dotnwat added a commit to dotnwat/rook that referenced this issue Jun 7, 2019
the only exception to a naive device list comparison had been to ignore
drive UUID information which was unreliable when a device wasn't
formatted / partitioned. however various users have reported different
type of false positives that resulted in orchestration being run
continuously due to the wrong observation that devices were changing.

this patch fixes the cases we have observed and attempts to be slightly
more conservative in the calculation.

1. the devlinks is ignored. when a device is setup for lvm, for example,
the devlinks will be updated with different paths that point to the
device in addition to its standard paths addressable by pci address.

2. in the lvm case, the "model" field and "filesystem" field may also
change.

3. we ignore devices with devlinks that contain "usb" to avoid issues
when using usb drives.

4. be smart about detecting device availability. if a device transitions
from a non-empty (or has-partitions) state to an empty (or unpartitioned)
state then orchestration is triggered. this like observing that a device
is now available (e.g. in the allDevices case). however, when a device
transistions from empty to non-empty, then this is ignored as while it
is a change, it's generally a change associated with the new consumption
of the device.

fixes: rook#3059
fixes: rook#3185
fixes: rook#3131

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
@travisn travisn closed this in #3256 Jun 7, 2019
travisn added a commit to travisn/rook that referenced this issue Jun 7, 2019
the only exception to a naive device list comparison had been to ignore
drive UUID information which was unreliable when a device wasn't
formatted / partitioned. however various users have reported different
type of false positives that resulted in orchestration being run
continuously due to the wrong observation that devices were changing.

this patch fixes the cases we have observed and attempts to be slightly
more conservative in the calculation.

1. the devlinks is ignored. when a device is setup for lvm, for example,
the devlinks will be updated with different paths that point to the
device in addition to its standard paths addressable by pci address.

2. in the lvm case, the "model" field and "filesystem" field may also
change.

3. we ignore devices with devlinks that contain "usb" to avoid issues
when using usb drives.

4. be smart about detecting device availability. if a device transitions
from a non-empty (or has-partitions) state to an empty (or unpartitioned)
state then orchestration is triggered. this like observing that a device
is now available (e.g. in the allDevices case). however, when a device
transistions from empty to non-empty, then this is ignored as while it
is a change, it's generally a change associated with the new consumption
of the device.

fixes: rook#3059
fixes: rook#3185
fixes: rook#3131

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 3966f16)
@travisn travisn mentioned this issue Jun 7, 2019
1 of 5 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.