Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure to remove detached disks via pool resize. Fixes #2099 #2184

Conversation

phillxnet
Copy link
Member

Fixes #2099
"failure to remove detached disks via pool resize"
Fixes #2077
"remove/resize pool for detached single dev pool fails"

Additional unrelated trivial fix included:
Fixes #2182
"Usage displayed instead of available free space Issue" #2182

Detached disk management changes proposed:

  • When a pool has no missing members remove db held pool associations from all prior ‘detached’ members. These disks then show up as regular detached devices and can be removed by their ‘bin’ icons. This allows for continued surfacing of prior attached status for these devices while returning source of truth to the pool itself re detached members. Issue failure to remove detached disks via pool resize #2099
  • Special-case the removal of a detached device from a pool that has only detached members. Issue remove/resize pool for detached single dev pool fails #2077
  • Restrict detached device removal selection: one-at-a-time and with no detached/attached mixing. This reduces our problem space and so eases sanity checking. Attempted multi detached device selection informs that all detached will be removed on pools with only detached members anyway.
  • Improve error message and log entry on missing device removal failure: reiterating degraded,rw advise and reboot suggestion.

Note on interplay between the first two elements of above.

For pools with only detached members, we preserve the function of maintaining the members pool association (in light of the first element above), even though we normally assess missing device status via a pool model property (has_missing_dev) which can’t in the case of no attached members be known. This functionality is due to a prior check where, in the case of zero attached members, we proceed no further as no mount is then possible, and so do not inadvertently remove that pools single disk association. This maintains the possibility to detach all of a pools members and then re-attach them ‘hole sale’ (in powered off state) and for all associated configs (shares, exports, etc) to be re-established automatically. As if we were to remove all disk pool associations automatically in the case of all detached members, all it’s associated shares/exports would also be removed. (TEST THIS).

Testing:

Changing all serials for all drives, during power down in KVM, results in none of the ‘detached’ former members having a pool association. This means these prior detached pool associated members are then removable via their bin icon, and as such we still surface their prior attached status. #2099

Removing a single detached member from pools that have only detached members results, by design, in all detached members having their pool association removed, and on next pool state refresh, the pool itself being removed. See #2077 detailed caveat below.

Caveats:

N.B. as is common on a few of our Web-UI ‘operations’ two pool details page refreshes are often required for final pool state to be displayed. I.e. a removed detached disk entry will linger, post the completion of the operation, until this double refresh has been performed. And in the case of detached disk removal from pools that have only detached member, we have the ‘Ragged Edge’ issue detailed in my comments on issue #2077 which can be addressed in a future pull request.

Ready for review, requesting @FroggyFlox .
I have separated out the various main elements, as detailed above, into their own commits to aid review and have also ensured our project goals of string.format() and black formatting to all modified files.

In some known cases a misleading/contradictory error message results
when attempting to remove a missing device: that of there being no
missing devices when btrfs fi show indicates otherwise. Handle this case
by re-asserting the degraded,rw mount options suggested in our
Maintenance required contextual text and further suggest a reboot may be
required. It is suggested that aim to provide an unmount/remount
capability in the future that could also be suggested here.
…kstor#2099

When a pool has no missing members remove db held pool associations from
all prior ‘detached’ members. These disks then show up as regular
detached devices and can be removed by their ‘bin’ icons. This allows
for continued surfacing of prior attached status for these devices while
returning source of truth to the pool itself re detached members.
…2077 rockstor#2099

- Special-case the removal of a detached device from a pool that has
only detached members. Issue rockstor#2077
- Restrict detached device removal selection: one-at-a-time and with no
detached/attached mixing. This reduces our problem space and so eases
sanity checking. Attempted multi detached device selection informs that
all detached will be removed on pools with only detached members anyway.
- In combination with other rockstor#2099 related changes auto pool deletion is
facilitated post all detached disk removals.
Copy link
Member

@FroggyFlox FroggyFlox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phillxnet ,

I just tested (to the best of my understanding, at least), and it works great!

The steps I followed were:

  1. in KVM, create two new disks with serial aaa and bbb.
  2. Boot Rockstor, and create a pool test_pool_03 with these two disks in raid1.
  3. Shutdown Rockstor.
  4. Change each disk's serial number to a2a2a2 and b2b2b2.
  5. Reboot Rockstor.
  6. Inspect Disks and Pools pages.

Prior to your PR, this resulted in the Pools page showing 4 disks members: the two old ones labeled as detached-disk_name + the two "new" ones (with the new serial numbers). No "bin" icon was surfaced in the Disks page and I had to use the Resize/ReRaid to remove the disks and then delete.

After your PR, this resulted in the logs showing the following at boot:

[19/Jun/2020 07:59:36] INFO [storageadmin.views.command:100] Removing detached disk from Pool test_pool_03: no missing devices found.
[19/Jun/2020 07:59:36] INFO [storageadmin.views.command:100] Removing detached disk from Pool test_pool_03: no missing devices found.

The Pools page now shows only the two new disks as members of test_pool_03.
The Disks page now shows the two old detached disks with a "bin" icon that I can successfully use to remove them from the list.

Nice!

# pool info as source of truth re missing.
if not p.has_missing_dev:
for disk in p.disk_set.filter(name__startswith="detached-"):
logger.info("Removing detached disk from Pool {}: no missing "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick thought: Could we further detail the disk name (or any meaningful info) on the disk in question here? While testing, two detached disks needed to be removed and that results in the following:

[19/Jun/2020 07:59:36] INFO [storageadmin.views.command:100] Removing detached disk from Pool test_pool_03: no missing devices found.
[19/Jun/2020 07:59:36] INFO [storageadmin.views.command:100] Removing detached disk from Pool test_pool_03: no missing devices found.

It thus results in duplicates (or tri-, etc...) when that happens. Tiny detail, but if there's some some of identification readily available for the detached disk in question being removed, it could be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point except that the name of a detached disk is essentially random. The bit after detached is just a uuid (long historical reason to that) so the name wouldn't help. That random element is also changed on every disk refresh!! And other than that I guess the best thing would be the serial. Needn't hold up the pr merge though as almost all removals will be singular devices anyway given only meaningful removals (of viable pools) in raid 6 will be more than one given it's the only raid level that can handle more than one disk failure currently anyway (bar upstream btrfs raid1c3 and raid1c4).

I say we earmark this as a future logging improvement as we will be back here before long anyway. Nice idea though.

@phillxnet
Copy link
Member Author

@FroggyFlox Thanks for the review, double check. I think we are fairly safe with this one as it stands.

Much appreciated. I'll let is sit for a bit and see if anything more dawns on us and if not then I'll pop it in ready for the next testing release rpm.

@phillxnet phillxnet changed the title failure to remove detached disks via pool resize.Fixes #2099 failure to remove detached disks via pool resize. Fixes #2099 Jun 24, 2020
@phillxnet phillxnet merged commit f23bef5 into rockstor:master Jun 24, 2020
@phillxnet phillxnet deleted the 2099_failure_to_remove_detached_disks_via_pool_resize branch June 24, 2020 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants