Skip to content

docs/fix: disk replacement workflow and nmdctl corrections#91

Closed
tim000x3 wants to merge 2 commits intoqvr:mainfrom
theweebcoders:main
Closed

docs/fix: disk replacement workflow and nmdctl corrections#91
tim000x3 wants to merge 2 commits intoqvr:mainfrom
theweebcoders:main

Conversation

@tim000x3
Copy link

Summary

  • Document complete disk replacement procedure with critical mount-before-rebuild warning
  • Fix nmdctl size display (was using binary math with decimal labels)
  • Fix reload warning to include required unmount step

Changes

Documentation (README.md)

Complete 7-step disk replacement workflow:

  1. Unmount filesystems
  2. Stop array
  3. Unassign failed disk
  4. Replace with new disk
  5. Start array
  6. Mount filesystems (CRITICAL: before rebuild)
  7. Start rebuild

Added CAUTION block explaining that mounting after rebuild starts will corrupt parity.

nmdctl fixes

  • Size display: Was calculating sizes with binary math (1024) but labeling as decimal units (TB/GB). Now uses decimal math (1000) to match the labels and unRAID's display.
  • Reload warning: After rebuild completes, the warning now correctly recommends nmdctl unmount && nmdctl reload && nmdctl start instead of just reload && start (which fails if filesystems are mounted).

Testing

These changes were discovered and validated during hands-on disk replacement testing on bare metal Arch Linux (kernel 6.12.1) with NonRAID DKMS 1.3.2 / kernel module 2.9.35.

Complete workflow for replacing failed disks with critical warning
about mount order: filesystems MUST be mounted before rebuild starts
or parity will be corrupted.
- Size display was using binary math (1024) with decimal labels (TB/GB).
  Now uses decimal math to match the labels and unRAID's display.
- Reload warning now includes required unmount step.
@qvr
Copy link
Owner

qvr commented Nov 29, 2025

Sorry, this doesn't make sense:

Added CAUTION block explaining that mounting after rebuild starts will corrupt parity.

Mounting before or after rebuild starts should not make any difference. The driver does not know about mounted filesystems.

@qvr
Copy link
Owner

qvr commented Nov 29, 2025

Regarding the other changes, I originally decided to go with base-2 units, and while it is probably "wrong" to then use gigabytes as units instead of gibibytes, I'm not sure I want to change it right away.

Reload warning, the reload_module could be changed to call stop_array instead of directly trying to stop the driver. stop_array will then handle the unmounting first (as long as it was not called in unattended mode).

@tim000x3
Copy link
Author

Sorry, this doesn't make sense:

Added CAUTION block explaining that mounting after rebuild starts will corrupt parity.

Mounting before or after rebuild starts should not make any difference. The driver does not know about mounted filesystems.

I understand the driver doesn't explicitly track mount state, this is more about what came up in my testing. So a couple of clarifications:

  • The drive I was replacing was XFS, not ZFS (Although, this doesn't matter based on what you're saying. I figured I'd mention it anyway.)
  • I never ran zpool import manually, so there was no ZFS cache to auto-import
  • When I started the rebuild before mounting, I got checksum mismatches after the rebuild completed
  • When I mounted first, then started the rebuild, everything verified correctly

Also, this is the order of operation Unraid uses. I know you mentioned you don't use unraid, so I figured I'd let you know.

How about as a middle ground, because this did come up for me in testing and maybe it's just an edge case.

So how about instead of "will corrupt parity," we could say "may cause parity inconsistency in some configurations" or "some users have reported issues when mounting after rebuild starts."

Either way I'd like to see I’d like some kind of documentation for a recommended workflow. If your issue is specifically on the caution block being too definitive, hopefully this middle ground makes sense to you.

@tim000x3
Copy link
Author

tim000x3 commented Nov 29, 2025

Regarding the other changes, I originally decided to go with base-2 units, and while it is probably "wrong" to then use gigabytes as units instead of gibibytes, I'm not sure I want to change it right away.

Reload warning, the reload_module could be changed to call stop_array instead of directly trying to stop the driver. stop_array will then handle the unmounting first (as long as it was not called in unattended mode).

I apologize for not being clear with my explanation. I honestly don't have a strong opinion on whether to use GB or GiB. My issue was that the code was doing the math for one thing but labeling it the other.

So the way you had it before, you were doing

elif [ "$force_unit" = "mb" ] || { [ -z "$force_unit" ] && [ "$kb" -lt 1048576 ]; }; then  # 1024^2
    local mb=$((kb / 1024))
    local remainder=$((kb % 1024))
    [ "$add_unit" -eq 1 ] && unit=" MB"
    format_with_decimals "$mb" "$remainder" "1024" "$unit" "$decimal_places"
elif [ "$force_unit" = "gb" ] || { [ -z "$force_unit" ] && [ "$kb" -lt 1073741824 ]; }; then  # 1024^3
    local gb=$((kb / 1048576))
    local remainder=$((kb % 1048576))
    [ "$add_unit" -eq 1 ] && unit=" GB"
    format_with_decimals "$gb" "$remainder" "1048576" "$unit" "$decimal_places"
elif [ "$force_unit" = "tb" ] || [ -z "$force_unit" ]; then
    local tb=$((kb / 1073741824))
    local remainder=$((kb % 1073741824))
    [ "$add_unit" -eq 1 ] && unit=" TB"
    format_with_decimals "$tb" "$remainder" "1073741824" "$unit" "$decimal_places"
fi

which would give you GiB/TiB but labeling it GB/TB.

If you want to stay with GiB/TiB That's fine. We just need to label it that way. So instead what we should do is

elif [ "$force_unit" = "mib" ] || { [ -z "$force_unit" ] && [ "$kb" -lt 1048576 ]; }; then  # 1024^2
    local mib=$((kb / 1024))
    local remainder=$((kb % 1024))
    [ "$add_unit" -eq 1 ] && unit=" MiB"
    format_with_decimals "$mib" "$remainder" "1024" "$unit" "$decimal_places"
elif [ "$force_unit" = "gib" ] || { [ -z "$force_unit" ] && [ "$kb" -lt 1073741824 ]; }; then  # 1024^3
    local gib=$((kb / 1048576))
    local remainder=$((kb % 1048576))
    [ "$add_unit" -eq 1 ] && unit=" GiB"
    format_with_decimals "$gib" "$remainder" "1048576" "$unit" "$decimal_places"
elif [ "$force_unit" = "tib" ] || [ -z "$force_unit" ]; then
    local tib=$((kb / 1073741824))
    local remainder=$((kb % 1073741824))
    [ "$add_unit" -eq 1 ] && unit=" TiB"
    format_with_decimals "$tib" "$remainder" "1073741824" "$unit" "$decimal_places"
fi

As I said, I don't really have strong opinions either way as long as we're consistent and accurate

As for your second point about the reload module being changed to call stop array, I 100% agree that's probably the better way of doing it. The reason I didn't do that is because when making my first pull request on any new code base, I like to do the least intrusive changes and I figured just changing the warning was less intrusive.

@tim000x3
Copy link
Author

Just let me know how you want to proceed and I'll modify my pull request accordingly.

@qvr
Copy link
Owner

qvr commented Nov 30, 2025

Can you open an issue detailing how to reproduce the parity corruption issue?

Please list all the commands you run, including nmdctl outputs at each step, and especially the mount commands you are running, and how you the come to the conclusion that parity is invalid after a successful rebuild.

I am unable to reproduce such an issue, and it would go against how the driver actually works. It is most likely you have accidentally mounted the raw disk device at some stage instead of the nonraid block device.

As for the unit changes, you can also open an issue about it, but there are many examples of CLI tools using base-2 and GB as symbols, and switching to gibibyte symbols might make the outputs look busy, so I'm inclined to leave it as-is. Even ZFS skirts around the "problem" by not including the byte symbol in their human-readable output at all, just using T/G/M suffixes...

And finally for the reload warning, you can open a new PR if you want to change the reload function to call stop_array internally. (You can also just open an issue and I can take a look at it later.)

I'll close this PR, in the future it would be best if you could open separate PRs for separate changes - and in most cases an issue before PR is good, so we can first discuss why a change is needed. (I also added this into the contributing guidelines just now, so no worries and thanks for your efforts so far!)

@qvr qvr closed this Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants