docs/fix: disk replacement workflow and nmdctl corrections by tim000x3 · Pull Request #91 · qvr/nonraid

tim000x3 · 2025-11-29T10:46:37Z

Summary

Document complete disk replacement procedure with critical mount-before-rebuild warning
Fix nmdctl size display (was using binary math with decimal labels)
Fix reload warning to include required unmount step

Changes

Documentation (README.md)

Complete 7-step disk replacement workflow:

Unmount filesystems
Stop array
Unassign failed disk
Replace with new disk
Start array
Mount filesystems (CRITICAL: before rebuild)
Start rebuild

Added CAUTION block explaining that mounting after rebuild starts will corrupt parity.

nmdctl fixes

Size display: Was calculating sizes with binary math (1024) but labeling as decimal units (TB/GB). Now uses decimal math (1000) to match the labels and unRAID's display.
Reload warning: After rebuild completes, the warning now correctly recommends nmdctl unmount && nmdctl reload && nmdctl start instead of just reload && start (which fails if filesystems are mounted).

Testing

These changes were discovered and validated during hands-on disk replacement testing on bare metal Arch Linux (kernel 6.12.1) with NonRAID DKMS 1.3.2 / kernel module 2.9.35.

Complete workflow for replacing failed disks with critical warning about mount order: filesystems MUST be mounted before rebuild starts or parity will be corrupted.

- Size display was using binary math (1024) with decimal labels (TB/GB). Now uses decimal math to match the labels and unRAID's display. - Reload warning now includes required unmount step.

qvr · 2025-11-29T12:03:48Z

Sorry, this doesn't make sense:

Added CAUTION block explaining that mounting after rebuild starts will corrupt parity.

Mounting before or after rebuild starts should not make any difference. The driver does not know about mounted filesystems.

qvr · 2025-11-29T12:36:37Z

Regarding the other changes, I originally decided to go with base-2 units, and while it is probably "wrong" to then use gigabytes as units instead of gibibytes, I'm not sure I want to change it right away.

Reload warning, the reload_module could be changed to call stop_array instead of directly trying to stop the driver. stop_array will then handle the unmounting first (as long as it was not called in unattended mode).

tim000x3 · 2025-11-29T23:31:28Z

Sorry, this doesn't make sense:

Added CAUTION block explaining that mounting after rebuild starts will corrupt parity.

Mounting before or after rebuild starts should not make any difference. The driver does not know about mounted filesystems.

I understand the driver doesn't explicitly track mount state, this is more about what came up in my testing. So a couple of clarifications:

The drive I was replacing was XFS, not ZFS (Although, this doesn't matter based on what you're saying. I figured I'd mention it anyway.)
I never ran zpool import manually, so there was no ZFS cache to auto-import
When I started the rebuild before mounting, I got checksum mismatches after the rebuild completed
When I mounted first, then started the rebuild, everything verified correctly

Also, this is the order of operation Unraid uses. I know you mentioned you don't use unraid, so I figured I'd let you know.

How about as a middle ground, because this did come up for me in testing and maybe it's just an edge case.

So how about instead of "will corrupt parity," we could say "may cause parity inconsistency in some configurations" or "some users have reported issues when mounting after rebuild starts."

Either way I'd like to see I’d like some kind of documentation for a recommended workflow. If your issue is specifically on the caution block being too definitive, hopefully this middle ground makes sense to you.

tim000x3 · 2025-11-29T23:40:19Z

Regarding the other changes, I originally decided to go with base-2 units, and while it is probably "wrong" to then use gigabytes as units instead of gibibytes, I'm not sure I want to change it right away.

Reload warning, the reload_module could be changed to call stop_array instead of directly trying to stop the driver. stop_array will then handle the unmounting first (as long as it was not called in unattended mode).

I apologize for not being clear with my explanation. I honestly don't have a strong opinion on whether to use GB or GiB. My issue was that the code was doing the math for one thing but labeling it the other.

So the way you had it before, you were doing

elif [ "$force_unit" = "mb" ] || { [ -z "$force_unit" ] && [ "$kb" -lt 1048576 ]; }; then  # 1024^2
    local mb=$((kb / 1024))
    local remainder=$((kb % 1024))
    [ "$add_unit" -eq 1 ] && unit=" MB"
    format_with_decimals "$mb" "$remainder" "1024" "$unit" "$decimal_places"
elif [ "$force_unit" = "gb" ] || { [ -z "$force_unit" ] && [ "$kb" -lt 1073741824 ]; }; then  # 1024^3
    local gb=$((kb / 1048576))
    local remainder=$((kb % 1048576))
    [ "$add_unit" -eq 1 ] && unit=" GB"
    format_with_decimals "$gb" "$remainder" "1048576" "$unit" "$decimal_places"
elif [ "$force_unit" = "tb" ] || [ -z "$force_unit" ]; then
    local tb=$((kb / 1073741824))
    local remainder=$((kb % 1073741824))
    [ "$add_unit" -eq 1 ] && unit=" TB"
    format_with_decimals "$tb" "$remainder" "1073741824" "$unit" "$decimal_places"
fi

which would give you GiB/TiB but labeling it GB/TB.

If you want to stay with GiB/TiB That's fine. We just need to label it that way. So instead what we should do is

elif [ "$force_unit" = "mib" ] || { [ -z "$force_unit" ] && [ "$kb" -lt 1048576 ]; }; then  # 1024^2
    local mib=$((kb / 1024))
    local remainder=$((kb % 1024))
    [ "$add_unit" -eq 1 ] && unit=" MiB"
    format_with_decimals "$mib" "$remainder" "1024" "$unit" "$decimal_places"
elif [ "$force_unit" = "gib" ] || { [ -z "$force_unit" ] && [ "$kb" -lt 1073741824 ]; }; then  # 1024^3
    local gib=$((kb / 1048576))
    local remainder=$((kb % 1048576))
    [ "$add_unit" -eq 1 ] && unit=" GiB"
    format_with_decimals "$gib" "$remainder" "1048576" "$unit" "$decimal_places"
elif [ "$force_unit" = "tib" ] || [ -z "$force_unit" ]; then
    local tib=$((kb / 1073741824))
    local remainder=$((kb % 1073741824))
    [ "$add_unit" -eq 1 ] && unit=" TiB"
    format_with_decimals "$tib" "$remainder" "1073741824" "$unit" "$decimal_places"
fi

As I said, I don't really have strong opinions either way as long as we're consistent and accurate

—

As for your second point about the reload module being changed to call stop array, I 100% agree that's probably the better way of doing it. The reason I didn't do that is because when making my first pull request on any new code base, I like to do the least intrusive changes and I figured just changing the warning was less intrusive.

tim000x3 · 2025-11-29T23:57:30Z

Just let me know how you want to proceed and I'll modify my pull request accordingly.

qvr · 2025-11-30T07:24:01Z

Can you open an issue detailing how to reproduce the parity corruption issue?

Please list all the commands you run, including nmdctl outputs at each step, and especially the mount commands you are running, and how you the come to the conclusion that parity is invalid after a successful rebuild.

I am unable to reproduce such an issue, and it would go against how the driver actually works. It is most likely you have accidentally mounted the raw disk device at some stage instead of the nonraid block device.

As for the unit changes, you can also open an issue about it, but there are many examples of CLI tools using base-2 and GB as symbols, and switching to gibibyte symbols might make the outputs look busy, so I'm inclined to leave it as-is. Even ZFS skirts around the "problem" by not including the byte symbol in their human-readable output at all, just using T/G/M suffixes...

And finally for the reload warning, you can open a new PR if you want to change the reload function to call stop_array internally. (You can also just open an issue and I can take a look at it later.)

I'll close this PR, in the future it would be best if you could open separate PRs for separate changes - and in most cases an issue before PR is good, so we can first discuss why a change is needed. (I also added this into the contributing guidelines just now, so no worries and thanks for your efforts so far!)

tim000x3 added 2 commits November 29, 2025 05:39

docs: document disk replacement procedure

3387193

Complete workflow for replacing failed disks with critical warning about mount order: filesystems MUST be mounted before rebuild starts or parity will be corrupted.

fix: correct nmdctl size display and reload warning

b491564

- Size display was using binary math (1024) with decimal labels (TB/GB). Now uses decimal math to match the labels and unRAID's display. - Reload warning now includes required unmount step.

qvr closed this Nov 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs/fix: disk replacement workflow and nmdctl corrections#91

docs/fix: disk replacement workflow and nmdctl corrections#91
tim000x3 wants to merge 2 commits intoqvr:mainfrom
theweebcoders:main

tim000x3 commented Nov 29, 2025

Uh oh!

qvr commented Nov 29, 2025

Uh oh!

qvr commented Nov 29, 2025

Uh oh!

tim000x3 commented Nov 29, 2025

Uh oh!

tim000x3 commented Nov 29, 2025 •

edited

Loading

Uh oh!

tim000x3 commented Nov 29, 2025

Uh oh!

qvr commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tim000x3 commented Nov 29, 2025

Summary

Changes

Documentation (README.md)

nmdctl fixes

Testing

Uh oh!

qvr commented Nov 29, 2025

Uh oh!

qvr commented Nov 29, 2025

Uh oh!

tim000x3 commented Nov 29, 2025

Uh oh!

tim000x3 commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tim000x3 commented Nov 29, 2025

Uh oh!

qvr commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tim000x3 commented Nov 29, 2025 •

edited

Loading