Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_protected_candidate_device called for '/sys/block/nvme0c0n1' but '/dev/nvme0c0n1' is no block device #3085

Closed
hitrikrtek opened this issue Nov 17, 2023 · 11 comments
Assignees
Labels
bug The code does not do what it is meant to do enhancement Adaptions and new features waiting for info
Milestone

Comments

@hitrikrtek
Copy link

hitrikrtek commented Nov 17, 2023

  • ReaR version ("/usr/sbin/rear -V"):
    Relax-and-Recover 2.7 / 2022-07-13

  • If your ReaR version is not the current version, explain why you can't upgrade:
    Latest version with distribution

  • OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"):

NAME="SLES"
VERSION="15-SP5"
VERSION_ID="15.5"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP5"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp5"
  • ReaR configuration files ("cat /etc/rear/local.conf"):
BACKUP=NETFS
OUTPUT=ISO
BACKUP_URL=nfs://172.17.3.1/rear
BACKUP_OPTIONS=
NETFS_KEEP_OLD_BACKUP_COPY=yes
USE_DHCLIENT=
MODULES_LOAD=( )
#BACKUP_PROG_INCLUDE=("{BACKUP_PROG_INCLUDE[@]}" '/boot/grub2/i386-pc' '/boot/grub2/x86_64-efi' '/home' '/opt' '/root' '/srv' '/hana/shared' '/usr' '/var')
BACKUP_PROG_EXCLUDE=("{BACKUP_PROG_EXCLUDE[@]}" '/hana/backup' '/hana/data' '/hana/log' '/hana/shared/HSP/HDB05/backup' '/hana/shared/NSP/HDB50/backup' '/opt/dpsapps/agentsvc/dbs' '/opt/dpsapps/agentsvc/logs' '/tmp')
POST_RECOVERY_SCRIPT=('if snapper --no-dbus -r $TARGET_FS_ROOT get-config | grep -q "^QGROUP.*[0-9]/[0-9]" ; then snapper --no-dbus -r $TARGET_FS_ROOT set-config QGROUP= ; snapper --no-dbus -r $TARGET_FS_ROOT setup-quota && echo snapper setup-quota done || echo snapper setup-quota failed ; else echo snapper setup-quota not used ; fi')
REQUIRED_PROGS=(snapper chattr lsattr ${REQUIRED_PROGS[@]})
COPY_AS_IS=(/usr/lib/snapper/installation-helper /etc/snapper/config-templates/default ${COPY_AS_IS[@]})
EXCLUDE_MOUNTPOINTS=("/mnt")
ISO_MKISOFS_BIN=/usr/bin/ebiso
  • Hardware vendor/product (PC or PowerNV BareMetal or ARM) or VM (KVM guest or PowerVM LPAR):
    DELL PowerEdge R860

  • System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device):
    x86_64

  • Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot):
    BIOS Version 1.5.6
    grub2-x86_64-efi-2.06-150500.29.8.1

  • Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe):
    NVMe in RAID configuration

Backplane 1 on Connector 0 of RAID Controller in SL 1
DeviceDescription	Backplane 1 on Connector 0 of RAID Controller in SL 1
DeviceType	PCIeSSDBackPlane
FirmwareVersion	7.10
FQDD	Enclosure.Internal.0-1:RAID.SL.1-1
InstanceID	Enclosure.Internal.0-1:RAID.SL.1-1
MediaType	Solid State Drive
PCIExpressGeneration	Gen 4
ProductName	BP_PSV 0:1
RollupStatus	OK
SlotCount	8
WiredOrder	1

Backplane 2 on Connector 0 of RAID Controller in SL 4
DeviceDescription	Backplane 2 on Connector 0 of RAID Controller in SL 4
DeviceType	PCIeSSDBackPlane
FirmwareVersion	7.10
FQDD	Enclosure.Internal.0-2:RAID.SL.4-1
InstanceID	Enclosure.Internal.0-2:RAID.SL.4-1
MediaType	Solid State Drive
PCIExpressGeneration	Gen 4
ProductName	BP_PSV 0:2
RollupStatus	OK
SlotCount	8
WiredOrder	2
  • Storage layout ("lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,LABEL,SIZE,MOUNTPOINT"):
NAME                          KNAME          PKNAME         TRAN   TYPE FSTYPE      LABEL   SIZE MOUNTPOINT
/dev/sda                      /dev/sda                             disk LVM2_member        23.3T
|-/dev/mapper/vghana-lvusrsap /dev/dm-2      /dev/sda              lvm  xfs                 150G /usr/sap
|-/dev/mapper/vghana-lvdata   /dev/dm-3      /dev/sda              lvm  xfs                   6T /hana/data
|-/dev/mapper/vghana-lvshared /dev/dm-4      /dev/sda              lvm  xfs                   2T /hana/shared
|-/dev/mapper/vghana-lvlog    /dev/dm-5      /dev/sda              lvm  xfs                   4T /hana/log
'-/dev/mapper/vghana-lvbackup /dev/dm-6      /dev/sda              lvm  xfs                   2T /hana/backup
/dev/sdb                      /dev/sdb                      iscsi  disk                     100M
/dev/sdc                      /dev/sdc                      iscsi  disk                     100M
/dev/sdd                      /dev/sdd                      iscsi  disk                       1G
/dev/nvme0n1                  /dev/nvme0n1                  nvme   disk                   447.1G
|-/dev/nvme0n1p1              /dev/nvme0n1p1 /dev/nvme0n1   nvme   part vfat        ESP     250M /boot/efi
|-/dev/nvme0n1p2              /dev/nvme0n1p2 /dev/nvme0n1   nvme   part vfat        OS        2G /boot
'-/dev/nvme0n1p3              /dev/nvme0n1p3 /dev/nvme0n1   nvme   part LVM2_member       444.8G
  |-/dev/mapper/vgroot-root   /dev/dm-0      /dev/nvme0n1p3        lvm  btrfs             442.8G /var
  '-/dev/mapper/vgroot-swap   /dev/dm-1      /dev/nvme0n1p3        lvm  swap                  2G [SWAP]
  • Description of the issue (ideally so that others can reproduce it):

When running rear recover we get this error:
image

  • Workaround, if any:
    None

  • Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):
    rear-srv-lx2013.log
    See attached debug log: rear-srv-lx2013.log

@jsmeix
Copy link
Member

jsmeix commented Nov 23, 2023

@hitrikrtek
what is the output of each of the commands

# echo /sys/block/*

# ls -l /dev/nvme0c0n1

# ls -l /sys/block/nvme0c0n1

# ls -l $( readlink -e /sys/block/nvme0c0n1 )

?

Details:

The following excerpt from your
https://github.com/rear/rear/files/13389016/rear-srv-lx2013.log
shows how it fails:

++ for current_device_path in /sys/block/*
++ current_disk_name=nvme0c0n1
++ is_multipath_path nvme0c0n1
++ test nvme0c0n1
++ is_multipath_used
++ type multipath
++ return 1
++ return 1
++ test -d /sys/block/nvme0c0n1/queue
++ test 0 = 1
++ is_write_protected /sys/block/nvme0c0n1
+++ write_protected_candidate_device /sys/block/nvme0c0n1
+++ local device=/sys/block/nvme0c0n1
+++ [[ /sys/block/nvme0c0n1 == /sys/block/* ]]
++++ get_device_name /sys/block/nvme0c0n1
++++ local name=/sys/block/nvme0c0n1
++++ name=nvme0c0n1
++++ contains_visible_char nvme0c0n1
+++++ tr -d -c '[:graph:]'
++++ test nvme0c0n1
++++ [[ nvme0c0n1 =~ ^mapper/ ]]
++++ [[ -L /dev/nvme0c0n1 ]]
++++ [[ nvme0c0n1 =~ ^dm- ]]
++++ name=nvme0c0n1
++++ echo /dev/nvme0c0n1
++++ [[ -r /dev/nvme0c0n1 ]]
++++ return 1
+++ device=/dev/nvme0c0n1
+++ test -b /dev/nvme0c0n1
+++ BugError 'write_protected_candidate_device called for '\''/sys/block/nvme0c0n1'\'' but '\''/dev/nvme0c0n1'\'' is no block device'

This looks strange because it seems in your case
there is "nvme0c0n1" in /sys/block/ where

test -b /dev/nvme0c0n1

fails so it seems "nvme0c0n1" is a block device
(because "nvme0c0n1" is in /sys/block/) that
either has no matching /dev/nvme0c0n1 device node
or it has a matching /dev/nvme0c0n1 device node
but that /dev/nvme0c0n1 device node is no block device.

By "googling" for 'nvme0c0n1' I found
google/cadvisor#3340

Not all /sys/block devices will have a "dev" file

which seems to explain the root cause.

For comparison:
On my laptop with a NVMe disk I have /sys/block/nvme0n1
and I have a matching device node /dev/nvme0n1
which is a block device.

The matching code is in
usr/share/rear/layout/prepare/default/250_compare_disks.sh
which is online for ReaR 2.7 starting at
https://github.com/rear/rear/blob/rear-2.7/usr/share/rear/layout/prepare/default/250_compare_disks.sh#L94

The is_write_protected and write_protected_candidate_device
functions are in
usr/share/rear/lib/write-protect-functions.sh
which is online for ReaR 2.7
https://github.com/rear/rear/blob/rear-2.7/usr/share/rear/lib/write-protect-functions.sh

So it seems we have to enhance the
write_protected_candidate_device function
to ignore it when it is called with a device
where no matching /dev/... device node exists.

@jsmeix jsmeix self-assigned this Nov 23, 2023
@jsmeix jsmeix added this to the ReaR v2.8 milestone Nov 23, 2023
@hitrikrtek
Copy link
Author

hitrikrtek commented Nov 23, 2023

Hi @jsmeix
First of all thank you for your time looking into this.
I've ran the commands you requested and here's the output:

# echo /sys/block/*
/sys/block/dm-0 /sys/block/dm-1 /sys/block/dm-2 /sys/block/dm-3 /sys/block/dm-4 /sys/block/dm-5 /sys/block/dm-6 /sys/block/nvme0c0n1 /sys/block/nvme0n1 /sys/block/sda /sys/block/sdb /sys/block/sdc /sys/block/sdd

# ls -l /dev/nvme0c0n1
ls: cannot access '/dev/nvme0c0n1': No such file or directory

# ls -l /sys/block/nvme0c0n1
lrwxrwxrwx 1 root root 0 Nov 21 13:38 /sys/block/nvme0c0n1 -> ../devices/pci0000:00/0000:00:0a.0/0000:01:00.0/nvme/nvme0/nvme0c0n1

# ls -l $( readlink -e /sys/block/nvme0c0n1 )
total 0
-r--r--r-- 1 root root 4096 Nov 23 09:26 alignment_offset
-r--r--r-- 1 root root 4096 Nov 23 09:26 capability
lrwxrwxrwx 1 root root    0 Nov 21 13:38 device -> ../../nvme0
-r--r--r-- 1 root root 4096 Nov 23 09:26 discard_alignment
-r--r--r-- 1 root root 4096 Nov 23 09:26 diskseq
-r--r--r-- 1 root root 4096 Nov 23 09:26 eui
-r--r--r-- 1 root root 4096 Nov 23 09:26 events
-r--r--r-- 1 root root 4096 Nov 23 09:26 events_async
-rw-r--r-- 1 root root 4096 Nov 23 09:26 events_poll_msecs
-r--r--r-- 1 root root 4096 Nov 23 09:26 ext_range
-r--r--r-- 1 root root 4096 Nov 23 09:26 hidden
drwxr-xr-x 2 root root    0 Nov 23 09:26 holders
-r--r--r-- 1 root root 4096 Nov 23 09:26 inflight
drwxr-xr-x 2 root root    0 Nov 23 09:26 integrity
-rw-r--r-- 1 root root 4096 Nov 23 09:26 make-it-fail
drwxr-xr-x 5 root root    0 Nov 23 09:26 mq
-r--r--r-- 1 root root 4096 Nov 23 09:26 nsid
drwxr-xr-x 2 root root    0 Nov 23 09:26 power
drwxr-xr-x 2 root root    0 Nov 23 09:26 queue
-r--r--r-- 1 root root 4096 Nov 23 09:26 range
-r--r--r-- 1 root root 4096 Nov 23 09:26 removable
-r--r--r-- 1 root root 4096 Nov 23 09:26 ro
-r--r--r-- 1 root root 4096 Nov 23 09:26 size
drwxr-xr-x 2 root root    0 Nov 23 09:26 slaves
-r--r--r-- 1 root root 4096 Nov 23 09:26 stat
lrwxrwxrwx 1 root root    0 Nov 21 13:38 subsystem -> ../../../../../../../class/block
drwxr-xr-x 2 root root    0 Nov 23 09:26 trace
-rw-r--r-- 1 root root 4096 Nov 21 13:38 uevent
-r--r--r-- 1 root root 4096 Nov 23 09:26 wwid

@jsmeix
Copy link
Member

jsmeix commented Nov 23, 2023

# echo /sys/block/*
... /sys/block/nvme0c0n1 /sys/block/nvme0n1 ...

# ls -l /dev/nvme0c0n1
ls: cannot access '/dev/nvme0c0n1': No such file or directory

proves that the root cause is that

Not all /sys/block devices will have a "dev" file

so we have to enhance the
write_protected_candidate_device function
to ignore it when it is called with a device
where no matching /dev/... device node exists.

jsmeix added a commit that referenced this issue Nov 23, 2023
Let the is_write_protected function
report candidate devices without device node
as not write protected
because not all /sys/block devices have a "dev" file
e.g. /sys/block/nvme0c0n1 has no /dev/nvme0c0n1 device node
see #3085
Because the is_write_protected function is meant
to identify write-protected disks and partitions
only candidate devices with a device node
are considered for write protection
@jsmeix jsmeix added bug The code does not do what it is meant to do and removed waiting for info support / question labels Nov 23, 2023
@jsmeix
Copy link
Member

jsmeix commented Nov 23, 2023

@hitrikrtek

I cannot reproduce your issue
because I don't have a system with /sys/block/nvme0c0n1
or something similar - i.e. where a /sys/block/device
does not have a matching /dev/device.

So as an offhanded untested workaround for now
you may try out how things behave when you replace in
usr/share/rear/lib/write-protect-functions.sh

test -b "$device" || BugError "write_protected_candidate_device called for '$1' but '$device' is no block device"

with

test -b "$device" || DebugPrint "write_protected_candidate_device called for '$1' but '$device' is no block device"

Perhaps with that you get other errors in other functions
in usr/share/rear/lib/write-protect-functions.sh

Alternatively (and preferred by me if you can) you could test
if my overhauled usr/share/rear/lib/write-protect-functions.sh
https://raw.githubusercontent.com/rear/rear/a403b5fe1c2c58420ba1b77db52283c041e4f7d4/usr/share/rear/lib/write-protect-functions.sh
in
#3091
makes things work well for your case.

@hitrikrtek
Copy link
Author

Alright! I've grabbed your updated function and am creating a backup now. Will report back the result.
Thank you

@jsmeix
Copy link
Member

jsmeix commented Nov 23, 2023

@hitrikrtek
with

OUTPUT=ISO

you should get nothing automatically set for
WRITE_PROTECTED_IDS and/or WRITE_PROTECTED_FS_LABEL_PATTERNS
after "rear mkrescue/mkbackup" in your
/var/tmp/rear.XXX/rootfs/etc/rear/rescue.conf
When you also have
neither WRITE_PROTECTED_IDS
nor WRITE_PROTECTED_FS_LABEL_PATTERNS
specified in your etc/rear/local.conf file
the all is_write_protected function calls
would always result that nothing is write-protected
so the is_write_protected function call could be
simplified and made behave more robust against errors
when it checks if WRITE_PROTECTED_IDS
and WRITE_PROTECTED_FS_LABEL_PATTERNS are empty
and in this case directly return 1 (i.e. "not write-protected").

@hitrikrtek
Copy link
Author

Regarding the updated function - this seems to work now, we were able to do restore - but now we're experiencing some other issues after restore our network broke somehow... all disks and configs seem just fine, we're investigating but I don't think it's related to this device. I'll post my finding here, if we figure it out.

@jsmeix
Copy link
Member

jsmeix commented Nov 23, 2023

@hitrikrtek
please do not post other things that do not belong to this issue
in this issue here but instead create new separated issues
for each separated issue that you have.
Otherwise all would get messed up.

@schlomo
Copy link
Member

schlomo commented Jan 13, 2024

@hitrikrtek just going over some stuff, a quick question: Your setup has NVMe multipath, correct? I'm asking because https://lore.kernel.org/all/3c725e5deaabaaf145f48f2f6fcfdae9f6d41e2e.camel@suse.de/t/ suggests that the specific sysfs path you encounter is related to the multipath setup, where indeed I understand that there are more sysfs block devices listed than actual dev block devices, because multipathing means that there are multiple "backend" block devices for the actually same disk.

I'm asking this because I'm wondering if ReaR actually correctly recognizes NVMe multipath setups as multpath setups, or if it tries to handle them are "regular" disks.

Of course, a proper fix would start from access to such a NVMe multpath device...

@hitrikrtek
Copy link
Author

hitrikrtek commented Jan 15, 2024

@schlomo it is NVMe in RAID1 configuration (two disks, but it is a HW raid controller) so I guess it is multipath yes. We've tried restore also and it works, so it seems the current fix works ok for this configuration. If you want me to send you some more info from the system, just let me know what to send.

@jsmeix
Copy link
Member

jsmeix commented Jan 15, 2024

I need more explanatory information
to be able to correctly imagine things
because I am not at all a multipath user
neither "traditional" multipath with non-NVMe disks
nor current multipath with NVMe.

Therefore I have some questions:

(1)
First about the wording
"NVMe native multipathing" versus "NVMe multipathing":

https://lore.kernel.org/all/3c725e5deaabaaf145f48f2f6fcfdae9f6d41e2e.camel@suse.de/t/
and
https://documentation.suse.com/de-de/sles/15-SP1/html/SLES-all/cha-nvmeof.html
talk about "NVMe native multipathing".
But here we talk about "NVMe multipathing" in particular
#3091 (comment)
Are "NVMe native multipathing" and "NVMe multipathing"
the same thing or is there a difference?

(2)
Second about kernel device nodes
versus "other devices" like "device" smylinks to "whatever":

As far as I know in traditional multipath with a non-NVMe disk
that is accessible via two separated ways (called "paths")
all involved devices have kernel device nodes for example
traditional multipath with a non-NVMe disk the path devices
/dev/sda and /dev/sdb for the two paths to the disk
plus via device mapper the multipath map device
/dev/dm-1 for the disk (via any of the two paths).

How does this look like with
"NVMe native multipathing" and/or "NVMe multipathing"?

https://lore.kernel.org/all/3c725e5deaabaaf145f48f2f6fcfdae9f6d41e2e.camel@suse.de/t/
only talks about "devices" in /sys/
but I would like to understand the relationship
between kernel device nodes (i.e. things in /dev/)
versus "other devices" like things in /sys/ for
"NVMe native multipathing" and/or "NVMe multipathing".

(3)
Third about multipathing versus RAID1:

As far as I know multipathing is about accessing
one same disk via several separated ways/paths.
RAID1 is about mirroring data on two different disks
so each of the two RAID1 disks is accessible via
its own specific way of access so RAID1 implies
multipath-like behaviour.
But (as far as I know) "RAID1 implicit multipath"
is not what is normally meant when one talks about multipath.

(4)
Fourth about hardware RAID1 versus software RAID1:

When "traditional" non-NVMe hardware RAID1 is used
there is only one disk device node (e.g. /dev/sda)
which behaves same as a normal single disk.

When "traditional" non-NVMe software RAID1 is used
two normal single disks (e.g. /dev/sda and /dev/sdb)
are combined by the Linux kernel into one
software RAID1 kernel device node (e.g. /dev/md127)
where the RAID1 kernel device node behaves
same as a normal single disk.

Here is seems that what is called "NVMe hardware RAID1"
does not behave same as a normal single NVMe disk.

jsmeix added a commit that referenced this issue Jan 22, 2024
…ite protected (#3091)

In lib/write-protect-functions.sh 
let the is_write_protected function report candidate devices
without device node as not write protected
because not all /sys/block devices have a "dev" file
e.g. /sys/block/nvme0c0n1 has no /dev/nvme0c0n1 device node
see #3085
Because the is_write_protected function is meant to identify
write-protected disks and partitions, only candidate devices
with a device node are considered for write protection.
Additionally completely overhauled lib/write-protect-functions.sh
which now contains only the is_write_protected function
(the only one intended to be called by other scripts).
The former helper functions are not needed as functions because
all what happens is a single straightforward sequence of actions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The code does not do what it is meant to do enhancement Adaptions and new features waiting for info
Projects
None yet
Development

No branches or pull requests

3 participants