Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery system libraries not consistent (e.g. when /lib64/noelision/libpthread is used kernel panics because init fails) #1494

Closed
linuxstar2017 opened this issue Sep 15, 2017 · 26 comments
Assignees
Labels
bug The code does not do what it is meant to do fixed / solved / done support / question
Milestone

Comments

@linuxstar2017
Copy link

linuxstar2017 commented Sep 15, 2017

Relax-and-Recover (ReaR) Issue Template

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

  • rear version (/usr/sbin/rear -V):
l9995988:~ # rpm -qa | grep -i rear
rear-1.18-3.x86_64
  • OS version (cat /etc/rear/os.conf or lsb_release -a):
    SLES12
9995988:~ # uname -a
Linux l9995988 4.4.49-92.11.1.12564.2.PTF.1027273-default #1 SMP Thu Mar 9 10:34:40 UTC 2017 (6122209) x86_64 x86_64 x86_64 GNU/Linux
  • rear configuration files (cat /etc/rear/site.conf or cat /etc/rear/local.conf):
l9995988:~ # cat /etc/rear/local.conf
MODULES=af_packet iscsi_ibft iscsi_boot_sysfs vmw_vsock_vmci_transport vsock coretemp crct10dif_pclmul crc32_pclmul drbg ansi_cprng aesni_intel aes_x86_64 ppdev lrw gf128mul glue_helper ablk_helper cryptd i2c_piix4 joydev pcspkr shpchp mptctl vmw_balloon e1000 vmw_vmci battery parport_pc parport acpi_cpufreq fjes processor ac xfs libcrc32c sr_mod sd_mod cdrom ata_generic crc32c_intel serio_raw drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mptspi drm scsi_transport_spi mptscsih mptbase floppy ata_piix ahci libahci libata button dm_mirror dm_region_hash dm_log sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 
BACKUP=NSR
OUTPUT=ISO
ISO_DIR=/var/lib/rear/output
ISO_PREFIX=t32-rescue-l9995988
ISO_VOLID=t32-rescue-l9995988
AUTOEXCLUDE_PATH=( /media /tmp )
AUTOEXCLUDE_AUTOFS=y
AUTOEXCLUDE_MULTIPATH=y
AUTOEXCLUDE_DISKS=y
NSR_RETENTION_TIME="1 month"
NSR_POOL_NAME=REAR
COPY_AS_IS=(${COPY_AS_IS[@]} /lib64/noelision/)
  • Are you using legacy BIOS or UEFI boot?
    legacy BIOS
  • Brief description of the issue:
    ReaR ISO recovery image leads to a kernel kernel panic.
    /init: error while loading shared library: libpthread.s0.0: cannot open shared object file
    Problem can be reproduced on physical and virtual servers and also
    with ReaR version 2.x, found on OBS
    Even we adapted the rear conf to include /lib64/noelision (it is actually copied over),
    libpthread is not found
  • Work-around, if any:
    Remove /lib64/noelision from /etc/ld.so.conf
    Execute ldconfig
    Create rescue ISO
@jsmeix jsmeix self-assigned this Sep 15, 2017
@jsmeix
Copy link
Member

jsmeix commented Sep 15, 2017

@linuxstar2017
I don't know about noelision stuff (and quick Googling did not
show something explanatory - only various bug reports).

Nevertheless I try some guesswork:
On my SLES12 system I have

e205:~/rear.master # find /lib64 | wc -l
234

but in the matching ReaR recovery system I have only

RESCUE e205:~ # find /lib64 | wc -l
71

In general the ReaR recovery system is by default very minimal
so that this or that additional stuff might be missing which
leads to this error.

@linuxstar2017
could you try if

COPY_AS_IS=( "${COPY_AS_IS[@]}" /lib64/ )

helps?
With that I get in my recovery system:

RESCUE e205:~ # find /lib64 | wc -l
230

That looks better but it is still 4 files less
that in the original system which are those links

/lib64/libfuse.so.2 -> /usr/lib64/libfuse.so.2
/lib64/libfuse.so.2.9.3 -> /usr/lib64/libfuse.so.2.9.3
/lib64/libss.so.2 -> /usr/lib64/libss.so.2
/lib64/libss.so.2.0 -> /usr/lib64/libss.so.2.0

It seems there is a general problem
with links and COPY_AS_IS ...
but I guess that is not related to this issue here.

@linuxstar2017 onyl FYI
you might have a possibly severe syntax issue in your

COPY_AS_IS=(${COPY_AS_IS[@]} /lib64/noelision/)

In almost all cases one must use quoted "${arr[@]}", cf.
#1068
and see in default.conf

# Some variables are actually bash arrays and should be treated with care.
# Use VAR=() to set an empty array.
# Use VAR=( "${VAR[@]}" 'value' ) to add a fixed value to an array.
# Use VAR=( "${VAR[@]}" "$var" ) to add a variable value to an array.
# Whether or not the latter case works as intended depends on when and
# how "$var" is set and evaluated by the Relax-and-Recover scripts.
# In general using ${VAR[*]} is problematic and using ${VAR[@]} without
# double-quotes is also problematic, see 'Arrays' in "man bash" and
# see https://github.com/rear/rear/issues/1068 for some examples.

@jsmeix
Copy link
Member

jsmeix commented Sep 15, 2017

@linuxstar2017
can you explain to me how I could reproduce it
on a SLES12 virtual QEMU/KVM machine?
On my current SLES12 virtual QEMU/KVM machine
I do not have any noelision in my /etc/ld.so.conf.

@linuxstar2017
Copy link
Author

@jsmeix

Sure ...
(on my lab server)

denbvslnxs1:/lib64/noelision # ls
libpthread-2.22.so libpthread.so.0
denbvslnxs1:/lib64/noelision # rpm -q --whatprovides /lib64/noelision/libpthread-2.22.so
glibc-2.22-61.3.x86_64

Normally, you don't need it !

But in our case ..

/etc/ld.so.conf need to have the "noelision" library path at first to prevent
application crashes, which are caused by bad coded applications (like Legato Networker's
nsrexecd) and lead to a double-unlock.

I sent you in parallel a mail (in German) with all the background ...

@jsmeix
Copy link
Member

jsmeix commented Sep 15, 2017

@linuxstar2017
I also have the noelision library in /lib64/noelision/
but I do not have any noelision in my /etc/ld.so.conf
so that I like to know what exact "noelision" library path
you have in your /etc/ld.so.conf that might help me
to reproduce it - preferably just post your whole
/etc/ld.so.conf here so that I do not need to guess.

Additionally to really reproduce it it seems one needs
"a badly coded application" which requires the noelision library.
Do you know a free software that requires the noelision library?

@linuxstar2017
Copy link
Author

linuxstar2017 commented Sep 15, 2017

@jsmeix

l9995988:~ # cat /etc/ld.so.conf
/lib64/noelision
/usr/local/lib64
/usr/local/lib
include /etc/ld.so.conf.d/*.conf
# /lib64, /lib, /usr/lib64 and /usr/lib gets added
# automatically by ldconfig after parsing this file.
# So, they do not need to be listed.

I have no clue of any free sw, which requires
this .. but for me, you don't need that step
of verification ... the original issue need just
to be fixed ... to do no kernel panic anymore
and boot from recovery ISO.

Tests for the EMC Networker Recovery need
to be done anyway later by the customer,
as he owns the license and EMC is the original
cause of the issues.

Just IMHO

@jsmeix
Copy link
Member

jsmeix commented Sep 15, 2017

@linuxstar2017
thanks - and have a nice weekend!

@jsmeix
Copy link
Member

jsmeix commented Sep 18, 2017

I can reproduce it and
I know a working workaround with the code as is and
I know the solution (which needs a fix in a script):

I use always

KEEP_BUILD_DIR="yes"

so that I can 'chroot' into the ReaR recovery system
directly on the system where I made it with "rear mkrescue"
without the need to boot the ReaR recovery system
on another (virtual) machine only to do some tests
in the ReaR recovery system which is impossible
when the ReaR recovery system fails to come up
like here when the kernel panics because init fails.

I can reproduce it with

COPY_AS_IS=( "${COPY_AS_IS[@]}" /lib64/noelision/ )

I make a ReaR recovery system and
'chroot' into it and do some tests therein:

e205:~/rear.master # usr/sbin/rear -d -D mkrescue
...
Creating recovery/rescue system initramfs/initrd
...
You should also rm -Rf /tmp/rear.mqtc7xtv52uuRgu

e205:~/rear.master # chroot /tmp/rear.mqtc7xtv52uuRgu/rootfs/

bash-4.3# export LC_ALL=C LANG=C

bash-4.3# ldd /bin/sort | grep pthread
grep: error while loading shared libraries: libpthread.so.0: cannot open shared object file: No such file or directory

bash-4.3# echo -e 'foo\nbar\nbaz' | sort
sort: error while loading shared libraries: libpthread.so.0: cannot open shared object file: No such file or directory

The working workaround without code changes is

COPY_AS_IS=( "${COPY_AS_IS[@]}" /lib64/ )

cf. "could you try if ... helps" in my above
#1494 (comment)

With that I get:

e205:~/rear.master # usr/sbin/rear -d -D mkrescue
...
Creating recovery/rescue system initramfs/initrd
...
You should also rm -Rf /tmp/rear.ICBQgUUoYcvDCap

e205:~/rear.master # chroot /tmp/rear.ICBQgUUoYcvDCap/rootfs/

bash-4.3# export LC_ALL=C LANG=C

bash-4.3# ldd /bin/sort | grep pthread
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fc6a7267000)

bash-4.3# echo -e 'foo\nbar\nbaz' | sort
bar
baz
foo

The solution (which needs a fix in a ReaR script) is
to run 'ldconfig' inside the ReaR recovery system before
"Creating recovery/rescue system initramfs/initrd ..."

e205:~/rear.master # chroot /tmp/rear.mqtc7xtv52uuRgu/rootfs/

bash-4.3# export LC_ALL=C LANG=C

bash-4.3# ldconfig -v | egrep 'noelision|pthread'
...
/lib64/noelision:
        libpthread.so.0 -> libpthread-2.22.so
bash-4.3# ldd /bin/sort | grep pthread
        libpthread.so.0 => /lib64/noelision/libpthread.so.0 (0x00007f295567a000)

bash-4.3# echo -e 'foo\nbar\nbaz' | sort
bar
baz
foo

@linuxstar2017
verify with your customer that

COPY_AS_IS=( "${COPY_AS_IS[@]}" /lib64/ )

also works for him with his particular backup program
(I cannot test if his particular backup program also works).

@jsmeix jsmeix added waiting for info bug The code does not do what it is meant to do labels Sep 18, 2017
@jsmeix jsmeix changed the title /lib64/noelision/libpthread not found on recovery boot, even directory was added ReaR recovery system libraries not consistent (e.g. when /lib64/noelision/libpthread is used kernel panics when booting recovery system because init fails) Sep 18, 2017
@jsmeix jsmeix changed the title ReaR recovery system libraries not consistent (e.g. when /lib64/noelision/libpthread is used kernel panics when booting recovery system because init fails) Recovery system libraries not consistent (e.g. when /lib64/noelision/libpthread is used kernel panics when booting recovery system because init fails) Sep 18, 2017
@jsmeix jsmeix changed the title Recovery system libraries not consistent (e.g. when /lib64/noelision/libpthread is used kernel panics when booting recovery system because init fails) Recovery system libraries not consistent (e.g. when /lib64/noelision/libpthread is used kernel panics because init fails) Sep 18, 2017
@jsmeix jsmeix added this to the ReaR v2.3 milestone Sep 18, 2017
@jsmeix
Copy link
Member

jsmeix commented Sep 18, 2017

At the end of
usr/share/rear/build/GNU/Linux/390_copy_binaries_libraries.sh
there is

#ldconfig $v -r "$ROOTFS_DIR" >&2 || Error "Could not configure libraries with ldconfig"

which would have helped in this case if it was enabled.

With this line enabled and

COPY_AS_IS=( "${COPY_AS_IS[@]}" /lib64/noelision/ )

everything works for me.

@linuxstar2017
verify with your customer that enabling the "#ldconfig ..."
line at the end of build/GNU/Linux/390_copy_binaries_libraries.sh
is sufficient so that it also works with only

COPY_AS_IS=( "${COPY_AS_IS[@]}" /lib64/noelision/ )

Because - as ususal :-( - there is not any comment that explains
why the 'ldconfig' is disabled in 390_copy_binaries_libraries.sh
I need to do forensics - but fortunately we have Git where

git log -p --follow usr/share/rear/build/GNU/Linux/390_copy_binaries_libraries.sh

shows that it was disabled via
d62a555
which points to
#772
and therein the reason behind is described in
#772 (comment)
and the commit
d62a555
shows that the intent was to run 'ldconfig' in the booted
recovery system but now as 'ldconfig -X' (WHY now the '-X'?)
which is wrong because now it tries to boot the
recovery system with inconsistent libraries configuration
which can badly fail as in this case.

As far as I understand it the actually right solution seems
to create a consistent libraries configuration at build-time
of the recovery system with right confing files in the
$ROOTFS_DIR/etc/ld.so.conf.d/ directory and
Error out if that fails because that could mean
the recovery system may fail to boot.

jsmeix referenced this issue Sep 18, 2017
  commenting out ldconfig line
* added new script rescue/GNU/Linux/55_copy_ldconfig.sh:
  copy the /etc/ld.so.conf* stuff to ROOTFS
* added new script skel/default/etc/scripts/system-setup.d/01-run-ldconfig.sh:
  run ldconfig -X before dhclient gets started at boot time
Fix for issue #772
@schlomo
Copy link
Member

schlomo commented Sep 18, 2017

We actually have usr/share/rear/build/default/980_verify_rootfs.sh which does a chroot test.

It seems to me that we should improve this test to be more thorough. At the moment it only checks bash which apparently does not have this problem. Can you maybe expand the test there to be more thorough and to also check for such problems?

@linuxstar2017
Copy link
Author

Feedback:

Perfect, that change solved the issue ! Server is now recovering ....

Many thanks for your help, will now update the SR.....

@jsmeix
Copy link
Member

jsmeix commented Sep 19, 2017

@linuxstar2017
please be explicit so that I do not need to guess.
If you do not invest your time to write explicit information
others may not spend their time to investigate your issues.
What change solved the issue?

COPY_AS_IS=( "${COPY_AS_IS[@]}" /lib64/ )

or

COPY_AS_IS=( "${COPY_AS_IS[@]}" /lib64/noelision/ )

plus in build/GNU/Linux/390_copy_binaries_libraries.sh

ldconfig $v -r "$ROOTFS_DIR" >&2 || Error " ..."

or did even both changes solve the issue?
In the latter case what was the preferred way and why?
I need such feedback to better understand user expectations.

@jsmeix
Copy link
Member

jsmeix commented Sep 19, 2017

I reopen this issue because we still have a bug in ReaR
that needs to be fixed.

@schlomo
I will enhance build/default/980_verify_rootfs.sh
but that is of secondary importance.

First and foremost I need some help how to implement correctly
that the libraries in the recovery system are consistent because
plain 'ldconfig $v -r "$ROOTFS_DIR"' seems to have
unwanted side-effects #772

@jsmeix jsmeix reopened this Sep 19, 2017
@linuxstar2017
Copy link
Author

@jsmeix

Sorry, I was explicit ... but agreed, only in the SR solution comment:

ReaR adapted as per below:

File:
/usr/share/rear/build/GNU/Linux/39_copy_binaries_libraries.sh: ldconfig $v -r "$ROOTFS_DIR" >&2 || Error "Could not configure libraries with ldconfig"
==> (line uncommented)

File:
/etc/rear/local.conf: COPY_AS_IS=( "${COPY_AS_IS[@]}" /lib64/noelision/ )
==> (/lib64/noelision added)

@schlomo
Copy link
Member

schlomo commented Sep 19, 2017

@jsmeix thanks. ATM I only can offer a wild guess: The way how ReaR copies libraries to the standard paths can mess up some setups to put "override" libraries in special paths. To be 100% correct we might have to:

  • Take ld.so.conf* as it is
  • replicate the library directories when we copy libs
  • Have special code to handle known special cases

@gdha
Copy link
Member

gdha commented Sep 19, 2017

@linuxstar2017 are you really sure it did not work with /usr/share/rear/build/GNU/Linux/39_copy_binaries_libraries.sh and leaving #ldconfig $v -r "$ROOTFS_DIR" >&2 || Error "Could not configure libraries with ldconfig" ==> (line commented)
It worked like that for 1,5 year already with commented line. I suspect the missing /lib64/noelision/ was the real reason.
However, we need to be sure.

@linuxstar2017
Copy link
Author

@gdha
At least this is , what we tried/tested .... after un-commenting the ldconfig line, it worked,
before not.

The addition of the library path was days before ... it was copied over (verified),
but failed nevertheless....

@jsmeix
Copy link
Member

jsmeix commented Sep 19, 2017

@linuxstar2017
I cannot read a "SR solution comment" and even more important:
Nobody else from ReaR upstream can.

@jsmeix
Copy link
Member

jsmeix commented Sep 19, 2017

@gdha
you can reproduce it when you have
/lib64/noelision/libpthread* libraries
on your system.

Add a topmost line
/lib64/noelision
to your /etc/ld.so.conf cf.
#1494 (comment)
and then do a 'ldconfig' so that you get on your system

# ldconfig -v | egrep 'libpthread|noelision'
...
/lib64/noelision:
        libpthread.so.0 -> libpthread-2.22.so
        libpthread.so.0 -> libpthread-2.22.so

# ldd /bin/sort | grep pthread
        libpthread.so.0 => /lib64/noelision/libpthread.so.0

I use /bin/sort as an example for a binary that is
linked with libpthread which is now linked with
the noelision version of libpthread.

Then reproduce it as I described in
#1494 (comment)

It worked like that for 1,5 year already with commented line
because for 1,5 year already no ReaR user uses the
noelision version of libpthread.

Simply put:
As soon as a special libraries configuration is used,
the libraries in the recovery system are likely inconsistent
and when a library is involved where init is linked with,
the recovery system fails to boot with kernel panic
because init fails.

@gdha
Copy link
Member

gdha commented Sep 19, 2017

@jsmeix @linuxstar2017 Thank you both for the clear explanation of the issue - now I'm following

jsmeix added a commit that referenced this issue Sep 19, 2017
…y_at_the_end_of_copy_binaries_libraries

Run ldconfig non mandatory at the end
of 390_copy_binaries_liraries.sh
to get a consitent libraries configuration
in the recovery system to avoid issues like
#1494
but do not treat a ldconfig failure as fatal
(only report the failure but do not error out)
so that it could still work for special cases as
#772
@jsmeix
Copy link
Member

jsmeix commented Sep 19, 2017

With #1502 merged
issues like this one should now be avoided.

@jsmeix jsmeix closed this as completed Sep 19, 2017
jsmeix added a commit that referenced this issue Sep 26, 2017
…overy_system_related_to_issue_1494

Verify required libraries in recovery system, cf.
#1494 (comment)
Now the recovery system tests in build/default/980_verify_rootfs.sh
are mandatory. It is an Error if the tests cannot be run because
the filesystem of the ROOTFS_DIR is mounted 'noexec'.
For programs (i.e. files in a .../bin/... or .../sbin/... directory)
a missing library in the recovery system is an Error.
@jsmeix
Copy link
Member

jsmeix commented Sep 26, 2017

The recovery system libraries are still not fully consistent,
see #1518 but
since #1514 is merged
for programs (i.e. files in a .../bin/... or .../sbin/... directory)
a missing library in the recovery system is now an Error.

@linuxstar2017
Copy link
Author

O.K. let me ask more concrete ... what my customer need to know ....

When is the SUSE SLES-supported ReaR package able to work
non-modfied ? As a consequence, customer expects a PTF or
fix going upstream.

If no ETA or PTF (customer would challenge, why ?), what according
to your experience, need the customer to implement to have current, non-working
situation corrected ?

For the given scenario ... including /lib64/noelision, a bootable recovery image and
working Networker recovery at the end .. pls. keep in mind, this is the customers
recovery environment, they rely on !

Many thanks ....

@jsmeix
Copy link
Member

jsmeix commented Sep 27, 2017

@linuxstar2017

The fix is upstream:
#1502
#1494 (comment)
6afc9a0

I do not understand why
"customer ... have current, non-working situation"
regardless of your
#1494 (comment)
"that change solved the issue ! Server is now recovering"
Why cannot the customer adapt his ReaR scripts?
That's why ReaR is implemented as bash scripts, read
"Disaster recovery with Relax-and-Recover (ReaR)" in
https://en.opensuse.org/SDB:Disaster_Recovery
Regarding the upstream fix
6afc9a0
When the customer uses an older ReaR version
there is no LogPrintError function.
In this case use the LogPrint function instead.

In general:

The public ReaR upstream GitHub is
neither the right place for questions about SUSE's
(or any company) official customer support handling
nor for lecturing what people should "keep in mind"
when working as ReaR upstream maintainers or contributors
(where most of them do that even on a voluntary base).

The public ReaR upstream GitHub is the right place
to report the plain facts about a ReaR issue
to get the issue analyzed and ideally even fixed
via direct communication with people who are here
as ReaR upstream maintainers and contributors.
This way an issue could get solved in a shorter time
than via an indirectly working official company support.

The public ReaR upstream GitHub does not replace
an official company support but it can complement it
by providing an efficient direct communication with the
ReaR upstream maintainers and contributors plus
either fixes so that all official company support teams
could get the fixes from the public ReaR upstream GitHub
or alternatively by providing a public visible explanation
why a particular issue cannot be fixed in ReaR
(e.g. when the cause of an issue is not "inside" ReaR).

@gdha
Copy link
Member

gdha commented Sep 27, 2017

@linuxstar2017 As @jsmeix said for ReaR packages from SLES itself you should open a software case at SuSE support center, and do not bother us with these kind of questions as we have nothing to say what SuSE does or has in the pipe-line with ReaR in their products.
We are volunteers working on public ReaR upstream project and cannot speak for RHEL, SuSE, Ubuntu or what-ever Linux distro.

@linuxstar2017
Copy link
Author

@gdha @jsmeix
Sorry (as explained already to jsmeix) my fault ....
Just forget it ... it's already channeled again the right way ...

@jsmeix
Copy link
Member

jsmeix commented Oct 4, 2017

With #1521 merged
the whole binaries and libraries copying code is now
cleaned up and simplified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The code does not do what it is meant to do fixed / solved / done support / question
Projects
None yet
Development

No branches or pull requests

4 participants