Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross arm*-musl hangs when runing g-ir-scanner-qemuwrapper #11426

Closed
pullmoll opened this issue May 1, 2019 · 34 comments
Closed

Cross arm*-musl hangs when runing g-ir-scanner-qemuwrapper #11426

pullmoll opened this issue May 1, 2019 · 34 comments
Labels
bug Something isn't working

Comments

@pullmoll
Copy link
Member

pullmoll commented May 1, 2019

As you may have noticed in the commit logs we have a problem with cross arm*-musl trying to run the cross g-ir-scanner with g-ir-scanner-qemuwrapper. It hangs for (literally) hours and does not finish. It's seemingly not a problem of the builders or age of (re-)built packages because it happens for me locally as well and I built all of the dependencies here.

What I know from looking at top when the build hangs is that it seems as if two processes in parallel are running g-ir-scanner-qemuwrapper to scan the introspection info. It does not happen with arm*-gnu cross builds so it has to be something specific to Musl libc or the cross environment for Musl. I wanted to try if it also happens when using a 32bit environment (i686) for cross compiling webkit2gtk for e.g. armv7l-musl to see if it's perhaps a 64 vs. 32 bit build environment issue.

For now I disabled gir for cross webkit2gtk to arm*-musl but of course this makes packages depending on the introspection files of webkit2gtk incomplete or even fail.

In case you find time to test this on your local box, you'd have to add gir to build_options_default in line 43 of srcpkgs/webkit2gtk/template and take a look at the processes running after the binaries were built (i.e. after cmake prints 100%).

I have no idea yet why it works for arm*-gnu but not for arm*-musl as for the most part we solved the cross gobject-introspection problems with *-musl and cross webkit2gtk to aarch64-musl or ppc64-musl work just fine.

@pullmoll pullmoll added the bug Something isn't working label May 1, 2019
@pullmoll
Copy link
Member Author

pullmoll commented May 2, 2019

There we have the problem: 64 bit to 32 bit cross compiling.
I could successfully build webkit2gtk for armv7l-musl using an i686 enviroment and qemu-arm-static.
Time again to think about changing the arm* builders to use i686 instead of x86_64.
It would avoid a mess of disabling packages depending on WebKit2-4.0.gir for arm*-musl.
What do @void-linux/void-ops think about this?

@the-maldridge
Copy link
Member

Its not out of the question, but I'd like to better understand why this is the case. I'm calling it a night here, but I will read any reply in the morning. Can you explain why this makes things better? Also worth remembering that there is no i686-musl, so you'd be crossing from i686 to armv{6,7}l-musl.

@pullmoll
Copy link
Member Author

pullmoll commented May 2, 2019

I think one explanation why cross compiling to a 32 bit arch from another 32 bit arch works (better than 64 to 32) is that sizes of pointers and long are equal. We will every now and then have some check in gnu-configure or other build styles picking the host's value for these instead of the one for the target as it should be.

Of course in theory it should work for both cases. Somewhere, probably deeply hidden and hard to find, we still have a problem with cross __WORDSIZE == 64 vs. 32. It may be hidden in common/environment/configure/autoconf_cache missing a case or setting it wrong, or ... well, I have no idea.

Cross i686-gnu to e.g. armv7l-musl is no problem. That's what I did and what works for me. I know there is no official i686-musl - I regularly build it here and always did and can test both ways when we hit this kind of problems.

@pullmoll
Copy link
Member Author

pullmoll commented May 2, 2019

Looking at autoconf_cache we have two types of partially contradicting definitions between arm-common, arm-linux, and musl-linux.

For arm*-gnu the value should be ac_cv_sizeof_off_t=4 but for musl, where there are no 32 bit lseek and similar functions but only lseek64, the value for ac_cv_sizeof_off_t=8. Now it depends on the order in which these definitions are included in common/environment/configure/gnu-configure-args.sh and I believe it is done in a way to allow musl-linux to override the <arch>-common and <arch>-linux definitions.

But what happens for sizeof(type) where we don't define the value in musl-linux with ac_cv_something=4, or ac_cv_something=8 or whatever the size is? A configure script may or rather will then take the value of the host's sizeof(type) which, in case of x86_64, may be different from the sizeof(type) for the arm* target. I think it's one or more of these cases why it works (better) to cross build from i686 to arm*.

I see no easy way to make sure that ./configure does not pick the host's values for sizeof(type) where it is supposed to take that of the target. We even do depend on header files being installed for the target but also the host to make some build styles (namely qmake, others?) find some headers they expect, and we hope they will finally use the ones for the target.

@pullmoll
Copy link
Member Author

pullmoll commented May 2, 2019

I fear that changing to the i686 environment for arm*-musl alone may introduce new problems.
Perhaps we can tag templates which don't cross build from x86_64 to <arch>-musl in some way to tell the builder to use an alternate (i686) chroot instead of the regular (x86_64) one.

I think of something like crossenv=32 with the default being crossenv=64. Then we could for now set this only in webkit2gtk/template for arm*-musl and perhaps also ppc-musl and work around the underlying problem.

@pullmoll
Copy link
Member Author

pullmoll commented May 2, 2019

@xtraeme but how should we anticipate which types any of the configure scripts wants to know the sizeof for? This may be sizeof(struct xyz) where struct xyz contains a single item or even an array of the integral, simple types like off_t and then having ac_cv_sizeof_off_t correctly defined is not sufficient. This seems like opening a can of worms, or actually trying to get the worms back in the can.

@ackalker
Copy link
Contributor

ackalker commented May 2, 2019

if you can get Bear (packaged in Void) to run in the chroot (host arch), you can get a full trace of all commands (with all command arguments) executed during build. Perhaps you can compare a trace from a working build to a failing one, along with the contents of any auto generated files.
To get more than only the compilation (cc/gcc, etc.) commands in the file compile_commands.json, you will need to run bear with verbosity turned up like so:

$ bear -vv make # or other build command

To get the interesting stuff out of the log, do:

$ grep "input was" <logfile> | tac # Yes, really. `bear`'s debug output is in reverse chronological order. Don't ask why, I don't know.

Hope this helps.

@pullmoll
Copy link
Member Author

pullmoll commented May 2, 2019

For now, to get the arm*-musl builders continuing, perhaps we can cross build webkit2gtk manually from i686 and put it into the repo? Otherwise I would (for now) break webkit2gtk for arm*-musl again to avoid the dependent packages (like devhelp now) error out and blocking valid updates.

@pullmoll
Copy link
Member Author

pullmoll commented May 4, 2019

It seems the underlying problem is fixed. I could now cross build webkit2gtk using the official repo from x86_64 to armv7l-musl. If anyone can confirm this we can remove the broken=.

@newbluemoon
Copy link
Contributor

I’m currently building it (x86_64-musl --> armv7l-musl). Will test it on my Raspberry Pi when done to make sure that there are no runtime issues and report back.

@newbluemoon
Copy link
Contributor

newbluemoon commented May 4, 2019

Sorry to report, but the x86_64-musl --> armv7l-musl build doesn’t finish here (with everything up-to-date). Same issue as above: qemu-arm-static just seems to be stuck... :(

@pullmoll
Copy link
Member Author

pullmoll commented May 4, 2019

Hmm.. I'll verify again later or tomorrow. It worked here with a freshly zapped and binary-boostrapped environment for x86_64 using packages from alpha.de.repo.voidlinux.org/current. Maybe your repocache is different? Mine was empty before I tried.

@newbluemoon
Copy link
Contributor

I zapped the masterdir first, too, and binary-bootstrapped from the same repo to have a clean environment. But I’m on a musl-system, not just a musl-chroot, if that’s of any relevance.

@newbluemoon
Copy link
Contributor

Hmm, maybe I’m not getting something, but in /usr/bin/g-ir-scanner

#!/usr/bin/env bash
# Check if we are running in an xbps-src environment and run the wrapper if that
# is the case.
if [ -n "$XBPS_CROSS_BASE" -a -n "$XBPS_TARGET_MACHINE" -a -n "$XBPS_VERSION" ]; then
	# This prevents g-ir-scanner from writing cache data to $HOME
	export GI_SCANNER_DISABLE_CACHE=1
	
	exec /usr/bin/g-ir-scanner.wrapped \
				 --use-binary-wrapper=/usr/bin/g-ir-scanner-qemuwrapper \
				 --use-ldd-wrapper=/usr/bin/g-ir-scanner-lddwrapper \
				 --add-include-path=${XBPS_CROSS_BASE}/usr/share/gir-1.0 \
				 --add-include-path=${XBPS_CROSS_BASE}/usr/lib/gir-1.0 \
				 "${@//-I\/usr\/include/-I${XBPS_CROSS_BASE}\/usr\/include}"
fi
	
exec /usr/bin/g-ir-scanner.wrapped "$@"

shouldn’t the last line belong to the ‘else’ branch of the if clause? Else there are two execs in the cross case.
Same with g-ir-compiler.

@pullmoll
Copy link
Member Author

pullmoll commented May 5, 2019

Oh, our builders are x86_64 with glibc AFAIK. And that was what I tried.
I haven't tried cross compiling from x86_64-musl to armv7l-musl yet.

For the g-ir-scanner wrapper I think you are basically right, but then exec won't return anyway and thus it's okay to have a default path not inside an else. AFAIK exec is implicitly noreturn.

@newbluemoon
Copy link
Contributor

I’ll try again with x86_64 --> armv7l-musl today. But my hardware isn’t that new so it’ll take some time. ;)
But I assume you’re using a x86_64-musl chroot, right?

As for the exec, now that you mention noreturn something made “click”. :) Was trying to figure out why there are two qemu instances running. But it seems they’re invoked in the makefile. Only with -j1 there’s only one instance.

@pullmoll
Copy link
Member Author

pullmoll commented May 5, 2019

No, I used an x86_64 chroot and I am on a x86_64 glibc system. Just now trying to cross compile from a x86_64-musl chroot with official repo packages. It already succeeded with my local packages.

@newbluemoon
Copy link
Contributor

Hmm, I tried x86_64 glibc host --> x86_64-musl chroot --> armv7l-musl target with a freshly git-cloned void-packages repo (everything from scratch) and qemu-arm-static just got stuck again.
Same with x86_64-musl host --> x86_64 chroot --> armv7l-musl target (though I think that combination doesn’t really make sense, but just for the sake of it ;)).
Still in progress is x86_64 host --> x86_64 chroot --> armv7l-musl target.

@pullmoll
Copy link
Member Author

pullmoll commented May 5, 2019

Here also the final test succeeded: x86_64 host, x86_64-musl chroot, cross to armv7l-musl, all with official repo packages. Whatever the difference between your and my environment is, it's probably not in the packages per se. It could be something like a wrong -march=native somewhere and your host's CPU being different from mine (AMD Ryzen 1950X).
I wonder if I should give the builders another try at compiling webkit2gtk...

@newbluemoon
Copy link
Contributor

I did some more digging:

There is a temporary directory /builddir/webkitgtk-2.24.1/build/Source/WebKit/tmp-<something> (it gets deleted when the build is aborted) with an executable WebKit2-4.0 which is executed basically as

./WebKit2-4.0 --introspect-dump=functions.txt,dump.xml

which yields the introspection data.

Copied to my Raspberry Pi and executed everything is fine. However, when run with qemu-arm-static the output file dump.xml is truncated, or to be more precise, running ./WebKit2-4.0 just doesn’t continue and hangs.

So I chrooted into the masterdir and ran

qemu-arm-static -d unimp -E LD_LIBRARY_PATH="/usr/armv7l-linux-musleabihf/usr/lib" -L "/usr/armv7l-linux-musleabihf" ./WebKit2-4.0 --introspect-dump=functions.txt,dump.xml

manually and got

Unsupported syscall: 389

which is the membarrier syscall which if I recall correctly musl started using with 1.1.22.

So could this be the culprit? But then I have no idea why it’s working for @pullmoll (maybe it’s random?).
For me x86_64 host --> x86_64 chroot --> armv7l-musl target also doesn’t work.

@pullmoll
Copy link
Member Author

pullmoll commented May 5, 2019

It probably is the culprit, however I don't understand why you get to see this Unsupported syscall: 389 message. You have a kernel where this is enabled grep MEMBARRIER /boot/config*?

@newbluemoon
Copy link
Contributor

I’m using the latest standard kernel (5.0.12), nothing fancy, and yes it’s enabled.

@pullmoll
Copy link
Member Author

pullmoll commented May 5, 2019

🤷‍♂️ now I'm out of ideas.

@newbluemoon
Copy link
Contributor

Maybe we should stop for today and do something entirely different. I found that solutions tend to present themselves when doing nothing. :)

@jnbr
Copy link
Contributor

jnbr commented May 5, 2019

Our qemu is built without membarrier support. Maybe build qemu with --enable-membarrierand try again.

@newbluemoon
Copy link
Contributor

Thanks @jnbr, compiling right now... :)

@newbluemoon
Copy link
Contributor

No, sadly a membarrier-enabled qemu doesn’t work for me either.
When I run the qemu command from above with option qemu-arm-static -strace ... it endlessly prints lines like

mremap(-1277952,4096,8192,0,0,0) = -1 errno=12 (Out of memory)

(though my machine is very far from memory exhaustion).

So maybe it’s not (solely) membarrier. I found this thread with similar errors on the musl mailing list:
https://www.openwall.com/lists/musl/2017/06/21/2
which suggests a bug in qemu. And maybe it is also subject to some features of the hardware used, because it’s working sometimes, apparently.

@newbluemoon
Copy link
Contributor

@jnbr I think it would also work for you with the standard qemu, because it did for @pullmoll (also Ryzen). I tried on a ~10 year old Athlon II X2. So I guess it is indeed some hardware feature which the builders are probably also lacking.

@pullmoll
Copy link
Member Author

pullmoll commented May 5, 2019

I somehow suspect a connection to the recently changed transparent hugepage handling in the kernels. I have transparent_hugepage=madvise in /etc/default/grub. I see no usage of madvise(2) in the qemu linux-user code, though.

If it is really an issue with qemu's target_mremap() function when new_size > old_size it could perhaps be the function mmap_reserve() in linux-user/mmap.c:580 which is failing because it cannot find a consecutive range of pages. Just a wild guess, though.

@jnbr
Copy link
Contributor

jnbr commented May 5, 2019

Forget what I've said before, stupid me forgot the -a Flag ...

@pullmoll
Copy link
Member Author

pullmoll commented May 5, 2019

Hmm.. this code seems wrong for me linux-user/mmap.c:726…:

    726         int prot = 0;
    727         if (reserved_va && old_size < new_size) {
    728             abi_ulong addr;
    729             for (addr = old_addr + old_size;
    730                  addr < old_addr + new_size;
    731                  addr++) {
    732                 prot |= page_get_flags(addr);
    733             }
    734         }

The loop with addr is for every byte of the range and not for every page using addr += TARGET_PAGE_SIZE like it is done in other places in the same source. If you mremap() and increase size by several MiB that will create millions of calls to page_get_flags(addr). Or is this really intended here?

Thinking about this and looking at the implementation of page_get_flags() it should suffice to skip addr in TARGET_PAGE_SIZE steps and in addition test the hight address old_addr + new_size - 1.

But then all this does not seem to be a possible cause for the ENOMEM error.

@pullmoll
Copy link
Member Author

pullmoll commented May 7, 2019

To simplify debugging (cross) packages there's now GIR_EXTRA_OPTIONS which can be specified in a template like e.g. GIR_EXTRA_OPTIONS="-strace" to see what happens in the regular build log.

What is missing, also for GIR_EXTRA_LIBS_PATH, is a unset GIR_EXTRA_… in the right place, or otherwise dependencies of a package are built with the same settings for these environment variables. This unset should perhaps be done in common/xbps-src/shutils/common.sh function setup_pkg() somewhere around # Start with a sane environment. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants
@ackalker @the-maldridge @pullmoll @jnbr @newbluemoon and others