Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weak memcmp symbol from compiler_rt is not overridden when building Python 3.11.0 on Linux and MacOS #13303

Open
williamstein opened this issue Oct 25, 2022 · 11 comments
Labels
bug Observed behavior contradicts documented or intended behavior
Milestone

Comments

@williamstein
Copy link
Contributor

Zig Version

0.10.0-dev.4562+b3cd38ea4

Steps to Reproduce

Building Python from source worked well using the dev versions of zig until a few days ago.

This bug is very easy to reproduce in about 1 minute. I've tested this on x86_64 Linux, aarch64 linux, and aarch64 macos, and the behavior is identical in all cases.

wget https://www.python.org/ftp/python/3.11.0/Python-3.11.0.tar.xz
tar xf Python-3.11.0.tar.xz
cd Python-3.11.0
AR="zig ar" CC="zig cc" ./configure
make
# wait one minute and BOOM:
...
Fatal Python error: pycore_interp_init: failed to initialize importlib

This fails to build with zig 4562+b3cd38ea4.tar.xz (also with 4560) but builds fine with zig 4476+0f0076666 and many previous versions.

Expected Behavior

Build Python successfully, like I did with 4476+0f0076666 and many previous versions.

Actual Behavior

The step that behaves differently in the build is creating _bootstrap_python. You can see what happens as follows:

rm -f _bootstrap_python && make _bootstrap_python

# zig cc     -o _bootstrap_python Modules/getbuildinfo.o Parser/token.o  Parser/pegen.o Parser/pegen_errors.o Parser/action_helpers.o Parser/parser.o Parser/string_parser.o Parser/peg_api.o Parser/myreadline.o Parser/tokenizer.o ...

It is a big linking of all the objects together to make a basic python
interpreter. _bootsrap_python is then run to create Python/deepfreeze/deepfreeze.c, and that file
is autogenerated incorrectly with the _bootsrap_python coming from the latest version of zig. It's
like some C library functions changed to be broken, which is breaking the behavior of the bootstrap_python
program badly.

If I install zig 4476+0f0076666, then do

rm -f _bootstrap_python && make _bootstrap_python

then use this _bootstrap_python to create Python/deepfreeze/deepfreeze.c it is correct and works and the build even finishes. Thus my best guess is there is a change in zig's compiler-rt that is linked in, but I don't know. The problem definitely doesn't involve the compilation stage (which just involves clang anyways), but what is linked in.

Also, I want to emphasize again that this problem is exactly the same on a whole bunch of different OS's and CPU architectures, which is probably helpful for debugging this.

@williamstein williamstein added the bug Observed behavior contradicts documented or intended behavior label Oct 25, 2022
@williamstein
Copy link
Contributor Author

I just checked (it's very easy) and building Python is still broken in exactly the same way with 0.10.0-dev.4587+710e2e7f1

I did an experiment with 0.10.0-dev.4587+710e2e7f1. I took the exact ld line (output by zig -v) and replaced the path to libcompiler_rt.a with the one for zig version 4476+0f0076666. Everything works fine. This proves that a change to libcompiler_rt.a between 4476+0f0076666 and 4562+b3cd38ea4 contains a new bug that significantly changes the behavior of (and breaks) Python.

@williamstein
Copy link
Contributor Author

This bug is caused by @Luukdegram 's change here: cdf7e7d

I reverted exactly that one line change to lib/compiler_rt.zig
then Python builds fine again.

I'm going to patch my local version(s) of Zig so I can continue using and testing the latest version. In the meantime, @Luukdegram can you please reconsider your commit cdf7e7d
more closely, since it significantly breaks vanilla CPython on many platforms, so there is definitely something subtly wrong about it.

@williamstein
Copy link
Contributor Author

💯 fan mail -- the fact that I can just change compiler_rt.zig and immediately work with the modified version is mind blowingly awesome. Zig rocks.

@Luukdegram
Copy link
Sponsor Member

Luukdegram commented Oct 27, 2022

I'm going to patch my local version(s) of Zig so I can continue using and testing the latest version. In the meantime, @Luukdegram can you please reconsider your commit cdf7e7d more closely, since it significantly breaks vanilla CPython on many platforms, so there is definitely something subtly wrong about it.

While I don't mind reconsidering the commit, it has to be for the right reason. This commit simply exposes our version of memcmp in compiler-rt as a weak symbol. I've successfully reproduced the bug as described by you on aarch64 macos, but I believe this is caused by a miscompilation in the memcmp implementation. While theoretically, it should use libc's version, I have confirmed it uses our version instead by changing the implementation slightly. (It using ours, regardless of us linking with libc in this case is a different subject). I'd like to resolve this bug for you, but I'm afraid I do not have any free time to fix this before the 0.10 release. Perhaps @andrewrk has an opinion on whether to revert the commit or not.

@andrewrk andrewrk added zig cc Zig as a drop-in C compiler feature regression It worked in a previous version of Zig, but stopped working. labels Oct 27, 2022
@williamstein
Copy link
Contributor Author

Thanks for looking into this and reproducing the bug!! You guys of course have a vastly deeper understanding of zig than I do.

I'd like to resolve this bug for you, but I'm afraid I do not have any free time to fix this before the 0.10 release.

Understood. In my project I'm patching the zig source code to resolve this bug, and that's working 100% fine for me for now, so at this point there is no urgency from me. Of course, longterm I hope zig can be used to build Python for other people, since Python is a very popular language, and zig's packaging of clang is simply amazing.

@andrewrk andrewrk added this to the 0.10.0 milestone Oct 27, 2022
@andrewrk
Copy link
Member

There are two things I want to investigate here:

  1. I suspect there is a bug in memcmp:

pub fn memcmp(vl: ?[*]const u8, vr: ?[*]const u8, n: usize) callconv(.C) c_int {
@setRuntimeSafety(false);
var index: usize = 0;
while (index != n) : (index += 1) {
const compare_val = @bitCast(i8, vl.?[index] -% vr.?[index]);
if (compare_val != 0) {
return compare_val;
}
}
return 0;
}

The following implementation is nicer and has different machine code than the above version, which happens to match musl libc's machine code for x86_64 when compiled with clang:

pub fn memcmp(vl: [*]const u8, vr: [*]const u8, n: usize) callconv(.C) c_int {
    var i: usize = 0;
    while (i < n) : (i += 1) {
        const l_big: c_int = vl[i];
        const r_big: c_int = vr[i];
        if (l_big != r_big) return l_big -% r_big;
    }
    return 0;
}

So I suspect it will solve the problem in this issue. However we should come up with a test case that triggers the faulty behavior before considering the case closed.

  1. The memcmp symbol that zig provides in compiler_rt has weak linkage, and therefore supposed to be replaced by the memcmp symbol from libc during linking. So why wasn't it replaced by libc when building Python? Another mystery to solve before closing this issue.

andrewrk added a commit that referenced this issue Oct 27, 2022
See the new test case - this fails in the previous implementation.

See #13303
andrewrk added a commit that referenced this issue Oct 28, 2022
See the new test case - this fails in the previous implementation.

See #13303
@andrewrk
Copy link
Member

andrewrk commented Oct 28, 2022

Alright I verified that this did indeed solve the original problem of building Python with zig cc. So now only the second mystery remains. However that mystery is not as urgent, so I will change the milestone.

In order to verify that this issue can be closed, one should be able to edit compiler_rt, replacing zig's memcmp implementation with @breakpoint(); unreachable; and then proceed to build Python with zig cc as indicated above. The libc being used should have a memcmp symbol that is strong and thereby overrides the weak one that has the trap.

@andrewrk andrewrk removed zig cc Zig as a drop-in C compiler feature regression It worked in a previous version of Zig, but stopped working. labels Oct 28, 2022
@andrewrk andrewrk changed the title building Python 3.11.0 is broken on Linux and MacOS with the latest version of zig 4562+b3cd38ea4, but worked with 4476+0f0076666 weak memcmp symbol from compiler_rt is not overridden when building Python 3.11.0 on Linux and MacOS Oct 28, 2022
@andrewrk andrewrk modified the milestones: 0.10.0, 0.11.0 Oct 28, 2022
@mikdusan
Copy link
Member

So now only the second mystery remains.

yeesh. It looks like libc (at least on archlinux, x86_64) doesn't define strong symbols.

(output trimmed)

nm -D /usr/lib/libc.so.6 | egrep 'memcpy|sleep'
00000000000a0d30 T memcpy@GLIBC_2.2.5
0000000000099c70 i memcpy@@GLIBC_2.14
00000000000d21d0 W sleep@@GLIBC_2.2.5

From ld man page:

"I" The symbol is an indirect reference to another symbol.
"i" ... For ELF format files this indicates that the symbol is an indirect function.

and from my tests, it seems that during build-time linking the symbol resolution for i is treated similar to W - that is, order is important amongst objects on link line. Eg. a FIRST.a symbol before a SECOND.so of equivalent linkage andFIRST.a is linked in.

*tidbit: apparently if you have no strong linkage of a symbol, but multiple weak linkages, the tie breaker is the SIZE, and presumably if SIZE are all same, then first one found on command line.

I hazard a guess the system libc did this approach to allow for certain overrides using strong symbols, regardless of object ordering and this precludes zig from using weak linkage in the absence of libc.

As an aside, I did verify that FIRST.a with weak symbol and SECOND.so with strong symbol (emphasis on archive vs dylib) the strong one is dynamically linked.

@williamstein
Copy link
Contributor Author

Tiny note: I only reported this due to aarch64-linux binaries of master being at https://ziglang.org/download/, but they seem to be discontinued now. In case somebody is curious, there's a lot of value in those binaries, and thanks to whoever was making them before.

@Vexu
Copy link
Member

Vexu commented Nov 9, 2022

The aarch64 CI makes those tarballs automatically but it mysteriously died three days ago a65ba6c

@williamstein
Copy link
Contributor Author

Thanks for the update. I tried signing up for their service to see if I could understand what is going on and got this very not-encouraging message:

image

I'll post a message here if I end up setting up something on my own hardware. Evidently, it's possible to run that same drone.io service on your own hardware according to https://docs.drone.io/server/overview/, but you have to put everything on the public internet, which complicates things substantially for me at least.

Evidently, Github still doesn't have aarch64 linux support yet: actions/runner-images#5631

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Observed behavior contradicts documented or intended behavior
Projects
None yet
Development

No branches or pull requests

5 participants