Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined behaviour in main and 3.11 #96678

Closed
matthiasgoergens opened this issue Sep 8, 2022 · 36 comments
Closed

Undefined behaviour in main and 3.11 #96678

matthiasgoergens opened this issue Sep 8, 2022 · 36 comments
Labels
3.11 only security fixes 3.12 bugs and security fixes type-bug An unexpected behavior, bug, or error type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@matthiasgoergens
Copy link
Contributor

matthiasgoergens commented Sep 8, 2022

I ran the sanitizers again, and found a few more instances of undefined behaviour, mostly around bit-shifting of signed integers and arithmetic with NULL pointers.

export CC="clang"
export ASAN_OPTIONS=detect_leaks=0
configure --with-assertions --with-address-sanitizer --with-trace-refs --with-undefined-behavior-sanitizer --with-pydebug
nice make -j8
make test

I put some asserts to demonstrate the undefined behaviour into pull requests for main (matthiasgoergens#18) and 3.11 (matthiasgoergens#19).

More information about my environment:

$ clang --version
clang version 14.0.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
@matthiasgoergens matthiasgoergens added the type-bug An unexpected behavior, bug, or error label Sep 8, 2022
@matthiasgoergens matthiasgoergens changed the title Undefined behaviours in main and 3.11 Undefined behaviours in main and 3.11 Sep 8, 2022
@matthiasgoergens matthiasgoergens changed the title Undefined behaviours in main and 3.11 Undefined behaviour in main and 3.11 Sep 8, 2022
@kumaraditya303
Copy link
Contributor

Can you post a summary of "arithmetic with NULL pointers" here? Thanks

@kumaraditya303 kumaraditya303 added 3.11 only security fixes 3.12 bugs and security fixes type-crash A hard crash of the interpreter, possibly with a core dump labels Sep 8, 2022
@matthiasgoergens
Copy link
Contributor Author

matthiasgoergens commented Sep 8, 2022

@kumaraditya303 In ceval.c, the assert is my addition.

    /* Pack other positional arguments into the *args argument */
    if (co->co_flags & CO_VARARGS) {
        PyObject *u = NULL;
        assert(args != NULL);
        u = _PyTuple_FromArraySteal(args + n, argcount - n);
        if (u == NULL) {
            goto fail_post_positional;
        }
        assert(localsplus[total_args] == NULL);
        localsplus[total_args] = u;
    }

There's also an example in Modules/_testcapimodule.c but that's only a test, I guess.

@matthiasgoergens
Copy link
Contributor Author

#96672 is also vaguely related.

@kumaraditya303
Copy link
Contributor

cc @pablogsal This seems to be introduced in #89419

@pablogsal
Copy link
Member

I am a bit surprised by this because we have a USAN buildbot that has not detected anything:

https://buildbot.python.org/all/#/builders/964

@pablogsal
Copy link
Member

@matthiasgoergens
Copy link
Contributor Author

@pablogsal I added some more info about what compiler I am using.

@pablogsal
Copy link
Member

@pablogsal I added some more info about what compiler I am using.

Can you see if you reproduce these warnings using the exact flags the buildbot is using?

@matthiasgoergens
Copy link
Contributor Author

I'll try that later today.

@matthiasgoergens
Copy link
Contributor Author

@pablogsal Is https://buildbot.python.org/all/#/builders/964/builds/319/steps/5/logs/stdio supposed to show lots of stuff like this?:

UndefinedBehaviorSanitizer can not provide additional info.
SUMMARY: UndefinedBehaviorSanitizer: SEGV (/usr/lib/libpthread.so.0+0x13702) in raise
==2684733==ABORTING
UndefinedBehaviorSanitizer:DEADLYSIGNAL
==2684777==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x03e80028f769 (pc 0x7f5fc8c93702 bp 0x55ef552eed10 sp 0x7ffcf1f5d2d0 T2684777)
==2684777==The signal is caused by a READ memory access.
    #0 0x7f5fc8c93702 in raise (/usr/lib/libpthread.so.0+0x13702)
    #1 0x55ef552eedcd in faulthandler_raise_sigsegv /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/./Modules/faulthandler.c:1056:5
    #2 0x55ef552eedcd in faulthandler_sigsegv /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/./Modules/faulthandler.c:1072:9
    #3 0x55ef5514c78e in cfunction_call /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/Objects/methodobject.c:553:18
    #4 0x55ef550d410a in _PyObject_MakeTpCall /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/Objects/call.c:214:18
    #5 0x55ef55239c95 in _PyEval_EvalFrameDefault /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/Python/ceval.c
    #6 0x55ef55228221 in _PyEval_EvalFrame /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/./Include/internal/pycore_ceval.h:73:16
    #7 0x55ef55228221 in _PyEval_Vector /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/Python/ceval.c:6424:24
    #8 0x55ef550d4952 in _PyVectorcall_Call /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/Objects/call.c:257:24

@pablogsal
Copy link
Member

@pablogsal Is https://buildbot.python.org/all/#/builders/964/builds/319/steps/5/logs/stdio supposed to show lots of stuff like this?:

UndefinedBehaviorSanitizer can not provide additional info.
SUMMARY: UndefinedBehaviorSanitizer: SEGV (/usr/lib/libpthread.so.0+0x13702) in raise
==2684733==ABORTING
UndefinedBehaviorSanitizer:DEADLYSIGNAL
==2684777==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x03e80028f769 (pc 0x7f5fc8c93702 bp 0x55ef552eed10 sp 0x7ffcf1f5d2d0 T2684777)
==2684777==The signal is caused by a READ memory access.
    #0 0x7f5fc8c93702 in raise (/usr/lib/libpthread.so.0+0x13702)
    #1 0x55ef552eedcd in faulthandler_raise_sigsegv /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/./Modules/faulthandler.c:1056:5
    #2 0x55ef552eedcd in faulthandler_sigsegv /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/./Modules/faulthandler.c:1072:9
    #3 0x55ef5514c78e in cfunction_call /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/Objects/methodobject.c:553:18
    #4 0x55ef550d410a in _PyObject_MakeTpCall /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/Objects/call.c:214:18
    #5 0x55ef55239c95 in _PyEval_EvalFrameDefault /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/Python/ceval.c
    #6 0x55ef55228221 in _PyEval_EvalFrame /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/./Include/internal/pycore_ceval.h:73:16
    #7 0x55ef55228221 in _PyEval_Vector /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/Python/ceval.c:6424:24
    #8 0x55ef550d4952 in _PyVectorcall_Call /buildbot/buildarea/3.11.pablogsal-arch-x86_64.clang-ubsan/build/Objects/call.c:257:24

Most of these are expected because some tests segfault on purpose (like faulthandler) and other are incompatible with sanitizer (like some ctypes ones) Those are always in subprocesses so as long as they don't happen on the main process that won't make the test suite fail.

@pablogsal
Copy link
Member

Notice that some of the frames involve faulthandler_sigsegv and friends

@matthiasgoergens
Copy link
Contributor Author

matthiasgoergens commented Sep 8, 2022

When I run with the same config as the build bot, I also seem to be getting only the SEGV. I also tried adding --with-pymalloc=no to the bot's config, but that doesn't seem to make a difference.

@matthiasgoergens
Copy link
Contributor Author

matthiasgoergens commented Sep 8, 2022

#!/bin/bash
set -e
set -u
set -o pipefail
set -x

dir="builds/debug-main-no-pymalloc"
here="$(realpath .)"

mkdir --parent "${dir}"
pushd "${dir}"

export CC="clang"
export LD="clang"
export CFLAGS="-fno-sanitize-recover"
nice "${here}/configure" \
    --with-undefined-behavior-sanitizer \
    --with-pymalloc=no \
    --with-assertions \
    --with-address-sanitizer \
    --with-trace-refs \
    --with-pydebug \

nice make -j8
nice make test

When I run the above build-script on 3.11, I don't even get to make test, because make itself already complains:

./_bootstrap_python /home/matthias/prog/python/cpython-main/Programs/_freeze_module.py abc /home/matthias/prog/python/cpython-main/Lib/abc.py Python/frozen_modules/abc.h
/home/matthias/prog/python/cpython-main/Python/pystate.c:2183:27: runtime error: applying non-zero offset 112 to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /home/matthias/prog/python/cpython-main/Python/pystate.c:2183:27 in 
make: *** [Makefile:1215: Python/frozen_modules/abc.h] Error 1

(See #96569 for that problem.)

@pablogsal
Copy link
Member

When I run with the same config as the build bot, I also seem to be getting only the SEGV. I also tried adding --with-pymalloc=no to the bot's config, but that doesn't seem to make a difference.

That hints that this only happens when we added either --with-trace-refs or --with-pydebug

@matthiasgoergens
Copy link
Contributor Author

matthiasgoergens commented Sep 8, 2022

Here's my full build script that I run from the root of the repository and it gives all the UB warnings I made assertions for:

#!/bin/bash
set -e
set -u
set -o pipefail
set -x

dir="builds/debug-main-no-pymalloc-do-recover-no-leaks"
here="$(realpath .)"

mkdir --parent "${dir}"
pushd "${dir}"

export CC="clang"
export LD="clang"
export ASAN_OPTIONS=detect_leaks=0
export CFLAGS="-fsanitize-recover"
nice "${here}/configure" \
    --with-undefined-behavior-sanitizer \
    --with-pymalloc=no \
    --with-assertions \
    --with-address-sanitizer \
    --with-trace-refs \
    --with-pydebug \

nice make -j8
nice make test

Curiously, export ASAN_OPTIONS=detect_leaks=0 seems to be required to give me all the warnings. Otherwise it just reports some leaks, and not much else.

(I only put export ASAN_OPTIONS=detect_leaks=0 in because I wanted to focus on other problem first, because I suspected some tests deliberately leak memory.)

Btw, this config doesn't seem to be producing the SEGV that we are seeing from the other one. I have no idea why that is the case.

@kumaraditya303
Copy link
Contributor

I think the C compiler optimizes away the undefined behavior in non debug builds hence the non debug ubsan is passing.
It should probably be changed to build in debug mode.

@gpshead
Copy link
Member

gpshead commented Sep 8, 2022

Are the things this points out new, or just stuff we haven't ever noticed until now because we don't do our sanitizer runs --with-pydebug or --with-trace-refs? If these aren't new, I'd defer this blocker to a bugfix.

@pablogsal
Copy link
Member

Are the things this points out new, or just stuff we haven't ever noticed until now because we don't do our sanitizer runs --with-pydebug or --with-trace-refs?

That's what I am trying to understand. It seems that this only triggers with --with-pydebug or --with-trace-refs but this could point to a real problem that just happens to not be there in release mode by chance or because we are lucky. In any case this doesn't seem to happen on 3.10 .

@matthiasgoergens
Copy link
Contributor Author

@gpshead It's a few different instances of UB. I suspect some have been around for a bit, and the one in ceval.c might be new.

(But I'd need to check the history for that, when I'm back at my pc.)

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 10, 2022
(cherry picked from commit 50a70a0)

Co-authored-by: Mark Shannon <mark@hotpy.org>
@matthiasgoergens
Copy link
Contributor Author

matthiasgoergens commented Sep 10, 2022

Btw, I suspect the problem in ceval.c was the most important one of the bunch.

Please check whether the other instances of UB still left deserve to be release blockers?

miss-islington added a commit that referenced this issue Sep 10, 2022
(cherry picked from commit 50a70a0)

Co-authored-by: Mark Shannon <mark@hotpy.org>
@pablogsal
Copy link
Member

I am removing the release-blocker tag for now was the error in ceval.c is fixed. As the buildbot doesn't detect this in release mode this should not block the release anymore.

@matthiasgoergens
Copy link
Contributor Author

matthiasgoergens commented Sep 10, 2022

@pablogsal The conclusion that the release should not be blocked might very well be the right course of action, but I am a bit confused by your reasoning?

Release mode wouldn't detect any undefined behaviour at all, as it doesn't run with any sanitizer nor assertions? Or would it?

At most you would get a failed test, if you are really lucky?

Eg left shifting a negative number doesn't suddenly become defined in C, just because we are building with optimizations turned on?

(Nor does left shifting a signed long 1 by 63 places?)

If I remember right from my cursory look at the git history, the remaining instances of UB that I detected are in files that haven't changed recently. So we can just pray that all the compilers for all the platforms we release for keep producing code that works alright in the new release, too.

(And that might sound a bit dismissive, but I think it's probably actually an ok strategy, and that's why I suggested reviewing the release-blocker status.

I just want us to be aware that our strategy is 'hope' in this case.)

@pablogsal
Copy link
Member

but I am a bit confused by your reasoning?

My reasoning is that we have an undefined behaviour sanitiser buildbot that checks release mode that has not found any problem so the risk of any of this leaking into release mode, although non-zero, is likely low. The builder is here:

https://buildbot.python.org/all/#/builders/964

So unless someone explains to me why the failures that only are detected on debug mode can also happen on release mode and why the buildbot is not picking them up, I am going to delay any fixes to 3.11.1.

@matthiasgoergens
Copy link
Contributor Author

matthiasgoergens commented Sep 11, 2022

As a practical test, we can take the asserts I added in the PRs mentioned in the text of the issue, convert them into something that triggers in release mode (eg a normal if and a printf or abort) and run them under the same conditions as release mode.

I don't know why the sanitizer doesn't pick any of this up. At least it doesn't pick any of this up when the build bot runs it, neither in release mode nor in the debug mode.

I ran my sanitizer locally with different flags. See above.

(I'm only on my phone right now, so can't run these experiments until later today.)

@pablogsal
Copy link
Member

pablogsal commented Sep 11, 2022

I don't know why the sanitizer doesn't pick any of this up. At least it doesn't pick any of this up when the build bot runs it, neither in release mode nor in the debug mode.

Answering that question is very important, as it may reveal very interesting things such as problems in the running builder or what is the difference in behaviour between your builds and the builder and if the differences matte or not.

neither in release mode nor in the debug mode.

The sanitizer only runs in release mode in the build bots

@matthiasgoergens
Copy link
Contributor Author

matthiasgoergens commented Sep 11, 2022

The sanitizer only runs in release mode in the build bots

Oh, thanks. I must have misunderstood something.

Of course, our general point about not actually picking up many instances of UB still stands.

Btw, the config that you linked to that had the sanitizer enabled didn't enable optimisations as far as I can tell, hence I didn't see it as release mode.

@matthiasgoergens
Copy link
Contributor Author

matthiasgoergens commented Sep 12, 2022

@pablogsal @markshannon @kumaraditya303

Good news! I found the rootcause of why my debug build with all sanitisers turned up showed undefined behaviour, but they didn't show up in the build bot.

Most invocations of configure add -fwrapv to the compiler options. But --with-pydebug removes that.

From clang's documentation about -fwrapv

Treat signed integer overflow as two’s complement

GCC is a bit more verbose, but still rather vague:

This option instructs the compiler to assume that signed arithmetic overflow of addition, subtraction and multiplication wraps around using twos-complement representation. This flag enables some optimizations and disables others.

It turns out that arithmetic on the null-pointer also becomes defined with -fwrapv. At least my experiments point in that direction. The only thing I could find is some discussion on the netbsd kernel mailing list that seems to confirm this.

If the above is true, then all the instances of UB that I found are not actually UB in the effective C-dialect we are using for release builds.

Now we have (at least) two options:

  • fix --with-pydebug to also add -fwrapv for consistency.
  • remove -fwrapv from the release builds.

The former is the minimal change and safe. The latter enables a lot of C compiler optimizations, and my experiments running the sanitisers suggests that we are already nearly -fno-wrapv-ready. In general, we might want to look into -fstrict-overflow, too.

I say we should do the former straight away, and in the longer term consider implementing the latter (after running benchmarks and weighing the pros and cons etc).

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 13, 2022
(cherry picked from commit 6ba686d)

Co-authored-by: Matthias Görgens <matthias.goergens@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 13, 2022
Automerge-Triggered-By: GH:pablogsal
(cherry picked from commit 81e36f3)

Co-authored-by: Matthias Görgens <matthias.goergens@gmail.com>
miss-islington pushed a commit that referenced this issue Sep 13, 2022
Automerge-Triggered-By: GH:pablogsal
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 13, 2022
Automerge-Triggered-By: GH:pablogsal
(cherry picked from commit 81e36f3)

Co-authored-by: Matthias Görgens <matthias.goergens@gmail.com>
pablogsal pushed a commit that referenced this issue Sep 13, 2022
(cherry picked from commit 6ba686d)

Co-authored-by: Matthias Görgens <matthias.goergens@gmail.com>

Co-authored-by: Matthias Görgens <matthias.goergens@gmail.com>
miss-islington added a commit that referenced this issue Sep 13, 2022
Automerge-Triggered-By: GH:pablogsal
(cherry picked from commit 81e36f3)

Co-authored-by: Matthias Görgens <matthias.goergens@gmail.com>
miss-islington added a commit that referenced this issue Sep 13, 2022
Automerge-Triggered-By: GH:pablogsal
(cherry picked from commit 81e36f3)

Co-authored-by: Matthias Görgens <matthias.goergens@gmail.com>
@matthiasgoergens
Copy link
Contributor Author

I am closing this issue for now, because thanks to -fwrapv we don't actually have any UB in the release builds. Only when building --with-pydebug or with an explicit optimization option on the command line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes 3.12 bugs and security fixes type-bug An unexpected behavior, bug, or error type-crash A hard crash of the interpreter, possibly with a core dump
Projects
Development

No branches or pull requests

5 participants