Skip to content

Conversation

jberg5
Copy link

@jberg5 jberg5 commented Oct 1, 2025

See #139462 for more context.

Short summary: prior to this change, if a child process segfaulted when running in a concurrent.futures.ProcessPoolExecutor, the user would get a BrokenProcessPool exception with no information about which child process terminated or why.

In order to improve the debugging experience, this change attempts to report which child process terminated and with what exit code. For instance, if I have a worker process that segfaults, I'll now see something like:

Lib.concurrent.futures.process._RemoteTraceback: Process 48939 terminated abruptly with exit code -11

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
...
Lib.concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

@bedevere-app
Copy link

bedevere-app bot commented Oct 1, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@python-cla-bot
Copy link

python-cla-bot bot commented Oct 1, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

if p.exitcode: # Report any nonzero exit codes
errors.append(f"Process {p.pid} terminated abruptly with exit code {p.exitcode}")
if errors:
bpe.__cause__ = _RemoteTraceback("\n".join(errors))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without having swapped in my context on this code... the above cause is not None case surrounds the value within with \n''' joined_value ''' - should we do the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another way to think about this... can this logic just set cause to be this errors list so there's only a single _RemoteTraceback construction?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - initially I was just aiming to modify as little code as possible 😄 but you're right that this is worth a small refactor. Let me know what you think.

Only possible downside is that now the traceback looks a little more funky. Do you know why we have the ''' and the newlines this way?

Lib.concurrent.futures.process._RemoteTraceback: 
'''
Process 54534 terminated abruptly with exit code 99'''

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add that I'm more than happy to keep the formatting like this. Consistency is good, and I'm sure that there is at least one user out there directly parsing _RemoteTraceback strings whose use case would break if we made any modifications :)

Just want to confirm that this all looks good to you.

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the bot, please add a news entry, and please also update the "What's New in Python 3.15" document.

@bedevere-app
Copy link

bedevere-app bot commented Oct 2, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@jberg5 jberg5 requested a review from AA-Turner as a code owner October 2, 2025 21:39
Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just down to docs nitpicks for me. The rest looks good.

@jberg5
Copy link
Author

jberg5 commented Oct 2, 2025

There was a fun issue where the precommit merge conflict checker thought that a perfectly legitimate string was a problem:

check for merge conflicts................................................Failed
Doc/whatsnew/3.15.rst:680: Merge conflict string '=======' found

I removed one of the = and it was happy again :) (and then added it back after)

@jberg5
Copy link
Author

jberg5 commented Oct 2, 2025

Thanks @ZeroIntensity, appreciate all the feedback!

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good to me. I'll wait for @gpshead as the multiprocessing expert to decide on the weird triple-quotes in the message.

@ZeroIntensity
Copy link
Member

Hmm, something seems to be causing the tests to hang.

@jberg5
Copy link
Author

jberg5 commented Oct 6, 2025

Hi @ZeroIntensity and @gpshead! Tests are passing, let me know if there are any other changes you'd like me to make, or if this is good to go.

cause_str = ''.join(cause)
else:
# No cause known, synthesize from child process exitcodes
errors = []
Copy link
Contributor

@YvesDup YvesDup Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there really cases where multiple processes fail together ? If not, the errors list does not seem necessary. Otherwise, a test would be welcome.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible! Doing

    with ProcessPoolExecutor(max_workers=2) as executor:
        futures = [
            executor.submit(os._exit, 99),
            executor.submit(os._exit, 100),
        ]
        for future in as_completed(futures):
            try:
                future.result()
            except BrokenProcessPool as e:
                traceback.print_exception(e)

sometimes gives me

Lib.concurrent.futures.process._RemoteTraceback: 
'''
Process 84477 terminated abruptly with exit code 100
Process 84478 terminated abruptly with exit code 99'''

But this is basically a race between the subprocesses terminating and when we build the traceback. Testing this in a non-flaky way would be tricky at best.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original concern when deciding to report every known termination was that we could potentially end up with a "real" failure and a "red herring" failure, and since we can't say for sure which happened first or which was more important, it would be safer to just dump all known failures into the traceback. And the total size of the traceback would be bounded by the number of processes that could terminate at the same time, i.e.

On Windows, max_workers must be less than or equal to 61. If it is not then ValueError will be raised. If max_workers is None, then the default chosen will be at most 61

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm the flaky number of catched failed processes. Sometimes there is only one ...
I am wondering if we should not insert a brief comment. It's up to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants