Skip to content

Conversation

@ryv-odoo
Copy link

@ryv-odoo ryv-odoo commented Oct 30, 2025

When a new thread is started (Thread.start()), the current thread waits for the new thread's _started signal (self._started.wait()). If the new thread doesn't have enough memory, it might crash before signaling its start to the parent thread, causing the parent thread to wait indefinitely (still in Thread.start()).

To fix the issue, remove the _started python attribute from the Thread class and converted the logic at C level (PyEvent). A flaw of this method, it that the threading module will still contains the zombie thread into the _limbo dictionnary.

We also change Thread._delete() to use pop to remove the thread from _active, as there is no guarantee that the thread exists in _active[get_ident()], thus avoiding a potential KeyError. This can happen if _bootstrap_inner crashes before _active[self._ident] = self executes. We use self._ident because we know set_ident() has already been called.

Moreover, remove the old comment in _delete because _active_limbo_lock became reentrant in commit 243fd01.

Not sure if this fix need/can to be backported

@bedevere-app
Copy link

bedevere-app bot commented Oct 30, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@python-cla-bot
Copy link

python-cla-bot bot commented Oct 30, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@YvesDup
Copy link
Contributor

YvesDup commented Nov 2, 2025

IMO, there is a simpler fix in the finally part of the _bootstrap_inner method as below:

diff --git a/Lib/threading.py b/Lib/threading.py
index 4ebceae702..b5fc035b7c 100644
--- a/Lib/threading.py
+++ b/Lib/threading.py
@@ -1076,12 +1076,17 @@ def _bootstrap_inner(self):
             except:
                 self._invoke_excepthook(self)
         finally:
+            # The code before the `self._started.set()` instruction
+            # could raise an exception. We have to check here and
+            # set the Event if necessary.
+            if not self._started.is_set():
+                self._started.set()
             self._delete()

Your change in the _deleted method looks good to me.
I have run dedicated tests including your new one with this fix, all are green.
It is up to you to modify your fix if you agree.

@ryv-odoo
Copy link
Author

ryv-odoo commented Nov 3, 2025

Hello @YvesDup , thank you for your suggestion.

IMHO, your proposal is actually less robust since it is possible that Python raises a memory error before calling _bootstrap_inner at all and then it will lead to the same issue. That's why the "recovery" code added is on the parent thread. Actually my script shows that issue still happens with your suggestion (the one that decreases the heap memory limit).

My test isn't perfect and it doesn't cover all case: Memory can happen before calling _bootstrap from the PyObject_Call C method, for example. Do you have a better idea how to test this case too ?

I am aware that my fix is not super clean and contains an arbitrary timeout (not fan of that).
I also have another idea that I need to try: put the _started (then remove it from the Python code) information at C level, at the same time as the waiting and the signaling. I still don't have a clue how to do it properly and all the repercussions implied. But it seems more distributive (but cleaner / efficient) than this ugly but pragmatic solution. What do you think?

@YvesDup
Copy link
Contributor

YvesDup commented Nov 4, 2025

My test isn't perfect and it doesn't cover all case: Memory can happen before calling _bootstrap from the PyObject_Call C method, for example. Do you have a better idea how to test this case too ?

If I correctly understand your case, an (memory) error can happen in the _start_joinable_thread function of threadiing module and does not return an error. That was surprising me.
EDIT: That was surprising me because this function raises several exceptions. And a except entry exists in in the _start method. This issue puzzles me.

Perhaps you have to take a look to the implementation of the _start_joinable_thread in the Modules\_threadmodule.c.

In order to never call _boostrap, I suggest to test your case by redefining the Thread._bootstrap method as below:

def nop():
    return 1 # or simulate an MemoryError

def nothing():
        print("nothing".center(90, '-'))

t = threading.Thread(target=nothing)
old_bootstrap = t._bootstrap
t._bootstrap = nop
    try:
        t.start()
    except Exception as e:
        print(f"start down..... {e = }")
...
t._bootstrap = old_bootstrap

I am aware that my fix is not super clean and contains an arbitrary timeout (not fan of that).

Me too, so please would you apply this new diff and run again your scripts:
EDIT: This proposal fails in the CI. Sometimes, the self._ident is still None even though the _bootstrap(_inner) methods have already started. Sorry for the noise.

diff --git a/Lib/threading.py b/Lib/threading.py
index 4ebceae702..8f31ce5c7b 100644
--- a/Lib/threading.py
+++ b/Lib/threading.py
@@ -1001,7 +1001,12 @@ def start(self):
             with _active_limbo_lock:
                 del _limbo[self]
             raise
-        self._started.wait()  # Will set ident and native_id
+        # Wait until the thread is really started.
+        # It must have at least one ident, optionaly a name.
+        # Testing this ID (_ident attribute) as a real value
+        # should be acceptable.
+        if self._ident is not None:   
+            self._started.wait()

My last suggestion has to be removed because forcing the set on the event means that thread was started. And it is false.

I also have another idea that I need to try: put the _started (then remove it from the Python code) information at C level, at the same time as the waiting and the signaling. I still don't have a clue how to do it properly and all the repercussions implied. But it seems more distributive (but cleaner / efficient) than this ugly but pragmatic solution. What do you think?

Modify at C level seems very risky to me . If my last suggestion or a another simple fix is not acceptable , I suggest to create your own Threadclass by inheriting and adapting methods as you wish.

@ryv-odoo ryv-odoo force-pushed the main-fix-140746-threading-start-waiting branch from 469849e to f34cae7 Compare November 10, 2025 12:29
@bedevere-app
Copy link

bedevere-app bot commented Nov 10, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's going to be too hacky. I think we want a more robust fix in C.

I can walk you through implementing that if you'd like. Otherwise, I can work on fixing it myself.

@bedevere-app
Copy link

bedevere-app bot commented Nov 12, 2025

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@ryv-odoo ryv-odoo force-pushed the main-fix-140746-threading-start-waiting branch from 2149546 to d631f2a Compare November 12, 2025 08:38
@bedevere-app
Copy link

bedevere-app bot commented Nov 12, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@ryv-odoo ryv-odoo force-pushed the main-fix-140746-threading-start-waiting branch from 7dcb5b5 to 89d8bbe Compare November 12, 2025 09:11
serving_thread.join(0.1)
self.assertFalse(serving_thread.is_alive())


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that worth testing a use case where the _bootstrap method will be never called just to check that the thread is not running ?
Is that worth testing a use case where the an exception is raised in the _start_joinable_thread just to check that the thread is not running ? This last one could simulate your initial issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial issue is when the new thread crashes inside the _bootstrap/_bootstrap_inner before signaling that it starts.

Is that worth testing a use case where the _bootstrap method will be never called just to check that the thread is not running ?
Is that worth testing a use case where the an exception is raised in the _start_joinable_thread just to check that the thread is not running ? This last one could simulate your initial issue.

There is already test_start_new_thread_failed and that sounds a bit out of scope IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial issue is when the new thread crashes inside the _bootstrap/_bootstrap_inner before signaling that it starts.
There is already test_start_new_thread_failed and that sounds a bit out of scope IMO.

I agree, sorry for the misunderstood.

Copy link
Contributor

@YvesDup YvesDup Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From one comment: IMHO, your proposal is actually less robust since it is possible that Python raises a memory error before calling _bootstrap_inner at all and then it will lead to the same issue. That's why the "recovery" code added is on the parent thread.
My test isn't perfect and it doesn't cover all case: Memory can happen before calling _bootstrap from the PyObject_Call C method, for example

From the previous: My initial issue is when the new thread crashes inside the _bootstrap/_bootstrap_inner before signaling that it starts.

Where is this issue actually located ? Is your example with few memory (ulimit -v 1000000) the only one that often fails ?
It bothers me that we don't have a reproducible example, even though I understand that this is a complex issue.
If you agree, I will try to work on a reproductible example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this issue actually located ?

It can be located at every place that allocates memory before self._started.is_set() in the new Thread. PyObject_Call in thread_run is only one example; it can happen calling _bootstrap_inner, self._set_ident(), self._set_native_id(), and so on. In my test, I chose one of them to demonstrate the problem and test it.

@YvesDup
Copy link
Contributor

YvesDup commented Nov 12, 2025

@ZeroIntensity in the _threadmodule.c I read that _PyEvent_Notify and _PyEvent_Clear have an underscore as first character while PyEvent_Notify does not . Is there a reason for this ?

@ZeroIntensity
Copy link
Member

The _Py prefix is supposed to denote that an API is private, so I'd guess that there was originally some plan to make PyEvent public, and it just never happened.

@ryv-odoo ryv-odoo force-pushed the main-fix-140746-threading-start-waiting branch from 89d8bbe to 9ffbbd4 Compare November 12, 2025 16:17
When a new thread is started (`Thread.start()`), the current thread
waits for the new thread's `_started` signal (`self._started.wait()`).
If the new thread doesn't have enough memory, it might crash before
signaling its start to the parent thread, causing the parent thread
to wait indefinitely (still in `Thread.start()`).

To fix the issue, remove the _started python attribute from the
Thread class and moved the logic on the _os_thread_handle.
WIP

We also change `Thread._delete()` to use `pop` to remove the thread from
`_active`, as there is no guarantee that the thread exists in
`_active[get_ident()]`, thus avoiding a potential `KeyError`.
This can happen if `_bootstrap_inner` crashes before
`_active[self._ident] = self` executes. We use `self._ident` because we
know `set_ident()` has already been called.

Moreover, remove the old comment in `_delete` because
`_active_limbo_lock` became reentrant in commit 243fd01.
@ryv-odoo ryv-odoo force-pushed the main-fix-140746-threading-start-waiting branch from 9ffbbd4 to 9286aa9 Compare November 12, 2025 16:17
@ZeroIntensity
Copy link
Member

FYI, please don't force push. We squash at the end, so it just makes reviewing harder.

- Redo the change done in test_various_ops because I don't want to change the semantic.
- Add a check in test_memory_error_bootstrap to ensure no dangling thread remains.
- Add a new type of ThreadHandleState for failing bootstrap (avoid
using THREAD_HANDLE_DONE for two different case, which is incorrect
for join())
- is_bootstraped instead of is_running having the same behavior than
before (True when the thread is `THREAD_HANDLE_DONE` state)
- Renamming stuff
- Remove useless change
- Inline method ThreadHandle_set_bootstraped
- Useless signaling
@ryv-odoo
Copy link
Author

Hello @ZeroIntensity / @colesbury , thank you for your answers.

As suggested, I've tried to make a less hacky fix. Sorry for the delay, I was (still am) uncomfortable with this code.

FYI, please don't force push. We squash at the end, so it just makes reviewing harder.

Sorry about that, I wasn't aware of the process. My bad.

I would lean towards changing Thread.start and _start_joinable_thread so that we can do the waiting in C

AFAIK, if we wait for the bootstrap signal in C, start_joinable_thread cannot be used outside of the python Thread class since the signal is sent inside the _bootstrap_inner (in case of success and in this version).

The idea remains the same: the parent thread waits for the bootstrap initialisation of the child thread is completed, and If the latter crashes before (e.g. MemoryError), the parent Thread attempts to clean _limbo/_active itself by it-self (to recover the most properly).

I moved the place of signaling (self._os_thread_handle.set_bootstraped(), replacing self._started.set()) later in the _bootstrap_inner to ensure that the parent thread clean _limbo/_active if the bootstraping fails on line _active[self._ident] = self.

The check "Sanitizers / UBSan" is still failing and I wasn't able to reproduce for now. I still need to check that.

Note that I am not confident about these changes as this is my first try to contribute.

Have a nice day.

@ZeroIntensity
Copy link
Member

The sanitizer check is our fault, it should be fixed on main.

@ryv-odoo
Copy link
Author

ryv-odoo commented Nov 18, 2025

Hello @ZeroIntensity ,

The sanitizer check is our fault, it should be fixed on main.

Ho, thank you for the info and for updating the PR.
I have made the requested changes; please review again

@bedevere-app
Copy link

bedevere-app bot commented Nov 18, 2025

Thanks for making the requested changes!

@ZeroIntensity: please review the changes made to this pull request.

@bedevere-app bedevere-app bot requested a review from ZeroIntensity November 18, 2025 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants