-
-
Notifications
You must be signed in to change notification settings - Fork 33.5k
gh-140746: Fix Thread.start() that can hang indefinitely #140799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
gh-140746: Fix Thread.start() that can hang indefinitely #140799
Conversation
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
IMO, there is a simpler fix in the diff --git a/Lib/threading.py b/Lib/threading.py
index 4ebceae702..b5fc035b7c 100644
--- a/Lib/threading.py
+++ b/Lib/threading.py
@@ -1076,12 +1076,17 @@ def _bootstrap_inner(self):
except:
self._invoke_excepthook(self)
finally:
+ # The code before the `self._started.set()` instruction
+ # could raise an exception. We have to check here and
+ # set the Event if necessary.
+ if not self._started.is_set():
+ self._started.set()
self._delete()Your change in the |
|
Hello @YvesDup , thank you for your suggestion. IMHO, your proposal is actually less robust since it is possible that Python raises a memory error before calling My test isn't perfect and it doesn't cover all case: Memory can happen before calling I am aware that my fix is not super clean and contains an arbitrary timeout (not fan of that). |
If I correctly understand your case, an (memory) error can happen in the Perhaps you have to take a look to the implementation of the In order to never call
diff --git a/Lib/threading.py b/Lib/threading.py
index 4ebceae702..8f31ce5c7b 100644
--- a/Lib/threading.py
+++ b/Lib/threading.py
@@ -1001,7 +1001,12 @@ def start(self):
with _active_limbo_lock:
del _limbo[self]
raise
- self._started.wait() # Will set ident and native_id
+ # Wait until the thread is really started.
+ # It must have at least one ident, optionaly a name.
+ # Testing this ID (_ident attribute) as a real value
+ # should be acceptable.
+ if self._ident is not None:
+ self._started.wait()My last suggestion has to be removed because forcing the set on the event means that thread was started. And it is false.
Modify at C level seems very risky to me . If my last suggestion or a another simple fix is not acceptable , I suggest to create your own |
469849e to
f34cae7
Compare
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
ZeroIntensity
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it's going to be too hacky. I think we want a more robust fix in C.
I can walk you through implementing that if you'd like. Otherwise, I can work on fixing it myself.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
2149546 to
d631f2a
Compare
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
7dcb5b5 to
89d8bbe
Compare
| serving_thread.join(0.1) | ||
| self.assertFalse(serving_thread.is_alive()) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that worth testing a use case where the _bootstrap method will be never called just to check that the thread is not running ?
Is that worth testing a use case where the an exception is raised in the _start_joinable_thread just to check that the thread is not running ? This last one could simulate your initial issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial issue is when the new thread crashes inside the _bootstrap/_bootstrap_inner before signaling that it starts.
Is that worth testing a use case where the _bootstrap method will be never called just to check that the thread is not running ?
Is that worth testing a use case where the an exception is raised in the _start_joinable_thread just to check that the thread is not running ? This last one could simulate your initial issue.
There is already test_start_new_thread_failed and that sounds a bit out of scope IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial issue is when the new thread crashes inside the _bootstrap/_bootstrap_inner before signaling that it starts.
There is already test_start_new_thread_failed and that sounds a bit out of scope IMO.
I agree, sorry for the misunderstood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From one comment: IMHO, your proposal is actually less robust since it is possible that Python raises a memory error before calling
_bootstrap_innerat all and then it will lead to the same issue. That's why the "recovery" code added is on the parent thread.
My test isn't perfect and it doesn't cover all case: Memory can happen before calling_bootstrapfrom thePyObject_CallC method, for example
From the previous: My initial issue is when the new thread crashes inside the _bootstrap/_bootstrap_inner before signaling that it starts.
Where is this issue actually located ? Is your example with few memory (ulimit -v 1000000) the only one that often fails ?
It bothers me that we don't have a reproducible example, even though I understand that this is a complex issue.
If you agree, I will try to work on a reproductible example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this issue actually located ?
It can be located at every place that allocates memory before self._started.is_set() in the new Thread. PyObject_Call in thread_run is only one example; it can happen calling _bootstrap_inner, self._set_ident(), self._set_native_id(), and so on. In my test, I chose one of them to demonstrate the problem and test it.
|
@ZeroIntensity in the |
|
The |
89d8bbe to
9ffbbd4
Compare
When a new thread is started (`Thread.start()`), the current thread waits for the new thread's `_started` signal (`self._started.wait()`). If the new thread doesn't have enough memory, it might crash before signaling its start to the parent thread, causing the parent thread to wait indefinitely (still in `Thread.start()`). To fix the issue, remove the _started python attribute from the Thread class and moved the logic on the _os_thread_handle. WIP We also change `Thread._delete()` to use `pop` to remove the thread from `_active`, as there is no guarantee that the thread exists in `_active[get_ident()]`, thus avoiding a potential `KeyError`. This can happen if `_bootstrap_inner` crashes before `_active[self._ident] = self` executes. We use `self._ident` because we know `set_ident()` has already been called. Moreover, remove the old comment in `_delete` because `_active_limbo_lock` became reentrant in commit 243fd01.
9ffbbd4 to
9286aa9
Compare
|
FYI, please don't force push. We squash at the end, so it just makes reviewing harder. |
- Add a new type of ThreadHandleState for failing bootstrap (avoid using THREAD_HANDLE_DONE for two different case, which is incorrect for join()) - is_bootstraped instead of is_running having the same behavior than before (True when the thread is `THREAD_HANDLE_DONE` state) - Renamming stuff
- Useless signaling
|
Hello @ZeroIntensity / @colesbury , thank you for your answers. As suggested, I've tried to make a less hacky fix. Sorry for the delay, I was (still am) uncomfortable with this code.
Sorry about that, I wasn't aware of the process. My bad.
AFAIK, if we wait for the bootstrap signal in C, The idea remains the same: the parent thread waits for the bootstrap initialisation of the child thread is completed, and If the latter crashes before (e.g. MemoryError), the parent Thread attempts to clean _limbo/_active itself by it-self (to recover the most properly). I moved the place of signaling ( The check "Sanitizers / UBSan" is still failing and I wasn't able to reproduce for now. I still need to check that. Note that I am not confident about these changes as this is my first try to contribute. Have a nice day. |
|
The sanitizer check is our fault, it should be fixed on main. |
|
Hello @ZeroIntensity ,
Ho, thank you for the info and for updating the PR. |
|
Thanks for making the requested changes! @ZeroIntensity: please review the changes made to this pull request. |
When a new thread is started (
Thread.start()), the current thread waits for the new thread's_startedsignal (self._started.wait()). If the new thread doesn't have enough memory, it might crash before signaling its start to the parent thread, causing the parent thread to wait indefinitely (still inThread.start()).To fix the issue, remove the _started python attribute from the Thread class and converted the logic at C level (PyEvent). A flaw of this method, it that the threading module will still contains the zombie thread into the _limbo dictionnary.
We also change
Thread._delete()to usepopto remove the thread from_active, as there is no guarantee that the thread exists in_active[get_ident()], thus avoiding a potentialKeyError. This can happen if_bootstrap_innercrashes before_active[self._ident] = selfexecutes. We useself._identbecause we knowset_ident()has already been called.Moreover, remove the old comment in
_deletebecause_active_limbo_lockbecame reentrant in commit 243fd01.Not sure if this fix need/can to be backported