-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronize entity dynamic imports #1794
Synchronize entity dynamic imports #1794
Conversation
Ok, I've managed to get what I believe is a minimal reproduction, but on my M2 with Python 3.12.3 the deadlock detection in importlib seems to be working correctly, so I don't reproduce a "true" deadlock, but I suspect that this is the same deadlock that is affecting us in production. My code also does some benchmarking, but because the code that loads the module caches it, it doesn't really seem like it's likely to affect anyone ... and on top of it, it's only a degradation of about 8 µs ( Here's the deadlock detection in action:
Alarmingly I see two other bugs that seem to be race conditions due to the dynamic import:
And:
These errors seem to happen more frequently on the countries that import other countries (Isle of Man, American Samoa, etc.). I'm not surprised to see these errors, as while one thread is importing one module, another may be trying to access the partially initialized module and end up with either this KeyError or an ImportError due to using directly a partially initialized module. We've also been able to confirm that this PR is working for us in our app when running locally, and doesn't degrade performance to any measurable extent. Here's the code for the minimal-ish reproduction. I've done some extra tuning with conditions to try to get the reproduction to be more reliable, but of course it's inherently probabilistic. It's kind of messy, but it was super helpful for me to confirm the theory of our bug, and test this fix. #!/usr/bin/env python
from typing import Optional
from types import TracebackType, FrameType
import time
import sys
import traceback
from threading import Thread, Condition, Lock
import importlib
import random
def bench():
from holidays.registry import EntityLoader
iters = 10_000
start = time.monotonic_ns()
for _ in range(iters):
# Raw entity loader performance, but because the `HolidayBase` get cached in `holidays`
# module, this ultimately should get called only once per country, pre-process, unless this
# API is used directly.
EntityLoader("holidays.countries.australia").get_entity()
end = time.monotonic_ns()
avg_duration = (end - start) / iters
print(f"Average duration {round(avg_duration, 2):.2f} ns")
class Counter:
cond: Condition
count: int
def __init__(self):
self.cond = Condition(Lock())
self.count = 0
def inc(self):
with self.cond:
self.count += 1
self.cond.notify_all()
def wait(self, expected_count: int):
while True:
with self.cond:
if self.count >= expected_count:
return
self.cond.wait()
class Loader(Thread):
counter: Counter
cond: Condition
error: Optional[Exception]
def __init__(self, counter: Counter, cond: Condition, *args, **kwargs):
super().__init__(*args, **kwargs)
self.counter = counter
self.cond = cond
self.error = None
def run(self):
try:
self._run_inner()
except Exception as error:
self.error = error
def _run_inner(self):
# Reload the things we need to test for a deadlock
from holidays.utils import country_holidays
from holidays.registry import COUNTRIES
# Sync all threads to get them to execute the entity loader at the same time
with self.cond:
self.counter.inc()
self.cond.wait()
for vals in random.choices(list(COUNTRIES.values()), k=100):
country_holidays(vals[1])
def join(self, *args, **kwargs):
super().join(*args, **kwargs)
if not self.is_alive() and self.error is not None:
raise self.error
# Fake exception class used to format the traceback for a "true" deadlock condition.
class FakeException(BaseException):
pass
def main():
bench()
print("Reproducing deadlock / race conditions (this will take awhile) ...")
reprod_attempts = 1_000
reprod_threads = 100
error_types_seen = set()
for _ in range(reprod_attempts):
# Clear the cached modules between each reproduction attempt
for k in list(sys.modules.keys()):
if k.split(".")[0] == "holidays":
del sys.modules[k]
importlib.invalidate_caches()
counter = Counter()
cond = Condition()
ts: list[Thread] = []
for _ in range(reprod_threads):
t = Loader(counter, cond)
t.start()
ts.append(t)
counter.wait(reprod_threads)
with cond:
cond.notify_all()
for t in ts:
try:
t.join(30.0)
except Exception as error:
error_type = type(error)
if error_type not in error_types_seen:
traceback.print_exception(error)
error_types_seen |= {error_type}
if t.is_alive():
frame: FrameType = sys._current_frames()[t.ident]
tb = None
while frame is not None:
tb = TracebackType(
tb, frame, frame.f_lasti, frame.f_lineno
)
frame = frame.f_back
error = FakeException().with_traceback(tb)
raise TimeoutError("Deadlock occurred") from error
if __name__ == "__main__":
main() Based on the above reproduction code, it seems that this PR fixes all 3 bugs caused by racy-ness in dynamic imports when using multithreading. |
Pull Request Test Coverage Report for Build 9187711103Details
💛 - Coveralls |
I've updated the code to use a re-entrant lock instead of a thread pool. It has one main advantage over the thread pool strategy: it's re-entry safe. With a thread pool, if the code were ever to be updated so that importing a module using the lazy loader could lazy load another module (I'm not sure why we'd do this), it could deadlock depending on import order when waiting on the future for the work to be executed on the same thread. I don't think this will add any other deadlock potential, and it still passes the minimal reproduction check I shared. For bonus points it reduces the already negligible performance impact of the thread pool strategy to undetectably negilgable (I got |
Hi @Izzette, Thanks for going above and beyond in order to identify and fix the issue. I have to admit this isn't a kind of PRs we normally receive. I'm going to look into it this weekend and do my best to squeeze the changes into our May 20th release. Again, I really appreciate the detailed description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks stable, let's see how it goes in the wild.
I had just a minor addition to the PR -- a comment describing RLock
.
@Izzette sorry I couldn't process this earlier and include into v0.49
Co-authored-by: Arkadii Yakovets <ark@cho.red>
231ebb1
to
b747c69
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@KJhellico @PPsyrius any comments/concerns on this?
LGTM. |
LGTM |
Alright, merging it then. @Izzette thanks for your contribution! |
Proposed change
Perform dynamic imports in registry in dedicated worker thread to avoid deadlocks from concurrent calls to
importlib.import_module
for the same module from different threads. This may not be sufficient if other code outside of python-holdays is importing these moduels directly. I believe this cpython issue is related https://bugs.python.org/issue38884. This should resolve (through a different implementation) #1514.Here is a flamegraph of the deadlock in action on one thread:
![image](https://private-user-images.githubusercontent.com/7987724/331135820-2eadd8a0-146a-4763-8c57-af0193e0ed22.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk4Mjg0NDgsIm5iZiI6MTcxOTgyODE0OCwicGF0aCI6Ii83OTg3NzI0LzMzMTEzNTgyMC0yZWFkZDhhMC0xNDZhLTQ3NjMtOGM1Ny1hZjAxOTNlMGVkMjIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDFUMTAwMjI4WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDc4YTk5MTYwOTgyYzkwNzQ0MWYxNDEwZWYzN2QxOGQ1YzYzYjBkMWU3ODA3YjI1ZmMwMTcyMmYwM2M1ZWQ5NSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.bx6w-YyMOdDnQt4MdoonJZJ_4H9_iurjoeZA6LhBbSg)
My main concern with this PR is that it may introduce a slight performance degradation. I'm not sure adding a test will be feasible, as the deadlock is inherently racy. Making a reliable production in the python-holidays code would require adding a lot of synchronization points within python-holidays to instrument such a test and hammering the
importlib.import_module
with many threads and than callingimportlib.invalidate_caches
in a loop until the issue is reproduced.Type of change
Checklist
make pre-commit
, it didn't generate any changesmake test
, all tests passed locally