-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential regression in Python 3.11 (multiprocess shutdown?) #97641
Comments
Based on |
I don't have access to a PC with Windows 11, so unfortunately I am unable debug this :( If someone with Windows 11 could do a bisect, that would be super helpful. A lot of changes happened between 3.10 entered the beta phase and the first 3.11 alpha was released: $ git log --oneline --until=2021-10-05 --since=2021-05-03 Modules/_sqlite | wc -l
56 |
Hi @erlend-aasland I've given this a try but I'm hitting various (A, for earlier commits) tool-chain and (B, for later commits) linker errors trying to build previous versions that (frankly) I'm not familiar enough with Windows, or building CPython for that matter, to be able to progress with in a reasonable timeframe. I was able to reproduce the issue at 0474d06, but I'm getting build errors by a3c11ce. A Windows expert would likely have more success. (Sorry to not be of more help. Hopefully I'm a few steps along the road for the future.) |
Hi @erlend-aasland I've given this a try but I'm hitting various (A, for earlier commits) tool-chain and (B, for later commits) linker errors trying to build previous versions that I'm not familiar enough with Windows, or building CPython for that matter, to be able to progress with in a reasonable timeframe. I was able to reproduce the issue at 0474d06, but I'm getting build errors by a3c11ce. A Windows expert would likely have more success. I'm happy to keep trying, but I've reached a bit of a roadblock with building Python failing (Sorry to not be of more help. Hopefully I'm a few steps along the road for the future.) |
Can someone add |
I'm not sure if it's related 🤔 but after forcing
|
Just checked on following PC specifications: with Python 3.11.0rc2, Django-4.2.dev20221005120449 Output: `PS C:\Users\TK\Desktop> cd django\testsPS C:\Users\TK\Desktop\django\tests> python runtests.py --parallel
|
Couldn't replicate on auto (16 cores laptop). So confirmed on latest 3.11.0rc2. I've disabled Windows Defender on the folder under test (django) as well as %TEMP% to make sure Windows Defender it's not causing issues. That's due to fact that I've got quite a few issues with pickling (i.e. cleaning after that) e.g.
Not 100% sure if that's a good troubleshooting approach, but a few observations: |
FTR, I'm in the process of installing a Windows 11 development environment now. Will hopefully be up and running by this evening, or at least tomorrow. I'll need to delve into how to debug CPython on Windows using Visual Studio, since I have little experience with both Windows and Visual Studio. |
cc. release-manager @pablogsal |
Ditto (Win 11, 16 cores)
Ditto. Bumped to 40, hit the PermissionError for |
Thanks so much for your input Eryk. AFAICS, this is not an sqlite3 bug; it is an issue with the test runner itself. |
I'm for some reason unable to install Visual Studio 2015, so I'm unable to compile 3.10.0 beta1 at the moment. Thus, I'm once again unable to bisect this :( |
This still looks to be failing with 3.11.0, so still looks like a regression from 3.10. |
I was able to fix this by adding an explicit connection |
Sorry, that comment wasn't meant for this issue. I cannot reproduce this issue to diagnose it. For me, there's a significant window of time between when the worker processes access the "*.sqlite3" files and when the main process deletes them at shutdown. If something else in the main process sometimes has the file open without delete sharing, I haven't been able to reproduce it in 3.11. (Note that any locking/unlocking of the database file that may be observed in monitoring tools has nothing to do with a sharing violation. Byte-range locks can lead to locking violations for read and write operations, but not to a sharing violation.) That said, for projects that are using multiprocessing for parallel testing, I think it would help if multiprocessing introduced optional support for process groups and, in Windows, job objects, in order to do a better job ensuring that the entire process tree has been given a chance to gracefully exit, and subsequently to forcefully kill any remaining processes. |
3.11 switched to using |
Good point. I remember the Windows CI was unhappy in the first draft implementations of this enhancement. I'm not sure it is a bug that the connection object might live longer. Explicit resource control is advocated in the sqlite3 docs; closing connections explicitly is good practice. |
From what I can tell, the old implementation avoids the circular reference by manually decrementing the reference count of the connection object: cpython/Modules/_sqlite/connection.c Lines 155 to 161 in 6a1d165
I suppose you could have kept the statement cache type as a skeleton that uses an |
True, it used that hack. Part of my motivation for getting rid of the (duplicate) LRU cache implementation in the _sqlite extension module was getting rid of this hack.
That would have been possible, but you'd keep the GC hack. I'm not sure that is a very good idea. |
There seems to be a regression in Python 3.11, where the sqlite connections are not deallocated, due to some internal changes in Python 3.11, where they are now using LRU cache. They are not deallocated until `gc.collect()` is not called. See python/cpython#97641. This affects only Windows, because when we try to remove the tempdir for the exp run, the sqlite connection is open which prevents us from deleting that folder. Although this may happen in real scenario in `exp run`, I am only fixing the tests by mocking `dvc.close()` and extending it by calling `gc.collect()` after it. We could also mock `State.close()` but didnot want to mock something that is not in dvc itself. The `diskcache` uses threadlocal for connections, so they are expected to be garbage collected, and therefore does not provide a good way to close the connections. The only API it offers is `self.close()` and that only closes main thread's connection. If we had access to connection, an easier way would have been to explicitly call `conn.close()`. But we don''t have such option at the moment. Related: iterative#8404 (comment) GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
There seems to be a regression in Python 3.11, where the sqlite connections are not deallocated, due to some internal changes in Python 3.11, where they are now using LRU cache. They are not deallocated until `gc.collect()` is not called. See python/cpython#97641. This affects only Windows, because when we try to remove the tempdir for the exp run, the sqlite connection is open which prevents us from deleting that folder. Although this may happen in real scenario in `exp run`, I am only fixing the tests by mocking `dvc.close()` and extending it by calling `gc.collect()` after it. We could also mock `State.close()` but didnot want to mock something that is not in dvc itself. The `diskcache` uses threadlocal for connections, so they are expected to be garbage collected, and therefore does not provide a good way to close the connections. The only API it offers is `self.close()` and that only closes main thread's connection. If we had access to connection, an easier way would have been to explicitly call `conn.close()`. But we don''t have such option at the moment. Related: iterative#8404 (comment) GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
There seems to be a regression in Python 3.11, where the sqlite connections are not deallocated, due to some internal changes in Python 3.11, where they are now using LRU cache. They are not deallocated until `gc.collect()` is not called. See python/cpython#97641. This affects only Windows, because when we try to remove the tempdir for the exp run, the sqlite connection is open which prevents us from deleting that folder. Although this may happen in real scenario in `exp run`, I am only fixing the tests by mocking `dvc.close()` and extending it by calling `gc.collect()` after it. We could also mock `State.close()` but didnot want to mock something that is not in dvc itself. The `diskcache` uses threadlocal for connections, so they are expected to be garbage collected, and therefore does not provide a good way to close the connections. The only API it offers is `self.close()` and that only closes main thread's connection. If we had access to connection, an easier way would have been to explicitly call `conn.close()`. But we don''t have such option at the moment. Related: #8404 (comment) GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
Python 3.11 tweaked how the sqlite module worked, which changed the implicit cleanup behavior, which can cause problems on windows when temporary files are involved. See the discussion in python/cpython#97641 The cursor juggling code in the database upgrade test implicitly relied on this, so we need to tweak the logic so we explicitly close the connection so the tests pass on Windows again.
Following gh-108015, sqlite3 will now emit a ResourceWarning if connection objects are closed implicitly. Explicit resource management is recommended. cc. @eryksun |
gh-108015 has now landed; Python 3.13 will emit a ResourceWarning if a connection object is closed implicitly. Explicit resource management is now recommended. @felixxm has reported that Django already started to clean up their test suite (django/django#17178). It seems to me the problems mentioned in this issue can all be handled by improved resource management. Suggesting to close this. |
Closing as per #97641 (comment). |
Running the Django test suite against the Python 3.11 pre-releases, we have hit
a potential regression.
Steps to reproduce
Bug report
On Python 3.8, 3.9, and 3.10 this runs without problem.
On Python 3.11 the following error in seen after the test suite completes,
during shutdown:
This looked similar to us to open issue #95027 but we were asked to report it
separately.
I've tested all the way back to
a1
where (along with other issues now resolved)this error still occurs:
This is somewhat frustrating as we've tried to test on all platforms since the
first releases.
Our test suite would only run on Windows with Python 3.11 very recently as
there was a third-party dependency that was not compatible with Windows. We
will try to adjust to test without dependencies as well on Windows for future
versions. (Sorry about that.)
Please do let us know if we can provide further info. I imagine though the
easiest thing is for you to run this yourself.
Thanks.
//cc @felixxm
Your environment
The text was updated successfully, but these errors were encountered: