Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI, MAINT: Windows 3.11 CI failure with file access issue #19852

Closed
tylerjereddy opened this issue Jan 10, 2024 · 9 comments · Fixed by #19859
Closed

CI, MAINT: Windows 3.11 CI failure with file access issue #19852

tylerjereddy opened this issue Jan 10, 2024 · 9 comments · Fixed by #19859
Labels
CI Items related to the CI tools such as CircleCI, GitHub Actions or Azure maintenance Items related to regular maintenance tasks Meson Items related to the introduction of Meson as the new build system for SciPy
Milestone

Comments

@tylerjereddy
Copy link
Contributor

This is affecting both the maintenance/1.12.x branch (discussed at: #19797 (comment) ; sample log: https://github.com/scipy/scipy/actions/runs/7467279245/job/20323369865?pr=19797) and the main branch: #19849 (log: https://github.com/scipy/scipy/actions/runs/7477725340/job/20351079843?pr=19849)

From a release management standpoint, this actually gives me some confidence that the matter is not a blocker to proceed with 1.12.0 RC2 (i.e., not specific to that branch).

@tylerjereddy tylerjereddy added maintenance Items related to regular maintenance tasks CI Items related to the CI tools such as CircleCI, GitHub Actions or Azure labels Jan 10, 2024
@rgommers
Copy link
Member

Adding a partial log:

SciPy 1.12.0rc2

  User defined options
    Native files: D:\a\scipy\scipy\.mesonpy-sgr84eaw\meson-python-native-file.ini
    buildtype   : release
    b_ndebug    : if-release
    b_vscrt     : md
    use-pythran : false

Found ninja.EXE-1.11.1.git.kitware.jobserver-1 at C:\hostedtoolcache\windows\Python\3.11.7\x64\Scripts\ninja.EXE
+ meson dist --allow-dirty --no-tests --formats gztar
WARNING: Repository has uncommitted changes that will not be included in the dist tarball
Created D:\a\scipy\scipy\.mesonpy-sgr84eaw\meson-dist\SciPy-1.12.0rc2.tar.gz
Traceback (most recent call last):
  File "C:\hostedtoolcache\windows\Python\3.11.7\x64\Lib\shutil.py", line 624, in _rmtree_unsafe
    os.rmdir(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '.\\.mesonpy-sgr84eaw\\meson-private\\cmake_scipy-openblas\\CMakeFiles\\CMakeScratch'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\hostedtoolcache\windows\Python\3.11.7\x64\Lib\tempfile.py", line 878, in onerror
    _os.unlink(path)
PermissionError: [WinError 5] Access is denied: '.\\.mesonpy-sgr84eaw\\meson-private\\cmake_scipy-openblas\\CMakeFiles\\CMakeScratch'

... (the above is repeated many times) ...

   File "C:\hostedtoolcache\windows\Python\3.11.7\x64\Lib\shutil.py", line 752, in rmtree
    if _rmtree_islink(path):
       ^^^^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded

ERROR Backend subprocess exited when trying to invoke build_sdist

Doesn't look familiar I'm afraid.

@h-vetinari
Copy link
Member

Quoting myself from #19797

I'm not sure what to make of the Python 3.11 Windows sdist job failure in the absence of Pythran.

This looks pretty unrelated to pythran, or perhaps it's triggering a bug somewhere else (e.g. somehow, a race or duplication for creating scratch space for scipy-openblas).

+ meson dist --allow-dirty --no-tests --formats gztar
WARNING: Repository has uncommitted changes that will not be included in the dist tarball
Created D:\a\scipy\scipy\.mesonpy-sgr84eaw\meson-dist\SciPy-1.12.0rc2.tar.gz
Traceback (most recent call last):
  File "C:\hostedtoolcache\windows\Python\3.11.7\x64\Lib\shutil.py", line 624, in _rmtree_unsafe
    os.rmdir(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '.\\.mesonpy-sgr84eaw\\meson-private\\cmake_scipy-openblas\\CMakeFiles\\CMakeScratch'

Basically, there seems to be a combination of factors such that several paths in our meson build try to independently and simultaneously build scipy-openblas. Or perhaps it's an outright meson bug

@lucascolley lucascolley added the Meson Items related to the introduction of Meson as the new build system for SciPy label Jan 10, 2024
@rgommers
Copy link
Member

rgommers commented Jan 10, 2024

Basically, there seems to be a combination of factors such that several paths in our meson build try to independently and simultaneously build scipy-openblas.

There's nothing to build there, there's detection but no building. The paths don't really make sense either, why would CMake be used here all of a sudden? And this version is old and used to work fine:

 cp /c/opt/64/bin/libopenblas_v0.3.20-571-g3dec11c6-gcc_10_3_0.dll /c/opt/openblas/openblas_dll

This is the one regular CI job that still uses the openblas tarballs rather than the wheels. Detection actually looks fine:

Run-time dependency scipy-openblas found: NO (tried pkgconfig and cmake)
Run-time dependency openblas found: YES 0.3.21.dev
Dependency openblas found: YES 0.3.21.dev (cached)

What seems to be happening is that the sdist gets created successfully, but on cleaning up the build dir that meson-python creates, something is holding on to a file handle and then things go haywire. Maybe a Meson 1.3.1 issue, or due to a change in GHA CI setup for Windows, or ....

What changed here most recently is that gh-19724 disabled build isolation for this job 3 days ago. But CI passed on that PR - if it's the cause, then it's intermittent. EDIT: the failures on this job only started yesterday.

@rgommers
Copy link
Member

rgommers commented Jan 10, 2024

I'll note that GHA did have a significant outage earlier today, so perhaps it was that. I re-ran the failed job linked above (https://github.com/scipy/scipy/actions/runs/7477725340/job/20364680808), and it's past the point now where it failed last time around.

Yet another suspect: new Windows GHA runner image: https://github.com/actions/runner-images/blob/main/images/windows/Windows2019-Readme.md. That's the newest change; none of our build deps had a release recent enough.

@eli-schwartz
Copy link
Contributor

ERROR Backend subprocess exited when trying to invoke build_sdist
usage: delvewheel repair [-h] [--add-path PATHS] [--add-dll DLLS]
                         [--no-dll DLLS] [--ignore-in-wheel] [-v] [-w TARGET]
                         [--no-mangle DLLS] [--no-mangle-all] [--strip]
                         [-L LIB_SDIR] [--namespace-pkg PKGS]
                         [--include-symbols]
                         wheel [wheel ...]
delvewheel repair: error: the following arguments are required: wheel
ERROR: You must give at least one requirement to install (see "pip help install")

As an aside, the CI definition seems a bit... broken. Well, really, powershell is a bit broken. If a command fails that other commands rely on, the job shouldn't continue by trying to delvewheel + install the result.

@rgommers
Copy link
Member

Well, really, powershell is a bit broken. If a command fails that other commands rely on, the job shouldn't continue

Yes indeed. I can never remember the bad Powershell syntax for anything, and that includes halt-on-error. When we had CI jobs on Azure it was figured out at some point, but the whole thing was unreadable.

@rgommers
Copy link
Member

Still happening intermittently, so frequency is probably related to both the size of the build dir/definitions and the details of the Windows image (responsiveness etc.). Fix in mesonbuild/meson#12726 looks promising, thanks @thalassemia!

I'll see if I can change the job to avoid this problem in the first place in the meantime.

rgommers added a commit to rgommers/scipy that referenced this issue Jan 11, 2024
This avoids looking for scipy-openblas with CMake (which we never want),
and avoids looking for it twice (that was an oversight). As a result,
we should be robust to whatever is the underlying problem of
the CI failures reported in scipygh-19852 are.

Closes scipygh-19852

[skip cirrus] [skip circle]
@thalassemia
Copy link
Contributor

If anyone is curious, the underlying issue was probably mesonbuild/meson-python#559.

@tylerjereddy tylerjereddy added this to the 1.12.0 milestone Jan 11, 2024
@rgommers
Copy link
Member

Awesome, thanks for getting to the bottom of that.

tylerjereddy pushed a commit to tylerjereddy/scipy that referenced this issue Jan 12, 2024
This avoids looking for scipy-openblas with CMake (which we never want),
and avoids looking for it twice (that was an oversight). As a result,
we should be robust to whatever is the underlying problem of
the CI failures reported in scipygh-19852 are.

Closes scipygh-19852

[skip cirrus] [skip circle]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Items related to the CI tools such as CircleCI, GitHub Actions or Azure maintenance Items related to regular maintenance tasks Meson Items related to the introduction of Meson as the new build system for SciPy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants