Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional errors on AppVeyor due to writing .mypy_cache #3215

Closed
gvanrossum opened this issue Apr 21, 2017 · 23 comments
Closed

Occasional errors on AppVeyor due to writing .mypy_cache #3215

gvanrossum opened this issue Apr 21, 2017 · 23 comments

Comments

@gvanrossum
Copy link
Member

gvanrossum commented Apr 21, 2017

Since #3133 was merged I see occasional test failures in AppVeyor (i.e. on Windows) with the traceback below. I suspect this is due to some background process on Windows (e.g. a virus checker or another container?) still having the file open, but a simple (?) solution would be to add --cache-dir=nul to the mypy invocations on Windows. (Though that will need to be done somewhere in the test framework I suspect.)

For now I'm just going to ignore such failures.

(UPDATE: example taken from https://ci.appveyor.com/project/gvanrossum/mypy/build/1.0.1227/job/9q6qgru4f6ig6ar8 )

Traceback (most recent call last):
  File "C:\projects\mypy\scripts\mypy", line 6, in <module>
    main(__file__)
  File "C:\projects\mypy\mypy\main.py", line 46, in main
    res = type_check_only(sources, bin_dir, options)
  File "C:\projects\mypy\mypy\main.py", line 93, in type_check_only
    options=options)
  File "C:\projects\mypy\mypy\build.py", line 188, in build
    graph = dispatch(sources, manager)
  File "C:\projects\mypy\mypy\build.py", line 1570, in dispatch
    process_graph(graph, manager)
  File "C:\projects\mypy\mypy\build.py", line 1813, in process_graph
    process_stale_scc(graph, scc, manager)
  File "C:\projects\mypy\mypy\build.py", line 1921, in process_stale_scc
    graph[id].write_cache()
  File "C:\projects\mypy\mypy\build.py", line 1551, in write_cache
    self.manager)
  File "C:\projects\mypy\mypy\build.py", line 904, in write_cache
    os.replace(data_json_tmp, data_json)
PermissionError: [WinError 5] Access is denied: '.mypy_cache\\3.\\enum.data.json.d2782ac765091066' -> '.mypy_cache\\3.6\\enum.data.json'
@pkch
Copy link
Contributor

pkch commented Apr 21, 2017

This probably has the same root cause as this; TLDR: Windows file system inherently leads to intermittent race conditions on write access attempts, even without any other processes, antivirus, etc. running.

If this is a risk in production, retrying the os.replace is the right solution

For AppVeyor, the simplest solution seems to be to make this change in mypy.defaults:

CACHE_DIR = 'nul' if 'APPVEYOR' in os.environ else '.mypy_cache'

@gvanrossum
Copy link
Member Author

Fun. I followed some older issues to https://mail.python.org/pipermail/python-dev/2008-April/078333.html and my recollection is that it plagued us even in the 2000-2003 era (when I shared an office at PythonLabs with Tim Peters).

But "Windows file system inherently leads to intermittent race conditions on write access attempts, even without any other processes, antivirus, etc." sounds awfully vague. Is there a Microsoft KB issue about this? Is it really the FS or is it Microsoft's libc emulation?

Anywho, I don't like making an exception in mypy itself just for AppVeyor -- if it happens to AppVeyor it may happen to others (also it may interfere with mypy runs by others on AppVeyor). Maybe the whole always-write-cache thing needs to be disabled by the test framework. Or maybe we need a separate "full-run-but-write-cache" flag instead of the default behavior, then Dropbox can enable that (we never run mypy on Windows) and everyone else is freed of this.

Ref: @JukkaL #3133.

@JukkaL
Copy link
Collaborator

JukkaL commented Apr 22, 2017

I think that we should understand the root cause before making a decision. If the correct fix is to add a retry loop on Windows, then we should probably just do the right thing here, always, even if it sounds like an ugly hack.

We could have a utility function that does the retry thing (e.g. perform_os_operation_with_retry(lambda: os.replace(source, target)) or something) and then we'd put that into a utility function such as safe_replace that would do the retry dance (on Windows only).

@pkch
Copy link
Contributor

pkch commented Apr 22, 2017

without any other processes

I meant without any user processes, sorry. Yes, of course, ultimately it's another process holding the handle to the Windows file that prevents it from being deleted; I was just saying that Windows system-level processes (file indexer, analyzer of file type, etc.) could be to blame, and there's almost no way to reliably turn all of them off.

A couple more references.

Apart from the solution with wait/retry loop that I mentioned, I found nothing else recommended by people who spent time on this issue. Should we try to implement it and see if it solves the problem without too much extra delay?

@JukkaL
Copy link
Collaborator

JukkaL commented Apr 22, 2017

Apart from the solution with wait/retry loop that I mentioned, I found nothing else recommended by people who spent time on this issue. Should we try to implement it and see if it solves the problem without too much extra delay?

Yes, I think that we should at least see if a wait/retry loop seems to fix the issue. The only major drawback to such a workaround that I can see would be slowdown when things are failing consistently, but that should be rare. Discussion at the first referenced article suggests that we may have to wait 2 seconds or more. It's likely that anything much longer than that could be disruptive, though.

@gvanrossum
Copy link
Member Author

OK, now I understand. It is another process, just nothing we can control.

I agree we should try a retry loop. This should be implemented as a set of utility functions wrapping standard os functions that do the retry loop only on Windows. That way they can be called from various places and we don't have to uglify the call sites (I suspect there are a few different call sites and os functions that may have this problem).

Nevertheless I would also like to turn off cache writing when the test suite runs.

Finally I would like to make the cache-writing code more robust, so that if at any point the cache write fails (maybe because I replaced .mypy_cache or one of its subdirectories with a plain file or with a symlink to a non-writable directory) the mypy run doesn't crash like it does now.

(In fact this may be a substitute for the retry loop? If the cache has problems we silently don't write it. The next run will regenerate the cache. That's how Python write .pyc files.)

@pkch
Copy link
Contributor

pkch commented Apr 22, 2017

Windows race conditions only affect file deletion / renaming, not file writing.
Can we just get rid of os.replace (see the PR)? That is, instead of writing to temp file first and then renaming it, can we write directly to the original file?

Separately, I noticed that os.remove is wrapped in except. What if it fails? Wouldn't the cache be corrupt in that case?

@pkch
Copy link
Contributor

pkch commented Apr 22, 2017

My understanding so far:

  • The reasons the blocking delete may fail is that:
    • some processes open the file without a special sharing mode (FILE_SHARE_DELETE, which allows others to delete it), arguably it's a bug in them
    • Windows doesn't make the deleted filename available to others for a short but unspecified period of time
  • Windows does not provide non-blocking delete ("delete when no one using it any more");FILE_FLAG_DELETE_ON_CLOSE seemed like it could do it, but it will fail if the file is opened without FILE_SHARE_DELETE anyway, so it's not solving this particular problem. Of course user program can create a separate thread that repeatedly tries to do it.

The best solution I can think of for atomic write is based on this:

  1. let's say we want to write data to original_filename
  2. create random filename1, and write the new data there
  3. rename original_filename to a new random tmpdir/filename2 (make sure filename2 doesn't exist)
  4. rename filename1 to original_filename
  5. lazily delete everything in tmpdir/ at the end of the program run

This has no race conditions. Should I implement it?

@gvanrossum
Copy link
Member Author

I think that looks reasonable. Steps 1-2 are exactly as we do them today. Steps 3-5 are the Windows version of os.replace(random_filename1, original_filename).

Are you proposing to do this just on Windows or also on Unix? At least on Unix I'd worry that tmpdir is on a different fs than original_filename -- that's why the current code doesn't use the tmpdir module but instead uses its own random string that's appended to the filenames.

I still think it makes sense to have a replace() function in mypy/util.py that just calls os.replace() on Unix but does the other dance on Windows (and it can arrange for the lazy deletions using sys.atexit).

@pkch
Copy link
Contributor

pkch commented Apr 23, 2017

Yup, just on Windows. On Unix, os.replace is already perfect (it's atomic and there's never problems with deletions).

@pkch
Copy link
Contributor

pkch commented Apr 23, 2017

It turns out that in order to rename a Windows file that's open in another process, one has to use a 3-argument version of ReplaceFile; see this and this for more details.

We don't have a function in python standard library that uses a 3-argument version of ReplaceFile from Windows API (os.rename and os.replace being the only two that could have, but they chose other implementation). As a result, the entire solution I proposed is impossible in python, and in fact there is no way to implement atomic writes without a retry loop.

Note that this is completely unrelated to issues of whether os.rename should or should not replace destination files when they are present.

I can implement a retry loop, unless someone who's done that before already has a ready to use code.

@refi64
Copy link
Contributor

refi64 commented Apr 23, 2017

Uhh...can't you use ctypes.windll.kernel32.ReplaceFileA?

@pkch
Copy link
Contributor

pkch commented Apr 23, 2017

Ah I didn't know you can call Windows API from pure python 😲 Yes, I managed to call it, so I'm testing out whether I can achieve the desired behavior.

@JelleZijlstra
Copy link
Member

I recently learned about https://github.com/untitaker/python-atomicwrites (because it has a stub in typeshed :) ). Is their approach useful?

@pkch
Copy link
Contributor

pkch commented Apr 24, 2017

@JelleZijlstra I think not. Their support for Windows is based on the function (MoveFileEx) that works if and only if os.replace works because both require the same DELETE access to source and destination files.

@gvanrossum
Copy link
Member Author

But why couldn't we assume the same access?

@pkch
Copy link
Contributor

pkch commented Apr 24, 2017

Well supposedly there was a function that allows to rename even if delete isn't allowed. I thought maybe they figured out how to use it properly.

But actually they are doing precisely what we are already doing - and since they don't do the wait/retry loop, the behavior of their approach should be identical to ours (in that it will intermittently fail).

Sadly, my tests show that the holy grail of Windows rename is just a myth. So I'll implement the wait/retry loop after all.

@gvanrossum
Copy link
Member Author

Thanks for the thorough research!

@pkch
Copy link
Contributor

pkch commented Apr 24, 2017

For the wait/retry loop, I'll borrow the solution from CPython test support module; see discussion about it here.

Unfortunately, I can't just from test.support import unlink because this was never intended as public API, so it's not very flexible; in particular, the maximum wait time is 1 sec, which is too long (in case a bunch of cache files can't be deleted for some permanent reason, we'll waiting for 1 sec * number_of_files > angry_user_threshold sec).

Another consideration is whether we actually need to block until the atomic write completed. At present, the answer is that we don't need to block right away since we don't need to read from those files for a while (we can wait at least until the end of process_stale_scc). So we can postpone all the failed writes until some checkpoint in the future, greatly reducing the worst-case delay since we can wait for all the "stuck" files at once. However, this means that the wait/retry will be run from a faraway location in the code. This makes this approach dangerous: who knows where else write_cache() might be used in the future? In fact, there's already a *#*TODO item to add it to update.build_incremental_step(). (Note that write_cache saves a single module, not the entire cache.)

The cleanest solution would be to refactor the write_cache to require a list of files to be written and wait for completion at the end of that function. This would have the best of both worlds: not wait for each file, and prevent future contributors from accidentally writing subtly buggy code that relies on writes that haven't happened yet.

An alternative is to wait for each file, but if total wait time exceeds a threshold, exit and warn user that something is really wrong with access rights.

Finally, I can simply wait for each file, and not worry about the case that multiple files are permanently stuck.

Anyone has an opinion?

@JukkaL
Copy link
Collaborator

JukkaL commented Apr 25, 2017

If something fails to get deleted, we can just fail the entire build and not try anything else? If the write succeeds but the replace fails, the user has a configuration problem -- we have partial write access to the cache directory. We can detect this and just fail immediately instead of trying to write additional cache files. There will be at most a one-second delay, which is reasonable.

Here are the most likely scenarios if I've understood this correctly:

  • Write access to all cache files: we retry the replace call, fixing the original issue. The write will succeed since we use a unique file name that nobody will have open.
  • No write access to cache files: the initial write will fail (which we won't retry) so no change to current semantics -- can't hit the wait/retry code path.
  • Some write access: initial write succeeds but the replace call fails consistently. This is a user configuration problem and we can either report this as such.

@pkch
Copy link
Contributor

pkch commented Apr 25, 2017

@JukkaL

I'm not 100% sure I understood you, so I'll summarize what we know and then respond to your bullet points.

The original problem is the intermittent PermissionError raised by our implementation of atomic writes on Windows. Currently there are only two atomic writes in the codebase, both in build.write_cache. They are implemented like this:

with open(tmpfile, 'w') as f:
    f.write(stuff)
os.replace(tmpfile, realfile)

They work perfectly on Linux. The problem on Windows is caused by the combination of two things:

  • Unlike on Linux, os.replace (as well as os.rename) on Windows requires DELETE permission to both files
  • Various programs (backup, indexing, antivirus, etc.) intemittently open random files for short periods of time, with the lock that does not allow others to DELETE

If os.replace fails, it most likely means that something like a backup service recently opened an exclusive handle to realfile and still holds it. (It might have been a handle to tmpfile of course, but less likely since there's such a short time window between when we close our own exlcusive handle to it and when we call os.replace.) There's nothing we can do to realfile until the handle is closed by the external process; any delete / rename / replace that involves it will fail with PermissionError.

It's a well-known problem. Among others, CPython test suite failures were caused by the same issue, and they were addressed by a solution (undocumented but visible in test.support) to wait and retry, exponentially increasing wait time, giving up entirely after 1 sec.

Write access to all cache files: we retry the replace call, fixing the original issue. The write will succeed since we use a unique file name that nobody will have open.

No, the second write may fail too. The lock is on realfile, and it may still be present.

No write access to cache files: the initial write will fail (which we won't retry) so no change to current semantics -- can't hit the wait/retry code path.

Yes.

Some write access: initial write succeeds but the replace call fails consistently. This is a user configuration problem and we can either report this as such.

Unfortunately, this is likely not user configuration problem, but just that the lock on realfile is still held.

@gvanrossum
Copy link
Member Author

OK, so the solution really is just to retry with exponential backoff and a 1 sec timeout, but it's unlikely to happen, so we won't have to worry about all writes taking an extra 1 sec per file.

Then all I recommend is simplifying the wait function to the bare minimal, per my previous comment.

gvanrossum pushed a commit that referenced this issue Apr 30, 2017
This is an alternative attempt at fixing issue #3215, given the
complexity of PR #3239.
gvanrossum added a commit that referenced this issue May 2, 2017
This is an alternative attempt at fixing issue #3215, given the
complexity of PR #3239.
@pkch
Copy link
Contributor

pkch commented May 4, 2017

I think this can be closed. No more crashes on AppVeyor as far as I can tell.

@JukkaL JukkaL closed this as completed May 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants