Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition for… file metadata? #66

Closed
niacdoial opened this issue Jun 14, 2021 · 4 comments
Closed

race condition for… file metadata? #66

niacdoial opened this issue Jun 14, 2021 · 4 comments
Labels

Comments

@niacdoial
Copy link

look, I know how weird this looks, but hear me out.
I somehow managed to get a race condition(?) where portalocker.open('file/path', 'a') would not actually get me at the end of the file I opened.
There was a lot of moving parts in this, with potential culprits such as:

  • an NFS filesytem
  • accessing a file from different machines within a cluster
  • an EOL'd kernel on every machine of said cluster (!)

here are a few small working examples.

script1.py

import portalocker
with portalocker.Lock('file/path','a') as f:
    f.write('prelude,')
    print(f.tell())
    input('start script2 and continue')
    f.write('AAAA')

script 2, variant a:

import portalocker
with portalocker.Lock('file/path','a') as f:
    print(f.tell())
    f.write('BBBB')

script 2, variant b:

import portalocker
from time import sleep
with portalocker.Lock('file/path','a') as f:
    print(f.tell())
    print(f.seek(0,2))
    sleep(5)
    print(f.tell())
    print(f.seek(0,2))

I emptied the file between all tests.

test 1: running script1 on node1, then quickly script2a on node2:
script1 prints 8\nstart script2 and continue, script2 prints 0, and the file contains BBBBude,AAAA

test 2: running script1 on node1, then quickly script2b on node2:
script2 prints 0\n0\n0\n12. (although the results can change with the delay set in sleep())

test 3: running script1 on node1, waiting, and running script2b on node2:
script2 prints 8\n8\n8\n12 or 8\n8\n12\n12.

test 4: running script1 on node1, waiting, and running script2b on node2:
script2 prints 8\n12\n12\n12?

I have yet to make tests where one or more of those scripts run on the controller node (which is the target of the NFS mount involved)

I am aware this might not be an issue on your side specifically, but like… yeah.
I hope I'm not wasting your time with this.

@wolph
Copy link
Owner

wolph commented Jun 15, 2021

With NFS... doesn't sound too weird, NFS and locking are rarely friends in my experience ;)

The thing with locks on unix systems is that they're (by default at least) advisory locks. Both clients need to choose to lock and the underlying filesystem needs to support it and the filesystem needs to be mounted with locking enabled. And with NFS, that last one is often the culprit. I should note that with NFS the locking can be enforced instead of being advisory.

So... common pitfalls with NFS locking:

  • The mount flags, most people disable NFS locking due to performance reasons (it can be really slow). You can check if the (mount) command includes the nolock parameter
  • The NFS version of all clients needs to be the same. The locking system between NFS 3 and NFS 4 is different and won't react to each other
  • It could simply be a bug because of old kernels or other things... Look at D10 in the NFS FAQ: http://nfs.sourceforge.net/
  • Perhaps this library is not using the correct locking flags for NFS. I honestly never tried because NFS locks are generally too slow to be usable in my experience.

In any case, I would start by looking at the server version, client versions and mount flags. See if they all match up and if nolock is not set as a mount flag.

@niacdoial
Copy link
Author

yep, everything matches up, and the closest thing to a nolock option is local_lock=none, which… actually has the opposite effects for my purposes?

also the locking itself seems to work, because the various script2s always hang before I press enter on script1?

also also, I tried adding fh.flush()\n os.fsync(fh.fileno()) to script1, without any effect. I have no idea how I didn't see this part of the docs before, kinda disappointed by myself right now.

@wolph
Copy link
Owner

wolph commented Jun 16, 2021

Yes, local_lock=none should be the right flag to have.

I'm not really sure what else to try honestly... it sounds like you're doing everything right.

@github-actions github-actions bot added the Stale label Aug 21, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 29, 2023
@wolph
Copy link
Owner

wolph commented Aug 29, 2023

I'm not sure if it's entirely relevant for you, but for cross-system locking I've also implemented a redis based locking system: https://github.com/wolph/portalocker#redis-locks

It works across multiple threads, systems, etc... in a completely safe manner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants