Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-28041: Inconsistent behavior: Get st_nlink from os.stat() and os.scandir() #795

Closed

Conversation

thebecwar
Copy link
Contributor

  • Unify stat approach between POSIX and Windows hosts. The data that comes out of the WIN32_FIND_DATAW is not the same information that is available in BY_HANDLE_FILE_INFORMATION.
  • Add unit test to verify fix
  • Update deleted file/directory tests to reflect new consistent behavior.

…scandir()

- Unify stat approach between POSIX and Windows hosts. The data that comes out of the WIN32_FIND_DATAW is not the same information that is available in BY_HANDLE_FILE_INFORMATION.
- Add unit test to verify fix
- Update deleted file/directory tests to reflect new consistent behavior.
Copy link
Contributor

@MojoVampire MojoVampire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explicitly undoes the performance boosts available on Windows by requiring each stat to cache on access, instead of using the free information provided by FindFirstFile/FindNextFile.

@thebecwar
Copy link
Contributor Author

The free information is incomplete. The missing information means that calling stat() on a DirEntry produces a different result than calling os.stat(). It also means that the POSIX and Windows behavior is different.

@MojoVampire
Copy link
Contributor

MojoVampire commented Mar 24, 2017

And those differences are documented. Making scandir + stat take significantly longer (much worse on if it's a high latency network share, as the round trip time ends up paid for each file, a situation I've personally encountered) for fields that are comparatively unlikely to be needed is not a reasonable compromise.

If it's possible, you might consider trying to make the three problematic fields implemented as properties that will lazily stat on first access to those specific fields, to normalize behavior without causing performance regressions. Not sure if StructSequences are flexible enough to support that though...

@thebecwar
Copy link
Contributor Author

Here's the options as I see it. I'm new to the codebase here, so I'm not really the best one to decide which way to go. I'm happy to hear opinions. I'd also be happy to POC/demo any of the options.

Do nothing

  • Close this pull request without merging
  • The documentation does state that the information is missing and how to obtain it. Potentially add a note to os.stat pointing back to the os.DirEntry.stat entry to make it clear that the implementation is different. I can see where someone reading the documentation may assume that the two methods are the same internally, since they are named identically and return the same data type.
  • Close 28041 as "By Design"

Merge this PR as is.

Benefits:

  • Increases the consistency of the POSIX and Windows implementations.
  • Fixes the non-intuitive behavior where a DirEntry can reference a deleted file, and not cause an exception to be raised when stat is called

Drawbacks:

  • Slows down the scandir + stat operation for Windows hosts.
  • Limited case scenario. Most people using stat are not looking for st_ino, st_dev or st_nlink.

Add a method to DirEntry (something like update_stat)

Fetches the latest stat information for all OSes, replacing the cached information.

Benefits:

  • Keeps the current implementation's efficiency for those that don't need the missing fields
  • Allows those that do need the fields to get the information explicitly
  • Provides a means to update the cached information for cases where its needed.

Drawbacks:

  • Adds complexity to the interface, with everything that entails. Adds to the cost of use, and maintenance.
  • Requires an explicit step to obtain complete information.

Promote the stat structure to a full type

Benefits:

  • Allows the missing fields to be lazily initialized on demand
  • Completely transparent to the end user. (ie- You don't have to do anything special to get the correct information)
  • Keeps the optimization of the standard case, taking a performance hit only when necessary

Drawbacks:

  • Property accessors shouldn't really be doing anything heavy internally. Taking the time penalty on the accessor is not ideal. There is also the potential that there could be an exception raised.
  • Adds another type that needs to be managed going forward. This is somewhat mitigated since it is a relatively simple type.

Make DirEntry.stat return a separate type

On Windows return a StructSequence from DirEntry.stat that doesn't have the missing properties.

Benefits:

  • Missing members aren't just zero, they would raise an AttributeError
  • Makes it more clear that the properties aren't available

Drawbacks:

  • Again, adds complexity
  • Widens the divide between Windows and POSIX versions of Python, potentially reducing portability

@@ -11569,7 +11565,7 @@ DirEntry_from_find_data(path_t *path, WIN32_FIND_DATAW *dataW)
if (!entry->path)
goto error;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extraneous whitespace.

@brettcannon brettcannon added the type-feature A feature request or enhancement label Mar 27, 2017
@vstinner
Copy link
Member

You misunderstood scandir () design. See the issue.

@vstinner vstinner closed this Apr 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants