Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-89812: Add pathlib._PathBase #106337

Merged
merged 37 commits into from Sep 30, 2023
Merged

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Jul 2, 2023

Add private pathlib._PathBase class. This will be used by an experimental PyPI package to incubate a tarfile.TarPath class.

Implementation notes

In pathlib.py:

  • Split Path into _PathBase and Path
  • Implement _PathBase.is_junction() and is_mount() using stat(). In the latter case, restore the implementation deleted in gh-86943: implement pathlib.WindowsPath.is_mount() #31458
  • Implement _PathBase.resolve() using absolute() and readlink()
  • Implement _PathBase._scandir() using iterdir()

In test_pathlib.py:

  • Add PathBaseTest to exercise the _PathBase class directly
  • Split PathTest into DummyPathTest and PathTest. The former exercises a representative subclass of _PathBase.
  • Add DummyPathWithSymlinksTest to exercise a similar subclass, this time with symlink support.

Questionable bits

In pathlib.py, the _PathBase.is_junction() and _PathBase.resolve() methods are new code. The latter is very performance-sensitive and may need more work.

Future work

  • Make pathlib._PathBase public. There's still at least a couple of things to do first:
    • Make PurePath._flavour public
    • Figure out what to do with _PathBase.__hash__(), __eq__(), and other comparison methods.

@barneygale
Copy link
Contributor Author

barneygale commented Jul 7, 2023

@brettcannon a long time ago in a PR far, far away, you said:

My current thinking (while I work through my PR review backlog to reach this PR), is this should be used for zipfile.Path. I also wouldn't be opposed to that happening this PR if @barneygale wanted to give that a shot while he waits for a review from me.

I couldn't make this work for zipfile.Path due to a couple of fundamental differences in the interface, e.g. the root attribute stores a string in pathlib.PurePath, but a ZipFile instance in zipfile.Path. It's possible to make it work with a couple of deprecations/removals, or an entirely new ZipPath class.

So in this PR I've chosen tarfile.TarPath as the initial target. I think it does a good job of showing the viability of the pathlib._VirtualPath interface.

However, I have a bunch of open questions in the PR description about how TarPath itself should work. It's not as straightforward as .zip files, because tarballs support symlinks and hardlinks, and because tarfile.TarFile doesn't support reading and writing the same file.

These questions make me think that tarfile.TarPath should first be gestated in a PyPI package before we add it to the standard library. But that leaves a bit of a chicken-and-egg problem with getting pathlib._VirtualPath in.

Do you have any advice on the best way forward?

Thanks!

@brettcannon
Copy link
Member

These questions make me think that tarfile.TarPath should first be gestated in a PyPI package before we add it to the standard library. But that leaves a bit of a chicken-and-egg problem with getting pathlib._VirtualPath in.

You have a couple of options:

  1. Start a discussion on discuss.python.org for tarfile and see if there's clear consensus on what to do.
  2. Make _VirtualPath as private and do your experimental PyPI project knowing it might be perpetually tied to a specific Python version (and make that clear to users of the package).
  3. Go with your gut and hope for the best (and err on the side of leaving things out, especially if they can be added in later). 😅

@merwok
Copy link
Member

merwok commented Jul 8, 2023

I am unsure about the name VirtualPath!

The way I’ve seen it used, a TarPath class, or SshPath, or S3Path would be examples of virtual filesystems: I give them paths that look like regular filesystem paths and they do custom operations to return info or data.

Their base class is not itself a virtual path, it’s an (abstract?) base class!

@barneygale
Copy link
Contributor Author

Fair point! How about PathBase? It squares with the names of base classes in io, I think.

@barneygale
Copy link
Contributor Author

Alternatively, pathlib.abc.VirtualPath?

@merwok
Copy link
Member

merwok commented Jul 10, 2023

Not a fan of deep modules, and even .abc.VirtualPath doesn’t seem great to me.
I have posted on discuss about VirtualPath vs PathBase!

@barneygale barneygale changed the title GH-89812: Add tarfile.TarPath GH-89812: Add pathlib._PathBase Jul 12, 2023
@barneygale barneygale removed the request for review from ethanfurman July 12, 2023 19:04
@RonnyPfannschmidt
Copy link

just stumbled uppon this and im wondering if a concept of a public Accessor of a path might make sense
where an accessor provides the implementation, then for example tarfile/zipfile paths could just have the accessor

additionally it would be trivial to have "roots" at a path or a "root" with a directory file descriptor/network protocol

@barneygale
Copy link
Contributor Author

just stumbled uppon this and im wondering if a concept of a public Accessor of a path might make sense where an accessor provides the implementation, then for example tarfile/zipfile paths could just have the accessor

We actually used to have accessor objects, but their interface was much the same as Path objects, and so they amounted to nothing more than a layer of indirection. They were removed in #25701.

If we add accessors back in, most users would need to subclass two classes to implement their own path objects: the accessor class and the path class. The latter would be necessary for any user who wants to add their own path methods, for example.

@RonnyPfannschmidt
Copy link

those "accessors" where nothing more than aliases, after all the implementation as done had no power over what the pure path portion can do with the real world

a actualy "accessor" would be able to be tied to something like a directory file descriptor, or a path object, or a entirely different thing thats not a filesystem

and yes there is some need for synchronization between concrete path capabilities and the accessors,

however there is certain equivalence classes to keep in mind (Posix, Windows), so in a lot of cases my impression is that there will be more "accessors" than paths

@barneygale
Copy link
Contributor Author

Intriguing. Could you sketch out the accessor API?

@merwok
Copy link
Member

merwok commented Sep 11, 2023

(in the forums rather than a PR discussion please)

@RonnyPfannschmidt
Copy link

i can dedicate some time to it later

a basic concept of the accessor version ought to consider something like:

  • windows paths

  • posix paths

  • unc paths

  • directory file descriptor paths (both secured and unsecured)

  • paths that pretend to be a chroot on top of another path/pathlike

  • archives

  • ssh/scp

  • urls/http

  • distinctions between readonly and readwrite accessors

  • capabilities of accessors (like users/perms/xattrs)

  • wrappers (like union fs) to support interesting testing
    (imagine a path that maps onto the real fs but pretends certain files are deleted/replaced)

    additionally for purely virtual accessors it might be interesting if parts of the path could be objects (like dates), but thats currently out of scope

@barneygale
Copy link
Contributor Author

barneygale commented Sep 13, 2023

To summarise my thoughts on the above: in many cases this PR helps bring those things about, and I don't think it hinders any of those things. But let's chat more in a forum thread.

What's left in this PR:

On the latter point, there are still two subtle and related problems:

  • _PathBase.resolve(strict=True) can raise OSError with the wrong values (errno, path). For example, resolve('/etc/hosts/foo') should raise NotADirectoryError(path='/etc/hosts'), but instead raises FileNotFoundError(path='/etc/hosts/foo')
  • To avoid an infinite loop, the implementation of _PathBase.readlink() must not call self.resolve(), and so users must supply a path with no symlinks, except in the final position. This is a bit different to Path.readlink().

Both of these issues could be solved if we can provide some way to stat() a path without resolving any symlinks. The resolve() method would call this stat() method first, and only call readlink() if the stat result indicates a symlink. A couple of options spring to mind:

  • We could allow _PathBase.stat(follow_symlinks=None), meaning "where possible, don't follow any symlinks in the path; the path I've provided is the direct physical path". But it's not particularly obvious, and follow_symlinks=None means something different in glob() and rglob().
  • We could add some other keyword-only argument to _PathBase.stat() with the same effect.
  • We could add a new method -- direct_stat(), stat1(), dstat()?

@barneygale
Copy link
Contributor Author

Bah, pushed a commit labelled "WIP". In fact it contains a solution to the above problem: stash a _resolving flag on path objects, which means resolve() can safely call stat() and readlink() without hitting infinite recursion. No weird APIs for users.

This will require further refactoring in another PR.
@barneygale
Copy link
Contributor Author

Hey @AA-Turner and @gst, thank you for your earlier reviews, very much appreciated. Please could you give it another pass, if you can spare the time?

I feel OK about _PathBase.resolve(): it's entirely private for the timebeing, and I intend to add further tests and optimization before we drop the underscore from _PathBase, so if we can't spot any clear defects I suspect it's fine to land? Grateful for your guidance!

@AA-Turner AA-Turner self-requested a review September 24, 2023 03:04
Lib/pathlib.py Outdated Show resolved Hide resolved
Lib/pathlib.py Outdated Show resolved Hide resolved
Lib/pathlib.py Outdated Show resolved Hide resolved
Lib/pathlib.py Outdated Show resolved Hide resolved
Copy link
Member

@AA-Turner AA-Turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just two notes. I think _split_stack makes sense a method.

A

Lib/pathlib.py Outdated Show resolved Hide resolved
Lib/pathlib.py Outdated Show resolved Hide resolved
@barneygale barneygale merged commit 89966a6 into python:main Sep 30, 2023
23 checks passed
@barneygale
Copy link
Contributor Author

Thank you so much for the reviews, everyone :-)

@AA-Turner
Copy link
Member

Congratulations Barney, a lot of hard work on your part paid off!

A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants