Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize pathlib path construction #101362

Closed
barneygale opened this issue Jan 27, 2023 · 10 comments
Closed

Optimize pathlib path construction #101362

barneygale opened this issue Jan 27, 2023 · 10 comments
Labels
performance Performance or resource usage topic-pathlib type-feature A feature request or enhancement

Comments

@barneygale
Copy link
Contributor

barneygale commented Jan 27, 2023

Pathlib is slow. One of the most obvious symptoms is that pathlib.PurePath objects are slow to construct. We should be able to speed construction up without making other parts of pathlib slower.

Two possible approaches:

  1. Optimize the existing machinary of path construction: __new__(), _from_parts(), _parse_parts(), _parse_args().
  2. Perform less work in the constructor: defer parsing, joining and normalization until needed.

Linked PRs

@barneygale barneygale added type-feature A feature request or enhancement performance Performance or resource usage topic-pathlib labels Jan 27, 2023
@barneygale
Copy link
Contributor Author

I'd like to land #101363 before I put the first PR up for this issue.

barneygale added a commit to barneygale/cpython that referenced this issue Feb 4, 2023
`PurePath` now normalises and splits paths only when necessary, e.g. when
`.name` or `.parent` is accessed. The result is cached. This speeds up path
object construction by around 4x.

`PurePath.__fspath__()` now returns an unnormalised path, which should be
transparent to filesystem APIs (else pathlib's normalisation is broken!).
This extends the earlier performance improvement to most impure `Path`
methods, and also speeds up pickling, `p.joinpath('bar')` and `p / 'bar'`.

This also fixes pythonGH-76846 and pythonGH-85281 by unifying path constructors and
adding an `__init__()` method.
barneygale added a commit to barneygale/cpython that referenced this issue Feb 7, 2023
This saves a comparison in `pathlib.Path.__new__()` and reduces the time
taken to run `Path()` by ~5%
barneygale added a commit to barneygale/cpython that referenced this issue Feb 7, 2023
This saves a comparison in `pathlib.Path.__new__()` and reduces the time
taken to run `Path()` by ~5%
barneygale added a commit to barneygale/cpython that referenced this issue Feb 7, 2023
…b.PurePath

This reduces the time taken to run `PurePath("foo")` by ~15%
barneygale added a commit to barneygale/cpython that referenced this issue Feb 7, 2023
The previous `_parse_args()` method pulled the `_parts` out of any
supplied `PurePath` objects; these were subsequently joined in
`_from_parts()` using `os.path.join()`. This is actually a slower form of
joining than calling `fspath()` on the path object, because it doesn't take
advantage of the fact that the contents of `_parts` is normalized!

This reduces the time taken to run `PurePath("foo", "bar") by ~20%, and
the time taken to run `PurePath(p, "cheese")`, where
`p = PurePath("/foo", "bar", "baz")`, by ~40%.
@zooba
Copy link
Member

zooba commented Feb 22, 2023

Does the PR cope with WindowsPurePath and PosixPurePath and converting between them?

@barneygale
Copy link
Contributor Author

Could you clarify? The PRs maintain the behaviour that attempting to instantiate pathlib.PurePath will give you a PureWindowsPath or a PurePosixPath depending on your platform, but pathlib doesn't support converting between Windows and POSIX paths as I understand it.

@zooba
Copy link
Member

zooba commented Feb 24, 2023

This behaviour should be preserved:

Python 3.10.10 (tags/v3.10.10:aad5f6a, Feb  7 2023, 17:20:36) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pathlib import *
>>> p = PureWindowsPath("a/b/c")
>>> p
PureWindowsPath('a/b/c')
>>> PurePosixPath(p)
PurePosixPath('a/b/c')
>>> PurePath(p)
PureWindowsPath('a/b/c')
>>> PosixPath(p)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__3847v3x7pw1km\lib\pathlib.py", line 962, in __new__
    raise NotImplementedError("cannot instantiate %r on your system"
NotImplementedError: cannot instantiate 'PosixPath' on your system
>>> WindowsPath(p)
WindowsPath('a/b/c')
>>>

@barneygale
Copy link
Contributor Author

Right! That will be broken by #101667 as things stand:

>>> from os import fspath
>>> from pathlib import *
>>> p = PureWindowsPath("a/b/c")
>>> p
PureWindowsPath('a/b/c')
>>> fspath(p)
'a\\b\\c'
>>> PurePosixPath(fspath(p))
PurePosixPath('a\\b\\c')
>>> PurePosixPath(p)
PurePosixPath('a\\b\\c')

It doesn't appear to be documented or tested behaviour, and it feels odd to me that PurePosixPath(p) can give a different result to PurePosixPath(fspath(p)). I'll try to find a good way forwards!

@barneygale
Copy link
Contributor Author

barneygale commented Feb 24, 2023

This feature doesn't really work with drives or roots:

>>> PurePosixPath(PureWindowsPath('//server/share/dir'))
PurePosixPath('\\\\server\\share\\/dir')
>>> PurePosixPath(PureWindowsPath('c:/dir'))
PurePosixPath('c:\\/dir')
>>> PurePosixPath(PureWindowsPath('/dir'))
PurePosixPath('\\/dir')

As far as I can tell, no one has ever logged a bug about it.

However, using PurePosixPath(PureWindowsPath(...).as_posix()) works everywhere:

>>> PurePosixPath(PureWindowsPath('//server/share/dir').as_posix())
PurePosixPath('//server/share/dir')
>>> PurePosixPath(PureWindowsPath('c:/dir').as_posix())
PurePosixPath('c:/dir')
>>> PurePosixPath(PureWindowsPath('/dir').as_posix())
PurePosixPath('/dir')

So I'm tempted to conclude that converting with PurePosixPath(PureWindowsPath(...)) alone is not supported, and that it only happened to work in some limited circumstances. Users can add .as_posix() to signal their intent and make it work reliably. What do you reckon?

@zooba
Copy link
Member

zooba commented Feb 24, 2023

I think the direct conversion should either be consistent with fspath(p) or p.parts, but no strong preference which. As you say, neither makes a huge amount of sense, and the workaround (.as_posix()) has been there for a long time.

Might be worth adding a short note to the docs warning that the constructor can't reliably convert from different pathlib objects, and maybe specify whether it will fspath or .parts them.

On another perf note, is it possible that parsing the path up front isn't necessary? Obviously it'll save the most time to keep a single string literal around and parse it later, but I don't personally have a good feel for whether that's common or not. (Obviously if it's available pre-parsed then keep it.)

@barneygale
Copy link
Contributor Author

On another perf note, is it possible that parsing the path up front isn't necessary? Obviously it'll save the most time to keep a single string literal around and parse it later, but I don't personally have a good feel for whether that's common or not. (Obviously if it's available pre-parsed then keep it.)

That's my plan! We can return the unnormalized path from PurePath.__fspath__() and only normalize/parse on demand, e.g. when users call .name or .parent. This would speed up construction, joining, and most methods defined in the Path class.

AlexWaygood pushed a commit that referenced this issue Mar 5, 2023
…Path() (#101665)

GH-101362: Call join() only when >1 argument supplied to pathlib.PurePath

This reduces the time taken to run `PurePath("foo")` by ~15%
AlexWaygood added a commit to barneygale/cpython that referenced this issue Mar 5, 2023
barneygale added a commit to barneygale/cpython that referenced this issue Mar 5, 2023
miss-islington pushed a commit that referenced this issue Mar 5, 2023
…H-101664)

This saves a comparison in `pathlib.Path.__new__()` and reduces the time taken to run `Path()` by ~5%.

Automerge-Triggered-By: GH:AlexWaygood
miss-islington pushed a commit that referenced this issue Mar 5, 2023
The previous `_parse_args()` method pulled the `_parts` out of any supplied `PurePath` objects; these were subsequently joined in `_from_parts()` using `os.path.join()`. This is actually a slower form of joining than calling `fspath()` on the path object, because it doesn't take advantage of the fact that the contents of `_parts` is normalized!

This reduces the time taken to run `PurePath("foo", "bar")` by ~20%, and the time taken to run `PurePath(p, "cheese")`, where `p = PurePath("/foo", "bar", "baz")`, by ~40%.

Automerge-Triggered-By: GH:AlexWaygood
hugovk pushed a commit to hugovk/cpython that referenced this issue Mar 6, 2023
…b.PurePath() (python#101665)

pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath

This reduces the time taken to run `PurePath("foo")` by ~15%
hugovk pushed a commit to hugovk/cpython that referenced this issue Mar 6, 2023
…ime (pythonGH-101664)

This saves a comparison in `pathlib.Path.__new__()` and reduces the time taken to run `Path()` by ~5%.

Automerge-Triggered-By: GH:AlexWaygood
hugovk pushed a commit to hugovk/cpython that referenced this issue Mar 6, 2023
The previous `_parse_args()` method pulled the `_parts` out of any supplied `PurePath` objects; these were subsequently joined in `_from_parts()` using `os.path.join()`. This is actually a slower form of joining than calling `fspath()` on the path object, because it doesn't take advantage of the fact that the contents of `_parts` is normalized!

This reduces the time taken to run `PurePath("foo", "bar")` by ~20%, and the time taken to run `PurePath(p, "cheese")`, where `p = PurePath("/foo", "bar", "baz")`, by ~40%.

Automerge-Triggered-By: GH:AlexWaygood
barneygale added a commit to barneygale/cpython that referenced this issue Mar 6, 2023
Improve performance of path construction by skipping the addition of the
path anchor (`drive + root`) to the internal `_parts` list. This change
allows us to simplify the implementations of `joinpath()`, `name`,
`parent`, and `parents` a little. The public `parts` tuple is unaffected.
carljm added a commit to carljm/cpython that referenced this issue Mar 6, 2023
* main: (21 commits)
  pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives in sub interpreters module (python#102472)
  pythongh-95672: Fix versionadded indentation of get_pagesize in test.rst (pythongh-102455)
  pythongh-102416: Do not memoize incorrectly loop rules in the parser (python#102467)
  pythonGH-101362: Optimise PurePath(PurePath(...)) (pythonGH-101667)
  pythonGH-101362: Check pathlib.Path flavour compatibility at import time (pythonGH-101664)
  pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath() (python#101665)
  pythongh-102444: Fix minor bugs in `test_typing` highlighted by pyflakes (python#102445)
  pythonGH-102341: Improve the test function for pow (python#102342)
  Fix unused classes in a typing test (pythonGH-102437)
  pythongh-101979: argparse: fix a bug where parentheses in metavar argument of add_argument() were dropped (python#102318)
  pythongh-102356: Add thrashcan macros to filter object dealloc (python#102426)
  Move around example in to_bytes() to avoid confusion (python#101595)
  pythonGH-97546: fix flaky asyncio `test_wait_for_race_condition` test (python#102421)
  pythongh-96821: Add config option `--with-strict-overflow` (python#96823)
  pythongh-101992: update pstlib module documentation (python#102133)
  pythongh-63301: Set exit code when tabnanny CLI exits on error (python#7699)
  pythongh-101863: Fix wrong comments in EUC-KR codec (pythongh-102417)
  pythongh-102302 Micro-optimize `inspect.Parameter.__hash__` (python#102303)
  pythongh-102179: Fix `os.dup2` error reporting for negative fds (python#102180)
  pythongh-101892: Fix `SystemError` when a callable iterator call exhausts the iterator (python#101896)
  ...
carljm added a commit to carljm/cpython that referenced this issue Mar 7, 2023
* main: (37 commits)
  pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives in sub interpreters module (python#102472)
  pythongh-95672: Fix versionadded indentation of get_pagesize in test.rst (pythongh-102455)
  pythongh-102416: Do not memoize incorrectly loop rules in the parser (python#102467)
  pythonGH-101362: Optimise PurePath(PurePath(...)) (pythonGH-101667)
  pythonGH-101362: Check pathlib.Path flavour compatibility at import time (pythonGH-101664)
  pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath() (python#101665)
  pythongh-102444: Fix minor bugs in `test_typing` highlighted by pyflakes (python#102445)
  pythonGH-102341: Improve the test function for pow (python#102342)
  Fix unused classes in a typing test (pythonGH-102437)
  pythongh-101979: argparse: fix a bug where parentheses in metavar argument of add_argument() were dropped (python#102318)
  pythongh-102356: Add thrashcan macros to filter object dealloc (python#102426)
  Move around example in to_bytes() to avoid confusion (python#101595)
  pythonGH-97546: fix flaky asyncio `test_wait_for_race_condition` test (python#102421)
  pythongh-96821: Add config option `--with-strict-overflow` (python#96823)
  pythongh-101992: update pstlib module documentation (python#102133)
  pythongh-63301: Set exit code when tabnanny CLI exits on error (python#7699)
  pythongh-101863: Fix wrong comments in EUC-KR codec (pythongh-102417)
  pythongh-102302 Micro-optimize `inspect.Parameter.__hash__` (python#102303)
  pythongh-102179: Fix `os.dup2` error reporting for negative fds (python#102180)
  pythongh-101892: Fix `SystemError` when a callable iterator call exhausts the iterator (python#101896)
  ...
barneygale added a commit to barneygale/cpython that referenced this issue Mar 10, 2023
@barneygale
Copy link
Contributor Author

For anyone following along, I think this ticket can be resolved if/when this PR lands:

barneygale added a commit to barneygale/cpython that referenced this issue Apr 3, 2023
barneygale added a commit to barneygale/cpython that referenced this issue Apr 9, 2023
barneygale added a commit that referenced this issue Apr 9, 2023
Improve performance of path construction by skipping the addition of the path anchor (`drive + root`) to the internal `_parts` list. Rename this attribute to `_tail` for clarity.
@barneygale
Copy link
Contributor Author

Resolving this issue. It's now much cheaper to construct pathlib.PurePath and Path objects. Some operations like str() and joinpath() are also cheaper.

Future work:

warsaw pushed a commit to warsaw/cpython that referenced this issue Apr 11, 2023
…ythonGH-102476)

Improve performance of path construction by skipping the addition of the path anchor (`drive + root`) to the internal `_parts` list. Rename this attribute to `_tail` for clarity.
aisk pushed a commit to aisk/cpython that referenced this issue Apr 18, 2023
…ythonGH-102476)

Improve performance of path construction by skipping the addition of the path anchor (`drive + root`) to the internal `_parts` list. Rename this attribute to `_tail` for clarity.
mih added a commit to datalad/datalad-next that referenced this issue Nov 23, 2023
In #540 I found about ~10% of the runtime to be attributable to the
regex-matching. This patch replaces the regex with two split()
calls. In local benchmarks I see a ~10% speedup.

One major other source of slow-down was/is the construction of
`PurePosixPath` objects. It seems this is being taken care of by
python/cpython#101362 and will be resolved
eventually.

Closes #540
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants