Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-117586: Speed up pathlib.Path.glob() by working with strings #117589

Merged
merged 8 commits into from Apr 10, 2024

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Apr 6, 2024

Move pathlib globbing implementation into a new private class: glob._Globber. This class implements fast string-based globbing. It's called by pathlib.Path.glob(), which then converts strings back to path objects.

In the private pathlib ABCs, add a pathlib._abc.Globber subclass that works with PathBase objects rather than strings, and calls user-defined path methods like PathBase.stat() rather than os.stat().

This sets the stage for two more improvements:

Timings:

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/*'))"
1000 loops, best of 5: 392 usec per loop
1000 loops, best of 5: 365 usec per loop
# --> 1.07x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/*.py'))"
1000 loops, best of 5: 393 usec per loop
1000 loops, best of 5: 371 usec per loop
# --> 1.06x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**'))"
50 loops, best of 5: 9.46 msec per loop
50 loops, best of 5: 9.06 msec per loop
# --> 1.04x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/'))"
50 loops, best of 5: 4.98 msec per loop
50 loops, best of 5: 5.15 msec per loop
# --> 1.03x slower (!)

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/*'))"
20 loops, best of 5: 14 msec per loop
20 loops, best of 5: 12.9 msec per loop
# --> 1.09x faster

$ ./python -m timeit -s "from pathlib import Path; p = Path.cwd()" "list(p.glob('Lib/**/*.py'))"
20 loops, best of 5: 12.2 msec per loop
20 loops, best of 5: 11.4 msec per loop
# --> 1.07x faster

Move pathlib globbing implementation to a new module and class:
`pathlib._glob.Globber`. This class implements fast string-based globbing.
It's called by `pathlib.Path.glob()`, which then converts strings back to
path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that
works with `PathBase` objects rather than strings, and calls user-defined
path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- pythonGH-115060: Query non-wildcard segments with `lstat()`
- pythonGH-116380: Move `pathlib._glob` to `glob` (unify implementations).
@barneygale
Copy link
Contributor Author

This is the first PR in a series that will hopefully unify the globbing implementations in the pathlib and glob modules, and speed both up in the process.

@barneygale
Copy link
Contributor Author

barneygale commented Apr 7, 2024

Hey @serhiy-storchaka, does this PR look alright to you? Not requesting a detailed review, more of a sanity check, given you've looked after the glob module for the last few years.

This PR doesn't affect glob.[i]glob(), but it does move pathlib's globbing implementation into glob.py.

Thank you.

@barneygale
Copy link
Contributor Author

I'll merge this now as it's important for #115060, which I'm hoping to get done in time for 3.13 beta 1.

But I'll leave glob.glob() and glob.iglob() unchanged in 3.13; any PRs I make will target 3.14.

@barneygale barneygale merged commit 6258844 into python:main Apr 10, 2024
33 checks passed
barneygale added a commit to barneygale/cpython that referenced this pull request Apr 10, 2024
Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new
`glob._Globber.walk()` classmethod works with strings internally, which is
a little faster than generating `Path` objects and keeping them normalized.
The `pathlib.Path.walk()` method converts the strings back to path objects.

In the private pathlib ABCs, our existing subclass of `_Globber` ensures
that `PathBase` instances are used throughout.

Follow-up to python#117589.
barneygale added a commit that referenced this pull request Apr 11, 2024
…17726)

Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new
`glob._Globber.walk()` classmethod works with strings internally, which is
a little faster than generating `Path` objects and keeping them normalized.
The `pathlib.Path.walk()` method converts the strings back to path objects.

In the private pathlib ABCs, our existing subclass of `_Globber` ensures
that `PathBase` instances are used throughout.

Follow-up to #117589.
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
…gs (python#117589)

Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- pythonGH-115060: Query non-wildcard segments with `lstat()`
- pythonGH-116380: Unify `pathlib` and `glob` implementations of globbing.

No change to the implementations of `glob.glob()` and `glob.iglob()`.
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
…gs (python#117726)

Move `pathlib.Path.walk()` implementation into `glob._Globber`. The new
`glob._Globber.walk()` classmethod works with strings internally, which is
a little faster than generating `Path` objects and keeping them normalized.
The `pathlib.Path.walk()` method converts the strings back to path objects.

In the private pathlib ABCs, our existing subclass of `_Globber` ensures
that `PathBase` instances are used throughout.

Follow-up to python#117589.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant