Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-114847: Speed up posixpath.realpath() #114848

Merged
merged 23 commits into from
Apr 5, 2024
Merged

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Feb 1, 2024

Apply the following optimizations to posixpath.realpath():

  • Remove use of recursion
  • Construct child paths directly rather than using join()
  • Use os.getcwd[b]() rather than abspath()
  • Use startswith(sep) rather than isabs()
  • Use slicing rather than split()

$ ./python -m timeit -s "from os.path import realpath" "realpath('.')"
100000 loops, best of 5: 2.74 usec per loop  # before
200000 loops, best of 5: 1.53 usec per loop  # after
# --> 1.79x faster

$ ./python -m timeit -s "from os.path import realpath" "realpath('..')"
100000 loops, best of 5: 2.9  usec per loop  # before
200000 loops, best of 5: 1.71 usec per loop  # after
# --> 1.7x faster

$ ./python -m timeit -s "from os.path import realpath" "realpath('../..')"
 50000 loops, best of 5: 4.1  usec per loop  # before
200000 loops, best of 5: 1.99 usec per loop  # after
# --> 2.06x faster

$ ./python -m timeit -s "from os.path import realpath" "realpath('Lib/test')"
50000 loops, best of 5: 8.06 usec per loop  # before
50000 loops, best of 5: 6.19 usec per loop  # after
# --> 1.3x faster

$ ./python -m timeit -s "from os.path import realpath" "realpath('Lib/test/nonexistent/..')"
20000 loops, best of 5: 12.6 usec per loop  # before
50000 loops, best of 5:  9.4 usec per loop  # after
# --> 1.34x faster

$ ./python -m timeit -s "from os.path import realpath" "realpath('/home/barney')"
50000 loops, best of 5: 6.41 usec per loop  # before
50000 loops, best of 5: 4.62 usec per loop  # after
# --> 1.39x faster

Apply the following optimizations to `posixpath.realpath()`:

- Remove use of recursion
- Directly construct child paths rather than using `join()`
- Use `os.getcwd[b]()` rather than `abspath()`
- Use `startswith(sep)` rather than `isabs()`
@barneygale barneygale added the performance Performance or resource usage label Feb 1, 2024
@barneygale
Copy link
Contributor Author

I've added a few wordy comments because:

  1. When I first read the code, it took me some time to really understand what seen is doing
  2. Applying tail call optimisation (this PR) further obscures the purpose of seen

Hopefully it's a bit easier for the next person to grok with the comments in place.

Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this gets rid of recursion.

Lib/posixpath.py Show resolved Hide resolved
Lib/posixpath.py Show resolved Hide resolved
Lib/posixpath.py Show resolved Hide resolved
Lib/posixpath.py Outdated Show resolved Hide resolved
@barneygale
Copy link
Contributor Author

Given that symlinks are rare, perhaps I've actually slowed things down by checking in seen prior to running lstat(). I'll see if reverting that change is faster.

@encukou
Copy link
Member

encukou commented Apr 4, 2024

The introduction of a querying flag that's reset once changes behaviour, since rest can return back into an existing path.

Given a directory and a link to it:

$ mkdir a_dir
$ ln -s a_dir a_link

with Python 3.12 I get:

>>> os.path.realpath('foo/../a_link')
'/tmp/links/a_dir'

With this PR:

>>> os.path.realpath('foo/../a_link')
'/tmp/links/a_link'

The previous behaviour matches coreutils:

$ realpath -m foo/../a_link
/tmp/links/a_dir

@encukou
Copy link
Member

encukou commented Apr 5, 2024

Thanks. I see the other use of querying -- a symlink loop -- retains previous behaviour (which is tested), even though it doesn't match coreutils' realpath -m :/

$ mkdir a_dir
$ ln -s a_dir a_link
$ ln -s loop1 loop2
$ ln -s loop2 loop1
$ realpath -m loop1/../a_link/
/tmp/links/a_dir
>>> os.path.realpath('foo/../loop1/../a_link/')
'/tmp/links/a_link'

Guess that's one bug we need to let live.

Copy link
Member

@encukou encukou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Lib/posixpath.py Outdated Show resolved Hide resolved
Co-authored-by: Petr Viktorin <encukou@gmail.com>
@barneygale
Copy link
Contributor Author

Thank you for the reviews!

@barneygale barneygale enabled auto-merge (squash) April 5, 2024 12:06
@barneygale barneygale merged commit abfa16b into python:main Apr 5, 2024
33 checks passed
@barneygale
Copy link
Contributor Author

Thanks. I see the other use of querying -- a symlink loop -- retains previous behaviour (which is tested), even though it doesn't match coreutils' realpath -m :/

(For posterity:) I opened an issue about this: #117546

diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
Apply the following optimizations to `posixpath.realpath()`:

- Remove use of recursion
- Construct child paths directly rather than using `join()`
- Use `os.getcwd[b]()` rather than `abspath()`
- Use `startswith(sep)` rather than `isabs()`
- Use slicing rather than `split()`

Co-authored-by: Petr Viktorin <encukou@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants