Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-36305: Fixes to path handling and parsing in pathlib #12361

Closed
wants to merge 3 commits into from

Conversation

kmaork
Copy link
Contributor

@kmaork kmaork commented Mar 15, 2019

The bugs
This PR fixes three bugs with path parsing in pathlib.
The following examples show these bugs (when the cwd is C:\d):

  1. WindowsPath('C:a').absolute() should return WindowsPath('C:\\d\\a') but returns WindowsPath('C:a').
    This is caused by flawed logic in the parse_parts method of the _Flavour class.
  2. WindowsPath('./b:a').absolute() should return WindowsPath('C:\\d\\b:a') but returns WindowsPath('b:a').
    This is caused by the limited interface of parse_parts, and affects the Path.absolute, Path.expanduser and Path.__rtruediv__ methods.
  3. WindowsPath('./b:a').resolve() should return WindowsPath('C:\\d\\b:a') but returns WindowsPath('b:a').
    This is caused by missing logic in the resolve method and in Path.__str__

The fixes

  1. To fix the first bug, I fixed a flaw in the parse_parts method.

  2. The second one was more complicated, as with the current interface of parse_parts (called by _parse_args), the bug can't be fixed. Let's take a simple example: WindowsPath(WindowsPath('./a:b')) and WindowsPath('a:b') - before the bugfix, they are equal. That happens because in both cases, parse_parts is called with ['a:b']. This part can be interpreted in two ways - either as the relative path 'b' with the drive 'a:', or as a file 'a' with the NTFS data-stream 'b'.

    That means that in some cases, passing the flattened _parts of a path to parse_parts is lossy. Therefore we have to a modify parse_parts's interface to enable passing more detailed information about the given parts. What I decided to do was allow passing tuples in addition to strings, thus supporting the old interface. The tuples would contain the drive, root and path parts, enough information to determine the correct parsing of path parts with data-streams, and maybe more future edge cases.

    After modifying parse_parts's interface, I changed _parse_args to use it and made Path.absolute, Path.expanduser and Path.__rtruediv__ pass Path objects to _parse_args instead of path parts, to preserve the path information.

  3. To solve the third bug I had to make small changes to both the resolve method and to Path.__str__.

Notes
In addition to the changes in the code, I've added regression tests and modified old incorrect tests.
Details about drive-relative paths can be found here.

https://bugs.python.org/issue36305

Lib/pathlib.py Outdated
@@ -171,7 +171,7 @@ def splitroot(self, part, sep=sep):
else:
return part[:index2], sep, part[index2+1:]
drv = root = ''
if second == ':' and first in self.drive_letters:
if second == ':' and first in self.drive_letters and third == sep:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is not the solution. "C:" is a drive-relative path. Of course it has a drive. We need to fix absolute to properly support drive-relative paths. See the FIXME comment in the code. It's not necessarily as simple as calling ntpath._getfullpatname, as suggested in the comment, unless we give up on preserving ".." components in Windows. (That wouldn't be a problem really since preserving ".." components serves no point in Windows, unlike POSIX.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I will modify the PR.

@eryksun
Copy link
Contributor

eryksun commented Mar 16, 2019

WindowsPath('c:') / 'a' returns WindowsPath('c:a'), which is a bug.

This is not a bug. The resolved value of "c:" depends on its working directory, so joining it with "a" has to remain as "c:a". If we call absolute(), however, it should use the drive's current working directory to create a fully-qualified path.

# NTFS file-stream paths
check(['c:'], ('', '', ['c:']))
check(['c:a'], ('', '', ['c:a']))
check(['a', 'Z:b', 'c'], ('', '', ['a', 'Z:b', 'c']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that "c:" is not a file stream path, and neither is "c:a". The system interprets "[A-Za-z]:" as a drive. If we want to create a file stream named "a" in a file named "c" in the working directory, we have to use a qualified path such as "./c:a".

Copy link
Contributor

@eryksun eryksun Mar 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which by the way is a separate bug in pathlib, if you want to look into it. For example, it's wrong for it to remove the explicit reference to the working directory in the following:

>>> p = Path('./c:a')
>>> str(p)
'c:a'

We gave it the path of a file stream, but it mistakenly handles it as a drive-relative path when getting the path as a string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and yet,

>>> p = Path('c:a')
>>> p
WindowsPath('c:a')
>>> Path('.') / p
WindowsPath('c:a')
>>> Path('./c:a')
WindowsPath('c:a')

I don't think it's reasonable for Path('.') / Path('c:a') to give a different result than Path('./c:a'). I'm pretty sure I've seen it documented somewhere (although not in the official docs) that using Path('a/b/c') is a portable and shorter way of writing Path('a')/Path('b')/Path('c'), and the proposed behaviour violates that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does make sense, and does seem to have been the assumption in pathlib (seeing the implementation for parse_parts).

But this is a wrong assumption, as sometimes a path part can be interpreted in more than one way (for example, c:a that can either be a drive-relative path or a file with a stream). That means that separating a path to its parts may cause a loss of information about the original path and lead to incorrect results.

Assuming that this assumption can't be true, I think that the current behavior -
WindowsPath('a') / WindowsPath('./c:b') returning WindowsPath('c:b') is surely worse than
WindowsPath('a') / WindowsPath('./c:b') returning WindowsPath('a/c:b') (which would be the case if my fixes are applied)

@kmaork kmaork changed the title bpo-36305: Fixed bug with colon-suffixed paths in pathlib bpo-36305: Fixes to path handling and parsing in pathlib Mar 16, 2019
@serhiy-storchaka serhiy-storchaka self-requested a review May 28, 2019 14:43
@csabella
Copy link
Contributor

csabella commented Jan 16, 2020

Part of this was fixed with bpo-34775, causing the merge conflict.

@pitrou
Copy link
Member

pitrou commented Apr 17, 2020

Hmm, I'm out of my depth here. @pfmoore @zooba Would you like to opine on the behaviour change in this PR? It's the first time I hear about NTFS data streams, myself...

@kmaork It's taken some time, but would you like to rebase / merge this PR with git master to have a fresh CI run?

@kmaork
Copy link
Contributor Author

kmaork commented May 2, 2020

I think this PR has gotten stuck because the wanted behavior was not agreed upon. If we can settle it, someone (maybe me) will be able to implement it.
An possible way to resolve this is to say that pathlib should behave exactly like os.path in all cases. What do you think? @pitrou @pfmoore @zooba

@barneygale
Copy link
Contributor

An possible way to resolve this is to say that pathlib should behave exactly like os.path in all cases. What do you think? @pitrou @pfmoore @zooba

Pathlib's internal representation of paths is more complex than a single string. This allows operations like .parent to be very fast, but it costs some flexibility in how path normalisation works. To make pathlib behave exactly like os.path, I think we'd need to perform serious surgery on pathlib's data model, which this PR doesn't do. I also reckon some of the original bugs have been fixed in recent versions.

Copy link
Contributor

@barneygale barneygale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once #101667 lands, you should only need to change _format_parsed_parts() to fix the original bug.

Comment on lines +136 to +137
check(['a', 'Z:b', 'c'], ('Z:', '', ['Z:', 'a', 'b', 'c']))
check(['Y:a', 'Z:b', 'c'], ('Z:', '', ['Z:', 'a', 'b', 'c']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right - if we switch drives mid-way through we should discard everything before. This is what os.path.join() does.

@@ -746,7 +756,9 @@ class PureWindowsPathTest(_BasePurePathTest, unittest.TestCase):

equivalences = _BasePurePathTest.equivalences.copy()
equivalences.update({
'c:a': [ ('c:', 'a'), ('c:', 'a/'), ('/', 'c:', 'a') ],
'./a:b': [ ('./a:b',) ],
'a:b:c': [ ('./b:c', 'a:'), ('b:', 'a:b:c') ],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above, I don't see why ('./b:c', 'a:') shouldn't resolve to a:.

@barneygale
Copy link
Contributor

A refreshed version of this PR has been merged into main in #102454, so I'm going to close this PR. Thanks very much for working on this, @kmaork!

@barneygale barneygale closed this Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants