Skip to content

Conversation

@jonburdo
Copy link
Contributor

@jonburdo jonburdo commented Jan 6, 2022

Fixes #7228

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

@jonburdo jonburdo requested a review from a team as a code owner January 6, 2022 02:44
@jonburdo jonburdo requested a review from dtrifiro January 6, 2022 02:44
@jonburdo
Copy link
Contributor Author

jonburdo commented Jan 6, 2022

@efiop

self.flavour = ntpath
else:
raise ValueError(f"unsupported separator '{sep}'")
self.sep_re = re.compile(f"{re.escape(sep)}+")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switching to regexp is a significant change, we would need to at the very least check the performance (e.g. split into parts 1M paths or smth like that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a gist with both versions (original with rstrip as parts and re version as parts_re) https://gist.github.com/jonburdo/d45db255710152704951bcdcf9503a16

I ran a little ipython test on my machine on almost 2.5M paths:

In [1]: from path_parts_comparison import Path

In [2]: import os

In [3]: from itertools import chain

In [4]: paths = list(chain(*[[d + os.sep] + [os.path.join(d, f) for f in files] for d, _, files in os.walk(os.path.expanduser('~'))]))

In [5]: ppath = Path('/')

In [6]: %time a = [ppath.parts(p) for p in paths]
CPU times: user 9.02 s, sys: 168 ms, total: 9.19 s
Wall time: 9.19 s

In [7]: %time b = [ppath.parts_re(p) for p in paths]
CPU times: user 2.8 s, sys: 222 ms, total: 3.03 s
Wall time: 3.03 s

In [8]: %time a = [ppath.parts(p) for p in paths]
CPU times: user 8.86 s, sys: 242 ms, total: 9.1 s
Wall time: 9.1 s

In [9]: %time b = [ppath.parts_re(p) for p in paths]
CPU times: user 2.97 s, sys: 190 ms, total: 3.16 s
Wall time: 3.16 s

In [10]: a == b
Out[10]: True

In [11]: a[:6]
Out[11]: 
[('/', 'home', 'jon'),
 ('/', 'home', 'jon', '.bash_history'),
 ('/', 'home', 'jon', '.bash_aliases'),
 ('/', 'home', 'jon', '.bash_logout'),
 ('/', 'home', 'jon', '.yarnrc'),
 ('/', 'home', 'jon', '.bash_profile')]

In [12]: paths[:6]
Out[12]: 
['/home/jon/',
 '/home/jon/.bash_history',
 '/home/jon/.bash_aliases',
 '/home/jon/.bash_logout',
 '/home/jon/.yarnrc',
 '/home/jon/.bash_profile']

In [13]: len(paths)
Out[13]: 2469677

Looks to me like regex is about 3 times faster. Obviously not a perfect test, but the regex is pretty simple and I think the repeated function calls in the current implementation add up.

This probably needs some more testing and debugging for the windows-style paths, but could be worth working off at some point.

Comment on lines +20 to +22
ret = self.sep_re.split(path.strip(self.flavour.sep))
if path.startswith(self.flavour.sep):
ret = [self.flavour.sep] + ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and looks like it broke windows tests πŸ™ Looks like the minimal change is to just rstrip the sep from the path, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's go with the minimal fix for now. I don't want to spend a lot of time debugging the windows tests now. I made a separate PR since I still think this might be worth debugging and working off of at some point. Let me know if you don't think so, but I think the current implementation still doesn't work as expected for windows paths.

Here's a PR with the minimal fix. Looks like it passes the tests: #7234

@dtrifiro
Copy link
Contributor

Fixed with #7234

@dtrifiro dtrifiro closed this Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dvc.fs.Path.parts wrong results

3 participants