-
Notifications
You must be signed in to change notification settings - Fork 1.3k
handle paths ending with sep in dvc.fs.path.Path.parts #7229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| self.flavour = ntpath | ||
| else: | ||
| raise ValueError(f"unsupported separator '{sep}'") | ||
| self.sep_re = re.compile(f"{re.escape(sep)}+") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switching to regexp is a significant change, we would need to at the very least check the performance (e.g. split into parts 1M paths or smth like that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a gist with both versions (original with rstrip as parts and re version as parts_re) https://gist.github.com/jonburdo/d45db255710152704951bcdcf9503a16
I ran a little ipython test on my machine on almost 2.5M paths:
In [1]: from path_parts_comparison import Path
In [2]: import os
In [3]: from itertools import chain
In [4]: paths = list(chain(*[[d + os.sep] + [os.path.join(d, f) for f in files] for d, _, files in os.walk(os.path.expanduser('~'))]))
In [5]: ppath = Path('/')
In [6]: %time a = [ppath.parts(p) for p in paths]
CPU times: user 9.02 s, sys: 168 ms, total: 9.19 s
Wall time: 9.19 s
In [7]: %time b = [ppath.parts_re(p) for p in paths]
CPU times: user 2.8 s, sys: 222 ms, total: 3.03 s
Wall time: 3.03 s
In [8]: %time a = [ppath.parts(p) for p in paths]
CPU times: user 8.86 s, sys: 242 ms, total: 9.1 s
Wall time: 9.1 s
In [9]: %time b = [ppath.parts_re(p) for p in paths]
CPU times: user 2.97 s, sys: 190 ms, total: 3.16 s
Wall time: 3.16 s
In [10]: a == b
Out[10]: True
In [11]: a[:6]
Out[11]:
[('/', 'home', 'jon'),
('/', 'home', 'jon', '.bash_history'),
('/', 'home', 'jon', '.bash_aliases'),
('/', 'home', 'jon', '.bash_logout'),
('/', 'home', 'jon', '.yarnrc'),
('/', 'home', 'jon', '.bash_profile')]
In [12]: paths[:6]
Out[12]:
['/home/jon/',
'/home/jon/.bash_history',
'/home/jon/.bash_aliases',
'/home/jon/.bash_logout',
'/home/jon/.yarnrc',
'/home/jon/.bash_profile']
In [13]: len(paths)
Out[13]: 2469677Looks to me like regex is about 3 times faster. Obviously not a perfect test, but the regex is pretty simple and I think the repeated function calls in the current implementation add up.
This probably needs some more testing and debugging for the windows-style paths, but could be worth working off at some point.
| ret = self.sep_re.split(path.strip(self.flavour.sep)) | ||
| if path.startswith(self.flavour.sep): | ||
| ret = [self.flavour.sep] + ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and looks like it broke windows tests π Looks like the minimal change is to just rstrip the sep from the path, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's go with the minimal fix for now. I don't want to spend a lot of time debugging the windows tests now. I made a separate PR since I still think this might be worth debugging and working off of at some point. Let me know if you don't think so, but I think the current implementation still doesn't work as expected for windows paths.
Here's a PR with the minimal fix. Looks like it passes the tests: #7234
|
Fixed with #7234 |
Fixes #7228
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π