Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for Path::file_prefix #86319

Open
1 of 3 tasks
mbhall88 opened this issue Jun 15, 2021 · 22 comments
Open
1 of 3 tasks

Tracking Issue for Path::file_prefix #86319

mbhall88 opened this issue Jun 15, 2021 · 22 comments
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@mbhall88
Copy link
Contributor

mbhall88 commented Jun 15, 2021

Feature gate: #![feature(path_file_prefix)]

This is a tracking issue for an implementation of the method std::path::Path::file_prefix. It is effectively a "left" variant of the existing file_stem method.

Public API

use std::path::Path;

let path = Path::new("foo.tar.gz");
assert_eq!(path.file_prefix(), Some("foo"));

Steps / History

Unresolved Questions

  • None yet.
@mbhall88 mbhall88 added C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jun 15, 2021
@schwern
Copy link

schwern commented Dec 16, 2021

How about a corresponding file_suffix? That completes the set. file_stem/extension and file_prefix/file_suffix.

    pub fn file_suffix(&self) -> Option<&OsStr> {
        self.file_name().map(split_file_at_dot).and_then(|(_before, after)| Some(after))
    }
use std::path::Path;

let path = Path::new("foo.tar.gz");
assert_eq!(path.file_prefix(), Some("foo"));
assert_eq!(path.file_suffix(), Some("tar.gz"));

assert_eq!(path.file_stem(), Some("foo.tar"));
assert_eq!(path.extension(), Some("gz"));

@mbhall88
Copy link
Contributor Author

mbhall88 commented Dec 16, 2021

@schwern that is indeed a nice idea. I guess it would be best to get the go ahead from someone in the libs team?

I don't know how these tracking issues work. The PR for this feature was merged a while ago, but I have no idea if I'm supposed to do something with this issue? @dtolnay as you were the reviewer/merger of that PR are you able to answer this?

@cyqsimon
Copy link
Contributor

cyqsimon commented Feb 4, 2022

Any progress on this? It's a very useful API when working with files like tarballs.

@mbhall88
Copy link
Contributor Author

mbhall88 commented Feb 5, 2022

@cyqsimon it is currently on nightly I believe. Not sure how long it takes to get onto stable though.

As for @schwern's suggestion, I guess that would have to be raised as a separate PR/issue and approved by someone on the libs team.

@hustcer
Copy link

hustcer commented May 9, 2022

How about a corresponding file_suffix? That completes the set. file_stem/extension and file_prefix/file_suffix.

Good point !

@yescallop
Copy link
Contributor

yescallop commented May 18, 2022

One of my concerns is that a file_prefix or file_suffix method that simply slices the path wouldn't be very much useful, because it is common that users insert dots in the meaningful part of file names, such as File Version 3.5.4 (Copy).tar.gz (from Jacob Lifshay on Zulip Chat). The current file_stem and extension methods exist mainly because of File Associations on Windows and common practice on Linux, while it is relatively rare that one would need the exact part of file name before or after the first dot.

I'd suggest adding this to the unresolved questions, if possible.

@lmapii
Copy link

lmapii commented May 18, 2022

While it might be impossible to have a 100% correct implementation in my opinion it would already be helpful to simply have all rust crates behave in a known manner. E.g., there are now hundreds of rust crates out there dealing with files, e.g., matching filenames, all behaving a little different. It is impossible to know how a rust crate that you're including will behave.

Even having a method that might change over time and for which it is documented how it behaves in certain corner cases would be very, very useful. Compare it to the pathlib in python: It is not perfect but it is a beautiful piece of software to use.

@EvanCarroll
Copy link

EvanCarroll commented May 18, 2022

while it is relatively rare that one would need the exact part of file name before or after the first dot.

It's not actually relatively rare. We went over this in voice chat. I gave you multiple reasons why someone would want to do this. In practice,

  • .tar.gz
  • .spec.ts
  • .openapi.yaml
  • .conf.d (for an example in a directory)

It's very popular with javascript (where there is no magic byte and everything is js or ts) .d.ts, .conf.json, .conf.js, .rc.json, rc.js, .esm.js, .cjs.js. But also with translations .ru.md etc. Are all common practices that we have to contend with right now many of these in the same project.


The current suggestion to get the functionality is,

.file_name().and_then(OsStr::to_str).map_or(false, |name| name.ends_with(".tar.gz"))

Which I think is a bit convoluted. I am down for doing this other ways, but I don't think something like this should is "rare" as far as path manipulation goes, or that it should require converting to-and-from str. I think the simple-most question here is

  • Do we want file_suffix to return the . that file_prefix leaves? My own preference, I think file = file_prefix + file_suffix if we go down that route.

We could potentially find better ways to do all of this though. Like simply returning an iterator for file_parts where given foo.bar.tar.gz could be (list monster in graphic below),

  • .next().collect::<OsStr>(), what I think of as extension (.bar.tar.gz), and I'm patching to be called .file_suffix() (List Monster tail)
  • .last(), what we currently call .extension() (List Monster last)
  • .first() what we currently call .file_prefix() (List Monster head)
  • file_stem would be all but the last element (not sure if there is a nice way to do that in Rust).. (List Monster init)

The bottom line here is that when it comes to the file name, people will logically think of splitting on . and want names for all possible sets/strings of that list, (from the List Monster in Learn You a Haskell)

zj2Lb

Something about that API of returning an Iterator just seems simpler than arguing about whether an "extension" in foo.tar.gz is .tar.gz, tar.gz, .gz or gz.

@yescallop
Copy link
Contributor

There might've been some misunderstandings here. Let's take this file name File Version 3.5.4 (Copy).tar.gz (from Jacob Lifshay) as an example.
In this case, .file_prefix() returns File Version 3 while .file_suffix() returns .5.4 (Copy).tar.gz. If you need to determine if this file name ends with .tar.gz or not, you still need to do stuffs like

.and_then(OsStr::to_str).map_or(false, |name| name.ends_with(".tar.gz"))

IMO this is not optimal. Maybe we could add methods ends_with and strip_suffix on OsStr if this is sufficient for most use cases. But things are getting too complicated in here.

@EvanCarroll
Copy link

EvanCarroll commented May 18, 2022

I can see the utility in better processing for File Version 3.5.4 (Copy).tar.gz. We could define a filename as

filename            = ( file_stem, extension? );
file_stem           = ( dotfile | non_dots );
dotfile             = ( "." + non_dots );
extension           = extension_segment+;
extension_segment   = ("." , valid_extension );
non_dots            = [^.]+?;
valid_extension     = [^\s.]+;

I believe the above would make such that

File Version 3.5.4 (Copy).tar.gz

Gets parsed as

file_stem = "File Version 3.5.4 (Copy)"
extension = ".tar.gz"

But it's probably still not sufficient because if the filename was instead File Version 3.5.4.tar.gz there would literally be nothing you could do if you desired file_name to be File Version 3.5.4. And I think that's fine. That's why I advocate returning an iterator for extension in the above comment, which is what Python does too. That makes processing easier.


IMO this is not optimal. Maybe we could add methods ends_with and strip_suffix on OsStr if this is sufficient for most use cases. But things are getting too complicated in here.

I agree with the OsStr suggestion. It doesn't make sense anyway not to have strip_suffix, strip_prefix, starts_with and ends_with on OsStr. But I still think having good pathing functionality in the library should be a goal, rather than to throw up our hands and say it's "too complex" or to pretend like it's rare to want to know an extension on a file. (Discussion topic on this suggestion opened https://internals.rust-lang.org/t/feature-request-starts-with-and-ends-with-on-osstr/16652)

@mbhall88
Copy link
Contributor Author

mbhall88 commented May 19, 2022

We could potentially find better ways to do all of this though. Like simply returning an iterator for file_parts where given foo.bar.tar.gz

I think this is probably the "correct" way to approach this. With such an iterator method we can easily implement any/all flavours of splitting up file names.

In borrowing from python's pathlib (which is very ergonomic), it may be as simple as implementing something akin to suffixes. Because with the method this issue tracks (file_prefix), and a new method suffixes, you have all parts.

Using the running example of the corner case file name

use std::path::Path;

let path = Path::new("path/to/File Version 3.5.4 (Copy).tar.gz");
let mut exts = path.suffixes()

assert_eq!(path.file_prefix(), Some("File Version 3"));
assert_eq!(exts.next(), Some(".5"));
assert_eq!(exts.next(), Some(".4 (Copy)"));
assert_eq!(exts.next(), Some(".tar"));
assert_eq!(exts.next(), Some(".gz"));
assert_eq!(exts.next(), None);

Although, I can also understand the argument that the following is more "complete"

use std::path::Path;

let path = Path::new("path/to/File Version 3.5.4 (Copy).tar.gz");
let mut parts = path.file_parts()

assert_eq!(parts.next(), Some("File Version 3"));
assert_eq!(parts.next(), Some(".5"));
assert_eq!(parts.next(), Some(".4 (Copy)"));
assert_eq!(parts.next(), Some(".tar"));
assert_eq!(parts.next(), Some(".gz"));
assert_eq!(parts.next(), None);

@yescallop
Copy link
Contributor

yescallop commented May 19, 2022

FYI, here's my attempt to implement Path::extensions which is akin to suffixes in python. Your comments are very welcome.

let mut exts = Path::new("foo.tar.gz").extensions().unwrap();
assert_eq!(exts.next(), Some(OsStr::new("gz")));
assert_eq!(exts.next(), Some(OsStr::new("tar")));
assert_eq!(exts.next(), None);
assert_eq!(exts.visited(), Some(OsStr::new("tar.gz"));
assert_eq!(exts.remaining(), "foo");

With this implementation, we have .path_prefix() == .extensions().map(|exts| exts.skip_all().remaining()) and .path_suffix() == .extensions().and_then(|exts| exts.skip_all().visited()).

@yescallop
Copy link
Contributor

yescallop commented May 20, 2022

Correct me if I'm wrong, but I believe that having a method that extracts the part of file name before or after the first dot is not a good idea. By splitting a file name at the first dot, one asserts that they won't have dots in the non-extension part of file name. However, there are actually a bunch of densely-dotted file names in the wild that go against this design, for example:

pyroute2.nslink-0.6.9.tar.gz, api.admin.users.service.spec.ts, Demo.Sales.APISvc.openapi.yaml, sap.ui.webc.common.d.ts, proxy.remote.docker.eximee.conf.json, webpack.dist.components.px.conf.js, org.freebsd.rc.json, and so on.

IMO, none of these file names benefits much from a split at the first dot unless it is guaranteed that we deny dots in the non-extension part of a file name. I'd prefer to go with the iterator solution suggested above and I'm willing to open a PR for this.

@EvanCarroll
Copy link

I believe that having a method that extracts the part of file name before or after the first dot is not a good idea. By splitting a file name at the first dot, one asserts that they won't have dots in the non-extension part of file name.

Agreed, that's why I named the original suggestion file_parts and not extension.

We could potentially find better ways to do all of this though. Like simply returning an iterator for file_parts where given foo.bar.tar.gz could be (list monster in graphic below),

@yescallop
Copy link
Contributor

@EvanCarroll That might be a solution, but I think there are some disadvantages to it:

The file_parts method returns a forward iterator over the parts of file name. I think this is against the fact that it is often the case that a program considers the non-extension part of file name opaque to path processing. As long as the extensions match, the program may extract the remaining part of file name and pass it to another function for further parsing, whether there are dots in it or not.

For this reason, a forward iterator seems not very practical. Also, I'm afraid it is not allowed to do a .collect::<&OsStr>() on an iterator that yields OsStr slices, because that requires &OsStr: FromIterator<&OsStr>. Doing a .collect::<OsString>() is possible, but it does unnecessary allocation.

This is why I implemented Path::extensions and two methods on the resulting Extensions struct:

  • .visited() to return the visited extensions as a consecutive OsStr.
  • .remaining() to return the part of file name remaining for iteration.

If it is really needed to do what .file_prefix() does right now, we may have:

  • .skip_all() to skip all the remaining extensions. Then you call .remaining() to get file_prefix or .visited() to get file_suffix.

And if it is sometimes useful to return the extensions with preceding dots, we could have:

  • .with_preceding_dots() to make the iterator do that.

@mbhall88
Copy link
Contributor Author

Correct me if I'm wrong, but I believe that having a method that extracts the part of file name before or after the first dot is not a good idea. By splitting a file name at the first dot, one asserts that they won't have dots in the non-extension part of file name. However, there are actually a bunch of densely-dotted file names in the wild that go against this design, for example:

pyroute2.nslink-0.6.9.tar.gz, api.admin.users.service.spec.ts, Demo.Sales.APISvc.openapi.yaml, sap.ui.webc.common.d.ts, proxy.remote.docker.eximee.conf.json, webpack.dist.components.px.conf.js, org.freebsd.rc.json, and so on.

IMO, none of these file names benefits much from a split at the first dot unless it is guaranteed that we deny dots in the non-extension part of a file name. I'd prefer to go with the iterator solution suggested above and I'm willing to open a PR for this.

You are right, none of those listed file names benefit from a split at the first dot. But there are many more file names "in the wild" that do. Solution: anyone working with those file names uses some other method to extract the parts they want...
Saying a method is not needed because some corner cases make it useless doesn't negate it being very useful in a whole bunch of other circumstances.

@robmv
Copy link

robmv commented Dec 19, 2023

Instead of creating new methods with new semantics that will not treat files with dots on parts of the filename not related to the extension, why not extend the current ones like:

let path = Path::new("foo 1.2.3.tar.gz");
assert_eq!(path.file_stem_multi(2), Some("foo 1.2.3"));
assert_eq!(path.extension_multi(2), Some("tar.gz"));

This will only have a different contract than the current ones, where:

Otherwise, the portion of the file name before the final `.`

is changed to;

Otherwise, the portion of the file name before the nth `.` from the end.

Where file_stem() and extension() are just a simple case of these new methods with the argument 1. Another helper methods like with_extension(..) could be extended to indicate the number of segments of the extension wanted to be replaced.

Note: named with a better suffix than _multi.

@mbhall88
Copy link
Contributor Author

@thomcc apologies for pinging you directly but I couldn't find a tag for the library team anywhere. @dtolnay was originally managing this feature but I don't know if he is on the library team anymore? I don't really know how long these features normally take to progress through to stable but this issue has been open for two and a half years which seems like longer than I would expect?
Can you suggest any next steps to get this stabilised?

@tgross35
Copy link
Contributor

tgross35 commented Mar 5, 2024

@thomcc apologies for pinging you directly but I couldn't find a tag for the library team anywhere. @dtolnay was originally managing this feature but I don't know if he is on the library team anymore? I don't really know how long these features normally take to progress through to stable but this issue has been open for two and a half years which seems like longer than I would expect? Can you suggest any next steps to get this stabilised?

Opening a stabilization PR is trivial, you just need to change the unstable gate to stable as described here https://rustc-dev-guide.rust-lang.org/stability.html#stabilizing-a-library-feature.

That being said, I have to agree with the concerns that the algorithm here can be a bit bikeshed-y and this could be better served by pattern functions on OsStr, which has been proposed rust-lang/libs-team#311.

Opening a stabilization PR to start libs-api discussion probably isn't a bad idea in any case.

@mbhall88
Copy link
Contributor Author

@tgross35 according to the stabilization docs you linked the first step is

Ask a @T-libs-api member to start an FCP on the tracking issue and wait for the FCP to complete (with disposition-merge).

Are you a member of said team? If I try and use that identifier in this comment it doesn't come up with anything...

@tgross35
Copy link
Contributor

tgross35 commented Aug 15, 2024

I am not, but team members can always be found in the relevant files on the teams repo https://github.com/rust-lang/team/blob/master/teams/libs-api.toml. Pinging team members isn't really the best way to get something on the radar though, so opening the stabilization PR first and requesting libs-api review is usually easier for trivial things like this (FCP just happens there).

If you have an idea for an iterator-based API it may be worth proposing that too as an alternative. New API proposals go through the ACP process, which is an issue template at https://github.com/rust-lang/libs-team/issues.

(also feel free to drop in on Zulip if you have any more specific questions)

@mbhall88
Copy link
Contributor Author

Thanks very much for you help @tgross35. I'll open the PR and we can discuss the FCP there and whether this implementation or an iterator API is preferable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

9 participants