Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equivalent to Python's os.path.normpath #2208

Open
linclelinkpart5 opened this issue Nov 5, 2017 · 37 comments
Open

Equivalent to Python's os.path.normpath #2208

linclelinkpart5 opened this issue Nov 5, 2017 · 37 comments
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.

Comments

@linclelinkpart5
Copy link

In Python, os.path.normpath performs a simple lexical normalization of a path, removing redundant dir separators and cur-dir ('.') parts, and resolving up-dir references ('..'). In addition, on Windows, any forward slashes are converted to backslashes.

It seems that Rust's std::fs::canonicalize is functionally similar to Python's os.path.realpath (with the addition of a check to see if the target exists), and is a superset of normpath. However, I have a case where I only want to normalize the path, and not resolve symlinks.

I'm quite new to Rust, but I'd be happy to help contribute a new method to do this to the library if I can.

@ExpHP
Copy link

ExpHP commented Nov 7, 2017

Heavens, yes! I can't believe rust doesn't have something like this!

I think it ought to be a method on Path (since this is path functionality and not filesystem functionality), and should return just a plain old PathBuf--no io::Result--not just because it isn't needed, but also to better communicate its nature as a pure function.

@eddyb
Copy link
Member

eddyb commented Nov 7, 2017

cc @rust-lang/libs

@linclelinkpart5
Copy link
Author

linclelinkpart5 commented Nov 8, 2017

I think I've successfully written a good first draft of this method. My aim was to replicate the version in Python (linked above) and Go. The documentation for the Go version in turn links to the Plan 9 document detailing the high-level algorithm of this sort of normalization, handy!

Right now, it's a free function, not in the std::path crate yet. As such, the method takes in an explicit &Path as opposed to a &self, but that can easily be fixed.

use std::path::Path;
use std::path::PathBuf;
use std::path::Component;

fn normalize(p: &Path) -> PathBuf {
    let mut stack: Vec<Component> = vec![];

    // We assume .components() removes redundant consecutive path separators.
    // Note that .components() also does some normalization of '.' on its own anyways.
    // This '.' normalization happens to be compatible with the approach below.
    for component in p.components() {
        match component {
            // Drop CurDir components, do not even push onto the stack.
            Component::CurDir => {},

            // For ParentDir components, we need to use the contents of the stack.
            Component::ParentDir => {
                // Look at the top element of stack, if any.
                let top = stack.last().cloned();

                match top {
                    // A component is on the stack, need more pattern matching.
                    Some(c) => {
                        match c {
                            // Push the ParentDir on the stack.
                            Component::Prefix(_) => { stack.push(component); },

                            // The parent of a RootDir is itself, so drop the ParentDir (no-op).
                            Component::RootDir => {},

                            // A CurDir should never be found on the stack, since they are dropped when seen.
                            Component::CurDir => { unreachable!(); },

                            // If a ParentDir is found, it must be due to it piling up at the start of a path.
                            // Push the new ParentDir onto the stack.
                            Component::ParentDir => { stack.push(component); },

                            // If a Normal is found, pop it off.
                            Component::Normal(_) => { let _ = stack.pop(); }
                        }
                    },

                    // Stack is empty, so path is empty, just push.
                    None => { stack.push(component); }
                }
            },

            // All others, simply push onto the stack.
            _ => { stack.push(component); },
        }
    }

    // If an empty PathBuf would be return, instead return CurDir ('.').
    if stack.is_empty() {
        return PathBuf::from(Component::CurDir.as_ref());
    }

    let mut norm_path = PathBuf::new();

    for item in &stack {
        norm_path.push(item.as_ref());
    }

    norm_path
}

Some sample (Unix) file path tests:

fn main() {
    let mut paths = vec![];

    paths.push(Path::new("../../home/thatsgobbles/././music/../code/.."));
    paths.push(Path::new("/home//thatsgobbles/music/"));
    paths.push(Path::new("/../../home/thatsgobbles/././code/../music/.."));
    paths.push(Path::new(".."));
    paths.push(Path::new("/.."));
    paths.push(Path::new("../"));
    paths.push(Path::new("/"));
    paths.push(Path::new(""));
    // More tests for Windows (especially with drive letters and UNC paths) needed.

    for p in &paths {
        let np = normalize(&p);
        println!("{:?} ==> {:?}", &p, &np);
    }
}

@ExpHP
Copy link

ExpHP commented Nov 8, 2017

Brute force comparison:

https://gist.github.com/ExpHP/3f7d8c03be1a45ebe5abd3ad5a517d73

fn gen_strings(alphabet: &[&str], s: String, depth: u8, emit: fn(&str)) {
    emit(&s);
    if let Some(deeper) = depth.checked_sub(1) {
        for &ch in alphabet {
            gen_strings(alphabet, s.clone() + ch, deeper, emit);
        }
    }
}

fn main() {
    gen_strings(&["a", ".", "/"], Default::default(), 12, |s| {
        let p = Path::new(&s);
        println!("{:>15} -> {:>15}", p.display(), normalize(&p).display());
    });
}
import os

def gen_strings(alphabet, s, depth):
    yield s
    if depth > 0:
        for ch in alphabet:
            yield from gen_strings(alphabet, s + ch, depth - 1)

for s in gen_strings("a./", '', 12):
    print("{:>15} -> {:>15}".format(s, os.path.normpath(s)))

Generating all strings of "a", ".", and "/" up to length 12 on Arch, there is one behavioral difference between your implementation and normpath, which is that, if a path begins with // but does not begin with ///, then normpath keeps the leading //. Interestingly, it appears that this is a thing in POSIX. (Makes me wonder if Python still does that on windows?)

It'd be nice to survey more languages and know what decisions and tradeoffs are on the table.

@linclelinkpart5
Copy link
Author

@ExpHP Thank you for the extended tests! I don't think I would have found that corner case without your help.

The double forward slash case is quite strange, I admit. I'm surprised that it ends up resolving to just /. I'm even newer at Go than Rust, but I can see if I can whip up a comparable test using the previously linked .Clean() method. I believe there's also such a routine in Node.js.

I'm going to look into installing Rust on my Windows machine tonight, and trying out some tests on there.

@Centril Centril added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Dec 6, 2017
@Centril
Copy link
Contributor

Centril commented Oct 8, 2018

Closing in favor of rust-lang/rust#47402.

@Centril Centril closed this as completed Oct 8, 2018
@akavel
Copy link

akavel commented Jan 23, 2019

@Centril is this correct to close this issue, given that the rust-lang/rust#47402 is (just?) a "tracking issue", which actually points back to this very issue? I'm really sorry if this is something normal, but as a Rust newbie, I really don't understand what's the status of this in such a situation :( Is it open? closed? worked on? not worked on? what could I do if I wanted to, say, submit a solution proposal?

@ExpHP
Copy link

ExpHP commented Jan 23, 2019

I recall that this was associated with a PR that was never finished after a long period of back-and-forth. Looking at the linkbacks, this is it:

rust-lang/rust#47363

Since we haven't heard from the author for a long time, I'd reckon it's fair game for anybody who wants to rebase that PR and address the outstanding feedback.


Issues on the RFC repository generally aren't regarded as actionable items (people for some reason have been using them as feature requests or like a discussion forum), so their open/close state is largely immaterial beyond the potential to confuse people. I'd always look at the linkbacks.

@linclelinkpart5
Copy link
Author

Original author here, apologies for the radio silence. It's been a long time since I've had time to work on this, not to mention my inexperience with contributing to open source in general.

Aside from that, the one issue I encountered with this was that Windows paths weren't all getting normalized as expected, as per a recommended Windows path test suite (seen in the discussion on the PR rust-lang/rust#47363).

My hope was that using Path.components() would be sufficient to do the normalization, but it seems that .components() isn't working as expected for my case.

@Mark-Simulacrum
Copy link
Member

I'm going to reopen this instead of rust-lang/rust#47402 as the associated PR never landed so unfortunately we don't actually have anything to track at this point (beyond the desire for the feature itself, which is betters suited to the rfcs repo).

@nic-hartley
Copy link

Is this still looking for someone to implement it? I'm happy to give it a go, as long as someone familiar with Windows paths can give me some edge cases to test against.

Also, given that this won't access the filesystem (that's canonicalize's job), how do we want to handle paths like foo/../../bar which go "out of bounds"? Error? Ignore extra .. (as happens when you reach / in *nix)? Something else?

@retep998
Copy link
Member

On Windows foo/../../bar is equivalent to ..\bar so that is how I would handle it there.

@gdzx
Copy link

gdzx commented Nov 4, 2019

I wrote a port of some functions I needed from the Go's path/filepath module and some discussions even lead to a pre-RFC:

@ssokolow
Copy link

ssokolow commented Feb 1, 2021

Has anyone poked the author of the path-clean crate to see if he'd be interested in getting it into std?

It looks to be an MIT/Apache-2.0 pure-Rust implementation of the same Plan 9 algorithm Go implemented for doing this.

(It does need a little work but it'd need that either way so, if he's willing to work on it...)

@gdzx
Copy link

gdzx commented Feb 1, 2021

@ssokolow Can you take a look at npath? It is tested on Unix and Windows, more rusty than the Go implementation (I also started from it), and with more methods around normalization.

@ssokolow
Copy link

ssokolow commented Feb 1, 2021

It uses unsafe. Use of unsafe in a place where I don't consider it necessary is an instant disqualification.

@gdzx
Copy link

gdzx commented Feb 1, 2021

Then I can point you to Path::new, which uses unsafe in the same way each time you call it, and surely is an instant disqualification of the whole Rust standard library.

@ssokolow
Copy link

ssokolow commented Feb 1, 2021

@gdzx I said "where I don't consider it necessary". As far as I'm concerned, this algorithm can be implemented perfectly well, relying only on the unsafe that already exists in the much more thoroughly audited standard library functions, and I strongly believe that unsafe should only be used when it's either necessary for FFI (not for the kind of normalization I want) or empirically proven that the performance benefits are significant enough to justify it. (And I still want a good test suite, being run on CI under Miri.)

If I run out of projects that don't need this functionality before someone contributes code that meets my standards, I'm perfectly willing to implement it myself, with #![forbid(unsafe_code)] on top for good measure.

@BurntSushi
Copy link
Member

Whether unsafe is necessary or not in this case is a red herring. Your use of unsafe relies on implementation details and is not a stable part of Path's API: https://doc.rust-lang.org/src/std/path.rs.html#1679

So I personally would consider it to be incorrect use of unsafe.

@ssokolow
Copy link

ssokolow commented Feb 1, 2021

That too. I'll freely admit that, as soon as I saw unsafe in something, my "unsafe? That doesn't need to be here!" reaction stopped me from looking into what it was being used for.

From my perspective, it didn't matter what it was being used for if the algorithm "wasn't supposed to be" using it in the first place.

@BurntSushi
Copy link
Member

BurntSushi commented Feb 1, 2021

FWIW, it can be a pretty severe pain point to deal with paths in some cases because its internal wtf-8 representation isn't exposed. Without exposing it, there will always be a strong temptation to do exactly this sort of thing to get access to it.

What I've done in the past to simplify is to use the Unix specific APIs to get the underlying &[u8] directly and safely. Then on Windows, give up 100% correctness and lossily convert the path to UTF-8. But that trade off may not always be appropriate. The alternatives are either a lot more complex or a lot more costly. (But also, exposing the internal wtf-8 comes with its own costs.)

@gdzx
Copy link

gdzx commented Feb 1, 2021

@ssokolow There is a beautiful thing on GitHub, it's called "Create an issue". You can call it "Remove uses of unsafe when unnecessary". That would be a great start instead of talking nonsense about "disqualification", on a project that isn't even released yet. Also, it is quite an inconsequential issue compared to actually implementing a lexical path processing algorithm, checking its correctness, and its properties.

@BurntSushi
Copy link
Member

@ssokolow @gdzx All right, let's dial the temperature down a bit please. Thanks.

@gdzx
Copy link

gdzx commented Feb 1, 2021

Then let me reestablish the truth: my implementation of normalized does not use unsafe. And the only uses of unsafe is to get the inner Vec<u8> when updating a PathBuf in place, and getting the path separator as a &str. That shouldn't be a deal breaker...

@BurntSushi
Copy link
Member

That shouldn't be a deal breaker...

Maybe @ssokolow was a little harsh, but yes, it would also be a deal breaker for me too. It's not just unnecessary. It's incorrect. See my previous comment.

@ssokolow
Copy link

ssokolow commented Feb 1, 2021

@gdzx I was about to apologize for being snippy when @BurntSushi beat me to acknowledging it (I tend to not notice when I'm emotionally compromised until someone points it out). That said, I am still curious why you would think I would spontaneously choose to reply in a communication channel different than the one that we began the interaction in without being explicitly prompted to.

As for unsafe, unnecessary, incorrect, or not, it's a matter of the auditing burden a dependency imposes on me. The crate uses unsafe. I don't want to have to re-audit that every time I bump the version. (I keep a close eye on my cargo-geiger output.)

@gdzx
Copy link

gdzx commented Feb 1, 2021

@BurntSushi I'll take "a little harsh" as an apology :D.

@ssokolow I don't understand why you would want to "reply in a communication channel different than the one that we began the interaction in".

Also I will repeat that you can just take normalized out my crate and enjoy no unsafe code at all. As for the use of unsafe, I implemented lexical_push just like I would do if I was writing code in std::path::PathBuf, using the equivalent of PathBuf::as_mut_vec which is private.

Anyway, if you have any idea for improvement, feel free to open an issue (just like you did on the other crate you linked to which also uses unsafe).

@ssokolow
Copy link

ssokolow commented Feb 1, 2021

I don't understand why you would want to "reply in a communication channel different than the one that we began the interaction in".

You said...

There is a beautiful thing on GitHub, it's called "Create an issue". You can call it "Remove uses of unsafe when unnecessary". That would be a great start instead of talking nonsense about "disqualification"

...which, to me, felt like "You should have replied in the bug tracker instead of here. Because you didn't read my mind and understand that, you have done wrong."

Also I will repeat that you can just take normalized out my crate and enjoy no unsafe code at all.

If I'd have to copy your code into my project and take responsibility for maintaining a copy of it to avoid having to re-audit the crate every so often, I think I'd prefer to just rewrite the algorithm in a style more intuitive to me: my own.

Honestly, thanks to what the .components() iterator already does, it doesn't seem like it'd be that difficult. Annoying to support both POSIX and Windows to my standards, given that the WTF-8 representation isn't part of the public API, but not difficult.

As for the use of unsafe, I implemented lexical_push just like I would do if I was writing code in std::path::PathBuf, using the equivalent of PathBuf::as_mut_vec which is private.

That's the problem. You wrote it as if it were part of std, when it's not. That disregard for the concepts of encapsulation and public vs. private APIs is why I wouldn't want to rely on your code. I place a very high value on API stability and I wouldn't want to discover, 10 years down the road, that one of my old projects can't be built without extra work because a dependency unnecesarily tied itself to internal implementation details of an old std version that are exempt from the Rust API stability promise.

If your code were part of std then there'd be a whole testing infrastructure to catch breakages and it'd be the responsibility of whoever else was working on std to update it to fix what their patch broke.

@BurntSushi
Copy link
Member

As for the use of unsafe, I implemented lexical_push just like I would do if I was writing code in std::path::PathBuf, using the equivalent of PathBuf::as_mut_vec which is private.

Right, you can't do that. That's what is incorrect because it relies on an implementation detail that is not part of the stable API. The next release of Rust could silently break your code. And since you're using unsafe, that breakage could result in memory unsafety.

@gdzx
Copy link

gdzx commented Feb 1, 2021

@ssokolow

...which, to me, felt like "You should have replied in the bug tracker instead of here. Because you didn't read my mind and understand that, you have done wrong."

You really have a disrespectful tone, I don't understand why. I didn't ask you to read my mind, I asked you to do the same as what you did on the other crate to remove the use of unsafe, which is in my crate in a method that is not in any way related to the very title of this issue: "Equivalent to Python's os.path.normpath".

If I'd have to copy your code into my project and take responsibility for maintaining a copy of it to avoid having to re-audit the crate every so often, I think I'd prefer to just rewrite the algorithm in a style more intuitive to me: my own.

Of course, that's why it is nice to contribute to existing project that already support the use case, have CI, and work on Unix and Windows, so they can improve and eventually release.

@BurntSushi

Right, you can't do that. That's what is incorrect because it relies on an implementation detail that is not part of the stable API. The next release of Rust could silently break your code. And since you're using unsafe, that breakage could result in memory unsafety.

Well, I can do it, but not without the limitations you outlined. However, I have the intention to make it part of the standard library, and it will serve as a good-enough PoC (which it still is now, since it doesn't have any release yet). But if you want to propose an improvement to make this crate suit your use-case, remove all use of unsafe, and work toward a release, then please make a pull request or open an issue!

@ssokolow
Copy link

ssokolow commented Feb 1, 2021

You really have a disrespectful tone,

How is it disrespectful to be honest about how I perceived what you'd said? You said something here, I tactlessly replied here, and you responded with what I, rightly or wrongly, perceived as an attack for having responded here rather than in the bug tracker.

Well, I can do it, but not without the limitations you outlined. However, I have the intention to make it part of the standard library, and it will serve as a good-enough PoC (which it still is now, since it doesn't have any release yet). But if you want to propose an improvement to make this crate suit your use-case, remove all use of unsafe, and work toward a release, then please make a pull request!

I think that might be the point of miscommunication. It wasn't made clear enough that you're intentionally writing it in a form that is not recommended for use outside of std, as a proof of concept for how you feel it would best be implemented as an internal component.

@gdzx
Copy link

gdzx commented Feb 1, 2021

@ssokolow I can agree on your last point. I will update the README to make that clear, but it's not on crates.io and the v0.1.0 is not released, that's not exactly production-ready. And that doesn't mean it is unsalvageable in the meantime to make it more appropriate for your use-case. Unless of course you want to improve on the Go port, but that is exactly what I did to make this crate.

@ssokolow
Copy link

ssokolow commented Feb 1, 2021

@gdzx I have to go shower and get to sleep, but I'll see if I can fit in a look at it tomorrow.

@gdzx
Copy link

gdzx commented Feb 1, 2021

@ssokolow Thank you ;).

@Ekleog
Copy link

Ekleog commented Feb 1, 2021

TL;DR for people like me who came back with 20 new notifications of diverse tones: here is a draft of what an implementation of this lifted to libstd might look like, but is currently not usable as a crate due to relying on PathBuf implementation details and thus being unstable. It would have to be adjusted at least a bit to be usable as a crate on regular Rust.

I don't have any kind of authority here, but maybe let's move the discussion about the pros and cons of this draft implementation to the issues of the linked repo, and try not to drown this issue in messages unrelated to the API design and desirability of this idea? :)

@gdzx
Copy link

gdzx commented Feb 2, 2021

In that case, I just pushed a very early RFC draft for this API. There are some links to implementations in other languages and some explanations. But as I said, it's a first draft.

@ssokolow
Copy link

ssokolow commented Feb 2, 2021

I don't have time to look at the code yet, but I've left some inline comments on the commit for the RFC draft.

Aside from that, I like what is proposed. It has a bunch of functions, like is_inside where I found myself thinking "Yeah. Why did Python's standard library force me to keep reinventing that in my programs if I need it so often?"

Heck, I have some ugly hackery for want of rooted_join in one of my recent hobby projects. (To work around the fact that the Url crate only exposes portable conversion from OS paths to URL path components via from_file_path and from_directory_path as far as I can tell, but "This returns Err if the given path is not absolute".)

marvin-bitterlich added a commit to marvin-bitterlich/coreutils that referenced this issue Mar 31, 2021
Uses the normalize_path used in cargo to strip duplicate slashes
With a link to a std rfc rust-lang/rfcs#2208

This fixes uutils#1829

This also touches uutils#1768 
but does not attempt to fully solve it
sylvestre pushed a commit to uutils/coreutils that referenced this issue Apr 5, 2021
* rm: add verbose output and trim multiple slashes

Uses the normalize_path used in cargo to strip duplicate slashes
With a link to a std rfc rust-lang/rfcs#2208

This fixes #1829

This also touches #1768 
but does not attempt to fully solve it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests