RFC: path reform #474

Merged
merged 6 commits into from Dec 19, 2014

Projects

None yet
@aturon
Contributor
aturon commented Nov 19, 2014

This RFC reforms the design of the std::path module in preparation for API
stabilization. The path API must deal with many competing demands, and the
current design handles many of them, but suffers from some significant problems
given in "Motivation" below. The RFC proposes a redesign modeled loosely on the
current API that addresses these problems while maintaining the advantages of
the current design.

Thanks to @kballard, who helped spark some of the initial ideas over Ike's sandwiches!

Rendered

@huonw huonw commented on the diff Nov 19, 2014
text/0000-path-reform.md
+
+impl<Sized? P> PathBuf where P: AsPath {
+ pub fn new<T: IntoString>(path: T) -> PathBuf;
+
+ pub fn push(&mut self, path: &P);
+ pub fn pop(&mut self) -> bool;
+
+ pub fn set_file_name(&mut self, file_name: &P);
+ pub fn set_extension(&mut self, extension: &P);
+}
+
+// These will ultimately replace the need for `push_many`
+impl<Sized? P> FromIterator<P> for PathBuf where P: AsPath { .. }
+impl<Sized? P> Extend<P> for PathBuf where P: AsPath { .. }
+
+impl<Sized? P> Path where P: AsPath {
@huonw
huonw Nov 19, 2014 Member

Nit/implementation detail: will putting the type parameter here cause inference problems with e.g. Path::new?

@aturon
aturon Nov 19, 2014 Contributor

@huonw Given #447, this will probably have to move inward, but at the moment this seemed the most concise way to present the API.

@huonw huonw commented on the diff Nov 19, 2014
text/0000-path-reform.md
+
+pub const SEP: char = ..
+pub const ALT_SEPS: &'static [char] = ..
+
+pub fn is_sep(c: char) -> bool { .. }
+```
+
+There is plenty of overlap with today's API, and the methods being retained here
+largely have the same semantics.
+
+But there are also a few potentially surprising aspects of this design that merit
+comment:
+
+* **Why does `PathBuf::new` take `IntoString`?** It needs an owned buffer
+ internally, and taking a string means that Unicode input is guaranteed, which
+ works on all platforms. (In general, the assumption is that non-Unicode paths
@huonw
huonw Nov 19, 2014 Member

Some platforms have illegal-in-paths code points (e.g. \0), so an arbitrary string isn't necessarily a valid path?

@aturon
aturon Nov 19, 2014 Contributor

Right, I should've talked about this. Right now we generally panic in that case, and I think we should continue with that approach. But it's not entirely clear whether to do the inspection at the path level or only when crossing into fs. (In any case, I expect the signature to stand.)

@alexcrichton alexcrichton commented on the diff Nov 19, 2014
text/0000-path-reform.md
+
+ // The "non-root" part of the path
+ pub fn relative_path(&self) -> &path;
+
+ // The "directory" portion of the path
+ pub fn dir_path(&self) -> &path;
+
+ pub fn file_name(&self) -> Option<&path>;
+ pub fn file_stem(&self) -> Option<&path>;
+ pub fn extension(&self) -> Option<&path>;
+
+ pub fn join(&self, path: &P) -> PathBuf;
+
+ pub fn with_file_name(&self, file_name: &P) -> PathBuf;
+ pub fn with_extension(&self, extension: &P) -> PathBuf;
+}
@alexcrichton
alexcrichton Nov 19, 2014 Member

Should this also have an as_bytes method? (for converting to &[u8])

@alexcrichton
alexcrichton Nov 19, 2014 Member

Also, should normalization methods be exposed? (either one for looking at the filesystem or one just looking at the path)

@alexcrichton alexcrichton commented on an outdated diff Nov 19, 2014
text/0000-path-reform.md
+ pub fn path_relative_from(&self, base: &P) -> Option<Path>;
+ pub fn starts_with(&self, base: &P) -> bool;
+ pub fn ends_with(&self, child: &P) -> bool;
+
+ // The "root" part of the path, if absolute
+ pub fn root_path(&self) -> Option<&path>;
+
+ // The "non-root" part of the path
+ pub fn relative_path(&self) -> &path;
+
+ // The "directory" portion of the path
+ pub fn dir_path(&self) -> &path;
+
+ pub fn file_name(&self) -> Option<&path>;
+ pub fn file_stem(&self) -> Option<&path>;
+ pub fn extension(&self) -> Option<&path>;
@alexcrichton
alexcrichton Nov 19, 2014 Member

This and a few other locations here need path => Path I think

@alexcrichton alexcrichton commented on the diff Nov 19, 2014
text/0000-path-reform.md
+```rust
+p == p.join(dot)
+p == dot.join(p)
+
+p == p.root_path().unwrap_or(dot)
+ .join(p.relative_path())
+
+p.relative_path() == match p.root_path() {
+ None => p,
+ Some(root) => p.path_relative_from(root).unwrap()
+}
+
+p == p.dir_path()
+ .join(p.file_name().unwrap_or(dot))
+
+p == p.iter().collect()
@alexcrichton
alexcrichton Nov 19, 2014 Member

Just clarifying, this means that to collect into a relative path the first component must be .? Otherwise it's collected into an absolute path?

@alexcrichton
alexcrichton Nov 19, 2014 Member

Or rather, let me rephrase: If I have a path a/b/c, what does .iter() return? I would probably expect "a", "b", "c", and if so, how does path know to collect into an absolute or relative path?

@kballard
kballard Nov 20, 2014 Contributor

In our original discussion, the thought was that for /a/b/c .iter() would return "/", "a", "b", "c", so collecting an iterator is semantically equivalent to taking the first component and pushing the rest of the components onto the end. I assume that's still the intention.

@sfackler sfackler commented on an outdated diff Nov 19, 2014
text/0000-path-reform.md
+```rust
+pub trait WindowsPathBufExt {
+ fn from_ucs2(path: &[u16]) -> Self;
+ fn make_non_verbatim(&mut self) -> bool;
+}
+
+pub trait WindowsPathExt {
+ fn is_cwd_relative(&self) -> bool;
+ fn is_vol_relative(&self) -> bool;
+ fn is_verbatim(&self) -> bool;
+ fn prefix(&self) -> PathPrefix;
+ fn to_ucs2(&self) -> Vec<u16>;
+}
+
+enum PathPrefix<'a> {
+ VerbatimPrefix(&'a Path),
@sfackler
sfackler Nov 19, 2014 Member

These don't need the Prefix suffix.

@sfackler sfackler commented on an outdated diff Nov 19, 2014
text/0000-path-reform.md
+
+impl<Sized? P> Path where P: AsPath {
+ pub fn new(path: &str) -> &Path;
+
+ pub fn as_str(&self) -> Option<&str>
+ pub fn as_str_lossy(&self) -> Cow<String, str>; // Cow will replace MaybeOwned
+ pub fn to_owned(&self) -> PathBuf;
+
+ // iterate over the components of a path
+ pub fn iter(&self) -> Iter;
+
+ pub fn is_absolute(&self) -> bool;
+ pub fn is_relative(&self) -> bool;
+ pub fn is_ancestor_of(&self, other: &P) -> bool;
+
+ pub fn path_relative_from(&self, base: &P) -> Option<Path>;
@sfackler
sfackler Nov 19, 2014 Member

Should this be Option<PathBuf>?

@alexcrichton alexcrichton and 1 other commented on an outdated diff Nov 19, 2014
text/0000-path-reform.md
+
+This is acceptable because the platform supports arbitrary byte sequences
+(usually interpreted as UTF-8).
+
+### Windows
+
+On Windows, the additional APIs allow you to convert to/from UCS-2 (roughly,
+arbitrary `u16` sequences interpreted as UTF-16 when applicable). They also
+provide the remaining Windows-specific path decomposition functionality that
+today's path module supports.
+
+```rust
+pub trait WindowsPathBufExt {
+ fn from_ucs2(path: &[u16]) -> Self;
+ fn make_non_verbatim(&mut self) -> bool;
+}
@alexcrichton
alexcrichton Nov 19, 2014 Member

This is an interesting conundrum that I hadn't really thought about before, but I think that it makes sense. In general, if the command line gives you a path, you're no longer able to pass the bytes into Path without looking at them. I think that's a good thing, however! (just something to point out).

@nikomatsakis
nikomatsakis Nov 19, 2014 Contributor

It might be useful to have a cross-platform way to create a path from bytes, for just this scenario. I guess people can mock up their own thing, but everyone will have to.

@nikomatsakis
nikomatsakis Nov 19, 2014 Contributor

Well, everyone who cares about writing command line utilities that work both on unix and windows. Which is...probably a small intersection. And they're using MinGW or something already.

@nikomatsakis
Contributor

I don't work much with paths, but I figure if the API itself gets a sign-off from @alexcrichton, @wycats, and @kballard, it's got to be fine. And we can always grow it over time if needed.

The general approach of having a borrowed type (Path) and a mutable, owning variant (PathBuf) seems like a great idea and I have a feeling this is going to become "a thing". One thing I was curious about was the decision to make Path unsized. Is this necessary for some reason? (One reason I can imagine might be BorrowFrom, if we were to change that to use associated types.)

In general, I imagine that many of the "view" types we will create will not be unsized, so it'd be good to know where unsizedness is helpful. Of course, one could imagine allowing types to "opt in" to being unsized, perhaps using a negative impl. (And of course the more unsized types we have, of course, the more Sized? annotations are required on traits and so forth.)

Anyway, this is not criticism of this RFC, which looks great, more just thinking aloud about the trend that it signals.

@japaric japaric commented on the diff Nov 19, 2014
text/0000-path-reform.md
+impl<Sized? P> PathBuf where P: AsPath {
+ pub fn new<T: IntoString>(path: T) -> PathBuf;
+
+ pub fn push(&mut self, path: &P);
+ pub fn pop(&mut self) -> bool;
+
+ pub fn set_file_name(&mut self, file_name: &P);
+ pub fn set_extension(&mut self, extension: &P);
+}
+
+// These will ultimately replace the need for `push_many`
+impl<Sized? P> FromIterator<P> for PathBuf where P: AsPath { .. }
+impl<Sized? P> Extend<P> for PathBuf where P: AsPath { .. }
+
+impl<Sized? P> Path where P: AsPath {
+ pub fn new(path: &str) -> &Path;
@japaric
japaric Nov 19, 2014 Member

The current API has a new_opt constructor that returns Option<Path>. It seems that such option has been removed in this new API, is it intentional? See rust-lang/rust#14048

@japaric japaric commented on the diff Nov 19, 2014
text/0000-path-reform.md
+useful ramifications for the rest of the API, described below.
+
+The proposed API deals with the other problems mentioned above, and also brings
+the module in line with current Rust patterns and conventions. These details
+will be discussed after getting a first look at the core API.
+
+## The cross-platform API
+
+The proposed core, cross-platform API provided by the new `std::path` is as follows:
+
+```rust
+// A sized, owned type akin to String:
+pub struct PathBuf { .. }
+
+// An unsized slice type akin to str:
+pub struct Path { .. }
@japaric
japaric Nov 19, 2014 Member

Could we get more details about how the underlying data structure may look? If it's going to be an unsized struct other than a newtype wrapper around [u8]/str, then you must consider than any extra field in the struct won't be stored inline (i.e. in the stack). For example: struct Path { sepidx: Option<uint>, data: [u8] }, then &Path is two-word long (just like &[u8]), and it's non obvious how to correctly create a &Path from &[u8].

@huonw
huonw Nov 19, 2014 Member

Having an embedded field like that makes it impossible to construct from a &[u8] and makes it impossible to move the start position when slicing (i.e. Path has to be isomorphic to pub struct Path([u8])).

@kballard
kballard Nov 20, 2014 Contributor

From my recollection during our meeting, I believe Path is expected to contain a single [u8] field, with no header.

@aturon
aturon Nov 20, 2014 Contributor

@kballard

From my recollection during our meeting, I believe Path is expected to contain a single [u8] field, with no header.

Exactly. I've clarified this with a comment below, and will update the RFC to be more clear about this shortly.

@steveklabnik steveklabnik commented on an outdated diff Nov 19, 2014
text/0000-path-reform.md
+- RFC PR: (leave this empty)
+- Rust Issue: (leave this empty)
+
+# Summary
+
+This RFC reforms the design of the `std::path` module in preparation for API
+stabilization. The path API must deal with many competing demands, and the
+current design handles many of them, but suffers from some significant problems
+given in "Motivation" below. The RFC proposes a redesign modeled loosely on the
+current API that addresses these problems while maintaining the advantages of
+the current design.
+
+# Motivation
+
+The design of a path abstraction is surprisingly hard. Paths work radically
+differently on different platform, so providing a cross-platform abstraction is
@steveklabnik
steveklabnik Nov 19, 2014 Contributor

"platforms"

@michaelsproul

Some of the bare functions in std::io::fs look very C like. Would there would be any interest in extending the PathExtensions API to include the same functionality while providing more Rustic names? We could have .contents instead of readdir, .make_dir instead of mkdir, etc. The C-style names could still be accessible through libc.

@alexcrichton alexcrichton referenced this pull request in rust-lang/rust Nov 19, 2014
Closed

Rename some methods in the os module #19110

@SimonSapin SimonSapin commented on an outdated diff Nov 19, 2014
text/0000-path-reform.md
+ pub fn push(&mut self, path: &P);
+ pub fn pop(&mut self) -> bool;
+
+ pub fn set_file_name(&mut self, file_name: &P);
+ pub fn set_extension(&mut self, extension: &P);
+}
+
+// These will ultimately replace the need for `push_many`
+impl<Sized? P> FromIterator<P> for PathBuf where P: AsPath { .. }
+impl<Sized? P> Extend<P> for PathBuf where P: AsPath { .. }
+
+impl<Sized? P> Path where P: AsPath {
+ pub fn new(path: &str) -> &Path;
+
+ pub fn as_str(&self) -> Option<&str>
+ pub fn as_str_lossy(&self) -> Cow<String, str>; // Cow will replace MaybeOwned
@SimonSapin
SimonSapin Nov 19, 2014 Contributor

Should this be to_str_lossy? Not sure which of the as_ or to_ naming conventions should “win” for Cow.

@SimonSapin SimonSapin commented on an outdated diff Nov 19, 2014
text/0000-path-reform.md
+ be valid Unicode, so the various methods going to and from `str` will
+ suffice. But as with paths in general, there are platform-specific ways of
+ working with non-Unicode data, explained below.
+
+* **Where did `push_many` and friends go?** They're replaced by implementing
+ `FromIterator` and `Extend`, following a similar pattern with the `Vec`
+ type. (Some work will be needed to retain full efficiency when doing so.)
+
+* **How does `Path::new` work?** The ability to directly get a `&Path` from an
+ `&str` (i.e., with no allocation or other work) is a key part of the
+ representation choices, which are described below.
+
+## Important semantic rules
+
+The path API is designed to satisfy several semantic rules described below.
+**Note that `==` here is *lazily* normalizing**.
@SimonSapin
SimonSapin Nov 19, 2014 Contributor

What normalization does it do? Should there be a .normalize() method?

@SimonSapin SimonSapin commented on the diff Nov 19, 2014
text/0000-path-reform.md
+ None => p,
+ Some(ext) => p.with_extension(ext)
+}
+
+p == match (p.file_stem(), p.extension()) {
+ (Some(stem), Some(ext)) => p.with_file_name(name).with_extension(ext),
+ _ => p
+}
+```
+
+## Representation choices, Unicode, and normalization
+
+A lot of the design in this RFC depends on a key property: both Unix and Windows
+paths can be easily represented as a flat byte sequence "compatible" with
+UTF-8. For Unix platforms, this is trivial: they accept any byte sequence, and
+will generally interpret the byte sequences as UTF-8 when valid to do so. For
@SimonSapin
SimonSapin Nov 19, 2014 Contributor

So we make no effort to support (for display and conversion to str) non-UTF-8 filesystems on legacy Unix systems? Interesting. +1

@SimonSapin SimonSapin commented on an outdated diff Nov 19, 2014
text/0000-path-reform.md
+* `std::path::windows::Path` works with Windows-style paths
+* `std::path::unix::Path` works with Unix-style paths
+* `std::path::Path` is a thin newtype wrapper around the current platform's path implementation
+
+This organization makes it possible to manipulate foreign paths by working with
+the appropriate submodule.
+
+In addition, each submodule defines some extension traits, explained below, that
+supplement the path API with functionality relevant to its variant of path.
+
+But what if you're writing a platform-specific application and wish to use the
+extended functionality directly on `std::path::Path`? In this case, you will be
+able to import the appropriate extension trait via `os::unix` or `os::windows`,
+depending on your platform. This is part of a new, general strategy for
+explicitly "opting-in" to platform-specific features by importing from
+`os::some_platform`.
@SimonSapin
SimonSapin Nov 19, 2014 Contributor

Will os::some_platform only be available on that platform? (This description of what it does sounds like it should.)

@SimonSapin SimonSapin commented on an outdated diff Nov 19, 2014
text/0000-path-reform.md
+}
+```
+
+This is acceptable because the platform supports arbitrary byte sequences
+(usually interpreted as UTF-8).
+
+### Windows
+
+On Windows, the additional APIs allow you to convert to/from UCS-2 (roughly,
+arbitrary `u16` sequences interpreted as UTF-16 when applicable). They also
+provide the remaining Windows-specific path decomposition functionality that
+today's path module supports.
+
+```rust
+pub trait WindowsPathBufExt {
+ fn from_ucs2(path: &[u16]) -> Self;
@SimonSapin
SimonSapin Nov 19, 2014 Contributor

I don’t mind “UCS-2” semi-informally in text, but I’d rather not have ucs2 in API names. Also, Microsoft uses “UTF-16” in its documentation, never “UCS-2”. The most accurate is “potentially ill-formed UTF-16”, but that’s too long for a method name. So I’d rather have the method names be from_utf16 and to_utf16 with the documentation noting that it can be ill-formed / invalid UTF-16.

@SimonSapin SimonSapin added a commit to SimonSapin/rust-wtf8 that referenced this pull request Nov 19, 2014
@SimonSapin SimonSapin Rename Wtf8Slice to Wtf8 and Wtf8String to Wtf8Buf
... per Path reform RFC convention.
rust-lang/rfcs#474
a643e7a
@SimonSapin SimonSapin commented on the diff Nov 19, 2014
text/0000-path-reform.md
+
+Due to the known semantic problems, it is not really an option to retain the
+current path implementation. As explained above, supporting UCS-2 also means
+that the various byte-slice methods in the current API are untenable, so the API
+also needs to change.
+
+Probably the main alternative to the proposed API would be to *not* use
+DST/slices, and instead use owned paths everywhere (probably doing some
+normalization of `.` at the same time). While the resulting API would be simpler
+in some respects, it would also be substantially less efficient for common operations.
+
+# Unresolved questions
+
+It is not clear how best to incorporate the
+[WTF-8 implementation](https://github.com/SimonSapin/rust-wtf8) (or how much to
+incorporate) into `libstd`.
@SimonSapin
SimonSapin Nov 19, 2014 Contributor

rust-wtf8 currently duplicates some standard library code, changing the char, str, and String types to their respective supersets CodePoint, Wtf8, and Wtf8Buf. To avoid the duplication, libstd could maybe have private functions that are generic over these types, so that optimization on monomorphized functions could still take advantage e.g. of the LLVM range asserts on char.

@SimonSapin SimonSapin commented on the diff Nov 19, 2014
text/0000-path-reform.md
+that the various byte-slice methods in the current API are untenable, so the API
+also needs to change.
+
+Probably the main alternative to the proposed API would be to *not* use
+DST/slices, and instead use owned paths everywhere (probably doing some
+normalization of `.` at the same time). While the resulting API would be simpler
+in some respects, it would also be substantially less efficient for common operations.
+
+# Unresolved questions
+
+It is not clear how best to incorporate the
+[WTF-8 implementation](https://github.com/SimonSapin/rust-wtf8) (or how much to
+incorporate) into `libstd`.
+
+There has been a long debate over whether paths should implement `Show` given
+that they may contain non-UTF-8 data. This RFC does not take a stance on that
@SimonSapin
SimonSapin Nov 19, 2014 Contributor

Only semi-related ranting: The current Show mechanism seems to kinda have the expectation that only UTF-8 should be emitted, but Formatter::write still takes any &[u8] bytes. Then format! has to do an UTF-8 check (which can panic!!) that shouldn’t be necessary.

I think Show should be based on Unicode all the way, with Formatter::write only accepting &str. Maybe I’ll write an RFC.

@Valloric

It might be useful to look at pathlib from Python 3 as a source of inspiration; it also has to deal with posix/windows paths etc.

@mahkoh mahkoh commented on the diff Nov 19, 2014
text/0000-path-reform.md
+ works on all platforms. (In general, the assumption is that non-Unicode paths
+ are most commonly produced by *reading* a path from the filesystem, rather
+ than creating now ones. As we'll see below, there are *platform-specific* ways
+ to crate non-Unicode paths.)
+
+* **Why do `file_name` and `extension` operations work with `Path` rather than
+ some other type?** In particular, it may seem strange to view an extension as
+ a path. But doing so allows us to not reveal platform differences about the
+ various character sets used in paths. By and large, extensions in practice will
+ be valid Unicode, so the various methods going to and from `str` will
+ suffice. But as with paths in general, there are platform-specific ways of
+ working with non-Unicode data, explained below.
+
+* **Where did `push_many` and friends go?** They're replaced by implementing
+ `FromIterator` and `Extend`, following a similar pattern with the `Vec`
+ type. (Some work will be needed to retain full efficiency when doing so.)
@mahkoh
mahkoh Nov 19, 2014 Contributor

Just a reminder that Vec::push_all has not been deprecated because of ergonomics and performance problems and that, at this point, using extend is much, much slower than using push_all.

test extend   ... bench:      2868 ns/iter (+/- 9)
test push_all ... bench:        76 ns/iter (+/- 0)

Have any concrete plans been made to fix this?

@huonw
huonw Nov 19, 2014 Member
test extend   ... bench:        43 ns/iter (+/- 4)
test push_all ... bench:        33 ns/iter (+/- 4)
extern crate test;

use test::Bencher as B;

#[bench]
fn push_all(b: &mut B) {
    let mut vec = vec![];
    let values = [1i, .. 10];

    b.iter(|| {
        vec.push_all(&values)
    })
}

#[bench]
fn extend(b: &mut B) {
    let mut vec = vec![];
    let values = [1i, .. 10];

    b.iter(|| {
        vec.extend(values.iter().map(|x| *x))
    })
}
@mahkoh
mahkoh Nov 19, 2014 Contributor

The memcpy overhead will be significant at N=10.

extern crate test;

#[inline(never)]
fn prepare() -> (Vec<u8>, Vec<u8>) {
    (vec!(), Vec::from_elem(1024, 0))
}

#[bench]
pub fn extend(b: &mut test::Bencher) {
    let (mut dst, src) = prepare();
    b.iter(|| {
        dst.clear();
        dst.extend(src.iter().map(|v| *v));
        test::black_box(&dst);
    });
}

#[bench]
pub fn push_all(b: &mut test::Bencher) {
    let (mut dst, src) = prepare();
    b.iter(|| {
        dst.clear();
        dst.push_all(src.as_slice());
        test::black_box(&dst);
    });
}
@Valloric Valloric and 1 other commented on an outdated diff Nov 19, 2014
text/0000-path-reform.md
+ pub fn extension(&self) -> Option<&Path>;
+
+ pub fn join(&self, path: &P) -> PathBuf;
+
+ pub fn with_file_name(&self, file_name: &P) -> PathBuf;
+ pub fn with_extension(&self, extension: &P) -> PathBuf;
+}
+
+pub struct Iter<'a> { .. }
+
+impl<'a> Iterator<&'a Path> for Iter<'a> { .. }
+
+pub const SEP: char = ..
+pub const ALT_SEPS: &'static [char] = ..
+
+pub fn is_sep(c: char) -> bool { .. }
@Valloric
Valloric Nov 19, 2014

Why not just is_separator? is_sep could also be read as is_separate or something else entirely.

Always err on the side of readability. More good reading.

@blaenk
blaenk Nov 19, 2014 Contributor

I agree that it should be is_separator.

@SimonSapin
Contributor

I commented inline on some details, but +1 on the overall design.

@aturon aturon was assigned by nrc Nov 20, 2014
@aturon
Contributor
aturon commented Nov 20, 2014

I've updated the RFC to address most of the questions that have been
raised, but I'll repeat some of the answers below for convenience.

@huonw

Some platforms have illegal-in-paths code points (e.g. \0), so an
arbitrary string isn't necessarily a valid path?

Right. The proposed API will continue to panic on embedded nulls. The
RFC now reflects this.

@alexcrichton

Should this also have an as_bytes method? (for converting to &[u8])

Nope! The problem is that it's not clear what this would mean on
Windows unless we expose the WTF-8 encoding (which I'm trying not to
do).

This is an interesting conundrum that I hadn't really thought about
before, but I think that it makes sense. In general, if the command
line gives you a path, you're no longer able to pass the bytes into
Path without looking at them. I think that's a good thing,
however! (just something to point out).

Basically, if you have a pile of bytes and want to get a path, you
have to decide how to interpret those bytes first, because if they're
not UTF-8, it's not at all clear what they should mean on Windows.

Also, should normalization methods be exposed? (either one for
looking at the filesystem or one just looking at the path)

Syntactic normalization is available via p.iter().collect(), as the
RFC now clarifies. I do think we should also provide filesystem-based
canonicalization ("realpath"), but am leaving that design out of the
RFC since it's basically orthogonal to the rest of the design.

Or rather, let me rephrase: If I have a path a/b/c, what does
.iter() return? I would probably expect "a", "b", "c", and if
so, how does path know to collect into an absolute or relative path?

I've now clarified this: the iterator will yield an element for the
root_path if one exists. This follows the Boost design.

@nikomatsakis

It might be useful to have a cross-platform way to create a path
from bytes, for just this scenario. I guess people can mock up their
own thing, but everyone will have to.

I don't think this is possible, except if you interpret the bytes as
UTF-8 (in which case you can get a &str); see above, and the updated RFC.

One thing I was curious about was the decision to make Path
unsized. Is this necessary for some reason? (One reason I can
imagine might be BorrowFrom, if we were to change that to use
associated types.)

This is not a necessary choice, I don't think, but still a desirable
one: given that we can make Path unsized, I prefer &'a Path to
Path<'a>, and this also allows Path to fit into all of the DST
rollout work we've been doing throughout libstd. More generally,
I've tried to model PathBuf/Path carefully after String/str, in
part to make it very easy to understand/use once you've grokked Rust's
strings. We've got DST; let's use it!

@japaric

The current API has a new_opt constructor that returns
Option<Path>. It seems that such option has been removed in this
new API, is it intentional? See rust-lang/rust#14048

This is intentional, and follows our current error guidelines: passing
a string with interior nulls is considered to be a contract violation
that will lead to a panic.

Could we get more details about how the underlying data structure
may look? If it's going to be an unsized struct other than a newtype
wrapper around [u8]/str, then you must consider than any extra
field in the struct won't be stored inline (i.e. in the stack). For
example: struct Path { sepidx: Option<uint>, data: [u8] }, then
&Path is two-word long (just like &[u8]), and it's non obvious
how to correctly create a &Path from &[u8].

I need to update the RFC with these details, but roughly you should
think of PathBuf as a newtype wrapper for Vec<u8> and Path as a
newtype wrapper for [u8] -- with no other fields in either
case. This is in part based on conversation with @kballard (author of
the current module) who believes that the extra fields currently used
don't provide much benefit.

@SimonSapin

What normalization does it do? Should there be a .normalize() method?

Clarified in the updated RFC.

So we make no effort to support (for display and conversion to
str) non-UTF-8 filesystems on legacy Unix systems? Interesting. +1

Not quite: you can get it in a platform-specific way.

Will os::some_platform only be available on that platform? (This
description of what it does sounds like it should.)

Yes.

I don’t mind “UCS-2” semi-informally in text, but I’d rather not
have ucs2 in API names. Also, Microsoft uses “UTF-16” in its
documentation, never “UCS-2”. The most accurate is “potentially
ill-formed UTF-16”, but that’s too long for a method name. So I’d
rather have the method names be from_utf16 and to_utf16 with the
documentation noting that it can be ill-formed / invalid UTF-16.

I changed to from_u16_slice etc. Let me know what you think.

@mahkoh

Just a reminder that Vec::push_all has not been deprecated because
of ergonomics and performance problems and that, at this point,
using extend is much, much slower than using push_all.

Yes, indeed, and we can keep push_all here as #[unstable] for the
same reason. But ultimately we can and should get the performance of
Extend on par.

@kballard kballard commented on the diff Nov 20, 2014
text/0000-path-reform.md
+pub struct PathBuf { .. }
+
+// An unsized slice type akin to str:
+pub struct Path { .. }
+
+// Some ergonomics and generics, following the pattern in String/str and Vec<T>/[T]
+impl Deref<Path> for PathBuf { ... }
+impl BorrowFrom<PathBuf> for Path { ... }
+
+// A replacement for BytesContainer; used to cut down on explicit coercions
+pub trait AsPath for Sized? {
+ fn as_path(&self) -> &Path;
+}
+
+impl<Sized? P> PathBuf where P: AsPath {
+ pub fn new<T: IntoString>(path: T) -> PathBuf;
@kballard
kballard Nov 20, 2014 Contributor

This should not be using IntoString, a path is not necessarily valid UTF-8.

Edit: I see you addressed this below. Carry on.

@kballard
kballard Nov 20, 2014 Contributor

Similarly it probably needs to return an Option<PathBuf>, since a path cannot contain NUL on Linux (and probably not on Windows either, but I'm not sure about that), or perhaps Result<PathBuf, T>.

You also addressed this below, explicitly opting to panic.

@kballard kballard commented on the diff Nov 20, 2014
text/0000-path-reform.md
+}
+
+pub trait UnixPathExt {
+ fn from_bytes(path: &[u8]) -> &Self;
+ fn as_bytes(&self) -> &[u8];
+}
+```
+
+This is acceptable because the platform supports arbitrary byte sequences
+(usually interpreted as UTF-8).
+
+### Windows
+
+On Windows, the additional APIs allow you to convert to/from UCS-2 (roughly,
+arbitrary `u16` sequences interpreted as UTF-16 when applicable); because the
+name "UCS-2" does not have a clear meaning, these APIs use `u16_slice` and will
@kballard
kballard Nov 20, 2014 Contributor

Nitpick: UCS-2 does have a clear meaning, but it just may not be instantly recognizable by people. I'm fine with calling it u16_slice() regardless.

@kballard
Contributor

@aturon Thanks for writing this up. I was surprised to see that Path can now be created from &str directly, without doing any normalization at all, since we had discussed the idea of still doing things like a//b -> a/b and a/./b -> a/b. However, I do see the benefit of being able to go from &str to &Path (and back to &str) without any allocation.

I do think we need to expose some sort of normalization method, because that's something we can expect people may want to do. I would propose .standardize() to do all the normalization operations that are safe to do without changing the meaning of the path (e.g. a//b -> a/b, a/./b -> a/b, and "" -> "."), and .normalize() would add in the a/../b -> b step. Then we can potentially have .realpath() that performs the actual realpath operation, where it uses the fs to traverse symlinks.

We may want to have .to_standardized() instead of .standardize(), and have it return Cow<PathBuf, Path>, i.e. skipping the allocation if the path was already standardized. Same with .normalize() and .realpath(). I also think it's important to have .normalize() because Windows verbatim paths means we can't expect users to be able to trivially reimplement it on top of the iterator; a verbatim path does not treat .. specially, so the routine has to be aware of that.

Right. The proposed API will continue to panic on embedded nulls. The
RFC now reflects this.

It seems reasonable to me to continue to expose a new_opt() method that avoids the panic (and from_bytes_opt() / from_vec_opt()).

@mahkoh
Contributor
mahkoh commented Nov 20, 2014

@aturon

But ultimately we can [...] get the performance of Extend on par.

I'd love to hear more about this.

@SimonSapin
Contributor

[…] I’d rather have the method names be from_utf16 and to_utf16 with the documentation noting that it can be ill-formed / invalid UTF-16.

I changed to from_u16_slice etc. Let me know what you think.

I feel that u16_slice is not enough information. Even though it can rarely be ill-formed, it’s still supposed to be UTF-16. I’d still prefer from_utf16 and to_utf16.

@blaenk
Contributor
blaenk commented Nov 21, 2014

From @kballard:

I would propose .standardize() to do all the normalization operations that are safe to do without changing the meaning of the path (e.g. a//b -> a/b, a/./b -> a/b, and "" -> "."), and .normalize() would add in the a/../b -> b step. Then we can potentially have .realpath() that performs the actual realpath operation, where it uses the fs to traverse symlinks.

This sounds good to me, but I'm not entirely sold on there being standardize and normalize with the only differing aspect being a/../b to b, I think that has the potential to be confusing to users. Perhaps we should keep your standardize as standardize and make realpath() be normalize(), which would include the a/../b -> b step depending on the actual file system, but perhaps I'm missing something.

@kballard
Contributor

@SimonSapin Except it's not UTF-16. That's the root of the issue with WindowsPath in the current implementation. It's UTF-16 for "modern" Windows APIs, but it's UCS-2 for older APIs. Calling it from_utf16 and to_utf16 would be entirely incorrect, since it cannot actually require that it be valid UTF-16 without breaking compatibility with real paths that exist in the wild (we have an issue somewhere where someone actually had a file that was UCS-2 instead of UTF-16 and it was causing glob to panic, so we worked around it by making glob silently ignore any files that aren't UTF-16).

@kballard
Contributor

@blaenk There are legitimate reasons to want to normalize a path without ever hitting the filesystem (including, but not limited to, the path not actually representing something on the current filesystem). For example, if you already know your path has no symlinks in it, then normalizing it without touching the filesystem is safe and is more performant than running realpath.

@blaenk
Contributor
blaenk commented Nov 21, 2014

Yeah that was the only thing I could think of. I still feel like having two separate methods based on that single difference alone might be too confusing, but it may be for the best.

@SimonSapin
Contributor

@kballard Except that, if we’re splitting hairs, and depending on what you call “UCS-2”, I’d argue that Windows filenames are not UCS-2 either since UCS-2 does not support supplementary code points at all and surrogate code points are not special. Whereas, as far as I understand, surrogate pairs in a filename are still supposed to be displayed as a single characters, even if the string is overall ill-formed.

There are plenty of systems out there that say they use UTF-16 for strings but don’t enforce well-formedness. They still call it UTF-16. Again, I think “potentially ill-formed UTF-16” is most accurate, but I don’t think to_potentially_ill_formed_utf16 and from_potentially_ill_formed_utf16 are good method names.

@telotortium

@SimonSapin Perhaps use the Windows API names (e.g., to_wchar and from_wchar or to_lpwstr and from_lpwstr)? It's only relevant for Windows, and most systems programmers there should have an idea of what is meant.

@SimonSapin
Contributor

But libc::types::os::arch::c95::wchar_t exists in Rust and (as far as I understand) is not always 16 bits.

@telotortium

However, will it be misleading on Windows? As far as I understand,
wchar_t is always 16-bit on Windows (and it's actually referred to via
typedef wchar_t WCHAR). I know that it's sometimes 32-bit on Unix, but
this is an API that deals with Windows paths. We should ask a real Windows
expert what they think.

On Fri, Nov 21, 2014, 09:53 Simon Sapin notifications@github.com wrote:

But libc::types::os::arch::c95::wchar_t exists in Rust and (as far as I
understand) is not always 16 bits.


Reply to this email directly or view it on GitHub
#474 (comment).

@retep998
Member

Usage of DST with PathBuf and Path sounds great to me.
Support for Window's UCS-2 paths also looks good.
My concern is with regards to \\?\ paths which disable all path normalization, including converting / to \. If .push() uses / on Windows, then the API effectively becomes unusable for \\?\ paths and I have to do everything manually. My recommendation is that .standardize() and friends should convert / to \ on Windows and that all methods adding separators should use \ as well.

@vadimcn
Contributor
vadimcn commented Nov 23, 2014

@SimonSapin, @aturon: MSDN usually refers to these as "wide" strings, so perhaps from_wide / to_wide ?

@SimonSapin
Contributor

@vadimcn, that works, I guess. Then the documentation can go into all the details, if necessary.

@retep998
Member
retep998 commented Dec 2, 2014

In fact, I think it should be very much a priority to normalize paths on Windows to be \\?\ paths, simply because it allows us to bypass the MAX_PATH limitations which are a major source of problems with a large number of programs on Windows. Nobody can be reasonably expected to manually ensure they are using \\?\ paths on Windows, especially when developing a cross platform application. These kinds of things need to be taken care of by the standard library.

@brendanzab
Member

Is there a way to defer the Windows questions to a later date so that we can get this RFC merged?

@erickt
erickt commented Dec 11, 2014

Does this imply changing how BorrowFrom is implemented? Right now it's:

pub trait BorrowFrom<Sized? Owned> for Sized? {
    fn borrow_from(owned: &Owned) -> &Self;
}

Which means that we'd implement this for Path with:

impl BorrowFrom<PathBuf> for Path {
    fn borrow_from(owned: &PathBuf) -> &Path {
        &Path { ... }
    }
}

But then we'd be returning a borrow of a local stack value.

@SimonSapin
Contributor

@erickt Indeed, that impl wouldn’t compile. Assuming the definitions are struct Path { data: [u8] } and struct PathBuf { data: Vec<u8> }, you have to use transmute<&[u8], &Path>(owned.data.as_slice()): rust-lang/rust#18806

@aturon
Contributor
aturon commented Dec 18, 2014

Sorry for the slow response to the last few comments; here's some final follow-up.

@kballard

I do think we need to expose some sort of normalization method, because that's something we can expect people may want to do. I would propose .standardize() to do all the normalization operations that are safe to do without changing the meaning of the path (e.g. a//b -> a/b, a/./b -> a/b, and "" -> "."), and .normalize() would add in the a/../b -> b step. Then we can potentially have .realpath() that performs the actual realpath operation, where it uses the fs to traverse symlinks.

I agree that we should provide normalization (probably in both flavors) and realpath in some form. The plan I had in mind was for the normal iterator to essentially produce a "standardized" path in this terminology, so if you did .iter().collect() you could recover a standardized path. But this will depend a bit on the overall implementation strategy.

For now, let's say that we will plan to offer tools to do all three kinds of normalization, but the exact details are left to emerge later.

@vadimcn

@SimonSapin, @aturon: MSDN usually refers to these as "wide" strings, so perhaps from_wide / to_wide ?

I'm happy to use that terminology -- seems like everyone is OK with that?

@retep998

My concern is with regards to \\?\ paths which disable all path normalization, including converting / to \. If .push() uses / on Windows, then the API effectively becomes unusable for \\?\ paths and I have to do everything manually. My recommendation is that .standardize() and friends should convert / to \ on Windows and that all methods adding separators should use \ as well.

In fact, I think it should be very much a priority to normalize paths on Windows to be \\?\ paths, simply because it allows us to bypass the MAX_PATH limitations which are a major source of problems with a large number of programs on Windows. Nobody can be reasonably expected to manually ensure they are using \\?\ paths on Windows, especially when developing a cross platform application. These kinds of things need to be taken care of by the standard library.

The above seems plausible to provide, hopefully as part of the same normalization infrastructure mentioned above. Again, I'd like to leave this part as a bit of an open question, as it's basically an (important!) add-on to the main API.

@kballard
Contributor

I always figured any automatic conversion to \\?\ would be done at the FFI boundary (the same place where you'll be converting it into a "wide" format). But for paths that are explicitly represented with \\?\ in Rust, it should preserve the semantics that \\?\ expects.

I do agree that on Windows, .push() should be using / and .standardize() should convert any / into \ (as long as the path isn't using \\?\ because in that case / isn't a path separator).

@SimonSapin
Contributor

@SimonSapin, @aturon: MSDN usually refers to these as "wide" strings, so perhaps from_wide / to_wide ?

I'm happy to use that terminology -- seems like everyone is OK with that?

Sounds fine. It should be consistent with the corresponding API on OsStrBuf and OsStr in #517.

@alexcrichton
Member

Overall it seems like there's large support for this RFC, and various details can always be left to the implementation to smooth out over time. In light of this, I'm going to merge this RFC. Thanks again for assembling all this @aturon and @kballard!

@alexcrichton alexcrichton merged commit 3e2ed00 into rust-lang:master Dec 19, 2014
@l0kod
l0kod commented Dec 24, 2014

A bit late, but the Path comparison should take care if the file system is case-sensitive.

This should prevent security bugs like the CVE-2014-9390.

@SimonSapin
Contributor

A bit late, but the Path comparison should take care if the file system is case-sensitive.

If so, should it also account for Apple’s very own modified variant of Unicode Normalization Form D, on HSF+?

Regardless of HSF+ details, determining which filesystem a given path is on requires some system calls, which might be more expensive than a simple string comparison.

@l0kod
l0kod commented Dec 24, 2014

Regardless of HSF+ details, determining which filesystem a given path is on requires some system calls, which might be more expensive than a simple string comparison.

Yes if it's automatic, but a Path optional flag could tell if the path should be considered case-sensitive or not.
This flag could be automatically set with an exists()-like method call (cf. PathExtensions trait), which should have a negligible impact.

@retep998
Member

Regarding case sensitivity, on Windows using normal paths is case insensitive but if you use a \\?\ path there is no normalization and everything is case sensitive.
Perhaps we could have two forms of comparison, one that compares the strings, the other that compares the real paths and does system calls?

@erickt
erickt commented Jan 3, 2015

For reference, the c++ standards committee just accepted a filesystem rfc.

@blaenk
Contributor
blaenk commented Jan 3, 2015

Yeah, I think it would be very useful and informative to look into the combined efforts of the committee to see the motivations behind the choices they made. It could allow us to recognize something we may have overlooked in this paths reform RFC.

@aturon
Contributor
aturon commented Jan 3, 2015

FWIW as I've been working on the implementation I've been moving steadily closer to what Boost did, which is presumably close to this. I will take a look ASAP.

@retep998
Member

So I think what should happen is that whenever we call a system function with a path, we should check the length of the path. If it is safely within MAX_PATH then we go ahead and call the function. If the path is too large then we first check whether it is absolute or relative. If the path is relative we return an error. If the path is absolute then we check whether it is a \\?\ path. If it isn't then standardize the path by prepending \\?\ (if it is a UNC path that starts with \\ but not \\?\ then the path should be prefixed by \\?\UNC\), converting / to \, and accounting for .. and .. We can then pass this path to the function.

EDIT: We should also probably convert all absolute paths to \\?\ paths when turning a relative path into an absolute path or normalizing a path. Also when appending a path onto a \\?\ path, we should normalize the appended path.

@aturon aturon referenced this pull request in rust-lang/rust Jan 28, 2015
Closed

Stabilization for 1.0-alpha2 #20761

29 of 38 tasks complete
@aturon aturon added a commit to aturon/rust that referenced this pull request Jan 29, 2015
@aturon aturon Rename std::path to std::old_path
As part of [RFC 474](rust-lang/rfcs#474), this
commit renames `std::path` to `std::old_path`, leaving the existing path
API in place to ease migration to the new one. Updating should be as
simple as adjusting imports, and the prelude still maps to the old path
APIs for now.

[breaking-change]
ca63e68
@aturon aturon added a commit to aturon/rust that referenced this pull request Jan 29, 2015
@aturon aturon Add new path module
Implements [RFC 474](rust-lang/rfcs#474); see
that RFC for details/motivation for this change.

This initial commit does not include additional normalization or
platform-specific path extensions. These will be done in follow up
commits or PRs.
43681cc
@aturon aturon added a commit to aturon/rust that referenced this pull request Jan 30, 2015
@aturon aturon Add new path module
Implements [RFC 474](rust-lang/rfcs#474); see
that RFC for details/motivation for this change.

This initial commit does not include additional normalization or
platform-specific path extensions. These will be done in follow up
commits or PRs.
2a865c1
@aturon aturon added a commit to aturon/rust that referenced this pull request Feb 3, 2015
@aturon aturon Rename std::path to std::old_path
As part of [RFC 474](rust-lang/rfcs#474), this
commit renames `std::path` to `std::old_path`, leaving the existing path
API in place to ease migration to the new one. Updating should be as
simple as adjusting imports, and the prelude still maps to the old path
APIs for now.

[breaking-change]
78de94e
@aturon aturon added a commit to aturon/rust that referenced this pull request Feb 3, 2015
@aturon aturon Add new path module
Implements [RFC 474](rust-lang/rfcs#474); see
that RFC for details/motivation for this change.

This initial commit does not include additional normalization or
platform-specific path extensions. These will be done in follow up
commits or PRs.
0fbe8ba
@aturon aturon added a commit to aturon/rust that referenced this pull request Feb 3, 2015
@aturon aturon Add new path module
Implements [RFC 474](rust-lang/rfcs#474); see
that RFC for details/motivation for this change.

This initial commit does not include additional normalization or
platform-specific path extensions. These will be done in follow up
commits or PRs.
cf8e13f
@aturon aturon added a commit to aturon/rust that referenced this pull request Feb 3, 2015
@aturon aturon Rename std::path to std::old_path
As part of [RFC 474](rust-lang/rfcs#474), this
commit renames `std::path` to `std::old_path`, leaving the existing path
API in place to ease migration to the new one. Updating should be as
simple as adjusting imports, and the prelude still maps to the old path
APIs for now.

[breaking-change]
3e39f0b
@aturon aturon added a commit to aturon/rust that referenced this pull request Feb 3, 2015
@aturon aturon Add new path module
Implements [RFC 474](rust-lang/rfcs#474); see
that RFC for details/motivation for this change.

This initial commit does not include additional normalization or
platform-specific path extensions. These will be done in follow up
commits or PRs.
e4aed2b
@aturon aturon added a commit to aturon/rust that referenced this pull request Feb 3, 2015
@aturon aturon Add new path module
Implements [RFC 474](rust-lang/rfcs#474); see
that RFC for details/motivation for this change.

This initial commit does not include additional normalization or
platform-specific path extensions. These will be done in follow up
commits or PRs.
1e2e30a
@aturon aturon added a commit to aturon/rust that referenced this pull request Feb 3, 2015
@aturon aturon Add new path module
Implements [RFC 474](rust-lang/rfcs#474); see
that RFC for details/motivation for this change.

This initial commit does not include additional normalization or
platform-specific path extensions. These will be done in follow up
commits or PRs.
45ddf50
@bors bors added a commit to rust-lang/rust that referenced this pull request Feb 3, 2015
@bors bors Auto merge of #21759 - aturon:new-path, r=alexcrichton
This PR implements [path reform](rust-lang/rfcs#474), and motivation and details for the change can be found there.

For convenience, the old path API is being kept as `old_path` for the time being. Updating after this PR is just a matter of changing imports to `old_path` (which is likely not needed, since the prelude entries still export the old path API).

This initial PR does not include additional normalization or platform-specific path extensions. These will be done in follow up commits or PRs.

[breaking-change]

Closes #20034
Closes #12056
Closes #11594
Closes #14028
Closes #14049
Closes #10035
449cb73
@alexcrichton alexcrichton added a commit to alexcrichton/rust that referenced this pull request Feb 4, 2015
@alexcrichton alexcrichton rollup merge of #21759: aturon/new-path
This PR implements [path reform](rust-lang/rfcs#474), and motivation and details for the change can be found there.

For convenience, the old path API is being kept as `old_path` for the time being. Updating after this PR is just a matter of changing imports to `old_path` (which is likely not needed, since the prelude entries still export the old path API).

This initial PR does not include additional normalization or platform-specific path extensions. These will be done in follow up commits or PRs.

[breaking-change]

Closes #20034
Closes #12056
Closes #11594
Closes #14028
Closes #14049
Closes #10035
8550bf7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment