Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for os_str_slice #118485

Open
2 of 4 tasks
blyxxyz opened this issue Nov 30, 2023 · 8 comments
Open
2 of 4 tasks

Tracking Issue for os_str_slice #118485

blyxxyz opened this issue Nov 30, 2023 · 8 comments
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@blyxxyz
Copy link
Contributor

blyxxyz commented Nov 30, 2023

Feature gate: #![feature(os_str_slice)]

This is a tracking issue for an API for taking substrings of OsStr, which in combination with OsStr::as_encoded_bytes() would make it possible to implement most string operations in (portable) safe code.

Public API

impl OsStr {
    pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self;
}

Steps / History

Unresolved Questions

  • Currently this enforces the same requirements on all platforms, but on Unix (and some other platforms) the internal encoding of OsStr is already fully specified to be arbitrary bytes by means of the OsStrExt trait. Should we:
    • Relax the requirements on these platforms,
    • Keep enforcing the lowest common denominator invariants, or
    • Use strict checks now, but leave the possibility of relaxing them later?

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@blyxxyz blyxxyz added C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Nov 30, 2023
@epage
Copy link
Contributor

epage commented Nov 30, 2023

Currently this enforces the same requirements on all platforms, but on Unix (and some other platforms) the internal encoding of OsStr is already fully specified to be arbitrary bytes by means of the OsStrExt trait. Should we:

This was discussed somewhat in the ACP (rust-lang/libs-team#306). Copying over the parts I find relevant :)

Starting with "lowest common denominator invariants" can always be relaxed later, as we'd be switching cases from asserting to not-asserting.

The other encoded_bytes functions are only documented for their cross-platform nature and the conversions docs point people to OsStrExt for platform-specific assumptions. I think being consistent with those would be good

@epage
Copy link
Contributor

epage commented Nov 30, 2023

Looking over #118484, this looks like it'll have a lot of overhead to enforce the invariants when building higher level operations on top that would go away with native support for those operations. This is mostly an observation as I'm not sure what else we can do for now while pattern API support is at a stand-still.

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Nov 30, 2023

There's a fast path for splitting on ASCII, which you'll always take when doing traditional options parsing1. My hunch is that that's the common case in general, that even if part of the string is Unicode or raw bytes you'll typically be hunting for some bit of ASCII syntax.

It can be made a lot faster on Windows (see #118484 (comment)), and I want to try that if I can get a local Windows environment set up. (EDIT: the wtf8 tests do run on Linux, maybe that's enough to work it out.)

If we relax the requirements for Unix then it just becomes a normal slicing operation there. That's one reason that feels attractive to me. Otherwise we have to keep doing something like the current validation.

The ASCII fast path can be made faster by skipping bounds checks and giving it its own function. Here's what I get by mucking around in compiler explorer:
image
(https://rust.godbolt.org/z/hx9r7nvhc)

But I don't know how far it's worth going, and if we choose relaxed checks then we can just get rid of it.

Footnotes

  1. Though PowerShell does support Unicode dashes.

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Dec 1, 2023

With strict checks it would technically be optimal for user code to have a function like this:

use std::ffi::OsStr;

fn slice_os_str(s: &OsStr, start: usize, end: usize) -> &OsStr {
    #[cfg(all(target_vendor = "fortanix", target_env = "sgx"))]
    use std::os::fortanix_sgx::ffi::OsStrExt;
    #[cfg(target_os = "hermit")]
    use std::os::hermit::ffi::OsStrExt;
    #[cfg(target_os = "solid")]
    use std::os::solid::ffi::OsStrExt;
    #[cfg(unix)]
    use std::os::unix::ffi::OsStrExt;
    #[cfg(target_os = "wasi")]
    use std::os::wasi::ffi::OsStrExt;
    #[cfg(target_os = "xous")]
    use std::os::xous::ffi::OsStrExt;

    #[cfg(any(
        unix,
        target_os = "wasi",
        target_os = "hermit",
        all(target_vendor = "fortanix", target_env = "sgx"),
        target_os = "solid",
        target_os = "xous"
    ))]
    return OsStr::from_bytes(&s.as_bytes()[start..end]);

    #[cfg(not(any(
        unix,
        target_os = "wasi",
        target_os = "hermit",
        all(target_vendor = "fortanix", target_env = "sgx"),
        target_os = "solid",
        target_os = "xous"
    )))]
    return s.slice_encoded_bytes(start..end);
}

And that's a little sad.

from_encoded_bytes_unchecked() is unsafe and not even Miri can check that you're using it correctly, so it makes sense to be as straightforward and restrictive as possible there. But I don't think we need to have as many worries for this function.

So all in all I'm back to preferring to skip the check on these platforms.

@epage
Copy link
Contributor

epage commented Dec 1, 2023

Regarding performance, the question I asked myself is why we can't have a subset of the Pattern API that doesn't take OsStr patterns. Most of the times slice_encoded_bytes would be needed would be reduced by the Pattern API. From what I can tell from the Tracking Issue is that OsStr patterns is what held things up. I noticed that the ACP linked to #109350 which does exactly this.

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Dec 1, 2023

My thought is that we should have such an API but that this method will be easier to get stabilized. It's a small and unopinionated MVP.

@epage
Copy link
Contributor

epage commented Dec 1, 2023

That doesn't make them mutually exclusive. My point was that for more performance critical code, we can look to #109350 while we can have slice_encoded_bytes uphold invariants (and users can always create your example slice_os_str above).

#109350 mirrors an API on one type into another, is related to an approved RFC, and trims down the biggest, most contentious part of that RFC. My hope is that it can be a relatively quick to stabilize API. It hasn't gotten much attention but I'm looking into that.

@BurntSushi
Copy link
Member

With respect to the restrictions, I am generally inclined toward @blyxxyz's point of view here where we shouldn't need them on Unix because its representation is already set in stone. With that said, keeping uniform restrictions does make the behavior easier to explain and reason about. (I think y'all mentioned that.) But I think most importantly for me anyway, if we start with uniform restrictions, we can always relax them later when we've got some solid use cases motivating us to do that. For std, I am generally inclined to the conservative posture because of our unique stability constraints.

bors added a commit to rust-lang-ci/rust that referenced this issue Dec 2, 2023
Add substring API for `OsStr`

This adds a method for taking a substring of an `OsStr`, which in combination with [`OsStr::as_encoded_bytes()`](https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.as_encoded_bytes) makes it possible to implement most string operations in safe code.

API:
```rust
impl OsStr {
    pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self;
}
```
Motivation, examples and research at rust-lang/libs-team#306.

Tracking issue: rust-lang#118485

cc `@epage`
r? libs-api
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 18, 2024
…ark-Simulacrum

Move `OsStr::slice_encoded_bytes` validation to platform modules

This delegates OS string slicing (`OsStr::slice_encoded_bytes`) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms.

The existing implementation is still used on Unix and most other platforms and is now optimized a little better.

Tracking issue: rust-lang#118485

cc `@epage,` `@BurntSushi`
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Feb 18, 2024
Rollup merge of rust-lang#118569 - blyxxyz:platform-os-str-slice, r=Mark-Simulacrum

Move `OsStr::slice_encoded_bytes` validation to platform modules

This delegates OS string slicing (`OsStr::slice_encoded_bytes`) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms.

The existing implementation is still used on Unix and most other platforms and is now optimized a little better.

Tracking issue: rust-lang#118485

cc `@epage,` `@BurntSushi`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants