Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for os_str_bytes #111544

Closed
2 of 3 tasks
epage opened this issue May 13, 2023 · 23 comments · Fixed by #115443
Closed
2 of 3 tasks

Tracking Issue for os_str_bytes #111544

epage opened this issue May 13, 2023 · 23 comments · Fixed by #115443
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@epage
Copy link
Contributor

epage commented May 13, 2023

Feature gate: #![feature(os_str_bytes)]

This is a tracking issue for cross-platform access to the underling bytes (&[u8]) for &OsStr by defining it as an unspecified superset of UTF-8.

Assumptions:

  • Owned variants of this API are not a blocker for stablization
  • Panicking and erroring variants of from_os_str_bytes_unchecked are not a blocker for stablization

Public API

impl OsStr {
    pub unsafe fn from_os_str_bytes_unchecked(bytes: &[u8]) -> &Self;
    pub fn as_os_str_bytes(&self) -> &[u8];
}

impl OsString {
    pub unsafe fn from_os_str_bytes_unchecked(bytes: Vec<u8>) -> Self;
    pub fn into_os_str_bytes(self) -> Vec<u8>;
}

Steps / History

Unresolved Questions

  • What should we refer to this as?
    • Currently os_str_bytes
    • encoded_bytes was another name that was brought up

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@epage epage added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. labels May 13, 2023
Dylan-DPC added a commit to Dylan-DPC/rust that referenced this issue May 30, 2023
Allow limited access to `OsStr` bytes

`OsStr` has historically kept its implementation details private out of
concern for locking us into a specific encoding on Windows.

This is an alternative to rust-lang#95290 which proposed specifying the encoding on Windows.  Instead, this
only specifies that for cross-platform code, `OsStr`'s encoding is a superset of UTF-8 and defines
rules for safely interacting with it

At minimum, this can greatly simplify the `os_str_bytes` crate and every
arg parser that interacts with `OsStr` directly (which is most of those
that support invalid UTF-8).

Tracking issue: rust-lang#111544
bors added a commit to rust-lang-ci/rust that referenced this issue May 30, 2023
Allow limited access to `OsStr` bytes

`OsStr` has historically kept its implementation details private out of
concern for locking us into a specific encoding on Windows.

This is an alternative to rust-lang#95290 which proposed specifying the encoding on Windows.  Instead, this
only specifies that for cross-platform code, `OsStr`'s encoding is a superset of UTF-8 and defines
rules for safely interacting with it

At minimum, this can greatly simplify the `os_str_bytes` crate and every
arg parser that interacts with `OsStr` directly (which is most of those
that support invalid UTF-8).

Tracking issue: rust-lang#111544
bors added a commit to rust-lang/miri that referenced this issue Jun 3, 2023
use as_os_str_bytes

Make use of the new operations recently added (tracking issue: rust-lang/rust#111544). At least the "host OsStr to target bytes" direction now works even for non-utf-8 strings on all hosts!
@epage
Copy link
Contributor Author

epage commented Jun 27, 2023

@RalfJung since you are one of the few to have used this in practice, any feedback? Any thoughts on how well the name works?

@RalfJung
Copy link
Member

🤷 no strong opinion on the name. It would have been nice to also have a safe function in the other direction; Miri still needs cfg to convert bytes to OsStr. Having the function return Option is fine as long as it guarantees that it will always succeed on slices produced by as_os_str_bytes and on valid UTF-8. Having only one direction safely feels asymmetrical.

@marcospb19
Copy link
Contributor

marcospb19 commented Jun 27, 2023

About naming, I opened #113106 and it contains this snippet:

let self_len = self.as_os_str().len();
let self_bytes = self.as_os_str().as_os_str_bytes();

Having the type name inside of the method usually leads to repetition, I'd suggest .as_slice() and .from_slice_unchecked since there's no such thing as "os_str_bytes", but I'm not convinced it's better, it reminds me of Vec::as_slice.

Well, assuming that .as_os_str_bytes should deprecate std::os::unix::ffi::OsStrExt::as_bytes anyways, can't we take the breaking change and call this .as_bytes()? If not, I'm also fine with the current name :).


I agree with @RalfJung, a from_os_str_bytes -> Option<_> here would be very convenient, compared with the unsafe version.

@epage
Copy link
Contributor Author

epage commented Jun 28, 2023

Having the type name inside of the method usually leads to repetition, I'd suggest .as_slice() and .from_slice_unchecked since there's no such thing as "os_str_bytes", but I'm not convinced it's better, it reminds me of Vec::as_slice.

The intention is to specify the encoding within the function name, much like utf8 and utf16 functions have the encoding in their name. The main difference is the encoding is opaque. One idea was as_encoded_bytes.

Well, assuming that .as_os_str_bytes should deprecate std::os::unix::ffi::OsStrExt::as_bytes anyways, can't we take the breaking change and call this `.as_bytes()´? If not, I'm also fine with the current name :).

We can't take breaking changes :). Unnsure if there are workarounds to avoid it being a breaking change.

However, we would want things to be parallel with both directions.

I agree with @RalfJung, a from_os_str_bytes -> Option<_> here would be very convenient, compared with the unsafe version.

The question is if stabilization of what exists should be blocked on this?

First, we'd need the ability to validating WTF-8 encoding which I don't think we have atm.

Second, we'd have to decide on what the error strategy is, whether None or something more like the error from encoding UTF8.

@programmerjake
Copy link
Member

Well, assuming that .as_os_str_bytes should deprecate std::os::unix::ffi::OsStrExt::as_bytes anyways, can't we take the breaking change and call this `.as_bytes()´? If not, I'm also fine with the current name :).

We can't take breaking changes :). Unnsure if there are workarounds to avoid it being a breaking change.

is it actually a breaking change if code calls a different function but with the exact same signature and the exact same effect? seems to me that it wouldn't be breaking since code won't behave differently at all.

@marcospb19
Copy link
Contributor

The question is if stabilization of what exists should be blocked on this?

I vote for not blocking because I'd love to see this stabilized, and the checked encoding would certainly delay it.

RalfJung pushed a commit to RalfJung/rust that referenced this issue Jun 29, 2023
use as_os_str_bytes

Make use of the new operations recently added (tracking issue: rust-lang#111544). At least the "host OsStr to target bytes" direction now works even for non-utf-8 strings on all hosts!
@dylni
Copy link
Contributor

dylni commented Jul 4, 2023

Based on my work to integrate this into the "os_str_bytes" crate, these are my observations:

  • Equivalents for OsString and Vec<u8> would be helpful for efficiency.
  • I don't see anything that guarantees OsStr::as_os_str_bytes and OsStr::as_bytes will return the same value, so it's unclear if differentiating them is required for correctness.
  • As mentioned in earlier comments, error detection isn't possible. However, the standard explicitly doesn't define it, so this may be debatable. I don't think it's required for stabilization, since the unchecked methods should exist anyway.
    • However, "os_str_bytes" has assertion methods for safe conversions, so something similar would clearly my preference. 😃

epage added a commit to epage/rust that referenced this issue Jul 7, 2023
This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString`
as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
@epage
Copy link
Contributor Author

epage commented Jul 14, 2023

@dylni I went ahead and created #113442 for OsString support.

The reason for differentiating OsStrExt::as_bytes from OsStr::as_os_str_bytes is that OsStrExt::as_bytes is for systems where an OsStr can be arbitrary bytes. When dealing with cross-platform code in OsStr::as_os_str_bytes, we are communicating that there is a platform-specific encoding with its own invariants that need to be upheld, like the fact that if you compared to the result of that call across two versions of Rust, in theory, you could get different results.

@dylni
Copy link
Contributor

dylni commented Jul 15, 2023

@epage Thanks!

For the second point, the problem comes when implementing checked indexing. On Windows, this requires finding a UTF-8 sequence before or after the index, which is slightly inefficient. This work could be avoided on platforms that support OsStrExt::as_bytes, since any index will be valid, but then there's no guarantee that the same index is valid for the result of OsStr::as_os_str_bytes.

However, it may be better to share the implementation between platforms, even if it doesn't cause a safety issue on Unix. I'll start with that implementation for now, since it does make the crate more consistent.

@epage
Copy link
Contributor Author

epage commented Jul 18, 2023

imo it would be fine for OsStrExt::as_bytes to have a comment that it is the same data as OsStr::as_os_str_bytes

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jul 22, 2023
Allow limited access to `OsString` bytes

This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jul 22, 2023
Allow limited access to `OsString` bytes

This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jul 22, 2023
Allow limited access to `OsString` bytes

This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
@epage
Copy link
Contributor Author

epage commented Jul 28, 2023

@dylni OsString support should be in nightly now in case you want to play with that and provide any feedback.

@dylni
Copy link
Contributor

dylni commented Aug 2, 2023

@epage Thanks for the note! I started using it in the crate, and I'll be publishing it under a feature once I complete some more testing. I'll update this ticket if I see any other issues.

@epage
Copy link
Contributor Author

epage commented Aug 2, 2023

With @dylni (author of os_str_bytes crate) giving this a try and feedback being addressed, do we expect any other feedback to come in or should we move towards stabilization?

The only open question is the name we give this (e.g. as_os_str_bytes vs as_encoded_bytes). I lean towards whats present (as_os_str_bytes) but do not have strong feelings on this besides agreeing that we should use names that connote there are invariants / structure beyond being a sequence of arbitrary bytes.

@epage epage added the I-libs-api-nominated The issue / PR has been nominated for discussion during a libs-api team meeting. label Aug 2, 2023
@RalfJung
Copy link
Member

RalfJung commented Aug 2, 2023

Should the OsString version be called as_os_string_bytes? That would better reflect its type.

@epage
Copy link
Contributor Author

epage commented Aug 2, 2023

(I assume you mean into_os_str_bytes)

I went with os_str to refer to the style of encoding being used, rather than focusing on the type itself, making it consistent with other places we are referring to the encoding.

That said, I do not feel strongly about this.

@Scripter17
Copy link

We can't take breaking changes :). Unnsure if there are workarounds to avoid it being a breaking change.

Silly idea: Make std::os::unix::ffi::OsStrExt an empty trait and move the functions to OsStr so legacy code can still use the trait without causing issues

@programmerjake
Copy link
Member

We can't take breaking changes :). Unnsure if there are workarounds to avoid it being a breaking change.

Silly idea: Make std::os::unix::ffi::OsStrExt an empty trait and move the functions to OsStr so legacy code can still use the trait without causing issues

that's a breaking change since people could be calling it like OsStrExt::as_bytes which would fail to compile.

@Scripter17
Copy link

Scripter17 commented Aug 2, 2023

trait OsStrExt {
    fn as_bytes(&self) -> &[u8] {
        // something that gets specifically the as_bytes from the inherent implementation
        // I have absolutely no idea how to do that. `OsStr::as_bytes` just calls this function
    }
    ...
}

impl OsStrExt for OsStr{}

impl OsStr {
    fn as_bytes(&self) -> &[u8] {
        // The actual function
    }
}

Assuming this is at all possible, it should fix that

Edit: I should clarify this doesn't seem to produce compile-time name conflicts or whatever it's called. It just also seems to be impossible to use the as_bytes not attached to a trait

@dylni
Copy link
Contributor

dylni commented Aug 4, 2023

Regarding the name, shortening it is probably undesirable, as the bytes need special handling. For example, combining two byte sequences would create another invalid sequence in some cases. It would ideally be rare that non-libraries use the methods.

I agree with @epage that using os_str consistently as the encoding makes sense. For the "os_str_bytes" crate, it's called into_os_str_vec, for consistency with other methods and OsStringExt::into_vec.

@Amanieu
Copy link
Member

Amanieu commented Aug 9, 2023

We discussed this in the libs-api meeting yesterday. We prefer the as_encoded_bytes name since as_os_str_bytes is confusing (it's not clear what os_str means in this context, it's already a method on OsStr). encoded will at least encourage people to read the documentation of this method to find out what the encoding is.

We're also happy to stabilize this along with the name change.

@rfcbot fcp merge

@rfcbot
Copy link

rfcbot commented Aug 9, 2023

Team member @Amanieu has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. labels Aug 9, 2023
@Amanieu Amanieu removed the I-libs-api-nominated The issue / PR has been nominated for discussion during a libs-api team meeting. label Aug 9, 2023
@rfcbot rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Aug 22, 2023
@rfcbot
Copy link

rfcbot commented Aug 22, 2023

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this PR / Issue. and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels Sep 1, 2023
@rfcbot
Copy link

rfcbot commented Sep 1, 2023

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

@rfcbot rfcbot added the to-announce Announce this issue on triage meeting label Sep 1, 2023
fmease added a commit to fmease/rust that referenced this issue Sep 2, 2023
feat(std): Stabilize 'os_str_bytes' feature

Closes rust-lang#111544
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Sep 2, 2023
feat(std): Stabilize 'os_str_bytes' feature

Closes rust-lang#111544
@bors bors closed this as completed in 9aee1de Sep 2, 2023
@apiraino apiraino removed the to-announce Announce this issue on triage meeting label Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants