Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert \r\n -> \n in include_str! macro #63681

Open
wants to merge 1 commit into
base: master
from

Conversation

@matklad
Copy link
Member

commented Aug 18, 2019

Ideally, the meaning of the program should be independent of the line
endings used, because, for example, git can change line endings during
a checkout.

We currently do line-ending conversion in almost all cases, with
include_str being an exception. This commit removes this exception,
bringing include_str closer in behavior to string literals.

Note that this is technically a breaking change.

In case that you really mean to include a string with DOS line
endings, you can use include_bytes! macro which is guaranteed to not
do any translation, like this

pub fn my_text() -> &'static str {
    unsafe {
        std::str::from_utf8_unchecked(MY_TEXT_BYTES);
    }
}

const MY_TEXT_BYTES: &[u8] = include_bytes("my_text.bin");

#[test]
fn test_encoding() {
    std::str::from_utf8(MY_TEXT_BYTES)
    .unwrap();
}

cc @petrochenkov, @rust-lang/lang

Some preliminary discussion in #63525 (comment)

@rust-highfive

This comment has been minimized.

Copy link
Collaborator

commented Aug 18, 2019

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

@Centril

This comment has been minimized.

Copy link
Member

commented Aug 18, 2019

Seems like a reasonable change to me and it removes exceptions (yay!).

Note that this is technically a breaking change.

Currently the crater queue is quite long, but do you want to crater it in any case?

I think this is small enough a change that we don't need an FCP but it seems good to discuss it a bit at least so I've nominated for the T-Lang meeting on Thursday (21:00 Stockholm/Paris time). Feel free to join that meeting.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Aug 18, 2019

So, the primary alternatives / extensions are:

  • Do the change, but simultaneously provide a new macro include_bytes_as_str that re-packages the read bytes as a str without any normalization, but with a static UTF-8 validity check.
  • Do the change, but add Python-style optional arguments to include_str - include_str!("path/to/file", normalize_eol=false, normalize_bom=true) to opt out of the default.

The choice between these extensions and doing nothing probably depends on what crater finds.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Aug 18, 2019

This change is very easily revertable and backportable if necessary, so we may get the necessary experimental results faster and better by landing it on nightly, rather than by running crater.

@rust-highfive

This comment has been minimized.

Copy link
Collaborator

commented Aug 18, 2019

The job x86_64-gnu-llvm-6.0 of your PR failed (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
2019-08-18T09:30:01.1517412Z ##[command]git remote add origin https://github.com/rust-lang/rust
2019-08-18T09:30:01.1717434Z ##[command]git config gc.auto 0
2019-08-18T09:30:01.1784073Z ##[command]git config --get-all http.https://github.com/rust-lang/rust.extraheader
2019-08-18T09:30:01.1842123Z ##[command]git config --get-all http.proxy
2019-08-18T09:30:01.1987271Z ##[command]git -c http.extraheader="AUTHORIZATION: basic ***" fetch --force --tags --prune --progress --no-recurse-submodules --depth=2 origin +refs/heads/*:refs/remotes/origin/* +refs/pull/63681/merge:refs/remotes/pull/63681/merge
---
2019-08-18T09:30:36.9770129Z do so (now or later) by using -b with the checkout command again. Example:
2019-08-18T09:30:36.9770178Z 
2019-08-18T09:30:36.9770375Z   git checkout -b <new-branch-name>
2019-08-18T09:30:36.9770414Z 
2019-08-18T09:30:36.9770459Z HEAD is now at d04a5bea1 Merge dffb9884e1479a09380ad1df23ce4fd3f458ba9d into ef1ecbefb8719e408150738664d443a73f757ffd
2019-08-18T09:30:36.9939413Z ##[section]Starting: Collect CPU-usage statistics in the background
2019-08-18T09:30:36.9942351Z ==============================================================================
2019-08-18T09:30:36.9942403Z Task         : Bash
2019-08-18T09:30:36.9942445Z Description  : Run a Bash script on macOS, Linux, or Windows
---
2019-08-18T10:34:30.1714909Z .................................................................................................... 1500/8928
2019-08-18T10:34:35.9478948Z .................................................................................................... 1600/8928
2019-08-18T10:34:49.2419425Z ................................i...............i................................................... 1700/8928
2019-08-18T10:34:57.1132015Z .................................................................................................... 1800/8928
2019-08-18T10:35:11.9174829Z ........................iiiii....................................................................... 1900/8928
2019-08-18T10:35:22.8933106Z .................................................................................................... 2100/8928
2019-08-18T10:35:25.5655650Z .................................................................................................... 2200/8928
2019-08-18T10:35:30.6364664Z .................................................................................................... 2300/8928
2019-08-18T10:35:37.6717122Z .................................................................................................... 2400/8928
---
2019-08-18T10:38:36.2171931Z .................................................................................................... 4600/8928
2019-08-18T10:38:43.5325442Z ........i...............i........................................................................... 4700/8928
2019-08-18T10:38:55.1457957Z .................................................................................................... 4800/8928
2019-08-18T10:39:01.2755225Z .................................................................................................... 4900/8928
2019-08-18T10:39:13.8157587Z .........................................................................................ii.ii...... 5000/8928
2019-08-18T10:39:23.7348839Z .................................................................................................... 5200/8928
2019-08-18T10:39:33.7394165Z .................................................................................................... 5300/8928
2019-08-18T10:39:40.8118861Z .............................................i...................................................... 5400/8928
2019-08-18T10:39:47.6553871Z .................................................................................................... 5500/8928
2019-08-18T10:39:47.6553871Z .................................................................................................... 5500/8928
2019-08-18T10:39:58.6695592Z .................................................................................................... 5600/8928
2019-08-18T10:40:11.6789346Z ......................................ii...i..ii...........i........................................ 5700/8928
2019-08-18T10:40:30.6828926Z .................................................................................................... 5900/8928
2019-08-18T10:40:35.7549078Z .................................................................................................... 6000/8928
2019-08-18T10:40:35.7549078Z .................................................................................................... 6000/8928
2019-08-18T10:40:49.4047455Z .......................................i..ii........................................................ 6100/8928
2019-08-18T10:41:11.1586565Z .................................................................................i.................. 6300/8928
2019-08-18T10:41:13.4629637Z .................................................................................................... 6400/8928
2019-08-18T10:41:15.8812170Z .....................................................i.............................................. 6500/8928
2019-08-18T10:41:19.1679921Z .................................................................................................... 6600/8928
---
2019-08-18T10:42:21.2634053Z .................................................................................................... 7200/8928
2019-08-18T10:42:28.3697969Z .................................................................................................... 7300/8928
2019-08-18T10:42:38.8921258Z .................................................................................................... 7400/8928
2019-08-18T10:42:48.7613303Z .................................................................................................... 7500/8928
2019-08-18T10:42:55.8088406Z .ii......i.......................................................................................... 7600/8928
2019-08-18T10:43:17.5411503Z .................................................................................................... 7800/8928
2019-08-18T10:43:26.6807945Z .................................................................................................... 7900/8928
2019-08-18T10:43:36.5100485Z .................................................................................................... 8000/8928
2019-08-18T10:44:16.9993790Z .................................................................................................... 8100/8928
---
2019-08-18T10:45:22.7236156Z 
2019-08-18T10:45:22.7236605Z ------------------------------------------
2019-08-18T10:45:22.7236862Z stderr:
2019-08-18T10:45:22.7237282Z ------------------------------------------
2019-08-18T10:45:22.7237897Z thread 'main' panicked at 'assertion failed: source.contains("string\r\nliteral")', /checkout/src/test/ui/lexer-crlf-line-endings-string-literal-doc-comment.rs:40:5
2019-08-18T10:45:22.7238640Z 
2019-08-18T10:45:22.7239745Z ------------------------------------------
2019-08-18T10:45:22.7240025Z 
2019-08-18T10:45:22.7240217Z 
---
2019-08-18T10:45:22.7269787Z thread 'main' panicked at 'Some tests failed', src/tools/compiletest/src/main.rs:536:22
2019-08-18T10:45:22.7270218Z note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
2019-08-18T10:45:22.7288298Z 
2019-08-18T10:45:22.7289091Z 
2019-08-18T10:45:22.7292716Z command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/compiletest" "--compile-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib" "--run-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/lib" "--rustc-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "--src-base" "/checkout/src/test/ui" "--build-base" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui" "--stage-id" "stage2-x86_64-unknown-linux-gnu" "--mode" "ui" "--target" "x86_64-unknown-linux-gnu" "--host" "x86_64-unknown-linux-gnu" "--llvm-filecheck" "/usr/lib/llvm-6.0/bin/FileCheck" "--host-rustcflags" "-Crpath -O -Cdebuginfo=0 -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--target-rustcflags" "-Crpath -O -Cdebuginfo=0 -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--docck-python" "/usr/bin/python2.7" "--lldb-python" "/usr/bin/python2.7" "--gdb" "/usr/bin/gdb" "--quiet" "--llvm-version" "6.0.0\n" "--system-llvm" "--cc" "" "--cxx" "" "--cflags" "" "--llvm-components" "" "--llvm-cxxflags" "" "--adb-path" "adb" "--adb-test-dir" "/data/tmp/work" "--android-cross-path" "" "--color" "always"
2019-08-18T10:45:22.7294723Z 
2019-08-18T10:45:22.7295274Z 
2019-08-18T10:45:22.7305813Z failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test
2019-08-18T10:45:22.7305948Z Build completed unsuccessfully in 1:08:12
2019-08-18T10:45:22.7305948Z Build completed unsuccessfully in 1:08:12
2019-08-18T10:45:22.7360682Z == clock drift check ==
2019-08-18T10:45:22.7376830Z   local time: Sun Aug 18 10:45:22 UTC 2019
2019-08-18T10:45:23.0155082Z   network time: Sun, 18 Aug 2019 10:45:23 GMT
2019-08-18T10:45:23.0155192Z == end clock drift check ==
2019-08-18T10:45:24.1354669Z ##[error]Bash exited with code '1'.
2019-08-18T10:45:24.1420490Z ##[section]Starting: Checkout
2019-08-18T10:45:24.1422281Z ==============================================================================
2019-08-18T10:45:24.1422337Z Task         : Get sources
2019-08-18T10:45:24.1422385Z Description  : Get sources from a repository. Supports Git, TfsVC, and SVN repositories.

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@Centril

This comment has been minimized.

Copy link
Member

commented Aug 18, 2019

convert \r\n -> \n in include_str! macro
Ideally, the meaning of the program should be independent of the line
endings used, because, for example, git can change line endings during
a checkout.

We currently do line-ending conversion in almost all cases, with
`include_str` being an exception. This commit removes this exception,
bringing `include_str` closer in behavior to string literals.

Note that this is technically a breaking change.

In case that you really mean to include a string with DOS line
endings, you can use `include_bytes!` macro which is guaranteed to not
do any translation, like this

    pub fn my_text() -> &'static str {
        unsafe {
            std::str::from_utf8_unchecked(MY_TEXT_BYTES);
        }
    }

    const MY_TEXT_BYTES: &[u8] = include_bytes("my_text.bin");

    #[test]
    fn test_encoding() {
        std::str::from_utf8(MY_TEXT_BYTES)
	    .unwrap();
    }

@matklad matklad force-pushed the matklad:normalized-include-str branch from dffb988 to df73b26 Aug 18, 2019

@eddyb

This comment has been minimized.

Copy link
Member

commented Aug 18, 2019

for example, git can change line endings during a checkout

Yay bad defaults strike again... even if there are ways to ensure the right behavior in git (convert \r\n -> \n when committing only on windows, and only for certain files, and don't do anything else).

I don't have much against this change, although I will say we should make str::from_utf8(b"...").unwrap() work in constants already... (but also, isn't the unchecked version const fn? if so, why?)

@@ -7,6 +7,6 @@ fn main() {
);
assert_eq!(
include_str!("data.bin"),
"\u{FEFF}This file starts with BOM.\r\nLines are separated by \\r\\n.\r\n",
"This file starts with BOM.\nLines are separated by \\r\\n.\n",

This comment has been minimized.

Copy link
@lzutao

lzutao Aug 18, 2019

Contributor

"Lines are separated by \\n."

This comment has been minimized.

Copy link
@matklad

matklad Aug 18, 2019

Author Member

Nope, it should be \\r\\n, escaped \r is not changed in any way.

@nagisa

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

This is a breaking change. Unlike with include!, it is trivial to observe "normalization" happening in include_str and include_bytes.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

For include! it's pretty trivial as well, just include something with a multi-line string literal.

@matklad

This comment has been minimized.

Copy link
Member Author

commented Aug 20, 2019

@petrochenkov I don't think so: line endings in multiline string literals are not observable, regardless of the way you get the literal.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

I mean the normalization is observable, the literals have \r\ns in the included file and \n in AST, same case as with include_str after this PR.

@Centril

This comment has been minimized.

Copy link
Member

commented Aug 29, 2019

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Aug 30, 2019

Regarding prior art on Windows, MSVC normalizes \r\n to \n, even in raw strings.

Rust always normalizes too (including raw strings), with include_str being the only exception.
(include_str is also a raw string, just loaded from a separate file.)

@joshtriplett

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

Another alternative: we could add a new include_text! or similar that does normalize, and continue to not normalize include_str!.

@retep998

This comment has been minimized.

Copy link
Member

commented Sep 6, 2019

I personally am not opposed to changing include_str! to normalize newlines, especially since we normalize them everywhere else anyway.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2019

Hey all,

we discussed this in the @rust-lang/lang meeting. It seemed to us that it would be a good topic for an RFC, as the best decision is not entirely obvious. In particular it'd be good to get a survey of how similar features work in other languages.

I think there was some general discomfort with changing the behavior here -- I guess it comes down to whether one considers the existing behavior a "bug" or a "feature". Going with another name (and probably deprecating the existing function) might obviate that concern. A stamp of approval from RFC commentors might also help.

If you think this is unreasonable, feel free to complain.

(Side question that occurs to me now: this seems like one of those "kind of lang kind of libs" questions, so I'd also like to hear what @rust-lang/libs members think about this question.)

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2019

Actually, I'll keep the nomination ticket just so we review any responses.

@sfackler

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

I'd also be a bit nervous about changing behavior here.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2019

I'd be unhappy with introduction of some new macro and deprecation of include_str, because pretty much all uses of include_str would have to migrate.

IMO, the best way to proceed here is to land the change and see what happens (possibly revert before the stable release).

Precedent: the same change was done for raw string literals a few releases ago without much noise, and we still haven't seen any complaints.
So, I secretly wish we landed this sneakily as well, then nobody would ever notice the change. So, this PR receives the amount of attention it doesn't deserve.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2019

@matklad
Would you like to write a mini-RFC describing the flavors of file loading and BOM/EOL normalization?
That would also be pretty useful as a documentation piece for our "pre-lexer" behavior.

@Centril

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

I mainly wished for the documentation aspect in an RFC. Otherwise, @petrochenkov makes good points overall re. that this would have potentially gone unnoticed if not for me flagging it (sorry? 😅). It seems @retep998 is fine with this from a Windows POV and that carries weight as well.

I think deprecation of things in the language (as opposed to library-encoded things) should be done with care so if we were to add a new macro I'd like to not deprecate the existing one.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2019

A crater run would certainly be useful data.

@Centril

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

Yea. Since T-infra doesn't want to give us more instances (budget reasons & stuff...), maybe we can bump a crater run a bit in the queue ahead of perhaps "less important" things.

@joshtriplett

This comment has been minimized.

Copy link
Member

commented Sep 12, 2019

@rfcbot merge

We discussed this in the language team meeting. Given the existence of include_bytes! for raw byte-for-byte inclusion of a file, we felt this change seems reasonable.

However, as @petrochenkov suggested, we'd also like to see documentation explaining the behavior, for line endings and for BOMs. (This could just take the form of updating the include_str! documentation.)

@rfcbot concern document-behavior-including-eol-and-bom

@rfcbot

This comment has been minimized.

Copy link

commented Sep 12, 2019

Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:

Concerns:

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@matklad

This comment has been minimized.

Copy link
Member Author

commented Sep 12, 2019

(This could just take the form of updating the include_str! documentation.)

This is exactly what this PR does: https://github.com/rust-lang/rust/pull/63681/files#diff-a5a0682c18717c9e06e867e1cf2abeb8R981 :)

@joshtriplett

This comment has been minimized.

Copy link
Member

commented Sep 12, 2019

@matklad Thank you, We should have read more carefully; none of us saw the note in the comment about the BOM.

@rfcbot resolve document-behavior-including-eol-and-bom

@matklad

This comment has been minimized.

Copy link
Member Author

commented Sep 12, 2019

See also rust-lang-nursery/reference#676, which properly specifies the current behavior of string literals in the reference.

@jhpratt

This comment has been minimized.

Copy link

commented Sep 13, 2019

I am opposed to this change. include_str! should do just that — include the string and nothing more. It is, at least to me, expected to include the file as-is.

@matklad

This comment has been minimized.

Copy link
Member Author

commented Sep 13, 2019

@jhpratt would you prefer the usual string literals, like

let xs = "hello
world
";

to also preserve line-endings as is?

If you prefer to do translation in one case, but not in the other, could you explain what makes the two cases feel different for you?

@jhpratt

This comment has been minimized.

Copy link

commented Sep 14, 2019

@matklad I do not believe your example should preserve line endings. The difference, to me, is that it's an external file. It seems logical (again, to me) that an external file would be included as-is.

In a situation like your example, the user could still explicitly provide for a \r if they wanted the Windows-style newline. From the top of my head, there would be no equivalent way to do that with this proposal, except for a second macro (which makes things even more confusing).

As with anything else, if I'm outnumbered (and it seems I am), then concensus should prevail.

@matklad

This comment has been minimized.

Copy link
Member Author

commented Sep 14, 2019

Aha, I see, thanks. Not being able to express \r\n line endings is a good point! However, I feel that with status quo you can’t express \r\n reliably either, and that is arguably worse.

Specifically, in a string literal one has to use \r escape, which makes it clear to the human that there’s an intended carriage return.

With include_str however, if you include a literal \r\n byte, here’s a chance that VCS translates it to \n, so you can get \r\n on one machine and \n on another! Moreover, to the human it is not obvious that the file contains \r.

There is a way to get \r\n reliably (in the PR description). It’s wordy, but that is ok for a pretty niche use-case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.