Splitting text line by line #902

bobi6666 · 2022-08-17T01:21:58Z

bobi6666
Aug 17, 2022

Hello, on this link that will be attached at the end, I tried to create a code that would allow me to write text with different line endings and then I wanted to split it via regex but for some reason when I used a for loop to check and spin the first result it worked like this that println printed the entire text without splitting.
Is there anything that i doing wrong? and if yes how to fix that?
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=809c68065b3e15efcb3a03ea79818251

Answered by BurntSushi

Aug 17, 2022

$ only matches at the end of a string. It sounds like you want it to match at line endings too. You need to enable multi line mode for that, just like you have to do in most other regex engines.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3e4ac6e8bf10babeb8937d5c333134e3

Note also that the only recognized line ending is \n currently. There is an issue somewhere tracking the addition of \r\n.

View full answer

BurntSushi · 2022-08-17T02:30:45Z

BurntSushi
Aug 17, 2022
Maintainer

The output shows that the string is split at each line ending and prints test1, test2 and test3. This is what I would expect. What specific output are you looking for?

0 replies

bobi6666 · 2022-08-17T07:20:35Z

bobi6666
Aug 17, 2022
Author

I thought that when I use split and I have it set so that it works line by line, after the first break it should print test1 and it should stop there, but it looks like it prints test1 test2 and test3 even though I put a break after the first iteration 2022-08-17 4:30 GMT+02:00, Andrew Gallant ***@***.***>:

…

The output shows that the string is split at each line ending and prints test1, test2 and test3. This is what I would expect. What specific output are you looking for? -- Reply to this email directly or view it on GitHub: #902 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

0 replies

bobi6666 · 2022-08-17T07:59:15Z

bobi6666
Aug 17, 2022
Author

I'm trying to create this as an example, because if I succeeded, later I would need to load a file via regex::bytes, which probably won't be utf8, then split the text according to line endings, which can be \r\n or \n, and then when the the file is loaded and split, from there I pull out a random string according to the number, that generates crate rand, but when I tried what I sent you, it looked like the strings were not split correctly according to line endings because if it would be correct then it should print test1 and stop when i added break 2022-08-17 9:20 GMT+02:00, Peter Kubek ***@***.***>:

…

I thought that when I use split and I have it set so that it works line by line, after the first break it should print test1 and it should stop there, but it looks like it prints test1 test2 and test3 even though I put a break after the first iteration 2022-08-17 4:30 GMT+02:00, Andrew Gallant ***@***.***>: > The output shows that the string is split at each line ending and prints > test1, test2 and test3. This is what I would expect. What specific output > are you looking for? > > -- > Reply to this email directly or view it on GitHub: > #902 (comment) > You are receiving this because you authored the thread. > > Message ID: > ***@***.***>

0 replies

BurntSushi · 2022-08-17T11:44:07Z

BurntSushi
Aug 17, 2022
Maintainer

$ only matches at the end of a string. It sounds like you want it to match at line endings too. You need to enable multi line mode for that, just like you have to do in most other regex engines.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3e4ac6e8bf10babeb8937d5c333134e3

Note also that the only recognized line ending is \n currently. There is an issue somewhere tracking the addition of \r\n.

0 replies

bobi6666 · 2022-08-17T11:50:16Z

bobi6666
Aug 17, 2022
Author

hello, I can say that you understood my question very well and suggested something that seems like a solution, but I still have 3 questions, do you think the issue with handling \r\n will ever be solved in the near future? when I load a file in which there is no guarantee that the file is all utf8, do I have to add something else to your example so that there is no problem with non utf8 characters? in case the first problem has not been solved in the near future, do you think I could use bstring and its lines handler to handle \r\n? 2022-08-17 13:44 GMT+02:00, Andrew Gallant ***@***.***>:

…

`$` only matches at the end of a string. It sounds like you want it to match at line endings too. You need to enable multi line mode for that, just like you have to do in most other regex engines. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3e4ac6e8bf10babeb8937d5c333134e3 Note also that the only recognized line ending is `\n` currently. There is an issue somewhere tracking the addition of `\r\n`. -- Reply to this email directly or view it on GitHub: #902 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

1 reply

BurntSushi Aug 17, 2022
Maintainer

Here's the issue tracking CRLF support for $: #244

do you think the issue with handling \r\n will ever be
solved in the near future?

I don't give estimates for projects I work on in my free time.

when I load a file in which there is no guarantee that the file is all
utf8, do I have to add something else to your example so that there is
no problem with non utf8 characters?

This is kind of a complicated and nuanced topic. If you expect your input to be conventionally UTF-8 but might have the odd latin-1 byte somewhere, then you can:

Assume it's close enough to UTF-8 and operate on &[u8] directly. That's what regex::bytes is for.
Lossily decode your data to UTF-8 such that invalid UTF-8 gets replaced with U+FFFD (the replacement codepoint). This way, you get a &str.
Return an error to the end user if it isn't valid UTF-8.

Any one of those might be reasonable. I can't tell you which is right for your use case because I don't know what problem you're trying to solve.

Now, if your input might be UTF-8, or UTF-16 or something else entirely, then that's a different problem and you likely need something like the encoding_rs_io crate to help you there.

in case the first problem has not been solved in the near future, do
you think I could use bstring and its lines handler to handle \r\n?

I don't see how I could answer this without knowing the problem you're trying to solve. Like... are you just trying to iterate over lines? Then yeah, umm, don't use a regex for that... You can either use the standard library's line iterator (for &str) or bstr's iterator (for &[u8]) or just roll your own. If you're trying to do more complicated matching and do specifically want a regex, then just use $. And if your match ends with a \r, remove it.

But like I said, you haven't actually explained the problem you're trying to solve. So I don't really know the answers to your questions.

bobi6666 · 2022-08-17T13:29:21Z

bobi6666
Aug 17, 2022
Author

the point is that I would use this in an older game where there is only ansi support and therefore I just need to split the file according to line endings and then return the string to the player as it was so that his ansi structure is not broken while I process it so that means i would split with line endings but will not check what is in this string until player get back that to his game 2022-08-17 14:50 GMT+02:00, Andrew Gallant ***@***.***>:

…

Here's the issue tracking CRLF support for `$`: #244 > do you think the issue with handling \r\n will ever be solved in the near future? I don't give estimates for projects I work on in my free time. > when I load a file in which there is no guarantee that the file is all utf8, do I have to add something else to your example so that there is no problem with non utf8 characters? This is kind of a complicated and nuanced topic. If you expect your input to be conventionally UTF-8 but might have the odd latin-1 byte somewhere, then you can: * Assume it's close enough to UTF-8 and operate on `&[u8]` directly. That's what `regex::bytes` is for. * Lossily decode your data to UTF-8 such that invalid UTF-8 gets replaced with `U+FFFD` (the replacement codepoint). * Return an error to the end user if it isn't valid UTF-8. Any one of those might be reasonable. I can't tell you which is right for your use case because I don't know what problem you're trying to solve. Now, if your input might be UTF-8, or UTF-16 or something else entirely, then that's a different problem and you likely need something like the `encoding_rs_io` crate to help you there. > in case the first problem has not been solved in the near future, do you think I could use bstring and its lines handler to handle \r\n? I don't see how I could answer this without knowing the problem you're trying to solve. Like... are you just trying to iterate over lines? Then yeah, umm, don't use a regex for that... You can either use the standard library's line iterator (for `&str`) or `bstr`'s iterator (for `&[u8]`) or just roll your own. If you're trying to do more complicated matching and do specifically want a regex, then just use `$`. And if your match ends with a `\r`, remove it. But like I said, you haven't actually _explained the problem you're trying to solve_. So I don't really know the answers to your questions. -- Reply to this email directly or view it on GitHub: #902 (reply in thread) You are receiving this because you authored the thread. Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting text line by line #902

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Splitting text line by line #902

bobi6666 Aug 17, 2022

Replies: 6 comments · 1 reply

BurntSushi Aug 17, 2022 Maintainer

bobi6666 Aug 17, 2022 Author

bobi6666 Aug 17, 2022 Author

BurntSushi Aug 17, 2022 Maintainer

bobi6666 Aug 17, 2022 Author

BurntSushi Aug 17, 2022 Maintainer

bobi6666 Aug 17, 2022 Author

bobi6666
Aug 17, 2022

Replies: 6 comments 1 reply

BurntSushi
Aug 17, 2022
Maintainer

bobi6666
Aug 17, 2022
Author

bobi6666
Aug 17, 2022
Author

BurntSushi
Aug 17, 2022
Maintainer

bobi6666
Aug 17, 2022
Author

BurntSushi Aug 17, 2022
Maintainer

bobi6666
Aug 17, 2022
Author