regex::bytes::escape #451

RReverser · 2018-02-23T11:31:37Z

Would it make sense to provide regex::bytes::escape similar to regex::escape but so that it could accept any &[u8] not &str?

Currently the quick workaround is to manually collect bytes into a regex string using write!(s, "\\x{:02X}", b) but it would be nice to have a built-in method for that.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2018-02-23T11:37:13Z

Could you please describe your use case? Both bytes::Regex::new and Regex::new accept an &str, so regardless of which one you use, your pattern needs to be &str, not a &[u8].

RReverser · 2018-02-23T11:41:19Z

Yeah, exactly. I'm using regex::bytes::Regex to find arbitrary byte sequence in a larger byte sequence, and so I need to transform raw byte slice into a regex like (?-u:\x01\x02...) first - all I'm suggesting is to provide escape(&[u8]) -> String function that would do exactly that (I can send PR myself).

In my benchmarks such precompiled regex turns out to be ~4x faster than native libc memmem on macOS and ~2x faster than https://docs.rs/memmem/0.1.1/memmem/ crate, so, given that I already have regex as a dependency anyway, it makes sense to use it instead of these alternative solutions :)

RReverser · 2018-02-23T11:45:32Z

FWIW this is what my current implementation looks like:

fn escape(b: &[u8]) -> String {
	let mut regex_str = String::with_capacity(6 + b.len() * 4);
	regex_str += "(?-u:";
	for b in b {
		write!(regex_str, "\\x{:02X}", b).unwrap();
	}
	regex_str += ")";
	return regex_str;
}

It could be somewhat optimised to not escape valid ASCII chars that can appear in regex, but such representation is even better for {:?} output, and it doesn't make any difference in execution speed.

BurntSushi · 2018-02-23T11:47:51Z

Oh I see. Generally escape is meant for escaping regex meta characters, but I can see how a natural extension of that is to escape raw bytes as well. It's a good use case. I'm not quite convinced it deserves a public API item in this crate, and moreover, its semantics need to be clearer. Should it also escape meta characters? Seems like it should. Overall, my answer is "I don't know," so perhaps we should just let this idea bake.

In my benchmarks such precompiled regex turns out to be ~4x faster than native libc memmem on macOS and ~2x faster than https://docs.rs/memmem/0.1.1/memmem/ crate, so, given that I already have regex as a dependency anyway, it makes sense to use it instead of these alternative solutions :)

That is interesting. Typically the speed comes from careful selection of a "rare" byte to give to memchr. Make sure your benchmarks give you sufficient coverage, because if they don't, you might have cases where a regex is slower than memmem. That is, memmem might be slower in some cases than regex, but it is likely much more consistent overall. Alas, there are no general principles here, since it depends a lot on your pattern and corpus.

RReverser · 2018-02-23T11:52:47Z

Make sure your benchmarks give you sufficient coverage, because if they don't, you might have cases where a regex is slower than memmem.

I tried on several small (<20 bytes) as well as large (~6.2MB) inputs, although in both cases my needle is relatively small compared to input, and seen pretty much consistent 4x win. Maybe part of it comes from calling into extern function as opposed to inlinable in libc case, but it's still a win.

Should it also escape meta characters?

Do you mean same as what I said in

It could be somewhat optimised to not escape valid ASCII chars that can appear in regex

or something different? \xHH escape feels natural for bytes, but I'm not strictly opposed to other representation.

Do you mind me sending a PR with such generic byte escape for now, and documenting that exact escaping rules are subject to change? Then you'd have flexibility to change it if needed.

BurntSushi · 2018-02-23T11:56:54Z

I think these questions are a good reason why this function might not be a good fit. You're right, of course, because if I have the pattern b"a*" and I call bytes::escape, then one valid output is \x61\x2A, and that would accomplish the goal of escaping all meta characters as well. But I can easily see someone wanting the pattern "a\*" as well.

I guess I don't mind if you submit a PR, but I generally don't like having long standing open PRs, so I may close it eventually. If that's OK with you, then feel free to submit it!

RReverser · 2018-02-23T11:58:19Z

I meant sending a PR that would be accepted as a solution for now until someone says they need a different representation 😄

BurntSushi · 2018-02-23T12:06:57Z

@RReverser I'd prefer to hold off for now.

RReverser · 2018-02-23T12:07:54Z

Okay. Do you want to keep the issue open to see if others have thoughts on this?

BurntSushi · 2018-02-23T12:14:41Z

Definitely.

alecmocatta · 2018-06-21T12:03:41Z

FWIW, regex::bytes::escape would be useful for my use case also. As said above, escaping every byte as \xHH is perfectly workable. Given my use case exposes some such strings to the programmer in debug output, it would be nice to unescape where possible.

i.e. desired behaviour is regex::escape(&String::from_utf8_lossy(bytes)) but rather than the unicode replacement char in the output, an \xHH escape code.

BurntSushi · 2018-06-21T12:16:53Z

i.e. desired behaviour is regex::escape(&String::from_utf8_lossy(bytes)) but rather than the unicode replacement char in the output, an \xHH escape code.

I think I agree with this, in that, this seems like the kind of behavior I might expect. @RReverser What do you think?

Also, I'm not quite sure I want to call this escape, since it has slightly different semantics than the existing escape method. Namely, in addition to escaping meta characters, it's goal is to also create a human readable string out of arbitrary bytes. Maybe, escape_visible? Or escape_hex?

The other concern I have here is that we're trying to bundle two distinct and orthogonal concerns into a single function:

I want to escape the meta characters in my pattern such that it becomes a literal.
I want to escape non-UTF-8 bytes in my pattern such that I can print them in a way that isn't visually unappealing.

So, what happens if someone wants (2) but not (1)? Should we expose yet another function?

alecmocatta · 2018-06-21T12:48:31Z

For avoidance of doubt in my previous post, this is simplified but representative of my use case:

let (a,b,c): (Vec<u8>,Vec<u8>,Vec<u8>) = ...;
let (a,b,c) = (bytes::escape(a),bytes::escape(b),bytes::escape(c));
bytes::RegexBuilder(format!("{} ({}|{})", a, b, c).unicode(false).build()

Namely, in addition to escaping meta characters, it's goal is to also create a human readable string out of arbitrary bytes.

I'm not sure I see the difference from plain regex::escape? That could hex escape every character, but it doesn't as it doesn't need to and it would be less readable.

Similarly bytes::escape could hex escape every character, but again it shouldn't as it doesn't need to and it would be less readable.

So, what happens if someone wants (2) but not (1)? Should we expose yet another function?

This introduces "meta bytes" in [u8] patterns, as opposed to the "meta characters" that exist in str patterns currently. If there's desire for that, I think a bytes::Regex::new for &[u8] is the more appropriate place for that, rather than shoehorning it into an "escape" function.

BurntSushi · 2018-06-21T12:57:15Z

@alecmocatta We're talking way past each other. I'm not sure how to reconcile it, so I'll be brief:

A Regex::new for &[u8] patterns is completely unrelated.
Someone might want the raw bytes \xFF+ to be translated to r"\xFF+", but the semantics of bytes::escape you've proscribed will do r"\xFF\+".
You haven't acknowledged my claim about two orthogonal concerns. There is absolutely nothing about the existing escape function that does anything related to readability. This issue is asking for a new function called escape that does what the existing function does, but also optimizes for readability.

RReverser · 2018-06-22T12:49:51Z

I'm fine with either proposed solution; for generic bytes I do think that hex representation makes more sense, but I can also see cases where a pretty-printed output might be useful for debugging ASCII strings etc.

RReverser · 2018-07-24T19:39:47Z

In my benchmarks such precompiled regex turns out to be ~4x faster than native libc memmem on macOS and ~2x faster than https://docs.rs/memmem/0.1.1/memmem/ crate

FWIW I found the reason of that difference later and it's coming from the fact that TwoWaySearcher needs some initialisation which can be cached and reused between runs. This is not something provided by OS-provided memmem, but it is provided by the Rust crate, and after doing just that I got same matching number as I was getting with regex, except the initialisation required is, of course, much faster now.

That said, I still think regex::bytes::escape would make sense from API parity perspective.

bennofs · 2022-04-21T14:28:37Z

What about naming this function bytes::literal? Not using escape sequences for unicode would be fine, but not required for my usecase. I think it is quite common (and also matches the usecase described here) that one wants to embed a byte sequence as literal in a regex. In that case, not using escape sequences for unicode characters is a nice-to-have (makes the regex more readable if you want to debug) but not necessary feature.

BurntSushi · 2022-04-21T14:34:28Z

I don't think this is a common use case.

It still seems to me like this function is pretty easy to write if you need it, and you can give it exactly the semantics you want by doing so.

BurntSushi · 2023-03-06T18:38:33Z

I am going to close this issue for now. It feels to me like the design space is a little too big and the use case is a little too niche. I'd rather see folks write their own version of this function if they need it.

BurntSushi added the enhancement label Feb 23, 2018

BurntSushi mentioned this issue Dec 5, 2019

implement matching on byte strings #85

Closed

BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2023

BurntSushi added the question label Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regex::bytes::escape #451

regex::bytes::escape #451

RReverser commented Feb 23, 2018

BurntSushi commented Feb 23, 2018

RReverser commented Feb 23, 2018

RReverser commented Feb 23, 2018 •

edited

Loading

BurntSushi commented Feb 23, 2018

RReverser commented Feb 23, 2018 •

edited

Loading

BurntSushi commented Feb 23, 2018

RReverser commented Feb 23, 2018

BurntSushi commented Feb 23, 2018

RReverser commented Feb 23, 2018

BurntSushi commented Feb 23, 2018

alecmocatta commented Jun 21, 2018

BurntSushi commented Jun 21, 2018

alecmocatta commented Jun 21, 2018

BurntSushi commented Jun 21, 2018

RReverser commented Jun 22, 2018

RReverser commented Jul 24, 2018

bennofs commented Apr 21, 2022

BurntSushi commented Apr 21, 2022

BurntSushi commented Mar 6, 2023

regex::bytes::escape #451

regex::bytes::escape #451

Comments

RReverser commented Feb 23, 2018

BurntSushi commented Feb 23, 2018

RReverser commented Feb 23, 2018

RReverser commented Feb 23, 2018 • edited Loading

BurntSushi commented Feb 23, 2018

RReverser commented Feb 23, 2018 • edited Loading

BurntSushi commented Feb 23, 2018

RReverser commented Feb 23, 2018

BurntSushi commented Feb 23, 2018

RReverser commented Feb 23, 2018

BurntSushi commented Feb 23, 2018

alecmocatta commented Jun 21, 2018

BurntSushi commented Jun 21, 2018

alecmocatta commented Jun 21, 2018

BurntSushi commented Jun 21, 2018

RReverser commented Jun 22, 2018

RReverser commented Jul 24, 2018

bennofs commented Apr 21, 2022

BurntSushi commented Apr 21, 2022

BurntSushi commented Mar 6, 2023

RReverser commented Feb 23, 2018 •

edited

Loading

RReverser commented Feb 23, 2018 •

edited

Loading