Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex::bytes::escape #451

Closed
RReverser opened this issue Feb 23, 2018 · 19 comments
Closed

regex::bytes::escape #451

RReverser opened this issue Feb 23, 2018 · 19 comments

Comments

@RReverser
Copy link
Contributor

Would it make sense to provide regex::bytes::escape similar to regex::escape but so that it could accept any &[u8] not &str?

Currently the quick workaround is to manually collect bytes into a regex string using write!(s, "\\x{:02X}", b) but it would be nice to have a built-in method for that.

@BurntSushi
Copy link
Member

Could you please describe your use case? Both bytes::Regex::new and Regex::new accept an &str, so regardless of which one you use, your pattern needs to be &str, not a &[u8].

@RReverser
Copy link
Contributor Author

Yeah, exactly. I'm using regex::bytes::Regex to find arbitrary byte sequence in a larger byte sequence, and so I need to transform raw byte slice into a regex like (?-u:\x01\x02...) first - all I'm suggesting is to provide escape(&[u8]) -> String function that would do exactly that (I can send PR myself).

In my benchmarks such precompiled regex turns out to be ~4x faster than native libc memmem on macOS and ~2x faster than https://docs.rs/memmem/0.1.1/memmem/ crate, so, given that I already have regex as a dependency anyway, it makes sense to use it instead of these alternative solutions :)

@RReverser
Copy link
Contributor Author

RReverser commented Feb 23, 2018

FWIW this is what my current implementation looks like:

fn escape(b: &[u8]) -> String {
	let mut regex_str = String::with_capacity(6 + b.len() * 4);
	regex_str += "(?-u:";
	for b in b {
		write!(regex_str, "\\x{:02X}", b).unwrap();
	}
	regex_str += ")";
	return regex_str;
}

It could be somewhat optimised to not escape valid ASCII chars that can appear in regex, but such representation is even better for {:?} output, and it doesn't make any difference in execution speed.

@BurntSushi
Copy link
Member

Oh I see. Generally escape is meant for escaping regex meta characters, but I can see how a natural extension of that is to escape raw bytes as well. It's a good use case. I'm not quite convinced it deserves a public API item in this crate, and moreover, its semantics need to be clearer. Should it also escape meta characters? Seems like it should. Overall, my answer is "I don't know," so perhaps we should just let this idea bake.

In my benchmarks such precompiled regex turns out to be ~4x faster than native libc memmem on macOS and ~2x faster than https://docs.rs/memmem/0.1.1/memmem/ crate, so, given that I already have regex as a dependency anyway, it makes sense to use it instead of these alternative solutions :)

That is interesting. Typically the speed comes from careful selection of a "rare" byte to give to memchr. Make sure your benchmarks give you sufficient coverage, because if they don't, you might have cases where a regex is slower than memmem. That is, memmem might be slower in some cases than regex, but it is likely much more consistent overall. Alas, there are no general principles here, since it depends a lot on your pattern and corpus.

@RReverser
Copy link
Contributor Author

RReverser commented Feb 23, 2018

Make sure your benchmarks give you sufficient coverage, because if they don't, you might have cases where a regex is slower than memmem.

I tried on several small (<20 bytes) as well as large (~6.2MB) inputs, although in both cases my needle is relatively small compared to input, and seen pretty much consistent 4x win. Maybe part of it comes from calling into extern function as opposed to inlinable in libc case, but it's still a win.


Should it also escape meta characters?

Do you mean same as what I said in

It could be somewhat optimised to not escape valid ASCII chars that can appear in regex

or something different? \xHH escape feels natural for bytes, but I'm not strictly opposed to other representation.


Do you mind me sending a PR with such generic byte escape for now, and documenting that exact escaping rules are subject to change? Then you'd have flexibility to change it if needed.

@BurntSushi
Copy link
Member

I think these questions are a good reason why this function might not be a good fit. You're right, of course, because if I have the pattern b"a*" and I call bytes::escape, then one valid output is \x61\x2A, and that would accomplish the goal of escaping all meta characters as well. But I can easily see someone wanting the pattern "a\*" as well.

I guess I don't mind if you submit a PR, but I generally don't like having long standing open PRs, so I may close it eventually. If that's OK with you, then feel free to submit it!

@RReverser
Copy link
Contributor Author

I meant sending a PR that would be accepted as a solution for now until someone says they need a different representation 😄

@BurntSushi
Copy link
Member

@RReverser I'd prefer to hold off for now.

@RReverser
Copy link
Contributor Author

Okay. Do you want to keep the issue open to see if others have thoughts on this?

@BurntSushi
Copy link
Member

Definitely.

@alecmocatta
Copy link

FWIW, regex::bytes::escape would be useful for my use case also. As said above, escaping every byte as \xHH is perfectly workable. Given my use case exposes some such strings to the programmer in debug output, it would be nice to unescape where possible.

i.e. desired behaviour is regex::escape(&String::from_utf8_lossy(bytes)) but rather than the unicode replacement char in the output, an \xHH escape code.

@BurntSushi
Copy link
Member

i.e. desired behaviour is regex::escape(&String::from_utf8_lossy(bytes)) but rather than the unicode replacement char in the output, an \xHH escape code.

I think I agree with this, in that, this seems like the kind of behavior I might expect. @RReverser What do you think?

Also, I'm not quite sure I want to call this escape, since it has slightly different semantics than the existing escape method. Namely, in addition to escaping meta characters, it's goal is to also create a human readable string out of arbitrary bytes. Maybe, escape_visible? Or escape_hex?

The other concern I have here is that we're trying to bundle two distinct and orthogonal concerns into a single function:

  1. I want to escape the meta characters in my pattern such that it becomes a literal.
  2. I want to escape non-UTF-8 bytes in my pattern such that I can print them in a way that isn't visually unappealing.

So, what happens if someone wants (2) but not (1)? Should we expose yet another function?

@alecmocatta
Copy link

For avoidance of doubt in my previous post, this is simplified but representative of my use case:

let (a,b,c): (Vec<u8>,Vec<u8>,Vec<u8>) = ...;
let (a,b,c) = (bytes::escape(a),bytes::escape(b),bytes::escape(c));
bytes::RegexBuilder(format!("{} ({}|{})", a, b, c).unicode(false).build()

Namely, in addition to escaping meta characters, it's goal is to also create a human readable string out of arbitrary bytes.

I'm not sure I see the difference from plain regex::escape? That could hex escape every character, but it doesn't as it doesn't need to and it would be less readable.

Similarly bytes::escape could hex escape every character, but again it shouldn't as it doesn't need to and it would be less readable.

So, what happens if someone wants (2) but not (1)? Should we expose yet another function?

This introduces "meta bytes" in [u8] patterns, as opposed to the "meta characters" that exist in str patterns currently. If there's desire for that, I think a bytes::Regex::new for &[u8] is the more appropriate place for that, rather than shoehorning it into an "escape" function.

@BurntSushi
Copy link
Member

@alecmocatta We're talking way past each other. I'm not sure how to reconcile it, so I'll be brief:

  • A Regex::new for &[u8] patterns is completely unrelated.
  • Someone might want the raw bytes \xFF+ to be translated to r"\xFF+", but the semantics of bytes::escape you've proscribed will do r"\xFF\+".
  • You haven't acknowledged my claim about two orthogonal concerns. There is absolutely nothing about the existing escape function that does anything related to readability. This issue is asking for a new function called escape that does what the existing function does, but also optimizes for readability.

@RReverser
Copy link
Contributor Author

I'm fine with either proposed solution; for generic bytes I do think that hex representation makes more sense, but I can also see cases where a pretty-printed output might be useful for debugging ASCII strings etc.

@RReverser
Copy link
Contributor Author

In my benchmarks such precompiled regex turns out to be ~4x faster than native libc memmem on macOS and ~2x faster than https://docs.rs/memmem/0.1.1/memmem/ crate

FWIW I found the reason of that difference later and it's coming from the fact that TwoWaySearcher needs some initialisation which can be cached and reused between runs. This is not something provided by OS-provided memmem, but it is provided by the Rust crate, and after doing just that I got same matching number as I was getting with regex, except the initialisation required is, of course, much faster now.

That said, I still think regex::bytes::escape would make sense from API parity perspective.

@bennofs
Copy link

bennofs commented Apr 21, 2022

What about naming this function bytes::literal? Not using escape sequences for unicode would be fine, but not required for my usecase. I think it is quite common (and also matches the usecase described here) that one wants to embed a byte sequence as literal in a regex. In that case, not using escape sequences for unicode characters is a nice-to-have (makes the regex more readable if you want to debug) but not necessary feature.

@BurntSushi
Copy link
Member

I don't think this is a common use case.

It still seems to me like this function is pretty easy to write if you need it, and you can give it exactly the semantics you want by doing so.

@BurntSushi
Copy link
Member

I am going to close this issue for now. It feels to me like the design space is a little too big and the use case is a little too niche. I'd rather see folks write their own version of this function if they need it.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants