Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upexecute a regex on text streams #425
Comments
This comment has been minimized.
This comment has been minimized.
|
Thanks for taking the time to write this up! I'm afraid the issue tracker (and the issue you linked) isn't entirely up to date with my opinion on this particular feature. Basically, I agree that this would be useful to have. API design is certainly one aspect of this that needs to be solved, and I thank you for spending time on that. However, the API design is just the surface. The far more difficult aspect of providing a feature like this is the actual implementation work required. My estimation of the level of difficulty for this is roughly "a rewrite of every single regex engine." That's about as bad as it gets and it's why this particular feature hasn't happened yet. The problem is that the assumption that a regex engine has access to the entire buffer on which it can match is thoroughly baked in across all of the matching routines. For example, as a trivially simple thing to understand (but far from the only issue), consider that none of the existing engines have the ability to pause execution of a match at an arbitrary point, and then pick it back up again. Such a thing is necessary for a streaming match mode. (There is an interesting compromise point here. For example, the regex crate could accept an Getting back to API design, I think we would want to take a very strong look at Hyperscan, whose documentation can be found here. This particular regex library is probably the fastest in the world, is built on finite automata, and specifically specializes in matching regexes on streams. You'll notice that an interesting knob on their API is the setting of whether to report the start of a match or not. This is important and particularly relevant to the regex crate. In particular, when running the DFA (the fastest regex engine), it is not possible to determine both the start and end of a match in a single pass in the general case. The first pass searches forwards through the string for the end of the match. The second pass then uses a reversed DFA and searches backwards in the text from the end of the match to find the start of the match. Cool, right? And it unfortunately requires that the entire match be buffered in memory. That throws a wrench in things, because now you need to handle the case of "what happens when a match is 2GB?" There's no universal right answer, so you need to expose knobs like Hyperscan does. The start-of-match stuff throws a wrench in my idea above as well. Even if the regex crate provided an API to search an
You can also see that RE2 doesn't have such an API either, mostly for the reasons I've already outlined here. That issue does note another problem, in particular, managing the state of the DFA, but that just falls under the bigger problem of "how do you pause/resume each regex engine" that I mentioned above. For Go's regexp engine, there is someone working on adding a DFA, and even in their work, they defer to the NFA when given a rune reader to use with the DFA. In other words, even with the DFA, that particular implementation chooses (2). What am I trying to say here? What I'm saying is that a number of people who specialize in this area (including myself) have come to the same conclusion: the feature you're asking for is freaking hard to implement at multiple levels, at least in a fast finite automaton based regex engine. So this isn't really something that we can just think ourselves out of here. The paths available are known, they are just a ton of work. With that said, this is something that I think would be very cool to have. My plans for the future of regex revolve around one goal: "make it easier to add and maintain new optimizations." This leaves me yearning for a more refined internal architecture, which also incidentally amounts to a rewrite. In the course of doing this, my plan was to re-evaluate the streaming feature and see if I could figure out how to do it. However, unfortunately, the time scale on this project is probably best measured in years, so it won't help you any time soon. If you need this feature yesterday, then this is the best thing I can come up with, if you're intent on sticking with pure Rust:
Basically, this amounts to "make the same trade off as Go." You could lobby me to make this a feature of the regex crate (with the warning that it will always run slowly), but I'm not particularly inclined to do that because it is still quite a bit of work and I'd rather hold off on adding such things until I more thoroughly understand the problem domain. It's basically a matter of priorities. I don't want to spend a couple months of my time adding a feature that has known bad performance, with no easy route to fixing it. Apologies for the bad news! |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
cessen
commented
Dec 4, 2017
|
@BurntSushi I think I was also perhaps mislead by my experience with the C++ standard library, e.g. http://en.cppreference.com/w/cpp/regex (As an aside, it seems slightly strange to me that the Rust stdlib doesn't have a bidirectional iterator trait. It does have double-ended, but that's different. Maybe I should write up a Rust RFC.) As I said before, my own use-case is not urgent. It's only for a toy project anyway. No one is going to die or lose their job. ;-) But I appreciate your "if you need this yesterday" explanation. I may end up doing that at some point, if/when I get around to adding regex to my editor.
Ah! I feel quite silly, now! I didn't notice that ticket. Apologies. Having said that, I think this ticket actually covers a super-set of #386. #386 is essentially just API 2 in this write-up, and doesn't cover the streaming case (API 1). And although my personal use-case only needs API 2, I was intentionally trying to cover the wider range of use-cases covered in #25. So I guess it depends on how relevant you feel the streaming use-case is? In any case, I won't be at all offended if you close this as duplicate. :-) |
BurntSushi
referenced this issue
Dec 4, 2017
Closed
Feature request: support searching an Iterator<u8> #386
This comment has been minimized.
This comment has been minimized.
|
Sounds good to me! I closed out #386. :-) |
BurntSushi
changed the title
Regex on text streams - Take 2
execute a regex on text streams
Dec 4, 2017
BurntSushi
added
enhancement
question
labels
Dec 4, 2017
This comment has been minimized.
This comment has been minimized.
AlbertoGP
commented
Dec 6, 2017
•
|
That was a greatly useful explanation for me. I'm writing a tool that needs string matching although not general regexps, and have been looking at three alternatives:
The ideal for me would be a RegexSet that would give me not just the indexes of matched patterns but also the (start,end) indexes, or rather the (start,end) of the first match, and that could be paused/restarted to feed it the input in blocks, not all at once. I see now why all that is not possible at the moment, and even if implemented it would be slower than what we already have. Thanks! |
This comment has been minimized.
This comment has been minimized.
I didn't respond to this before, but to be clear, C++'s standard library regex engine is backtracking AFAIK, which has a completely different set of trade offs in this space. If you asked me to do this feature request in a backtracking engine, my first thought to you would be "forget everything I just said." :) |
This comment has been minimized.
This comment has been minimized.
cessen
commented
Jan 14, 2018
|
@BurntSushi Also, looking at Although I don't think I have the time/energy for it right now (I'm putting most of my spare time into Ropey), if I get around to it would you be offended if I took a crack at creating a separate crate to accommodate the use-cases outlined in this issue, using just the slower NFA engine from this crate? My intent in doing so would be twofold:
This would all be with the idea that these use-cases would be folded back into this crate whenever you get the time to tackle your bigger plans for it, and my fork would be then deprecated. But it would give a stop-gap solution in the mean time. |
This comment has been minimized.
This comment has been minimized.
|
@cessen I would not be offended. :) In fact, I suggested that exact idea above I think. You are right if course that the pikevm is more amenable to this change. The bounded backtracker might also be capable of handling it, but I haven't thought about it. |
This comment has been minimized.
This comment has been minimized.
cessen
commented
Jan 14, 2018
|
Yup! I was double-checking that you were okay with it being a published crate, rather than just an in-project-repo sort of thing. Thanks much! If I get started on this at some point, I'll post here. |
This comment has been minimized.
This comment has been minimized.
cessen
commented
Jan 18, 2018
|
I've started poking at this in https://github.com/cessen/streaming_regex So far I've ripped out everything except the PikeVM engine, and also ripped out the byte regex. My goal is to get the simple case working, and then I can start adding things back in (like the literal matcher, byte regex, etc.). Now that I've gotten to this point, it's really obvious what you meant @BurntSushi , regarding everything needing to be re-done to make this possible! The first thing I'm going to try is to rework the PikeVM engine so that it incrementally takes a byte at a time as input. I think this can work even in the unicode case by having a small four-byte buffer to collect the bytes of unicode scalar values, and only execute regex instructions once a full value has been collected. Hopefully that can be done relatively efficiently. Once that's done, then building on that for incremental input will hopefully be relatively straight-forward (fingers crossed). Does that sound like a reasonable approach, @BurntSushi ? |
This comment has been minimized.
This comment has been minimized.
|
@cessen Indeed it does! Exciting. :-) |
This comment has been minimized.
This comment has been minimized.
cessen
commented
Jan 21, 2018
|
@BurntSushi |
This comment has been minimized.
This comment has been minimized.
|
I wouldn't bother honestly. I would do whatever is natural for you. I expect the current internals to be completely redone before something like streaming functionality could get merged. I'm happy to be a sounding board of course! |
This comment has been minimized.
This comment has been minimized.
cessen
commented
Jan 22, 2018
•
|
Oh sure, I'm not expecting it to be mergeable. But I'm hoping that the architecture of this fork will be relevant to your rewrite in the future, so that any lessons learned can be applied. So I'd rather not go off in some direction that's completely tangential to (or fundamentally clashing with) what you have in mind for the future. If that makes sense? (Edited to add the "not" before "expecting" above. Really bad typo...) |
This comment has been minimized.
This comment has been minimized.
|
Yeah I think so. I'm happy to give feedback as time permits. :) |
This comment has been minimized.
This comment has been minimized.
cessen
commented
Jan 22, 2018
|
Awesome, thanks! And, of course, I don't mean to squeeze you for time. Apologies for coming off a bit presumptuous earlier--I didn't mean it that way. I think I'll post here, if that's okay. Let me know if you'd prefer elsewhere! |
This comment has been minimized.
This comment has been minimized.
cessen
commented
Jan 22, 2018
•
|
So, the first thing is a bit bigger-picture: I think we'll have a lot more wiggle-room to experiment with fast incremental scanning approaches if we keep the incremental APIs chunk-based rather than byte-based. This basically amounts to changing API 2 in my original proposal above to take an iterator over byte slices instead of an iterator over bytes. I think that still covers all of the real use-cases. (And if someone really needs to use a byte iterator with the API, they can do their own buffering to pass larger chunks.) The relevance to what I'm doing right now is that I'm thinking of changing the pub trait RegularExpression {
fn is_match_at(
&self,
chunk: &[u8], // Analagous to `text` in your code
offset_in_chunk: usize, // Analagous to `start` in your code
is_last_chunk: bool,
) -> bool;
fn find_at(
&self,
chunk: &[u8],
offset_in_chunk: usize,
is_last_chunk: bool,
) -> Option<(usize, usize)>;
// etc...
}Calling code would pass a chunk at a time, and indicate via The pub trait RegularExpression {
// The above stuff
fn find_iter<I: Iterator<Item=&[u8]>> (
self,
text: I
) -> Matches<Self> {
// ...
}
// etc...
}One of the nice things about this approach is that a single contiguous text is just a special case: for Using DFA internally on contiguous text falls naturally out of this as well, since we can switch to DFA when we know we're on the last chunk. And this gives us plenty of room to experiment with approaches to faster incremental scanning. (Incidentally, the reason I'm focusing on the |
This comment has been minimized.
This comment has been minimized.
|
@cessen Everything you said sounds pretty sensible to me. Here are some random thoughts:
|
This comment has been minimized.
This comment has been minimized.
cessen
commented
Jan 24, 2018
|
Thanks for the feedback!
|
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
cessen
commented
Jan 24, 2018
•
|
Yes, that was my understanding as well. I didn't mean to imply that |
This comment has been minimized.
This comment has been minimized.
cessen
commented
Jan 24, 2018
|
Or, more specifically, if we reach the upper bound (if one is given) we can set We would still have to do something like pass an empty chunk at the end when the upper bound isn't given or when we reach the end before the upper bound. |
This comment has been minimized.
This comment has been minimized.
cessen
commented
Jan 25, 2018
|
So, regardless of all of this, I think your suggestion to use the peekable iterator is a better idea. For some reason I thought not all iterators could be made peekable, but apparently they can. So that would be a great approach! |
hxtk
referenced this issue
Feb 3, 2018
Open
Fails to recognize delimiters that fall on the boundary between two buffers #4
This comment has been minimized.
This comment has been minimized.
cessen
commented
Feb 5, 2018
•
|
Quick status report: The PikeVM engine now takes a single token at a time, and doesn't access the Next step is to get it to take just a single byte at a time. One thing I've noticed is that I think the PikeVM can be made to handle Unicode and byte data use-cases at the same time, even within the same regex (assuming the syntax supports making a distinction). I'm guessing with the DFA that's not possible in a reasonable way, because it can only be in one state at a time...? It's not particularly relevant to the use-cases I'm trying to accommodate, but I think it's interesting nonetheless. Edit: |
This comment has been minimized.
This comment has been minimized.
It does indeed. Regexes like
The DFA is always byte-at-a-time.
:-) |
This comment has been minimized.
This comment has been minimized.
cessen
commented
Feb 5, 2018
Oh, interesting. Is the |
This comment has been minimized.
This comment has been minimized.
|
Yes. And probably the backtracker. The Unicode instructions also use less space, but that is a specious argument because if we removed them, we might be able to reduce the total number of copies of the program bytecode from 3 to 2. (I'm kind of just riffing here based off memory.) |
benaryorg
referenced this issue
Apr 11, 2018
Closed
performance optimisation of EOL matching using ReverseSearcher #463
BurntSushi
referenced this issue
Jul 26, 2018
Closed
add -z/--null-data flag for reading large binary files #993
This comment has been minimized.
This comment has been minimized.
sanmai-NL
commented
Jul 27, 2018
|
@cessen: thanks for your great work! I was wondering how your project is progressing? Have you perhaps planned towards some milestone? |
This comment has been minimized.
This comment has been minimized.
cessen
commented
Aug 1, 2018
|
@sanmai-NL I don't know when I'll pick it back up again, so feel free to take the work I've done so far and build on it. I think I stopped at a pretty reasonable point, having gotten the PikeVM to work incrementally one token at a time. The next steps are basically:
In fact, if/when I pick it back up again, I might reverse those steps anyway. Might be more satisfying to get something fully working first, and then generalize to byte regex after. |
cessen commentedDec 3, 2017
•
edited
This is more-or-less a continuation of issue #25 (most of which is actually here).
Preface
I don't personally have an urgent need for this functionality, but I do think it would be useful and would make the regex crate even more powerful and flexible. I also have a motivating use-case that I didn't see mentioned in the previous issue.
More importantly, though, I think I have a reasonable design that would handle all the relevant use-cases for streaming regex--or at least would make the regex crate not the limiting/blocking factor. I don't have the time/energy to work on implementing it myself, so please take this proposal with the appropriate amount of salt. It's more of a thought and a "hey, I think this design might work", than anything else.
And most importantly: thanks so much to everyone who has put time and effort into contributing to the regex crate! It is no coincidence that it has become such staple of the Rust ecosystem. It's a great piece of software!
My use-case
I occasionally hack on a toy text editor project of mine, and this editor uses ropes as its in-memory text data structure. The relevant implication of this is that text in my editor is split over non-contiguous chunks of memory. Since the regex crate only works on contiguous strings, that means I can't use it to perform searches on text in my editor. (Unless, I suppose, I copy the text wholesale into a contiguous chunk of memory just to perform the search on that copy. But that seems overkill and wouldn't make sense for larger texts.)
Proposal
In the previous issue discussing this topic, the main problem noted was that the regex crate would have to allocate (e.g. a String) to return the contents of matches from an arbitrary stream. My proposed solution essentially amounts to: don't return the content of the match at all, and instead only return the byte offsets. It is then the responsibility of the client code to fetch the actual contents. For example, my editor would use its own rope APIs to fetch the contents (or replace them, or whatever), completely independent of the regex crate.
The current API that returns the contents along with offsets could (and probably should) still be included as a convenience for performing regex on contiguous slices. But the "raw" or "low level" API would only yield byte offsets, allowing for a wider range of use-cases.
Layered API
I'm imagining there would be three "layers" to the API, of increasing levels of convenience and decreasing levels of flexibility:
1. Feed chunks of bytes manually, handling matches as we go
2. Give regex an iterator that yields bytes
3. Give regex a slice, just like the current API
I'm of course not suggesting naming schemes here, or even the precise way that these API's should work. I'm just trying to illustrate the idea. :-)
Note that API 2 above addresses my use-case just fine. But API 1 provides even more flexibility for other use-cases.
Things this doesn't address
BurntSushi noted the following in the previous discussion (referencing Go's streaming regex support):
This proposal doesn't solve that problem, but rather side-steps it, making it the responsibility of the client code to decide how to handle it (or not). Practically speaking, this isn't actually an API problem but rather is a fundamental problem with unbounded streaming searches.
IMO, it doesn't make sense to keep this functionality out of the the regex crate because of this issue, because the issue is by its nature outside of the regex crate. The important thing is to design the API such that people can implement their own domain-specific solutions in the client code.
As an aside: API 1 above could be enhanced to provide the length of the longest potential match so far. For clarity of what I mean, here is an example of what that might look like and how it could be used:
That would allow client code to at least only hold onto the minimal amount of data. Nevertheless, that only mitigates the problem, since you can still have regex's that match unbounded amounts of data.