New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execute a regex on text streams #425

Open
cessen opened this Issue Dec 3, 2017 · 29 comments

Comments

Projects
None yet
4 participants
@cessen

cessen commented Dec 3, 2017

This is more-or-less a continuation of issue #25 (most of which is actually here).

Preface

I don't personally have an urgent need for this functionality, but I do think it would be useful and would make the regex crate even more powerful and flexible. I also have a motivating use-case that I didn't see mentioned in the previous issue.

More importantly, though, I think I have a reasonable design that would handle all the relevant use-cases for streaming regex--or at least would make the regex crate not the limiting/blocking factor. I don't have the time/energy to work on implementing it myself, so please take this proposal with the appropriate amount of salt. It's more of a thought and a "hey, I think this design might work", than anything else.

And most importantly: thanks so much to everyone who has put time and effort into contributing to the regex crate! It is no coincidence that it has become such staple of the Rust ecosystem. It's a great piece of software!

My use-case

I occasionally hack on a toy text editor project of mine, and this editor uses ropes as its in-memory text data structure. The relevant implication of this is that text in my editor is split over non-contiguous chunks of memory. Since the regex crate only works on contiguous strings, that means I can't use it to perform searches on text in my editor. (Unless, I suppose, I copy the text wholesale into a contiguous chunk of memory just to perform the search on that copy. But that seems overkill and wouldn't make sense for larger texts.)

Proposal

In the previous issue discussing this topic, the main problem noted was that the regex crate would have to allocate (e.g. a String) to return the contents of matches from an arbitrary stream. My proposed solution essentially amounts to: don't return the content of the match at all, and instead only return the byte offsets. It is then the responsibility of the client code to fetch the actual contents. For example, my editor would use its own rope APIs to fetch the contents (or replace them, or whatever), completely independent of the regex crate.

The current API that returns the contents along with offsets could (and probably should) still be included as a convenience for performing regex on contiguous slices. But the "raw" or "low level" API would only yield byte offsets, allowing for a wider range of use-cases.

Layered API

I'm imagining there would be three "layers" to the API, of increasing levels of convenience and decreasing levels of flexibility:

1. Feed chunks of bytes manually, handling matches as we go

let re = Regex::new("...");
let mut matcher = re::streaming_matcher();

for match in matcher.consume("Some bytes from a larger stream of data that") {
    my_data_source.do_something(match.start(), match.end());
    // Note: there is no match.as_str() for this API.
}

for match in matcher.consume(" we don't know how much there might be in total.") {
    // ...
}

2. Give regex an iterator that yields bytes

let re = Regex::new("...");

for match in re::find_iter_from_bytes_iter("Non-contiguous bytes that can be iterated over.".bytes()) {
    // Again, only match.start() and match.end() for this API.
    // ...
}

3. Give regex a slice, just like the current API

let re = Regex::new("...");

for match in re.find_iter("A directly passed slice of data.") {
    match.as_str(); // In this API you can get the str slice of the match.
    // ...
}

I'm of course not suggesting naming schemes here, or even the precise way that these API's should work. I'm just trying to illustrate the idea. :-)

Note that API 2 above addresses my use-case just fine. But API 1 provides even more flexibility for other use-cases.

Things this doesn't address

BurntSushi noted the following in the previous discussion (referencing Go's streaming regex support):

The most interesting bit there is that the regexp functions on readers finds precisely one match and returns byte indices into that stream. In the Go world, if I wanted to use that and actually retrieve the text, I'd write my own "middleware" reader to save text as its being read. Then I could use that after the first match to grab the matched text. Then reset the "middleware" reader and start searching again.

The problem with that approach: what if there's never a match? You'd end up piling the whole stream into memory (likely defeating the purpose of using a stream).

This proposal doesn't solve that problem, but rather side-steps it, making it the responsibility of the client code to decide how to handle it (or not). Practically speaking, this isn't actually an API problem but rather is a fundamental problem with unbounded streaming searches.

IMO, it doesn't make sense to keep this functionality out of the the regex crate because of this issue, because the issue is by its nature outside of the regex crate. The important thing is to design the API such that people can implement their own domain-specific solutions in the client code.

As an aside: API 1 above could be enhanced to provide the length of the longest potential match so far. For clarity of what I mean, here is an example of what that might look like and how it could be used:

// ...
let mut matcher = re::streaming_matcher();
let mut buffer = String::new();
let mut offset = 0;

loop {
    // Get next chunk of streaming data
    buffer.push(streaming_data_source.get_data());

    // Handle matches
    for match in matcher.consume(&buffer) {
        match_contents = &buffer[(match.start()-offset)..(match.end()-offset)];
        // ...
    }

    // Shrink buffer to smallest size that is still guaranteed to contain all data for
    // potential future matches.
    let pot_match_len = matcher.longest_potential_match_len();
    offset += buffer.len() - pot_match_len;
    buffer.truncate_from_front(pot_match_len);
}

That would allow client code to at least only hold onto the minimal amount of data. Nevertheless, that only mitigates the problem, since you can still have regex's that match unbounded amounts of data.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Dec 3, 2017

Thanks for taking the time to write this up!

I'm afraid the issue tracker (and the issue you linked) isn't entirely up to date with my opinion on this particular feature. Basically, I agree that this would be useful to have. API design is certainly one aspect of this that needs to be solved, and I thank you for spending time on that. However, the API design is just the surface. The far more difficult aspect of providing a feature like this is the actual implementation work required. My estimation of the level of difficulty for this is roughly "a rewrite of every single regex engine." That's about as bad as it gets and it's why this particular feature hasn't happened yet. The problem is that the assumption that a regex engine has access to the entire buffer on which it can match is thoroughly baked in across all of the matching routines. For example, as a trivially simple thing to understand (but far from the only issue), consider that none of the existing engines have the ability to pause execution of a match at an arbitrary point, and then pick it back up again. Such a thing is necessary for a streaming match mode. (There is an interesting compromise point here. For example, the regex crate could accept an R: io::Read and simply search that until it finds a match. This would avoid needing the aforementioned pause/resume functionality, and I think it could be added without rewriting the regex crate, but it would be a significant complication IMO. However, there are problems with this approach, see below.)

Getting back to API design, I think we would want to take a very strong look at Hyperscan, whose documentation can be found here. This particular regex library is probably the fastest in the world, is built on finite automata, and specifically specializes in matching regexes on streams. You'll notice that an interesting knob on their API is the setting of whether to report the start of a match or not. This is important and particularly relevant to the regex crate. In particular, when running the DFA (the fastest regex engine), it is not possible to determine both the start and end of a match in a single pass in the general case. The first pass searches forwards through the string for the end of the match. The second pass then uses a reversed DFA and searches backwards in the text from the end of the match to find the start of the match. Cool, right? And it unfortunately requires that the entire match be buffered in memory. That throws a wrench in things, because now you need to handle the case of "what happens when a match is 2GB?" There's no universal right answer, so you need to expose knobs like Hyperscan does.

The start-of-match stuff throws a wrench in my idea above as well. Even if the regex crate provided an API to search an R: io::Read, it would in turn need to do one of the following things with respect to reporting match locations:

  1. Only provide end of match indices.
  2. Always provide start of match indices, at the cost of running a slower regex engine. (The NFA based engines can find the start and end of match in a single pass.) Note that this is the choice that Go's regex engine makes. Also note that "slow" here approximately means "an order of magnitude slower."
  3. Always provide start of match indices, at the cost of buffering the entire R: io::Read into memory. (Which kind of defeats the purpose of providing this API at all.)

You can also see that RE2 doesn't have such an API either, mostly for the reasons I've already outlined here. That issue does note another problem, in particular, managing the state of the DFA, but that just falls under the bigger problem of "how do you pause/resume each regex engine" that I mentioned above.

For Go's regexp engine, there is someone working on adding a DFA, and even in their work, they defer to the NFA when given a rune reader to use with the DFA. In other words, even with the DFA, that particular implementation chooses (2).

What am I trying to say here? What I'm saying is that a number of people who specialize in this area (including myself) have come to the same conclusion: the feature you're asking for is freaking hard to implement at multiple levels, at least in a fast finite automaton based regex engine. So this isn't really something that we can just think ourselves out of here. The paths available are known, they are just a ton of work.

With that said, this is something that I think would be very cool to have. My plans for the future of regex revolve around one goal: "make it easier to add and maintain new optimizations." This leaves me yearning for a more refined internal architecture, which also incidentally amounts to a rewrite. In the course of doing this, my plan was to re-evaluate the streaming feature and see if I could figure out how to do it. However, unfortunately, the time scale on this project is probably best measured in years, so it won't help you any time soon.

If you need this feature yesterday, then this is the best thing I can come up with, if you're intent on sticking with pure Rust:

  1. Convince yourself that you're OK with a slow regexp implementation.
  2. Fork the regex crate and remove everything except for the Pike VM. Remove the DFA, the literal optimizations and the backtracker. Make it pass the test suite.
  3. Modify the Pike VM to accept an R: io::Read.

Basically, this amounts to "make the same trade off as Go." You could lobby me to make this a feature of the regex crate (with the warning that it will always run slowly), but I'm not particularly inclined to do that because it is still quite a bit of work and I'd rather hold off on adding such things until I more thoroughly understand the problem domain. It's basically a matter of priorities. I don't want to spend a couple months of my time adding a feature that has known bad performance, with no easy route to fixing it.

Apologies for the bad news!

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Dec 3, 2017

@cessen What do you think about closing this ticket as a duplicate of #386?

@cessen

This comment has been minimized.

cessen commented Dec 4, 2017

@BurntSushi
Wow, thanks so much for the thorough explanation! Indeed, I didn't realize there were so many other issues involved. The older discussion I looked at made it seem like it would mostly be easy, but it just wasn't clear how the API could work. So I appreciate you clarifying all this!

I think I was also perhaps mislead by my experience with the C++ standard library, e.g. http://en.cppreference.com/w/cpp/regex
It doesn't (as far as I can tell) support streaming, but it does take an iterator rather than a slice to do the search. But indeed the iterator is bidrectional.

(As an aside, it seems slightly strange to me that the Rust stdlib doesn't have a bidirectional iterator trait. It does have double-ended, but that's different. Maybe I should write up a Rust RFC.)

As I said before, my own use-case is not urgent. It's only for a toy project anyway. No one is going to die or lose their job. ;-) But I appreciate your "if you need this yesterday" explanation. I may end up doing that at some point, if/when I get around to adding regex to my editor.

@cessen What do you think about closing this ticket as a duplicate of #386?

Ah! I feel quite silly, now! I didn't notice that ticket. Apologies.

Having said that, I think this ticket actually covers a super-set of #386. #386 is essentially just API 2 in this write-up, and doesn't cover the streaming case (API 1). And although my personal use-case only needs API 2, I was intentionally trying to cover the wider range of use-cases covered in #25.

So I guess it depends on how relevant you feel the streaming use-case is?

In any case, I won't be at all offended if you close this as duplicate. :-)

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Dec 4, 2017

Sounds good to me! I closed out #386. :-)

@BurntSushi BurntSushi changed the title Regex on text streams - Take 2 execute a regex on text streams Dec 4, 2017

@AlbertoGP

This comment has been minimized.

AlbertoGP commented Dec 6, 2017

That was a greatly useful explanation for me. I'm writing a tool that needs string matching although not general regexps, and have been looking at three alternatives:

  • use the regex crate
  • build on the aho_corasick crate (I know it's used inside regex)
  • fully custom pattern matcher (have done this before, which is why I'm first checking whether I could use yours instead for this specific application)

The ideal for me would be a RegexSet that would give me not just the indexes of matched patterns but also the (start,end) indexes, or rather the (start,end) of the first match, and that could be paused/restarted to feed it the input in blocks, not all at once.
[EDIT: just noticed that the "match first" RegexSet feature is already requested at #259]

I see now why all that is not possible at the moment, and even if implemented it would be slower than what we already have.

Thanks!

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Dec 31, 2017

I think I was also perhaps mislead by my experience with the C++ standard library, e.g. http://en.cppreference.com/w/cpp/regex
It doesn't (as far as I can tell) support streaming, but it does take an iterator rather than a slice to do the search. But indeed the iterator is bidrectional.

I didn't respond to this before, but to be clear, C++'s standard library regex engine is backtracking AFAIK, which has a completely different set of trade offs in this space. If you asked me to do this feature request in a backtracking engine, my first thought to you would be "forget everything I just said." :)

@cessen

This comment has been minimized.

cessen commented Jan 14, 2018

@BurntSushi
Yeah, that totally makes sense. Thanks!

Also, looking at pikevm.rs, my initial reaction (though I may be misunderstanding some things) is that it seems like turning the main matching loop into something that can be paused/resumed wouldn't be too much work.

Although I don't think I have the time/energy for it right now (I'm putting most of my spare time into Ropey), if I get around to it would you be offended if I took a crack at creating a separate crate to accommodate the use-cases outlined in this issue, using just the slower NFA engine from this crate? My intent in doing so would be twofold:

  1. Give an immediate (even if not optimal) solution for people with these use-cases.
  2. Better explore the API design space.

This would all be with the idea that these use-cases would be folded back into this crate whenever you get the time to tackle your bigger plans for it, and my fork would be then deprecated. But it would give a stop-gap solution in the mean time.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Jan 14, 2018

@cessen I would not be offended. :) In fact, I suggested that exact idea above I think. You are right if course that the pikevm is more amenable to this change. The bounded backtracker might also be capable of handling it, but I haven't thought about it.

@cessen

This comment has been minimized.

cessen commented Jan 14, 2018

Yup! I was double-checking that you were okay with it being a published crate, rather than just an in-project-repo sort of thing. Thanks much!

If I get started on this at some point, I'll post here.

@cessen

This comment has been minimized.

cessen commented Jan 18, 2018

I've started poking at this in https://github.com/cessen/streaming_regex

So far I've ripped out everything except the PikeVM engine, and also ripped out the byte regex. My goal is to get the simple case working, and then I can start adding things back in (like the literal matcher, byte regex, etc.).

Now that I've gotten to this point, it's really obvious what you meant @BurntSushi , regarding everything needing to be re-done to make this possible!

The first thing I'm going to try is to rework the PikeVM engine so that it incrementally takes a byte at a time as input. I think this can work even in the unicode case by having a small four-byte buffer to collect the bytes of unicode scalar values, and only execute regex instructions once a full value has been collected. Hopefully that can be done relatively efficiently.

Once that's done, then building on that for incremental input will hopefully be relatively straight-forward (fingers crossed).

Does that sound like a reasonable approach, @BurntSushi ?

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Jan 18, 2018

@cessen Indeed it does! Exciting. :-)

@cessen

This comment has been minimized.

cessen commented Jan 21, 2018

@BurntSushi
I'm getting to the point where I have some more solid API proposals, largely for internal API's at this point (e.g. the RegularExpression trait). Since I would like this work to be applicable to this crate at some point in the future (even if not directly pulled), I would love to discuss it with you. Should I do that here, or make an issue on my fork and ping you from there?

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Jan 22, 2018

I wouldn't bother honestly. I would do whatever is natural for you. I expect the current internals to be completely redone before something like streaming functionality could get merged.

I'm happy to be a sounding board of course!

@cessen

This comment has been minimized.

cessen commented Jan 22, 2018

Oh sure, I'm not expecting it to be mergeable. But I'm hoping that the architecture of this fork will be relevant to your rewrite in the future, so that any lessons learned can be applied. So I'd rather not go off in some direction that's completely tangential to (or fundamentally clashing with) what you have in mind for the future. If that makes sense?

(Edited to add the "not" before "expecting" above. Really bad typo...)

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Jan 22, 2018

Yeah I think so. I'm happy to give feedback as time permits. :)

@cessen

This comment has been minimized.

cessen commented Jan 22, 2018

Awesome, thanks! And, of course, I don't mean to squeeze you for time. Apologies for coming off a bit presumptuous earlier--I didn't mean it that way.

I think I'll post here, if that's okay. Let me know if you'd prefer elsewhere!

@cessen

This comment has been minimized.

cessen commented Jan 22, 2018

So, the first thing is a bit bigger-picture: I think we'll have a lot more wiggle-room to experiment with fast incremental scanning approaches if we keep the incremental APIs chunk-based rather than byte-based.

This basically amounts to changing API 2 in my original proposal above to take an iterator over byte slices instead of an iterator over bytes. I think that still covers all of the real use-cases. (And if someone really needs to use a byte iterator with the API, they can do their own buffering to pass larger chunks.)

The relevance to what I'm doing right now is that I'm thinking of changing the RegularExpression trait in re_trait.rs to be oriented around sequential chunk input. This mostly involves adding something like a is_last_chunk parameter to the various *_at() methods. It might look something like this:

pub trait RegularExpression {
    fn is_match_at(
        &self,
        chunk: &[u8],           // Analagous to `text` in your code
        offset_in_chunk: usize, // Analagous to `start` in your code
        is_last_chunk: bool,
    ) -> bool;

    fn find_at(
        &self,
        chunk: &[u8],
        offset_in_chunk: usize,
        is_last_chunk: bool,
    ) -> Option<(usize, usize)>;

    // etc...
}

Calling code would pass a chunk at a time, and indicate via is_last_chunk if it's the end of input. This approach puts the onus on implementors of RegularExpression to keep track of how many bytes from start-of-input have already been consumed, but I think that's pretty reasonable...?

The *_iter() methods would be passed iterators over text chunks, and Iterator::size_hint() would be used internally to know when we're at the last chunk:

pub trait RegularExpression {
    // The above stuff

    fn find_iter<I: Iterator<Item=&[u8]>> (
        self,
        text: I
    ) -> Matches<Self> {
        // ...
    }

    // etc...
}

One of the nice things about this approach is that a single contiguous text is just a special case: for find_at() et al. you just pass the text as a byte slice with is_last_chunk = true. And for the iterator methods, you just pass [text].iter().

Using DFA internally on contiguous text falls naturally out of this as well, since we can switch to DFA when we know we're on the last chunk. And this gives us plenty of room to experiment with approaches to faster incremental scanning.

(Incidentally, the reason I'm focusing on the RegularExpression trait here is that it seems to be the glue between the higher-level and lower-level APIs, so I'm thinking it will have a lot of influence over the implementations of both. But I could be wrong about that!)

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Jan 24, 2018

@cessen Everything you said sounds pretty sensible to me. Here are some random thoughts:

  1. Using size_hint on iterators can't be assumed to be correct. You can use a peekable iterator (or equivalent).
  2. You might consider alternatives from the is_last_chunk construction. For example, you could state that "passing an empty chunk indicates end of input."
  3. I think consideration should be given to io::Read as well. I believe it's possible to write an adapter on top of your proposed API that takes an arbitrary io::Read and executes a search, right? If so, that's good. Just wanted to put it on your radar. :)
@cessen

This comment has been minimized.

cessen commented Jan 24, 2018

Thanks for the feedback!

  1. As far as I understand from the documentation, size_hint can't be assumed correct for purposes of memory safety, but can be assumed correct for other purposes. Specifically, "[...]That said, the implementation should provide a correct estimation, because otherwise it would be a violation of the trait's protocol." Of course, I may be interpreting that incorrectly.
  2. Yeah, I was thinking about that too. The down-side to that approach is then we don't know when we're on the last chunk while processing it, which means we lose the ability to e.g. switch to DFA for the last chunk. I agree that the is_last_chunk parameter is awkward and clunky, but I think the benefit may be worth it if it lets other things "just work". However, I'm certainly open to other approaches.
  3. Absolutely! I think that's another benefit of the chunk-based approach, actually: it maps nicely to how io::Read::read() works. It should be pretty trivial to build an io::Read based API on top of API 1 in the OP proposal. In any case, to put you at ease: I also agree it's important to support that, and I'm definitely approaching things with that in mind.
@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Jan 24, 2018

@cessen

  1. The key part of the docs is this line: "The default implementation returns (0, None) which is correct for any iterator." The part you quoted is when the reported lower/upper bounds are wrong. But a 0 lower bound is never wrong and an unknown upper bound is also never wrong. An example of something that's wrong is returning 1 as a lower bound for an iterator over an empty slice.
@cessen

This comment has been minimized.

cessen commented Jan 24, 2018

Yes, that was my understanding as well. I didn't mean to imply that size_hint guarantees that we know when we're on the last chunk, but rather that when it's indicated that we can know, it's correct and we can take advantage of it.

@cessen

This comment has been minimized.

cessen commented Jan 24, 2018

Or, more specifically, if we reach the upper bound (if one is given) we can set is_last_chunk to true.

We would still have to do something like pass an empty chunk at the end when the upper bound isn't given or when we reach the end before the upper bound.

@cessen

This comment has been minimized.

cessen commented Jan 25, 2018

So, regardless of all of this, I think your suggestion to use the peekable iterator is a better idea. For some reason I thought not all iterators could be made peekable, but apparently they can. So that would be a great approach!

@cessen

This comment has been minimized.

cessen commented Feb 5, 2018

Quick status report:

The PikeVM engine now takes a single token at a time, and doesn't access the Input anymore. Essentially, that means that it can be built off of to do streaming input now, though for utf8 it has to be passed an entire scalar value at a time.

Next step is to get it to take just a single byte at a time.

One thing I've noticed is that I think the PikeVM can be made to handle Unicode and byte data use-cases at the same time, even within the same regex (assuming the syntax supports making a distinction). I'm guessing with the DFA that's not possible in a reasonable way, because it can only be in one state at a time...? It's not particularly relevant to the use-cases I'm trying to accommodate, but I think it's interesting nonetheless.

Edit:
@BurntSushi
Also, gotta say, all the unit tests you wrote are a life-saver! Would never have been able to get to this point without them!

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Feb 5, 2018

One thing I've noticed is that I think the PikeVM can be made to handle Unicode and byte data use-cases at the same time, even within the same regex (assuming the syntax supports making a distinction)

It does indeed. Regexes like .+(?-u:.+) (which matches any sequence of Unicode scalar values followed by any sequence bytes) are perfectly valid for the bytes::Regex type. All of the regex engines can handle this case. Right now, the opcodes include both Unicode specific opcodes and byte-specific opcodes. Technically, the Unicode specific opcodes are not necessary, but I believe they might be faster in some cases. Or at the very least, the full performance profile isn't well understood by me, so I've left them there.

I'm guessing with the DFA that's not possible in a reasonable way, because it can only be in one state at a time...? It's not particularly relevant to the use-cases I'm trying to accommodate, but I think it's interesting nonetheless.

The DFA is always byte-at-a-time.

Also, gotta say, all the unit tests you wrote are a life-saver! Would never have been able to get to this point without them!

:-)

@cessen

This comment has been minimized.

cessen commented Feb 5, 2018

The DFA is always byte-at-a-time.

Oh, interesting. Is the Regex/bytes::Regex divide purely for the benefit of the PikeVM, then? In which case, once I make it byte-based, I could merge the two?

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Feb 5, 2018

Yes. And probably the backtracker. The Unicode instructions also use less space, but that is a specious argument because if we removed them, we might be able to reduce the total number of copies of the program bytecode from 3 to 2. (I'm kind of just riffing here based off memory.)

@sanmai-NL

This comment has been minimized.

sanmai-NL commented Jul 27, 2018

@cessen: thanks for your great work! I was wondering how your project is progressing? Have you perhaps planned towards some milestone?

@cessen

This comment has been minimized.

cessen commented Aug 1, 2018

@sanmai-NL
Sorry for my lack of reports! I haven't worked on it since my last comment in this thread, so that's still the current status. I didn't stop for any particular reason, other than getting distracted by other things in my life.

I don't know when I'll pick it back up again, so feel free to take the work I've done so far and build on it. I think I stopped at a pretty reasonable point, having gotten the PikeVM to work incrementally one token at a time. The next steps are basically:

  1. Get the PikeVM to work one byte at a time (optional if you only want unicode regex, IIRC).
  2. Build an incremental regex API on top of the PikeVM.

In fact, if/when I pick it back up again, I might reverse those steps anyway. Might be more satisfying to get something fully working first, and then generalize to byte regex after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment