allow starting DFA in noncontinuous bytes #1031

pascalkuthe · 2023-07-09T16:31:55Z

regex-automtaton already supports transversing the DFA one byte at a time with
next_state. This is potentially very useful when scanning noncontinuous data
like network stream or a rope data structures as commonly used in editors.

However, to start the DFA with start_state_forward/start_state_reverse
currently requires an Input and will look ahead/look one byte behind the
span boundaries. To support that (especially when using prefilters/literal
optimization) a streaming use case can not provide such a haystack easily (it
can be worked around with a temporary array and copying one byte over but its
extremely brittle/hacky).

This PR adds the start_state_forward_with/start_state_reverse_with
function which allow passing the information extracted from the Input directly.

This is currently a draft because I haven't added proper documentation yet. I wanted to wait whether you were open to this change in general before writing the docs.

regex-automtaton already supports transversing the DFA one byte at a time with `next_state`. This is potentially very useful when scanning noncontinuous data like network stream or a rope data structures as commonly used in editors. However, to start the DFA with `start_state_forward`/`start_state_reverse` currently requires an `Input` and will look ahead/look one byte behind the span boundaries. To support that (especially when using prefilters/literal optimization) a streaming use case can not provide such a haystack easily (it can be worked around with a temporary array and copying one byte over but its extremely brittle/hacky). This commit adds the `start_state_forward_with`/`start_state_reverse_with` function which allow passing the information extracted from the Input directly.

BurntSushi · 2023-07-10T01:35:02Z

Can you show how you would use the existing APIs for the streaming use case? You mention it would be extremely hacky, but in my head it doesn't seem that way?

The problem with this addition is that it complicates the API even more than it already is, and more importantly, makes it so the start state can only be computed from the span and/or look-behind. The status quo is that the start state can be computed from anything on an Input, which is far more flexible. If anything, this API addition seems like a hack to me. Because of that, it's important to be able to see what the actual alternative is.

pascalkuthe · 2023-07-10T02:30:13Z

So with this PR starting a DFA looks as follows

  let sid = dfa.start_state_forward_with(
      cache,
      input.get_anchored(),
      input.look_behind(),
      input.get_span(),
  )?;

whereas a workaround would look as follows:

let fake_haystack;
let fake_input = match input.look_behind() {
    Some(byte) => {
        //The second byte is just so that the span is not empty.
        // I am not sure that matters otherwise it could be a single-element array
        fake_haystack = [byte, byte];
        regex_automata::Input::new(&fake_haystack).range(1..)
    }
    None => regex_automata::Input::new(input.curr_haystack()),
};
let sid = dfa.start_state_forward(
    cache,
    &fake_input
        .anchored(input.get_anchored())
        .earliest(input.get_earliest()),
)?;

Just to provide some background I am implementing a regex engine for a rope data structure so I have a cursor of byte chunks. That means I am not using Input at all (and reimplementing everything attached to it). I have my equivalent of the input struct that smoothly handles the transversal of the rope. Importantly I am never allocating some kind of temporary buffer. The goal is to use the internal buffers of the rope without copying them.

In the workaround I am creating a fake input just with the look behind the byte. I can't use the current chunk/haystack because the lookahead byte might be in the previous chunk (if the new DFA starts exactly at a new chunk so the look_behind byte is the last byte of the previous chunk).

Doing it this way kind of works but feels pretty hacky. Especially if you plan on using more information from the input in the future (for example looking at the actual content of the haystack inside the span) this solution might break down the road. I am not sure how you would treat something like that but depending on your definition that wouldn't be a breaking change either so I would have to pin a patch version. I was looking for an API that explicitly considers this kind of streaming application a valid use-case and not "I passed an input that happens to behave the same thanks to my knowledge of this function implementation details".

Of course, if your plan is to actually evolve that function then providing such a function would be pretty limiting. Perhaps the parameter set could be extended or I could just construct an input with my first chunk and only provide the look_behind byte (or lookahead for reverse scan) byte seperatly?

BurntSushi · 2023-07-10T02:43:15Z

I'm respond more later when I have my hands on a keyboard, but the entire point of Input was to avoid functions exactly like the ones proposed here where you have long parameter lists with options and other various things. Believe me, I tried that approach. It doesn't scale.

I'll respond with more detail and thoughts about your code sample tomorrow.

BurntSushi · 2023-07-10T15:55:59Z

So I slept on this, and I basically think the start_state_forward (and related APIs) was a minor gaffe on my part. Thinking more holistically, it just doesn't make a ton of sense for the start state computation to depend on the entire haystack contents. It just so happens that an Input is pretty close to what you want, but actually kind of pushes you toward thinking that computing the start state requires more stuff than what is actually sensible. Like, it should be fine to have an API guarantee that computing the start state shouldn't need anything more than very local information with respect to where the caller wants to start a search.

In the immediate term, I think it's safe to just use a haystack consisting of a single byte and setting the search span offsets accordingly to get the behavior you desire.

In the less immediate term, I think a new API that lets one build up a StartConfig would be best. That is, my comments about routines with these parameter lists still apply, and the solution to that is to encapsulate the parameters in a type. So it's probably the case that something like your Input would be roughly equivalent to a StartConfig.

So if you wanted to work on this, here's what I'm thinking:

Put StartConfig in crate::util::start name it Config to avoid stuttering.
Export the start module.
Add new start_state_forward_with routines, like you have here, but ones that accept a crate::util::start::Config instead. (I don't like the name start_state_forward_with personally. Ideally we'd just replace the existing start_state_forward routines, but I'm not doing a semver breaking change release already just for this.)
Make start_state_forward just be a thin wrapper that calls start_state_forward_with. Perhaps adding to_start_config as a new public method on Input would be nice.
Do the above for both the lazy DFA and full DFAs. (Of course, both would use the same start::Config type.)

How's that sound?

pascalkuthe · 2023-07-10T21:01:53Z

Thanks for the detailed thoughts! I will try to finish my prototype implementation first and use the workaround. It sounds like that it should be reasonably safe to rely on that for now.

I think exposing an extra type makes a lot of sense and would feel clearer. I will try to take a stab at that in the (hopefully) not to distant future. I will close this PR and open a new PR then 👍

BurntSushi · 2023-07-10T21:15:29Z

Aye, and I might take a stab at it too if you don't get to it. In particular, there are interesting API design challenges here. For example, there is probably an invariant where if span.start > 0, then a look-behind byte must be specified. We could uphold that by panicking (perhaps an easier API design) or by requiring it by virtue of the types (requires more thought). Or if we choose not to uphold that invariant, then we need to define the behavior when that occurs (that is, when a start configuration is given where span.start > 0 and no look-behind byte is given). In today's world, if you give a span that doesn't reflect the haystack inside the Input, then you get panics.

BurntSushi · 2023-08-26T03:16:28Z

I'm going to re-open this PR because it's hard to track the important API addition I outlined above otherwise.

BurntSushi · 2023-10-03T20:36:41Z

I am tackling this with a rough plan to just fix the existing API and put out a regex-automata 0.4 release. (Because if I choose to do #469, then I'll need to put out a 0.4 release anyway. Normally I wouldn't do a breaking change release for something like this and instead just add a new API.)

pascalkuthe · 2023-10-03T20:46:57Z

sorry for nog teggint back to this sooner, I had a contract comeup that is sucking up all my time.

BurntSushi · 2023-10-03T20:48:20Z

No worries! I wasn't planning on getting to it myself any time soon, but with the possibility of a breaking release it makes sense to try and get it in.

It turns out that requiring callers to provide an `Input` (and thus a `&[u8]` haystack) is a bit onerous for all cases. Namely, part of the point of `regex-automata` was to expose enough guts to make it tractable to write a streaming regex engine. A streaming regex engine, especially one that does a byte-at-a-time loop, is somewhat antithetical to having a haystack in a single `&[u8]` slice. This made computing start states possible but very awkward and quite unclear in terms of what the implementation would actually do with the haystack. This commit fixes that by exposing a lower level `start_state` method on both of the DFAs that can be called without materializing an `Input`. Instead, callers must create a new `start::Config` value which provides all of the information necessary for the DFA to compute the correct start state. This in turn also exposes the `crate::util::start` module. This is ultimately a breaking change because it adds a new required method to the `Automaton` trait. It also makes `start_state_forward` and `start_state_reverse` optional. It isn't really expected for callers to implement the `Automaton` trait themselves (and perhaps I will seal it so we can do such changes in the future without it being breaking), but still, this is technically breaking. Callers using `start_state_forward` or `start_state_reverse` with either DFA remain unchanged and unaffected. Closes #1031

BurntSushi · 2023-10-06T15:52:06Z

@pascalkuthe What do you think?

New start::Config type: https://burntsushi.net/stuff/tmp-do-not-link-me/regex/regex_automata/util/start/struct.Config.html
New hybrid::dfa::DFA::start_state method: https://burntsushi.net/stuff/tmp-do-not-link-me/regex/regex_automata/hybrid/dfa/struct.DFA.html#method.start_state
New dfa::Automaton::start_state method: https://burntsushi.net/stuff/tmp-do-not-link-me/regex/regex_automata/dfa/trait.Automaton.html#tymethod.start_state

There is technically a breaking change here due to a new required method (start_state) on the Automaton trait, but otherwise, the existing start_state_forward and start_state_reverse routines continue to work as they did.

It turns out that requiring callers to provide an `Input` (and thus a `&[u8]` haystack) is a bit onerous for all cases. Namely, part of the point of `regex-automata` was to expose enough guts to make it tractable to write a streaming regex engine. A streaming regex engine, especially one that does a byte-at-a-time loop, is somewhat antithetical to having a haystack in a single `&[u8]` slice. This made computing start states possible but very awkward and quite unclear in terms of what the implementation would actually do with the haystack. This commit fixes that by exposing a lower level `start_state` method on both of the DFAs that can be called without materializing an `Input`. Instead, callers must create a new `start::Config` value which provides all of the information necessary for the DFA to compute the correct start state. This in turn also exposes the `crate::util::start` module. This is ultimately a breaking change because it adds a new required method to the `Automaton` trait. It also makes `start_state_forward` and `start_state_reverse` optional. It isn't really expected for callers to implement the `Automaton` trait themselves (and perhaps I will seal it so we can do such changes in the future without it being breaking), but still, this is technically breaking. Callers using `start_state_forward` or `start_state_reverse` with either DFA remain unchanged and unaffected. Closes #1031

pascalkuthe · 2023-10-07T00:02:04Z

I read trough the docs, and that looks great! I really like the builder style API it makes a lot of sense here (especially if you do endup introducing more complex start state it could avoid future breaking changes). Also great docs, the examples are great! Awesome that you picked this up, this is gonna really help make my streaming DFA search implementation less hacky:)

I haven't had the chance to read the code too deeply yet as I am away this weekend. If that would be helpful to you I can try to read trough/review the changes next week.

BurntSushi · 2023-10-07T00:05:21Z

I haven't had the chance to read the code too deeply yet as I am away this weekend. If that would be helpful to you I can try to read trough/review the changes next week.

Absolutely! I have dreams of having the code merged by then, but in practice I have a lot of other stuff to do, specifically #469. Wouldn't be surprised if that took me into next week.

(especially if you do endup introducing more complex start state it could avoid future breaking changes)

Yeah exactly. That's part of why I created Input too. (But also because, before Input, I had huge parameter lists and a combinatorial explosion of functions. It was gross.)

It turns out that requiring callers to provide an `Input` (and thus a `&[u8]` haystack) is a bit onerous for all cases. Namely, part of the point of `regex-automata` was to expose enough guts to make it tractable to write a streaming regex engine. A streaming regex engine, especially one that does a byte-at-a-time loop, is somewhat antithetical to having a haystack in a single `&[u8]` slice. This made computing start states possible but very awkward and quite unclear in terms of what the implementation would actually do with the haystack. This commit fixes that by exposing a lower level `start_state` method on both of the DFAs that can be called without materializing an `Input`. Instead, callers must create a new `start::Config` value which provides all of the information necessary for the DFA to compute the correct start state. This in turn also exposes the `crate::util::start` module. This is ultimately a breaking change because it adds a new required method to the `Automaton` trait. It also makes `start_state_forward` and `start_state_reverse` optional. It isn't really expected for callers to implement the `Automaton` trait themselves (and perhaps I will seal it so we can do such changes in the future without it being breaking), but still, this is technically breaking. Callers using `start_state_forward` or `start_state_reverse` with either DFA remain unchanged and unaffected. Closes #1031

pascalkuthe · 2023-10-10T16:32:11Z

sorry for the late feedback. It seems you have already merged that commit. Nonetheless I took a look trough the changes in more detail. Looks great to me! Thank you for working on this yourself. I really appreciate that you put effort to add these APIs for more niche (streaming) usecases of regex-automaton

BurntSushi · 2023-10-10T16:34:25Z

Sweet! Good to hear. :)

pascalkuthe force-pushed the master branch from bbf274e to 1d98af1 Compare July 9, 2023 16:34

pascalkuthe closed this Jul 10, 2023

BurntSushi reopened this Aug 26, 2023

pascalkuthe mentioned this pull request Sep 22, 2023

execute a regex on text streams #425

Open

BurntSushi mentioned this pull request Oct 3, 2023

consider providing \< and \> #469

Closed

BurntSushi added the fix-incoming label Oct 6, 2023

BurntSushi closed this in 7a7ce83 Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow starting DFA in noncontinuous bytes #1031

allow starting DFA in noncontinuous bytes #1031

pascalkuthe commented Jul 9, 2023

BurntSushi commented Jul 10, 2023 •

edited

Loading

pascalkuthe commented Jul 10, 2023

BurntSushi commented Jul 10, 2023

BurntSushi commented Jul 10, 2023

pascalkuthe commented Jul 10, 2023

BurntSushi commented Jul 10, 2023

BurntSushi commented Aug 26, 2023

BurntSushi commented Oct 3, 2023 •

edited

Loading

pascalkuthe commented Oct 3, 2023

BurntSushi commented Oct 3, 2023

BurntSushi commented Oct 6, 2023

pascalkuthe commented Oct 7, 2023

BurntSushi commented Oct 7, 2023

pascalkuthe commented Oct 10, 2023

BurntSushi commented Oct 10, 2023

allow starting DFA in noncontinuous bytes #1031

allow starting DFA in noncontinuous bytes #1031

Conversation

pascalkuthe commented Jul 9, 2023

BurntSushi commented Jul 10, 2023 • edited Loading

pascalkuthe commented Jul 10, 2023

BurntSushi commented Jul 10, 2023

BurntSushi commented Jul 10, 2023

pascalkuthe commented Jul 10, 2023

BurntSushi commented Jul 10, 2023

BurntSushi commented Aug 26, 2023

BurntSushi commented Oct 3, 2023 • edited Loading

pascalkuthe commented Oct 3, 2023

BurntSushi commented Oct 3, 2023

BurntSushi commented Oct 6, 2023

pascalkuthe commented Oct 7, 2023

BurntSushi commented Oct 7, 2023

pascalkuthe commented Oct 10, 2023

BurntSushi commented Oct 10, 2023

BurntSushi commented Jul 10, 2023 •

edited

Loading

BurntSushi commented Oct 3, 2023 •

edited

Loading