Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow starting DFA in noncontinuous bytes #1031

Closed
wants to merge 1 commit into from

Conversation

pascalkuthe
Copy link
Contributor

regex-automtaton already supports transversing the DFA one byte at a time with
next_state. This is potentially very useful when scanning noncontinuous data
like network stream or a rope data structures as commonly used in editors.

However, to start the DFA with start_state_forward/start_state_reverse
currently requires an Input and will look ahead/look one byte behind the
span boundaries. To support that (especially when using prefilters/literal
optimization) a streaming use case can not provide such a haystack easily (it
can be worked around with a temporary array and copying one byte over but its
extremely brittle/hacky).

This PR adds the start_state_forward_with/start_state_reverse_with
function which allow passing the information extracted from the Input directly.

This is currently a draft because I haven't added proper documentation yet. I wanted to wait whether you were open to this change in general before writing the docs.

regex-automtaton already supports transversing the DFA one byte at a time with
`next_state`. This is potentially very useful when scanning noncontinuous data
like network stream or a rope data structures as commonly used in editors.

However, to start the DFA with `start_state_forward`/`start_state_reverse`
currently requires an `Input` and will look ahead/look one byte behind the
span boundaries. To support that (especially when using prefilters/literal
optimization) a streaming use case can not provide such a haystack easily (it
can be worked around with a temporary array and copying one byte over but its
extremely brittle/hacky).

This commit adds the `start_state_forward_with`/`start_state_reverse_with`
function which allow passing the information extracted from the Input directly.
@BurntSushi
Copy link
Member

BurntSushi commented Jul 10, 2023

Can you show how you would use the existing APIs for the streaming use case? You mention it would be extremely hacky, but in my head it doesn't seem that way?

The problem with this addition is that it complicates the API even more than it already is, and more importantly, makes it so the start state can only be computed from the span and/or look-behind. The status quo is that the start state can be computed from anything on an Input, which is far more flexible. If anything, this API addition seems like a hack to me. Because of that, it's important to be able to see what the actual alternative is.

@pascalkuthe
Copy link
Contributor Author

So with this PR starting a DFA looks as follows

  let sid = dfa.start_state_forward_with(
      cache,
      input.get_anchored(),
      input.look_behind(),
      input.get_span(),
  )?;

whereas a workaround would look as follows:

let fake_haystack;
let fake_input = match input.look_behind() {
    Some(byte) => {
        //The second byte is just so that the span is not empty.
        // I am not sure that matters otherwise it could be a single-element array
        fake_haystack = [byte, byte];
        regex_automata::Input::new(&fake_haystack).range(1..)
    }
    None => regex_automata::Input::new(input.curr_haystack()),
};
let sid = dfa.start_state_forward(
    cache,
    &fake_input
        .anchored(input.get_anchored())
        .earliest(input.get_earliest()),
)?;

Just to provide some background I am implementing a regex engine for a rope data structure so I have a cursor of byte chunks. That means I am not using Input at all (and reimplementing everything attached to it). I have my equivalent of the input struct that smoothly handles the transversal of the rope. Importantly I am never allocating some kind of temporary buffer. The goal is to use the internal buffers of the rope without copying them.

In the workaround I am creating a fake input just with the look behind the byte. I can't use the current chunk/haystack because the lookahead byte might be in the previous chunk (if the new DFA starts exactly at a new chunk so the look_behind byte is the last byte of the previous chunk).

Doing it this way kind of works but feels pretty hacky. Especially if you plan on using more information from the input in the future (for example looking at the actual content of the haystack inside the span) this solution might break down the road. I am not sure how you would treat something like that but depending on your definition that wouldn't be a breaking change either so I would have to pin a patch version. I was looking for an API that explicitly considers this kind of streaming application a valid use-case and not "I passed an input that happens to behave the same thanks to my knowledge of this function implementation details".

Of course, if your plan is to actually evolve that function then providing such a function would be pretty limiting. Perhaps the parameter set could be extended or I could just construct an input with my first chunk and only provide the look_behind byte (or lookahead for reverse scan) byte seperatly?

@BurntSushi
Copy link
Member

I'm respond more later when I have my hands on a keyboard, but the entire point of Input was to avoid functions exactly like the ones proposed here where you have long parameter lists with options and other various things. Believe me, I tried that approach. It doesn't scale.

I'll respond with more detail and thoughts about your code sample tomorrow.

@BurntSushi
Copy link
Member

So I slept on this, and I basically think the start_state_forward (and related APIs) was a minor gaffe on my part. Thinking more holistically, it just doesn't make a ton of sense for the start state computation to depend on the entire haystack contents. It just so happens that an Input is pretty close to what you want, but actually kind of pushes you toward thinking that computing the start state requires more stuff than what is actually sensible. Like, it should be fine to have an API guarantee that computing the start state shouldn't need anything more than very local information with respect to where the caller wants to start a search.

In the immediate term, I think it's safe to just use a haystack consisting of a single byte and setting the search span offsets accordingly to get the behavior you desire.

In the less immediate term, I think a new API that lets one build up a StartConfig would be best. That is, my comments about routines with these parameter lists still apply, and the solution to that is to encapsulate the parameters in a type. So it's probably the case that something like your Input would be roughly equivalent to a StartConfig.

So if you wanted to work on this, here's what I'm thinking:

  • Put StartConfig in crate::util::start name it Config to avoid stuttering.
  • Export the start module.
  • Add new start_state_forward_with routines, like you have here, but ones that accept a crate::util::start::Config instead. (I don't like the name start_state_forward_with personally. Ideally we'd just replace the existing start_state_forward routines, but I'm not doing a semver breaking change release already just for this.)
  • Make start_state_forward just be a thin wrapper that calls start_state_forward_with. Perhaps adding to_start_config as a new public method on Input would be nice.
  • Do the above for both the lazy DFA and full DFAs. (Of course, both would use the same start::Config type.)

How's that sound?

@pascalkuthe
Copy link
Contributor Author

Thanks for the detailed thoughts! I will try to finish my prototype implementation first and use the workaround. It sounds like that it should be reasonably safe to rely on that for now.

I think exposing an extra type makes a lot of sense and would feel clearer. I will try to take a stab at that in the (hopefully) not to distant future. I will close this PR and open a new PR then 👍

@BurntSushi
Copy link
Member

Aye, and I might take a stab at it too if you don't get to it. In particular, there are interesting API design challenges here. For example, there is probably an invariant where if span.start > 0, then a look-behind byte must be specified. We could uphold that by panicking (perhaps an easier API design) or by requiring it by virtue of the types (requires more thought). Or if we choose not to uphold that invariant, then we need to define the behavior when that occurs (that is, when a start configuration is given where span.start > 0 and no look-behind byte is given). In today's world, if you give a span that doesn't reflect the haystack inside the Input, then you get panics.

@BurntSushi
Copy link
Member

I'm going to re-open this PR because it's hard to track the important API addition I outlined above otherwise.

@BurntSushi
Copy link
Member

BurntSushi commented Oct 3, 2023

I am tackling this with a rough plan to just fix the existing API and put out a regex-automata 0.4 release. (Because if I choose to do #469, then I'll need to put out a 0.4 release anyway. Normally I wouldn't do a breaking change release for something like this and instead just add a new API.)

@pascalkuthe
Copy link
Contributor Author

sorry for nog teggint back to this sooner, I had a contract comeup that is sucking up all my time.

@BurntSushi
Copy link
Member

No worries! I wasn't planning on getting to it myself any time soon, but with the possibility of a breaking release it makes sense to try and get it in.

BurntSushi added a commit that referenced this pull request Oct 6, 2023
It turns out that requiring callers to provide an `Input` (and thus a
`&[u8]` haystack) is a bit onerous for all cases. Namely, part of the
point of `regex-automata` was to expose enough guts to make it tractable
to write a streaming regex engine. A streaming regex engine, especially
one that does a byte-at-a-time loop, is somewhat antithetical to having
a haystack in a single `&[u8]` slice. This made computing start states
possible but very awkward and quite unclear in terms of what the
implementation would actually do with the haystack.

This commit fixes that by exposing a lower level `start_state` method on
both of the DFAs that can be called without materializing an `Input`.
Instead, callers must create a new `start::Config` value which provides
all of the information necessary for the DFA to compute the correct
start state. This in turn also exposes the `crate::util::start` module.

This is ultimately a breaking change because it adds a new required
method to the `Automaton` trait. It also makes `start_state_forward` and
`start_state_reverse` optional. It isn't really expected for callers to
implement the `Automaton` trait themselves (and perhaps I will seal it
so we can do such changes in the future without it being breaking), but
still, this is technically breaking.

Callers using `start_state_forward` or `start_state_reverse` with either
DFA remain unchanged and unaffected.

Closes #1031
@BurntSushi
Copy link
Member

@pascalkuthe What do you think?

There is technically a breaking change here due to a new required method (start_state) on the Automaton trait, but otherwise, the existing start_state_forward and start_state_reverse routines continue to work as they did.

BurntSushi added a commit that referenced this pull request Oct 6, 2023
It turns out that requiring callers to provide an `Input` (and thus a
`&[u8]` haystack) is a bit onerous for all cases. Namely, part of the
point of `regex-automata` was to expose enough guts to make it tractable
to write a streaming regex engine. A streaming regex engine, especially
one that does a byte-at-a-time loop, is somewhat antithetical to having
a haystack in a single `&[u8]` slice. This made computing start states
possible but very awkward and quite unclear in terms of what the
implementation would actually do with the haystack.

This commit fixes that by exposing a lower level `start_state` method on
both of the DFAs that can be called without materializing an `Input`.
Instead, callers must create a new `start::Config` value which provides
all of the information necessary for the DFA to compute the correct
start state. This in turn also exposes the `crate::util::start` module.

This is ultimately a breaking change because it adds a new required
method to the `Automaton` trait. It also makes `start_state_forward` and
`start_state_reverse` optional. It isn't really expected for callers to
implement the `Automaton` trait themselves (and perhaps I will seal it
so we can do such changes in the future without it being breaking), but
still, this is technically breaking.

Callers using `start_state_forward` or `start_state_reverse` with either
DFA remain unchanged and unaffected.

Closes #1031
BurntSushi added a commit that referenced this pull request Oct 6, 2023
It turns out that requiring callers to provide an `Input` (and thus a
`&[u8]` haystack) is a bit onerous for all cases. Namely, part of the
point of `regex-automata` was to expose enough guts to make it tractable
to write a streaming regex engine. A streaming regex engine, especially
one that does a byte-at-a-time loop, is somewhat antithetical to having
a haystack in a single `&[u8]` slice. This made computing start states
possible but very awkward and quite unclear in terms of what the
implementation would actually do with the haystack.

This commit fixes that by exposing a lower level `start_state` method on
both of the DFAs that can be called without materializing an `Input`.
Instead, callers must create a new `start::Config` value which provides
all of the information necessary for the DFA to compute the correct
start state. This in turn also exposes the `crate::util::start` module.

This is ultimately a breaking change because it adds a new required
method to the `Automaton` trait. It also makes `start_state_forward` and
`start_state_reverse` optional. It isn't really expected for callers to
implement the `Automaton` trait themselves (and perhaps I will seal it
so we can do such changes in the future without it being breaking), but
still, this is technically breaking.

Callers using `start_state_forward` or `start_state_reverse` with either
DFA remain unchanged and unaffected.

Closes #1031
@pascalkuthe
Copy link
Contributor Author

I read trough the docs, and that looks great! I really like the builder style API it makes a lot of sense here (especially if you do endup introducing more complex start state it could avoid future breaking changes). Also great docs, the examples are great! Awesome that you picked this up, this is gonna really help make my streaming DFA search implementation less hacky:)

I haven't had the chance to read the code too deeply yet as I am away this weekend. If that would be helpful to you I can try to read trough/review the changes next week.

@BurntSushi
Copy link
Member

I haven't had the chance to read the code too deeply yet as I am away this weekend. If that would be helpful to you I can try to read trough/review the changes next week.

Absolutely! I have dreams of having the code merged by then, but in practice I have a lot of other stuff to do, specifically #469. Wouldn't be surprised if that took me into next week.

(especially if you do endup introducing more complex start state it could avoid future breaking changes)

Yeah exactly. That's part of why I created Input too. (But also because, before Input, I had huge parameter lists and a combinatorial explosion of functions. It was gross.)

BurntSushi added a commit that referenced this pull request Oct 8, 2023
It turns out that requiring callers to provide an `Input` (and thus a
`&[u8]` haystack) is a bit onerous for all cases. Namely, part of the
point of `regex-automata` was to expose enough guts to make it tractable
to write a streaming regex engine. A streaming regex engine, especially
one that does a byte-at-a-time loop, is somewhat antithetical to having
a haystack in a single `&[u8]` slice. This made computing start states
possible but very awkward and quite unclear in terms of what the
implementation would actually do with the haystack.

This commit fixes that by exposing a lower level `start_state` method on
both of the DFAs that can be called without materializing an `Input`.
Instead, callers must create a new `start::Config` value which provides
all of the information necessary for the DFA to compute the correct
start state. This in turn also exposes the `crate::util::start` module.

This is ultimately a breaking change because it adds a new required
method to the `Automaton` trait. It also makes `start_state_forward` and
`start_state_reverse` optional. It isn't really expected for callers to
implement the `Automaton` trait themselves (and perhaps I will seal it
so we can do such changes in the future without it being breaking), but
still, this is technically breaking.

Callers using `start_state_forward` or `start_state_reverse` with either
DFA remain unchanged and unaffected.

Closes #1031
BurntSushi added a commit that referenced this pull request Oct 9, 2023
It turns out that requiring callers to provide an `Input` (and thus a
`&[u8]` haystack) is a bit onerous for all cases. Namely, part of the
point of `regex-automata` was to expose enough guts to make it tractable
to write a streaming regex engine. A streaming regex engine, especially
one that does a byte-at-a-time loop, is somewhat antithetical to having
a haystack in a single `&[u8]` slice. This made computing start states
possible but very awkward and quite unclear in terms of what the
implementation would actually do with the haystack.

This commit fixes that by exposing a lower level `start_state` method on
both of the DFAs that can be called without materializing an `Input`.
Instead, callers must create a new `start::Config` value which provides
all of the information necessary for the DFA to compute the correct
start state. This in turn also exposes the `crate::util::start` module.

This is ultimately a breaking change because it adds a new required
method to the `Automaton` trait. It also makes `start_state_forward` and
`start_state_reverse` optional. It isn't really expected for callers to
implement the `Automaton` trait themselves (and perhaps I will seal it
so we can do such changes in the future without it being breaking), but
still, this is technically breaking.

Callers using `start_state_forward` or `start_state_reverse` with either
DFA remain unchanged and unaffected.

Closes #1031
@BurntSushi BurntSushi closed this in 7a7ce83 Oct 9, 2023
@pascalkuthe
Copy link
Contributor Author

sorry for the late feedback. It seems you have already merged that commit. Nonetheless I took a look trough the changes in more detail. Looks great to me! Thank you for working on this yourself. I really appreciate that you put effort to add these APIs for more niche (streaming) usecases of regex-automaton

@BurntSushi
Copy link
Member

Sweet! Good to hear. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants