Skip to content

Commit

Permalink
automata: rejigger DFA start state computation
Browse files Browse the repository at this point in the history
It turns out that requiring callers to provide an `Input` (and thus a
`&[u8]` haystack) is a bit onerous for all cases. Namely, part of the
point of `regex-automata` was to expose enough guts to make it tractable
to write a streaming regex engine. A streaming regex engine, especially
one that does a byte-at-a-time loop, is somewhat antithetical to having
a haystack in a single `&[u8]` slice. This made computing start states
possible but very awkward and quite unclear in terms of what the
implementation would actually do with the haystack.

This commit fixes that by exposing a lower level `start_state` method on
both of the DFAs that can be called without materializing an `Input`.
Instead, callers must create a new `start::Config` value which provides
all of the information necessary for the DFA to compute the correct
start state. This in turn also exposes the `crate::util::start` module.

This is ultimately a breaking change because it adds a new required
method to the `Automaton` trait. It also makes `start_state_forward` and
`start_state_reverse` optional. It isn't really expected for callers to
implement the `Automaton` trait themselves (and perhaps I will seal it
so we can do such changes in the future without it being breaking), but
still, this is technically breaking.

Callers using `start_state_forward` or `start_state_reverse` with either
DFA remain unchanged and unaffected.

Closes #1031
  • Loading branch information
BurntSushi committed Oct 9, 2023
1 parent ad2cfd6 commit f0147f8
Show file tree
Hide file tree
Showing 10 changed files with 662 additions and 230 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
TBD
===

New features:

* [FEATURE(regex-automata) #1031](https://github.com/rust-lang/regex/pull/1031):
DFAs now have a `start_state` method that doesn't use an `Input`.

Bug fixes:

* [BUG #1046](https://github.com/rust-lang/regex/issues/1046):
Fix a bug that could result in incorrect match spans when using a Unicode word
boundary and searching non-ASCII strings.
Expand Down
188 changes: 164 additions & 24 deletions regex-automata/src/dfa/automaton.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ use crate::{
prefilter::Prefilter,
primitives::{PatternID, StateID},
search::{Anchored, HalfMatch, Input, MatchError},
start,
},
};

Expand Down Expand Up @@ -226,21 +227,50 @@ pub unsafe trait Automaton {
/// ```
fn next_eoi_state(&self, current: StateID) -> StateID;

/// Return the ID of the start state for this lazy DFA when executing a
/// forward search.
/// Return the ID of the start state for this DFA for the given starting
/// configuration.
///
/// Unlike typical DFA implementations, the start state for DFAs in this
/// crate is dependent on a few different factors:
///
/// * The [`Anchored`] mode of the search. Unanchored, anchored and
/// anchored searches for a specific [`PatternID`] all use different start
/// states.
/// * The position at which the search begins, via [`Input::start`]. This
/// and the byte immediately preceding the start of the search (if one
/// exists) influence which look-behind assertions are true at the start
/// of the search. This in turn influences which start state is selected.
/// * Whether the search is a forward or reverse search. This routine can
/// only be used for forward searches.
/// * Whether a "look-behind" byte exists. For example, the `^` anchor
/// matches if and only if there is no look-behind byte.
/// * The specific value of that look-behind byte. For example, a `(?m:^)`
/// assertion only matches when there is either no look-behind byte, or
/// when the look-behind byte is a line terminator.
///
/// The [starting configuration](start::Config) provides the above
/// information.
///
/// This routine can be used for either forward or reverse searches.
/// Although, as a convenience, if you have an [`Input`], then it may
/// be more succinct to use [`Automaton::start_state_forward`] or
/// [`Automaton::start_state_reverse`]. Note, for example, that the
/// convenience routines return a [`MatchError`] on failure where as this
/// routine returns a [`StartError`].
///
/// # Errors
///
/// This may return a [`StartError`] if the search needs to give up when
/// determining the start state (for example, if it sees a "quit" byte).
/// This can also return an error if the given configuration contains an
/// unsupported [`Anchored`] configuration.
fn start_state(
&self,
config: &start::Config,
) -> Result<StateID, StartError>;

/// Return the ID of the start state for this DFA when executing a forward
/// search.
///
/// This is a convenience routine for calling [`Automaton::start_state`]
/// that converts the given [`Input`] to a [start
/// configuration](start::Config). Additionally, if an error occurs, it is
/// converted from a [`StartError`] to a [`MatchError`] using the offset
/// information in the given [`Input`].
///
/// # Errors
///
Expand All @@ -251,23 +281,30 @@ pub unsafe trait Automaton {
fn start_state_forward(
&self,
input: &Input<'_>,
) -> Result<StateID, MatchError>;
) -> Result<StateID, MatchError> {
let config = start::Config::from_input_forward(input);
self.start_state(&config).map_err(|err| match err {
StartError::Quit { byte } => {
let offset = input
.start()
.checked_sub(1)
.expect("no quit in start without look-behind");
MatchError::quit(byte, offset)
}
StartError::UnsupportedAnchored { mode } => {
MatchError::unsupported_anchored(mode)
}
})
}

/// Return the ID of the start state for this lazy DFA when executing a
/// reverse search.
/// Return the ID of the start state for this DFA when executing a reverse
/// search.
///
/// Unlike typical DFA implementations, the start state for DFAs in this
/// crate is dependent on a few different factors:
///
/// * The [`Anchored`] mode of the search. Unanchored, anchored and
/// anchored searches for a specific [`PatternID`] all use different start
/// states.
/// * The position at which the search begins, via [`Input::start`]. This
/// and the byte immediately preceding the start of the search (if one
/// exists) influence which look-behind assertions are true at the start
/// of the search. This in turn influences which start state is selected.
/// * Whether the search is a forward or reverse search. This routine can
/// only be used for reverse searches.
/// This is a convenience routine for calling [`Automaton::start_state`]
/// that converts the given [`Input`] to a [start
/// configuration](start::Config). Additionally, if an error occurs, it is
/// converted from a [`StartError`] to a [`MatchError`] using the offset
/// information in the given [`Input`].
///
/// # Errors
///
Expand All @@ -278,7 +315,18 @@ pub unsafe trait Automaton {
fn start_state_reverse(
&self,
input: &Input<'_>,
) -> Result<StateID, MatchError>;
) -> Result<StateID, MatchError> {
let config = start::Config::from_input_reverse(input);
self.start_state(&config).map_err(|err| match err {
StartError::Quit { byte } => {
let offset = input.end();
MatchError::quit(byte, offset)
}
StartError::UnsupportedAnchored { mode } => {
MatchError::unsupported_anchored(mode)
}
})
}

/// If this DFA has a universal starting state for the given anchor mode
/// and the DFA supports universal starting states, then this returns that
Expand Down Expand Up @@ -1798,6 +1846,14 @@ unsafe impl<'a, A: Automaton + ?Sized> Automaton for &'a A {
(**self).next_eoi_state(current)
}

#[inline]
fn start_state(
&self,
config: &start::Config,
) -> Result<StateID, StartError> {
(**self).start_state(config)
}

#[inline]
fn start_state_forward(
&self,
Expand Down Expand Up @@ -2015,6 +2071,90 @@ impl OverlappingState {
}
}

/// An error that can occur when computing the start state for a search.
///
/// Computing a start state can fail for a few reasons, either based on
/// incorrect configuration or even based on whether the look-behind byte
/// triggers a quit state. Typically one does not need to handle this error
/// if you're using [`Automaton::start_state_forward`] (or its reverse
/// counterpart), as that routine automatically converts `StartError` to a
/// [`MatchError`] for you.
///
/// This error may be returned by the [`Automaton::start_state`] routine.
///
/// This error implements the `std::error::Error` trait when the `std` feature
/// is enabled.
///
/// This error is marked as non-exhaustive. New variants may be added in a
/// semver compatible release.
#[non_exhaustive]
#[derive(Clone, Debug)]
pub enum StartError {
/// An error that occurs when a starting configuration's look-behind byte
/// is in this DFA's quit set.
Quit {
/// The quit byte that was found.
byte: u8,
},
/// An error that occurs when the caller requests an anchored mode that
/// isn't supported by the DFA.
UnsupportedAnchored {
/// The anchored mode given that is unsupported.
mode: Anchored,
},
}

impl StartError {
pub(crate) fn quit(byte: u8) -> StartError {
StartError::Quit { byte }
}

pub(crate) fn unsupported_anchored(mode: Anchored) -> StartError {
StartError::UnsupportedAnchored { mode }
}
}

#[cfg(feature = "std")]
impl std::error::Error for StartError {}

impl core::fmt::Display for StartError {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
match *self {
StartError::Quit { byte } => write!(
f,
"error computing start state because the look-behind byte \
{:?} triggered a quit state",
crate::util::escape::DebugByte(byte),
),
StartError::UnsupportedAnchored { mode: Anchored::Yes } => {
write!(
f,
"error computing start state because \
anchored searches are not supported or enabled"
)
}
StartError::UnsupportedAnchored { mode: Anchored::No } => {
write!(
f,
"error computing start state because \
unanchored searches are not supported or enabled"
)
}
StartError::UnsupportedAnchored {
mode: Anchored::Pattern(pid),
} => {
write!(
f,
"error computing start state because \
anchored searches for a specific pattern ({}) \
are not supported or enabled",
pid.as_usize(),
)
}
}
}
}

/// Runs the given overlapping `search` function (forwards or backwards) until
/// a match is found whose offset does not split a codepoint.
///
Expand Down
Loading

0 comments on commit f0147f8

Please sign in to comment.