I've been thinking more about this, and I feel like it would be helpful to write down some core data types. It will help structure thought, discussion and work on this I think.

I think the two primary data types are an Hir and State, where State is a single state in an NFA. I'll start by writing down what my initial instinct for these two types are, and then explain the problems. For Hir:

enum Hir {
  Empty,
  Char(char),
  Class(Vec<(char, char)>),
  Repetition {
    min: u32,
    max: Option<u32>,
    greedy: bool,
    child: Box<Hir>,
  },
  Capture {
    index: u32,
    name: Option<Box<str>>,
    child: Box<Hir>,
  },
  Concat(Vec<Hir>),
  Alternation(Vec<Hir>),
}

And now for State:

enum State {
  Range {
    start: char,
    end: char,
  },
  Split {
    alt1: StateID,
    alt2: StateID,
  },
  Goto {
    target: StateID,
  },
  Fail,
  Match,
}

And that's pretty much it. The core matching loop would then proceed by decoding a single rune from the haystack and applying it to the NFA state graph above. This should all work fine for &str APIs because everything is guaranteed to be valid UTF-8.

The problem comes up with the &[u8] APIs. Ideally, we would want it to be possible for things like . to match through invalid UTF-8. So for example, compile(".").is_match(b"\xFF") should return true. Another thing that seems like something we should support is compile(r"\xFF").is_match(b"\xFF") should return true. The question is how to do this. In the status quo, we complicate the above representation by adding things like State::ByteRange { start: u8, end: u8 }. And then in order to compile such things, you usually need to build UTF-8 decoding into the automaton. Which kind of stinks.

I do wonder if we might be able to take Go's approach to this problem. Go's regexp engine doesn't support the compile(r"\xFF").is_match(b"\xFF") use case, but it does permit . to match through invalid UTF-8. Basically, it does this by using lossy UTF-8 decoding in its core matching loop. Any byte that is not valid UTF-8 gets treated as indistinguishable from U+FFFD. Thus, regexes like . and things like [^a] will match invalid UTF-8 because they match U+FFFD, but they also simultaneously will never split a codepoint.

This feels like an optimal place to land for me. It keeps the implementation very simple, makes searching &str sensible and permits quite a bit of flexibility with respect to searching on &[u8] as well.

Once difference from Go is that we should probably use "substitution of maximal subparts" as our lossy UTF-8 decoding strategy. So for example, a\xF0\x9F\x87z would decode as [a, U+FFFD, z], where as in Go, that would decode as [a, U+FFFD, U+FFFD, U+FFFD, z].

Please ask questions if any of this doesn't make sense!

Ironing out regex-lite #961

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Ironing out `regex-lite` #961