Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Optimise HIR case folding #893

Closed
wants to merge 1 commit into from
Closed

Performance: Optimise HIR case folding #893

wants to merge 1 commit into from

Conversation

addisoncrump
Copy link
Contributor

In the HIR translation stage, if in Unicode and ignore-case mode, case folding interval sets can take exceedingly long. This PR optimises case folding through the following techniques:

  1. Determining if case folding is necessary.
    • If we have already case folded, then we do not need to do so again.
    • If we have already case folded and are performing a set operation with another interval set that has already been case folded, we do not need to do so again. (Short-hand evidence: if two interval sets are already case folded, then we can consider their set operation to not be operating on individual characters, but instead sets of case folded characters, thus the result of any set operation is necessarily also a set of sets of case folded characters)
    • If we perform a set operation and the result is an interval set which we already know is case folded, then we do not need to do so again.
  2. Optimising case folding itself.
    • Most case folding operations occur over ranges of characters which are in order and adjacent. By keeping track of the previous location of a character, we can remove the need for additional binary searches for the next character.
    • Removes unnecessary iterations when we already know how far away the next foldable character is by changing the range based on the error response of the case folding operation.
  3. Hashing interval sets so that we can determine if an interval set has changed after an operation (necessary for technique 1).

These changes dramatically improve performance in these conditions. To demonstrate this, consider the following regex and test program:

(?i:[[:^space:]------------------------------------------------------------------------])
fn main() -> Result<(), Box<dyn std::error::Error>> {
    for pattern in std::env::args().skip(1) {
        const ITERS: u32 = 10;
        for _ in 0..ITERS {
            let _ = regex_syntax::Parser::new().parse(&pattern);
        }
    }
    Ok(())
}

Placing the test program in examples/test-pattern.rs, we can observe the runtime of the program before the changes:

[addisoncrump@addisoncrump-main regex]$ time cargo run --release --example test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'
    Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `target/release/examples/test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'`

real	0m8,983s
user	0m8,934s
sys	0m0,014s

And after the changes:

[addisoncrump@addisoncrump-main regex]$ time cargo run --release --example test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'
    Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `target/release/examples/test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'`

real	0m0,088s
user	0m0,069s
sys	0m0,018s

@BurntSushi
Copy link
Member

Nice find and nice work! So I think what I'd like to do is figure out, to what extent, each of these optimizations actually helps things. The hashing and the case table lookups in particular represent a fairly big boost in complexity that I would really like to make sure is worth it. Additionally, the hashing trick in particular poses a somewhat more practical issue for the medium term where I plan to add a alloc-only mode to regex-syntax, and the hashing routines aren't available in alloc. So for the alloc-only mode, we'd either need to figure out an alternative strategy, not do it at all, implement our own hashing or bring in another crate. None are particularly appetizing to me...

So with that said, I think the thing I'd like to know here is how much (1) helps all on its own. You mention that (3) is required for (1), but I'm not actually convinced of that? I think it's probably required if you want to optimize out every last bit of redundant work, but my suspicion is that we might be able to get away without hashing. That is, if you're performing a set operation on any two classes, your patch here already leverages the fact that if they're already both case folded than the result must be case folded too. It only falls back to hashing in the case where one wasn't already case folded. I wonder, how often does that happen in practice? And if it does happen often, perhaps we can change the caller patterns such that it happens less.

Another thing I'd like to do in this space is to introduce a separate character class nesting limit, but I think that's out of scope for this PR here.

@addisoncrump
Copy link
Contributor Author

Hm, okay -- I'll try feature-gating various sections, then re-using the fuzzer I used to find this particular input to find troublemaking inputs with various feature flags.

@BurntSushi
Copy link
Member

I ended up investigating this a bit, and I was able to fix the perf problem with your regex by just using technique (1) (with a tweak or two). So I've stuck with just that for now.

I'm also going to look into a nest limit for character classes.

@BurntSushi
Copy link
Member

I ended up thinking more about your case folding optimizations as well and decided to take a crack at it. But I did not want to make the type signature more unruly than it already was. So I ended up just creating a stateful type that knows how to do the right thing, and in the process, greatly simplified the caller. So the caller now looks like this:

    fn case_fold_simple(
        &self,
        ranges: &mut Vec<ClassUnicodeRange>,
    ) -> Result<(), unicode::CaseFoldError> {
        let mut folder = unicode::SimpleCaseFolder::new()?;
        if !folder.overlaps(self.start, self.end) {
            return Ok(());
        }
        let (start, end) = (u32::from(self.start), u32::from(self.end));
        for cp in (start..=end).filter_map(char::from_u32) {
            for &cp_folded in folder.mapping(cp) {
                ranges.push(ClassUnicodeRange::new(cp_folded, cp_folded));
            }
        }
        Ok(())
    }

And here's the case folder:

/// A state oriented traverser of the simple case folding table.
///
/// A case folder can be constructed via `SimpleCaseFolder::new()`, which will
/// return an error if the underlying case folding table is unavailable.
///
/// After construction, it is expected that callers will use
/// `SimpleCaseFolder::mapping` by calling it with codepoints in strictly
/// increasing order. For example, calling it on `b` and then on `a` is illegal
/// and will result in a panic.
///
/// The main idea of this type is that it tries hard to make mapping lookups
/// fast by exploiting the structure of the underlying table, and the ordering
/// assumption enables this.
#[derive(Debug)]
pub struct SimpleCaseFolder {
    /// The simple case fold table. It's a sorted association list, where the
    /// keys are Unicode scalar values and the values are the corresponding
    /// equivalence class (not including the key) of the "simple" case folded
    /// Unicode scalar values.
    table: &'static [(char, &'static [char])],
    /// The last codepoint that was used for a lookup.
    last: Option<char>,
    /// The index to the entry in `table` corresponding to the smallest key `k`
    /// such that `k > k0`, where `k0` is the most recent key lookup. Note that
    /// in particular, `k0` may not be in the table!
    next: usize,
}

impl SimpleCaseFolder {
    /// Create a new simple case folder, returning an error if the underlying
    /// case folding table is unavailable.
    pub fn new() -> Result<SimpleCaseFolder, CaseFoldError> {
        #[cfg(not(feature = "unicode-case"))]
        {
            Err(CaseFoldError(()))
        }
        #[cfg(feature = "unicode-case")]
        {
            Ok(SimpleCaseFolder {
                table: crate::unicode_tables::case_folding_simple::CASE_FOLDING_SIMPLE,
                last: None,
                next: 0,
            })
        }
    }

    /// Return the equivalence class of case folded codepoints for the given
    /// codepoint. The equivalence class returned never includes the codepoint
    /// given. If the given codepoint has no case folded codepoints (i.e.,
    /// no entry in the underlying case folding table), then this returns an
    /// empty slice.
    ///
    /// # Panics
    ///
    /// This panics when called with a `c` that is less than or equal to the
    /// previous call. In other words, callers need to use this method with
    /// strictly increasing values of `c`.
    pub fn mapping(&mut self, c: char) -> &'static [char] {
        if let Some(last) = self.last {
            assert!(
                last < c,
                "got codepoint U+{:X} which occurs before \
                 last codepoint U+{:X}",
                u32::from(c),
                u32::from(last),
            );
        }
        self.last = Some(c);
        if self.next >= self.table.len() {
            return &[];
        }
        let (k, v) = self.table[self.next];
        if k == c {
            self.next += 1;
            return v;
        }
        match self.get(c) {
            Err(i) => {
                self.next = i;
                &[]
            }
            Ok(i) => {
                // Since we require lookups to proceed
                // in order, anything we find should be
                // after whatever we thought might be
                // next. Otherwise, the caller is either
                // going out of order or we would have
                // found our next key at 'self.next'.
                assert!(i > self.next);
                self.next = i + 1;
                self.table[i].1
            }
        }
    }

    /// Returns true if and only if the given range overlaps with any region
    /// of the underlying case folding table. That is, when true, there exists
    /// at least one codepoint in the inclusive range `[start, end]` that has
    /// a non-trivial equivalence class of case folded codepoints. Conversely,
    /// when this returns false, all codepoints in the range `[start, end]`
    /// correspond to the trivial equivalence class of case folded codepoints,
    /// i.e., itself.
    ///
    /// This is useful to call before iterating over the codepoints in the
    /// range and looking up the mapping for each. If you know none of the
    /// mappings will return anything, then you might be able to skip doing it
    /// altogether.
    ///
    /// # Panics
    ///
    /// This panics when `end < start`.
    pub fn overlaps(&self, start: char, end: char) -> bool {
        use core::cmp::Ordering;

        assert!(start <= end);
        self.table
            .binary_search_by(|&(c, _)| {
                if start <= c && c <= end {
                    Ordering::Equal
                } else if c > end {
                    Ordering::Greater
                } else {
                    Ordering::Less
                }
            })
            .is_ok()
    }

    /// Returns the index at which `c` occurs in the simple case fold table. If
    /// `c` does not occur, then this returns an `i` such that `table[i-1].0 <
    /// c` and `table[i].0 > c`.
    fn get(&self, c: char) -> Result<usize, usize> {
        self.table.binary_search_by_key(&c, |&(c1, _)| c1)
    }
}

I think this is still the same fundamental idea in your patch, where basically it keeps track of the last position and looks for an adjacent match before launching into binary search.

I didn't do the "Removes unnecessary iterations" optimization though.

BurntSushi added a commit that referenced this pull request Mar 6, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Mar 6, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
@addisoncrump
Copy link
Contributor Author

addisoncrump commented Mar 6, 2023

Awesome! I will close this PR in favour of these changes. It appears that the changes do dramatically improve the performance, but not quite at the ratio presented originally:

$ time cargo run --release --example test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'
    Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `target/release/examples/test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'`

real    0m7,437s
user    0m7,416s
sys     0m0,021s

Becomes:

$ time cargo run --release --example test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'
    Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `target/release/examples/test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'`

real    0m0,087s
user    0m0,060s
sys     0m0,024s

(perhaps not the most scientific result, but I do not have the original machine I tested on anymore 🙂)

@BurntSushi
Copy link
Member

Ah no I'd like to leave this PR open. It should get closed automatically once my own changes land.

That new ratio looks good to me.

We can continue to improve here too. With the new case folder, I think there's more room to optimize it, including with your additional trick.

BurntSushi added a commit that referenced this pull request Mar 15, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Mar 15, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
BurntSushi added a commit that referenced this pull request Mar 15, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Mar 15, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
BurntSushi added a commit that referenced this pull request Mar 15, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Mar 15, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
BurntSushi added a commit that referenced this pull request Mar 20, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Mar 20, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
BurntSushi added a commit that referenced this pull request Mar 21, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Mar 21, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
BurntSushi added a commit that referenced this pull request Apr 15, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Apr 15, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
BurntSushi added a commit that referenced this pull request Apr 15, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Apr 15, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
BurntSushi added a commit that referenced this pull request Apr 17, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Apr 17, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
BurntSushi added a commit that referenced this pull request Apr 17, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Apr 17, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
BurntSushi added a commit that referenced this pull request Apr 17, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Apr 17, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
BurntSushi added a commit that referenced this pull request Apr 17, 2023
It turns out that it's not too hard to get HIR translation to run pretty
slowly with some carefully crafted regexes. For example:

    (?i:[[:^space:]------------------------------------------------------------------------])

This regex is actually a [:^space:] class that has an empty class
subtracted from it 36 times. For each subtraction, the resulting
class--despite it not having changed---goes through Unicode case folding
again. This in turn slows things way down.

We introduce a fairly basic optimization that basically keeps track of
whether an interval set has been folded or not. The idea was taken from
PR #893, but was tweaked slightly. The magic of how it works is that if
two interval sets have already been folded, then they retain that
property after any of the set operations: negation, union, difference,
intersection and symmetric difference. So case folding should generally
only need to be run once for each "base" class, but then not again as
operations are performed.

Some benchmarks were added to rebar (which isn't public yet at time of
writing).

Closes #893
BurntSushi added a commit that referenced this pull request Apr 17, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
BurntSushi added a commit that referenced this pull request Apr 17, 2023
This rewrites how Unicode simple case folding worked. Instead of just
defining a single function and expecting callers to deal with the
fallout, we know define a stateful type that "knows" about the structure
of the case folding table. For example, it now knows enough to avoid
binary search lookups in most cases. All we really have to do is require
that callers lookup codepoints in sequence, which is perfectly fine for
our use case.

Ref #893
@addisoncrump addisoncrump deleted the hir-folding-opts branch April 18, 2023 14:29
BurntSushi added a commit that referenced this pull request Apr 20, 2023
1.8.0 (2023-04-20)
==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represent preparatory work
for the second release, which will be even bigger. Namely, this release:

* Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
* Upgrades its dependency on `aho-corasick` to the recently release 1.0
version.
* Upgrades its dependency on `regex-syntax` to the simultaneously released
`0.7` version. The changes to `regex-syntax` principally revolve around a
rewrite of its literal extraction code and a number of simplifications and
optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

* [FEATURE #501](#501):
Permit many more characters to be escaped, even if they have no significance.
More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
escaped. Also, a new routine, `is_escapeable_character`, has been added to
`regex-syntax` to query whether a character is escapeable or not.
* [FEATURE #547](#547):
Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
introduce any new expressive power.
* [FEATURE #595](#595):
Capture group names are now Unicode-aware. They can now begin with either a `_`
or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
`]`. Note that replacement syntax has not changed.
* [FEATURE #810](#810):
Add `Match::is_empty` and `Match::len` APIs.
* [FEATURE #905](#905):
Add an `impl Default for RegexSet`, with the default being the empty set.
* [FEATURE #908](#908):
A new method, `Regex::static_captures_len`, has been added which returns the
number of capture groups in the pattern if and only if every possible match
always contains the same number of matching groups.
* [FEATURE #955](#955):
Named captures can now be written as `(?<name>re)` in addition to
`(?P<name>re)`.
* FEATURE: `regex-syntax` now supports empty character classes.
* FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
to `regex` in the second release.)
* FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
made to it.
* FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
mode. This will be supported in `regex` proper in the second release.
* FEATURE: `regex-syntax` now has proper support for "regex that never
matches" via `Hir::fail()`.
* FEATURE: The `hir::literal` module of `regex-syntax` has been completely
re-worked. It now has more documentation, examples and advice.
* FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

* PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
cases. It's difficult to characterize exactly which patterns this might impact,
but if there are a small number of longish (>= 4 bytes) prefix literals, then
it might be faster than before.

Bug fixes:

* [BUG #514](#514):
Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
* BUGS [#516](#516),
[#731](#731):
Fix a number of issues with printing `Hir` values as regex patterns.
* [BUG #610](#610):
Add explicit example of `foo|bar` in the regex syntax docs.
* [BUG #625](#625):
Clarify that `SetMatches::len` does not (regretably) refer to the number of
matches in the set.
* [BUG #660](#660):
Clarify "verbose mode" in regex syntax documentation.
* BUG [#738](#738),
[#950](#950):
Fix `CaptureLocations::get` so that it never panics.
* [BUG #747](#747):
Clarify documentation for `Regex::shortest_match`.
* [BUG #835](#835):
Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
* [BUG #846](#846):
Add more clarifying documentation to the `CompiledTooBig` error variant.
* [BUG #854](#854):
Clarify that `regex::Regex` searches as if the haystack is a sequence of
Unicode scalar values.
* [BUG #884](#884):
Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
* [BUG #893](#893):
Optimize case folding since it can get quite slow in some pathological cases.
* [BUG #895](#895):
Reject `(?-u:\W)` in `regex::Regex` APIs.
* [BUG #942](#942):
Add a missing `void` keyword to indicate "no parameters" in C API.
* [BUG #965](#965):
Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
* [BUG #975](#975):
Clarify documentation for `\pX` syntax.
@BurntSushi BurntSushi mentioned this pull request Apr 20, 2023
BurntSushi added a commit that referenced this pull request Apr 20, 2023
1.8.0 (2023-04-20)
==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represent preparatory work
for the second release, which will be even bigger. Namely, this release:

* Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
* Upgrades its dependency on `aho-corasick` to the recently release 1.0
version.
* Upgrades its dependency on `regex-syntax` to the simultaneously released
`0.7` version. The changes to `regex-syntax` principally revolve around a
rewrite of its literal extraction code and a number of simplifications and
optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

* [FEATURE #501](#501):
Permit many more characters to be escaped, even if they have no significance.
More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
escaped. Also, a new routine, `is_escapeable_character`, has been added to
`regex-syntax` to query whether a character is escapeable or not.
* [FEATURE #547](#547):
Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
introduce any new expressive power.
* [FEATURE #595](#595):
Capture group names are now Unicode-aware. They can now begin with either a `_`
or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
`]`. Note that replacement syntax has not changed.
* [FEATURE #810](#810):
Add `Match::is_empty` and `Match::len` APIs.
* [FEATURE #905](#905):
Add an `impl Default for RegexSet`, with the default being the empty set.
* [FEATURE #908](#908):
A new method, `Regex::static_captures_len`, has been added which returns the
number of capture groups in the pattern if and only if every possible match
always contains the same number of matching groups.
* [FEATURE #955](#955):
Named captures can now be written as `(?<name>re)` in addition to
`(?P<name>re)`.
* FEATURE: `regex-syntax` now supports empty character classes.
* FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
to `regex` in the second release.)
* FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
made to it.
* FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
mode. This will be supported in `regex` proper in the second release.
* FEATURE: `regex-syntax` now has proper support for "regex that never
matches" via `Hir::fail()`.
* FEATURE: The `hir::literal` module of `regex-syntax` has been completely
re-worked. It now has more documentation, examples and advice.
* FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

* PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
cases. It's difficult to characterize exactly which patterns this might impact,
but if there are a small number of longish (>= 4 bytes) prefix literals, then
it might be faster than before.

Bug fixes:

* [BUG #514](#514):
Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
* BUGS [#516](#516),
[#731](#731):
Fix a number of issues with printing `Hir` values as regex patterns.
* [BUG #610](#610):
Add explicit example of `foo|bar` in the regex syntax docs.
* [BUG #625](#625):
Clarify that `SetMatches::len` does not (regretably) refer to the number of
matches in the set.
* [BUG #660](#660):
Clarify "verbose mode" in regex syntax documentation.
* BUG [#738](#738),
[#950](#950):
Fix `CaptureLocations::get` so that it never panics.
* [BUG #747](#747):
Clarify documentation for `Regex::shortest_match`.
* [BUG #835](#835):
Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
* [BUG #846](#846):
Add more clarifying documentation to the `CompiledTooBig` error variant.
* [BUG #854](#854):
Clarify that `regex::Regex` searches as if the haystack is a sequence of
Unicode scalar values.
* [BUG #884](#884):
Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
* [BUG #893](#893):
Optimize case folding since it can get quite slow in some pathological cases.
* [BUG #895](#895):
Reject `(?-u:\W)` in `regex::Regex` APIs.
* [BUG #942](#942):
Add a missing `void` keyword to indicate "no parameters" in C API.
* [BUG #965](#965):
Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
* [BUG #975](#975):
Clarify documentation for `\pX` syntax.
crapStone added a commit to Calciumdibromid/CaBr2 that referenced this pull request May 2, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.7.3` -> `1.8.1` |

---

### Release Notes

<details>
<summary>rust-lang/regex</summary>

### [`v1.8.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;181-2023-04-21)

\==================
This is a patch release that fixes a bug where a regex match could be reported
where none was found. Specifically, the bug occurs when a pattern contains some
literal prefixes that could be extracted *and* an optional word boundary in the
prefix.

Bug fixes:

-   [BUG #&#8203;981](rust-lang/regex#981):
    Fix a bug where a word boundary could interact with prefix literal
    optimizations and lead to a false positive match.

### [`v1.8.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;180-2023-04-20)

\==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represents preparatory work
for the second release, which will be even bigger. Namely, this release:

-   Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
-   Upgrades its dependency on `aho-corasick` to the recently released 1.0
    version.
-   Upgrades its dependency on `regex-syntax` to the simultaneously released
    `0.7` version. The changes to `regex-syntax` principally revolve around a
    rewrite of its literal extraction code and a number of simplifications and
    optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](rust-lang/regex#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

-   [FEATURE #&#8203;501](rust-lang/regex#501):
    Permit many more characters to be escaped, even if they have no significance.
    More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
    escaped. Also, a new routine, `is_escapeable_character`, has been added to
    `regex-syntax` to query whether a character is escapeable or not.
-   [FEATURE #&#8203;547](rust-lang/regex#547):
    Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
    introduce any new expressive power.
-   [FEATURE #&#8203;595](rust-lang/regex#595):
    Capture group names are now Unicode-aware. They can now begin with either a `_`
    or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
    can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
    `]`. Note that replacement syntax has not changed.
-   [FEATURE #&#8203;810](rust-lang/regex#810):
    Add `Match::is_empty` and `Match::len` APIs.
-   [FEATURE #&#8203;905](rust-lang/regex#905):
    Add an `impl Default for RegexSet`, with the default being the empty set.
-   [FEATURE #&#8203;908](rust-lang/regex#908):
    A new method, `Regex::static_captures_len`, has been added which returns the
    number of capture groups in the pattern if and only if every possible match
    always contains the same number of matching groups.
-   [FEATURE #&#8203;955](rust-lang/regex#955):
    Named captures can now be written as `(?<name>re)` in addition to
    `(?P<name>re)`.
-   FEATURE: `regex-syntax` now supports empty character classes.
-   FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
    to `regex` in the second release.)
-   FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
    made to it.
-   FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
    mode. This will be supported in `regex` proper in the second release.
-   FEATURE: `regex-syntax` now has proper support for "regex that never
    matches" via `Hir::fail()`.
-   FEATURE: The `hir::literal` module of `regex-syntax` has been completely
    re-worked. It now has more documentation, examples and advice.
-   FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
    to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

-   PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
    cases. It's difficult to characterize exactly which patterns this might impact,
    but if there are a small number of longish (>= 4 bytes) prefix literals, then
    it might be faster than before.

Bug fixes:

-   [BUG #&#8203;514](rust-lang/regex#514):
    Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
-   BUGS [#&#8203;516](rust-lang/regex#516),
    [#&#8203;731](rust-lang/regex#731):
    Fix a number of issues with printing `Hir` values as regex patterns.
-   [BUG #&#8203;610](rust-lang/regex#610):
    Add explicit example of `foo|bar` in the regex syntax docs.
-   [BUG #&#8203;625](rust-lang/regex#625):
    Clarify that `SetMatches::len` does not (regretably) refer to the number of
    matches in the set.
-   [BUG #&#8203;660](rust-lang/regex#660):
    Clarify "verbose mode" in regex syntax documentation.
-   BUG [#&#8203;738](rust-lang/regex#738),
    [#&#8203;950](rust-lang/regex#950):
    Fix `CaptureLocations::get` so that it never panics.
-   [BUG #&#8203;747](rust-lang/regex#747):
    Clarify documentation for `Regex::shortest_match`.
-   [BUG #&#8203;835](rust-lang/regex#835):
    Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
-   [BUG #&#8203;846](rust-lang/regex#846):
    Add more clarifying documentation to the `CompiledTooBig` error variant.
-   [BUG #&#8203;854](rust-lang/regex#854):
    Clarify that `regex::Regex` searches as if the haystack is a sequence of
    Unicode scalar values.
-   [BUG #&#8203;884](rust-lang/regex#884):
    Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
-   [BUG #&#8203;893](rust-lang/regex#893):
    Optimize case folding since it can get quite slow in some pathological cases.
-   [BUG #&#8203;895](rust-lang/regex#895):
    Reject `(?-u:\W)` in `regex::Regex` APIs.
-   [BUG #&#8203;942](rust-lang/regex#942):
    Add a missing `void` keyword to indicate "no parameters" in C API.
-   [BUG #&#8203;965](rust-lang/regex#965):
    Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
-   [BUG #&#8203;975](rust-lang/regex#975):
    Clarify documentation for `\pX` syntax.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS42MS4wIiwidXBkYXRlZEluVmVyIjoiMzUuNjYuMyIsInRhcmdldEJyYW5jaCI6ImRldmVsb3AifQ==-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1874
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants