Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose more knobs #166

Closed
BurntSushi opened this issue Feb 15, 2016 · 1 comment
Closed

expose more knobs #166

BurntSushi opened this issue Feb 15, 2016 · 1 comment
Labels
Milestone

Comments

@BurntSushi
Copy link
Member

Regex currently has two constructors, Regex::new and Regex::with_size_limit. The latter is the same as the former, except it allows one to bound the size of the compiled program. The idea is here to control how much memory is used if a regex is compiled from an untrusted source.

There are other knobs that seem useful for exposing to the user:

  • The recursion limit used in regex-syntax for recursively simplifying a regex. (It'd be nice to move this to a stack on the heap, but it seems tricky.)
  • The cache size used in the lazy DFA. For regexes that create lots of distinct states, it's possible to realize big gains if you're willing to spend the memory to do it. Currently, it is set to a constant of ~2MB.

There are other knobs that maybe shouldn't be exposed, but could:

  • Control the budget for extracting literal prefixes.
  • Control whether the DFA is used. (There exists regexes and inputs where avoiding the lazy DFA is actually faster, but it's probably hard to know when this is without some experience with the internals of this crate.)
  • Control whether common suffixes are factored out of compiled programs. This could reduce compilation times at the expense of bigger programs.

It's not clear what, if any, of these things should be exposed. I feel like the knobs that control memory bounds should be accessible to callers, because when you need them, you really need them.

@BurntSushi BurntSushi added this to the 1.0 milestone Feb 15, 2016
@BurntSushi
Copy link
Member Author

I've given this some thought, and this is where I'm currently at:

  1. There should be a struct with no public fields that permits setting various knobs that can be applied to the construction of a Regex. Knobs can be set using methods. This permits expanding the set of the knobs in the future in an API compatible way.
  2. We should probably start with just two knobs: the existing compile size limit and a new knob that controls how much cache the lazy DFA is allowed to use.
  3. This probably means that Regex will continue to need an alternate constructor, much like how Vec or HashMap have alternate constructors. For example, Regex::with_config or something.

BurntSushi added a commit that referenced this issue Apr 29, 2016
This commit contains a new sub-crate called `regex-capi` which provides
a C library called `rure`.

A new `RegexBuilder` type was also added to the Rust API proper, which
permits both users of C and Rust to tweak various knobs on a `Regex`.
This fixes issue #166.

Since it's likely that this API will be used to provide bindings to
other languages, I've created bindings to Go as a proof of concept:
https://github.com/BurntSushi/rure-go --- to my knowledge, the wrapper
has as little overhead as it can. It was in particular important for the
C library to not store any pointers provided by the caller, as this can
be problematic in languages with managed runtimes and a moving GC.

The C API doesn't expose `RegexSet` and a few other convenience functions
such as splitting or replacing. That can be future work.

Note that the regex-capi crate requires Rust 1.9, since it uses
`panic::catch_unwind`.

This also includes tests of basic API functionality and a commented
example. Both should now run as part of CI.
BurntSushi added a commit that referenced this issue Apr 29, 2016
This commit contains a new sub-crate called `regex-capi` which provides
a C library called `rure`.

A new `RegexBuilder` type was also added to the Rust API proper, which
permits both users of C and Rust to tweak various knobs on a `Regex`.
This fixes issue #166.

Since it's likely that this API will be used to provide bindings to
other languages, I've created bindings to Go as a proof of concept:
https://github.com/BurntSushi/rure-go --- to my knowledge, the wrapper
has as little overhead as it can. It was in particular important for the
C library to not store any pointers provided by the caller, as this can
be problematic in languages with managed runtimes and a moving GC.

The C API doesn't expose `RegexSet` and a few other convenience functions
such as splitting or replacing. That can be future work.

Note that the regex-capi crate requires Rust 1.9, since it uses
`panic::catch_unwind`.

This also includes tests of basic API functionality and a commented
example. Both should now run as part of CI.
BurntSushi added a commit that referenced this issue Apr 29, 2016
This commit contains a new sub-crate called `regex-capi` which provides
a C library called `rure`.

A new `RegexBuilder` type was also added to the Rust API proper, which
permits both users of C and Rust to tweak various knobs on a `Regex`.
This fixes issue #166.

Since it's likely that this API will be used to provide bindings to
other languages, I've created bindings to Go as a proof of concept:
https://github.com/BurntSushi/rure-go --- to my knowledge, the wrapper
has as little overhead as it can. It was in particular important for the
C library to not store any pointers provided by the caller, as this can
be problematic in languages with managed runtimes and a moving GC.

The C API doesn't expose `RegexSet` and a few other convenience functions
such as splitting or replacing. That can be future work.

Note that the regex-capi crate requires Rust 1.9, since it uses
`panic::catch_unwind`.

This also includes tests of basic API functionality and a commented
example. Both should now run as part of CI.
BurntSushi added a commit that referenced this issue May 7, 2016
This is replaced by using RegexBuilder.

Fixes #166
BurntSushi added a commit that referenced this issue May 18, 2016
This is replaced by using RegexBuilder.

Fixes #166
BurntSushi added a commit that referenced this issue Aug 5, 2016
This is replaced by using RegexBuilder.

Fixes #166
@BurntSushi BurntSushi mentioned this issue Dec 31, 2016
bors added a commit that referenced this issue Dec 31, 2016
regex 0.2

0.2.0
=====
This is a new major release of the regex crate, and is an implementation of the
[regex 1.0 RFC](https://github.com/rust-lang/rfcs/blob/master/text/1620-regex-1.0.md).
We are releasing a `0.2` first, and if there are no major problems, we will
release a `1.0` shortly. For `0.2`, the minimum *supported* Rust version is
1.12.

There are a number of **breaking changes** in `0.2`. They are split into two
types. The first type correspond to breaking changes in regular expression
syntax. The second type correspond to breaking changes in the API.

Breaking changes for regex syntax:

* POSIX character classes now require double bracketing. Previously, the regex
  `[:upper:]` would parse as the `upper` POSIX character class. Now it parses
  as the character class containing the characters `:upper:`. The fix to this
  change is to use `[[:upper:]]` instead. Note that variants like
  `[[:upper:][:blank:]]` continue to work.
* The character `[` must always be escaped inside a character class.
* The characters `&`, `-` and `~` must be escaped if any one of them are
  repeated consecutively. For example, `[&]`, `[\&]`, `[\&\&]`, `[&-&]` are all
  equivalent while `[&&]` is illegal. (The motivation for this and the prior
  change is to provide a backwards compatible path for adding character class
  set notation.)
* A `bytes::Regex` now has Unicode mode enabled by default (like the main
  `Regex` type). This means regexes compiled with `bytes::Regex::new` that
  don't have the Unicode flag set should add `(?-u)` to recover the original
  behavior.

Breaking changes for the regex API:

* `find` and `find_iter` now **return `Match` values instead of
  `(usize, usize)`.** `Match` values have `start` and `end` methods, which
  return the match offsets. `Match` values also have an `as_str` method,
  which returns the text of the match itself.
* The `Captures` type now only provides a single iterator over all capturing
  matches, which should replace uses of `iter` and `iter_pos`. Uses of
  `iter_named` should use the `capture_names` method on `Regex`.
* The `replace` methods now return `Cow` values. The `Cow::Borrowed` variant
  is returned when no replacements are made.
* The `Replacer` trait has been completely overhauled. This should only
  impact clients that implement this trait explicitly. Standard uses of
  the `replace` methods should continue to work unchanged.
* The `quote` free function has been renamed to `escape`.
* The `Regex::with_size_limit` method has been removed. It is replaced by
  `RegexBuilder::size_limit`.
* The `RegexBuilder` type has switched from owned `self` method receivers to
  `&mut self` method receivers. Most uses will continue to work unchanged, but
  some code may require naming an intermediate variable to hold the builder.
* The free `is_match` function has been removed. It is replaced by compiling
  a `Regex` and calling its `is_match` method.
* The `PartialEq` and `Eq` impls on `Regex` have been dropped. If you relied
  on these impls, the fix is to define a wrapper type around `Regex`, impl
  `Deref` on it and provide the necessary impls.
* The `is_empty` method on `Captures` has been removed. This always returns
  `false`, so its use is superfluous.
* The `Syntax` variant of the `Error` type now contains a string instead of
  a `regex_syntax::Error`. If you were examining syntax errors more closely,
  you'll need to explicitly use the `regex_syntax` crate to re-parse the regex.
* The `InvalidSet` variant of the `Error` type has been removed since it is
  no longer used.
* Most of the iterator types have been renamed to match conventions. If you
  were using these iterator types explicitly, please consult the documentation
  for its new name. For example, `RegexSplits` has been renamed to `Split`.

A number of bugs have been fixed:

* [BUG #151](#151):
  The `Replacer` trait has been changed to permit the caller to control
  allocation.
* [BUG #165](#165):
  Remove the free `is_match` function.
* [BUG #166](#166):
  Expose more knobs (available in `0.1`) and remove `with_size_limit`.
* [BUG #168](#168):
  Iterators produced by `Captures` now have the correct lifetime parameters.
* [BUG #175](#175):
  Fix a corner case in the parsing of POSIX character classes.
* [BUG #178](#178):
  Drop the `PartialEq` and `Eq` impls on `Regex`.
* [BUG #179](#179):
  Remove `is_empty` from `Captures` since it always returns false.
* [BUG #276](#276):
  Position of named capture can now be retrieved from a `Captures`.
* [BUG #296](#296):
  Remove winapi/kernel32-sys dependency on UNIX.
* [BUG #307](#307):
  Fix error on emscripten.
@bors bors closed this as completed in e1a94bb Dec 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant