regex 0.2 #310

BurntSushi · 2016-12-30T06:10:02Z

0.2.0

This is a new major release of the regex crate, and is an implementation of the
regex 1.0 RFC.
We are releasing a 0.2 first, and if there are no major problems, we will
release a 1.0 shortly. For 0.2, the minimum supported Rust version is
1.12.

There are a number of breaking changes in 0.2. They are split into two
types. The first type correspond to breaking changes in regular expression
syntax. The second type correspond to breaking changes in the API.

Breaking changes for regex syntax:

POSIX character classes now require double bracketing. Previously, the regex
[:upper:] would parse as the upper POSIX character class. Now it parses
as the character class containing the characters :upper:. The fix to this
change is to use [[:upper:]] instead. Note that variants like
[[:upper:][:blank:]] continue to work.
The character [ must always be escaped inside a character class.
The characters &, - and ~ must be escaped if any one of them are
repeated consecutively. For example, [&], [\&], [\&\&], [&-&] are all
equivalent while [&&] is illegal. (The motivation for this and the prior
change is to provide a backwards compatible path for adding character class
set notation.)
A bytes::Regex now has Unicode mode enabled by default (like the main
Regex type). This means regexes compiled with bytes::Regex::new that
don't have the Unicode flag set should add (?-u) to recover the original
behavior.

Breaking changes for the regex API:

find and find_iter now return Match values instead of
(usize, usize). Match values have start and end methods, which
return the match offsets. Match values also have an as_str method,
which returns the text of the match itself.
The Captures type now only provides a single iterator over all capturing
matches, which should replace uses of iter and iter_pos. Uses of
iter_named should use the capture_names method on Regex.
The replace methods now return Cow values. The Cow::Borrowed variant
is returned when no replacements are made.
The Replacer trait has been completely overhauled. This should only
impact clients that implement this trait explicitly. Standard uses of
the replace methods should continue to work unchanged.
The quote free function has been renamed to escape.
The Regex::with_size_limit method has been removed. It is replaced by
RegexBuilder::size_limit.
The RegexBuilder type has switched from owned self method receivers to
&mut self method receivers. Most uses will continue to work unchanged, but
some code may require naming an intermediate variable to hold the builder.
The free is_match function has been removed. It is replaced by compiling
a Regex and calling its is_match method.
The PartialEq and Eq impls on Regex have been dropped. If you relied
on these impls, the fix is to define a wrapper type around Regex, impl
Deref on it and provide the necessary impls.
The is_empty method on Captures has been removed. This always returns
false, so its use is superfluous.
The Syntax variant of the Error type now contains a string instead of
a regex_syntax::Error. If you were examining syntax errors more closely,
you'll need to explicitly use the regex_syntax crate to re-parse the regex.
The InvalidSet variant of the Error type has been removed since it is
no longer used.
Most of the iterator types have been renamed to match conventions. If you
were using these iterator types explicitly, please consult the documentation
for its new name. For example, RegexSplits has been renamed to Split.

A number of bugs have been fixed:

BUG #151:
The Replacer trait has been changed to permit the caller to control
allocation.
BUG #165:
Remove the free is_match function.
BUG #166:
Expose more knobs (available in 0.1) and remove with_size_limit.
BUG #168:
Iterators produced by Captures now have the correct lifetime parameters.
BUG #175:
Fix a corner case in the parsing of POSIX character classes.
BUG #178:
Drop the PartialEq and Eq impls on Regex.
BUG #179:
Remove is_empty from Captures since it always returns false.
BUG #276:
Position of named capture can now be retrieved from a Captures.
BUG #296:
Remove winapi/kernel32-sys dependency on UNIX.
BUG #307:
Fix error on emscripten.

This uses the new Replacer trait essentially as defined in the `bytes` sub-module and described in rust-lang#151. Fixes rust-lang#151

It is useless because it will always return false (since every regex has at least one capture group corresponding to the full match). Fixes rust-lang#179

It is misleading to suggest that Regex implements equality, since equality is a well defined operation on regular expressions and this particular implementation doesn't correspond to that definition at all. Moreover, I suspect the actual use cases for such an impl are rather niche. A simple newtype+deref should resolve any such use cases. Fixes rust-lang#178

This corrects a gaffe of mine. In particular, both types contain references to a `Captures` *and* the text that was searched, but only names one lifetime. In practice, this means that the shortest lifetime is used, which can be problematic for when one is trying to extract submatch text. This also fixes the lifetime annotation on `iter_pos`, which should be tied to the Captures and not the text. It was always possible to work around this by using indices. Fixes rust-lang#168

This is replaced by using RegexBuilder. Fixes rust-lang#166

It encourages compiling a regex for every use, which can be convenient in some circumstances but deadly for performance. Fixes rust-lang#165

Similarly, rename RegexSplitsN to SplitsN. This follows the convention of all other iterator types. In general, we shouldn't namespace our type names.

Mostly, this adds an `Iter` suffix to all of the names.

If `replace` doesn't find any matches, then it can return the original string unchanged.

This remove the InvalidSet variant, which is no longer used, and no longer exposes the `regex_syntax::Error` type, instead exposing it as a string.

This also removes Captures.{at,pos} and replaces it with Captures.get, which now returns a Match. Similarly, Captures.name returns a Match as well. Fixes rust-lang#276

All use cases can be replaced with Regex::capture_names.

Specifically, use mutable references instead of passing ownership.

For example, the regex `[:upper:]` used to correspond to the `upper` ASCII character class, but it now corresponds to the character class containing the characters `:upper:`. Forms like `[[:upper:][:blank:]]` are still accepted. Fixes rust-lang#175

The escaping of &, - and ~ is only required when the characters are repeated adjacently, which should be quite rare. Escaping of [ is always required, unless it appear in the second position of a range. These rules enable us to add character class sets as described in UTS#18 RL1.3 in a backward compatible way.

This was added because regex 0.1 supports Rust 1.3+. But we can now assume Rust 1.12+, which has Vec::extend_from_slice. Yay for less unsafe!

When building a Match, we should avoid storing a subslice and instead store the full string. We can punt subslicing to access. This seems to get LLVM to optimize tight loops better when the subslice isn't needed.

This API mirrors RegexBuilder, but for multiple patterns. Also, modify regex-capi to use RegexSetBuilder internally.

Fixes rust-lang#296, Fixes rust-lang#307

BurntSushi · 2016-12-31T22:06:57Z

@bors r+

bors · 2016-12-31T22:06:59Z

📌 Commit ac3ab6d has been approved by BurntSushi

bors · 2016-12-31T22:07:06Z

⌛ Testing commit ac3ab6d with merge 52fdae7...

regex 0.2 0.2.0 ===== This is a new major release of the regex crate, and is an implementation of the [regex 1.0 RFC](https://github.com/rust-lang/rfcs/blob/master/text/1620-regex-1.0.md). We are releasing a `0.2` first, and if there are no major problems, we will release a `1.0` shortly. For `0.2`, the minimum *supported* Rust version is 1.12. There are a number of **breaking changes** in `0.2`. They are split into two types. The first type correspond to breaking changes in regular expression syntax. The second type correspond to breaking changes in the API. Breaking changes for regex syntax: * POSIX character classes now require double bracketing. Previously, the regex `[:upper:]` would parse as the `upper` POSIX character class. Now it parses as the character class containing the characters `:upper:`. The fix to this change is to use `[[:upper:]]` instead. Note that variants like `[[:upper:][:blank:]]` continue to work. * The character `[` must always be escaped inside a character class. * The characters `&`, `-` and `~` must be escaped if any one of them are repeated consecutively. For example, `[&]`, `[\&]`, `[\&\&]`, `[&-&]` are all equivalent while `[&&]` is illegal. (The motivation for this and the prior change is to provide a backwards compatible path for adding character class set notation.) * A `bytes::Regex` now has Unicode mode enabled by default (like the main `Regex` type). This means regexes compiled with `bytes::Regex::new` that don't have the Unicode flag set should add `(?-u)` to recover the original behavior. Breaking changes for the regex API: * `find` and `find_iter` now **return `Match` values instead of `(usize, usize)`.** `Match` values have `start` and `end` methods, which return the match offsets. `Match` values also have an `as_str` method, which returns the text of the match itself. * The `Captures` type now only provides a single iterator over all capturing matches, which should replace uses of `iter` and `iter_pos`. Uses of `iter_named` should use the `capture_names` method on `Regex`. * The `replace` methods now return `Cow` values. The `Cow::Borrowed` variant is returned when no replacements are made. * The `Replacer` trait has been completely overhauled. This should only impact clients that implement this trait explicitly. Standard uses of the `replace` methods should continue to work unchanged. * The `quote` free function has been renamed to `escape`. * The `Regex::with_size_limit` method has been removed. It is replaced by `RegexBuilder::size_limit`. * The `RegexBuilder` type has switched from owned `self` method receivers to `&mut self` method receivers. Most uses will continue to work unchanged, but some code may require naming an intermediate variable to hold the builder. * The free `is_match` function has been removed. It is replaced by compiling a `Regex` and calling its `is_match` method. * The `PartialEq` and `Eq` impls on `Regex` have been dropped. If you relied on these impls, the fix is to define a wrapper type around `Regex`, impl `Deref` on it and provide the necessary impls. * The `is_empty` method on `Captures` has been removed. This always returns `false`, so its use is superfluous. * The `Syntax` variant of the `Error` type now contains a string instead of a `regex_syntax::Error`. If you were examining syntax errors more closely, you'll need to explicitly use the `regex_syntax` crate to re-parse the regex. * The `InvalidSet` variant of the `Error` type has been removed since it is no longer used. * Most of the iterator types have been renamed to match conventions. If you were using these iterator types explicitly, please consult the documentation for its new name. For example, `RegexSplits` has been renamed to `Split`. A number of bugs have been fixed: * [BUG #151](#151): The `Replacer` trait has been changed to permit the caller to control allocation. * [BUG #165](#165): Remove the free `is_match` function. * [BUG #166](#166): Expose more knobs (available in `0.1`) and remove `with_size_limit`. * [BUG #168](#168): Iterators produced by `Captures` now have the correct lifetime parameters. * [BUG #175](#175): Fix a corner case in the parsing of POSIX character classes. * [BUG #178](#178): Drop the `PartialEq` and `Eq` impls on `Regex`. * [BUG #179](#179): Remove `is_empty` from `Captures` since it always returns false. * [BUG #276](#276): Position of named capture can now be retrieved from a `Captures`. * [BUG #296](#296): Remove winapi/kernel32-sys dependency on UNIX. * [BUG #307](#307): Fix error on emscripten.

bors · 2016-12-31T22:48:11Z

☀️ Test successful - status-appveyor, status-travis
Approved by: BurntSushi
Pushing 52fdae7 to master...

BurntSushi and others added 20 commits December 30, 2016 01:05

Switch bytes::Regex to using Unicode mode by default.

d44a9f9

Update Replacer trait for Unicode regexes.

ebd26e9

This uses the new Replacer trait essentially as defined in the `bytes` sub-module and described in rust-lang#151. Fixes rust-lang#151

Remove the is_empty method on Captures.

f98219b

It is useless because it will always return false (since every regex has at least one capture group corresponding to the full match). Fixes rust-lang#179

Remove Regex::with_size_limit.

e1a94bb

This is replaced by using RegexBuilder. Fixes rust-lang#166

Remove free is_match function.

cfd887d

It encourages compiling a regex for every use, which can be convenient in some circumstances but deadly for performance. Fixes rust-lang#165

Rename RegexSplits to Splits.

24f86b0

Similarly, rename RegexSplitsN to SplitsN. This follows the convention of all other iterator types. In general, we shouldn't namespace our type names.

Reorganize capture slot handling, but don't make any public API changes.

a6722a3

Rename many of the iterator types.

2632c2f

Mostly, this adds an `Iter` suffix to all of the names.

Use Cow for replacements.

52165d6

If `replace` doesn't find any matches, then it can return the original string unchanged.

Update the Error type.

2805811

This remove the InvalidSet variant, which is no longer used, and no longer exposes the `regex_syntax::Error` type, instead exposing it as a string.

find/find_iter now return a Match instead of (usize, usize).

384e937

This also removes Captures.{at,pos} and replaces it with Captures.get, which now returns a Match. Similarly, Captures.name returns a Match as well. Fixes rust-lang#276

Remove the submatch iterators.

fab4069

All use cases can be replaced with Regex::capture_names.

Fix tests.

1f7f5c9

Switch to more idiomatic builder definition.

403b27a

Specifically, use mutable references instead of passing ownership.

Rename iterator types to match std conventions.

3f1fde5

Changed the name of quote to escape.

8ee9262

This was referenced Dec 30, 2016

regex 1.0 #230

Closed

Changed quote to escape #294

Closed

BurntSushi added 2 commits December 30, 2016 01:45

Add SubCaptureMatches iterator on Captures.

374f139

Remove custom extend_from_slice implementation.

c4faddf

This was added because regex 0.1 supports Rust 1.3+. But we can now assume Rust 1.12+, which has Vec::extend_from_slice. Yay for less unsafe!

BurntSushi force-pushed the rfc branch 2 times, most recently from ca60bf9 to f8903d9 Compare December 30, 2016 21:46

Fix performance bug with Match.

66c6ddf

When building a Match, we should avoid storing a subslice and instead store the full string. We can punt subslicing to access. This seems to get LLVM to optimize tight loops better when the subslice isn't needed.

BurntSushi force-pushed the rfc branch 2 times, most recently from e818f7e to cc56d60 Compare December 31, 2016 17:57

Add RegexSetBuilder.

0c59d41

This API mirrors RegexBuilder, but for multiple patterns. Also, modify regex-capi to use RegexSetBuilder internally.

Documentation updates and clean ups.

63132b5

BurntSushi force-pushed the rfc branch from cc56d60 to a52ede5 Compare December 31, 2016 21:39

BurntSushi self-assigned this Dec 31, 2016

BurntSushi force-pushed the rfc branch from a52ede5 to d558d3b Compare December 31, 2016 21:43

Update github links.

f094d15

BurntSushi force-pushed the rfc branch from d558d3b to 3edef85 Compare December 31, 2016 21:50

Bump versions everywhere and update CHANGELOG.

ac3ab6d

Fixes rust-lang#296, Fixes rust-lang#307

BurntSushi force-pushed the rfc branch from 3edef85 to ac3ab6d Compare December 31, 2016 22:02

bors merged commit ac3ab6d into rust-lang:master Dec 31, 2016

BurntSushi deleted the rfc branch December 31, 2016 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regex 0.2 #310

regex 0.2 #310

BurntSushi commented Dec 30, 2016 •

edited

Loading

BurntSushi commented Dec 31, 2016

bors commented Dec 31, 2016

bors commented Dec 31, 2016

bors commented Dec 31, 2016

regex 0.2 #310

regex 0.2 #310

Conversation

BurntSushi commented Dec 30, 2016 • edited Loading

0.2.0

BurntSushi commented Dec 31, 2016

bors commented Dec 31, 2016

bors commented Dec 31, 2016

bors commented Dec 31, 2016

BurntSushi commented Dec 30, 2016 •

edited

Loading