Skip to content

Commit

Permalink
syntax: loosen ASCII compatible rules
Browse files Browse the repository at this point in the history
Previously, patterns like `(?-u:☃)` were banned under the logic that
Unicode scalar values shouldn't be available unless Unicode mode is
enabled. But since patterns are required to be UTF-8, there really isn't
any difficulty in just interpreting Unicode literals as their
corresponding UTF-8 encoding.

Note though that Unicode character classes, even things like
`(?-u:[☃])`, remain banned. We probably could make character classes
work too, but it's unclear how that plays with ASCII compatible mode
requiring that a single byte is the fundamental atom of matching (where
as Unicode mode requires that Unicode scalar values are the fundamental
atom of matching).
  • Loading branch information
BurntSushi committed Oct 14, 2023
1 parent cfd0ca2 commit 04f5d7b
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 38 deletions.
46 changes: 10 additions & 36 deletions regex-syntax/src/hir/translate.rs
Expand Up @@ -388,17 +388,10 @@ impl<'t, 'p> Visitor for TranslatorI<'t, 'p> {
}
Ast::Literal(ref x) => match self.ast_literal_to_scalar(x)? {
Either::Right(byte) => self.push_byte(byte),
Either::Left(ch) => {
if !self.flags().unicode() && ch.len_utf8() > 1 {
return Err(
self.error(x.span, ErrorKind::UnicodeNotAllowed)
);
}
match self.case_fold_char(x.span, ch)? {
None => self.push_char(ch),
Some(expr) => self.push(HirFrame::Expr(expr)),
}
}
Either::Left(ch) => match self.case_fold_char(x.span, ch)? {
None => self.push_char(ch),
Some(expr) => self.push(HirFrame::Expr(expr)),
},
},
Ast::Dot(ref span) => {
self.push(HirFrame::Expr(self.hir_dot(**span)?));
Expand Down Expand Up @@ -872,8 +865,8 @@ impl<'t, 'p> TranslatorI<'t, 'p> {
})?;
Ok(Some(Hir::class(hir::Class::Unicode(cls))))
} else {
if c.len_utf8() > 1 {
return Err(self.error(span, ErrorKind::UnicodeNotAllowed));
if !c.is_ascii() {
return Ok(None);
}
// If case folding won't do anything, then don't bother trying.
match c {
Expand Down Expand Up @@ -1211,9 +1204,8 @@ impl<'t, 'p> TranslatorI<'t, 'p> {
match self.ast_literal_to_scalar(ast)? {
Either::Right(byte) => Ok(byte),
Either::Left(ch) => {
let cp = u32::from(ch);
if cp <= 0x7F {
Ok(u8::try_from(cp).unwrap())
if ch.is_ascii() {
Ok(u8::try_from(ch).unwrap())
} else {
// We can't feasibly support Unicode in
// byte oriented classes. Byte classes don't
Expand Down Expand Up @@ -1661,16 +1653,7 @@ mod tests {
assert_eq!(t_bytes(r"(?-u)\x61"), hir_lit("a"));
assert_eq!(t_bytes(r"(?-u)\xFF"), hir_blit(b"\xFF"));

assert_eq!(
t_err("(?-u)☃"),
TestError {
kind: hir::ErrorKind::UnicodeNotAllowed,
span: Span::new(
Position::new(5, 1, 6),
Position::new(8, 1, 7)
),
}
);
assert_eq!(t("(?-u)☃"), hir_lit("☃"));
assert_eq!(
t_err(r"(?-u)\xFF"),
TestError {
Expand Down Expand Up @@ -1748,16 +1731,7 @@ mod tests {
);
assert_eq!(t_bytes(r"(?i-u)\xFF"), hir_blit(b"\xFF"));

assert_eq!(
t_err("(?i-u)β"),
TestError {
kind: hir::ErrorKind::UnicodeNotAllowed,
span: Span::new(
Position::new(6, 1, 7),
Position::new(8, 1, 8),
),
}
);
assert_eq!(t("(?i-u)β"), hir_lit("β"),);
}

#[test]
Expand Down
4 changes: 2 additions & 2 deletions src/bytes.rs
Expand Up @@ -68,8 +68,8 @@ bytes:
1. The `u` flag can be disabled even when disabling it might cause the regex to
match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in
"ASCII compatible" mode.
2. In ASCII compatible mode, neither Unicode scalar values nor Unicode
character classes are allowed.
2. In ASCII compatible mode, Unicode character classes are not allowed. Literal
Unicode scalar values outside of character classes are allowed.
3. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`)
revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps
to `[[:digit:]]` and `\s` maps to `[[:space:]]`.
Expand Down

0 comments on commit 04f5d7b

Please sign in to comment.