Set Regexp encoding flags #2486

nirvdrum · 2024-02-22T22:05:46Z

This PR sets the encoding flags on Regexp instances much like we do for String and Symbol. Unlike String and Symbol, there's a third way to set a Regexp encoding: encoding modifiers. There are four encoding modifiers:

Modifier	Encoding
n	ASCII-8BIT
u	UTF-8
e	EUC-JP
s	Windows-31J

Background

The meaning of these flags appears to have shifted over time. In my copy of "The Ruby Programming Language", which covers Ruby 1.8 and the soon-to-be-released 1.9, the n modifier meant "ASCII" and the s modifier meant "Shift_JIS". I think the n flag still has some of its heritage as it will adjust the regex encoding to be US-ASCII, rather than ASCII-8BIT, in certain circumstances.

String, Symbol, and Regexp all have a flag to force the object's encoding to US-ASCII, ASCII-8BIT, or UTF-8. Rather than introduce new flag values to force a Regexp encoding to either EUC-JP or Windows-31J, I opted to use the encoding modifiers to indicate the Regexp encoding. To avoid any confusion, if an encoding modifier option is used, then the "force encoding" flags are only set if they would result in a different encoding. I.e., the "/u" option would never result in the "force UTF-8" flag being set. But, the "/n" option may result in the "force US-ASCII" flag being set.

Client Usage

To determine the Regexp encoding would require flag checks like the following (and used in encoding_test.rb):

if regexp.forced_utf8_encoding?
  Encoding::UTF_8
elsif regexp.forced_binary_encoding?
  Encoding::ASCII_8BIT
elsif regexp.forced_us_ascii_encoding?
  Encoding::US_ASCII
elsif regexp.ascii_8bit?
  Encoding::ASCII_8BIT
elsif regexp.utf_8?
  Encoding::UTF_8
elsif regexp.euc_jp?
  Encoding::EUC_JP
elsif regexp.windows_31j?
  Encoding::Windows_31J
else
  encoding
end

The order here matters. The "forced encoding" flags must be checked before the encoding modifiers. The encoding modifiers may differ, but must be kept around for Ruby semantics. The order of the "force encoding" flags and the encoding modifiers is irrelevant, however, since each flag value is mutually exclusive within that class.

Design Notes

Regexp encoding validation is rather complex. I'm not sure that there's a unifying rule coming from CRuby. I'm sure there's a philosophy on what should be an error but some of the details appear to be happenstance. While I did consult the existing implementations of CRuby, JRuby, and TruffleRuby for what needs to be done, I mostly arrived at the rules embodied in this code by extensive testing. This PR adds rudimentary Regexp validation. To be 100% accurate would require scanning the bytes of the source string and performing validation against the associated encoding. Given Prism doesn't currently do that at large, I decided not to add it for this PR. However, there are other validation issues trivially detectable by mismatched combination of encoding modifiers and source file or Regexp source string encoding. In cases where I could I added that validation.

In several places I added placeholder code. These sections use intent-hinting variable names but set the value conservatively to match Prism's current semantics. This was done largely to help me keep track of the complex set of rules involved in validation. However, I retained the code because I think it'll make subsequent improvements much easier to slot into the overarching design. If this is controversial I can safely remove that code. It would necessitate the removal of some error messages that are otherwise unused in Prism but used by CRuby during Regexp validation.

I had initially planned on deferring validation to a subsequent pass. However, I found it very confusing to have flags set to nonsense values that should be syntax errors that we weren't yet handling. Adding this rudimentary validation avoids those situations. I've tried to avoid adjusting any of the flags if there is a syntax error, but given validation can happen in stages I think Prism as a whole should not make any guarantees about flag values in the presence of syntax errors.

I also combined the computation of a Regexp's encoding flags and the Regexp validation into a joint function. While this may not be architecturally pure, I think it's much easier to follow the convoluted set of rules that determine the resulting encoding. This is in contrast to how we handle String and Symbol. In both of those other cases the validation is much simpler: does this byte array map into the associate encoding? While we could split the two processes for Regexp, we would end up in a situation where we could have to process some values multiple times or set flags and have to clear them later on.

Finally, while this PR does not compute the validity of a Regexp source string in a given encoding, it does scan the string to determine if it is ASCII-only. In some error cases this may happen multiple times. I expect all of this will clean up when we add the code range calculation to strings in Prism. If it's required I can compute the value once and pass it around, but we should retain the laziness and that complicates things when passing the value across function boundaries.

enebo · 2024-02-23T14:23:17Z

@nirvdrum The flags are needed for clients to know how an encoding was arrived at but I feel like we should consider storing the arrived at encoding. Checking n state flags to figure out encoding feels like something is missing like a value representing the encoding it is. Or can we be more clever here and use flag value with a lookup table (for Ruby lib if these conditionals add clarity I am not against them staying there but I am asking about how JRuby can do this without that many conditionals)?

(not for this PR but for when CR is calculated):
If we can't table lookup then we should consider how to make encoding a single lookup value to remove the need for the if/else chain. We cannot CR until we know encoding so it is probably the right time to add encoding value (since parser already arrived at actual encoding).

nirvdrum · 2024-02-23T15:58:42Z

The flags are needed for clients to know how an encoding was arrived at but I feel like we should consider storing the arrived at encoding.

@enebo I agree, but that's a larger design change. If we go that route, I think it makes sense to do it for String and Symbol as well. This PR adds missing functionality adhering to the current API philosophy and doesn't add any new flags. I had previously spoken with @kddnewton about the API for determining the Regexp encoding and he was in favor of cutting that out for this PR to help keep it scoped. With that said, I think this PR helps to frame that discussion better since we can clearly see what is required of clients if we don't have an encoding field on RegularExpressionNode.

Another little quirk is the RegularExpressioNode#unescaped field is different in semantics from String#unescaped and Symbol#unescaped. It's basically what you'd get with Regexp#source, which may translate the original string in ways not done for either String or Symbol. I think it's worth renaming the field for clarity, but I also deferred that as a future work item to help keep this PR free of API design changes.

kddnewton

Overall this is really great, I'm excited to merge it and the approach looks good.

For the test-all failures, I think you're accidentally parsing strings instead of regexps. it looks like the source variable in those methods looks like:

"# encoding: ASCII-8BIT\n\"/abc/n\""

For the memcheck failure, I think it's because you're using %s in the error messages but not necessarily giving it null-terminated strings. For other errors I've typically been using %.*s and explicitly giving it the string length first.

Those are the only pressing things. Other than that, right now this PR is modifying all token buffers and doing a bunch of unnecessary work in the case of strings, heredocs, and lists. I would like to eliminate that work. Here are a couple of ways around that:

Duplicate the pm_token_buffer_* functions with a new variant that also accepts a regexp buffer.
Add a new kind of token buffer that has a pm_token_buffer_t as its first member so that you can cast between them. Add a bool to pm_buffer_token_t so that it knows if it's actually a pm_regexp_buffer_t and can upcast.
Create two token buffers, and within the regexp lexing call it for both.

I'm fine with any of these solutions, but definitely want to avoid that extra work.

In terms of stuff on top of this, the big API change I would like to make is the thing that @enebo mentions here of having a new kind of field that is an encoding. But I would prefer that not to be in this PR because it's just going to make reviewing that much harder.

kddnewton

Just marking this as "request changes"

nirvdrum · 2024-03-01T19:21:49Z

I've fixed the make test-all issues. It turns out one of the variables I was using for a define_method loop was being used as something of a global variable elsewhere. Depending on execution order the tests could end up working with incorrect data. I've renamed the variable to avoid that.

The memory leaks are taken care of now, as well. I hadn't realized that parser->current_string isn't cleaned up on the next pass. I'm reasonably certain that parser->current_regular_expression_source as well as the associated token buffer are only used for regular expressions now. The valgrind tests would raise an issue if parser->current_regular_expression_source were being populated and unused by String and Symbol parsing.

src/prism.c

…sion so we can accurately set its encoding flags.

nirvdrum force-pushed the regexp-encoding-fix branch 3 times, most recently from 854f3f3 to dfdb1fd Compare February 22, 2024 22:30

kddnewton requested changes Feb 27, 2024

View reviewed changes

nirvdrum force-pushed the regexp-encoding-fix branch 3 times, most recently from d12816a to dee9289 Compare March 1, 2024 08:28

nirvdrum marked this pull request as ready for review March 1, 2024 19:21

kddnewton requested changes Mar 7, 2024

View reviewed changes

src/prism.c Outdated Show resolved Hide resolved

kddnewton force-pushed the regexp-encoding-fix branch from 62620fb to a2b7571 Compare March 7, 2024 21:53

nirvdrum added 3 commits March 7, 2024 18:15

Track both the unescaped bytes and source string for a regular expres…

dc6dd3a

…sion so we can accurately set its encoding flags.

Extract a magic value to a named constant.

54d14a3

Add initial implementation of Regexp validation.

6bf1b8e

kddnewton force-pushed the regexp-encoding-fix branch from a2b7571 to 6bf1b8e Compare March 7, 2024 23:15

Extract out pm_regexp_token_buffer_t

7bbe422

kddnewton force-pushed the regexp-encoding-fix branch 3 times, most recently from 932161a to 235f4dd Compare March 8, 2024 16:59

Fix up regexp memory leaks

4dc58a5

kddnewton force-pushed the regexp-encoding-fix branch from 235f4dd to 4dc58a5 Compare March 8, 2024 18:12

kddnewton approved these changes Mar 8, 2024

View reviewed changes

kddnewton merged commit 5f6854c into ruby:main Mar 8, 2024
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set Regexp encoding flags #2486

Set Regexp encoding flags #2486

nirvdrum commented Feb 22, 2024

enebo commented Feb 23, 2024

nirvdrum commented Feb 23, 2024 •

edited

Loading

kddnewton left a comment

kddnewton left a comment

nirvdrum commented Mar 1, 2024

Set Regexp encoding flags #2486

Set Regexp encoding flags #2486

Conversation

nirvdrum commented Feb 22, 2024

Background

Client Usage

Design Notes

enebo commented Feb 23, 2024

nirvdrum commented Feb 23, 2024 • edited Loading

kddnewton left a comment

Choose a reason for hiding this comment

kddnewton left a comment

Choose a reason for hiding this comment

nirvdrum commented Mar 1, 2024

nirvdrum commented Feb 23, 2024 •

edited

Loading