-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set Regexp encoding flags #2486
Conversation
854f3f3
to
dfdb1fd
Compare
@nirvdrum The flags are needed for clients to know how an encoding was arrived at but I feel like we should consider storing the arrived at encoding. Checking n state flags to figure out encoding feels like something is missing like a value representing the encoding it is. Or can we be more clever here and use flag value with a lookup table (for Ruby lib if these conditionals add clarity I am not against them staying there but I am asking about how JRuby can do this without that many conditionals)? (not for this PR but for when CR is calculated): |
@enebo I agree, but that's a larger design change. If we go that route, I think it makes sense to do it for Another little quirk is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is really great, I'm excited to merge it and the approach looks good.
- For the
test-all
failures, I think you're accidentally parsing strings instead of regexps. it looks like thesource
variable in those methods looks like:
"# encoding: ASCII-8BIT\n\"/abc/n\""
- For the
memcheck
failure, I think it's because you're using%s
in the error messages but not necessarily giving it null-terminated strings. For other errors I've typically been using%.*s
and explicitly giving it the string length first.
Those are the only pressing things. Other than that, right now this PR is modifying all token buffers and doing a bunch of unnecessary work in the case of strings, heredocs, and lists. I would like to eliminate that work. Here are a couple of ways around that:
- Duplicate the
pm_token_buffer_*
functions with a new variant that also accepts a regexp buffer. - Add a new kind of token buffer that has a
pm_token_buffer_t
as its first member so that you can cast between them. Add abool
topm_buffer_token_t
so that it knows if it's actually apm_regexp_buffer_t
and can upcast. - Create two token buffers, and within the regexp lexing call it for both.
I'm fine with any of these solutions, but definitely want to avoid that extra work.
In terms of stuff on top of this, the big API change I would like to make is the thing that @enebo mentions here of having a new kind of field that is an encoding. But I would prefer that not to be in this PR because it's just going to make reviewing that much harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just marking this as "request changes"
d12816a
to
dee9289
Compare
I've fixed the The memory leaks are taken care of now, as well. I hadn't realized that |
62620fb
to
a2b7571
Compare
…sion so we can accurately set its encoding flags.
a2b7571
to
6bf1b8e
Compare
932161a
to
235f4dd
Compare
235f4dd
to
4dc58a5
Compare
This PR sets the encoding flags on
Regexp
instances much like we do forString
andSymbol
. UnlikeString
andSymbol
, there's a third way to set aRegexp
encoding: encoding modifiers. There are four encoding modifiers:Background
The meaning of these flags appears to have shifted over time. In my copy of "The Ruby Programming Language", which covers Ruby 1.8 and the soon-to-be-released 1.9, the
n
modifier meant "ASCII" and thes
modifier meant "Shift_JIS". I think then
flag still has some of its heritage as it will adjust the regex encoding to be US-ASCII, rather than ASCII-8BIT, in certain circumstances.String
,Symbol
, andRegexp
all have a flag to force the object's encoding to US-ASCII, ASCII-8BIT, or UTF-8. Rather than introduce new flag values to force aRegexp
encoding to either EUC-JP or Windows-31J, I opted to use the encoding modifiers to indicate theRegexp
encoding. To avoid any confusion, if an encoding modifier option is used, then the "force encoding" flags are only set if they would result in a different encoding. I.e., the "/u" option would never result in the "force UTF-8" flag being set. But, the "/n" option may result in the "force US-ASCII" flag being set.Client Usage
To determine the
Regexp
encoding would require flag checks like the following (and used in encoding_test.rb):The order here matters. The "forced encoding" flags must be checked before the encoding modifiers. The encoding modifiers may differ, but must be kept around for Ruby semantics. The order of the "force encoding" flags and the encoding modifiers is irrelevant, however, since each flag value is mutually exclusive within that class.
Design Notes
Regexp
encoding validation is rather complex. I'm not sure that there's a unifying rule coming from CRuby. I'm sure there's a philosophy on what should be an error but some of the details appear to be happenstance. While I did consult the existing implementations of CRuby, JRuby, and TruffleRuby for what needs to be done, I mostly arrived at the rules embodied in this code by extensive testing. This PR adds rudimentaryRegexp
validation. To be 100% accurate would require scanning the bytes of the source string and performing validation against the associated encoding. Given Prism doesn't currently do that at large, I decided not to add it for this PR. However, there are other validation issues trivially detectable by mismatched combination of encoding modifiers and source file orRegexp
source string encoding. In cases where I could I added that validation.In several places I added placeholder code. These sections use intent-hinting variable names but set the value conservatively to match Prism's current semantics. This was done largely to help me keep track of the complex set of rules involved in validation. However, I retained the code because I think it'll make subsequent improvements much easier to slot into the overarching design. If this is controversial I can safely remove that code. It would necessitate the removal of some error messages that are otherwise unused in Prism but used by CRuby during
Regexp
validation.I had initially planned on deferring validation to a subsequent pass. However, I found it very confusing to have flags set to nonsense values that should be syntax errors that we weren't yet handling. Adding this rudimentary validation avoids those situations. I've tried to avoid adjusting any of the flags if there is a syntax error, but given validation can happen in stages I think Prism as a whole should not make any guarantees about flag values in the presence of syntax errors.
I also combined the computation of a
Regexp
's encoding flags and theRegexp
validation into a joint function. While this may not be architecturally pure, I think it's much easier to follow the convoluted set of rules that determine the resulting encoding. This is in contrast to how we handleString
andSymbol
. In both of those other cases the validation is much simpler: does this byte array map into the associate encoding? While we could split the two processes forRegexp
, we would end up in a situation where we could have to process some values multiple times or set flags and have to clear them later on.Finally, while this PR does not compute the validity of a
Regexp
source string in a given encoding, it does scan the string to determine if it is ASCII-only. In some error cases this may happen multiple times. I expect all of this will clean up when we add the code range calculation to strings in Prism. If it's required I can compute the value once and pass it around, but we should retain the laziness and that complicates things when passing the value across function boundaries.