Add a "repeated flag" to parameter nodes by tenderlove · Pull Request #2144 · ruby/prism

tenderlove · 2024-01-09T18:24:30Z

It's possible to repeat parameters in method definitions like so:

def foo(_a, _a)
end

The compiler needs to know to adjust the local table size to account for these duplicate names. We'll use the repeated parameter flag to account for the extra stack space required

It's possible to repeat parameters in method definitions like so: ```ruby def foo(_a, _a) end ``` The compiler needs to know to adjust the local table size to account for these duplicate names. We'll use the repeated parameter flag to account for the extra stack space required Co-Authored-By: Kevin Newton <kddnewton@gmail.com> Co-Authored-By: Jemma Issroff <jemmaissroff@gmail.com>

kddnewton

Two small optional things, otherwise looks good

kddnewton · 2024-01-09T18:40:59Z

+            pm_node_flag_set(node, PM_PARAMETER_FLAGS_REPEATED_PARAMETER);
+            break;
+        default:
+            fprintf(stderr, "unhandled type %d\n", PM_NODE_TYPE(node));


We could potentially change this function to do an assertion at the top of the function on the type and then just set the flag. But I doubt this is a very common path so I'm fine with leaving it as is.

I think using assert rather than the printf / abort is better. I'll change it.

kddnewton · 2024-01-09T18:41:13Z

 * parameter is unique or not.
 */
-static void
+static bool


Could you document the return type in the description of the function?

tenderlove · 2024-01-09T19:54:35Z

@kddnewton mind reviewing again? I think I addressed the feedback.

eregon · 2024-01-10T11:07:40Z

I wonder what is the right approach here.
@andrykonchin and I were thinking to open an issue or discussion about this case:

class P
  def m(*a) = p a
end

class C<P
  def m(_,_,_)
    _ = 4
    super
  end
end

C.new.m(1,2,3) # => [4, 2, 3]

So the 1st _ is really a different variable than the 2nd _ and 3rd _.
And assigning _ only assigns the first one.

Notably ruby --dump=insns -e 'def m(_,_); [_,_]; end' shows a local table of size 2 (and not 1).
But Prism currently only declares one local for that case.

The JRuby parser defines multiple variables in that case:

Source:
def m(_,_); [_,_]; end

AST:
RootParseNode
  DefnParseNode:m
    ArgsParseNode
      ArrayParseNode
        ArgumentParseNode:_
        ArgumentParseNode:_$0
      ArrayParseNode
      ArrayParseNode
      ArrayParseNode
    ArrayParseNode
      LocalVarParseNode:_
      LocalVarParseNode:_

So I'm thinking it would be nice if Prism would really consider these _ variables to be different because they are obviously separate semantically.
Then the Prism locals would also reflect how many variables there are.

Regarding naming, one trick is to use an invalid local variable name like _$0 in JRuby parser or _%0 or %_0.
Then they can easily be filtered and won't appear as accessible as Ruby local variables.
Starting with an invalid character seems best to ease filtering so e.g. the 2nd _ becomes %_2 and the 2nd _a becomes %_a2.
(binding.local_variable_get :"NAME" can be used to check whether it's a valid local var name)

I think the repeated flag in this PR is harder to process and error-prone, i.e., if anything uses the Prism AST and see two _ variables they will just think it's the same variable (incorrect semantically) vs if they are actually different variable names then there is no ambiguity and no need for every usage of Prism to actually expand duplicated variables into separate variables. The trade off is to go back to Ruby syntax for the cases needing it then a /^%(_.*)\d+$/ -> '\1' substitution would be needed.
OTOH it is enough to deal with this case without manually checking duplicates (so better than current).

As a note, while the added flag on *ParameterNode does not increase memory for C structs (they always have flags), and only increase if the flag is set in serialized form, it does increase the memory for all Java classes for *ParameterNode, because those only have a short flags; field if there is any flag.
EDIT: although currently it wouldn't increase the effective size in memory due to private boolean newLineFlag; in Java Node. But maybe that flag could be stored as a bit on startOffset or length, as it's currently increasing lots of Java Node size potentially by 8 bytes just to store one bit.

eregon · 2024-01-10T11:20:25Z

For comparison to the JRuby AST above, this is the current Prism AST for def m(_,_); [_,_]; end:

$ bin/parse -e 'def m(_,_); [_,_]; end'
...
        └── @ DefNode (location: (1,0)-(1,22))
            ...
            ├── parameters:
            │   @ ParametersNode (location: (1,6)-(1,9))
            │   ├── requireds: (length: 2)
            │   │   ├── @ RequiredParameterNode (location: (1,6)-(1,7))
            │   │   │   └── name: :_
            │   │   └── @ RequiredParameterNode (location: (1,8)-(1,9))
            │   │       └── name: :_
            │   ├...
            ├── body:
            │   @ StatementsNode (location: (1,12)-(1,17))
            │   └── body: (length: 1)
            │       └── @ ArrayNode (location: (1,12)-(1,17))
            │           ├── flags: ∅
            │           ├── elements: (length: 2)
            │           │   ├── @ LocalVariableReadNode (location: (1,13)-(1,14))
            │           │   │   ├── name: :_
            │           │   │   └── depth: 0
            │           │   └── @ LocalVariableReadNode (location: (1,15)-(1,16))
            │           │       ├── name: :_
            │           │       └── depth: 0
            │           ├...
            ├── locals: [:_]
            ├...

Looking at it it all seems too clear that there is a single _ variable, but there are 2 semantically.
The repeated flag addresses it for the RequiredParameterNodes, but not for LocalVariableReadNode, which is then from the AST not clear to which variable it refers.

enebo · 2024-01-10T15:30:58Z

It is funny because I was discussing this variable to someone this weekend and it appears '' as a user visible variable is the first '' encountered while parsing. So:

def foo(,); p _; end; foo(1, 2); # "1"

def foo(_, (a, _); p _; end; foo(1, [2,3]) # "3"

For this second case either first _ depth-first parse use of _ is used OR first is bound and the destructuring assigns to first _ (I believe it is the second case but it is confusing) The first example is interesting because why is that 1 and not 2? I don't personally think people should depend on being able to access this but Ruby is what it does and we seem to have this single variable name with weird value assignment behavior that also does not warn. I have a theory on why this exists but I won't bore anyone.

That aside the semantics are only one _ is actually a valid variable for use. The others are specifying we do not care about them as user-visible values (and honestly they should not have allowed _ to be accessed at all). The fact that the legacy parser (in JRuby) assigns is just an implementation detail (in fact it is a vestigial artifact from AST interpreter days and has not been used in years). In Prism this is how we do it:

    /* '_' can be seen as a variable only by its first assignment as a local variable.  For any additional
     * '_' we create temporary variables in the case the scope has a zsuper in it.  If so, then the zsuper
     * call will slurp those temps up as it's parameters so it can properly set up the call.
     */
    Variable argumentResult(RubySymbol name) {
        // |a,| case
        if (name == null) return temp();

        boolean isUnderscore = name.getBytes().realSize() == 1 && name.getBytes().charAt(0) == '_';

        if (isUnderscore && underscoreVariableSeen) {
            return temp();
        } else {
            if (isUnderscore) underscoreVariableSeen = true;
            return getNewLocalVariable(name, 0);
        }
    }

In this new parsing world we just assign a temp and do not pollute the variable table (obviously JRuby using register based design makes this simple). The main point here though is there is no semantic reason why _ should have extra local variables. Users cannot seem them and implementers (at least JRuby) are not using them to implement zsuper. HOWEVER

Marking the real _ would have two benefits:

The above code would be simpler because I would not have to examine the name of ever parameter. I would check type or field or flag.
If you were to write a Ruby script to evaluate variable uses it would be handy to know which _ was the user-visible one. Having consumers of the tree figure this out is probably beyond most people's general knowledge. As we can see above it was above mine and I implemented it a few times.

@kddnewton I think we should consider annotating the real _ for item 2 but it would not hurt for number 1 but obviously it is not a big burden to do what I did.

@eregon I don't see value in adding these extra _ as variables. It feels like impl detail and not semantics.

eregon · 2024-01-10T15:59:21Z

def foo(_, (a, _); p _; end; foo(1, [2,3]) # "3"

Wow that is nasty.
I think it's worth reporting at https://bugs.ruby-lang.org/, at least to discuss it there if intended or bug.
BTW this PR labels the _ after a, as repeated_parameter so that's inconsistent in Prism vs CRuby.

@eregon I don't see value in adding these extra _ as variables. It feels like impl detail and not semantics.

That's the part we disagree. They are semantics because they affect the behavior of Ruby and it is observable (an implementation detail OTOH would not be observable from Ruby code).
Any correct semantic usage of Prism will require to deal with this edge case, if such _ code is ever provided as input.
I think it's easier to deal with the edge case if Prism recognizes them as separate variables (notably it would not require state in the translator/compiler to deal with that, in general we avoid state as much as possible in TruffleRuby translator as it works poorly with lazy translation).
But in the end it's 2 different representations for the same thing. For the flag one still needs to manually number the various _ though.

I wish multiple _ would just be the same variable and writes to such parameters would just overwrite each other (or just multi _ be a SyntaxError), but it's not the semantics of Ruby 3.3 and before.

tenderlove · 2024-01-10T18:29:47Z

AFAIK matz says that assignments to repeated _ prefixed variables has undefined behavior. I need these flags just because the locals table that prism provides is actually a set so we can't know the right stack size on method entry.

For example:

def foo(a, _a, _a); end
foo(0, 1, 2)

The caller will push 3 items on the stack, but if we only look at the local table provided by prism, then we'll assume the stack size is only 2. This flag just helps CRuby do stack size book keeping, it's not a statement about actual behavior in the language.

duerst · 2024-01-11T03:45:04Z

@eregon wrote:

I wish multiple _ would just be the same variable and writes to such parameters would just overwrite each other (or just multi _ be a SyntaxError), but it's not the semantics of Ruby 3.3 and before.

I think some people use variable names that start with '_' to indicate that these are ignored. That would suggest that such variables are not accessible at all. This would solve the problem with duplication automatically.

eregon · 2024-01-11T10:42:55Z

I need these flags just because the locals table that prism provides is actually a set so we can't know the right stack size on method entry.

Right, it's very similar to the issue enebo and I are mentioning.

undefined behavior does not exist in Ruby, it's what CRuby does or doesn't.

the locals table that prism provides is actually a set

I think that's the part we could fix, we could have Prism list all the semantically-different variables in its local tables (by renaming duplicates as suggested above).
Then it would be a "set/table of all semantics local variables in that scope" vs currently it's a "set of names in the source, not necessarily corresponding to one variable but in some cases to multiple variables".
I think no one expect a "locals table" to not include all semantic/required local variables, otherwise it's incomplete and cannot be relied upon, isn't it?

I can see the renaming is not ideal (introducing variables with a different name than in the source) but I think it's the cleanest solution here.

It is possible to workaround this issue with the new flag, but it's less convenient and I think every semantic usage of Prism will have to manually workaround this.

eregon · 2024-01-18T14:29:35Z

In #2124 it uses a 0it variable name.
While in this PR the approach is to keep the name in the source but differentiate via a flag.
Maybe it's worth unifying?
Or maybe variables should have 2 names, the source name and the semantic/compiler/internal name? (not sure where we would store & serialize these 2 names though).

FWIW we have adopted the repeated flag in TruffleRuby, it's convenient. Not as good as if we didn't even have to think about it because it would already be separate variables, but simple enough in practice.

tenderlove requested a review from kddnewton January 9, 2024 18:26

kddnewton approved these changes Jan 9, 2024

View reviewed changes

address feedback

ed183ad

tenderlove requested a review from kddnewton January 9, 2024 19:54

kddnewton approved these changes Jan 10, 2024

View reviewed changes

kddnewton merged commit ced3b8d into ruby:main Jan 10, 2024

tenderlove deleted the dup-param-flag branch January 10, 2024 18:16

Conversation

tenderlove commented Jan 9, 2024

Uh oh!

kddnewton left a comment

Choose a reason for hiding this comment

Uh oh!

kddnewton Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

tenderlove Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

kddnewton Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

tenderlove commented Jan 9, 2024

Uh oh!

eregon commented Jan 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eregon commented Jan 10, 2024

Uh oh!

enebo commented Jan 10, 2024

Uh oh!

eregon commented Jan 10, 2024

Uh oh!

tenderlove commented Jan 10, 2024

Uh oh!

duerst commented Jan 11, 2024

Uh oh!

eregon commented Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eregon commented Jan 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eregon commented Jan 10, 2024 •

edited

Loading

eregon commented Jan 11, 2024 •

edited

Loading