Skip to content

Add a "repeated flag" to parameter nodes#2144

Merged
kddnewton merged 2 commits intoruby:mainfrom
tenderlove:dup-param-flag
Jan 10, 2024
Merged

Add a "repeated flag" to parameter nodes#2144
kddnewton merged 2 commits intoruby:mainfrom
tenderlove:dup-param-flag

Conversation

@tenderlove
Copy link
Copy Markdown
Member

It's possible to repeat parameters in method definitions like so:

def foo(_a, _a)
end

The compiler needs to know to adjust the local table size to account for these duplicate names. We'll use the repeated parameter flag to account for the extra stack space required

It's possible to repeat parameters in method definitions like so:

```ruby
def foo(_a, _a)
end
```

The compiler needs to know to adjust the local table size to account for
these duplicate names.  We'll use the repeated parameter flag to account
for the extra stack space required

Co-Authored-By: Kevin Newton <kddnewton@gmail.com>
Co-Authored-By: Jemma Issroff <jemmaissroff@gmail.com>
@tenderlove tenderlove requested a review from kddnewton January 9, 2024 18:26
Copy link
Copy Markdown
Collaborator

@kddnewton kddnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small optional things, otherwise looks good

Comment thread src/prism.c Outdated
pm_node_flag_set(node, PM_PARAMETER_FLAGS_REPEATED_PARAMETER);
break;
default:
fprintf(stderr, "unhandled type %d\n", PM_NODE_TYPE(node));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially change this function to do an assertion at the top of the function on the type and then just set the flag. But I doubt this is a very common path so I'm fine with leaving it as is.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using assert rather than the printf / abort is better. I'll change it.

Comment thread src/prism.c
* parameter is unique or not.
*/
static void
static bool
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document the return type in the description of the function?

@tenderlove tenderlove requested a review from kddnewton January 9, 2024 19:54
@tenderlove
Copy link
Copy Markdown
Member Author

@kddnewton mind reviewing again? I think I addressed the feedback.

@eregon
Copy link
Copy Markdown
Member

eregon commented Jan 10, 2024

I wonder what is the right approach here.
@andrykonchin and I were thinking to open an issue or discussion about this case:

class P
  def m(*a) = p a
end

class C<P
  def m(_,_,_)
    _ = 4
    super
  end
end

C.new.m(1,2,3) # => [4, 2, 3]

So the 1st _ is really a different variable than the 2nd _ and 3rd _.
And assigning _ only assigns the first one.

Notably ruby --dump=insns -e 'def m(_,_); [_,_]; end' shows a local table of size 2 (and not 1).
But Prism currently only declares one local for that case.

The JRuby parser defines multiple variables in that case:

Source:
def m(_,_); [_,_]; end

AST:
RootParseNode
  DefnParseNode:m
    ArgsParseNode
      ArrayParseNode
        ArgumentParseNode:_
        ArgumentParseNode:_$0
      ArrayParseNode
      ArrayParseNode
      ArrayParseNode
    ArrayParseNode
      LocalVarParseNode:_
      LocalVarParseNode:_

So I'm thinking it would be nice if Prism would really consider these _ variables to be different because they are obviously separate semantically.
Then the Prism locals would also reflect how many variables there are.

Regarding naming, one trick is to use an invalid local variable name like _$0 in JRuby parser or _%0 or %_0.
Then they can easily be filtered and won't appear as accessible as Ruby local variables.
Starting with an invalid character seems best to ease filtering so e.g. the 2nd _ becomes %_2 and the 2nd _a becomes %_a2.
(binding.local_variable_get :"NAME" can be used to check whether it's a valid local var name)

I think the repeated flag in this PR is harder to process and error-prone, i.e., if anything uses the Prism AST and see two _ variables they will just think it's the same variable (incorrect semantically) vs if they are actually different variable names then there is no ambiguity and no need for every usage of Prism to actually expand duplicated variables into separate variables. The trade off is to go back to Ruby syntax for the cases needing it then a /^%(_.*)\d+$/ -> '\1' substitution would be needed.
OTOH it is enough to deal with this case without manually checking duplicates (so better than current).

As a note, while the added flag on *ParameterNode does not increase memory for C structs (they always have flags), and only increase if the flag is set in serialized form, it does increase the memory for all Java classes for *ParameterNode, because those only have a short flags; field if there is any flag.
EDIT: although currently it wouldn't increase the effective size in memory due to private boolean newLineFlag; in Java Node. But maybe that flag could be stored as a bit on startOffset or length, as it's currently increasing lots of Java Node size potentially by 8 bytes just to store one bit.

@eregon
Copy link
Copy Markdown
Member

eregon commented Jan 10, 2024

For comparison to the JRuby AST above, this is the current Prism AST for def m(_,_); [_,_]; end:

$ bin/parse -e 'def m(_,_); [_,_]; end'
...
        └── @ DefNode (location: (1,0)-(1,22))
            ...
            ├── parameters:
            │   @ ParametersNode (location: (1,6)-(1,9))
            │   ├── requireds: (length: 2)
            │   │   ├── @ RequiredParameterNode (location: (1,6)-(1,7))
            │   │   │   └── name: :_
            │   │   └── @ RequiredParameterNode (location: (1,8)-(1,9))
            │   │       └── name: :_
            │   ├...
            ├── body:
            │   @ StatementsNode (location: (1,12)-(1,17))
            │   └── body: (length: 1)
            │       └── @ ArrayNode (location: (1,12)-(1,17))
            │           ├── flags: ∅
            │           ├── elements: (length: 2)
            │           │   ├── @ LocalVariableReadNode (location: (1,13)-(1,14))
            │           │   │   ├── name: :_
            │           │   │   └── depth: 0
            │           │   └── @ LocalVariableReadNode (location: (1,15)-(1,16))
            │           │       ├── name: :_
            │           │       └── depth: 0
            │           ├...
            ├── locals: [:_]
            ├...

Looking at it it all seems too clear that there is a single _ variable, but there are 2 semantically.
The repeated flag addresses it for the RequiredParameterNodes, but not for LocalVariableReadNode, which is then from the AST not clear to which variable it refers.

@kddnewton kddnewton merged commit ced3b8d into ruby:main Jan 10, 2024
@enebo
Copy link
Copy Markdown
Collaborator

enebo commented Jan 10, 2024

It is funny because I was discussing this variable to someone this weekend and it appears '' as a user visible variable is the first '' encountered while parsing. So:

def foo(,); p _; end; foo(1, 2); # "1"

def foo(_, (a, _); p _; end; foo(1, [2,3]) # "3"

For this second case either first _ depth-first parse use of _ is used OR first is bound and the destructuring assigns to first _ (I believe it is the second case but it is confusing) The first example is interesting because why is that 1 and not 2? I don't personally think people should depend on being able to access this but Ruby is what it does and we seem to have this single variable name with weird value assignment behavior that also does not warn. I have a theory on why this exists but I won't bore anyone.

That aside the semantics are only one _ is actually a valid variable for use. The others are specifying we do not care about them as user-visible values (and honestly they should not have allowed _ to be accessed at all). The fact that the legacy parser (in JRuby) assigns is just an implementation detail (in fact it is a vestigial artifact from AST interpreter days and has not been used in years). In Prism this is how we do it:

    /* '_' can be seen as a variable only by its first assignment as a local variable.  For any additional
     * '_' we create temporary variables in the case the scope has a zsuper in it.  If so, then the zsuper
     * call will slurp those temps up as it's parameters so it can properly set up the call.
     */
    Variable argumentResult(RubySymbol name) {
        // |a,| case
        if (name == null) return temp();

        boolean isUnderscore = name.getBytes().realSize() == 1 && name.getBytes().charAt(0) == '_';

        if (isUnderscore && underscoreVariableSeen) {
            return temp();
        } else {
            if (isUnderscore) underscoreVariableSeen = true;
            return getNewLocalVariable(name, 0);
        }
    }

In this new parsing world we just assign a temp and do not pollute the variable table (obviously JRuby using register based design makes this simple). The main point here though is there is no semantic reason why _ should have extra local variables. Users cannot seem them and implementers (at least JRuby) are not using them to implement zsuper. HOWEVER

Marking the real _ would have two benefits:

  1. The above code would be simpler because I would not have to examine the name of ever parameter. I would check type or field or flag.
  2. If you were to write a Ruby script to evaluate variable uses it would be handy to know which _ was the user-visible one. Having consumers of the tree figure this out is probably beyond most people's general knowledge. As we can see above it was above mine and I implemented it a few times.

@kddnewton I think we should consider annotating the real _ for item 2 but it would not hurt for number 1 but obviously it is not a big burden to do what I did.

@eregon I don't see value in adding these extra _ as variables. It feels like impl detail and not semantics.

@eregon
Copy link
Copy Markdown
Member

eregon commented Jan 10, 2024

def foo(_, (a, _); p _; end; foo(1, [2,3]) # "3"

Wow that is nasty.
I think it's worth reporting at https://bugs.ruby-lang.org/, at least to discuss it there if intended or bug.
BTW this PR labels the _ after a, as repeated_parameter so that's inconsistent in Prism vs CRuby.

@eregon I don't see value in adding these extra _ as variables. It feels like impl detail and not semantics.

That's the part we disagree. They are semantics because they affect the behavior of Ruby and it is observable (an implementation detail OTOH would not be observable from Ruby code).
Any correct semantic usage of Prism will require to deal with this edge case, if such _ code is ever provided as input.
I think it's easier to deal with the edge case if Prism recognizes them as separate variables (notably it would not require state in the translator/compiler to deal with that, in general we avoid state as much as possible in TruffleRuby translator as it works poorly with lazy translation).
But in the end it's 2 different representations for the same thing. For the flag one still needs to manually number the various _ though.

I wish multiple _ would just be the same variable and writes to such parameters would just overwrite each other (or just multi _ be a SyntaxError), but it's not the semantics of Ruby 3.3 and before.

@tenderlove tenderlove deleted the dup-param-flag branch January 10, 2024 18:16
@tenderlove
Copy link
Copy Markdown
Member Author

AFAIK matz says that assignments to repeated _ prefixed variables has undefined behavior. I need these flags just because the locals table that prism provides is actually a set so we can't know the right stack size on method entry.

For example:

def foo(a, _a, _a); end
foo(0, 1, 2)

The caller will push 3 items on the stack, but if we only look at the local table provided by prism, then we'll assume the stack size is only 2. This flag just helps CRuby do stack size book keeping, it's not a statement about actual behavior in the language.

@duerst
Copy link
Copy Markdown
Member

duerst commented Jan 11, 2024

@eregon wrote:

I wish multiple _ would just be the same variable and writes to such parameters would just overwrite each other (or just multi _ be a SyntaxError), but it's not the semantics of Ruby 3.3 and before.

I think some people use variable names that start with '_' to indicate that these are ignored. That would suggest that such variables are not accessible at all. This would solve the problem with duplication automatically.

@eregon
Copy link
Copy Markdown
Member

eregon commented Jan 11, 2024

I need these flags just because the locals table that prism provides is actually a set so we can't know the right stack size on method entry.

Right, it's very similar to the issue enebo and I are mentioning.

undefined behavior does not exist in Ruby, it's what CRuby does or doesn't.

the locals table that prism provides is actually a set

I think that's the part we could fix, we could have Prism list all the semantically-different variables in its local tables (by renaming duplicates as suggested above).
Then it would be a "set/table of all semantics local variables in that scope" vs currently it's a "set of names in the source, not necessarily corresponding to one variable but in some cases to multiple variables".
I think no one expect a "locals table" to not include all semantic/required local variables, otherwise it's incomplete and cannot be relied upon, isn't it?

I can see the renaming is not ideal (introducing variables with a different name than in the source) but I think it's the cleanest solution here.

It is possible to workaround this issue with the new flag, but it's less convenient and I think every semantic usage of Prism will have to manually workaround this.

@eregon
Copy link
Copy Markdown
Member

eregon commented Jan 18, 2024

In #2124 it uses a 0it variable name.
While in this PR the approach is to keep the name in the source but differentiate via a flag.
Maybe it's worth unifying?
Or maybe variables should have 2 names, the source name and the semantic/compiler/internal name? (not sure where we would store & serialize these 2 names though).

FWIW we have adopted the repeated flag in TruffleRuby, it's convenient. Not as good as if we didn't even have to think about it because it would already be separate variables, but simple enough in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants