Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relate AST nodes back to parse data #15

Open
machow opened this issue Apr 24, 2018 · 6 comments
Open

Relate AST nodes back to parse data #15

machow opened this issue Apr 24, 2018 · 6 comments
Labels
feature a feature request or enhancement

Comments

@machow
Copy link

machow commented Apr 24, 2018

(I'm happy to take a stab at this if it seems useful! cc @filipsch who is interested in this issue)

Really enjoyed the NYC R conference talk on lobstr! This might be outside the scope of lobstr, but it seems like being able to enhance the AST with line / column info could lead to useful ways of allowing people to explore the AST, and its relationship back to their code.

I worked briefly on connecting AST and parse data while flying somewhere, but never finished. I think the key points are...

  1. getParseData returns a data.frame, but there is a handy function in https://github.com/halpo/parser to turn it into a tree (and plot it)
  2. In the AST, a node's first child is always the function called, but in the parse data their order corresponds to their position in the code (shown below).
  3. As far as I can tell, the layers of the AST and parse data graph match, so relating them requires...
  • filtering parse data nodes corresponding to "expr" tokens
  • potentially finding a non-"expr" token corresponding to leftmost AST child node in other tokens (e.g. in 1 + 2, the + is not an expr token)

There may be a much easier way to do this. Here's is a comparison of the parse data for a binary operator.

image

Code used to graph parsed data

# taken from https://github.com/halpo/parser/blob/master/R/plot.parser.R
library(igraph)
plot.parser <- function(x, ...){
  y = getParseData(x)
  stopifnot(require(igraph))
  y$new.id <- seq_along(y$id)
  h <- graph.tree(0) + vertices(id = y$id, label= y$text)
  for(i in 1:nrow(y)){
    if(y[i, 'parent'])
      h <- h + edge(c(y[y$id == y[i, 'parent'], 'new.id'], y[i, 'new.id']))
  }
  plot(h, layout=layout.reingold.tilford, ...)   
}

create_parse_graph <- function(pd) {
  pd$new.id <- seq_along(pd$id)
  h <- graph.tree(0) + vertices(id = pd$id, label= pd$text)
  for(i in 1:nrow(pd)){
    if(pd[i, 'parent'])
      h <- h + edge(c(pd[pd$id == pd[i, 'parent'], 'new.id'], pd[i, 'new.id']))
  }
  
  h
}

Some notes on matching nodes

# Steps to map to AST from parse data
#   1. start at root AST node (call ast), and PD node (call pd)
#
#   2a. if is.atomic(ast), build atomic node by returning ast[1]. Otherwise...
#   2b. if is.call(ast) and is.name(ast[[1]]), then ast[[1]] is child token of pd
#     i. standard call will make it the first child token
#    ii. binary will be in the middle (so easier to search for ast[[1]])
#   2c. if is.call(ast) and is.call(ast[[1]]), then it is first child of pd
#
#   3. Get all child exprs of pd (in order), move one matching ast[[1]] to front
#
#   4. You have matched up ast to pd! Recurse
@hadley
Copy link
Member

hadley commented Aug 23, 2018

This would be really cool! Do you have any thoughts on what the resulting data structure would be?

@hadley hadley added the feature a feature request or enhancement label Dec 20, 2018
@machow
Copy link
Author

machow commented Sep 9, 2020

Hey, @nischalshrestha brought up a quirk with the AST that reminded me of this! Apparently, I made an almost working prototype 2 years ago, so I made some small changes / pushed it to github. No worries if it's not an active area of interest anymore, but wanted to describe some possible directions!

tl;dr

Do you have any thoughts on what the resulting data structure would be?

What do you think about something similar to the srcref approach, where an attribute is attached to each entry in the expression object (that I've been calling AST)? Say an int vector with 5 entries--corresponding getParseData data.frame row number, line start/end, column start/end?

As I understand, the result of parse already contains a src ref, that gives line and column information, but this is only for entire statements (e.g. 1 + 1; 2 + 2 would have 2 entries for srcref).

It seems like a big advantage here is that tools operating over the AST could identify where they are in the source code (e.g. if they hit an error, or for feedback). For example, identifying all assignments to a certain variable name, or function calls. Right now AFAICT they would need to operate on the parse tree from getParseData(), but could be totally wrong!

background

https://github.com/machow/straw

straw

(parse tree is on the left, AST on the right)

The implementation is pretty rough, but returns something similar to getParseData right now...

library(straw)
enhance_ast("1 + 1")
# A tibble: 5 x 12
  node     id parent text  token row_num children line1  col1 line2  col2 parse_row_num
  <lis> <dbl>  <dbl> <chr> <lgl>   <int> <list>   <int> <int> <int> <int>         <int>
1 <exp…     1     NA ""    NA          1 <int [1…    NA    NA    NA    NA             1
2 <lan…     2      1 ""    NA          2 <int [3…     1     1     1     5             2
3 <sym>     3      2 "+"   NA          3 <int [0…     1     3     1     3             5
4 <dbl…     4      2 "1"   NA          4 <int [0…     1     1     1     1             4
5 <dbl…     5      2 "1"   NA          5 <int [0…     1     5     1     5             7
# note, need to add prepend a row to top of this data.frame
# to represent the ASTs top-level "expr" node. 
# as is, parse_row_num - 1 from above would match this
getParseData(parse(text = "1 + 1", keep.source = TRUE))
  line1 col1 line2 col2 id parent     token terminal text
7     1    1     1    5  7      0      expr    FALSE     
1     1    1     1    1  1      2 NUM_CONST     TRUE    1
2     1    1     1    1  2      7      expr    FALSE     
3     1    3     1    3  3      7       '+'     TRUE    +
4     1    5     1    5  4      5 NUM_CONST     TRUE    1
5     1    5     1    5  5      7      expr    FALSE     

@hadley
Copy link
Member

hadley commented Sep 10, 2020

I'm still very much interested in this 😄 The problem with using attributes to map the AST to the parse data is that you can't put attributes on symbols, so I think that makes that approach basically impossible.

I wonder if instead it would make sense to just return the data frame and then provide some helper function for recursing over it like you normally would with the AST?

@machow
Copy link
Author

machow commented Sep 11, 2020

Ah, shoot--that makes sense!

Part of my interest in picking this back up is looking at how the recursion is usually done in R. I have an implementation of a tree visitor as an R6 class that I can toss into a gist, but have been curious about using something like case statements to define visitors in a more functional way (based on this post).

Are there good places to look in different libraries to see how recursion / visiting is often done? For example, I know replace_expr() in dbplyr does simple visiting. Really curious what some of the more complex situations out there look like (or whether most the time if simple recursion is all that's needed).

@hadley
Copy link
Member

hadley commented Sep 11, 2020

@machow I usually just whip up a recursive tree-walker by hand, tailored for the specific situation, so I don't have much sense for what you want in a generic implementation.

@machow
Copy link
Author

machow commented Sep 11, 2020

Okay--thanks, helpful to hear. Will try out a simple helper function, since it seems like that approach is working for people

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants