Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting comments from source code #31

Closed
bbatsov opened this issue May 15, 2013 · 15 comments
Closed

Extracting comments from source code #31

bbatsov opened this issue May 15, 2013 · 15 comments
Assignees
Milestone

Comments

@bbatsov
Copy link
Contributor

bbatsov commented May 15, 2013

Not really a issue, just a question I wasn't sure where to ask.

rubocop has some comment style checks that need access to the tokens generated by the lexer, since comments for obvious reasons are not part of the parser AST. Basically I need some equivalent of Ripper.lex. I noticed the Lexer class in Parser's source code and its use to generate the output of ruby-parser -E, but I'm quite certain how can I interact with it to simply get a list of tokens with their text and locations.

@whitequark
Copy link
Owner

Not possible. The way Ruby works, you can only simultaneously lex and parse it. For example, compare these code fragments:

division and comment:

a = 1
a / 10 # /

call and regexp:

a / 10 # /

Ruby's (>=1.9.2) grammar is context-sensitive for all intents and purposes, and with the current LALR-parse+context-sensitive-lexer implementation it only ever makes sense to perform parsing and lexing simultaneously.

I'm not sure which interface for storing comments would be the most sane.

@bbatsov
Copy link
Contributor Author

bbatsov commented May 16, 2013

I see. Maybe comments should be tracked during lex/parse and the be stored in some structure that's finally attached to the source map of the top level AST node.

@whitequark
Copy link
Owner

Might be.

One of the problems here is that people seem to like comments attached to the def/class/module nodes, and these two approaches do not compose well. (I don't want to have the same data in two disjoint places in my AST.)

@whitequark
Copy link
Owner

What I'm currently thinking about is to make the #parse API look like this:

ast, comments = parser.parse(buffer)

comments would be an array of Parser::Source::Comment, complete with the Range and text.

Questions:

  • Do we include leading # (or =begin/=end)? Currently, (unexported) lexer interface includes them, as that's how ruby_parser works.
  • Do we merge distinct inline comments? Probably no, but there are use cases which benefit from it (see below).
  • How do we satisfy the users who just want preceding comments for various nodes? Def/class at least, but YARD-like consumers would want to get preceding comments for arbitrary nodes.
    • Maybe make a post-processor? Such a post-processor could associate comments with nodes based on location info.
    • How would such a post-processor handle comments which appear "inside" other nodes? E.g. `a + #foo\nb".

@bbatsov
Copy link
Contributor Author

bbatsov commented May 16, 2013

The API change seems reasonable to me.

I think we should include #/=begin. I don't think we should merge distinct inline comments.

The post-processor idea sounds good; we can probably just discard comments that appear inside nodes - seems unlikely that someone would like to map those to nodes anyways. I'm reasonably sure we should map only class/module/def/constant comments to nodes. The other comments would likely be needed just for bulk comment processing and having them only in the comments array should suffice.

@whitequark
Copy link
Owner

What is your reasoning for including #/=begin?

I believe that YARD allows documenting attr_accessor clauses and similar ones, that's why we must support attaching them to arbitrary nodes. Well, once you have constant assignments, everything else is also there, so it's not much of a problem.

What would an algorithm for associating comments to nodes look like? I'm thinking of this:

  1. Let C be a view into lexically sorted set of comments of length 1.
  2. Traverse the AST in depth-first order. While traversing each node, perform:
    1. Let P (previous) be the current node.
    2. Let N (next) be the first child node.
    3. While C is contained lexically between P and N (that is, begin of C is contained after end of P's expression map, and end of C is contained before N's expression map):
      1. Associate C with N.
      2. Advance C.
    4. Traverse N.
    5. If there is a child node after N:
      1. Let P be N.
      2. Let N be the child node after N.
      3. Goto 2.3.
    6. Otherwise, the current node is traversed.

This way, comments in weird places will get associated to the next nearest node, which sounds reasonable to me.

I'm also interested in another opinions... @mbj, @yorickpeterse, @judofyr?

@yorickpeterse
Copy link
Contributor

Wouldn't a map/reduce like system (or just reduce in this case) work as well? Maybe I'm just not entirely understanding the description but consider the following:

  1. You have two datasets: "comments" and "ast".
  2. Iterate over each node in the "ast", do the following for each node:
    1. Check if the next item in the "comments" dataset has its position set before the current "ast" node
    2. If this is the case shift the item from the "comments" dataset and somehow associate it with the AST node.
  3. Rinse, repeat

In terms of code that would look something like the following:

class Container
  attr_reader :type, :value, :line

  def initialize(type, value, line)
    @type  = type
    @value = value
    @line  = line
  end

  def inspect
    return "(#{type} #{value} #{line})"
  end
end

comments = [
  Container.new(:comment, 'hello', 1),
  Container.new(:comment, 'world', 5)
]

ast = [
  Container.new(:node, 'a', 2),
  Container.new(:node, 'b', 3),
  Container.new(:node, 'c', 4),
  Container.new(:node, 'd', 6)
]

associations = {}

ast.each do |node|
  comment = comments[0]

  if comment and comment.line < node.line
    associations[node] = comments.shift
  end
end

p associations

This would result in the following:

{(node a 2)=>(comment hello 1), (node d 6)=>(comment world 5)}

Note that we might be talking about the same algorithm here.

@whitequark
Copy link
Owner

@yorickpeterse Kinda the same.

@bbatsov
Copy link
Contributor Author

bbatsov commented May 17, 2013

@whitequark My reasoning is simply that I need to be able to differentiate between =begin and # comments in RuboCop.

@mbj
Copy link
Collaborator

mbj commented May 17, 2013

@whitequark, @yorickpeterse Dont have a better / different idea than yours.

@ghost ghost assigned whitequark May 17, 2013
This was referenced May 19, 2013
@whitequark
Copy link
Owner

Hm, I made #parse return [ast, comments] but it's kind of ugly. Maybe make separate #parse and #parse_with_comments? Not sure here.

@bbatsov
Copy link
Contributor Author

bbatsov commented May 22, 2013

@whitequark Why not change the method signature of parse to accept an optional hash of additional params :with_comments and :with_tokens. If they are both false parse would simply return the ast (as it does now), instead of an array.

@whitequark
Copy link
Owner

Um, no, I don't like such return value polymorphism at all. It means I don't know what is returned unless I have a hash literal at the call site.

@judofyr
Copy link
Contributor

judofyr commented May 22, 2013

+1 for #parse_with_comments

@whitequark
Copy link
Owner

Done.

Code:

# Class comment
# another class comment
class Foo
  # attr_accessor comment
  attr_accessor :foo

  # method comment
  def bar
    # expr comment
    1 + # intermediate comment
      2
    # stray comment
  end
end

Associations:

{(class
  (const nil :Foo) nil
  (begin
    (send nil :attr_accessor
      (sym :foo))
    (def :bar
      (args)
      (send
        (int 1) :+
        (int 2)))))=>
  [#<Parser::Source::Comment:0x007fda223bbad8
    @location=#<Source::Range (comments) 0...15>,
    @text="# Class comment">,
   #<Parser::Source::Comment:0x007fda225a03d0
    @location=#<Source::Range (comments) 16...39>,
    @text="# another class comment">],
 (send nil :attr_accessor
  (sym :foo))=>
  [#<Parser::Source::Comment:0x007fda225fc270
    @location=#<Source::Range (comments) 52...75>,
    @text="# attr_accessor comment">],
 (def :bar
  (args)
  (send
    (int 1) :+
    (int 2)))=>
  [#<Parser::Source::Comment:0x007fda22685ca0
    @location=#<Source::Range (comments) 100...116>,
    @text="# method comment">],
 (send
  (int 1) :+
  (int 2))=>
  [#<Parser::Source::Comment:0x007fda226ba8d8
    @location=#<Source::Range (comments) 131...145>,
    @text="# expr comment">],
 (int 2)=>
  [#<Parser::Source::Comment:0x007fda226e7338
    @location=#<Source::Range (comments) 154...176>,
    @text="# intermediate comment">]}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants