Extracting comments from source code #31

bbatsov · 2013-05-15T09:43:06Z

Not really a issue, just a question I wasn't sure where to ask.

rubocop has some comment style checks that need access to the tokens generated by the lexer, since comments for obvious reasons are not part of the parser AST. Basically I need some equivalent of Ripper.lex. I noticed the Lexer class in Parser's source code and its use to generate the output of ruby-parser -E, but I'm quite certain how can I interact with it to simply get a list of tokens with their text and locations.

The text was updated successfully, but these errors were encountered:

whitequark · 2013-05-15T09:52:53Z

Not possible. The way Ruby works, you can only simultaneously lex and parse it. For example, compare these code fragments:

division and comment:

a = 1
a / 10 # /

call and regexp:

a / 10 # /

Ruby's (>=1.9.2) grammar is context-sensitive for all intents and purposes, and with the current LALR-parse+context-sensitive-lexer implementation it only ever makes sense to perform parsing and lexing simultaneously.

I'm not sure which interface for storing comments would be the most sane.

bbatsov · 2013-05-16T13:08:23Z

I see. Maybe comments should be tracked during lex/parse and the be stored in some structure that's finally attached to the source map of the top level AST node.

whitequark · 2013-05-16T13:17:08Z

Might be.

One of the problems here is that people seem to like comments attached to the def/class/module nodes, and these two approaches do not compose well. (I don't want to have the same data in two disjoint places in my AST.)

whitequark · 2013-05-16T14:41:46Z

What I'm currently thinking about is to make the #parse API look like this:

ast, comments = parser.parse(buffer)

comments would be an array of Parser::Source::Comment, complete with the Range and text.

Questions:

Do we include leading # (or =begin/=end)? Currently, (unexported) lexer interface includes them, as that's how ruby_parser works.
Do we merge distinct inline comments? Probably no, but there are use cases which benefit from it (see below).
How do we satisfy the users who just want preceding comments for various nodes? Def/class at least, but YARD-like consumers would want to get preceding comments for arbitrary nodes.
- Maybe make a post-processor? Such a post-processor could associate comments with nodes based on location info.
- How would such a post-processor handle comments which appear "inside" other nodes? E.g. `a + #foo\nb".

bbatsov · 2013-05-16T15:04:18Z

The API change seems reasonable to me.

I think we should include #/=begin. I don't think we should merge distinct inline comments.

The post-processor idea sounds good; we can probably just discard comments that appear inside nodes - seems unlikely that someone would like to map those to nodes anyways. I'm reasonably sure we should map only class/module/def/constant comments to nodes. The other comments would likely be needed just for bulk comment processing and having them only in the comments array should suffice.

whitequark · 2013-05-16T15:27:48Z

What is your reasoning for including #/=begin?

I believe that YARD allows documenting attr_accessor clauses and similar ones, that's why we must support attaching them to arbitrary nodes. Well, once you have constant assignments, everything else is also there, so it's not much of a problem.

What would an algorithm for associating comments to nodes look like? I'm thinking of this:

Let C be a view into lexically sorted set of comments of length 1.
Traverse the AST in depth-first order. While traversing each node, perform:
1. Let P (previous) be the current node.
2. Let N (next) be the first child node.
3. While C is contained lexically between P and N (that is, begin of C is contained after end of P's expression map, and end of C is contained before N's expression map):
  1. Associate C with N.
  2. Advance C.
4. Traverse N.
5. If there is a child node after N:
  1. Let P be N.
  2. Let N be the child node after N.
  3. Goto 2.3.
6. Otherwise, the current node is traversed.

This way, comments in weird places will get associated to the next nearest node, which sounds reasonable to me.

I'm also interested in another opinions... @mbj, @yorickpeterse, @judofyr?

yorickpeterse · 2013-05-16T19:02:38Z

Wouldn't a map/reduce like system (or just reduce in this case) work as well? Maybe I'm just not entirely understanding the description but consider the following:

You have two datasets: "comments" and "ast".
Iterate over each node in the "ast", do the following for each node:
1. Check if the next item in the "comments" dataset has its position set before the current "ast" node
2. If this is the case shift the item from the "comments" dataset and somehow associate it with the AST node.
Rinse, repeat

In terms of code that would look something like the following:

class Container
  attr_reader :type, :value, :line

  def initialize(type, value, line)
    @type  = type
    @value = value
    @line  = line
  end

  def inspect
    return "(#{type} #{value} #{line})"
  end
end

comments = [
  Container.new(:comment, 'hello', 1),
  Container.new(:comment, 'world', 5)
]

ast = [
  Container.new(:node, 'a', 2),
  Container.new(:node, 'b', 3),
  Container.new(:node, 'c', 4),
  Container.new(:node, 'd', 6)
]

associations = {}

ast.each do |node|
  comment = comments[0]

  if comment and comment.line < node.line
    associations[node] = comments.shift
  end
end

p associations

This would result in the following:

{(node a 2)=>(comment hello 1), (node d 6)=>(comment world 5)}

Note that we might be talking about the same algorithm here.

whitequark · 2013-05-16T20:12:06Z

@yorickpeterse Kinda the same.

bbatsov · 2013-05-17T14:55:22Z

@whitequark My reasoning is simply that I need to be able to differentiate between =begin and # comments in RuboCop.

mbj · 2013-05-17T15:02:25Z

@whitequark, @yorickpeterse Dont have a better / different idea than yours.

whitequark · 2013-05-22T07:18:32Z

Hm, I made #parse return [ast, comments] but it's kind of ugly. Maybe make separate #parse and #parse_with_comments? Not sure here.

bbatsov · 2013-05-22T07:42:02Z

@whitequark Why not change the method signature of parse to accept an optional hash of additional params :with_comments and :with_tokens. If they are both false parse would simply return the ast (as it does now), instead of an array.

whitequark · 2013-05-22T07:45:34Z

Um, no, I don't like such return value polymorphism at all. It means I don't know what is returned unless I have a hash literal at the call site.

judofyr · 2013-05-22T07:55:27Z

+1 for #parse_with_comments

whitequark · 2013-05-25T07:08:34Z

Done.

Code:

# Class comment
# another class comment
class Foo
  # attr_accessor comment
  attr_accessor :foo

  # method comment
  def bar
    # expr comment
    1 + # intermediate comment
      2
    # stray comment
  end
end

Associations:

{(class
  (const nil :Foo) nil
  (begin
    (send nil :attr_accessor
      (sym :foo))
    (def :bar
      (args)
      (send
        (int 1) :+
        (int 2)))))=>
  [#<Parser::Source::Comment:0x007fda223bbad8
    @location=#<Source::Range (comments) 0...15>,
    @text="# Class comment">,
   #<Parser::Source::Comment:0x007fda225a03d0
    @location=#<Source::Range (comments) 16...39>,
    @text="# another class comment">],
 (send nil :attr_accessor
  (sym :foo))=>
  [#<Parser::Source::Comment:0x007fda225fc270
    @location=#<Source::Range (comments) 52...75>,
    @text="# attr_accessor comment">],
 (def :bar
  (args)
  (send
    (int 1) :+
    (int 2)))=>
  [#<Parser::Source::Comment:0x007fda22685ca0
    @location=#<Source::Range (comments) 100...116>,
    @text="# method comment">],
 (send
  (int 1) :+
  (int 2))=>
  [#<Parser::Source::Comment:0x007fda226ba8d8
    @location=#<Source::Range (comments) 131...145>,
    @text="# expr comment">],
 (int 2)=>
  [#<Parser::Source::Comment:0x007fda226e7338
    @location=#<Source::Range (comments) 154...176>,
    @text="# intermediate comment">]}

ghost assigned whitequark May 17, 2013

This was referenced May 19, 2013

Lexing support #46

Closed

2.0 release schedule #51

Closed

whitequark closed this as completed in d84eb7c May 25, 2013

whitequark mentioned this issue May 25, 2013

Migrate to Parser 2.0 rubocop/rubocop#195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting comments from source code #31

Extracting comments from source code #31

bbatsov commented May 15, 2013

whitequark commented May 15, 2013

bbatsov commented May 16, 2013

whitequark commented May 16, 2013

whitequark commented May 16, 2013

bbatsov commented May 16, 2013

whitequark commented May 16, 2013

yorickpeterse commented May 16, 2013

whitequark commented May 16, 2013

bbatsov commented May 17, 2013

mbj commented May 17, 2013

whitequark commented May 22, 2013

bbatsov commented May 22, 2013

whitequark commented May 22, 2013

judofyr commented May 22, 2013

whitequark commented May 25, 2013

Extracting comments from source code #31

Extracting comments from source code #31

Comments

bbatsov commented May 15, 2013

whitequark commented May 15, 2013

bbatsov commented May 16, 2013

whitequark commented May 16, 2013

whitequark commented May 16, 2013

bbatsov commented May 16, 2013

whitequark commented May 16, 2013

yorickpeterse commented May 16, 2013

whitequark commented May 16, 2013

bbatsov commented May 17, 2013

mbj commented May 17, 2013

whitequark commented May 22, 2013

bbatsov commented May 22, 2013

whitequark commented May 22, 2013

judofyr commented May 22, 2013

whitequark commented May 25, 2013