Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative parser technology #62

Closed
MasseGuillaume opened this issue Nov 25, 2014 · 27 comments
Closed

Alternative parser technology #62

MasseGuillaume opened this issue Nov 25, 2014 · 27 comments
Labels

Comments

@MasseGuillaume
Copy link
Collaborator

Hum the parser code is quite impressively long why it's not using something like parboiled2 ?

https://github.com/sirthias/parboiled2

@densh
Copy link
Member

densh commented Nov 25, 2014

The parser is derived from Scala's 2.11 parser. Due to the fact that Scala spec doesn't fully correspond to current implementation we can't just reimplement parser from scratch without breaking plenty of hidden subtle variations that are allowed right now. Refactoring starting from the reference codebase is far less likely to cause major (unintended) breakage.

Apart from that we're not certain that parboiled2 is expressive enough for our use cases to invest heavily into it. For example as far as I know there is no lexer phase. That will cause problems for us as we want to expose token-level view of the source code for rewriting tools, pretty printers etc.

@MasseGuillaume
Copy link
Collaborator Author

One goal of ScalaMeta is to decouple with scalac. Maybe it would be a good thing to work on a parboiled2 version just for that purpose.

@densh
Copy link
Member

densh commented Nov 25, 2014

We are decoupled in fact. We've forked and completely decoupled 2.11 parser and working on it independently. We hope that once we are done Scala implementations can just use as a library to do parsing.

@xeno-by
Copy link
Member

xeno-by commented Nov 25, 2014

@sirthias, could you comment on this? Do you think it would be possible to parse the entire Scala grammar with parboiled efficiently? Would it be possible to also generate Tree => Token mappings while doing that?

@sirthias
Copy link

Yes, I think it is very much possible.
In fact, the grammar itself appears to be already done in large parts:
sirthias/parboiled2#98

The fact that pb2 is a PEG parser and doesn't require a lexer phase is a non-issue.
And, while it might be true that coming up with a correct PEG grammar for Scala could be more work than distilling a hand-rolled parser from scalac, the discussion on the ticket referenced above shows that much of the work appears to be done already.
Of course a parser of this size and complexity is bound to discover a few points where pb2 can be improved. But so far we have not discovered any serious roadblocks that would prevent pb2 to be suited for the endeavor.

@MasseGuillaume
Copy link
Collaborator Author

@lihaoyi
Copy link

lihaoyi commented Nov 25, 2014

@densh is right in that the Scala lexical grammar is all wrong (to a 1st approximation) or incomplete (hi whitespace sensitivity/semicolon insertion), so directly porting over the spec to a PEG doesn't work at all.

The parser I have (linked above) works pretty good for 20 hours of work, and is only 400 lines long. It's able to totally parse & identify its own source code, and the source code of the project it's embedded in. I bet with a relatively small amount of work I can make it spit out a nice AST too with source positions. May be less work than trying to clean up the hand-rolled 5000LOC scalac parser

@xeno-by
Copy link
Member

xeno-by commented Nov 26, 2014

@lihaoyi How interesting! Do you plan to continue your work on the parser?

Speaking of positions, do you think it would be possible to emit precise range positions? Also, how hard would it be to adapt the parser to also remember the list of tokens corresponding to each tree (or something analogous)?

@lihaoyi
Copy link

lihaoyi commented Nov 26, 2014

Yeah, I'm going to continue to work on it. I'm relying on it for the templates I'm using at http://lihaoyi.github.io/hands-on-scala-js, though I'm not really stretching it (you don't do that fancy stuff in templates) it's not going to go away any time soon.

Making it emit precise positions would be trivial. Making it emit a nice AST would be pretty trivial too. I'd guess doing both of those would be another 20 hours of work, tops. I don't think you'd be able to do it without forking the source, but with 400LOC forking the source isn't a big deal anyway.

@lihaoyi
Copy link

lihaoyi commented Nov 28, 2014

So I've put in another 10 hours or so, and the linked parser now parses every .scala file in https://github.com/scala-js/scala-js, every .scala file in https://github.com/scala/scala /src, and makes it about half way through the /tests folder. By this point the code that it's failing to parse is stuff that confuses the heck out of IntelliJ too =P So I think it's in pretty good shape given the small amount of time I've put into it.

I'll probably put in a few more hours at some point to make it parse everything in scala/scala, then I'd consider the parser itself "done". Making it build whatever AST you want should be pretty easy after that.

@MasseGuillaume
Copy link
Collaborator Author

@paulp what's your take on this ?

@xeno-by
Copy link
Member

xeno-by commented Nov 29, 2014

@paulp
Copy link

paulp commented Nov 29, 2014

I think "great", but also that it might be a little early to count those chickens. Error positioning and performance both need attention. There's no way to avoid dealing with xml without severely constraining the uses. I also suspect it needs work on the negative side (that is, failing to parse when it's supposed to fail.)

Mostly though it's extremely promising.

@lihaoyi
Copy link

lihaoyi commented Nov 29, 2014

@paulp XML is easy to add; I just haven't done it, but it won't muck with the existing code at all. PEGs are nice like that.

It is much more lenient than the current parser, but I don't think that's a bad thing. There are a bunch of things that should probably move from the parsing phase into some post-parsing sanity-checking phase, which would greatly simplify the grammar while keeping all other things constant.

@paulp
Copy link

paulp commented Nov 29, 2014

You are absolutely correct on both points.

@paulp
Copy link

paulp commented Nov 30, 2014

Since this parser appears to have overcome the two major difficulties I've encountered with combinators in the past (performance and the handling of left-recursion) I'm adopting it. I've started the blender at https://github.com/paulp/scala-parser . As soon as I get the aesthetics where I want them I'll incorporate an AST - that's something I've already implemented a few times.

@sirthias
Copy link

Haoyi requested a number of pb2 improvements that appear to be somewhat crucial for working with pb2 parsers of this size, most importantly:

I'll try to get these tickets cleared tomorrow and publish a snapshot release.
Also I'll make a review pass over the current state of the parser and put in a PR with any potential improvement that I can see (parser-implementation-wise).

@dragos
Copy link

dragos commented Dec 1, 2014

This looks indeed promising.

My 2c is a "me too" to what @paulp said: the hard part about parsers is not parsing correct code (and relatively simple code, like scala/scala), but having clear (and well positioned) error messages. For that purpose I think there's nothing that can beat hand rolled recursive descent parsers, but I'm willing to be proven wrong.

One more thing to be aware: pb2 has a number of dependencies, including shapeless. That might be a problem if this project becomes (at some point) a dependency of scalac or another platform-level project (think bootstrapping and breaking releases).

@sirthias
Copy link

sirthias commented Dec 1, 2014

@dragos while its true that parboiled currently relies on shapeless' HLists this is not really a "hard" dependencies. We could switch to a tuple-based approach relatively easily should that become a requirement. (We have already successfully moved away from HLists in akka-http's routing DSL.) Apart from shapeless there are no real dependencies.

The error reporting point is important though and there is some more work to be done, yes.
But here also I don't see any fundamental issues.

In the contrary, there are even interesting opportunities in a PEG-parser based approach.
For example, once we've have re-implemented the error recovery from pb1 (sirthias/parboiled2#42) a PEG parser can do things that a "traditional" parser can't do, like
recovery from typos in keywords. If you say objet instead of object the parser could recognize this as a single-char typo and recovery accordingly rather than failing completely like a token-based approach would have to.

@MasseGuillaume
Copy link
Collaborator Author

@paulp @lihaoyi hijacking this thread.

what about we split the xml part so we can test the parser against http://www.w3.org/XML/Test/ ?

of course there is some intersection with the scala parser http://scala-lang.org/files/archive/spec/2.11/10-xml-expressions-and-patterns.html

it's really two separated concerns.

I'm working on a scala fix tool with scalameta (like go fix) and one good thing would be to remove xml literals from projects.

@xeno-by xeno-by changed the title parboiled2 Alternative parser technology Jun 6, 2015
@xeno-by xeno-by added this to the Paradise milestone Jun 6, 2015
@xeno-by
Copy link
Member

xeno-by commented Jun 6, 2015

We are very excited to witness rapid development of fastparse (https://github.com/lihaoyi/fastparse), and it would be really interesting to have a fastparse-based parser that emits scala.meta trees with filled in tokens!

In the near future, I'll be addressing #150 and #154, which should make this issue approachable for contributions.

@xeno-by xeno-by removed this from the Paradise milestone Jun 6, 2015
@sirinath
Copy link

Maybe this is also a good contender to be included: https://github.com/begeric/FastParsers

@densh
Copy link
Member

densh commented Jun 15, 2015

@sirinath This ticket is not just about picking a parser combinator library, there are also has to be an implementation of full-blown parser for Scala grammar including all of its quirks.

@sirinath
Copy link

For some one who attempts this the above might help the cause as it claims to have a similar interface as the Scala parse combinators.

@lihaoyi
Copy link

lihaoyi commented Jul 1, 2015

My 2c is a "me too" to what @paulp said: the hard part about parsers is not parsing correct code (and relatively simple code, like scala/scala), but having clear (and well positioned) error messages

https://github.com/lihaoyi/fastparse/blob/master/scalaparse/shared/src/test/scala/scalaparse/unit/FailureTests.scala

@dragos I don't know what it'd take for you to be proven wrong, but IMHO this basically trumps Scalac's error messages in terms of correctness of parser: it shows all options that can make progress at the point of failure, with exact error positions.

Furthermore, apart from having a nice toString, the failure is actually a data structure. This means it's trivial to programmatically/interactively expand e.g.

SmallerExprOrLambda -> ParenedLambda | PostfixLambda

LocalMod            -> `abstract` | `final` | `sealed` | `implicit` | `lazy

BindPattern         -> Binding ~ InfixPattern | InfixPattern | VarId

and so on recursively as deeply as you would like. These aren't strings anymore.


Furthermore, because these aren't strings, you can easily tag parts of the grammar with identity extra-metadata-parsers such that if the failure trace contains these parsers, the metadata will be present in the stack trace and can be used e.g. for a custom explanation. This would be as simple as replacing

val ObjDef = P(...)

with

val ObjDef = P(...)(FuncName("ObjDef", "The definition of a Scala `object` or singleton")

A similar technique can be used to tag arbitrary sub-sections of parsers such that their failure results in custom error messages, e.g. making procedure-syntax contain a custom error that can be extracted to show the user would be as simple as swapping

`=`          -> `=`.failureMsg("Procedure syntax has been remove since 2.11.x, please us `def f: Unit = ...` if you want a function returning `Unit`")

And then extracting the message from the Failure object via

failure.traced.fullStack.collectFirst{case Tagged(_, msg) => msg }

Lastly, this has no dependencies, and parses at 40,000LOC/s, ~10x slower than Scalac's own parser

It also has no dependencies.

Is there anything else you'd from this? ^_^

@xeno-by
Copy link
Member

xeno-by commented May 6, 2016

Starting from 0.1.0-RC5, scala.meta's infrastructure no longer requires Tree.tokens to be assigned for every parsed AST. Now the only thing that the parser is expected to do is to assign Tree.origin, which is equivalent to providing a position, i.e. a start-point-end range. Our tokenizer is then able to calculate tokens on its own based on the provided position.

I hope that this will remove unnecessary barriers to providing an alternative parser for scala.meta trees.

The only additional complication that I can think of at the moment is the necessity to support the Quasiquote dialect of Scala, i.e. being able to accept stuff like ..$xs or foo(...$argss) and to transform it to a scala.meta AST enriched by nodes representing ellipses and unquotes.

/cc @lihaoyi

@xeno-by
Copy link
Member

xeno-by commented Jul 5, 2017

Our parser has been serving us well for 3+ years, which shows that we're not in the market for alternative parser technologies. Therefore I'm closing this issue, and if something groundbreaking comes up, we can always reopen it.

@xeno-by xeno-by closed this as completed Jul 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants