Alternative parser technology #62

MasseGuillaume · 2014-11-25T00:34:25Z

Hum the parser code is quite impressively long why it's not using something like parboiled2 ?

densh · 2014-11-25T11:24:07Z

The parser is derived from Scala's 2.11 parser. Due to the fact that Scala spec doesn't fully correspond to current implementation we can't just reimplement parser from scratch without breaking plenty of hidden subtle variations that are allowed right now. Refactoring starting from the reference codebase is far less likely to cause major (unintended) breakage.

Apart from that we're not certain that parboiled2 is expressive enough for our use cases to invest heavily into it. For example as far as I know there is no lexer phase. That will cause problems for us as we want to expose token-level view of the source code for rewriting tools, pretty printers etc.

MasseGuillaume · 2014-11-25T16:51:17Z

One goal of ScalaMeta is to decouple with scalac. Maybe it would be a good thing to work on a parboiled2 version just for that purpose.

densh · 2014-11-25T17:27:56Z

We are decoupled in fact. We've forked and completely decoupled 2.11 parser and working on it independently. We hope that once we are done Scala implementations can just use as a library to do parsing.

xeno-by · 2014-11-25T18:49:19Z

@sirthias, could you comment on this? Do you think it would be possible to parse the entire Scala grammar with parboiled efficiently? Would it be possible to also generate Tree => Token mappings while doing that?

sirthias · 2014-11-25T22:35:41Z

Yes, I think it is very much possible.
In fact, the grammar itself appears to be already done in large parts:
sirthias/parboiled2#98

The fact that pb2 is a PEG parser and doesn't require a lexer phase is a non-issue.
And, while it might be true that coming up with a correct PEG grammar for Scala could be more work than distilling a hand-rolled parser from scalac, the discussion on the ticket referenced above shows that much of the work appears to be done already.
Of course a parser of this size and complexity is bound to discover a few points where pb2 can be improved. But so far we have not discovered any serious roadblocks that would prevent pb2 to be suited for the endeavor.

MasseGuillaume · 2014-11-25T22:54:54Z

cool we got some WIP

https://github.com/rcano/sps
https://github.com/lihaoyi/hands-on-scala-js/blob/e6a71476fed976ea2a55a03b764caad364444452/scalaParser/src/main/scala/scalaParser/ScalaSyntax.scala

also props @lihaoyi for the scala parser for scalajs.

lihaoyi · 2014-11-25T23:20:28Z

@densh is right in that the Scala lexical grammar is all wrong (to a 1st approximation) or incomplete (hi whitespace sensitivity/semicolon insertion), so directly porting over the spec to a PEG doesn't work at all.

The parser I have (linked above) works pretty good for 20 hours of work, and is only 400 lines long. It's able to totally parse & identify its own source code, and the source code of the project it's embedded in. I bet with a relatively small amount of work I can make it spit out a nice AST too with source positions. May be less work than trying to clean up the hand-rolled 5000LOC scalac parser

xeno-by · 2014-11-26T09:41:06Z

@lihaoyi How interesting! Do you plan to continue your work on the parser?

Speaking of positions, do you think it would be possible to emit precise range positions? Also, how hard would it be to adapt the parser to also remember the list of tokens corresponding to each tree (or something analogous)?

lihaoyi · 2014-11-26T11:38:41Z

Yeah, I'm going to continue to work on it. I'm relying on it for the templates I'm using at http://lihaoyi.github.io/hands-on-scala-js, though I'm not really stretching it (you don't do that fancy stuff in templates) it's not going to go away any time soon.

Making it emit precise positions would be trivial. Making it emit a nice AST would be pretty trivial too. I'd guess doing both of those would be another 20 hours of work, tops. I don't think you'd be able to do it without forking the source, but with 400LOC forking the source isn't a big deal anyway.

lihaoyi · 2014-11-28T17:58:11Z

So I've put in another 10 hours or so, and the linked parser now parses every .scala file in https://github.com/scala-js/scala-js, every .scala file in https://github.com/scala/scala /src, and makes it about half way through the /tests folder. By this point the code that it's failing to parse is stuff that confuses the heck out of IntelliJ too =P So I think it's in pretty good shape given the small amount of time I've put into it.

I'll probably put in a few more hours at some point to make it parse everything in scala/scala, then I'd consider the parser itself "done". Making it build whatever AST you want should be pretty easy after that.

MasseGuillaume · 2014-11-28T18:46:06Z

@paulp what's your take on this ?

xeno-by · 2014-11-29T12:13:00Z

Also see https://groups.google.com/forum/#!topic/scala-internals/q-pJ4aMNgmo

paulp · 2014-11-29T14:45:35Z

I think "great", but also that it might be a little early to count those chickens. Error positioning and performance both need attention. There's no way to avoid dealing with xml without severely constraining the uses. I also suspect it needs work on the negative side (that is, failing to parse when it's supposed to fail.)

Mostly though it's extremely promising.

lihaoyi · 2014-11-29T19:57:32Z

@paulp XML is easy to add; I just haven't done it, but it won't muck with the existing code at all. PEGs are nice like that.

It is much more lenient than the current parser, but I don't think that's a bad thing. There are a bunch of things that should probably move from the parsing phase into some post-parsing sanity-checking phase, which would greatly simplify the grammar while keeping all other things constant.

paulp · 2014-11-29T20:34:43Z

You are absolutely correct on both points.

paulp · 2014-11-30T19:31:44Z

Since this parser appears to have overcome the two major difficulties I've encountered with combinators in the past (performance and the handling of left-recursion) I'm adopting it. I've started the blender at https://github.com/paulp/scala-parser . As soon as I get the aesthetics where I want them I'll incorporate an AST - that's something I've already implemented a few times.

sirthias · 2014-11-30T20:06:27Z

Haoyi requested a number of pb2 improvements that appear to be somewhat crucial for working with pb2 parsers of this size, most importantly:

I'll try to get these tickets cleared tomorrow and publish a snapshot release.
Also I'll make a review pass over the current state of the parser and put in a PR with any potential improvement that I can see (parser-implementation-wise).

dragos · 2014-12-01T10:23:21Z

This looks indeed promising.

My 2c is a "me too" to what @paulp said: the hard part about parsers is not parsing correct code (and relatively simple code, like scala/scala), but having clear (and well positioned) error messages. For that purpose I think there's nothing that can beat hand rolled recursive descent parsers, but I'm willing to be proven wrong.

One more thing to be aware: pb2 has a number of dependencies, including shapeless. That might be a problem if this project becomes (at some point) a dependency of scalac or another platform-level project (think bootstrapping and breaking releases).

sirthias · 2014-12-01T10:58:05Z

@dragos while its true that parboiled currently relies on shapeless' HLists this is not really a "hard" dependencies. We could switch to a tuple-based approach relatively easily should that become a requirement. (We have already successfully moved away from HLists in akka-http's routing DSL.) Apart from shapeless there are no real dependencies.

The error reporting point is important though and there is some more work to be done, yes.
But here also I don't see any fundamental issues.

In the contrary, there are even interesting opportunities in a PEG-parser based approach.
For example, once we've have re-implemented the error recovery from pb1 (sirthias/parboiled2#42) a PEG parser can do things that a "traditional" parser can't do, like
recovery from typos in keywords. If you say objet instead of object the parser could recognize this as a single-char typo and recovery accordingly rather than failing completely like a token-based approach would have to.

MasseGuillaume · 2014-12-02T05:08:44Z

@paulp @lihaoyi hijacking this thread.

what about we split the xml part so we can test the parser against http://www.w3.org/XML/Test/ ?

of course there is some intersection with the scala parser http://scala-lang.org/files/archive/spec/2.11/10-xml-expressions-and-patterns.html

it's really two separated concerns.

I'm working on a scala fix tool with scalameta (like go fix) and one good thing would be to remove xml literals from projects.

xeno-by · 2015-06-06T17:16:09Z

We are very excited to witness rapid development of fastparse (https://github.com/lihaoyi/fastparse), and it would be really interesting to have a fastparse-based parser that emits scala.meta trees with filled in tokens!

In the near future, I'll be addressing #150 and #154, which should make this issue approachable for contributions.

sirinath · 2015-06-15T16:10:02Z

Maybe this is also a good contender to be included: https://github.com/begeric/FastParsers

densh · 2015-06-15T16:15:42Z

@sirinath This ticket is not just about picking a parser combinator library, there are also has to be an implementation of full-blown parser for Scala grammar including all of its quirks.

sirinath · 2015-06-15T16:25:38Z

For some one who attempts this the above might help the cause as it claims to have a similar interface as the Scala parse combinators.

lihaoyi · 2015-07-01T02:25:29Z

My 2c is a "me too" to what @paulp said: the hard part about parsers is not parsing correct code (and relatively simple code, like scala/scala), but having clear (and well positioned) error messages

https://github.com/lihaoyi/fastparse/blob/master/scalaparse/shared/src/test/scala/scalaparse/unit/FailureTests.scala

@dragos I don't know what it'd take for you to be proven wrong, but IMHO this basically trumps Scalac's error messages in terms of correctness of parser: it shows all options that can make progress at the point of failure, with exact error positions.

Furthermore, apart from having a nice toString, the failure is actually a data structure. This means it's trivial to programmatically/interactively expand e.g.

SmallerExprOrLambda -> ParenedLambda | PostfixLambda

LocalMod            -> `abstract` | `final` | `sealed` | `implicit` | `lazy

BindPattern         -> Binding ~ InfixPattern | InfixPattern | VarId

and so on recursively as deeply as you would like. These aren't strings anymore.

Furthermore, because these aren't strings, you can easily tag parts of the grammar with identity extra-metadata-parsers such that if the failure trace contains these parsers, the metadata will be present in the stack trace and can be used e.g. for a custom explanation. This would be as simple as replacing

val ObjDef = P(...)

with

val ObjDef = P(...)(FuncName("ObjDef", "The definition of a Scala `object` or singleton")

A similar technique can be used to tag arbitrary sub-sections of parsers such that their failure results in custom error messages, e.g. making procedure-syntax contain a custom error that can be extracted to show the user would be as simple as swapping

`=`          -> `=`.failureMsg("Procedure syntax has been remove since 2.11.x, please us `def f: Unit = ...` if you want a function returning `Unit`")

And then extracting the message from the Failure object via

failure.traced.fullStack.collectFirst{case Tagged(_, msg) => msg }

Lastly, this has no dependencies, and parses at 40,000LOC/s, ~10x slower than Scalac's own parser

It also has no dependencies.

Is there anything else you'd from this? ^_^

xeno-by · 2016-05-06T15:28:42Z

Starting from 0.1.0-RC5, scala.meta's infrastructure no longer requires Tree.tokens to be assigned for every parsed AST. Now the only thing that the parser is expected to do is to assign Tree.origin, which is equivalent to providing a position, i.e. a start-point-end range. Our tokenizer is then able to calculate tokens on its own based on the provided position.

I hope that this will remove unnecessary barriers to providing an alternative parser for scala.meta trees.

The only additional complication that I can think of at the moment is the necessity to support the Quasiquote dialect of Scala, i.e. being able to accept stuff like ..$xs or foo(...$argss) and to transform it to a scala.meta AST enriched by nodes representing ellipses and unquotes.

/cc @lihaoyi

xeno-by · 2017-07-05T23:57:01Z

Our parser has been serving us well for 3+ years, which shows that we're not in the market for alternative parser technologies. Therefore I'm closing this issue, and if something groundbreaking comes up, we can always reopen it.

MasseGuillaume closed this as completed Nov 25, 2014

MasseGuillaume reopened this Nov 25, 2014

xeno-by changed the title ~~parboiled2~~ Alternative parser technology Jun 6, 2015

xeno-by added this to the Paradise milestone Jun 6, 2015

xeno-by added the Contributor alert label Jun 6, 2015

xeno-by removed this from the Paradise milestone Jun 6, 2015

xeno-by added Parsing and removed Contributor alert labels Jan 30, 2017

xeno-by closed this as completed Jul 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative parser technology #62

Alternative parser technology #62

MasseGuillaume commented Nov 25, 2014

densh commented Nov 25, 2014

MasseGuillaume commented Nov 25, 2014

densh commented Nov 25, 2014

xeno-by commented Nov 25, 2014

sirthias commented Nov 25, 2014

MasseGuillaume commented Nov 25, 2014

lihaoyi commented Nov 25, 2014

xeno-by commented Nov 26, 2014

lihaoyi commented Nov 26, 2014

lihaoyi commented Nov 28, 2014

MasseGuillaume commented Nov 28, 2014

xeno-by commented Nov 29, 2014

paulp commented Nov 29, 2014

lihaoyi commented Nov 29, 2014

paulp commented Nov 29, 2014

paulp commented Nov 30, 2014

sirthias commented Nov 30, 2014

dragos commented Dec 1, 2014

sirthias commented Dec 1, 2014

MasseGuillaume commented Dec 2, 2014

xeno-by commented Jun 6, 2015

sirinath commented Jun 15, 2015

densh commented Jun 15, 2015

sirinath commented Jun 15, 2015

lihaoyi commented Jul 1, 2015

xeno-by commented May 6, 2016

xeno-by commented Jul 5, 2017

Alternative parser technology #62

Alternative parser technology #62

Comments

MasseGuillaume commented Nov 25, 2014

densh commented Nov 25, 2014

MasseGuillaume commented Nov 25, 2014

densh commented Nov 25, 2014

xeno-by commented Nov 25, 2014

sirthias commented Nov 25, 2014

MasseGuillaume commented Nov 25, 2014

lihaoyi commented Nov 25, 2014

xeno-by commented Nov 26, 2014

lihaoyi commented Nov 26, 2014

lihaoyi commented Nov 28, 2014

MasseGuillaume commented Nov 28, 2014

xeno-by commented Nov 29, 2014

paulp commented Nov 29, 2014

lihaoyi commented Nov 29, 2014

paulp commented Nov 29, 2014

paulp commented Nov 30, 2014

sirthias commented Nov 30, 2014

dragos commented Dec 1, 2014

sirthias commented Dec 1, 2014

MasseGuillaume commented Dec 2, 2014

xeno-by commented Jun 6, 2015

sirinath commented Jun 15, 2015

densh commented Jun 15, 2015

sirinath commented Jun 15, 2015

lihaoyi commented Jul 1, 2015

xeno-by commented May 6, 2016

xeno-by commented Jul 5, 2017