New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative parser technology #62
Comments
The parser is derived from Scala's 2.11 parser. Due to the fact that Scala spec doesn't fully correspond to current implementation we can't just reimplement parser from scratch without breaking plenty of hidden subtle variations that are allowed right now. Refactoring starting from the reference codebase is far less likely to cause major (unintended) breakage. Apart from that we're not certain that parboiled2 is expressive enough for our use cases to invest heavily into it. For example as far as I know there is no lexer phase. That will cause problems for us as we want to expose token-level view of the source code for rewriting tools, pretty printers etc. |
One goal of ScalaMeta is to decouple with scalac. Maybe it would be a good thing to work on a parboiled2 version just for that purpose. |
We are decoupled in fact. We've forked and completely decoupled 2.11 parser and working on it independently. We hope that once we are done Scala implementations can just use as a library to do parsing. |
@sirthias, could you comment on this? Do you think it would be possible to parse the entire Scala grammar with parboiled efficiently? Would it be possible to also generate Tree => Token mappings while doing that? |
Yes, I think it is very much possible. The fact that pb2 is a PEG parser and doesn't require a lexer phase is a non-issue. |
cool we got some WIP https://github.com/rcano/sps also props @lihaoyi for the scala parser for scalajs. |
@densh is right in that the Scala lexical grammar is all wrong (to a 1st approximation) or incomplete (hi whitespace sensitivity/semicolon insertion), so directly porting over the spec to a PEG doesn't work at all. The parser I have (linked above) works pretty good for 20 hours of work, and is only 400 lines long. It's able to totally parse & identify its own source code, and the source code of the project it's embedded in. I bet with a relatively small amount of work I can make it spit out a nice AST too with source positions. May be less work than trying to clean up the hand-rolled 5000LOC scalac parser |
@lihaoyi How interesting! Do you plan to continue your work on the parser? Speaking of positions, do you think it would be possible to emit precise range positions? Also, how hard would it be to adapt the parser to also remember the list of tokens corresponding to each tree (or something analogous)? |
Yeah, I'm going to continue to work on it. I'm relying on it for the templates I'm using at http://lihaoyi.github.io/hands-on-scala-js, though I'm not really stretching it (you don't do that fancy stuff in templates) it's not going to go away any time soon. Making it emit precise positions would be trivial. Making it emit a nice AST would be pretty trivial too. I'd guess doing both of those would be another 20 hours of work, tops. I don't think you'd be able to do it without forking the source, but with 400LOC forking the source isn't a big deal anyway. |
So I've put in another 10 hours or so, and the linked parser now parses every I'll probably put in a few more hours at some point to make it parse everything in scala/scala, then I'd consider the parser itself "done". Making it build whatever AST you want should be pretty easy after that. |
@paulp what's your take on this ? |
I think "great", but also that it might be a little early to count those chickens. Error positioning and performance both need attention. There's no way to avoid dealing with xml without severely constraining the uses. I also suspect it needs work on the negative side (that is, failing to parse when it's supposed to fail.) Mostly though it's extremely promising. |
@paulp XML is easy to add; I just haven't done it, but it won't muck with the existing code at all. PEGs are nice like that. It is much more lenient than the current parser, but I don't think that's a bad thing. There are a bunch of things that should probably move from the parsing phase into some post-parsing sanity-checking phase, which would greatly simplify the grammar while keeping all other things constant. |
You are absolutely correct on both points. |
Since this parser appears to have overcome the two major difficulties I've encountered with combinators in the past (performance and the handling of left-recursion) I'm adopting it. I've started the blender at https://github.com/paulp/scala-parser . As soon as I get the aesthetics where I want them I'll incorporate an AST - that's something I've already implemented a few times. |
Haoyi requested a number of pb2 improvements that appear to be somewhat crucial for working with pb2 parsers of this size, most importantly:
I'll try to get these tickets cleared tomorrow and publish a snapshot release. |
This looks indeed promising. My 2c is a "me too" to what @paulp said: the hard part about parsers is not parsing correct code (and relatively simple code, like scala/scala), but having clear (and well positioned) error messages. For that purpose I think there's nothing that can beat hand rolled recursive descent parsers, but I'm willing to be proven wrong. One more thing to be aware: pb2 has a number of dependencies, including shapeless. That might be a problem if this project becomes (at some point) a dependency of |
@dragos while its true that parboiled currently relies on shapeless' HLists this is not really a "hard" dependencies. We could switch to a tuple-based approach relatively easily should that become a requirement. (We have already successfully moved away from HLists in akka-http's routing DSL.) Apart from shapeless there are no real dependencies. The error reporting point is important though and there is some more work to be done, yes. In the contrary, there are even interesting opportunities in a PEG-parser based approach. |
@paulp @lihaoyi hijacking this thread. what about we split the xml part so we can test the parser against http://www.w3.org/XML/Test/ ? of course there is some intersection with the scala parser http://scala-lang.org/files/archive/spec/2.11/10-xml-expressions-and-patterns.html it's really two separated concerns. I'm working on a scala fix tool with scalameta (like go fix) and one good thing would be to remove xml literals from projects. |
We are very excited to witness rapid development of fastparse (https://github.com/lihaoyi/fastparse), and it would be really interesting to have a fastparse-based parser that emits scala.meta trees with filled in tokens! In the near future, I'll be addressing #150 and #154, which should make this issue approachable for contributions. |
Maybe this is also a good contender to be included: https://github.com/begeric/FastParsers |
@sirinath This ticket is not just about picking a parser combinator library, there are also has to be an implementation of full-blown parser for Scala grammar including all of its quirks. |
For some one who attempts this the above might help the cause as it claims to have a similar interface as the Scala parse combinators. |
@dragos I don't know what it'd take for you to be proven wrong, but IMHO this basically trumps Scalac's error messages in terms of correctness of parser: it shows all options that can make progress at the point of failure, with exact error positions. Furthermore, apart from having a nice toString, the failure is actually a data structure. This means it's trivial to programmatically/interactively expand e.g. SmallerExprOrLambda -> ParenedLambda | PostfixLambda
LocalMod -> `abstract` | `final` | `sealed` | `implicit` | `lazy
BindPattern -> Binding ~ InfixPattern | InfixPattern | VarId and so on recursively as deeply as you would like. These aren't strings anymore. Furthermore, because these aren't strings, you can easily tag parts of the grammar with identity extra-metadata-parsers such that if the failure trace contains these parsers, the metadata will be present in the stack trace and can be used e.g. for a custom explanation. This would be as simple as replacing val ObjDef = P(...) with val ObjDef = P(...)(FuncName("ObjDef", "The definition of a Scala `object` or singleton") A similar technique can be used to tag arbitrary sub-sections of parsers such that their failure results in custom error messages, e.g. making procedure-syntax contain a custom error that can be extracted to show the user would be as simple as swapping `=` -> `=`.failureMsg("Procedure syntax has been remove since 2.11.x, please us `def f: Unit = ...` if you want a function returning `Unit`") And then extracting the message from the failure.traced.fullStack.collectFirst{case Tagged(_, msg) => msg } Lastly, this has no dependencies, and parses at 40,000LOC/s, ~10x slower than Scalac's own parser It also has no dependencies. Is there anything else you'd from this? ^_^ |
Starting from 0.1.0-RC5, scala.meta's infrastructure no longer requires Tree.tokens to be assigned for every parsed AST. Now the only thing that the parser is expected to do is to assign Tree.origin, which is equivalent to providing a position, i.e. a start-point-end range. Our tokenizer is then able to calculate tokens on its own based on the provided position. I hope that this will remove unnecessary barriers to providing an alternative parser for scala.meta trees. The only additional complication that I can think of at the moment is the necessity to support the Quasiquote dialect of Scala, i.e. being able to accept stuff like /cc @lihaoyi |
Our parser has been serving us well for 3+ years, which shows that we're not in the market for alternative parser technologies. Therefore I'm closing this issue, and if something groundbreaking comes up, we can always reopen it. |
Hum the parser code is quite impressively long why it's not using something like parboiled2 ?
https://github.com/sirthias/parboiled2
The text was updated successfully, but these errors were encountered: