New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kotlin: parser based on peg grammar #2866
kotlin: parser based on peg grammar #2866
Conversation
698e304
to
31f78fd
Compare
The comment I deleted may be wrong. Great challenge! I tried to rewrite the Java parser in peg. The result was that it was very slow; 15 ~ 30 times slower than the original parser ctags had. You can use https://github.com/universal-ctags/codebase to evaluate your parser in the performance aspect. |
The original .g4 files are distributed under the Apache2 license. Though you have made many efforts to convert it to peg, I wonder we can merge this change to Universal ctags or not. I met the same situation when I worked on "varlink.peg". I explicitly got permission I can use the original peg file in the MIT license. |
31f78fd
to
17508ec
Compare
Codecov Report
@@ Coverage Diff @@
## master #2866 +/- ##
==========================================
- Coverage 87.10% 87.09% -0.02%
==========================================
Files 194 194
Lines 44387 44399 +12
==========================================
+ Hits 38664 38669 +5
- Misses 5723 5730 +7
Continue to review full report at Codecov.
|
Hi @masatake, First of all, thanks for your fast review. You are right, this is a tricky situation. I have asked about the license in kotlin-spec repository: Kotlin/kotlin-spec#70 I have tested the speed using codebase repository. However, I didn't notice that there is a 10s timeout for the test until today, so I was under false impression that it works quite fast, which is not really true 🙂 I did write a simple C parser, with the same features as the current regexp based one and compared them with the peg generated parser. The results for 60934 kotlin files (2269833 lines, 77611 kB) are following:
Note however, that the comparison is not entirely fair, because the Peg implementation covers entire Kotlin grammar and tracks scope. The tests were all done on a machine with i5-6300U cpu and NVMe drive. The differences are big, but considering how much files was parsed, it's not that bad. I think 9ms per file is ok, my biggest Kotlin project has about 250 files and is parsed under 3s, which is not great, but not terrible either. I will also try to look at other languages (probably Java and C++, which have similar complexity as Kotlin) to see how bad the numbers really are. If I were to make it faster, there are probably two different approaches:
|
17508ec
to
ab0d9af
Compare
Good news, the Kotlin guys are surprisingly fast and decided to dual-license the code to GPLv2. So the legal obstructions should be out of the way. Now to just solve the rest: Add support for reference tags and (maybe) make it faster... |
02bf826
to
5f2cf95
Compare
Fabulous! Could you write the following things to "peg/kotlin_post.h" as notes? Based on this success story as a model, I want to document the actions we should take when encountering the same situation for supporting a new language. The performance comparison is fascinating. Thank you for taking the time. First of all, I myself don't care much about performance. If you, the original author and the first user, are o.k., I will merge the new parser. The new peg-based parser is slower than the existing regex-based parser but provides rich information like scope fields.
The biggest I got is whether the slowness comes from the peg itself or the code generated by |
Shouldn't it be in the kotlin.peg file? Because that one was derived from the original kotlin files. I did some profiling and found some interesting things, but nothing that would give us really big performance boost for simple change. Here is what I've found:
Problem number one could be maybe optimised a bit by simplifying the rules. I'll try to avoid parsing every letter and digit separately, by "inlining" The second issue can be attacked in two ways: I've also made a "flamegraph", generated from perf data, you can find it here. Note that it is interactive: you can click it to zoom in particular function, or searched using the (almost invisible) search buton in top right corner. It's far from perfect, some function calls are not where they supposed to be, since I didn't have time to tweak the compilation parameters to make the binary contain all required info. But it gave me a general idea about what is happening. |
Oh, and you were right about the timeout in ctags-codebase. It was only in my local copy. I must have added it there a while back and then totally forget about it. I think it was because at the beginning I ran into some issues where my code would be stuck in infinite loop and I wanted to kill it in such case... |
Update: I tried to inline the |
Yes, you are correct.
Yes. I studied it when I rewrote the Java parser. I found packcc allocates memory objects having the same statically calculated size. But ... let's suspend the discussion about packcc here. Now, this discussion becomes very simple. I proposed I will work on writing a test infrastructure for packcc as I did in u-ctags. However, I have no time now. |
Some comments:
|
5f2cf95
to
73846d8
Compare
Just tried, shows 0 valgrind errors. Thank you for pointing this out, I didn't even know about the option. I also updated the news.rst, as you suggested. If you don't see any other problems, I think this can be merged. I'll try to look into some optimizations in separate PR in near future. |
peg/kotlin_post.h
Outdated
unsigned long endLine = getInputLineNumberForFileOffset(offset-1); | ||
if (startLine == endLine) | ||
{ | ||
fprintf(stderr, "Failed to parse '%s' at line %lu!\n", getInputFileName(), startLine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question. Is this message for users or for debugging purposes?
In my idea, ctags should not report syntax errors in the input source files to USERS.
For debugging purposes, we can use macros defined in main/trace.h.
TRACE_PRINT
is a suitable replacement for this fprintf.
To use trace, you have to run ./configure with --enable-debugging
.
In addition, when you run ctags, you have to pass --_trace=Kotlin
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was meant for user. But if it is not common behavior in ctags, I'm okay with it.
I changed it to only track and print the parsing failures in debug mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must reconsider this topic in the future.
- correctly allocate and free memory - fix indentation - do not report syntax errors, unless in trace mode
Thank you very much. |
I had to rewrite the Kotlin parser from scratch in order to be able to correctly implement scope tracking. The new implementation is based on official Kotlin grammar. I have just rewrote the grammar from ANTLR to peg, trying to modify it only where necessary to make it work with different type of parser generator (and also to make it possible to recover from parsing failures).
This is still work in progress, because I have yet to test if the new implementation is compatible with geany (which is the main reason I'm doing this).