kotlin: parser based on peg grammar #2866

dolik-rce · 2021-02-14T14:12:46Z

I had to rewrite the Kotlin parser from scratch in order to be able to correctly implement scope tracking. The new implementation is based on official Kotlin grammar. I have just rewrote the grammar from ANTLR to peg, trying to modify it only where necessary to make it work with different type of parser generator (and also to make it possible to recover from parsing failures).

This is still work in progress, because I have yet to test if the new implementation is compatible with geany (which is the main reason I'm doing this).

peg/kotlin.peg

peg/kotlin_post.h

masatake · 2021-02-14T14:37:40Z

The comment I deleted may be wrong.
ANTLR grammar is not written in peg.
You are writing the new code in peg.

Great challenge!

I tried to rewrite the Java parser in peg. The result was that it was very slow; 15 ~ 30 times slower than the original parser ctags had.

You can use https://github.com/universal-ctags/codebase to evaluate your parser in the performance aspect.

peg/kotlin_pre.h

masatake · 2021-02-14T15:17:30Z

The original .g4 files are distributed under the Apache2 license. Though you have made many efforts to convert it to peg, I wonder we can merge this change to Universal ctags or not.

I met the same situation when I worked on "varlink.peg". I explicitly got permission I can use the original peg file in the MIT license.
See the header of "varlink.peg" and varlink/rust#20. I would like you to take the same action for your parser.

codecov · 2021-02-14T18:32:53Z

Codecov Report

Merging #2866 (2b224bd) into master (4b54332) will decrease coverage by 0.01%.
The diff coverage is 89.23%.

@@            Coverage Diff             @@
##           master    #2866      +/-   ##
==========================================
- Coverage   87.10%   87.09%   -0.02%     
==========================================
  Files         194      194              
  Lines       44387    44399      +12     
==========================================
+ Hits        38664    38669       +5     
- Misses       5723     5730       +7

Impacted Files	Coverage Δ
peg/kotlin_post.h	`89.23% <89.23%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b54332...7dfe4f4. Read the comment docs.

dolik-rce · 2021-02-15T07:56:59Z

Hi @masatake,

First of all, thanks for your fast review.

You are right, this is a tricky situation. I have asked about the license in kotlin-spec repository: Kotlin/kotlin-spec#70

I have tested the speed using codebase repository. However, I didn't notice that there is a 10s timeout for the test until today, so I was under false impression that it works quite fast, which is not really true 🙂 I did write a simple C parser, with the same features as the current regexp based one and compared them with the peg generated parser. The results for 60934 kotlin files (2269833 lines, 77611 kB) are following:

Implementation	Time	Avg. per file	Tags added
C	4 s	66 us	567695
RegExp	155 s	2.5 ms	527908
Peg	523s	8.6 ms	782500

Note however, that the comparison is not entirely fair, because the Peg implementation covers entire Kotlin grammar and tracks scope. The tests were all done on a machine with i5-6300U cpu and NVMe drive.

The differences are big, but considering how much files was parsed, it's not that bad. I think 9ms per file is ok, my biggest Kotlin project has about 250 files and is parsed under 3s, which is not great, but not terrible either.

I will also try to look at other languages (probably Java and C++, which have similar complexity as Kotlin) to see how bad the numbers really are.

If I were to make it faster, there are probably two different approaches:

Simplify the peg grammar as much as possible. There are probably some parts that are not useful to extract, so they could be omitted, without losing quality. It might also help to reorder some rules, to make it try more common variants first, to avoid some of the backtracking.
Write a full C parser. This would be much harder to do (and to maintain). As you can see the grammar is really complex and I'd have to use the kotlin-spec anyway to parse everything correctly.

dolik-rce · 2021-02-15T18:15:05Z

Good news, the Kotlin guys are surprisingly fast and decided to dual-license the code to GPLv2. So the legal obstructions should be out of the way.

Now to just solve the rest: Add support for reference tags and (maybe) make it faster...

masatake · 2021-02-16T16:14:14Z

Good news, the Kotlin guys are surprisingly fast and decided to dual-license the code to GPLv2. So the legal obstructions should be out of the way.

Fabulous!

Could you write the following things to "peg/kotlin_post.h" as notes?
A. kotolin_post.h was derived from files (reference files) in ,https://github.com/Kotlin/kotlin-spec
B. Kotlin / kotlin-spec is distributed under Apache2 license,
C. you requested the original developer "re-license or dual-license" at Kotlin/kotlin-spec#70,
D. The request was accepted, and the original developer made a new repository https://github.com/Kotlin/kotlin-grammar-gpl2 for redistributing the reference files in GPLv2.
E. this file is derived from Kotlin/kotlin-grammar-gpl2 repo.
F. Unlike many files in Universal Ctags, "This source code is released for free distribution under the terms of the GNU General Public License version 2." "or (at your option) any later version." is not included.

Based on this success story as a model, I want to document the actions we should take when encountering the same situation for supporting a new language.

The performance comparison is fascinating. Thank you for taking the time.

First of all, I myself don't care much about performance. If you, the original author and the first user, are o.k., I will merge the new parser. The new peg-based parser is slower than the existing regex-based parser but provides rich information like scope fields.
This can be done because kind-design was done in your regex-based parser. As far as the kind-design is compatible, we can replace the implementation as we want.
As I wrote a peg-based Java parser, but it is much slower than the existing parser. I rejected my peg-based parser because I, a ctags user, cannot accept the performance regression.

I didn't notice that there is a 10s timeout for the test until today,
Are you talking about the behavior of the codebase command?
make units has a time limit, but codebase has no time limit.

The biggest I got is whether the slowness comes from the peg itself or the code generated by packcc.
We can evaluate your peg rule with another code generator like https://github.com/pointlander/peg .

dolik-rce · 2021-02-17T06:22:26Z

Could you write the following things to "peg/kotlin_post.h" as notes?
...

Shouldn't it be in the kotlin.peg file? Because that one was derived from the original kotlin files.

I did some profiling and found some interesting things, but nothing that would give us really big performance boost for simple change. Here is what I've found:

20% of time is spent in pcc_apply_rule. This is no big surprise, since it is called for every considered rule.
60% of time is spent in malloc/realloc/free. This is surprising and quite alarming.
The remaining time is spread among pcc_evaluate_rule_* functions, string operations and some data maintanance (working with arrays in packcc etc.).

Problem number one could be maybe optimised a bit by simplifying the rules. I'll try to avoid parsing every letter and digit separately, by "inlining" Letter and UnicodeDigit rules. This might save quite a few function calls and some caching. It might also hlep with the "other functions" mentioned in 3, since less rules would be evaluated.

The second issue can be attacked in two ways:
a. Optimizing the packcc to generate code with less allocations and deallocations.
b. Using better allocator.
So far I have tried b, because it was simpler. Running the ctags with jemalloc simply be preloading it yields about 10% speed-up for Kotlin parsing. If you want to try, you can download/instlall jemalloc and call ctags with LD_PREOAD variable: LD_PRELOAD=/usr/lib/libjemalloc.so ctags ...

I've also made a "flamegraph", generated from perf data, you can find it here. Note that it is interactive: you can click it to zoom in particular function, or searched using the (almost invisible) search buton in top right corner. It's far from perfect, some function calls are not where they supposed to be, since I didn't have time to tweak the compilation parameters to make the binary contain all required info. But it gave me a general idea about what is happening.

dolik-rce · 2021-02-17T06:23:44Z

Oh, and you were right about the timeout in ctags-codebase. It was only in my local copy. I must have added it there a while back and then totally forget about it. I think it was because at the beginning I ran into some issues where my code would be stuck in infinite loop and I wanted to kill it in such case...

dolik-rce · 2021-02-17T06:34:06Z

Update: I tried to inline the Letter and UnicodeDigit rules and it works surprisingly well. It runs about 8% faster. I'll try to write a simple preprocessor to inline more of the rules before compilation. If it proves to work well, we might try to suggest (or implement) such optimalization directly in packcc.

masatake · 2021-02-17T16:47:17Z

Shouldn't it be in the kotlin.peg file? Because that one was derived from the original kotlin files.

Yes, you are correct.

60% of time is spent in malloc/realloc/free. This is surprising and quite alarming.

Yes. I studied it when I rewrote the Java parser. I found packcc allocates memory objects having the same statically calculated size.
We can introduce a memory allocator specialized to the sizes as you wrote as . b. Using better allocator.
e.g objpool.

But ... let's suspend the discussion about packcc here.
Currently, ctags uses packcc maintained at https://github.com/universal-ctags/packcc.
However, I would like to switch it to the real upstream project, https://github.com/arithy/packcc .
I would like you to see universal-ctags/packcc#5 and arithy/packcc#13...
Oh, you are the person I discussed the test infrastructure of packcc at arithy/packcc#14 !!!

Now, this discussion becomes very simple.
If you think the peg-based kotlin parser is better than the mtable-regex-based one, let's merge the new one.
Let's discuss the way to improve packcc at the upstream project. The new kotlin parser we will merge can be the reference for meaning the performance of code generated by packcc in the discussion. I felt packcc had so few examples. I think therefore making a serious discussion about performance is hard. The new kotlin parser can fill the gap of discussion.

I proposed I will work on writing a test infrastructure for packcc as I did in u-ctags. However, I have no time now.

masatake · 2021-02-17T16:55:33Z

Some comments:

running test cases under valgrind
Did you try
```
$ make units LANGUAGES=Kotlin VG=1
```
?

If you installed valgrind, the units target runs test cases under valgrind. It is quite useful.

update docs/NEWS.rst like:

diff --git a/docs/news.rst b/docs/news.rst
index d4ab610c1..a6913f455 100644
--- a/docs/news.rst
+++ b/docs/news.rst
@@ -415,7 +415,7 @@ The following parsers have been added:
 * JSON
 * Julia
 * Kconfig *optlib*
-* Kotlin *optlib*
+* Kotlin *peg/packcc*
 * GNU linker script(LdScript)
 * Man page *optlib*
 * Markdown *optlib*

dolik-rce · 2021-02-17T20:23:06Z

running test cases under valgrind

Just tried, shows 0 valgrind errors. Thank you for pointing this out, I didn't even know about the option.

I also updated the news.rst, as you suggested.

If you don't see any other problems, I think this can be merged. I'll try to look into some optimizations in separate PR in near future.

peg/kotlin_post.h

main/parsers_p.h

peg/kotlin_post.h

masatake · 2021-02-18T12:53:54Z

peg/kotlin_post.h

+        unsigned long endLine = getInputLineNumberForFileOffset(offset-1);
+        if (startLine == endLine)
+        {
+            fprintf(stderr, "Failed to parse '%s' at line %lu!\n", getInputFileName(), startLine);


Just a question. Is this message for users or for debugging purposes?

In my idea, ctags should not report syntax errors in the input source files to USERS.
For debugging purposes, we can use macros defined in main/trace.h.
TRACE_PRINT is a suitable replacement for this fprintf.
To use trace, you have to run ./configure with --enable-debugging.
In addition, when you run ctags, you have to pass --_trace=Kotlin.

It was meant for user. But if it is not common behavior in ctags, I'm okay with it.

I changed it to only track and print the parsing failures in debug mode.

I must reconsider this topic in the future.

- correctly allocate and free memory - fix indentation - do not report syntax errors, unless in trace mode

masatake · 2021-02-18T18:55:25Z

Thank you very much.

dolik-rce force-pushed the kotlin-full-grammar branch from 698e304 to 31f78fd Compare February 14, 2021 14:19

masatake reviewed Feb 14, 2021

View reviewed changes

peg/kotlin.peg Show resolved Hide resolved

masatake reviewed Feb 14, 2021

View reviewed changes

peg/kotlin_post.h Show resolved Hide resolved

masatake reviewed Feb 14, 2021

View reviewed changes

peg/kotlin_pre.h Outdated Show resolved Hide resolved

dolik-rce force-pushed the kotlin-full-grammar branch from 31f78fd to 17508ec Compare February 14, 2021 18:29

dolik-rce force-pushed the kotlin-full-grammar branch from 17508ec to ab0d9af Compare February 15, 2021 18:09

dolik-rce force-pushed the kotlin-full-grammar branch 2 times, most recently from 02bf826 to 5f2cf95 Compare February 16, 2021 05:17

dolik-rce added 4 commits February 17, 2021 19:42

kotlin: parser based on peg grammar

7ca9a0e

drop support for imports

de91af1

fix win32 build

f74c897

add licensing notes

73846d8

dolik-rce force-pushed the kotlin-full-grammar branch from 5f2cf95 to 73846d8 Compare February 17, 2021 18:42

update news file

e97f58c

dolik-rce marked this pull request as ready for review February 17, 2021 20:24

masatake requested changes Feb 18, 2021

View reviewed changes

various small fixes

7dfe4f4

- correctly allocate and free memory - fix indentation - do not report syntax errors, unless in trace mode

masatake merged commit d532b5c into universal-ctags:master Feb 18, 2021

dolik-rce mentioned this pull request Feb 19, 2021

Peg parser optimizations #2878

Closed

dolik-rce mentioned this pull request Dec 7, 2021

Add support for Kotlin tags geany/geany#3034

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kotlin: parser based on peg grammar #2866

kotlin: parser based on peg grammar #2866

dolik-rce commented Feb 14, 2021

masatake commented Feb 14, 2021

masatake commented Feb 14, 2021

codecov bot commented Feb 14, 2021 •

edited

dolik-rce commented Feb 15, 2021

dolik-rce commented Feb 15, 2021 •

edited

masatake commented Feb 16, 2021

dolik-rce commented Feb 17, 2021

dolik-rce commented Feb 17, 2021 •

edited

dolik-rce commented Feb 17, 2021

masatake commented Feb 17, 2021

masatake commented Feb 17, 2021

dolik-rce commented Feb 17, 2021

masatake Feb 18, 2021

dolik-rce Feb 18, 2021

masatake Feb 18, 2021

masatake commented Feb 18, 2021

kotlin: parser based on peg grammar #2866

kotlin: parser based on peg grammar #2866

Conversation

dolik-rce commented Feb 14, 2021

masatake commented Feb 14, 2021

masatake commented Feb 14, 2021

codecov bot commented Feb 14, 2021 • edited

Codecov Report

dolik-rce commented Feb 15, 2021

dolik-rce commented Feb 15, 2021 • edited

masatake commented Feb 16, 2021

dolik-rce commented Feb 17, 2021

dolik-rce commented Feb 17, 2021 • edited

dolik-rce commented Feb 17, 2021

masatake commented Feb 17, 2021

masatake commented Feb 17, 2021

dolik-rce commented Feb 17, 2021

masatake Feb 18, 2021

Choose a reason for hiding this comment

dolik-rce Feb 18, 2021

Choose a reason for hiding this comment

masatake Feb 18, 2021

Choose a reason for hiding this comment

masatake commented Feb 18, 2021

codecov bot commented Feb 14, 2021 •

edited

dolik-rce commented Feb 15, 2021 •

edited

dolik-rce commented Feb 17, 2021 •

edited