Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kotlin: parser based on peg grammar #2866

Merged
merged 6 commits into from Feb 18, 2021

Conversation

dolik-rce
Copy link
Contributor

I had to rewrite the Kotlin parser from scratch in order to be able to correctly implement scope tracking. The new implementation is based on official Kotlin grammar. I have just rewrote the grammar from ANTLR to peg, trying to modify it only where necessary to make it work with different type of parser generator (and also to make it possible to recover from parsing failures).

This is still work in progress, because I have yet to test if the new implementation is compatible with geany (which is the main reason I'm doing this).

@masatake
Copy link
Member

The comment I deleted may be wrong.
ANTLR grammar is not written in peg.
You are writing the new code in peg.


Great challenge!

I tried to rewrite the Java parser in peg. The result was that it was very slow; 15 ~ 30 times slower than the original parser ctags had.

You can use https://github.com/universal-ctags/codebase to evaluate your parser in the performance aspect.

peg/kotlin_pre.h Outdated Show resolved Hide resolved
@masatake
Copy link
Member

The original .g4 files are distributed under the Apache2 license. Though you have made many efforts to convert it to peg, I wonder we can merge this change to Universal ctags or not.

I met the same situation when I worked on "varlink.peg". I explicitly got permission I can use the original peg file in the MIT license.
See the header of "varlink.peg" and varlink/rust#20. I would like you to take the same action for your parser.

@codecov
Copy link

codecov bot commented Feb 14, 2021

Codecov Report

Merging #2866 (2b224bd) into master (4b54332) will decrease coverage by 0.01%.
The diff coverage is 89.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2866      +/-   ##
==========================================
- Coverage   87.10%   87.09%   -0.02%     
==========================================
  Files         194      194              
  Lines       44387    44399      +12     
==========================================
+ Hits        38664    38669       +5     
- Misses       5723     5730       +7     
Impacted Files Coverage Δ
peg/kotlin_post.h 89.23% <89.23%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b54332...7dfe4f4. Read the comment docs.

@dolik-rce
Copy link
Contributor Author

Hi @masatake,

First of all, thanks for your fast review.

You are right, this is a tricky situation. I have asked about the license in kotlin-spec repository: Kotlin/kotlin-spec#70

I have tested the speed using codebase repository. However, I didn't notice that there is a 10s timeout for the test until today, so I was under false impression that it works quite fast, which is not really true 🙂 I did write a simple C parser, with the same features as the current regexp based one and compared them with the peg generated parser. The results for 60934 kotlin files (2269833 lines, 77611 kB) are following:

Implementation Time Avg. per file Tags added
C 4 s 66 us 567695
RegExp 155 s 2.5 ms 527908
Peg 523s 8.6 ms 782500

Note however, that the comparison is not entirely fair, because the Peg implementation covers entire Kotlin grammar and tracks scope. The tests were all done on a machine with i5-6300U cpu and NVMe drive.

The differences are big, but considering how much files was parsed, it's not that bad. I think 9ms per file is ok, my biggest Kotlin project has about 250 files and is parsed under 3s, which is not great, but not terrible either.

I will also try to look at other languages (probably Java and C++, which have similar complexity as Kotlin) to see how bad the numbers really are.

If I were to make it faster, there are probably two different approaches:

  1. Simplify the peg grammar as much as possible. There are probably some parts that are not useful to extract, so they could be omitted, without losing quality. It might also help to reorder some rules, to make it try more common variants first, to avoid some of the backtracking.
  2. Write a full C parser. This would be much harder to do (and to maintain). As you can see the grammar is really complex and I'd have to use the kotlin-spec anyway to parse everything correctly.

@dolik-rce
Copy link
Contributor Author

dolik-rce commented Feb 15, 2021

Good news, the Kotlin guys are surprisingly fast and decided to dual-license the code to GPLv2. So the legal obstructions should be out of the way.

Now to just solve the rest: Add support for reference tags and (maybe) make it faster...

@dolik-rce dolik-rce force-pushed the kotlin-full-grammar branch 2 times, most recently from 02bf826 to 5f2cf95 Compare February 16, 2021 05:17
@masatake
Copy link
Member

Good news, the Kotlin guys are surprisingly fast and decided to dual-license the code to GPLv2. So the legal obstructions should be out of the way.

Fabulous!

Could you write the following things to "peg/kotlin_post.h" as notes?
A. kotolin_post.h was derived from files (reference files) in ,https://github.com/Kotlin/kotlin-spec
B. Kotlin / kotlin-spec is distributed under Apache2 license,
C. you requested the original developer "re-license or dual-license" at Kotlin/kotlin-spec#70,
D. The request was accepted, and the original developer made a new repository https://github.com/Kotlin/kotlin-grammar-gpl2 for redistributing the reference files in GPLv2.
E. this file is derived from Kotlin/kotlin-grammar-gpl2 repo.
F. Unlike many files in Universal Ctags, "This source code is released for free distribution under the terms of the GNU General Public License version 2." "or (at your option) any later version." is not included.

Based on this success story as a model, I want to document the actions we should take when encountering the same situation for supporting a new language.


The performance comparison is fascinating. Thank you for taking the time.

First of all, I myself don't care much about performance. If you, the original author and the first user, are o.k., I will merge the new parser. The new peg-based parser is slower than the existing regex-based parser but provides rich information like scope fields.
This can be done because kind-design was done in your regex-based parser. As far as the kind-design is compatible, we can replace the implementation as we want.
As I wrote a peg-based Java parser, but it is much slower than the existing parser. I rejected my peg-based parser because I, a ctags user, cannot accept the performance regression.

I didn't notice that there is a 10s timeout for the test until today,
Are you talking about the behavior of the codebase command?
make units has a time limit, but codebase has no time limit.

The biggest I got is whether the slowness comes from the peg itself or the code generated by packcc.
We can evaluate your peg rule with another code generator like https://github.com/pointlander/peg .

@dolik-rce
Copy link
Contributor Author

Could you write the following things to "peg/kotlin_post.h" as notes?
...

Shouldn't it be in the kotlin.peg file? Because that one was derived from the original kotlin files.


I did some profiling and found some interesting things, but nothing that would give us really big performance boost for simple change. Here is what I've found:

  1. 20% of time is spent in pcc_apply_rule. This is no big surprise, since it is called for every considered rule.
  2. 60% of time is spent in malloc/realloc/free. This is surprising and quite alarming.
  3. The remaining time is spread among pcc_evaluate_rule_* functions, string operations and some data maintanance (working with arrays in packcc etc.).

Problem number one could be maybe optimised a bit by simplifying the rules. I'll try to avoid parsing every letter and digit separately, by "inlining" Letter and UnicodeDigit rules. This might save quite a few function calls and some caching. It might also hlep with the "other functions" mentioned in 3, since less rules would be evaluated.

The second issue can be attacked in two ways:
a. Optimizing the packcc to generate code with less allocations and deallocations.
b. Using better allocator.
So far I have tried b, because it was simpler. Running the ctags with jemalloc simply be preloading it yields about 10% speed-up for Kotlin parsing. If you want to try, you can download/instlall jemalloc and call ctags with LD_PREOAD variable: LD_PRELOAD=/usr/lib/libjemalloc.so ctags ...

I've also made a "flamegraph", generated from perf data, you can find it here. Note that it is interactive: you can click it to zoom in particular function, or searched using the (almost invisible) search buton in top right corner. It's far from perfect, some function calls are not where they supposed to be, since I didn't have time to tweak the compilation parameters to make the binary contain all required info. But it gave me a general idea about what is happening.

@dolik-rce
Copy link
Contributor Author

dolik-rce commented Feb 17, 2021

Oh, and you were right about the timeout in ctags-codebase. It was only in my local copy. I must have added it there a while back and then totally forget about it. I think it was because at the beginning I ran into some issues where my code would be stuck in infinite loop and I wanted to kill it in such case...

@dolik-rce
Copy link
Contributor Author

Update: I tried to inline the Letter and UnicodeDigit rules and it works surprisingly well. It runs about 8% faster. I'll try to write a simple preprocessor to inline more of the rules before compilation. If it proves to work well, we might try to suggest (or implement) such optimalization directly in packcc.

@masatake
Copy link
Member

Shouldn't it be in the kotlin.peg file? Because that one was derived from the original kotlin files.

Yes, you are correct.

60% of time is spent in malloc/realloc/free. This is surprising and quite alarming.

Yes. I studied it when I rewrote the Java parser. I found packcc allocates memory objects having the same statically calculated size.
We can introduce a memory allocator specialized to the sizes as you wrote as . b. Using better allocator.
e.g objpool.

But ... let's suspend the discussion about packcc here.
Currently, ctags uses packcc maintained at https://github.com/universal-ctags/packcc.
However, I would like to switch it to the real upstream project, https://github.com/arithy/packcc .
I would like you to see universal-ctags/packcc#5 and arithy/packcc#13...
Oh, you are the person I discussed the test infrastructure of packcc at arithy/packcc#14 !!!

Now, this discussion becomes very simple.
If you think the peg-based kotlin parser is better than the mtable-regex-based one, let's merge the new one.
Let's discuss the way to improve packcc at the upstream project. The new kotlin parser we will merge can be the reference for meaning the performance of code generated by packcc in the discussion. I felt packcc had so few examples. I think therefore making a serious discussion about performance is hard. The new kotlin parser can fill the gap of discussion.

I proposed I will work on writing a test infrastructure for packcc as I did in u-ctags. However, I have no time now.

@masatake
Copy link
Member

Some comments:

  • running test cases under valgrind
    Did you try

    $ make units LANGUAGES=Kotlin VG=1
    

    ?

    If you installed valgrind, the units target runs test cases under valgrind. It is quite useful.

  • update docs/NEWS.rst like:

    diff --git a/docs/news.rst b/docs/news.rst
    index d4ab610c1..a6913f455 100644
    --- a/docs/news.rst
    +++ b/docs/news.rst
    @@ -415,7 +415,7 @@ The following parsers have been added:
     * JSON
     * Julia
     * Kconfig *optlib*
    -* Kotlin *optlib*
    +* Kotlin *peg/packcc*
     * GNU linker script(LdScript)
     * Man page *optlib*
     * Markdown *optlib*

@dolik-rce
Copy link
Contributor Author

running test cases under valgrind

Just tried, shows 0 valgrind errors. Thank you for pointing this out, I didn't even know about the option.

I also updated the news.rst, as you suggested.

If you don't see any other problems, I think this can be merged. I'll try to look into some optimizations in separate PR in near future.

@dolik-rce dolik-rce marked this pull request as ready for review February 17, 2021 20:24
peg/kotlin_post.h Outdated Show resolved Hide resolved
main/parsers_p.h Outdated Show resolved Hide resolved
peg/kotlin_post.h Outdated Show resolved Hide resolved
unsigned long endLine = getInputLineNumberForFileOffset(offset-1);
if (startLine == endLine)
{
fprintf(stderr, "Failed to parse '%s' at line %lu!\n", getInputFileName(), startLine);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. Is this message for users or for debugging purposes?

In my idea, ctags should not report syntax errors in the input source files to USERS.
For debugging purposes, we can use macros defined in main/trace.h.
TRACE_PRINT is a suitable replacement for this fprintf.
To use trace, you have to run ./configure with --enable-debugging.
In addition, when you run ctags, you have to pass --_trace=Kotlin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was meant for user. But if it is not common behavior in ctags, I'm okay with it.

I changed it to only track and print the parsing failures in debug mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must reconsider this topic in the future.

 - correctly allocate and free memory
 - fix indentation
 - do not report syntax errors, unless in trace mode
@masatake masatake merged commit d532b5c into universal-ctags:master Feb 18, 2021
@masatake
Copy link
Member

Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants