Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YAML] Add entirely new YAML syntax #90

Merged
merged 12 commits into from
Mar 15, 2016

Conversation

FichteFoll
Copy link
Collaborator

I spent the last couple of weeks (every now and then) writing this from scratch, as I promised. Based on http://www.yaml.org/spec/1.2/spec.html.

Package now also includes a preview file (for "good measure"), so you can test your color scheme against it or something.

Also has a lot of tests, of course.

I have decided to not assign any special scopes to the punctuation characters, e.g. :[]{},?|>% because everything else is already colored.

Notable differences:

  • Properties are now properly highlighted
  • Directives are highlighted
  • Ending and beginning of plain scalars are correctly highlighted, in hopefully all situations. This is context-aware (block vs flow context)
  • More accurate highlighting of implicit plain scalar types (int, float, bool ...)
  • Explicit keys are matched, even though they don't receive special highlighting in most cases
  • Probably more things that I can't

Screenshot: with new syntax

2015-08-16_01 25 19

(unfortunately constant.numeric and string have a very similar color in my color scheme, so you hardly see the difference here)

Screenshot with old syntax:

2015-08-16_01 11 37


Fixes jskinner/DefaultPackages#41
Fixes jskinner/DefaultPackages#167

@FichteFoll
Copy link
Collaborator Author

As a side note, I discovered a couple of inconveniences while working on this, which I reported here: https://github.com/SublimeTextIssues/Core/labels/C%3A%20Syntax%20Highlighting

@aziz
Copy link
Contributor

aziz commented Aug 16, 2015

👍 This is awesome man. YAML is a tricky one! Thanks @FichteFoll. Next stop Markdown 😉

@FichteFoll
Copy link
Collaborator Author

By the way, I would really like to use this as a base for a .sublime-syntax definition, but I am still unsure of how to tackle it best since I want to wait for this to be merged first. It would be cool if I could somehow inject patterns (i.e. prototypes) only into certain named contexts, such as key names, so that I could highlight the 'special' keys without rewriting half of the yaml def.

This would also be interesting for other Sublime Text resource file types based on JSON.

Don't know if this would be a feasable addition.

@FichteFoll
Copy link
Collaborator Author

When you are reviewing this, I'd like to hear about your opinion on coloring the punctuation characters, i.e. -|<[]{},?:%. I currently do highlight &* in anchors and references.

Edit: Just saw that convert_syntax.py is included in this PR. This shouldn't be there. Will squash soon™.

Written from scratch. Package now also includes a preview file
(for "good measure"), so you can test your color scheme against it or
something.

Also has a lot of tests, of course.
@FichteFoll
Copy link
Collaborator Author

Done.

Regarding punctuation characters, I came to the following stance: For the same reason that JSON punctuation is not highlighted, I believe that YAML punctuation should not be highlighted as well. If users prefer to have them highlighted, they can do so easily by editing their color scheme, which they could also to for other languages following the same punctuation scope namin, which will hopefully be standardized at some point.

The other way around (edit to override colorization of punctuation because of using scope names like keyword.flow) is not preferred here.

@wbond
Copy link
Member

wbond commented Jan 29, 2016

In 3098 we added the "Performance" variant of the Syntax Tests build suite.

Since this is a complete rewrite, could you take a couple of minutes and runs some tests on some decently large YAML files with this and the existing YAML syntax? This can help ensure that, in addition to the excellent coverage of different syntax you already have, there aren't any performance issues with the regular expressions.

@FichteFoll
Copy link
Collaborator Author

I ran the performance test a couple times and recorded min and max average.

The new YAML.sublime-syntax file itself (550 lines):

Syntax "Packages/YAML/YAML.sublime-syntax" took an average of 6-8ms over 10 runs
Syntax "Packages/YAML 1.2/YAML 1.2.sublime-syntax" took an average of 33-34ms over 10 runs

The biggest YAML I could find (you hardly find anything >50 lines with google) was ... PHP Source.sublime-syntax, which has 1193 lines:

Syntax "Packages/YAML/YAML.sublime-syntax" took an average of 30-32ms over 10 runs
Syntax "Packages/YAML 1.2/YAML 1.2.sublime-syntax" took an average of 744-749ms over 10 runs

That's sadly a 25x slowdown (only 5x for the first). To be expected, since YAML is really not easy to parse for computers, but the current one is just wrong in many situations and we can't have that, can we?

@wbond
Copy link
Member

wbond commented Feb 3, 2016

@FichteFoll Hmm, 750ms for a 1200 line file is definitely less than ideal. Considering that the second example is only about twice as long as the first, the 20x slowdown seems like there is probably something we can optimize in there.

I'll take a look and see if there is anything I can identify.

@wbond
Copy link
Member

wbond commented Feb 3, 2016

Quoting the long strings in PHP Source.sublime-syntax brings the average (on my machine) to 125ms. This leads me to believe a lot of the inefficiency right now is in unquoted string processing.

With all of the variables and includes, and being unfamiliar with all of the YAML terminology, I have not yet identified what is causing the issue. I see there are a number of patterns with multiple options that are part of negative lookaheads. My hunch is it may be related to this.

The other option is that I may be able to instrument the regex engine for the performance test to help identify the regex pattern performance.

@FichteFoll
Copy link
Collaborator Author

It is very likely that unquoted scalars are the performance bottle neck in this, just because of how "expensive" they are in a computer parsing sense, since multiple checks have to be performed for each character. The way I check this currently is, as you mentioned, by using negative look-aheads that tell us to terminate a plain scalar, but I do think there is room for improvement.

I don't have it in my head exactly right now since it's been a while that i worked on this, though, and I'm rather busy at the moment with deadlines. I will be able to take a look at this again after next week (2016-02-15). Maybe earlier, but unlikely.

@FichteFoll
Copy link
Collaborator Author

Adding the results of performance tests without {} repititions, which the sregex engine does not support at this moment, for reference:

Highlighting time of PHP Source.sublime-syntax is reduced by ~100ms to ~650ms, which is a 13% improvement.
For YAML.sublime-syntax the improvement is 40%. (~20ms)

@haad
Copy link

haad commented Feb 22, 2016

+1

(?x)
(?=
{{ns_plain_first_plain_out}}
((?!{{_flow_scalar_end_plain_out}}).)*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you have a lookahead, containing a negative lookahead, containing a positive lookahead.

Proposed change: ([^ :#]|\:[^ ]| [^#])*

@wbond
Copy link
Member

wbond commented Mar 10, 2016

With the two proposed changes I just made (removing nested lookahead, negative lookahead, lookhead patterns), the syntax is an order of magnitude faster.

It processes PHP Source.sublime-syntax in 55ms on my machine.

@wbond
Copy link
Member

wbond commented Mar 10, 2016

I tweaked a few regexes further. With the unreleased dev build of Sublime Text I am seeing full-file highlighting of PHP Source.sublime-syntax in under 20ms. This makes it possible to edit line 91 of PHP Source.sublime-syntax without any lag.

In short, I removed a number of negative lookaheads being applied to every character with patterns that used character classes to do positive matching. It is possible I introduced a change to behavior, although I tried not to. It would be good to have you review @FichteFoll.

https://gist.github.com/wbond/c2a846b92c873f5b7153

@jcberquist
Copy link
Contributor

@wbond: while you are waiting for feedback, I thought this might interest you. I was curious and so I diffed the scope names applied by the two versions of the syntax to the PHP Source.sublime-syntax file and it looks like your new version assigns the string.unquoted.plain.out.yaml scope to newlines. You can easily see this by placing the cursor at the end of the name: PHP Source line and viewing the scopes. In your version you will see source.yaml string.unquoted.plain.out.yaml whereas in @FichteFoll's version it just says source.yaml. I should note that I did this with build 3107 since maybe the dev version you used is different.

sregex doesn't support them but does not need them either, since they
are essentially a hack for backtracking regex engines.

Courtesy of @wbond.
The nested look-aheads were a huge bottleneck for plain scalar key-value
pair parsing. By utilizing linear matches in the single look-ahead,
parsing speed for `PHP Source.sublime-syntax` is improved by 80%.

Partly courtesy of @wbond.
@FichteFoll
Copy link
Collaborator Author

@wbond
All right, I think I'm done with this now. I hand-reviewed all your changes and adjusted them before packing into commits. The issue pointed out by @jcberquist does not appear in this version.

PHP Source.sublime-syntax gets parsed in ~130ms on my machine with 3107 (which is expected to be better for you since I still have {} repititions). Please let me know how this one compares to your version on your machine (and your ST build).

I also tweaked the scoping slightly here and there.

There was one thing I did not incorporate, which was the addition of some match patterns before matches like - match: '{{_flow_scalar_end_plain_out}}' which seem to serve the purpose of improving performance by not causing this look-ahead pattern to be run against each character, but in practice I didn't get any performance improvements at all. Please tell me if results are different on your ST build.

PS: I got my hands on some pretty large YAML test files but they are so large that performance testing with my current ST build would not be feasable.

@wbond
Copy link
Member

wbond commented Mar 15, 2016

This is merged in, thanks for all of your work @FichteFoll!

@javiercr
Copy link

Awesome! Works like a charm with Rails i18n files. Thanks!

@FichteFoll
Copy link
Collaborator Author

Thanks for the merge. Glad we finally have sane YAML highlighting. :)

Here are some performance tests on big files with the slower 3107 build:

~11600 lines (400kB): 840ms
~161000 lines (4MB): 7900ms (a bit laggy when editing)
~980000 lines (60MB): 57,800ms (also laggy when editing, but only double as bad as previous)

Can't wait to compare against the next build.


I noticed that syntax highlighting, or especially the syntax tests, only use a single core. Maybe some concurrency optimization wrt sregex could speed this up a littlebit, but it'd be quite some work I suppose.
I also noticed that RAM usage grows very linearly, but that will likely not be an issue and is unavoidable in order to store the tokens. I think it should drop between test iterations however, which it does not.

What I'm curious about is RAM usage of plugin_host.exe however, which grew significantly over the course of performance testing but eventually dropped before the tests were finished. Maybe it's unrelated?
See screenshot below:

2016-03-16_15 02 51

@wbond
Copy link
Member

wbond commented Mar 16, 2016

Remember that with 3107, this syntax is still use the oniguruma engine. So all of the performance, memory characteristics, etc will likely be affected by not utilizing patterns that require that engine.

I'm thinking the memory usage of plugin_host is unrelated. Perhaps another package you have installed? It happens after the performance test on the 2nd file, and starts before the performance test on the 3rd file.

I don't know off of the top of my head how allocations are happening related to lexing files in a buffer. My guess is that some detail of that is why the memory usage increases until the end of the performance test.

@FichteFoll
Copy link
Collaborator Author

Did more tests with 3110 (and removal of the possessive quantifiers):

  • 300kB: 340ms
  • 4MB: 3,117.5ms (old syntax: 1,880.7ms)

The 4MB file still seems editable. It does lag behind slightly but it's not as bad as it was previously.
I'd say this is probably as fast as we can get while maintaining accuracy.

If tokenizing could somehow be multithreaded, then performance would probably improve greatly (I have 4 cores with HT).

@FichteFoll FichteFoll mentioned this pull request Jul 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants