Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Kinda-working fancy-regex support #34

Closed
wants to merge 10 commits into from
Closed

[WIP] Kinda-working fancy-regex support #34

wants to merge 10 commits into from

Conversation

@trishume
Copy link
Owner

@trishume trishume commented Feb 9, 2017

This branch switches the regex engine to fancy-regex or more specifically my fork of it.

Currently it only works for a few syntaxes because of a few different features fancy-regex doesn't support:

  • The \n escape (Everything, but fixed it my fork)
  • Unnecessary escapes in character classes like [\<]
  • The \h escape in character classes (Rust)
  • Fix #76 so nonewlines mode doesn't produce weird regexes.
  • Named backrefs \k<marker> (Markdown)
  • Fancy character class syntax [a-w&&[^c-g]z]
  • Add support for match limit to fancy-regex: https://github.com/google/fancy-regex/issues/44

The jQuery highlighting benchmark now takes 1s instead of 0.66s. Which is super unfortunate given that I'd hoped it would be faster than Oniguruma. I have no idea why it is substantially slower.

@raphlinus @robinst

@raphlinus
Copy link
Contributor

@raphlinus raphlinus commented Feb 9, 2017

I took a quick look at this, profiling the highlighting of jquery. It's promising but clearly not compelling yet. It seems to be spending most of its time delegating to regex, but in the VM. This suggests that it's doing backtracking, and might not even be using the NFA (it delegates just to get classes). I have a bunch of ideas on how to optimize more, but don't have insight into specifically what's slow now. The best case would be something like a(?=b), which currently throws a into fancy mode, but could be optimized to just (a)b with fixup of the captures.

The way to make progress here is to capture which regexes are consuming the most time. I'd add some profiling, something like a lazy_static hash table on the side, so that every time the VM runs it increments a count for that regex, and accumulates the time. Then just go down the list in terms of which regexes burn the most time.

I'd be tempted to investigate myself, but am currently trying to really focus on incremental update in xi. Thanks for pushing this forward!

Loading

@trishume
Copy link
Owner Author

@trishume trishume commented Feb 9, 2017

@raphlinus Yah that was my thought on what to investigate as well. Since it's single-threaded I can do even better than a count for each regex, I can actually measure the total elapsed time and count per regex to figure out which ones are slow. Then I can run it again with Oniguruma and see which regexes are faster with fancy-regex and which are slower.

Unfortunately, I'm back to being busy with school work and I'm not sure when/if I'll have time to do this. The perf regression combined with missing features means it's going to be a bunch of work. Not an undoable amount, but still substantial.

Loading

@TimNN
Copy link

@TimNN TimNN commented Feb 10, 2017

@trishume: I'm trying to collect some per-regex timings, however trying to run the jquery highlighting benchmark fails because of the Ability to use \z in a character class problem. How did you get around that?

Loading

@TimNN
Copy link

@TimNN TimNN commented Feb 10, 2017

So I did some initial benchmarking of the jquery benchmark (measuring how long each regex matching took), the result are in this gist: https://gist.github.com/b6bb756f96b58e52b3299b709fa785dd

CUM: is the cumulative time spend on a regex, AVG, the average time, the lists are sorted by average time (times are in seconds).

The code is available in the respective branches in the TimNN/syntect repo.

The worst offenders by far (based on both, AVG and CUM) are the following:

CUM: PT7.310148847S AVG: PT0.000004021S REGEX: [_$[:alpha:]][_$[:alnum:]]*(?=\s*[\[.])
CUM: PT17.180939801S AVG: PT0.000004806S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)(?=\s*\()
CUM: PT21.358124652S AVG: PT0.000008019S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)\s*(\.)\s*(prototype)\s*(\.)\s*(?=[_$[:alpha:]][_$[:alnum:]]*\s*=\s*(\s*\b(async\s+)?function\b|\s*(\basync\s*)?([_$[:alpha:]][_$[:alnum:]]*|\(([^()]|\([^()]*\))*\))\s*=>))
CUM: PT22.522137730S AVG: PT0.000008456S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)\s*(\.)\s*(prototype)(?=\s*=\s*(\s*\b(async\s+)?function\b|\s*(\basync\s*)?([_$[:alpha:]][_$[:alnum:]]*|\(([^()]|\([^()]*\))*\))\s*=>))

They seem to match the "best case" mentioned by @raphlinus, which I guess is a good thing?

Loading

@trishume
Copy link
Owner Author

@trishume trishume commented Feb 10, 2017

@TimNN awesome thank you! That's definitely useful information since it does indeed match up with the case @raphlinus said could be optimized without too much difficulty. Thanks for the help.

And yes the jQuery benchmark breaks because of a substitution I perform for nonewlines mode. I fixed the benchmark to use line strings with newline characters, but didn't end up committing it, sorry.

Loading

@TimNN
Copy link

@TimNN TimNN commented Feb 11, 2017

So I've been hacking a bit on fancy-regex and managed to get the optimisation mentioned above working (at least I think so -- the code is very hacky and desperately needs cleaning up).

The results however look very promising so far: On my machine, highlighting jquery went from 1,228,132,968 ns/iter (+/- 63,891,621) down to 858,410,622 ns/iter (+/- 83,053,993), thus an improvement of about 30%.

The code is in the trailing-la-opt branch in my fancy-regex fork, if you want to give it a go. I'll try to clean it up a bit over the weekend and send a PR to get some feedback from @raphlinus.

Edit: It's probably going to take a bit longer, until I find the time to cleanup / send a PR.

Loading

@trishume
Copy link
Owner Author

@trishume trishume commented Feb 11, 2017

@TimNN that's awesome! 858ms is still more time than it takes Oniguruma to highlight jQuery on my computer, but my computer also takes less than 1,228ms to highlight with fancy-regex so it is possible that on my computer fancy-regex will be just as fast. I'll try and test your branch on my machine at some point.

I was hoping it would actually lead to a significant performance increase eventually but merely matching the performance of Oniguruma is enough for me to make it the default once the compatibility issues are fixed since it will fix #33 and make all dependencies pure rust.

I may be able to find the time to fix some of the smaller compatibility issues I listed. Specifically the first two unfinished ones listed (I sorted by estimated difficulty). Some of the issues look difficult though, specifically a full expression parser/rewriter for the character class operators.

Loading

@TimNN
Copy link

@TimNN TimNN commented Feb 11, 2017

I ran the jquery benchmark again with the oniguruma version and got 834,134,774 ns/iter (+/- 101,552,378), so the patch brings us at least to the same level as oniguruma on my machine for this benchmark.

(Note that my per regex benchmarking code is currently not very efficient since I had planned on collecting more stats than average & total time, so this may slow everything down a bit).

Also, using RUSTFLAGS=-Ctarget-cpu=native improved the runtime of the optimised fancy-regex version by about 50ms on my machine.

Loading

@trishume
Copy link
Owner Author

@trishume trishume commented Feb 11, 2017

@TimNN Excellent. There's probably more optimization possible but that's great for now.

It should be theoretically faster than Oniguruma at least on syntaxes which have been optimized for Sublime's sregex engine to not use many/any fancy regex features, which should allow the rust regex crate to do everything, and that engine should be faster than Oniguruma.

I'm not sure even that would match Sublime Text's performance though. I think it uses something like https://doc.rust-lang.org/regex/regex/struct.RegexSet.html to match many regexes at once in time proportional only to the number of characters, but with support for extracting captures and match positions (unlike the regex crate).

At the moment fancy-regex supports substantially more regexes than Sublime's sregex, which I think only supports the things in the regex crate plus the optimization you just implemented for translating lookaheads and possibly lookbehinds. Otherwise it falls back to Oniguruma. But the things that sregex supports should end up almost entirely delegating to regex under fancy-regex, so the syntaxes optimized for it should be fast under fancy-regex.

Loading

@robinst
Copy link
Collaborator

@robinst robinst commented Feb 16, 2017

Should we create issues in fancy-regex for the unsupported syntax? Seems like the better place to discuss these.

For fancy character classes ([a-w&&[^c-g]z]), it looks like that should be added to the regex crate. There's a comment in the source hinting that adding support for these is planned/welcome:

https://github.com/rust-lang/regex/blob/52fdae7169ec619530985a019184319ac4bbee5a/regex-syntax/src/lib.rs#L1408-L1410

(see UTS#18 RL1.3)

I think that would also help with implementing things such as \H in character classes (which is [^0-9A-Fa-f]).

Loading

@trishume
Copy link
Owner Author

@trishume trishume commented Feb 16, 2017

@robinst Good point. I guess fancy-regex would be a better place.

I created an issue in regex (rust-lang/regex#341), haven't created any in fancy-regex yet though.

Loading

@BurntSushi
Copy link

@BurntSushi BurntSushi commented Feb 18, 2017

I would be interested in lending some insight here if you folks wind up seeing bottlenecks inside the regex crate. The regex crate is fast in a lot of cases, but that doesn't mean it's fast in every case, so don't assume that the regex crate will always bail you out. :-) I'll get the ball rolling by throwing some things against the wall and seeing what sticks.

If a regex is particularly large, then it's possible that the DFA will be forced to bail out because it's thrashing its cache. When the DFA bails, it falls back to a much slower (by an ~order of magnitude) regex engine. You can tweak how much cache space is available with the dfa_size_limit option. By default, it's 2MB. The surefire indicator of regexes that might thrash the cache are regexes with large counted repetitions (e.g., (foo){1000}) or regexes with lots of Unicode classes. A few Unicode classes here and there aren't going to hurt, but if you combine large classes with counted repetitions, e.g., \pL{100}, then you're in for some pain.

RegexSet may be a little tricky to use since it is very limited in what it can do. All it can really tell you is which regex matches, but doesn't give you any position information, which means you need re-run the regexes in the set that matched to get that position information. It is worth thinking about and possibly even trying a RegexSet, but I wouldn't get your hopes up. :-)

Finally, have you folks seen any problems with performance for compiling the regexes? Does it add any noticeable overhead?

Loading

@trishume
Copy link
Owner Author

@trishume trishume commented Feb 18, 2017

@BurntSushi cool thank you. If we get around to optimizing it more than @TimNN already has, we could probably use your advice.

From the very basic benchmarks I did on this branch I didn't see anything noticeable from Regex compilation. It didn't seem to be very different from Oniguruma. I do make sure to compile each regex at most once and only compile them if they are actually needed.

Loading

@keith-hall
Copy link
Collaborator

@keith-hall keith-hall commented Mar 1, 2017

Hi, I just wanted to make a quick note that I've submitted a PR to the ST Packages repo that removes the named backrefs compatibility issue from the Markdown syntax, so, depending whether any other syntaxes use named backreferences, you may get away without this support.

Loading

@raphlinus
Copy link
Contributor

@raphlinus raphlinus commented Mar 1, 2017

Some comments.

First, I took a look at @TimNN 's optimization. It's definitely the optimization I had in mind, but is not quite suitable for merging yet (it changes the output, specifically adding more capture groups than were originally present). I am a bit surprised it's only a 30% gain, I would have expected more.

The instrumentation for total time spent, number of invocations, etc., sounds extremely useful, and I recommend that gets checked in. We'll want to track performance on an ongoing basis (assuming we go ahead with fancy-regex, and even then it's extremely useful for making that decision). Where is the time going after the optimization is in place?

In my (admittedly not very thorough) testing, the time impact of regex compilation was minimal. For large files, it's spending seconds computing the highlighting.

What's the secret to super-fast performance? Is it running multiple regexes in parallel? Is it using RegexSet? I think the latter is promising. One idea that might be worth pursuing is the idea of an approximate (conservative) pure regex, which is guaranteed to match if the source regex does, but not conversely. This would be useful primarily in pruning the regexes that need to be evaluated. One downside, though: if there's a lot of pushing and popping, then it will run multiple sets over (almost) the entire source line.

Loading

@trishume trishume mentioned this pull request Mar 2, 2017
5 tasks
@trishume
Copy link
Owner Author

@trishume trishume commented Mar 16, 2017

I just learned something from this conversation on Reddit with @BurntSushi and @raphlinus that may be part of the cause of the lack of performance gains.

syntect currently always does a regex search with captures even when it doesn't actually need them. Given what @BurntSushi said on Reddit, with the regex crate (which fancy-regex will call into a lot) this has a substantial performance penalty over just finding the bounds of the match.

I think an optimization where it keeps track of if a certain rule needs the captures or not may help performance with fancy-regex and possibly with Oniguruma as well, although I'm not sure there's much of a penalty to getting captures with Oniguruma, especially since it lets you re-use the capture regions struct between calls.

Not sure exactly how much difference it would make, but it would probably help a bit.

Loading

@BurntSushi
Copy link

@BurntSushi BurntSushi commented Mar 16, 2017

syntect currently always does a regex search with captures even when it doesn't actually need them. Given what @BurntSushi said on Reddit, with the regex crate (which fancy-regex will call into a lot) this has a substantial performance penalty over just finding the bounds of the match.

There's a lot of subtlety here. There are various factors at play:

  1. Even if you use the captures method call, if the regex itself has no captures, then the regex engine notices this and doesn't run the NFA. See: https://github.com/rust-lang/regex/blob/d813518e2a199884cd38a4e32497a7453db79697/src/exec.rs#L523-L535
  2. Does fancy-regex ever call the captures method? Or does it handle captures on its own?
  3. What do your regexes actually look like? (I tried looking once but I got lost. :-()
  4. The performance guide for the regex crate is something you should definitely check out.

although I'm not sure there's much of a penalty to getting captures with Oniguruma

Right. Classical backtracking engines, IME, typically don't impose a penalty for extracting captures.

Loading

@BurntSushi
Copy link

@BurntSushi BurntSushi commented Mar 16, 2017

FWIW, there are plans (in my head, anyway) to make capture extraction faster, but I can't commit to a timeline.

Loading

@robinst
Copy link
Collaborator

@robinst robinst commented May 22, 2017

I rebased this branch and have been implementing missing features in fancy-regex, see my pull requests.

Also, the just released regex 0.2.2 now supports nested character classes and intersections, which means the "Fancy character class syntax [a-w&&[^c-g]z]" task is mostly done.

So I think the only task that doesn't have a pull request yet is "Ability to use \z in a character class (Using the nonewlines option)", which I haven't looked at yet.

I also found a regex that fancy-regex currently has trouble with (haven't investigated why yet): https://github.com/google/fancy-regex/issues/14

Loading

@robinst
Copy link
Collaborator

@robinst robinst commented Jun 9, 2017

I looked into "Ability to use \z in a character class (Using the nonewlines option)" now. I'm pretty sure it doesn't work as expected in the current implementation, but it wasn't detected because onig doesn't complain.

With the following:

let re = onig::Regex::new(r"^a[\z]").unwrap();
println!("{}", re.is_match("a"));
println!("{}", re.is_match("az"));

The regex is compiled, but it prints false, then true. So it's equivalent to ^az, not ^a\z (which is what syntect would want).

Loading

@trishume trishume mentioned this pull request Jun 9, 2017
2 tasks
@trishume
Copy link
Owner Author

@trishume trishume commented Jun 9, 2017

@robinst good catch, I created an issue: #76 and updated the to-do list in this issue.

Loading

@robinst
Copy link
Collaborator

@robinst robinst commented Jul 27, 2017

Update: Fixed #76 now. With that, all of the check boxes in the description are done.

I've rebased @trishume's fancy-regex branch here: https://github.com/robinst/syntect/tree/fancy-regex

Now there's the following failing tests left, and they fail on the assertions (instead of panicking while compiling regexes):

html::tests::strings
html::tests::tokens
parsing::parser::tests::can_parse_yaml

The last one I added here and shows that the YAML syntax is not working yet:

 thread 'parsing::parser::tests::can_parse_yaml' panicked at 'assertion failed: `(left == right)`
-  left: `[(0, Push(<source.yaml>)), (0, Push(<string.unquoted.plain.out.yaml>)), (1, Pop(1)), (1, Push(<string.unquoted.plain.out.yaml>)), (2, Pop(1)), (2, Push(<constant.language.boolean.yaml>)), (3, Pop(1)), (3, Push(<punctuation.separator.key-value.mapping.yaml>)), (4, Pop(1)), (5, Push(<string.unquoted.plain.out.yaml>)), (10, Pop(1))]`,
+ right: `[(0, Push(<source.yaml>)), (0, Push(<string.unquoted.plain.out.yaml>)), (0, Push(<entity.name.tag.yaml>)), (3, Pop(2)), (3, Push(<punctuation.separator.key-value.mapping.yaml>)), (4, Pop(1)), (5, Push(<string.unquoted.plain.out.yaml>)), (10, Pop(1))]`', src/parsing/parser.rs:482:8

I had a look at the syntax but it's pretty complex. If someone who knows the syntax wants to track down the problem, that would be cool. (I guess I should learn how to use a debugger for Rust :)).

Loading

@keith-hall
Copy link
Collaborator

@keith-hall keith-hall commented Jul 27, 2017

I don't know YAML syntax very well, but I cut down the syntax definition to the following, and I think it should still have the same behavior with the key: value\n example (it still gets the same scopes in ST as the full YAML syntax def - untested in syntect though, sorry!), so it may help with debugging.

%YAML 1.2
---
# See http://www.sublimetext.com/docs/3/syntax.html
scope: source.yaml-test
name: YAML-Test
variables:
  c_indicator: '[-?:,\[\]{}#&*!|>''"%@`]'
  # plain scalar begin and end patterns
  ns_plain_first_plain_out: |- # c=plain-out
    (?x:
        [^\s{{c_indicator}}]
      | [?:-] \S
    )

  _flow_scalar_end_plain_out: |- # kind of the negation of nb-ns-plain-in-line(c) c=plain-out
    (?x:
      (?=
          \s* $
        | \s+ \#
        | \s* : (\s|$)
      )
    )
contexts:
  main:
    - include: block-mapping
    - include: flow-scalar-plain-out

  block-mapping:
    - match: |
        (?x)
        (?=
          {{ns_plain_first_plain_out}}
          (
              [^\s:]
            | : \S
            | \s+ (?![#\s])
          )*
          \s*
          :
          (\s|$)
        )
      push:
        #- include: flow-scalar-plain-out-implicit-type
        - match: '{{_flow_scalar_end_plain_out}}'
          pop: true
        - match: '{{ns_plain_first_plain_out}}'
          set:
            - meta_scope: string.unquoted.plain.out.yaml entity.name.tag.yaml
              meta_include_prototype: false
            - match: '{{_flow_scalar_end_plain_out}}'
              pop: true
    - match: :(?=\s|$)
      scope: punctuation.separator.key-value.mapping.yaml

  flow-scalar-plain-out:
    # http://yaml.org/spec/1.2/spec.html#style/flow/plain
    # ns-plain(n,c) (c=flow-out, c=block-key)
    #- include: flow-scalar-plain-out-implicit-type
    - match: '{{ns_plain_first_plain_out}}'
      push:
        - meta_scope: string.unquoted.plain.out.yaml
          meta_include_prototype: false
        - match: '{{_flow_scalar_end_plain_out}}'
          pop: true

Loading

@robinst
Copy link
Collaborator

@robinst robinst commented Jul 28, 2017

Thanks @keith-hall! That helped, I've noticed a difference with this pattern (narrowed down):

let regex = r"(?=\s*$|\s*:(\s|$))";
let s = "key: value";
println!("{:?}", onig::Regex::new(regex).unwrap().find(s));
println!("{:?}", fancy_regex::Regex::new(regex).unwrap().find(s));

onig returns Some((3, 3)), whereas fancy-regex returns Some((0, 0)). Removing the \s*$| part of the pattern makes it work. Ran out of time now, but looks like fancy-regex doesn't handle the $ in the positive lookahead correctly?

Loading

@robinst
Copy link
Collaborator

@robinst robinst commented Aug 3, 2017

Ok, tracked down the problem and have a fix here: google/fancy-regex#21

With that fix, cargo test works! syntest panicks compiling one of the regexes though :).

Loading

robinst added 5 commits May 1, 2018
Some of the regexes include `$` and expect it to match end of line. In
fancy-regex, `$` means end of text by default. Adding `(?m)` activates
multi-line mode which changes `$` to match end of line.

This fixes a large number of the failed assertions with syntest.
In fancy-regex, POSIX character classes only match ASCII characters.
Sublime's syntaxes expect them to match Unicode characters as well, so
transform them to corresponding Unicode character classes.
With the regex crate and fancy-regex, `^` in multi-line mode also
matches at the end of a string like "test\n". There are some regexes in
the syntax definitions like `^\s*$`, which are intended to match a blank
line only. So change `^` to `\A` which only matches at the beginning of
text.
Note that this wasn't a problem with Oniguruma because it works on UTF-8
bytes, but fancy-regex works on characters.
@robinst
Copy link
Collaborator

@robinst robinst commented May 1, 2018

Done! Note that you might have to cargo update locally before trying to run this (because Cargo.lock is not checked in). Updates:

  • A small change to make fancy-regex work with a newer regex crate.
  • A bug fix in syntect for code that I wrote a couple of days ago that fails with fancy-regex (but is fine with Oniguruma). += 1 on a string index is almost never the right thing to do. I walked into this one before, but apparently forgot about it again :).

Loading

@keith-hall
Copy link
Collaborator

@keith-hall keith-hall commented May 2, 2018

small note for people struggling to get syntect to build on Windows: using the branch from this PR, you can edit Cargo.toml to remove the onig dependency, and everything will work fine.

Side note: probably this PR should be updated so that fancy-regex is an optional dependency as part of the parsing feature like onig is, no?

--- Cargo.toml
+++ [new] Cargo.toml
@@ -15,7 +15,7 @@
 
 [dependencies]
 yaml-rust = { version = "0.4", optional = true }
-onig = { version = "3.2.1", optional = true }
+#onig = { version = "3.2.1", optional = true }
 walkdir = "2.0"
 regex-syntax = { version = "0.4", optional = true }
 lazy_static = "1.0"
@@ -25,7 +25,7 @@
 flate2 = { version = "1.0", optional = true, default-features = false }
 fnv = { version = "1.0", optional = true }
 regex = "*"
-fancy-regex = { git = "https://github.com/google/fancy-regex.git" }
+fancy-regex = { git = "https://github.com/google/fancy-regex.git", optional = true }
 serde = { version = "1.0", features = ["rc"] }
 serde_derive = "1.0"
 serde_json = "1.0"
@@ -51,7 +51,7 @@
 # Pure Rust dump creation, worse compressor so produces larger dumps than dump-create
 dump-create-rs = ["flate2/rust_backend", "bincode"]
 
-parsing = ["onig", "regex-syntax", "fnv"]
+parsing = ["fancy-regex", "regex-syntax", "fnv"]
 # The `assets` feature enables inclusion of the default theme and syntax packages.
 # For `assets` to do anything, it requires one of `dump-load-rs` or `dump-load` to be set.
 assets = []

Loading

@robinst
Copy link
Collaborator

@robinst robinst commented May 2, 2018

@keith-hall Pushed a commit with those changes, thanks!

Loading

RegexOptions::REGEX_OPTION_CAPTURE_GROUP,
Syntax::default())
.unwrap();
println!("compiling {:?}", self.regex_str);
Copy link
Collaborator

@keith-hall keith-hall May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this println be here, as it generates a lot of noise? if it's useful for debugging, maybe it would be best to hide it behind a feature flag as discussed at #146 (comment)

Loading

Copy link
Owner Author

@trishume trishume May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just commenting it out is fine, see my reply on the comment

Loading

Copy link
Collaborator

@robinst robinst May 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keith-hall I pushed a commit that changes this to only print in case it fails.

Loading

I think the println! was only there to see the regex that failed to
compile.
@kornelski
Copy link
Contributor

@kornelski kornelski commented Jul 24, 2018

It would be great if that change landed, because libgit2 crashes in binaries linking oniguruma, and I'd like to use both libgit2 and syntect together.

Loading

@Keats
Copy link
Contributor

@Keats Keats commented Aug 23, 2018

Is https://github.com/google/fancy-regex/issues/44 the only blocker left for this?

Loading

@keith-hall
Copy link
Collaborator

@keith-hall keith-hall commented Aug 23, 2018

I was also wondering about this. If we don't want to default to fancy-regex yet, maybe it would be worth maintaining both the oniguruma and fancy-regex code paths for now and consumers can choose which one to use from a feature flag. It may then be easier to do performance comparisons etc.

Loading

@@ -5,12 +5,13 @@
//! into this data structure?
use std::collections::{BTreeMap, HashMap};
use std::hash::Hash;
use onig::{Regex, RegexOptions, Region, Syntax};
use fancy_regex;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not continue using fancy_regex::Regex here?

Loading

Copy link
Collaborator

@robinst robinst Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean why not write use fancy_regex::Regex; here? Yeah there's not really a good reason. I will change it next time I work on this.

Loading

@robinst
Copy link
Collaborator

@robinst robinst commented Aug 27, 2018

Is google/fancy-regex#44 the only blocker left for this?

@Keats Yes, unless we find another one.

@keith-hall A feature sounds like a good idea, yeah. Maybe we can abstract the regex compilation and matching parts a bit to make the feature less painful to maintain (so that it's just in one module and not all over the place).

I might have some time this week to work on this.

Loading

/// In fancy-regex, POSIX character classes only match ASCII characters.
/// Sublime's syntaxes expect them to match Unicode characters as well, so transform them to
/// corresponding Unicode character classes.
fn replace_posix_char_classes(regex: String) -> String {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there by any chance that we are able to do the sublime syntax replacement before run-time?

Loading

@OptimisticPeach
Copy link

@OptimisticPeach OptimisticPeach commented Jun 23, 2019

Sorry to poke at an old issue, but is there any news on this? I'd really like to use syntect in a WASM project, but according to #135 this needs to land first.

Loading

@trishume
Copy link
Owner Author

@trishume trishume commented Jun 23, 2019

Unfortunately this is not quite complete and based on an older release of syntect. It would probably take a fair amount of work to complete. I'm not personally interested in doing that work unfortunately, so unless someone else steps up to do it, it's not on the roadmap.

Loading

@OptimisticPeach
Copy link

@OptimisticPeach OptimisticPeach commented Jun 24, 2019

Ah that's a bummer, I'm sorry for the letdown but I don't think I'd be able to commit to making it happen right now.

Loading

@trishume trishume mentioned this pull request Aug 10, 2019
@robinst
Copy link
Collaborator

@robinst robinst commented Nov 19, 2019

Hey, just an update. I'm working on this again, but with a different approach.

I'm moving all the regex usage to a module first, so then we can swap out the implementation using a cargo feature.

I'll have a pull request next week :).

Loading

@robinst
Copy link
Collaborator

@robinst robinst commented Nov 25, 2019

Ok, see #270. I think we can close this PR and move discussion to #270.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

10 participants