Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use arena for contexts #182

Merged
merged 24 commits into from
Aug 1, 2018
Merged

Use arena for contexts #182

merged 24 commits into from
Aug 1, 2018

Conversation

robinst
Copy link
Collaborator

@robinst robinst commented Jun 28, 2018

Changes context references from Rc<RefCell<Context>> to a Vec index, which makes it possible to make SyntaxSet both Send and Sync, see #83.

This still needs more work and there's tons of TODOs to clean up, but I have some questions which will affect the direction.

The high-level overview of changes so far:

  • contexts on SyntaxDefinition changed from a HashMap<String, ContextPtr> to a Vec<Context>
  • Context has a new name field, and the contexts are sorted by that. The idea is that during linking, we can do a binary search to find contexts. After linking, we have a usize index to look up the context quickly.
  • Nested contexts (e.g. anonymous contexts that are pushed) are no longer directly included in the parent context, but instead they are pulled out into the flat context list. This means the contexts are no longer a tree, which simplified satisfying the borrow checker a bit. Having said that, I'm not sure it's strictly needed, we can revisit that.

High-level TODOs:

Copy link
Collaborator Author

@robinst robinst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some inline comments for the things that are unclear :).

Another thing is that the can_parse_issue120 test fails after these changes. If you look at what the test does, you can see that it clones a SyntaxDefinition and then puts it into a new SyntaxSet. This works with the current syntect because of ContextPtr, but is not something that should be allowed.

Note that even though the code calls link_syntaxes, it can't fix the reference because at that point it's just a ContextId.

To prevent that, I think we have two options:

  1. Make the SyntaxDefinition API type not clonable. The content still needs to be clonable somehow because we need that to make a whole SyntaxSet clonable.
  2. Allow cloning and make it possible to relink

I think option 2 is nicer, and we could do it like this: Instead of having a ContextReference::Direct variant, change the structure to this:

enum ContextReferenceThing {
    Unlinked(ContextReference),
    Linked(ContextReference, ContextId),
})

That way we could relink because we still have the original references.

To ensure that the callers never forget to link, we could remove the add_syntax from SyntaxSet and instead make it go via a builder, and make build do the linking. And then, the serialized syntax set could already be linked too :). What do you think?

/// Create a state from a syntax, keeps its own reference counted
/// pointer to the main context of the syntax.
pub fn new(syntax: &SyntaxDefinition) -> ParseState {
pub fn new(syntax_set: &'a SyntaxSet, syntax: &'a SyntaxDefinition) -> ParseState<'a> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there's nothing here preventing the caller to pass in a SyntaxDefinition that doesn't actually come from the passed in SyntaxSet. I'm not sure if that's a big problem, but yeah.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way would be to return some opaque SyntaxRef type that contains a ref to the SyntaxSet and the starting context from SyntaxSet. This can be done while satisfying my desire to have ParseState not store the SyntaxSet ref by using a PhantomData type. Although it can't be reconciled with my not having a lifetime parameter in ParseState, although again I'm not sure if that's necessary or not.

I'm also not sure if this would be very clean API-wise. Right now the methods that let you look up syntax definitions in a set do double duty of letting you figure out info about the syntax like its name (for display in a text editor or whatever) as well as letting you highlight with it.

It may not be worth trying to make this strict, but if we can come up with a nice way that doesn't cause borrow checker hell we should do it.

pub regex: Option<Regex>,
// #[serde(skip_serializing, skip_deserializing)]
// TODO: cache these somewhere?
// pub regex: Option<Regex>,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to look at how we want to cache regexes. I was going to look at what the regex crate does for its internal cache and then do something similar.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy regex compilation is definitely a necessary feature before merging, without it either parsing or startup is really slow.

Something like the regex crate is a decent idea. Ideally this refactor should make multithreaded use easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this and noticed that onig's Regex is actually Send and Sync now: rust-onig/rust-onig#47

So it looks like all we need is lazy initialization, not a thread-local cache. I've raised a PR here: #186

Inline(ContextPtr),
Direct(LinkerLink),
// Inline(Context),
Inline(String),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially left the distinction between Named and Inline because recursively_mark_no_prototype only looked at Inline. But after #180, this would not be necessary anymore.

Having said that, I don't know if we want SyntaxDefinition to stay as close as possible to the YAML structure or not. In other words, are all the structures within SyntaxDefinition meant to be part of the API surface for users of this library to be able to inspect?

@@ -25,6 +25,7 @@ use onig::Regex;
///
/// Re-linking— linking, adding more unlinked syntaxes with `load_syntaxes`,
/// and then linking again—is allowed.
// TODO: clone?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is not clone yet is because of the FirstLineCache. I imagine we'd use a similar solution to caching match regexes for that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah good idea. Either one pool of compiled regexes that has both sets of regexes in it, or just one RegexCache type with two instances.

.map(|x| SyntaxDefinition::parse_reference(x, state))
.enumerate()
.map(|(i, x)| SyntaxDefinition::parse_reference(
&format!("{}.{}", name, i), x, state, contexts))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keith-hall you mentioned that Sublime already gives names to anonymous contexts. Do you have a reference or a way to check how Sublime assigns the names? Then we could do it the same way here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it used to be possible to find how ST names these contexts by checking the ST console when an anonymous context included a meta_scope and was subject to a with_prototype - see sublimehq/sublime_text#1683. This is the closest I have to a reference or a way to check the names (with ST build 3153 or older).

It seems like the naming convention is #anon_ followed by the named context, followed by _ and the index of the anonymous context within that named context (regardless of nesting level, in an XML-like "document order").

This was my test syntax definition:

%YAML 1.2
---
scope: source.example-anon
contexts:
  main:
    - include: c
  c:
    - meta_scope: test_c
    - match: c_test
      scope: c_test
  d:
    - meta_scope: test_d
    - match: d_example
      scope: d_example
      push: #anonymous context 0 inside `d` context
        - meta_scope: test_e
        - match: e
          scope: e
          push: # anonymous context 1 inside `d` context
            - meta_scope: testing123
            - match: f
              push: # anonymous context 2 inside `d` context
                - match: g
              with_prototype:
                - match: stop_already...
          with_prototype:
            - match: nope
      with_prototype:
        - match: this
    - match: another
      push: # anonymous context 3 inside `d` context
        - meta_scope: blah
        - match: something
          push: # anonymous context 4 inside `d` context
            - meta_scope: foobar
            - match: hello
          with_prototype:
            - match: me

and the ST console output:

generating syntax summary
rule 719396458c7046797910f3c1990b81d6::#anon_d_1 has a scope name, but is unreachable, so the name will never be used
rule c has a scope name, but is unreachable, so the name will never be used
rule d has a scope name, but is unreachable, so the name will never be used
rule #anon_d_0 has a scope name, but is unreachable, so the name will never be used
rule #anon_d_1 has a scope name, but is unreachable, so the name will never be used
rule #anon_d_4 has a scope name, but is unreachable, so the name will never be used

I believe that 719396458c7046797910f3c1990b81d6 can be ignored for our purposes, I think it's some kind of unique identifier ST gave when creating new anonymous contexts with the with_prototype applied, which is not something we do in syntect.

Copy link
Owner

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. I think there might be some clever ways to fiddle with things so that linking is cleaner though.

Right now this PR adds noticeable more code than it deletes and I feel like there may be a better design that doesn't. But if adding code is necessary that's okay, this refactor would still buy us context names for debugging as well as likely better multithreading support.

// TODO: removed PartialEq, Eq from this because SyntaxSet is not
#[derive(Debug, Clone)]
pub struct ParseState<'a> {
syntax_set: &'a SyntaxSet,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is where the SyntaxSet should go, I think it should maybe be passed into parse_line.

The reason is that ParseState is designed for caching in text editors, and it would be kinda sad to redundantly cache a ton of pointers to the same SyntaxSet.

I also don't have a good idea of how much borrow checker hell having ParseState store a reference will cause. Might be really easy, might be hard. It could be worth looking at how this might fit into Xi as a case study. If it is hard then the StateLevels could store a context index instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try out the changes in xi-editor.

Passing it in parse_line seems like you could accidentally pass in a different one and get all kinds of messed up results.

The way it's currently written with the lifetime specifier reflects the reality nicely: The ParseState is valid for as long as the corresponding SyntaxSet exists. If that goes away or is changed, a ParseState instance should no longer be used.

fn eq(&self, other: &StateLevel) -> bool {
context_ptr_eq(&self.context, &other.context) &&
self.context == other.context &&
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might have performance implications since I think this does a deep compare. The PartialEq implementation for Context is really only intended for unit tests. Probably want to cast to pointers and compare those.

I think this implementation is for the purpose of implementing fancy caching patterns, and I'm not sure if this compare is valuable there, I don't think anyone uses this so it won't show up until someone tries to implement a fancy caching pattern and its super slow.

/// Create a state from a syntax, keeps its own reference counted
/// pointer to the main context of the syntax.
pub fn new(syntax: &SyntaxDefinition) -> ParseState {
pub fn new(syntax_set: &'a SyntaxSet, syntax: &'a SyntaxDefinition) -> ParseState<'a> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way would be to return some opaque SyntaxRef type that contains a ref to the SyntaxSet and the starting context from SyntaxSet. This can be done while satisfying my desire to have ParseState not store the SyntaxSet ref by using a PhantomData type. Although it can't be reconciled with my not having a lifetime parameter in ParseState, although again I'm not sure if that's necessary or not.

I'm also not sure if this would be very clean API-wise. Right now the methods that let you look up syntax definitions in a set do double duty of letting you figure out info about the syntax like its name (for display in a text editor or whatever) as well as letting you highlight with it.

It may not be worth trying to make this strict, but if we can come up with a nice way that doesn't cause borrow checker hell we should do it.

pub regex: Option<Regex>,
// #[serde(skip_serializing, skip_deserializing)]
// TODO: cache these somewhere?
// pub regex: Option<Regex>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy regex compilation is definitely a necessary feature before merging, without it either parsing or startup is really slow.

Something like the regex crate is a decent idea. Ideally this refactor should make multithreaded use easier.

ContextId { syntax_index, context_index }
}

// TODO: maybe this should be on SyntaxSet instead?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure about this too. On the one hand it seems like it might fit better on the SyntaxSet in terms of who's doing the resolving of what.

On the other hand the SyntaxSet already has a lot of methods and it's something people read docs for, and so we'd probably need to give it a more descriptive name. Probably better to keep it tucked away here where it more specifically applies.

@@ -25,6 +25,7 @@ use onig::Regex;
///
/// Re-linking— linking, adding more unlinked syntaxes with `load_syntaxes`,
/// and then linking again—is allowed.
// TODO: clone?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah good idea. Either one pool of compiled regexes that has both sets of regexes in it, or just one RegexCache type with two instances.



#[derive(Clone, Debug, Eq, PartialEq, Serialize, Deserialize)]
pub struct ContextId {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we can figure out a way to cleanly put all the Contexts in one big Vec in the SyntaxSet. That would make ContextId half the size, as well as saving a bounds check + pointer lookup on every ID resolve.

I think such a design would also make whole-syntax-set passes like linking cleaner to implement, possibly through nice visitor methods that take a closure that's called on every link or something. Also future cache data structures with a slot for every context would be easier to index.

if let Some(context_index) = syntax.find_context_index_by_name("prototype") {
// let prototype = &mut syntax.contexts[context_index];
// TODO: this works but is it nice? Would it be nicer to make
// recursively_mark_no_prototype a method on SyntaxDefinition?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder if this is different than how Sublime does it. Now I'm wondering if in Sublime I include a named context in the prototype, and then push that named context from the main section, will it use the prototype or not. This also makes linking not idempotent since if a prototype includes something from another syntax (don't think this ever happens in practice) the first time it'll work fine, but on second link it can interfere with the prototype of another syntax.

I think Sublime may use a model of duplicating contexts when they have a prototype. Although I'm not sure if this ever comes up in practice.

Anyhow yah maybe this would be better on SyntaxDefinition, although I think ideally all contexts get moved into a single Vec in the SyntaxSet and then there's some kind of recursive context walking helper method that this can use or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm wondering if in Sublime I include a named context in the prototype, and then push that named context from the main section, will it use the prototype or not.

I tested that as part of #180 and the answer is: No, it also won't use the prototype in that case. So the way we currently do it seems to be correct.

@@ -128,6 +126,21 @@ impl SyntaxDefinition {
SyntaxDefinition::parse_top_level(doc, scope_repo.deref_mut(), lines_include_newline, fallback_name)
}

// TODO: visibility. also, is this the right file to have these?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, note that this whole module isn't included if the yaml loading feature isn't included, and I think you use this in a test outside this module.

@@ -57,6 +58,34 @@ impl Default for SyntaxSet {
}
}

struct ContextMap {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there must be a better way of doing linking, although it may require restructuring some data.

Actually I just realize I don't even understand why this is a thing, why not just iterate over the list of syntaxes directly to find the right index instead of constructing all of these extra vectors with the same indexes and information?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because at the point we want to find a syntax/context, we're within a method that has a mutable borrow of a Context or some component of it.

I tried what you suggested but I couldn't get it past the borrow checker. If you have any ideas that would be good. We could probably do something with unsafe, because all we want to do is mutate ContextReference, not changing any of the structure apart from that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay makes sense.

One way around this might be to switch to having a flat Vec<Context> with SyntaxDefinitions just having references into it. The YAML loader would take a mutable ref to a SyntaxSet (or the underlying Vec) that it would shove new contexts into. Then SyntaxDefinition would have something like a HashMap<String, ContextId> to represent the names.

That would accomplish moving to a flat system (which I suspect will work better/faster in other ways, see other comments) and also separate the lookup data (and move back to an efficient map lookup for names).

It would sacrifice being able to pick up syntaxes and drop them in different sets, but I don't think that's that useful. The one test that does that can just ask the SyntaxSet to load the file it wants directly instead of loading the entire directory and picking one out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea. I think it's a bigger change than what I have here, but I'd like to try it out.

It would sacrifice being able to pick up syntaxes and drop them in different sets

Depending on how we implement it, I think we could still make that possible. We could retrieve the original contexts based on the HashMap<String, ContextId>.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I've:

Copy link
Owner

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New builder and benchmark stuff looks good

@@ -91,58 +95,24 @@ impl SyntaxSet {
SyntaxSet::default()
}

/// Create a builder from this set to load more syntax definitions.
pub fn builder(self) -> SyntaxSetBuilder {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for thinking of this. It's good to have this so fancy text editors can load new syntax definitions easily.

robinst added 10 commits July 5, 2018 16:51
Contexts (e.g. inline contexts) are no longer nested within parent
contexts, but instead are now in the flat list like top-level contexts.

This makes it possible to work with indexes in
recursively_mark_no_prototype.

Context linking does not yet borrow check because of how references are
resolved.
This pleases the borrow checker.
The previous code didn't have this problem because it used
try_borrow_mut and so a recursive attempt to borrow the same context
would enter the block.
Previously, linking had to be done manually using `link_syntaxes` in
order for the syntaxes to work.

This change introduces a SyntaxSetBuilder that contains all the methods
to add/load syntaxes. When calling `build`, the syntaxes are linked and
you get a `SyntaxSet`.

This has the nice side effect of separating all the methods that mutate
to a separate type and making SyntaxSet immutable.
This means resolving a ContextId is now a single lookup.

Having the SyntaxSetBuilder (introduced in a previous commit) helped
make this change.
Now that we have a separate SyntaxReference type which just contains the
context IDs, there's not point in having the Vec in SyntaxDefinition
anymore.
The order determines the order of the flat Vec<Context> in the
SyntaxSet, so sort them to make dumps deterministic.

Another option would be to switch from HashMap to BTreeMap or to
IndexMap from the indexmap crate.
robinst added a commit to robinst/xi-editor that referenced this pull request Jul 19, 2018
@robinst
Copy link
Collaborator Author

robinst commented Jul 19, 2018

@trishume I've ported the syntect-plugin, see here: xi-editor/xi-editor@master...robinst:syntect-plugin-arena

The lifetimes are not that nice, but it works. I had to re-add the implementation of PartialEq and Eq for ParseState in this PR.

The plugin using == to compare ParseStates might be a good reason to switch it to use ContextId instead of keeping &Context and &SyntaxSet, as you mentioned earlier. It would mean old contexts would have to be looked up via the id every time they're used, but not sure how much slower that would be. It's your call, I don't mind changing it.

I'll look into the other remaining things for now :).

@trishume
Copy link
Owner

Sorry for leaving this to sit longer than usual, I've been really busy lately with a school project. I'll try and take a look at what the plugin rewrite looks like this weekend, but if you have time to test switching to ContextID and the benchmarks don't regress much, as I suspect they may not due to things staying in cache, then that may be a good idea.

This means ParseState doesn't contain a reference to SyntaxSet anymore
and doesn't need a lifetime.

This makes it simpler for code that wants to cache it, but it also means
parse_line needs to be called with the SyntaxSet as an argument so that
it can look up contexts.
@robinst
Copy link
Collaborator Author

robinst commented Jul 27, 2018

No worries! I've now finished all the big outstanding TODOs. Now the only thing that's left is documentation and some organizational things such as method visibility/naming and where to put them.

I've also added some tests for SyntaxSet and the builder for adding syntaxes, cloning, and multi-threaded usage :).

The change for ParseState is in a separate PR for now: #190

Copy link
Owner

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the latest into_builder stuff looks good!

@robinst robinst changed the title [WIP] Use arena for contexts Use arena for contexts Jul 30, 2018
@robinst
Copy link
Collaborator Author

robinst commented Jul 30, 2018

So I've merged the other PR into this one, and updated the docs and the README.

The only thing left that I'd like to look at is:

  1. I've kept the ContextId APIs pub(crate) for now. Do you think e.g. the contexts field of SyntaxReference should be public, and get_context as well?
  2. While we're breaking the API, do you want to clean up the methods in SyntaxSetBuilder a bit? I'm thinking load_syntaxes should be renamed to load_from_folder maybe. I'm also not sure we need load_from_folder on SyntaxSet anymore. And maybe we should rename add_syntax to just add, or rename load_from_folder to load_syntaxes_from_folder to make the names consistent.

Copy link
Owner

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice new docs 👍

@trishume
Copy link
Owner

Yup merging soon sounds good to me. And re your questions:

Leaving the context stuff hidden is fine, I can't think of any even pro text editor use cases that need them and if anyone does they can ask.

And yah your suggested renames sound good. I like load_from_folder and changing to add.

@robinst
Copy link
Collaborator Author

robinst commented Jul 31, 2018

Actually let's do the API renames in a new PR, makes it possible to review.

Should I click the Merge button? :)

@robinst
Copy link
Collaborator Author

robinst commented Jul 31, 2018

Ah btw, cargo bench -- load_internal_dump compared to master shaves a bit of time off loading dumps (turns out linking is not a big part of that but nice anyway):

load_internal_dump      time:   [26.601 ms 26.866 ms 27.131 ms]
                        change: [-11.629% -9.5751% -7.6640%] (p = 0.00 < 0.05)
                        Performance has improved.

@trishume
Copy link
Owner

Nice, a separate PR for renames sounds good.

I normally merge but you should indeed have the satisfaction of hitting merge on this enormous amount of awesome work, go ahead!

After the renames I'll look at drafting a 3.0 release with notes on the breaking and other changes.

@robinst robinst merged commit 731bdae into master Aug 1, 2018
@robinst
Copy link
Collaborator Author

robinst commented Aug 1, 2018

Merged! 🎉

I'll prepare the other PR in the next few days.

@robinst robinst deleted the use-arena-for-contexts branch October 30, 2018 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants