Initial attempt at adding RecoveryKind to yacc %grmtools section#547
Initial attempt at adding RecoveryKind to yacc %grmtools section#547ratmice wants to merge 24 commits into
Conversation
👍 |
This patch implements (but does not yet use) a wrapper around hashmap. * Adds an API for marking keys as `used`. * Provides a mechanism for getting an iterator over `unused` keys. * A query mechanism that checks whether value for a given key matches the specified variant of the of the `Value` type. Otherwise producing an error.
|
I added basic implementation of what I was thinking in 96d721d It doesn't get used yet anywhere, so is largely untested. Anyhow that is the basic idea of what I had in mind. |
| pub fn query_key( | ||
| &self, | ||
| key: &str, | ||
| query_mask: u16, |
There was a problem hiding this comment.
Of course there is some borrow checker issues with the combination of query_key/mark_key_unused,
So this interface seems to require some changes. I haven't gotten the entry API to work (since we need to mutate the header not the HashMap. So might have to resort to contains_key -> mutate -> query.
There was a problem hiding this comment.
Should be fixed in 06458b1
I'm not sure it's the best fix, the "mark as used" even on error behavior
is intended to avoid the following situation:
- specify
%grmtools{yacckind: Foo} - Get an "invalid yacc kind" for
Fooerror. - also get
unused yacckind
|
That at least seems like a first try at fixing all 3 of those issues, and I haven't thought of any more either. |
| } | ||
| } | ||
| let recoverykind = recoverykind.unwrap_or(RecoveryKind::CPCTPlus); | ||
| ast_validation.header_mut().mark_key_used("recoverer"); |
There was a problem hiding this comment.
It's worth pointing out a difference here between nimbleparse and CTParserBuilder.
In nimbleparse we mark_key_used("recoverer") even when the value is ignored because the value was set on the command line.
But in CTParserBuilder it will complain if you set recoverykind(CPCTPlus) but also have the key in the %grmtools section. The thinking for allowing it to error in CTParserBuilder was it might be confusing if one of the build.rs or %grmtools section went unused.
So how it is, if you specify it in both places you should get an unused in %grmtools section error.
Edit: There is kind of an interaction here with the forcing behavior of lrlex, we should probably consider how to handle it generally? Anyhow I'm not entirely certain what to think about it yet.
There was a problem hiding this comment.
There is kind of an interaction here with the forcing behavior of lrlex, we should probably consider how to handle it generally? Anyhow I'm not entirely certain what to think about it yet.
In the sense that it's not clear which setting should be forcing which?
There was a problem hiding this comment.
There is kind of an interaction here with the forcing behavior of lrlex, we should probably consider how to handle it generally? Anyhow I'm not entirely certain what to think about it yet.
In the sense that it's not clear which setting should be forcing which?
I think a good way to explain it is that currently, we have Force, Default, NoDefault, but we lack XOR.
- Force overrides input text.
- Default is overridden by input text.
- NoDefault forces input text to specify.
So we don't have a way to specify "one of the builder, or the input text should specify, but not both"
That is actually the behavior which CTParserBuilder kind of implements for RecoveryKind because RecoveryKind only gets marked used if it is None the CTParserBuilder, so if it is specified in both,
we complain that RecoveryKind is unused in the input file.
This wasn't really an intended thing, I only noticed it when I added my comment, but the behavior as implemented does have a correlation to an XOR forcing mechanism
There was a problem hiding this comment.
I guess the other thing is we originally added YaccKindResolver, there was only one value yacckind for cfgrammar::YaccParser, but when we got to lrlex there were multiple flags. Now that we integrate CTParserBuilder the forcing mechanism specific to cfgrammar, and it isn't clear how to expand it across these multiple crates. Neither of the forcing mechanisms the way we did it in cfgrammar nor the way we did it in lrlex works when crossing crate boundaries.
I guess that is where my head is most fogged up.
There was a problem hiding this comment.
Part of me wonders if I shouldn't be looking towards the mark_key_used mechanism, and adding a flag to it
to make it less specific to used/unused, and then reusing the structure as an input parameter, and doing the whole merge thing we did with lexflags but with all these string arguments.
so:
enum Mark {
Used: 1 << 0,
Default: 1 << 1,
Force: 1 << 3,
XOR: 1 << 4
}
/// You *may* mark a key that may or may not be in the header itself.
fn mark_key(self: Header, key, mark: Mark)
Edits: added XOR, removed NoDefault (I think it would be just an absent key from a header arg)
I feel like this is probably at least worth experimenting with...
There was a problem hiding this comment.
Fwiw, here is the start of alternative to hashmap I was playing with...
Under the assumption that this has so few keys it can afford to be less than optimal.
Edit: By no means is this well tested though, e.g. get_mark has a bug in that link
There was a problem hiding this comment.
FWIW, here is a better tested version that implements the merge, unused checking and adds a required mark.
That is to say, it is approaching the merging behavior or LexFlags except in a stringly typed fashion, and on a per-key basis unlike lex_flags where we had different collections for defaults and forcing.
https://gist.github.com/ratmice/4081fec2d6118682ff7b433cb413943b
Anyhow I still can't tell if it is a terrible idea or not...
There is probably quite a ways to go before it actually feels like the HashMap API (if we even want that).
But essentially it tries to mimic HashMap api which includes these merge, missing, and unused functions.
Which specifies on a per-key basis how to resolve conflicts, or whether they are required but missing, or present but unused.
Edit: Still some issues with it's logic, but it is getting there...
There was a problem hiding this comment.
I went ahead and integrated the thing in cdccd47 just to see how far away from getting it working in place of the HashMap we were.
I had to add the Borrow thing to a few places, and add contains_key, and change some method signatures for the used function.
But overall it did drop into place pretty well, this doesn't actually utilize the merge behavior, but I guess that would be next.
There was a problem hiding this comment.
So the next step making CTParserBuilder use these marks and merge behaviors instead of things YaccKindResolver, is sort of being a painful task, and showing it's warts.
I added things like impl Into<header::Value> for YaccKind and then insert them into a separate header than the one read from the grammar source.
One of the issues is that Header assumes it's going to be read from a file, so It has things like Span.
while the one constructed from the CTParserBuilder doesn't have span data.
Ideally I guess we'd have headers with/without spans and be able to merge them into one with optional spans
Anyhow, I'll press on and maybe this just my distaste for stringly typed nature of it, but I certainly don't have high hopes for this particular experiment.
There was a problem hiding this comment.
So c16876c is the first functioning attempt which replaces YaccKindResolver with this whole merge_from behavior, there is quite a few warts in there.
But it does appear to function.
| pub fn new() -> Self { | ||
| Header { | ||
| contents: HashMap::new(), | ||
| used: vec![], |
There was a problem hiding this comment.
Given the thoughts on extending the mark behavior to include more than used, but also forcing flags.
I've been thinking about ways to get rid of the used value entirely,
For instance whether we can use hashbrown::HashTable to store the mark flags along side the key in a way that allows them to be mutated separately without affecting the Hash.
I don't yet know whether this HashTable API though allows us to add to keys in a way that doesn't also require us to add an element. At the very least relatively popular crates like SlotMap unfortunately don't fall under such a use case as slots are created during insertion.
The other option is just using a sorted vec instead contents: Vec<(String, Mark, Option<(Span, Value)>>
which honestly seems like the simplest option to me.
It contains a lot of hacks, like `unwrap()` and made up `Span::new()`. This is because this transitions the `Header` structure from being something that was always parsed, into something that gets constructed by `CTParserBuilder` and then merged with what is parsed. This is largely just an experiment with this approach to find out if it is actually viable.
| .contents_mut() | ||
| .set_merge_behavior(&"yacckind".to_string(), MergeBehavior::Ours); | ||
| header.contents_mut().mark_required(&"yacckind".to_string()); | ||
| // FIXME seems like this this should be Eco! |
There was a problem hiding this comment.
Searching through this patch for eco, it appears that there are
a couple more instances besides this FIXME where the yacc kind seems like it should be Eco
but at some point most likely was accidentally changed to GenericParseTree.
There was a problem hiding this comment.
Both this and the copy in firsts.rs both seem to have been using Original since they were added
|
|
||
| #[test] | ||
| fn test_first_no_subsequent_rules() { | ||
| let mut header = Header::new(); |
There was a problem hiding this comment.
It would be really nice to simplify a lot of this testsuite stuff with a macro.
I think I'll try to make a test_support.rs module, with a macro header_for_yacckind! or something to that effect.
| /// then unused to construct a `YaccGrammar`, which will either produce an | ||
| /// `Ok(YaccGrammar)` or an `Err` which includes these errors. | ||
| pub fn new(yacc_kind_resolver: YaccKindResolver, s: &str) -> Self { | ||
| pub fn new(header: Header, s: &str) -> Self { |
There was a problem hiding this comment.
So, I wonder here if we can take a header: &mut Header here,
and then avoid returning it in YaccParser::build(), it seems possible that would simplify things.
But I'm not sure whether the compiler will be able to tell that it isn't capture by the GrammarAST return value.
There was a problem hiding this comment.
It appears that we can, this was done in 4c59bb9.
It required adding a lifetime to (now) YaccParser<'a>, but I think that should be a private type.
Seemed like a small win to me.
| self.header = header; | ||
| result | ||
| Ok((parsed_header, i)) => { | ||
| self.header.merge_from(parsed_header).unwrap(); |
There was a problem hiding this comment.
This is unwrap() is just a temporary hack that was done just to get things working,
These errors should somehow be converted into a grammar error.
There was a problem hiding this comment.
I fixed that by trying to map this error to one of the Duplicate errors DuplicateGrmtoolsSectionEntry.
06390fa
This subsequently required me to add a new spans kind for none-or-more spans in 6a7f5c5
I'm not certain that reusing DuplicateGrmtoolsSectionEntry is the right thing, e.g. usually we see duplicate because of something silly like:
%grmtools{yacckind: Grmtools, yacckind: Grmtools}
however this is like
cp_builder.yacckind(YaccKind::Grmtools)
Parsing a source file with:
%grmtools{yacckind: Grmtools}
|
So, I think this is at a good place to decide if we should put any more effort into it, or cut our losses here. Edit: I did start grinding down some of the sharper corners at least. I think most of what is left is just refining the tree API But from a big picture perspective of having this header kind of tree, it's basic purpose of merging things CTParserBuilder or nimbleparse and the input sources. How it gets passed around and reused across crates because it just holds settings as a bunch of I at least find the tree structure moderately interesting, and I can't remember seeing anything generalized of it's sort (Although it isn't really generalized, and not having seen anything like it might be a reason enough to avoid it!)... Perhaps though it would help if I spent some time trying to get the commit history in a better state first? |
Since this series now uses `From` impls to build headers from the command line, and builder values, header's may no longer have a notion of source location. This makes these spans `Option<Span>`, and adds a `SpansKind` that may be empty.
| out.push('\n'); | ||
| let s = match e.spanskind() { | ||
| SpansKind::DuplicationError => { | ||
| SpansKind::DuplicationError | SpansKind::OptionalDuplicationSpan => { |
There was a problem hiding this comment.
This is a bit awkward, we can only count the duplicate occurances by their spans, so for OptionalDuplicationSpan, we can't accurately know how many duplicates there are, just how many we see.
This is kind of akin to the old C compiler problem where you can say get C preprocessor variables
cc -DFOO where their source location is on <command-line>.
Except here because we're dealing with Option<Span> ↝ Vec<Span>, these None values end up dropped on the floor.
There was a problem hiding this comment.
Anyhow it seems likely to me that these should be something else besides Option<Span>, perhaps something to the effect of:
enum Location {
Span(Span),
CommandLine,
Other(String),
}
Allowing Other("From CTParserBuilder".to_string()) or some such.
Edit: I guess the main issue with this idea is we can't actually put one of those in a spans: Vec<Span>.
So, I'd suppose the right thing would be to keep HeaderError and friends and YaccGrammarError separate, making parse return an enum
enum ParseError {
HeaderError(...),
GrammarError(YaccGrammarError),
}
then change HeaderError to use Location instead of Optional<Span>.
That keeps Location isolated to only those places where command line things can be present
which seems workable.
When comparing command-line derived, or values from `CTParserBuilder`, to elements which are parsed from an input source we may or may not have span information. This includes an enum `Location` which contains a span, but can specify `CommandLine`, or `Other` location. The parts of `YaccParser::parse` which compare elements that are lacking spans, return errors that contain `Location`. Other parts of `YaccParser` *only* deal with elements that derive from input sources. which still return `YaccGrammarError`. So the top-level enum changed to one which can contain *either* `YaccGrammarError` *or* these other error kinds.
Previously we stored an `Option<u16>` where a key/value was inserted but it's mark could be `None`. This seems like a bad idea. we often did mark.map when we intended `mark.unwrap_or(0)`, So remove the `Option` and default to `0`.
|
So, For now I feel like I should probably just close this, while it is finally in a state where it is both pretty much complete and I can't think of any glaring holes, It isn't exactly in a state where I feel it could reasonably be reviewed. So I'll spend some time reordering the commits, so that free standing things can be committed separately |
|
Thanks! |
Here is an initial attempt at adding
recovererto the grmtools section for yacc.Things I don't like:
%grmtools{unknown: Foo}I have some thoughts on these last two, for grmtools but lex is perhaps more complicated.
In the same way that ctparserbuilder calls into
self.yacckind = ast_validation.yacc_kind();we could use a similar path to communicate the key/values that went unused byYaccParser.On the first one, I'll try and think of helper functions, but we have things like YaccKind which can be either a
Unitaryor aConstructor, so we need to be able to check that it isoneOfthose.I'll try and fiddle with it more, but this is at least a naive draft implementation.
Edit: I have some general thoughts on things to attempt,
largely replacing the
HashMapwith something that tracks key consumption,and a small mechanism for querying values based on a
reprenum.Anyhow the feeling I get is that a key query API that checks the value in the hashmap against one of those in bitmask form. That along with, marking keys as "used", doesn't seem to fall into any complexity traps.
I'll elaborate tomorrow in code form...