-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a span field to LexBuildError #299
Conversation
Well, I did a first pass in 21f1d2c of my suggested "next step", so we could at least see the ramifications... |
lrpar/src/lib/ctbuilder.rs
Outdated
let grm = YaccGrammar::<StorageT>::new_with_storaget(yk, &inc).map_err(|e| match e { | ||
YaccGrammarError::YaccParserError(e) => { | ||
let (column, line) = off_to_line_col(&inc, e.span.start()); | ||
format!("{} at column {column} line {line}", e) | ||
} | ||
e => e.to_string(), | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit hacky going from YaccGrammarError -> String -> Box<dyn Error>
, but I'm not so sure what we could actually do otherwise
@ltratt So, I think this at least covers the user-facing components (Though i'll keep poking around to make sure I haven't missed anything). This will have inferior test suite output at the many So it's probably a good point where we can decide if this seems like a reasonable approach, as I think it is relatively complete/hits the points of interest. |
So, I believe that i've resolved all the things from your previous review now. |
cfgrammar/src/lib/span.rs
Outdated
@@ -43,3 +43,56 @@ impl Span { | |||
self.len() == 0 | |||
} | |||
} | |||
|
|||
pub struct NewlineToLineColCache<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, here's something interesting. When I first (naively) thought about this, I assumed that the cache wouldn't need a copy of the source. That's wrong, of course, because offset_to_line_col
needs that to calculate column numbers.
But now I'm wondering if/how this could support incrementally built-up newline caches (e.g. for a streaming parser). Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had considered it but (not in the streaming context though, i'm just not sure what type the streaming context would/should give for src exactly),
The way I as considering was moving the building of newlines
into offset_to_line_col
and checking the off
parameter against newlines.last()
in an a branch (after the self.src.len()
check but before the line/col calculation. then using extend
on newlines
, using self.src[last..off].char_indices.map(|cidx| cidx.0 + last)
.
or something to that effect, the danger here is that off needs to be on an str::is_char_boundary
for that slice to work I believe. I don't see anywhere that the current mechanism relies on that, but it seems like it might be a reasonable requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that isn't a reasonable requirement I think it should be possible to implement floor_char_boundary
and ceil_char_boundary
, outside of std to not require nightly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, i've been thinking this morning about this and noticed that there is one issue with the sort of lazy building of the newlines vector, specifically I think in cases where a file ends without a newline, it'll try and lazily find more newlines for any offset on the last line.
This would seem like it could be an issue for code like minified javascript and json in such cases where you have excessively long lines.
In the non-streaming case we could just add an index into newlines at EOF (which besides variable name should be harmless)... That kind of seems like a poor approximation for just keeping a specific usize
around for the furthest byte we've looked for a newline at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and pushed an implementation of this, it takes the assume char boundary
approach.
One thing I hadn't occurred to me beforehand is the change in mutability requirements for the line cache.
It still definitely isn't streaming, but I think it should be adaptable to the streaming case with the right type for src
,
anyhow let me know what you think...
Sorry for the delay! I started chewing on this and then hit a mental dead end, but now I think I've got a handle on things. Summary: e7da5fd is quite cool, but I thought we wanted incremental building of What I've now realised is that we need to get rid of struct NewLinesCache {
newlines: Vec<usize>
}
impl NewLinesCache {
/// Feed more input into the cache, calculating newlines data from it.
fn feed(&mut self, s: &str) { ... }
/// Convert a byte offset in the input to a logical line number. Returns None if
/// the byte offset exceeds the known input length.
fn byte_to_line_num(&self, byte: usize) -> Option<usize> { ... }
} Importantly we don't store The obvious downside is that because we don't have impl NewLinesCache {
... // As above
/// Convert a byte offset in the input to the byte offset of the beginning of its line.
/// Returns None if the byte offset exceeds the known input length.
fn byte_to_line_byte(&self, byte: usize) -> Option<usize> { ... }
} Assuming the user has their own cache of the input, they can then use the byte of the beginning of the line to easily calculate the column offset. Of course we could add a convenience function along the lines of to do this on their behalf: impl NewLinesCache {
... // As above
/// Convert a byte offset in the input to logical line and column numbers.
/// Returns None if the byte offset exceeds the known input length.
fn byte_to_line_and_col(&self, s: &str, byte: usize) -> Option<(usize, usize)> { ... }
} I am now very aware that we are dragging this PR ever further away from its original intent. Do you think it's perhaps now sensible to strip this PR back so it just does the original "add a span field" and then do a second PR with a newline cache along the lines of the above? |
Don't worry about the delay, sometimes things just need to simmer for a bit. In ropey I just mention it because I've found the My thinking is with these changes we are still within the realm of reasonable, while 'add a span field' isn't a regression, it really feels incomplete. |
So I did a first pass at this, perhaps I misunderstood something, I had to make 2 minor changes (both are probably untested and need some test cases).
I had gone with the latter meaning of src, did you mean the first? perhaps some of this additional |
newlines: Vec<usize>, | ||
trailing_bytes: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very clever. I hadn't even thought of this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully not too clever, there may be some more places we need to use it beyond where we currently do, I'll try and think about this tonight, and come up with some tests which deal with multiple strings, including needing the trailing_bytes
.
cfgrammar/src/lib/span.rs
Outdated
@@ -103,7 +103,7 @@ impl NewlineToLineColCache { | |||
if line_num > self.newlines.len() { | |||
None | |||
} else { | |||
Some(self.newlines[line_num - 1]) | |||
Some(self.newlines[line_num.saturating_sub(1)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need this saturating_sub
, since this fn isn't pub
, so we don't need to worry about being called with 0
. I suppose this is a good reason to keep it private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An assert
would be better IMHO: saturating_sub
is worse than just -
here because in debug mode it will still do weird things but won't crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by a243de6
cfgrammar/src/lib/span.rs
Outdated
if src_pos + byte == src.len() { | ||
let line_off = self.newlines.last().unwrap(); | ||
return (self.newlines.len(), src[*line_off..].chars().count() + 1); | ||
let line_byte = self.line_num_to_byte(line_num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It strikes me that this function isn't taking trailing_bytes
into account, and I feel like probably needs to, i'll have to think it over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think being driven by tests would be a good thing here!
cfgrammar/src/lib/span.rs
Outdated
self.trailing_bytes = 0; | ||
Some(start_pos + offset + 1) | ||
} | ||
(_, '\r') => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without multiple tests to tell what this actually does, I'm not super happy with this: it looks like it's going to have some weird consequences. But I might be wrong! So let's add tests please :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are correct here this is going to mess up byte offsets, and if we aren't dealing with \r
by itself the corner case I was imagining won't actually be a case were going to handle.
The thing I was worried about is '\r' contributing to the column calculation, in particular if we have input like:
cache.feed("aaaa\nbbbb"); cache.feed("mnop"); cache.byte_to_line_and_column(.., "mnop", ...);
we need to (but aren't currently) dealing with the need for the characters "bbbb" to contribute to the column count.
to deal with this case we need both a trailing_bytes
and trailing_chars
, and \r
should contribute to trailing_bytes
so as to avoid messing up byte offsets. But it seems like it should not contribute to trailing_chars
. If we were dealing with \r
by itself and \r\n
, this would seem to be much more problematic. As then we get cases like cache.feed("\r"); cache.feed("\n");
.
At least I feel like I understand the case that lead me to make this change now, so i'll get to work on codifying them as tests...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's actually worse than needing trailing_chars
I believe the tests I've added need a Vec<usize>
i.e. a vector of trailing chars one for each call to feed
, or perhaps a HashMap<usize, usize>
to map src_pos
to trailing_chars
.
This is seeming to get a bit hairy so i am going to wait to proceed in that until you have a chance to weigh in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I just went ahead and implemented the above in 0f8f14f so we could at least look at the damage before deciding...
Edit: This I think is still a very special case, and we need some tests for normal cases, this probably only applies e.g. for the case where byte it is on the first line of a string... on other lines we shouldn't take trailing_bytes
and trailing_chars
into account.
I'll work on those tests in the morning....
cfgrammar/src/lib/span.rs
Outdated
} | ||
|
||
impl Default for NewlineToLineColCache { | ||
fn default() -> NewlineToLineColCache { | ||
NewlineToLineColCache { | ||
newlines: vec![0], | ||
trailing_bytes: 0, | ||
trailing_chars: 0, | ||
trailing: HashMap::new(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are trailing_chars
and (in particular) trailing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trailing_chars
and trailing_bytes
are accumulators, where trailing
is a "snapshot" of (trailing_bytes, trailing_chars)
at a point where feed was called, the key of trailing is the src_pos
of the next call to feed beyond the current one.
trailing_chars
, is just the number of characters trailing the previous newline, we can't calculate it from bytes because of unicode characters, the copy in trailing is used to calculate the column in cases where the newline is in a previous string than the one byte_to_line_and_col
is passed as src
. The copy of trailing_bytes
in trailing
is used similarly to slice into the src
string passed into bytes_to_line_and_col
. We need a snapshot of each trailing_bytes
and trailing_chars
for each potential call to bytes_to_line_and_col
for example:
cache.feed("\nabc"); cache.feed("mnop"); cache.feed("xyz");
trailing
should (I believe be) { 4 = (3,3), 8 = (7,7)}
, We could add the first 0 = (0, 0)
, but it is unnecessary the previous has never has trailing bytes or chars.
With 4 being the src_pos
for "mnop"
and 8 for "xyz"
, we can then calculate that 'n'
of "mnop"
is at the 5th column, even though "abc"
was never passed to bytes_to_line_and_col
.
In these ASCII cases they'll be equal, but should differ for unicode (another test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here that we want to cope with strings that end in non-valid-unicode characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just valid multibyte where each one which only add 1
to the column offset, trailing_bytes
part of this is an artifact of the way we know we are slicing at a unicode char_boundary
by doing so after the newline we know it is a valid char_boundary
.
Your questions trigger me to think that we can perhaps get rid of this usage of trailing_bytes
though.
if line_off < src_pos
my feeling is I'm probably doing a bunch of extra work to calculate src[0..]
, aka src
given a line_off
by subtracting out stuff...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that instead of trying to write/edit code, let's create all possible test cases (including edge cases) and be guided by that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, now that I have that one major edge case under wraps, I will add unicode tests, test the None
cases, and some more general testing, and stuff at EOF
. But I don't know of any more major edge cases lurking, but no proof they don't exist...
cfgrammar/src/lib/span.rs
Outdated
use std::ops::Range; | ||
#[test] | ||
fn newline_cache_splits() { | ||
let same_line_col_1 = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify this and also create multiple tests (which we definitely need). Something like:
fn newline_test_helper(feed: Vec<&str>, tests: Vec<(usize, usize)>) {
assert_eq!(feed.len(), tests.len());
let cache = ...;
for f, (byte, line) in feed.iter().zip(tests) {
cache.feed(f);
assert_eq!(cache.line_num(byte), num);
}
}
where you would use it along the lines of:
newline_test_helper(vec!["a", "\nb"], vec![(0, 1), (2, 2)]);
It's absolutely vital to have a test for every edge case, as my experience with this sort of function is that the edge cases are easy to get wrong. The helper function might need generalising a bit (e.g. should tests
be Vec<Vec<(usize, usize)>>
?), but maybe it's a useful idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to digest this in the morning, sorry for adding more copy/pastes of the above test you quote, but I wanted to fix/check that case I just added so I could sleep 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added newline_test_helper
, I went with the &[Vec<(usize, usize)>]
, but I didn't have a strong preference.
It shows the difference between these ["a", "b"]
cases, but maybe flattening these is preferred, we wouldn't have to repeat so much, but it wouldn't show the difference anymore 🤷.
I also called feed
with all the arguments first, because I think it will catch it would catch cases where we could trailing_bytes
when it is out of date because another string has been fed, rather than doing it in the single loop.
Anyhow just added a couple of cases, until we've decided
Edit: I guess now I am leaning towards a flattened Vec<(usize, usize)>
, rather than tests: &[Vec<(usize, usize)>]
.
I think once we get a lot of tests, it'll be much less noisy?
cfgrammar/src/lib/span.rs
Outdated
&["a", "b", "c"], | ||
&[vec![(1, 1)], vec![(1, 2)], vec![(1, 3)]], | ||
); | ||
newline_test_helper(&["abc"], &[vec![(1, 1), (1, 2), (1, 3)]]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need an empty test &[""]
or similar and also probably "negative" tests that check we don't get results for e.g. (1, 4)
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the first part in f995d7f still thinking through the latter "negative" tests part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that I covered all the cases we should return None
in newline_cache_negative
of
fb14a8d
I'm not sure though I understand this case about (1, 4)
, I.e. the above test should fail an assert_eq left: [..., (1, 3)] right: [..., (1,4)]
, it's possible I am missing something.
cfgrammar/src/lib/span.rs
Outdated
let cr = &[(1, 1), (1, 2), (1, 3), (2, 1)]; | ||
newline_test_helper(&["a\r\n", "b"], cr); | ||
newline_test_helper(&["a", "\r\nb"], cr); | ||
newline_test_helper(&["a\r", "\nb"], cr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the lack of incrementing trailing chars removed in this patch is because,
I kind of feel like this should test should be for &[(1, 1), (1, 2), (1, 2), (2, 1)]
rather than ... (1, 2), (1, 3), ..
But doing it uniformly across split strings is a bit of work, but we should decide what we want the output to be here.
With special casing \r
in feed, it was giving &[.., (1, 2), (1, 2), ...]
for the 3rd cr
case, but (1, 2), (1, 3)
for the first 2, so I removed the special casing for now so it would at least be uniform.
Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'm fine with just ignoring \r
when calculating newlines and treating it as a zero width character for column positions. It's simple and has the same effect (AFAICT) as "complicated" solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, that is going to take some work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I'd tried ignoring \r
, but ended up ignoring '\n' if '\n' is preceded by '\r', primarily because,
if '\r' appears at the beginning of a line, we could end up with a column: 0
, which didn't seem right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is fixed in 21d2f1f.
I'm still really unsure what the |
I believe it is necessary, I believe that It should probably be renamed At the point of column calculation we don't have that string containing '\n', so for column calculation we need to know how many bytes the string of characters after the newline there are, and how many characters are in it. I agree it doesn't feel necessary, it took me a while to come up with something that works though, and I certainly don't think it is obvious. But I currently don't see any other way around it. |
I understand that we might need to keep some of the previous count (though do we need to keep character counts? I think we just need byte counts) but why a |
I think we need both, for the unicode tests, because a multibyte character only contributes a single character to the column count,
Edit: |
I wonder if where we are differing in our thinking due to sequential vs non-sequential,
We wouldn't need the once we have
We need to look back at the preceeding bytes/characters when x was fed. |
Nope, I still don't get it! I think you should be able to call any interleaving you want of |
Sure,
when you look up a byte in |
I should probably continue: So now looking into The output we want looking up When it comes to looking up |
I tried to rename the variables in a way which more reflects their usage, and added a large comment explaining the purpose and layout of it in this case, I think/hope I do better in that format than in comment type thread even if we decide to move that comment out elsewhere, I hope that either the above or the last commit may help in conveying the need for it. |
So, I did manage in 11bc6c9 to remove the Do you mind if I reorder and squash some commits out, to get rid of a bunch of interim changes? |
Please go ahead. |
Getting started on that, I'm not going to change anything during the rebase, but here are some thoughts/ideas on
Would something to that effect still incur the wrath of the lifetimes, which you were trying to avoid? note: given the |
11bc6c9
to
bb105f1
Compare
I wonder if the difference in our approaches is in part of a comment:
My original suggestion in #299 (comment) was that we don't try and do anything about column numbers because the only way to do that in UTF-8 is to store the entire string (or something that's close to it). We don't want to do that because we'll roughly double the amount of memory a given input takes. I suggested a convenience function |
Part of the issue is that the So without something there is no way to know whether a given column number is accurate, So the problem is that |
The idea of the |
Aha, now I see what you mean, and that would work. Yeah, the only way I can think of which doesn't involve caching the entire string is the approach i've been currently undertaking or some variation on it... Now I also finally understand why you didn't need the |
FWIW, Once we're done here i'll probably finish that other implementation in it's own freestanding crate for fun... I just wanted to mention that I don't think the space overhead is that bad, if we compare it to the So I think in the scheme of things, it is really tiny compared to newlines. I am not sure if you were comparing to the size of the reference, or the size of the string contents if comparing to the string contents it seems negligible. That said the additional complexity, and additional arguments are non-negligible and it seems perfectly acceptable for it to be in it's own freestanding crate. |
Please squash. [FWIW, I think having an external version of this with more features, but a slight performance penalty, is a good idea. Choice for users is always a good thing!] |
21c7ae4
to
bd01b1c
Compare
Squashed. |
Can we expand the commit message a bit because "A struct NewlineToLineColCache is added to convert the byte offsets of spans into line and column values." is sorta true, but also sorta misleading :) I think it would be nice to get across that, as it stands, it only helps with line numbers, but we provide a convenience method that also helps with column numbers for cases where the user has access to the complete input. Please force push an update. |
bd01b1c
to
38dad75
Compare
This adds span fields to LexBuildError and YaccParserError. For when a .l or .y files fail to parse. A helper struct NewlineToLineColCache has been added which can be used to convert byte offsets of spans into line numbers, and additionally column numbers when the entire input is available.
38dad75
to
945cb2b
Compare
That is a good point, updated. |
bors r+ |
Build succeeded: |
This currently uses the offset given to build an empty span at
(off, off)
.I've commented in the tests, some spans that I believe might be the right thing, but without really
digging into the parser code for each case and checking out the results, I'm not certain of the accuracy of these comments.
I mainly wanted to push this before starting any work on that, because that will probably take a bit of work,
increase the size of the patch, and that work should be isolated private API.
My feeling is the next step would be to remove the
line
/col
fields from the structure and the error text, moving that to callers using the span, in this patch we still need those, because by the time we format the message we no longer have the text to count newlines. Does that sound reasonable?