Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a span field to LexBuildError #299

Merged
merged 1 commit into from
Jun 16, 2022
Merged

Conversation

ratmice
Copy link
Collaborator

@ratmice ratmice commented Jun 2, 2022

This currently uses the offset given to build an empty span at (off, off).
I've commented in the tests, some spans that I believe might be the right thing, but without really
digging into the parser code for each case and checking out the results, I'm not certain of the accuracy of these comments.

I mainly wanted to push this before starting any work on that, because that will probably take a bit of work,
increase the size of the patch, and that work should be isolated private API.

My feeling is the next step would be to remove the line/col fields from the structure and the error text, moving that to callers using the span, in this patch we still need those, because by the time we format the message we no longer have the text to count newlines. Does that sound reasonable?

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 3, 2022

Well, I did a first pass in 21f1d2c of my suggested "next step", so we could at least see the ramifications...
I believe that is all the non-example code, but trying to figure out how rust errors propagate can feel like whac-a-mole.

Comment on lines 387 to 374
let grm = YaccGrammar::<StorageT>::new_with_storaget(yk, &inc).map_err(|e| match e {
YaccGrammarError::YaccParserError(e) => {
let (column, line) = off_to_line_col(&inc, e.span.start());
format!("{} at column {column} line {line}", e)
}
e => e.to_string(),
})?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hacky going from YaccGrammarError -> String -> Box<dyn Error>, but I'm not so sure what we could actually do otherwise

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 4, 2022

@ltratt So, I think this at least covers the user-facing components (Though i'll keep poking around to make sure I haven't missed anything). This will have inferior test suite output at the many YaccGrammar::new(...).unwrap() though (at least this patch doesn't try to add any niceties there).

So it's probably a good point where we can decide if this seems like a reasonable approach, as I think it is relatively complete/hits the points of interest.

lrpar/src/lib/parser.rs Outdated Show resolved Hide resolved
lrlex/src/lib/parser.rs Outdated Show resolved Hide resolved
lrlex/src/main.rs Outdated Show resolved Hide resolved
cfgrammar/src/lib/span.rs Outdated Show resolved Hide resolved
@ratmice ratmice mentioned this pull request Jun 5, 2022
2 tasks
@ratmice
Copy link
Collaborator Author

ratmice commented Jun 6, 2022

So, I believe that i've resolved all the things from your previous review now.

@@ -43,3 +43,56 @@ impl Span {
self.len() == 0
}
}

pub struct NewlineToLineColCache<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, here's something interesting. When I first (naively) thought about this, I assumed that the cache wouldn't need a copy of the source. That's wrong, of course, because offset_to_line_col needs that to calculate column numbers.

But now I'm wondering if/how this could support incrementally built-up newline caches (e.g. for a streaming parser). Any thoughts?

Copy link
Collaborator Author

@ratmice ratmice Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had considered it but (not in the streaming context though, i'm just not sure what type the streaming context would/should give for src exactly),

The way I as considering was moving the building of newlines into offset_to_line_col and checking the off parameter against newlines.last() in an a branch (after the self.src.len() check but before the line/col calculation. then using extend on newlines, using self.src[last..off].char_indices.map(|cidx| cidx.0 + last).

or something to that effect, the danger here is that off needs to be on an str::is_char_boundary for that slice to work I believe. I don't see anywhere that the current mechanism relies on that, but it seems like it might be a reasonable requirement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that isn't a reasonable requirement I think it should be possible to implement floor_char_boundary and ceil_char_boundary, outside of std to not require nightly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, i've been thinking this morning about this and noticed that there is one issue with the sort of lazy building of the newlines vector, specifically I think in cases where a file ends without a newline, it'll try and lazily find more newlines for any offset on the last line.

This would seem like it could be an issue for code like minified javascript and json in such cases where you have excessively long lines.

In the non-streaming case we could just add an index into newlines at EOF (which besides variable name should be harmless)... That kind of seems like a poor approximation for just keeping a specific usize around for the furthest byte we've looked for a newline at.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and pushed an implementation of this, it takes the assume char boundary approach.
One thing I hadn't occurred to me beforehand is the change in mutability requirements for the line cache.

It still definitely isn't streaming, but I think it should be adaptable to the streaming case with the right type for src,
anyhow let me know what you think...

cfgrammar/src/lib/span.rs Outdated Show resolved Hide resolved
@ltratt
Copy link
Member

ltratt commented Jun 12, 2022

Sorry for the delay! I started chewing on this and then hit a mental dead end, but now I think I've got a handle on things.

Summary: e7da5fd is quite cool, but I thought we wanted incremental building of src (incremental building of newlines is a pleasant optimisation, but not fundamental). But that isn't going to work with lifetimes in a sensible way and we don't want to cache some sort of Vec<String> as that will probably duplicate what the user already has.

What I've now realised is that we need to get rid of src in the newlines cache entirely. I think the minimal API should be, roughly, something like:

struct NewLinesCache {
    newlines: Vec<usize>
}

impl NewLinesCache {
    /// Feed more input into the cache, calculating newlines data from it.
    fn feed(&mut self, s: &str) { ... }
    /// Convert a byte offset in the input to a logical line number. Returns None if
    /// the byte offset exceeds the known input length.
    fn byte_to_line_num(&self, byte: usize) -> Option<usize> { ... }
}

Importantly we don't store src: the user just keeps pushing strings into the cache, we keep adding newline data, but don't store the user's string.

The obvious downside is that because we don't have src we can no longer calculate column offsets. We can offer a function that makes it easier for the user to do so:

impl NewLinesCache {
    ... // As above

    /// Convert a byte offset in the input to the byte offset of the beginning of its line.
    /// Returns None if the byte offset exceeds the known input length.
    fn byte_to_line_byte(&self, byte: usize) -> Option<usize> { ... }
}

Assuming the user has their own cache of the input, they can then use the byte of the beginning of the line to easily calculate the column offset.

Of course we could add a convenience function along the lines of to do this on their behalf:

impl NewLinesCache {
    ... // As above

    /// Convert a byte offset in the input to logical line and column numbers.
    /// Returns None if the byte offset exceeds the known input length.
    fn byte_to_line_and_col(&self, s: &str, byte: usize) -> Option<(usize, usize)> { ... }
}

I am now very aware that we are dragging this PR ever further away from its original intent. Do you think it's perhaps now sensible to strip this PR back so it just does the original "add a span field" and then do a second PR with a newline cache along the lines of the above?

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 12, 2022

Don't worry about the delay, sometimes things just need to simmer for a bit.
One thing I definitely notice here is the similarity to the API for ropey which I really like.

In ropey byte_to_line_byte is achieved by the composition of 2 separate functions line_to_byte and byte_to_line, I bring it up because byte_to_line_num is modulo panics vs option similar to byte_to_line, and we could also easily implement line_num_to_byte which is a bit more primitive than the functions proposed above.

I just mention it because I've found the line_to_byte naming to be quite self evident, where byte_to_line_byte requires a bit more thought. Perhaps it would be more efficient than byte_to_line_num and byte_to_line_byte since with line_to_byte we can just do random access into the newlines Vec, where byte_to_line_byte would require a second search of the newlines vector?

My thinking is with these changes we are still within the realm of reasonable, while 'add a span field' isn't a regression, it really feels incomplete.
And I think that these changes we're discussing are still pretty localized, and won't impact anything above beyond what this series is already changing. So I kind of lean towards reaching parity with the existing code and once squashed it seems like it should still be on the cusp of being a reasonable patch...

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 13, 2022

So I did a first pass at this, perhaps I misunderstood something, I had to make 2 minor changes (both are probably untested and need some test cases).

  • I needed a way to keep track of the number of bytes trailing a fed string after the last newline, this is trailing_bytes.
  • I wasn't certain how to deal with the src parameter to bytes_to_line_and_col, whether this src parameter is to be the cumulative string of all values passed via feed concatenated together
    or instead src being one of the parameters which was passed to feed in which case, I needed a byte position reflecting the start of src.

I had gone with the latter meaning of src, did you mean the first?
I haven't yet tested this outside of the single string usage where src_pos is 0 though,

perhaps some of this additional Option handling in callers could be turned into unwraping particularly in the tests.
Anyhow it is a start 💤

cfgrammar/src/lib/span.rs Outdated Show resolved Hide resolved
newlines: Vec<usize>,
trailing_bytes: usize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very clever. I hadn't even thought of this!

Copy link
Collaborator Author

@ratmice ratmice Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully not too clever, there may be some more places we need to use it beyond where we currently do, I'll try and think about this tonight, and come up with some tests which deal with multiple strings, including needing the trailing_bytes.

cfgrammar/src/lib/span.rs Outdated Show resolved Hide resolved
cfgrammar/src/lib/span.rs Outdated Show resolved Hide resolved
@@ -103,7 +103,7 @@ impl NewlineToLineColCache {
if line_num > self.newlines.len() {
None
} else {
Some(self.newlines[line_num - 1])
Some(self.newlines[line_num.saturating_sub(1)])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need this saturating_sub, since this fn isn't pub, so we don't need to worry about being called with 0. I suppose this is a good reason to keep it private.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An assert would be better IMHO: saturating_sub is worse than just - here because in debug mode it will still do weird things but won't crash.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by a243de6

if src_pos + byte == src.len() {
let line_off = self.newlines.last().unwrap();
return (self.newlines.len(), src[*line_off..].chars().count() + 1);
let line_byte = self.line_num_to_byte(line_num);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It strikes me that this function isn't taking trailing_bytes into account, and I feel like probably needs to, i'll have to think it over

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think being driven by tests would be a good thing here!

self.trailing_bytes = 0;
Some(start_pos + offset + 1)
}
(_, '\r') => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without multiple tests to tell what this actually does, I'm not super happy with this: it looks like it's going to have some weird consequences. But I might be wrong! So let's add tests please :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are correct here this is going to mess up byte offsets, and if we aren't dealing with \r by itself the corner case I was imagining won't actually be a case were going to handle.

The thing I was worried about is '\r' contributing to the column calculation, in particular if we have input like:
cache.feed("aaaa\nbbbb"); cache.feed("mnop"); cache.byte_to_line_and_column(.., "mnop", ...);
we need to (but aren't currently) dealing with the need for the characters "bbbb" to contribute to the column count.

to deal with this case we need both a trailing_bytes and trailing_chars, and \r should contribute to trailing_bytes so as to avoid messing up byte offsets. But it seems like it should not contribute to trailing_chars. If we were dealing with \r by itself and \r\n, this would seem to be much more problematic. As then we get cases like cache.feed("\r"); cache.feed("\n");.

At least I feel like I understand the case that lead me to make this change now, so i'll get to work on codifying them as tests...

Copy link
Collaborator Author

@ratmice ratmice Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's actually worse than needing trailing_chars I believe the tests I've added need a Vec<usize> i.e. a vector of trailing chars one for each call to feed, or perhaps a HashMap<usize, usize> to map src_pos to trailing_chars.

This is seeming to get a bit hairy so i am going to wait to proceed in that until you have a chance to weigh in.

Copy link
Collaborator Author

@ratmice ratmice Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I just went ahead and implemented the above in 0f8f14f so we could at least look at the damage before deciding...

Edit: This I think is still a very special case, and we need some tests for normal cases, this probably only applies e.g. for the case where byte it is on the first line of a string... on other lines we shouldn't take trailing_bytes and trailing_chars into account.

I'll work on those tests in the morning....

}

impl Default for NewlineToLineColCache {
fn default() -> NewlineToLineColCache {
NewlineToLineColCache {
newlines: vec![0],
trailing_bytes: 0,
trailing_chars: 0,
trailing: HashMap::new(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are trailing_chars and (in particular) trailing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing_chars and trailing_bytes are accumulators, where trailing is a "snapshot" of (trailing_bytes, trailing_chars) at a point where feed was called, the key of trailing is the src_pos of the next call to feed beyond the current one.

trailing_chars, is just the number of characters trailing the previous newline, we can't calculate it from bytes because of unicode characters, the copy in trailing is used to calculate the column in cases where the newline is in a previous string than the one byte_to_line_and_col is passed as src. The copy of trailing_bytes in trailing is used similarly to slice into the src string passed into bytes_to_line_and_col. We need a snapshot of each trailing_bytes and trailing_chars for each potential call to bytes_to_line_and_col

for example:
cache.feed("\nabc"); cache.feed("mnop"); cache.feed("xyz");

trailing should (I believe be) { 4 = (3,3), 8 = (7,7)}, We could add the first 0 = (0, 0), but it is unnecessary the previous has never has trailing bytes or chars.
With 4 being the src_pos for "mnop" and 8 for "xyz", we can then calculate that 'n' of "mnop" is at the 5th column, even though "abc" was never passed to bytes_to_line_and_col.

In these ASCII cases they'll be equal, but should differ for unicode (another test).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that we want to cope with strings that end in non-valid-unicode characters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just valid multibyte where each one which only add 1 to the column offset, trailing_bytes part of this is an artifact of the way we know we are slicing at a unicode char_boundary by doing so after the newline we know it is a valid char_boundary.

Your questions trigger me to think that we can perhaps get rid of this usage of trailing_bytes though.
if line_off < src_pos my feeling is I'm probably doing a bunch of extra work to calculate src[0..], aka src given a line_off by subtracting out stuff...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that instead of trying to write/edit code, let's create all possible test cases (including edge cases) and be guided by that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, now that I have that one major edge case under wraps, I will add unicode tests, test the None cases, and some more general testing, and stuff at EOF. But I don't know of any more major edge cases lurking, but no proof they don't exist...

use std::ops::Range;
#[test]
fn newline_cache_splits() {
let same_line_col_1 = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this and also create multiple tests (which we definitely need). Something like:

fn newline_test_helper(feed: Vec<&str>, tests: Vec<(usize, usize)>) {
  assert_eq!(feed.len(), tests.len());
  let cache = ...;
  for f, (byte, line) in feed.iter().zip(tests) {
    cache.feed(f);
    assert_eq!(cache.line_num(byte), num);
  }
}

where you would use it along the lines of:

newline_test_helper(vec!["a", "\nb"], vec![(0, 1), (2, 2)]);

It's absolutely vital to have a test for every edge case, as my experience with this sort of function is that the edge cases are easy to get wrong. The helper function might need generalising a bit (e.g. should tests be Vec<Vec<(usize, usize)>>?), but maybe it's a useful idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to digest this in the morning, sorry for adding more copy/pastes of the above test you quote, but I wanted to fix/check that case I just added so I could sleep 🤣

Copy link
Collaborator Author

@ratmice ratmice Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added newline_test_helper, I went with the &[Vec<(usize, usize)>], but I didn't have a strong preference.
It shows the difference between these ["a", "b"] cases, but maybe flattening these is preferred, we wouldn't have to repeat so much, but it wouldn't show the difference anymore 🤷.

I also called feed with all the arguments first, because I think it will catch it would catch cases where we could trailing_bytes when it is out of date because another string has been fed, rather than doing it in the single loop.

Anyhow just added a couple of cases, until we've decided

Edit: I guess now I am leaning towards a flattened Vec<(usize, usize)>, rather than tests: &[Vec<(usize, usize)>].
I think once we get a lot of tests, it'll be much less noisy?

&["a", "b", "c"],
&[vec![(1, 1)], vec![(1, 2)], vec![(1, 3)]],
);
newline_test_helper(&["abc"], &[vec![(1, 1), (1, 2), (1, 3)]]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need an empty test &[""] or similar and also probably "negative" tests that check we don't get results for e.g. (1, 4) or similar.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the first part in f995d7f still thinking through the latter "negative" tests part.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that I covered all the cases we should return None in newline_cache_negative of
fb14a8d

I'm not sure though I understand this case about (1, 4), I.e. the above test should fail an assert_eq left: [..., (1, 3)] right: [..., (1,4)], it's possible I am missing something.

Comment on lines 266 to 200
let cr = &[(1, 1), (1, 2), (1, 3), (2, 1)];
newline_test_helper(&["a\r\n", "b"], cr);
newline_test_helper(&["a", "\r\nb"], cr);
newline_test_helper(&["a\r", "\nb"], cr);
Copy link
Collaborator Author

@ratmice ratmice Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the lack of incrementing trailing chars removed in this patch is because,
I kind of feel like this should test should be for &[(1, 1), (1, 2), (1, 2), (2, 1)] rather than ... (1, 2), (1, 3), ..

But doing it uniformly across split strings is a bit of work, but we should decide what we want the output to be here.
With special casing \r in feed, it was giving &[.., (1, 2), (1, 2), ...] for the 3rd cr case, but (1, 2), (1, 3) for the first 2, so I removed the special casing for now so it would at least be uniform.

Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm fine with just ignoring \r when calculating newlines and treating it as a zero width character for column positions. It's simple and has the same effect (AFAICT) as "complicated" solutions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, that is going to take some work though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'd tried ignoring \r, but ended up ignoring '\n' if '\n' is preceded by '\r', primarily because,
if '\r' appears at the beginning of a line, we could end up with a column: 0, which didn't seem right.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is fixed in 21d2f1f.

@ltratt
Copy link
Member

ltratt commented Jun 14, 2022

I'm still really unsure what the trailing hashmap is doing or why it's needed. That might be me being stupid, but it doesn't feel like it should be necessary.

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 14, 2022

I believe it is necessary, I believe that It should probably be renamed preceding, or precedeing_bytes_and_characters, when 2 strings are fed across a single line, when we go to calculate columns, we only have the offset of the line we're on.
When that line starts on a previous string, there is a bunch of bytes and characters potentially after the newline.

At the point of column calculation we don't have that string containing '\n', so for column calculation we need to know how many bytes the string of characters after the newline there are, and how many characters are in it.

I agree it doesn't feel necessary, it took me a while to come up with something that works though, and I certainly don't think it is obvious. But I currently don't see any other way around it.

@ltratt
Copy link
Member

ltratt commented Jun 14, 2022

I understand that we might need to keep some of the previous count (though do we need to keep character counts? I think we just need byte counts) but why a HashMap?

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 14, 2022

I think we need both, for the unicode tests, because a multibyte character only contributes a single character to the column count,
and we can't recover the number of characters from bytes nor the other way around.

HashMap is definitely debatable, and picked by default, I was thinking BTreeMap would be better.
we Could perhaps use a Vec<(usize, usize)>, we'd need to return some kind of SrcIdx from feed, to be passed back in to bytes_to_line_and_col. I believe that would work, I'm not sure if it would allow us to avoid passing in src_pos too.

Edit:
But the src_pos in this case is sort of acting like an index which uniquely identifies the string, as well as a way to make bytes relative to the start of the string slice we're dealing with..

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 14, 2022

I wonder if where we are differing in our thinking due to sequential vs non-sequential,
I'd agree that if we have

cache.feed(x);
cache.line_num(.., x, ...);

We wouldn't need the trailing, if it was only called sequentially like this, but we usually don't get spans until multiple strings will have been fed. So we need to the state at a previous call to feed (Edit: this was my assumption at least!).

once we have

cache.feed(x);
cache.feed(y);
cache.feed(z);
cache.line_num(..., y, ...); 

We need to look back at the preceeding bytes/characters when x was fed.
anyhow it would help if I could pinpoint we seem to be out of sync.

@ltratt
Copy link
Member

ltratt commented Jun 14, 2022

Nope, I still don't get it! I think you should be able to call any interleaving you want of feed and line_num and I don't really see why keeping previous state is necessary. Can you maybe spell out (bearing in mind that I'm an idiot!) what happens with/without the hashmap?

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 14, 2022

Sure,
lets say

let x = "a";
let y = "b";
let z = "c";
// after this call cache.trailing_bytes == 1
cache.feed(x);
// after this call cache.trailing_bytes == 2
cache.feed(y);
// after this call cache.trailing_bytes == 3
cache.feed(z);

let (line, col) = cache.byte_to_line_and_col(1, y, 1); 

when you look up a byte in y, you want a window into the past where trailing_bytes == 1, but by this time it has moved beyond that due to the call to feed(z)

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 14, 2022

I should probably continue:

So now looking into byte_to_line_and_col, we have as parameters:
src_pos = 1, y = "b", byte = 1.

The output we want looking up b, from cache.newlines we know the line_byte = 0,
given our src_pos = 1 we know that the current line started in a preceding string.
The column we expect is 2, so we want to add the trailing character 'a' to the column count.

When it comes to looking up 'c' in z, though we want the trailing characters to reflect both the trailing characters in
x, and y. The previous newline is 2 strings ago, and to calculate the column 3 for 'c' with 2 of those characters preceding.

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 15, 2022

I tried to rename the variables in a way which more reflects their usage, and added a large comment explaining the purpose and layout of it in this case, I think/hope I do better in that format than in comment type thread even if we decide to move that comment out elsewhere, I hope that either the above or the last commit may help in conveying the need for it.

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 15, 2022

So, I did manage in 11bc6c9 to remove the trailing_bytes that was in trailing. So your intuition that something was unnecessary there was right on. Sorry I didn't see it earlier, too much thinking relative to line offsets.

Do you mind if I reorder and squash some commits out, to get rid of a bunch of interim changes?

@ltratt
Copy link
Member

ltratt commented Jun 15, 2022

Do you mind if I reorder and squash some commits out, to get rid of a bunch of interim changes?

Please go ahead.

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 15, 2022

Getting started on that,

I'm not going to change anything during the rebase, but here are some thoughts/ideas on
getting rid of the HashMap, I see two options to do that:

  1. return trailing_characters from feed, and pass it back in to bytes_to_line_and_col.
    i.e. the first call to feed returns 0, the second returns the trailing_chars accumulated by the first call.
  2. variation of 1, where we change feed to something like
    feed<'a>(&mut self, &'a str) -> LineColCacheEntry<'a>, and pass it back to bytes_to_line_and_col
struct NewlineCacheEntry<'a> {
  src_pos: usize,
  src: &'a str,
  preceeding_cols: usize,
}

Would something to that effect still incur the wrath of the lifetimes, which you were trying to avoid?
It would at reduce the number of arguments to bytes_to_line_and_col(&self, entry: &LineColCacheEntry<'a>, byte: usize).

note: given the \r contributing 0 to the column count, a better name for trailing_chars is going to be trailing_cols,
So I should probably rename that.

@ltratt
Copy link
Member

ltratt commented Jun 15, 2022

I wonder if the difference in our approaches is in part of a comment:

We need the preceding characters to calculate column numbers.

My original suggestion in #299 (comment) was that we don't try and do anything about column numbers because the only way to do that in UTF-8 is to store the entire string (or something that's close to it). We don't want to do that because we'll roughly double the amount of memory a given input takes. I suggested a convenience function byte_to_line_and_col where the user can pass a &str in to get column numbers; but if they don't do that, all they get by default is line numbers.

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 15, 2022

Part of the issue is that the &str alone isn't enough to calculate valid column numbers, in cases where lines span multiple strings.
given a src: &str, columns and character positions are relative to that &str, so we can calculate that "a" is at 1, but if that string is preceeded by "foo", the actual column for a is 4.

So without something there is no way to know whether a given column number is accurate, So the problem is that byte_to_line_and_col needs a bit of information about the very end of the previous string which was fed before it, to actually produce a valid column number.

@ltratt
Copy link
Member

ltratt commented Jun 15, 2022

The idea of the byte_to_line_and_col function is that you have to pass the whole string in (i.e. if you did feed("a"); feed("b") you need to do byte_to_line_and_col("ab")). I can't see any way around this that doesn't involve caching the entire string.

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 15, 2022

Aha, now I see what you mean, and that would work.

Yeah, the only way I can think of which doesn't involve caching the entire string is the approach i've been currently undertaking or some variation on it... Now I also finally understand why you didn't need the src_pos argument denoting its offset either.

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 16, 2022

FWIW, Once we're done here i'll probably finish that other implementation in it's own freestanding crate for fun...

I just wanted to mention that I don't think the space overhead is that bad, if we compare it to the newlines vector, except maybe with very short strings. This is primarily because the overhead needed for newlines is one usize per newline, while the overhead for preceding_columns is less than one per-string. The map only needs to contain 1 usize per string where the column position at the end of the string is non-zero otherwise you may assume that if it is not in the map, that it is zero.

So I think in the scheme of things, it is really tiny compared to newlines. I am not sure if you were comparing to the size of the reference, or the size of the string contents if comparing to the string contents it seems negligible.

That said the additional complexity, and additional arguments are non-negligible and it seems perfectly acceptable for it to be in it's own freestanding crate.

@ltratt
Copy link
Member

ltratt commented Jun 16, 2022

Please squash.

[FWIW, I think having an external version of this with more features, but a slight performance penalty, is a good idea. Choice for users is always a good thing!]

@ratmice
Copy link
Collaborator Author

ratmice commented Jun 16, 2022

Squashed.

@ltratt
Copy link
Member

ltratt commented Jun 16, 2022

Can we expand the commit message a bit because "A struct NewlineToLineColCache is added to convert the byte offsets of spans into line and column values." is sorta true, but also sorta misleading :) I think it would be nice to get across that, as it stands, it only helps with line numbers, but we provide a convenience method that also helps with column numbers for cases where the user has access to the complete input. Please force push an update.

This adds span fields to LexBuildError and YaccParserError.
For when a .l or .y files fail to parse.

A helper struct NewlineToLineColCache has been added
which can be used to convert byte offsets of spans into
line numbers, and additionally column numbers when the entire
input is available.
@ratmice
Copy link
Collaborator Author

ratmice commented Jun 16, 2022

That is a good point, updated.

@ltratt
Copy link
Member

ltratt commented Jun 16, 2022

bors r+

@bors
Copy link
Contributor

bors bot commented Jun 16, 2022

Build succeeded:

@bors bors bot merged commit 6da01e9 into softdevteam:master Jun 16, 2022
@ratmice ratmice deleted the lexbuilderr_span branch June 23, 2022 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants