Get the span as a result of a match #242

charles-paperman · 2023-09-04T16:47:27Z

Given a string in RAM, it would be nice to collect all the span of the result so that they could be parsed and loaded with an external solution.

github-actions · 2023-09-04T16:47:46Z

Tagging @V0ldek for notifications

V0ldek · 2023-09-05T12:18:28Z

We already have matches which returns a Match that includes the MatchSpan, but it includes the copied bytes as well. I understand the request is about a zero-copy endpoint.

Unfortunately, we can't exactly get zero-copy, because of the additional work required to get the end of a match – starting indices are easy (hence indices exists), but to get the end we need to store the matched bytes and then trim the trailing whitespace; so the copy would happen anyway, even if we didn't return it to the caller.

However, if the purpose is for external to parse the results, then trimming the whitespace is probably not required. I'd expect any parser to just discard the trailing whitespace. Maybe we can expose this and explicitly document that it's an approximate span, where the ends might point to some whitespace after the actual match. Call it approx_spans? That would be significantly faster than the full matches function now.

V0ldek · 2023-09-05T12:32:22Z

To make the issue concrete, there are currently two real cases where a match includes whitespace that needs to be trimmed. One, if the JSON is weird and has spaces before commas delimiting items in an object or list:

{
  "a": 42   ,
  "b": 43
}

with query $.a would match 42␣␣␣. or

[ 42  , 43 ]

where $[0] would match 42␣␣␣.

This is a corner-case, since no sane person or formatter writes JSONs like that. The other happens all the time, however, and it's when the match happens at the end of an object/list and so it will include all whitespace until the parent's closing brace/bracket:

{
  "a": {
    "b": 42
  }
}

where $.a.b would match 42⏎␣␣.

In the last case, the correct span to report is 20-21, but to get that we need to trim the whitespace after the match. In the fast approximate span endpoint the report would be 20-25, which would include all the whitespace.

V0ldek · 2023-09-05T12:38:36Z

@charles-paperman Is there value in adding this to rq proper as an output mode or is the use case purely for the library?

charles-paperman · 2023-09-05T13:46:28Z

I believe it is purely for the lib. I don't think reporting the span in rq makes a lot of sense.

V0ldek · 2023-09-06T10:57:32Z

So the API here would be:

fn approx_spans<I, S>(&self, input: &I, sink: &mut S) -> Result<(), EngineError>
       where I: Input,
             S: Sink<MatchSpan>;

Internally, a different recorder that would eschew all the byte-copying and whitespace trimming and work with reported offsets only. It'd still need to use the stack to make sure the responses are ordered properly, we should follow a similar design to NodesRecorder.

The documentation MUST explicitly and prominently state that the spans will be inaccurate at the ends, but with the guarantee that only trailing JSON whitespace characters are included.

charles-paperman · 2023-09-07T19:04:56Z

Efficient approximate copy-free span would be neat!

- Engine can return an approximate span of the match, where "approximate" means the start index is correct, but the end index might include trailing whitespace after the match. - This mode is much faster that full `matches`, close to the performance of `count`, especially for large result sets. - This is a library-only feature. Ref: #242

V0ldek · 2023-09-20T18:09:44Z

Implemented in v0.8.1 as approximate_spans.

github-actions bot added the acceptance: triage Waiting for owner's input label Sep 4, 2023

V0ldek added mod: engine area: result Improvements in query result reporting type: feature New feature or request labels Sep 5, 2023

V0ldek added this to the v1.0.0 milestone Sep 6, 2023

github-actions bot added acceptance: go ahead Reviewed, implementation can start and removed acceptance: triage Waiting for owner's input labels Sep 6, 2023

V0ldek added help wanted External contributions welcome good first issue Good for newcomers labels Sep 6, 2023

V0ldek self-assigned this Sep 18, 2023

V0ldek mentioned this issue Sep 20, 2023

feat: Added approximate_spans result mode. #275

Merged

3 tasks

V0ldek added a commit that referenced this issue Sep 20, 2023

Merge branch 'main' into v0ldek/#242-approximate-spans

1c27b5a

V0ldek closed this as completed in #275 Sep 20, 2023

github-actions bot removed the acceptance: go ahead Reviewed, implementation can start label Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get the span as a result of a match #242

Get the span as a result of a match #242

charles-paperman commented Sep 4, 2023

github-actions bot commented Sep 4, 2023

V0ldek commented Sep 5, 2023

V0ldek commented Sep 5, 2023

V0ldek commented Sep 5, 2023

charles-paperman commented Sep 5, 2023

V0ldek commented Sep 6, 2023

charles-paperman commented Sep 7, 2023

V0ldek commented Sep 20, 2023

Get the span as a result of a match #242

Get the span as a result of a match #242

Comments

charles-paperman commented Sep 4, 2023

github-actions bot commented Sep 4, 2023

V0ldek commented Sep 5, 2023

V0ldek commented Sep 5, 2023

V0ldek commented Sep 5, 2023

charles-paperman commented Sep 5, 2023

V0ldek commented Sep 6, 2023

charles-paperman commented Sep 7, 2023

V0ldek commented Sep 20, 2023