Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect stream statistics and offer new hook to check in on them #1767

Merged
merged 4 commits into from
Jun 25, 2024

Conversation

rsmmr
Copy link
Member

@rsmmr rsmmr commented Jun 21, 2024

This extends the stream type to track statistics about their data, and it adds
new %sync_advance hook that can be used to check statistics during
synchronization. For more information, see the individual commits.

Closes #1768.

@rsmmr rsmmr force-pushed the topic/robin/gh-3779-stream-stats branch from 733b5a4 to 3cbad6b Compare June 24, 2024 08:54
@rsmmr
Copy link
Member Author

rsmmr commented Jun 24, 2024

Pushed a new version that introduces a property %sync-advance-block-size with the discussed semantics, and a default of 4K. I left the hook at %sync_advance. Staying with the $stream for now. Still open for more ideas on naming.

We implement this by storing the current input stream inside the
internal parse struct as a new field `__stream`. Using `stream()` then
turns into an access to that field (i.e,  `*(self.__stream)`). If that
field never gets accessed, the optimizer will remove it, so there's no
overhead if nobody ever uses `stream()` with a unit.
@rsmmr rsmmr force-pushed the topic/robin/gh-3779-stream-stats branch from 3cbad6b to eb76180 Compare June 24, 2024 10:15
@rsmmr rsmmr marked this pull request as ready for review June 24, 2024 10:16
@rsmmr
Copy link
Member Author

rsmmr commented Jun 24, 2024

Ready for review now, now with self.stream() instead of $stream.

bbannier
bbannier previously approved these changes Jun 24, 2024
Copy link
Member

@bbannier bbannier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked that self.stream() does not do something crazy for &parse-from (e.g., __stream becoming invalid in a subunit); since the code is so simple not sure we necessarily need dedicated tests for that (but might still be nice).

hilti/runtime/include/types/stream.h Outdated Show resolved Hide resolved
hilti/runtime/include/types/stream.h Outdated Show resolved Hide resolved
tests/hilti/types/stream/ops.hlt Outdated Show resolved Hide resolved
tests/spicy/types/unit/stream-statistics.spicy Outdated Show resolved Hide resolved
hilti/toolchain/src/ast/operators/stream.cc Outdated Show resolved Hide resolved
hilti/toolchain/include/ast/forward.h Show resolved Hide resolved
doc/programming/parsing.rst Outdated Show resolved Hide resolved
This returns a struct of the following type, reflecting the input
seen so far:

```
type StreamStatistics = struct {
    num_data_bytes: uint64;     ## number of data bytes processed
    num_data_chunks: uint64;    ## number of data chunks processed, excluding empty chunks
    num_gap_bytes: uint64;      ## number of gap bytes processed
    num_gap_chunks: uint64;     ## number of gap chunks processed, excluding empty chunks
};
```
This adds support for a new unit hook:

```
on %sync_advance(offset: uint64) {
    ...
}
```

This hook is called regularly (see below) during error recovery when
synchronization skips over data or gaps while searching for a valid
synchronization point. It can be used to check in on the
synchronization to, e.g., abort further processing if it just keeps
failing. `offset` is the current position inside the input stream that
synchronization just skipped to.

By default, "called regularly" means that it's called every 4KB of
input skipped over while searching for a synchronization point. That
value can be changed by setting a unit property
`%sync-advance-block-size = <number of bytes>`.

As an additional minor tweak, this also changes the name of what used
to be the `__gap__` profiler to now be called `__sync_advance` because
it's profiling the time spent in skipping data, not just gaps.

Implementation notes:

1. I considered reusing `on %gap()` instead of adding a new hook, but
   it turns out the semantics are different than what I think we want
   here: (1) it doesn't trigger for non-gap data (by definition), so
   couldn't check in on long intervals of searching through payload
   data; and (2) it's called when a gap added to the input, not when
   it's encountered.

2. The new `ASTInfo` inside the Spicy code generator might seem
   overkill for just the one field it currently contains. However, I'm
   planing to use this for other information in the future as well
   (specifically, tracking the need for `&on-heap`).

Closes #3779.
@rsmmr rsmmr force-pushed the topic/robin/gh-3779-stream-stats branch from e575b7c to fe832f2 Compare June 24, 2024 15:40
@rsmmr rsmmr merged commit 3626654 into main Jun 25, 2024
19 checks passed
@rsmmr rsmmr deleted the topic/robin/gh-3779-stream-stats branch June 25, 2024 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tooling or guidance on how a Spicy parser with error recovery should deal with huge packet loss
2 participants