-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(vrl): Adds chunks
function
#13794
feat(vrl): Adds chunks
function
#13794
Conversation
✅ Deploy Preview for vector-project ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
Soak Test ResultsBaseline: e5c435f ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. Changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%:
Fine details of change detection per experiment.
|
00ef727
to
4dbdd76
Compare
Soak Test ResultsBaseline: 3fd0754 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%: Fine details of change detection per experiment.
|
Looks like I just had to rebase.
|
That would indeed be the way to do it, but it's surprising that building Vector itself didn't already get you all of the dependencies needed to build the |
@tobz seems like it's working now, whatever it was! Maybe I just had a connectivity issue. Being able to run the tests helps a lot ^_^ |
2dded92
to
84d348c
Compare
Soak Test ResultsBaseline: 4b3fe0a ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%: Fine details of change detection per experiment.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution @briankung !
It prompted some discussion on the team about the inconsistencies with how VRL deals with bytes and UTF-8 strings (some of which are described in https://github.com/vectordotdev/vector/issues/12369). We think the best plan is to follow up with an RFC document to lay out a direction in the future, but, in absence of that, we think the simplest thing to do for this function would actually be to just have chunks
operate on the number of bytes without any specific UTF-8 handling (that is, remove the utf8
flag). Then we can follow up later once we have defined some guidelines for how functions in VRL deal with UTF-8 strings and bytes to make sure all VRL functions handle them in a consistent fashion.
Other than that, what you have here generally looks good to me. I left one comment below about adding benchmarks.
Soak Test ResultsBaseline: 44cfd90 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. Changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%:
Fine details of change detection per experiment.
|
Soak Test ResultsBaseline: 33625e7 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%: Fine details of change detection per experiment.
|
chunks
functionchunks
function
I think this is ready for final review. I have two minor questions regarding testing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small things / merge conflicts. Just need to resolve those and fix some of the CI issues and this should be good to go.
- Check that the chunk_size is at least 1 byte - Split out the chunk loop from the unicode code point backtracking so that the bounds check is not performed on every loop. Also labeled the loops because it's prettier that way. - Remove redundant vectorization of iterator - Change error messages to be less confusing Still looking for ways to reorganize this code to be a little easier on the eyes.
ce0e9b6
to
71ad389
Compare
Rebased and updated! |
Soak Test ResultsBaseline: 90c99a6 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%: Fine details of change detection per experiment.
|
I can't tell why the cue docs are failing in CI. Locally, I get:
|
Soak Test ResultsBaseline: 27663f3 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%: Fine details of change detection per experiment.
|
@briankung Thanks for the contribution! |
Thanks for the multiple rounds of reviews! Excited to see this in a new release. |
Issue: #13329
This adds a
chunks
method (chunks rather than chunk following the slice naming convention for slice's chunks method). This chunks data based on a givenchunk_size
in bytes. The result is an array of bytes ofchunk_size
in length.👇 The below was true for the initial PR, which attempted to respect unicode code point boundaries. After this comment that is no longer the aim of this PR.
It attempts to respect unicode code point boundaries, but makes no guarantees about grapheme cluster boundaries: https://unicode.org/reports/tr29/#Grapheme_Cluster_BoundariesA quick overview is that valid UTF-8 code points are up to 4 bytes in length, the length of which is denoted by the first few bits in the first byte. See https://en.wikipedia.org/wiki/UTF-8#Encoding).However, user-perceived characters can consist of several UTF-8 code points (see above link on Grapheme Cluster Boundaries). There is no simple algorithm for splitting text based on grapheme cluster boundaries because grapheme clusters are of arbitrary lengths.Therefore, the algorithm used here only attempts to split text based on code point boundaries, which are at most 4 bytes in length, and not on grapheme cluster boundaries.Additional subtasks that I would like to complete:
chunk_size
fn examples
testsConsider refactoring the main looputf
configuration flag to treat it as utf or not (default: true)Add tests for data size (i.e. check that a 1MB limit actually does limit it to 1MB chunks)I don't want to add 1MB+ to the repo, that would be silly