New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement(reduce transform): Add additional merge strategies #8559
Conversation
✔️ Deploy Preview for vector-project canceled. 🔨 Explore the source changes: 67b5013 🔍 Inspect the deploy log: https://app.netlify.com/sites/vector-project/deploys/6131327067085d00072ffd62 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I've left a couple initial comments, and it also appears to need a make fmt
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for these additions @dbcfd . I think the added merge strategies make sense.
I left a couple of comments. make check-clippy
also has some linter complaints as highlighted by CI.
lib/vector-core/src/event/value.rs
Outdated
} | ||
Value::Float(v) => { | ||
if v.is_finite() { | ||
format!("{}", v).hash(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to expect that Hash
for Value
will be called in hot paths. Ignoring concerns about the hashability of floats I am opposed to having format!
here, considering it implies allocation.
|
||
impl UniqueMerger { | ||
fn new(v: Value) -> Self { | ||
let mut h = HashSet::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small note that you might want to consider parameterizing over a different hash function than default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestion on which hash function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we already have twox-hash in the project but the exact choice would need to be guided by a benchmark. I'd be satisfied with an issue to follow up on that and you're welcome to assign it to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @dbcfd ! I left a few last comments, but nothing major. Thanks for this work!
lib/vector-core/src/event/value.rs
Outdated
if v.is_finite() { | ||
let trunc: u64 = unsafe { std::mem::transmute(v.trunc()) }; | ||
if trunc == 0 { | ||
v.is_sign_negative().hash(state); | ||
} | ||
trunc.hash(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a comment to this match arm with how we are doing the hashing? Something like:
// This hashes floats with the following rules:
// * NaNs hash as equal
// * -0 and +0 hash to different values
// * otherwise transmute to u64 and hash
This one, in particular, makes me think we should have some tests around this hash function. Would you mind adding a few unit tests here?
Cargo.toml
Outdated
nom = { version = "6.1.2", default-features = false, optional = true } | ||
#https://github.com/wyyerd/pulsar-rs/issues/167 | ||
nom = { version="=7.0.0-alpha1", default-features = false, features=["alloc"], optional = true} | ||
#nom = { version = "6.1.2", default-features = false, optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little surprised this brought in a nom
update. Did you mean to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is the indexmap
change needed? I do know we have another PR that pins it too due to the circular dependency, but it doesn't seem like this PR does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can probably get rid of it by reverting the lock and not running cargo update.
@jszwedko Should be good fore review again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @dbcfd! I left a couple last comments.
lib/vector-core/src/event/value.rs
Outdated
assert_ne!(hash(Value::Boolean(true)), hash(Value::Integer(2))); | ||
assert_eq!(hash(Value::Float(1.2)), hash(Value::Float(1.4))); | ||
assert_ne!(hash(Value::Float(-0.0)), hash(Value::Float(0.0))); | ||
assert_eq!(hash(Value::Float(f64::NEG_INFINITY)), hash(Value::Float(f64::INFINITY))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I neglected infinity 😄 . I think maybe we should have these hash to different values?
longest: "Retains the longest array seen" | ||
shortest: "Retains the shortest array seen" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a bit pedantic, but could we call these longest_array
and shortest_array
? I could also see people thinking they could be used with strings otherwise.
@dbcfd also, could you satisfy the DCO check here? https://github.com/timberio/vector/pull/8559/checks?check_run_id=3362512094 It looks like some of the commits are unsigned. There's more info about this requirement here: https://github.com/timberio/vector/blob/master/docs/CONTRIBUTING.md#dco |
3b824e7
to
6a57a5c
Compare
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
6ee94a7
to
42a8693
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for making these final changes. This looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one clippy ding that needs resolved and/or ignored but I'm happy with this.
@blt There's 2 clippy dings, one for deriving PartialEq, and one for the transmute. I'll do the transmute one, but should I also implement PartialEq (seems weird since it's derivable), or just do an allow on it? |
I think we should implement |
Oh, I missed one and only caught the |
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks for this contribution @dbcfd !
@dbcfd sorry, I realized the DCO check is failing again 😅 . Do you mind signing the unsigned commits? |
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
6bbb2d5
to
e846eee
Compare
@jszwedko signed now :D Maybe someday I'll remember to that. |
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
@dbcfd yeah, that appears to be a false positive due to |
* Add additional merge strategies: longest, shortest, retain, unique Signed-off-by: dbcfd <bdbrowning2@gmail.com> * Fix tests, fmt Signed-off-by: dbcfd <bdbrowning2@gmail.com> * Fix tests, adjust float hashing Signed-off-by: dbcfd <bdbrowning2@gmail.com> * Dependency fixes Signed-off-by: dbcfd <bdbrowning2@gmail.com> * Error for shortest/longest when not array Signed-off-by: dbcfd <bdbrowning2@gmail.com> * Revert dependency changes Signed-off-by: dbcfd <bdbrowning2@gmail.com> * Review changes Signed-off-by: dbcfd <bdbrowning2@gmail.com> * Revert nom change Signed-off-by: dbcfd <bdbrowning2@gmail.com> * Revert build pin Signed-off-by: dbcfd <bdbrowning2@gmail.com> * Rename strategy to indicate it is for array, handle infinity properly Signed-off-by: dbcfd <bdbrowning2@gmail.com> * fmt Signed-off-by: dbcfd <bdbrowning2@gmail.com> * Fix clippy warnings by implementing PartialEq and using bits Signed-off-by: dbcfd <bdbrowning2@gmail.com> * Another clippy fix Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Adds additional merge strategies: longest, shortest, retain, unique.
This issue provides additional support to reduce in lieu of #8036 and #4258 being closed.