Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement(reduce transform): Add additional merge strategies #8559

Merged
merged 14 commits into from Sep 2, 2021

Conversation

dbcfd
Copy link
Contributor

@dbcfd dbcfd commented Aug 2, 2021

Adds additional merge strategies: longest, shortest, retain, unique.

This issue provides additional support to reduce in lieu of #8036 and #4258 being closed.

@netlify
Copy link

netlify bot commented Aug 2, 2021

✔️ Deploy Preview for vector-project canceled.

🔨 Explore the source changes: 67b5013

🔍 Inspect the deploy log: https://app.netlify.com/sites/vector-project/deploys/6131327067085d00072ffd62

@dbcfd dbcfd mentioned this pull request Aug 2, 2021
3 tasks
@spencergilbert spencergilbert changed the title enhancement (reduce): Add additional merge strategies enhancement(reduce transform): Add additional merge strategies Aug 3, 2021
@spencergilbert spencergilbert added the transform: reduce Anything `reduce` transform related label Aug 3, 2021
Copy link
Contributor

@spencergilbert spencergilbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I've left a couple initial comments, and it also appears to need a make fmt 👍

src/transforms/reduce/merge_strategy.rs Outdated Show resolved Hide resolved
src/transforms/reduce/merge_strategy.rs Outdated Show resolved Hide resolved
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for these additions @dbcfd . I think the added merge strategies make sense.

I left a couple of comments. make check-clippy also has some linter complaints as highlighted by CI.

src/transforms/reduce/merge_strategy.rs Show resolved Hide resolved
src/transforms/reduce/merge_strategy.rs Show resolved Hide resolved
lib/vector-core/src/event/value.rs Outdated Show resolved Hide resolved
lib/vector-core/src/event/value.rs Show resolved Hide resolved
lib/vector-core/src/event/value.rs Outdated Show resolved Hide resolved
}
Value::Float(v) => {
if v.is_finite() {
format!("{}", v).hash(state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to expect that Hash for Value will be called in hot paths. Ignoring concerns about the hashability of floats I am opposed to having format! here, considering it implies allocation.


impl UniqueMerger {
fn new(v: Value) -> Self {
let mut h = HashSet::default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small note that you might want to consider parameterizing over a different hash function than default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion on which hash function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we already have twox-hash in the project but the exact choice would need to be guided by a benchmark. I'd be satisfied with an issue to follow up on that and you're welcome to assign it to me.

@dbcfd
Copy link
Contributor Author

dbcfd commented Aug 17, 2021

@jszwedko @blt I've adjusted the hashing functions, let me know if that looks correct. The dependency issues are also pulled out into a single commit to be pulled to another PR if need be.

Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @dbcfd ! I left a few last comments, but nothing major. Thanks for this work!

Comment on lines 42 to 50
if v.is_finite() {
let trunc: u64 = unsafe { std::mem::transmute(v.trunc()) };
if trunc == 0 {
v.is_sign_negative().hash(state);
}
trunc.hash(state);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment to this match arm with how we are doing the hashing? Something like:

// This hashes floats with the following rules:
// * NaNs hash as equal
// * -0 and +0 hash to different values
// * otherwise transmute to u64 and hash

This one, in particular, makes me think we should have some tests around this hash function. Would you mind adding a few unit tests here?

src/transforms/reduce/merge_strategy.rs Show resolved Hide resolved
Cargo.toml Outdated
nom = { version = "6.1.2", default-features = false, optional = true }
#https://github.com/wyyerd/pulsar-rs/issues/167
nom = { version="=7.0.0-alpha1", default-features = false, features=["alloc"], optional = true}
#nom = { version = "6.1.2", default-features = false, optional = true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little surprised this brought in a nom update. Did you mean to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is the indexmap change needed? I do know we have another PR that pins it too due to the circular dependency, but it doesn't seem like this PR does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably get rid of it by reverting the lock and not running cargo update.

@dbcfd
Copy link
Contributor Author

dbcfd commented Aug 18, 2021

@jszwedko Should be good fore review again

Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @dbcfd! I left a couple last comments.

assert_ne!(hash(Value::Boolean(true)), hash(Value::Integer(2)));
assert_eq!(hash(Value::Float(1.2)), hash(Value::Float(1.4)));
assert_ne!(hash(Value::Float(-0.0)), hash(Value::Float(0.0)));
assert_eq!(hash(Value::Float(f64::NEG_INFINITY)), hash(Value::Float(f64::INFINITY)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I neglected infinity 😄 . I think maybe we should have these hash to different values?

Comment on lines 123 to 124
longest: "Retains the longest array seen"
shortest: "Retains the shortest array seen"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a bit pedantic, but could we call these longest_array and shortest_array? I could also see people thinking they could be used with strings otherwise.

@jszwedko
Copy link
Member

@dbcfd also, could you satisfy the DCO check here? https://github.com/timberio/vector/pull/8559/checks?check_run_id=3362512094 It looks like some of the commits are unsigned. There's more info about this requirement here: https://github.com/timberio/vector/blob/master/docs/CONTRIBUTING.md#dco

@dbcfd dbcfd force-pushed the more-merge-strategies branch 3 times, most recently from 3b824e7 to 6a57a5c Compare August 18, 2021 16:44
@dbcfd
Copy link
Contributor Author

dbcfd commented Aug 18, 2021

@jszwedko Fixing DCO went a little weird, but it should be fixed now. Most recent commit for review feedback was 6a57a5c. Did a final fmt after that.

Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for making these final changes. This looks good to me!

@jszwedko jszwedko enabled auto-merge (squash) August 18, 2021 19:55
Copy link
Contributor

@blt blt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one clippy ding that needs resolved and/or ignored but I'm happy with this.

@dbcfd
Copy link
Contributor Author

dbcfd commented Aug 18, 2021

@blt There's 2 clippy dings, one for deriving PartialEq, and one for the transmute. I'll do the transmute one, but should I also implement PartialEq (seems weird since it's derivable), or just do an allow on it?

@jszwedko
Copy link
Member

@blt There's 2 clippy dings, one for deriving PartialEq, and one for the transmute. I'll do the transmute one, but should I also implement PartialEq (seems weird since it's derivable), or just do an allow on it?

I think we should implement PartialEq here too. I agree with clippy that it would be confusing if they don't match. Things that hash to the same value are not considered "partially equal" (admitting there may be hash collisions).

@blt
Copy link
Contributor

blt commented Aug 18, 2021

@blt There's 2 clippy dings, one for deriving PartialEq, and one for the transmute. I'll do the transmute one, but should I also implement PartialEq (seems weird since it's derivable), or just do an allow on it?

Oh, I missed one and only caught the PartialEq. The ding is this one https://rust-lang.github.io/rust-clippy/v0.0.212/#derive_hash_xor_eq. It's probably wise to go ahead and implement PartialEq since someone could accidentally break k1 == k2 -> hash(k1) == hash(k2) without being aware of it, if there's not an explicit implementation to look at.

auto-merge was automatically disabled August 24, 2021 16:41

Head branch was pushed to by a user without write access

@dbcfd
Copy link
Contributor Author

dbcfd commented Aug 24, 2021

@jszwedko @blt Implemented partial eq which should match the hashing logic. Mind taking a look again?

Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks for this contribution @dbcfd !

@jszwedko
Copy link
Member

@dbcfd sorry, I realized the DCO check is failing again 😅 . Do you mind signing the unsigned commits?

Signed-off-by: dbcfd <bdbrowning2@gmail.com>
@dbcfd
Copy link
Contributor Author

dbcfd commented Aug 24, 2021

@jszwedko signed now :D Maybe someday I'll remember to that.

Signed-off-by: dbcfd <bdbrowning2@gmail.com>
@jszwedko jszwedko enabled auto-merge (squash) August 25, 2021 00:12
@dbcfd
Copy link
Contributor Author

dbcfd commented Sep 2, 2021

@jszwedko @blt There's another clippy error, but it looks to be spurious (mutable key type). Is this good to set an ignore on?

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
@jszwedko
Copy link
Member

jszwedko commented Sep 2, 2021

@dbcfd yeah, that appears to be a false positive due to Value::Bytes(bytes::Bytes). I pushed a commit adding the clippy allow. Thanks again for this work! It should get merged soon when CI passes.

@jszwedko jszwedko merged commit 0f95401 into vectordotdev:master Sep 2, 2021
jdrouet pushed a commit that referenced this pull request Sep 6, 2021
* Add additional merge strategies: longest, shortest, retain, unique

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* Fix tests, fmt

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* Fix tests, adjust float hashing

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* Dependency fixes

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* Error for shortest/longest when not array

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* Revert dependency changes

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* Review changes

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* Revert nom change

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* Revert build pin

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* Rename strategy to indicate it is for array, handle infinity properly

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* fmt

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* Fix clippy warnings by implementing PartialEq and using bits

Signed-off-by: dbcfd <bdbrowning2@gmail.com>

* Another clippy fix

Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
transform: reduce Anything `reduce` transform related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants