-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sparseIntersectByKey/hashIntersectByKey (#1354) #1393
Conversation
val numHashes = BloomFilter.optimalNumHashes(thatNumKeys, width) | ||
val rhsBf = that | ||
.aggregate(BloomFilterAggregator[K](numHashes, width)) | ||
.asIterableSideInput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not using singleton side input here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a singleton, in the case of an empty that
it'll fail with Empty PCollection accessed as a singleton view
when we try to access rhsBf
through side input context. I based it off of the SparseJoin implementations, which use an IterableSideInput
in the same way. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there's a def asSingletonSideInput(defaultValue: T): SideInput[T]
with default now maybe use that for both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! added a unit test for that case in SparseOuterJoins too. Also added an exact
boolean to the sparse intersection.
.withSideInputs(rhsBf) | ||
.filter { case (e, c) => c(rhsBf).headOption.exists(_.maybeContains(e._1)) } | ||
.toSCollection | ||
.cogroup(that.map((_, ()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the cogroup+flatMap
anymore since it's covered by filter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess it depends on if we expect this function to be approximate or exact -- since the bloom filter can have false positives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a exact: Boolean
option that defaults to false? If the user joins the result with another data set it defeats the purpose of cogroup+flatMap
.
} | ||
} | ||
|
||
protected def sparseIntersectByKeyImpl(that: SCollection[K], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be private
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weirdly when i make it private it's inaccessible from
val joined = (thisParts zip thatParts).map {
case (lhs, rhs) =>
lhs.sparseIntersectByKeyImpl(rhs, bfSettings.capacity, computeExact, fpProb)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right probably because it's an implicit class.
val side = combineAsMapSideInput(that.map((_, ()))) | ||
self | ||
.withSideInputs(side) | ||
.flatMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a filter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! thanks.
c02031e
to
78b0c97
Compare
Codecov Report
@@ Coverage Diff @@
## master #1393 +/- ##
=========================================
+ Coverage 79.24% 79.35% +0.1%
=========================================
Files 160 160
Lines 4899 4925 +26
Branches 300 340 +40
=========================================
+ Hits 3882 3908 +26
Misses 1017 1017
Continue to review full report at Codecov.
|
No description provided.