Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse lookups #1398

Merged
merged 7 commits into from Oct 1, 2018

Conversation

@anish749
Copy link
Member

commented Sep 20, 2018

Tried to add some more sparse join functions which we wanted to use.
I mentioned this in #1397

The idea here is this being opposite of hashLookUp. that would be considerably larger and this is also large enough to not fit in memory.

The left side here ends up being smaller than the right side. Is this way of naming a good idea?

Our use case is mostly joining smaller feature specific datasets keyed on some id with larger company wide datasets which are also keyed on the same id. Also we mostly end up joining with at least 2 company wide datasets in our pipelines.

anish749 added some commits Sep 20, 2018

Merge branch 'master_new' into sparse-lookups
# Conflicts:
#	scio-core/src/main/scala/com/spotify/scio/values/PairSCollectionFunctions.scala
#	scio-test/src/test/scala/com/spotify/scio/values/PairSCollectionFunctionsTest.scala
@codecov-io

This comment has been minimized.

Copy link

commented Sep 20, 2018

Codecov Report

❗️ No coverage uploaded for pull request base (master@b6d4fd7). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1398   +/-   ##
=========================================
  Coverage          ?   69.19%           
=========================================
  Files             ?      160           
  Lines             ?     4964           
  Branches          ?      312           
=========================================
  Hits              ?     3435           
  Misses            ?     1529           
  Partials          ?        0
Impacted Files Coverage Δ
...spotify/scio/values/PairSCollectionFunctions.scala 95.97% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6d4fd7...5262cdb. Read the comment docs.

Show resolved Hide resolved ...rc/test/scala/com/spotify/scio/values/PairSCollectionFunctionsTest.scala
voder: Coder[V]): SCollection[(K, (V, Iterable[A], Iterable[B]))] =
sparseLookup(that1, that2, thisNumKeys, 0.01)

private[values] def getOptimalKeysBloomFiltersAsSideInputs(thisNumKeys: Long, fpProb: Double)(

This comment has been minimized.

Copy link
@regadas

regadas Sep 21, 2018

Contributor

@anish749 just optimalKeysBloomFiltersAsSideInputs?

This comment has been minimized.

Copy link
@anish749

anish749 Sep 21, 2018

Author Member

sure, this sounds good. I'll make the change..

@regadas

This comment has been minimized.

Copy link
Contributor

commented Sep 21, 2018

@anish749 Thx for the PR 👍

@anish749

This comment has been minimized.

Copy link
Member Author

commented Sep 21, 2018

@regadas Thanks for the review :) I've made the changes!!!

@anish749

This comment has been minimized.

Copy link
Member Author

commented Sep 26, 2018

@regadas any thoughts? I was wondering if these should be sparseLookupByKey or have it as a function as input to sparseLookupBy(...) WDYT?

@regadas

This comment has been minimized.

Copy link
Contributor

commented Sep 28, 2018

@anish749 I think it's fine as is 😄

val thatParts = that.partition(n, _._1.hashCode() % n)

SCollection.unionAll(
(thisParts zip selfBfSideInputs zip thatParts).map {

This comment has been minimized.

Copy link
@regadas

regadas Sep 28, 2018

Contributor

@anish749 sorry I missed this! this kind of notation is discouraged.

This comment has been minimized.

Copy link
@anish749

anish749 Oct 4, 2018

Author Member

i'll keep this in mind 👍 Was away the last few days :)

@regadas

regadas approved these changes Oct 1, 2018

@regadas regadas merged commit e0a865e into spotify:master Oct 1, 2018

4 checks passed

ci/circleci: build_211 Your tests passed on CircleCI!
Details
ci/circleci: build_212 Your tests passed on CircleCI!
Details
codecov/patch No report found to compare against
Details
codecov/project No report found to compare against
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.