Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse left and right outer joins #1386

Merged
merged 2 commits into from Sep 19, 2018

Conversation

@anish749
Copy link
Member

commented Sep 19, 2018

A version of sparse left and right outer joins, taken from the original sparse outer join implementation

In some of our pipeline we have a bunch of right outer joins where the RHS of the join is around 5% of the LHS, and but the RHS still doesn't fit in memory. As of now we extract the keys as a Side Input out of RHS which barely fits in memory and then we filter LHS using the keys, before doing the right outer join. Using a BF and sparse joins would help us scale well, which we would need to very soon, but we would need a right outer join variant. So I thought of adding the right and left versions.

Also,
Can we have sparse multi joins?
The use case is like a small but not tiny dataset to be left joined with multiple very large right sides.

splitSelfUsing(that, thatNumKeys, fpProb).map {
case (lhsUnique, lhsOverlap, rhs) =>
val unique =
lhsUnique.map(kv => (kv._1, (kv._2, Option.empty[W])))(Coder.gen[(K, (V, Option[W]))])

This comment has been minimized.

Copy link
@anish749

anish749 Sep 19, 2018

Author Member

I was getting problems with diverging implicit expansions here and had to explicitly add (Coder.gen[(K, (V, Option[W]))])
Can this be done in a better way?

This comment has been minimized.

Copy link
@jto

jto Sep 19, 2018

Member

probably just a case of Scalac being an idiot. I'll have a look at it.

This comment has been minimized.

Copy link
@jto

jto Sep 19, 2018

Member

Fixed it in #1389 @anish749

This comment has been minimized.

Copy link
@anish749

anish749 Sep 19, 2018

Author Member

Awesome..!! I'll rebase after that is merged :)

This comment has been minimized.

Copy link
@regadas

regadas Sep 19, 2018

Contributor

@anish749 can you rebase this? 😄

@codecov-io

This comment has been minimized.

Copy link

commented Sep 19, 2018

Codecov Report

Merging #1386 into master will decrease coverage by 10.9%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #1386       +/-   ##
===========================================
- Coverage   83.42%   72.52%   -10.91%     
===========================================
  Files         161      161               
  Lines        4627     4633        +6     
  Branches      368      303       -65     
===========================================
- Hits         3860     3360      -500     
- Misses        767     1273      +506
Impacted Files Coverage Δ
...ify/scio/values/PairHashSCollectionFunctions.scala 97.82% <ø> (ø) ⬆️
...in/scala/com/spotify/scio/values/SCollection.scala 91.62% <100%> (+0.12%) ⬆️
...spotify/scio/values/PairSCollectionFunctions.scala 94.11% <100%> (+0.25%) ⬆️
.../spotify/scio/bigquery/BigQueryPartitionUtil.scala 0% <0%> (-100%) ⬇️
...scala/com/spotify/scio/bigquery/MockBigQuery.scala 0% <0%> (-94.74%) ⬇️
...la/com/spotify/scio/bigquery/dynamic/package.scala 0% <0%> (-94.12%) ⬇️
...potify/scio/runners/dataflow/DataflowContext.scala 0% <0%> (-92.86%) ⬇️
...n/scala/com/spotify/scio/bigtable/TableAdmin.scala 0% <0%> (-80%) ⬇️
...cio/bigquery/dynamic/DynamicDestinationsUtil.scala 0% <0%> (-80%) ⬇️
...ala/com/spotify/scio/bigquery/BigQueryClient.scala 9.32% <0%> (-71.73%) ⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9053e3...09f5f9f. Read the comment docs.

@nevillelyh nevillelyh merged commit a3982ca into spotify:master Sep 19, 2018

4 checks passed

ci/circleci: build_211 Your tests passed on CircleCI!
Details
ci/circleci: build_212 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 83.42%)
Details
codecov/project Absolute coverage decreased by -10.9% but relative coverage increased by +16.57% compared to c9053e3
Details

nevillelyh added a commit that referenced this pull request Sep 19, 2018

Sparse left and right outer joins (#1386)
* Sparse left and right outer joins

* scala format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.