Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Records per partition #2791

Closed
sbottelli opened this issue Nov 10, 2020 · 6 comments · Fixed by #2806
Closed

Records per partition #2791

sbottelli opened this issue Nov 10, 2020 · 6 comments · Fixed by #2806

Comments

@sbottelli
Copy link

@sbottelli sbottelli commented Nov 10, 2020

Hi!
I have a question: how can I see the number of records per partition?
I want to know how the records are distributed in order to see some skewness in the data and eventually adjust them with some workarounds (to speed up performances)
Thanks for the hard work that you do for mantain this very outstanding package!
Regards
Stefano

@yitao-li
Copy link
Contributor

@yitao-li yitao-li commented Nov 10, 2020

@sbottelli Good question!

At the moment you can run spark_apply(sdf, function(x) nrow(x)) to get the partition sizes (probably not very efficient, but does get the job done). Example:

library(sparklyr)
sc <- spark_connect(master = "local")

sdf <- sdf_len(sc, 100, repartition = 10L)
print(spark_apply(sdf, function(x) nrow(x)))  # result is implementation-dependent but most likely it will show 10 partitions each of size 10

@yitao-li
Copy link
Contributor

@yitao-li yitao-li commented Nov 10, 2020

Also notice if you just want to know whether some partition strategy will produce skew or not, it probably makes sense to apply that partition strategy to some subset of data first and then run spark_apply(...) from my previous comment on that subset. spark_apply could be slow when running on large number of rows.

@sbottelli
Copy link
Author

@sbottelli sbottelli commented Nov 11, 2020

Thanks @yitao-li ! I will test it!

@yitao-li
Copy link
Contributor

@yitao-li yitao-li commented Nov 11, 2020

@sbottelli I think it may also be a good idea to first only select the subset of column(s) you are partitioning with into a smaller dataframe -- otherwise it looks like Spark might start computing all other columns which are irrelevant to distribution of rows in partitions (I could be wrong on this one though but it doesn't hurt to always guard against computations that are unnecessary)

@sbottelli
Copy link
Author

@sbottelli sbottelli commented Nov 11, 2020

yes, I partition the table by using an ID column and then calculating the distribution of rows in each partition to examine skew.

Thanks for the code, the tricks and the clear answer!

@benmwhite
Copy link

@benmwhite benmwhite commented Nov 11, 2020

spark_apply(sdf, function(x) nrow(x))

@yitao-li this is one of the hackiest uses of spark_apply() I've ever seen, I love it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants