New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Records per partition #2791
Comments
|
@sbottelli Good question! At the moment you can run |
|
Also notice if you just want to know whether some partition strategy will produce skew or not, it probably makes sense to apply that partition strategy to some subset of data first and then run |
|
Thanks @yitao-li ! I will test it! |
|
@sbottelli I think it may also be a good idea to first only select the subset of column(s) you are partitioning with into a smaller dataframe -- otherwise it looks like Spark might start computing all other columns which are irrelevant to distribution of rows in partitions (I could be wrong on this one though but it doesn't hurt to always guard against computations that are unnecessary) |
|
yes, I partition the table by using an ID column and then calculating the distribution of rows in each partition to examine skew. Thanks for the code, the tricks and the clear answer! |
|
@yitao-li this is one of the hackiest uses of |
Hi!
I have a question: how can I see the number of records per partition?
I want to know how the records are distributed in order to see some skewness in the data and eventually adjust them with some workarounds (to speed up performances)
Thanks for the hard work that you do for mantain this very outstanding package!
Regards
Stefano
The text was updated successfully, but these errors were encountered: