New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error for the weight argument in dplyr::sample_frac #2592
Comments
|
@ajing Thanks for reporting this! Given that the query works on a R data frame (i.e., |
|
@ajing Further to my comment above: I think one reason the dplyr Because dplyr relies on a translation layer that converts all data manipulation verbs in R into SQL queries that Apache Spark can process, it is limited by what Spark SQL can actually support. However, while the weighted sampling from a Spark dataframe will not be directly feasible with dplyr, there is always a possibility that it can be supported by a separate helper method in Sparklyr. The only distinction here is such helper method will be under a different name, and also most likely will have to create a temporary view with some additional columns for the purpose of making weighted random sampling expressible via a Spark SQL select query. It could be a fun exercise and could potentially be part of the next sparklyr release. Let me know what you think. |
|
@yl790 sounds good to me. We need a view with a weight column and a column with random numbers. Will you work on that or I can also look into it. |
|
@ajing I'll definitely aim to have that as part of sparklyr 1.4 -- It feels like something that should be part of sparklyr asap. |
|
@yl790 thanks for giving it a high priority, Yitao! What is the expected date for the release of sparkly 1.4? I would like to try it. |
|
@ajing It could be 1-2 months away (given that Sparklyr 1.3 was only released just recently). Meanwhile instead of waiting for a release, there is always the option of |
|
@yl790 cool, thanks! definitely will try it when you have a working version. |
|
@ajing You can now run
Example usage: |
the weight argument in dplyr::sample_frac seems not working. Is there any way to resolve that?
The text was updated successfully, but these errors were encountered: