Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading multiple files with `spark_read_` #2118

Merged
merged 4 commits into from Aug 30, 2019

Conversation

@jozefhajnala
Copy link
Contributor

commented Aug 23, 2019

Some of the org.apache.spark.sql.DataFrameReader methods allow for passing paths: String*, besides path: String. We can use this to allow the spark_read_ suite of functions to accept multiple paths, not just a single path and read multiple files in one call (inspired in part by SO question).

This PR proposes to add this support to:

  • spark_read_parquet()
  • spark_read_json()
  • spark_read_text() - with whole=FALSE only, stops with a meaningful error message if multiple paths are provided with whole=TRUE
  • spark_read_orc()

Unit tests are also added for the above reading functions.

Compatibility notes on DataFrameReader supporting paths: String*:

  • .parquet - since 1.4.0
  • .json - since 2.0.0
  • .text - since 1.6.0
  • .orc - since 2.0.0
R/data_interface.R Outdated Show resolved Hide resolved
@javierluraschi

This comment has been minimized.

Copy link
Member

commented Aug 23, 2019

@jozefhajnala this is pretty great! Thank you! Left a few comments, if you are low on time let us know and we can iterate over your Pr.

@kevinykuo

This comment has been minimized.

Copy link
Collaborator

commented Aug 23, 2019

@jozefhajnala do you mind adding tests for reading multiple files with globbing expressions to make sure we don't break those?

@jozefhajnala

This comment has been minimized.

Copy link
Contributor Author

commented Aug 25, 2019

@jozefhajnala do you mind adding tests for reading multiple files with globbing expressions to make sure we don't break those?

Added some in 7b0ccce.

@javierluraschi

This comment has been minimized.

Copy link
Member

commented Aug 30, 2019

Thanks @jozefhajnala!

@javierluraschi javierluraschi merged commit 5a419fe into rstudio:master Aug 30, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.