Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #8
Add a Spark DataSource (experimental). See examples/spark_datasource.py (see docstrings for documentation and configuration example)
It will create the connection and spark connection object/context for you based on the config.
Use the method
get_spark_dataframe()
to get a Spark DataFrame, andget_dataframe()
to get a Pandas DataFrame.NOTE: Only use the Pandas variant if you want the data downloaded locally for non-Spark processing. Stick to the spark-specific method to be able to use Spark in further processing and/or ML.
Once you get your hands on a Spark DataFrame, you usually don't need to deal with the spark connection/context explicitly. But if you need it, you can access it through this DataSource's
spark
property. If you don't need to query data, but need the connection object, you can create a dummy data source (and never call itsget_(spark_)dataframe()
methods).Caveats: This being experimental, there is currently no Spark data sink. Let me know if you need it, and I'll take a look at it.