Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spark support #145

Merged
merged 4 commits into from Oct 18, 2021
Merged

Add spark support #145

merged 4 commits into from Oct 18, 2021

Conversation

schuderer
Copy link
Owner

@schuderer schuderer commented Jun 22, 2021

Closes #8

Add a Spark DataSource (experimental). See examples/spark_datasource.py (see docstrings for documentation and configuration example)

It will create the connection and spark connection object/context for you based on the config.

Use the method get_spark_dataframe() to get a Spark DataFrame, andget_dataframe() to get a Pandas DataFrame.

NOTE: Only use the Pandas variant if you want the data downloaded locally for non-Spark processing. Stick to the spark-specific method to be able to use Spark in further processing and/or ML.

Once you get your hands on a Spark DataFrame, you usually don't need to deal with the spark connection/context explicitly. But if you need it, you can access it through this DataSource's spark property. If you don't need to query data, but need the connection object, you can create a dummy data source (and never call its get_(spark_)dataframe() methods).

Caveats: This being experimental, there is currently no Spark data sink. Let me know if you need it, and I'll take a look at it.

@schuderer schuderer added the enhancement New feature or request label Sep 7, 2021
@schuderer schuderer merged commit a51a0b7 into master Oct 18, 2021
@schuderer schuderer deleted the add_spark_support branch October 18, 2021 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add pyspark model base class
1 participant