Add spark support #145

schuderer · 2021-06-22T17:04:01Z

Closes #8

Add a Spark DataSource (experimental). See examples/spark_datasource.py (see docstrings for documentation and configuration example)

It will create the connection and spark connection object/context for you based on the config.

Use the method get_spark_dataframe() to get a Spark DataFrame, andget_dataframe() to get a Pandas DataFrame.

NOTE: Only use the Pandas variant if you want the data downloaded locally for non-Spark processing. Stick to the spark-specific method to be able to use Spark in further processing and/or ML.

Once you get your hands on a Spark DataFrame, you usually don't need to deal with the spark connection/context explicitly. But if you need it, you can access it through this DataSource's spark property. If you don't need to query data, but need the connection object, you can create a dummy data source (and never call its get_(spark_)dataframe() methods).

Caveats: This being experimental, there is currently no Spark data sink. Let me know if you need it, and I'll take a look at it.

Add spark datasource

3cedc16

schuderer added the enhancement New feature or request label Sep 7, 2021

schuderer added 2 commits September 21, 2021 14:51

Merge branch 'master' into add_spark_support

0341fde

Merge branch 'master' into add_spark_support

6cf0d68

schuderer mentioned this pull request Oct 18, 2021

Add pyspark model base class #8

Closed

Adapt docs, changelog

6a8d33a

schuderer merged commit a51a0b7 into master Oct 18, 2021

schuderer deleted the add_spark_support branch October 18, 2021 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spark support #145

Add spark support #145

schuderer commented Jun 22, 2021 •

edited

Add spark support #145

Add spark support #145

Conversation

schuderer commented Jun 22, 2021 • edited

schuderer commented Jun 22, 2021 •

edited