Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pyspark model base class #8

Closed
schuderer opened this issue Jun 16, 2019 · 1 comment · Fixed by #145
Closed

Add pyspark model base class #8

schuderer opened this issue Jun 16, 2019 · 1 comment · Fixed by #145
Labels
enhancement New feature or request

Comments

@schuderer
Copy link
Owner

schuderer commented Jun 16, 2019

using system from #7

supporting pyspark (possibly very little user-model-related stuff necessary, except maybe convenience base-class-to-use to deal with spark context.

@schuderer schuderer added the enhancement New feature or request label Jun 16, 2019
@schuderer schuderer modified the milestone: Plugin System Jun 16, 2019
@schuderer schuderer added this to To do in Prioritized User Issues via automation May 17, 2021
@schuderer
Copy link
Owner Author

Actually, I found that it makes much more sense to just use the (currently experimental) Spark DataSource #145 --> It will create the connection and spark object/context for you based on the config. Furthermore, once you get your hands on a spark dataframe (which the Spark DataSource provides through get_spark_dataframe()), you usually don't need to handle the spark context explicitly any more anyway (unless you want to store spark DFs -- ideally, a data sink should be created for this. Let me know if you need this, and I'll take a look at it).

So, in essence, spark context handling is against the gist of what ML Launchpad tries to achieve -- getting the I/O as much as possible out of your hair. That's what the Spark DataSource should provide.

If you absolutely need a spark context (and are not querying a spark dataframe), you can configure a dummy spark datasource (e.g. with query 'select 1 as dummy'). That SQL query does not even get executed unless you call get_spark_dataframe() or get_dataframe(), but you have access to the connection through data_source.spark anyway.

Prioritized User Issues automation moved this from To do to Done Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

1 participant