-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Registering functions for use with SparkSQL in pyspark #9
Comments
@colinglaes we don't use pyspark so it's best to look for insights from the Spark codebase or people with pyspark + Scala experience. @MrPowers you have exposed functions to pyspark. Are any of them native ones? |
Of course, you can always go with Python wrappers for SparkSQL expressions as I outlined in #4 (comment) |
Thanks for the quick response @ssimeonov, I tried using from running
from running
|
We run on Databricks where it's easy to have a notebook with both a Scala and Python context: hence the easy workaround of registering in Scala and then using the SparkSQL versions from python. The UDF registration error makes sense as the functions are not UDFs. They are Spark native functions and not UDFs. Basically, this isn't a |
@colinglaes While struggling with same problem, I found solution by creating a wrapper around |
@djo10 thank you. You've cracked the code on the exact incantation necessary to use in Python, but I am not sure what the wrapper is needed at all... Why not simply? sc._jvm.com.swoop.alchemy.spark.expressions.hll.HLLFunctionRegistration.registerFunctions(spark._jsparkSession) One can then do something like: spark.range(5).toDF("id").select(expr("hll_cardinality(hll_init(id)) as cd")) And, of course, higher level wrappers around |
Yeah, but, I had to add another jar in case to work (HyperLogLog in Java). And for configuration Finally, I think the solution is here, just matter of choice providing 3 or 1 (wrapper) jar. |
Setting As for dependency JARs, they are always needed unless you produce a far JAR. |
thanks @djo10 & @ssimeonov, using |
* Bump spark version to 3.2.0 * Bump alchemy version for dependency change * Code updates for changes to Spark APIs
I'm interested in using your HLL implementation in a pyspark project but i'm having trouble figuring out how to properly register the functions for use in sql execution. I'm unsure of how i'd execute
com.swoop.alchemy.spark.expressions.hll.HLLFunctionRegistration.registerFunctions(spark)
in scala from pyspark (from the documentation under"Using HLL functions"."From SparkSQL". I've tried the following without any luck.
The text was updated successfully, but these errors were encountered: