-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FromDDL rework #158
FromDDL rework #158
Conversation
StructField, StructType | ||
) | ||
# DataType | ||
from pysparkling.sql.utils import ParseException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StructType does not expose fromDDL in Python, (maybe we should not either), but the one in Scala does not show the behavior expected in these tests:
scala> StructType.fromDDL("some_str string, some_int integer, some_date date")
res2: org.apache.spark.sql.types.StructType = StructType(StructField(some_str,StringType,true), StructField(some_int,IntegerType,true), StructField(some_date,DateType,true))
scala> StructType.fromDDL("int")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '<EOF>'
(Spark 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple types of syntaxes are supported, fromDDL only handles one of them in Spark:
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L821-L841
Maybe we want not to rely on StructType.fromDDL to do this parsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method is wrongly called. We shouldn't call this fromDDL. This is actually _parse_datatype_string
. And it's supported by Spark ;-). This is mainly used when you create a dataframe with a string as the field description.
Just check the implementation in pyspark, it supports multiple ways and I do agree that we should drop fromDDL (as it's not in pyspark at all).
I saw you posted the link to the sources... Yep, that's the one. That's the one I implemented with the lexer.
How should I proceed here?
|
||
|
||
def test_void(): | ||
assert parser.parse("void") == NullType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this from? It fails in pyspark:
>>> spark.createDataFrame([], "id void")
[...[
pyspark.sql.utils.ParseException:
DataType void is not supported.(line 1, pos 3)
== SQL ==
id void
---^^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Added @SInCE and @until decorators. This is very useful when going to implement methods in pysparkling. Example: "weekday" exists since 2.4.0. This way you can decorate the method/class with @SInCE('2.4.0'). In your pysparkling program you can say `pysparkling.config.spark_version = 2.3.2`. And if you would be using SOMEWHERE "weekday" in the code, it would fire the `NotSupportedByThisSparkVersion` exception. Thus correctly failing your pysparkling program. * Improved tests. * Fix import order.
I basically took the tests from spark itself, and wrote a lex-parser around it (as I couldn't find an easy way to extract the one from scala).
It supports every test that was in the scala program, including:
a+b
)int: int
-- int is both the name of the column, and the datatype! -- But "int int" doesn't work the same way, that errors out.)@tools4origins : could you check if this can be integrated into your PR because I believe my implementation is more feature complete?