Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FromDDL rework #158

Closed
wants to merge 4 commits into from
Closed

FromDDL rework #158

wants to merge 4 commits into from

Conversation

svaningelgem
Copy link
Contributor

@svaningelgem svaningelgem commented Feb 27, 2021

I basically took the tests from spark itself, and wrote a lex-parser around it (as I couldn't find an easy way to extract the one from scala).
It supports every test that was in the scala program, including:

  • basic datatypes
  • maps
  • structs
  • arrays
  • weird names (like a+b)
  • comments
  • reserved names-names (like int: int -- int is both the name of the column, and the datatype! -- But "int int" doesn't work the same way, that errors out.)

@tools4origins : could you check if this can be integrated into your PR because I believe my implementation is more feature complete?

StructField, StructType
)
# DataType
from pysparkling.sql.utils import ParseException
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructType does not expose fromDDL in Python, (maybe we should not either), but the one in Scala does not show the behavior expected in these tests:

scala> StructType.fromDDL("some_str string, some_int integer, some_date date")
res2: org.apache.spark.sql.types.StructType = StructType(StructField(some_str,StringType,true), StructField(some_int,IntegerType,true), StructField(some_date,DateType,true))
scala> StructType.fromDDL("int")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '<EOF>'

(Spark 3)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple types of syntaxes are supported, fromDDL only handles one of them in Spark:
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L821-L841

Maybe we want not to rely on StructType.fromDDL to do this parsing?

Copy link
Contributor Author

@svaningelgem svaningelgem Feb 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method is wrongly called. We shouldn't call this fromDDL. This is actually _parse_datatype_string. And it's supported by Spark ;-). This is mainly used when you create a dataframe with a string as the field description.

Just check the implementation in pyspark, it supports multiple ways and I do agree that we should drop fromDDL (as it's not in pyspark at all).
I saw you posted the link to the sources... Yep, that's the one. That's the one I implemented with the lexer.

How should I proceed here?



def test_void():
assert parser.parse("void") == NullType()
Copy link
Collaborator

@tools4origins tools4origins Feb 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this from? It fails in pyspark:

>>> spark.createDataFrame([], "id void")
[...[
pyspark.sql.utils.ParseException: 
DataType void is not supported.(line 1, pos 3)

== SQL ==
id void
---^^^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Added @SInCE and @until decorators. This is very useful when going to implement methods in pysparkling.
Example: "weekday" exists since 2.4.0.

This way you can decorate the method/class with @SInCE('2.4.0').

In your pysparkling program you can say `pysparkling.config.spark_version = 2.3.2`. And if you would be using SOMEWHERE "weekday" in the code, it would fire the `NotSupportedByThisSparkVersion` exception. Thus correctly failing your pysparkling program.

* Improved tests.

* Fix import order.
@svaningelgem svaningelgem deleted the fromDDL_rework branch March 5, 2021 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants