DataFrameReader.load parameters incorrectly expected all to be strings #275

ghost · 2019-12-31T10:19:46Z

Using 2.4.0.post6

spark.read.load(folders, inferSchema=True, header=False)

mypy reports Expected type 'str', got 'bool' instead for both inferSchema and header.

Looks like the issue is in third_party/3/pyspark/sql/readwriter.pyi Line 23 where in the definition for load() we have **options: str. For csv suppport this needs to be **options: Optional[Union[bool, str, int]] but to handle the general case it probably needs to be **options: Any.

The text was updated successfully, but these errors were encountered:

zero323 · 2019-12-31T11:44:10Z

This is actually intentional, though I am open to discussion. Please note that options method assumes the same.

A caveat is that Scala DataFrame(Reader|Writer).options takes Map[String, String]. So to handle options are passed to options method, which in turn invoke internal to_str on each item, and delegate job to option method.

Technically speaking the actual type bound would be something like Optional[Union[SupportsString, SupportsRepr]] with following Protocols:

class SupportsString(Protocol):
    def __str__(self) -> str: ...

class SupportsRepr(Protocol):
    def __repr__(self) -> str: ...

This might matter where either __repr__ or __str__ provide unusual representation. I admit that is edge case.

You could point out that the same options passed in unambiguous contexts, like csv

pyspark-stubs/third_party/3/pyspark/sql/readwriter.pyi

Line 61 in 76db4a4

    
           def csv(self, path: str, mode: Optional[str] = ..., compression: Optional[str] = ..., sep: Optional[str] = ..., quote: Optional[str] = ..., escape: Optional[str] = ..., header: Optional[Union[bool, str]] = ..., nullValue: Optional[str] = ..., escapeQuotes: Optional[Union[bool, str]] = ..., quoteAll: Optional[Union[bool, str]] = ..., dateFormat: Optional[str] = ..., timestampFormat: Optional[str] = ..., ignoreLeadingWhiteSpace: Optional[Union[bool, str]] = ..., ignoreTrailingWhiteSpace: Optional[Union[bool, str]] = ..., charToEscapeQuoteEscaping: Optional[str] = ..., encoding: Optional[str] = ..., emptyValue: Optional[str] = ..., lineSep: Optional[str] = ...) -> None: ...

are more lenient. That's intentional as well, as these are normally passed to option directly, without additional conversion.

mark-oppenheim · 2019-12-31T16:14:30Z

@zero323 Given that, as you say:

def load(self, path=None, format=None, schema=None, **options):
        ...
        self.options(**options)
        ...

def options(self, **options):
        ....
        for k in options:
            self._jreader = self._jreader.option(k, to_str(options[k]))

the code clearly expect non string options to be passable to load().

And, despite the documentation first saying:

options – all other string options

it then gives as an example two non-string options!

df = spark.read.format("parquet").load('python/test_support/sql/parquet_partitioned',
        opt1=True, opt2=1, opt3='str')

I would suggest that real usage requires non-string options, and the example backs this up, but I see why you say what you do.

zero323 · 2020-01-01T12:20:02Z

the code clearly expect non string options to be passable to load().

We should distinguish between what is accepted (literally everything) and what is valid (a tiny subset of the universe). The path that is used here is for convenience of developers not end users. It is brittle (depends on a detail of implementation that can be easily overridden) and can result in all kinds of undesired outcomes.

It also rejects inputs that would be otherwise perfectly sensible. Let's imagine a made up class

class BoundedDecimal(decimal.Decimal): 
    def __init__(self, v): 
        super().__init__() 
        assert 0 <= self <= 1 
    def __str__(self): 
        return f"BoundedDecimal({super().__str__()})"

It should be a valid choice for let's say samplingRatio, but it isn't.

That's true about almost all interfaces for Java classes, and general attitude is that JVM exception is good enough.

In general I am opened to discussion about using the same set of known types with known representation (bool, float, int, str) for option, options and load ‒ there is definitely a good case to be made for that resolution.

I am strictly against using Any as it includes possibly infinite number of invalid options for no good reason.

I realize that many choices here are more restrictive than the actual implementation ‒ it is most of the time intentional, even if goes against general Python attitude, that it is easier to ask for forgiveness.

ghost · 2020-01-02T12:31:55Z

OK, makes sense. I generally don't like using Any, and the issue you raise with Java representations makes that stronger. Your suggestion of types with a known representation should cover all, or at least all common, real use cases whereas only handling str definitely rules out common (and useful) use cases.

zero323 · 2020-01-02T18:01:40Z

Your suggestion of types with a known representation should cover all, or at least all common, real use cases whereas only handling str definitely rules out common (and useful) use cases.

So I guess we could:

Define an alternative alias for Union[bool, int, float, str] - something like PrimitiveType or OptionType. We already have

pyspark-stubs/third_party/3/pyspark/sql/_typing.pyi

Line 18 in 76db4a4

LiteralType = Union[bool, int, float, str]

that covers the same set of types, but using it directly would be confusing.
Adjust all related signatures (option, options, load, save in reader / writer and their structured streaming counterparts) when applicable.
Provide some test data for these.

Is it something you'd like to work on?

ghost · 2020-01-06T11:21:33Z

Great. I'll be happy to do so, but will do so via my personal account.

zero323 · 2020-01-07T15:25:06Z

Sounds good. Let's continue this discussion under #276

zero323 mentioned this issue Dec 31, 2019

Allow non-string load/save parameters #276

Closed

zero323 added the enhancement label Feb 3, 2020

zero323 mentioned this issue Feb 3, 2020

Add ReadWriterOptionType #363

Merged

zero323 closed this as completed in #363 Feb 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrameReader.load parameters incorrectly expected all to be strings #275

DataFrameReader.load parameters incorrectly expected all to be strings #275

ghost commented Dec 31, 2019

zero323 commented Dec 31, 2019

mark-oppenheim commented Dec 31, 2019 •

edited

zero323 commented Jan 1, 2020

ghost commented Jan 2, 2020

zero323 commented Jan 2, 2020

ghost commented Jan 6, 2020

zero323 commented Jan 7, 2020

DataFrameReader.load parameters incorrectly expected all to be strings #275

DataFrameReader.load parameters incorrectly expected all to be strings #275

Comments

ghost commented Dec 31, 2019

zero323 commented Dec 31, 2019

mark-oppenheim commented Dec 31, 2019 • edited

zero323 commented Jan 1, 2020

ghost commented Jan 2, 2020

zero323 commented Jan 2, 2020

ghost commented Jan 6, 2020

zero323 commented Jan 7, 2020

mark-oppenheim commented Dec 31, 2019 •

edited