Skip to content
This repository has been archived by the owner on Nov 22, 2022. It is now read-only.

DataFrameReader.load parameters incorrectly expected all to be strings #275

Closed
ghost opened this issue Dec 31, 2019 · 7 comments · Fixed by #363
Closed

DataFrameReader.load parameters incorrectly expected all to be strings #275

ghost opened this issue Dec 31, 2019 · 7 comments · Fixed by #363

Comments

@ghost
Copy link

ghost commented Dec 31, 2019

Using 2.4.0.post6

spark.read.load(folders, inferSchema=True, header=False)

mypy reports Expected type 'str', got 'bool' instead for both inferSchema and header.

Looks like the issue is in third_party/3/pyspark/sql/readwriter.pyi Line 23 where in the definition for load() we have **options: str. For csv suppport this needs to be **options: Optional[Union[bool, str, int]] but to handle the general case it probably needs to be **options: Any.

@zero323
Copy link
Owner

zero323 commented Dec 31, 2019

This is actually intentional, though I am open to discussion. Please note that options method assumes the same.

A caveat is that Scala DataFrame(Reader|Writer).options takes Map[String, String]. So to handle options are passed to options method, which in turn invoke internal to_str on each item, and delegate job to option method.

Technically speaking the actual type bound would be something like Optional[Union[SupportsString, SupportsRepr]] with following Protocols:

class SupportsString(Protocol):
    def __str__(self) -> str: ...

class SupportsRepr(Protocol):
    def __repr__(self) -> str: ...

This might matter where either __repr__ or __str__ provide unusual representation. I admit that is edge case.

You could point out that the same options passed in unambiguous contexts, like csv

def csv(self, path: str, mode: Optional[str] = ..., compression: Optional[str] = ..., sep: Optional[str] = ..., quote: Optional[str] = ..., escape: Optional[str] = ..., header: Optional[Union[bool, str]] = ..., nullValue: Optional[str] = ..., escapeQuotes: Optional[Union[bool, str]] = ..., quoteAll: Optional[Union[bool, str]] = ..., dateFormat: Optional[str] = ..., timestampFormat: Optional[str] = ..., ignoreLeadingWhiteSpace: Optional[Union[bool, str]] = ..., ignoreTrailingWhiteSpace: Optional[Union[bool, str]] = ..., charToEscapeQuoteEscaping: Optional[str] = ..., encoding: Optional[str] = ..., emptyValue: Optional[str] = ..., lineSep: Optional[str] = ...) -> None: ...

are more lenient. That's intentional as well, as these are normally passed to option directly, without additional conversion.

@mark-oppenheim
Copy link

mark-oppenheim commented Dec 31, 2019

@zero323 Given that, as you say:

def load(self, path=None, format=None, schema=None, **options):
        ...
        self.options(**options)
        ...

def options(self, **options):
        ....
        for k in options:
            self._jreader = self._jreader.option(k, to_str(options[k]))

the code clearly expect non string options to be passable to load().

And, despite the documentation first saying:

options – all other string options

it then gives as an example two non-string options!

df = spark.read.format("parquet").load('python/test_support/sql/parquet_partitioned',
        opt1=True, opt2=1, opt3='str')

I would suggest that real usage requires non-string options, and the example backs this up, but I see why you say what you do.

@zero323
Copy link
Owner

zero323 commented Jan 1, 2020

the code clearly expect non string options to be passable to load().

We should distinguish between what is accepted (literally everything) and what is valid (a tiny subset of the universe). The path that is used here is for convenience of developers not end users. It is brittle (depends on a detail of implementation that can be easily overridden) and can result in all kinds of undesired outcomes.

It also rejects inputs that would be otherwise perfectly sensible. Let's imagine a made up class

class BoundedDecimal(decimal.Decimal): 
    def __init__(self, v): 
        super().__init__() 
        assert 0 <= self <= 1 
    def __str__(self): 
        return f"BoundedDecimal({super().__str__()})" 

It should be a valid choice for let's say samplingRatio, but it isn't.

That's true about almost all interfaces for Java classes, and general attitude is that JVM exception is good enough.

In general I am opened to discussion about using the same set of known types with known representation (bool, float, int, str) for option, options and load ‒ there is definitely a good case to be made for that resolution.

I am strictly against using Any as it includes possibly infinite number of invalid options for no good reason.

I realize that many choices here are more restrictive than the actual implementation ‒ it is most of the time intentional, even if goes against general Python attitude, that it is easier to ask for forgiveness.

@ghost
Copy link
Author

ghost commented Jan 2, 2020

OK, makes sense.  I generally don't like using Any, and the issue you raise with Java representations makes that stronger. Your suggestion of types with a known representation should cover all, or at least all common, real use cases whereas only handling str definitely rules out common (and useful) use cases.

@zero323
Copy link
Owner

zero323 commented Jan 2, 2020

Your suggestion of types with a known representation should cover all, or at least all common, real use cases whereas only handling str definitely rules out common (and useful) use cases.

So I guess we could:

Is it something you'd like to work on?

@ghost
Copy link
Author

ghost commented Jan 6, 2020

Great. I'll be happy to do so, but will do so via my personal account.

@zero323
Copy link
Owner

zero323 commented Jan 7, 2020

Sounds good. Let's continue this discussion under #276

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants