DataFrameReader.load parameters incorrectly expected all to be strings #275
Comments
This is actually intentional, though I am open to discussion. Please note that A caveat is that Scala Technically speaking the actual type bound would be something like class SupportsString(Protocol):
def __str__(self) -> str: ...
class SupportsRepr(Protocol):
def __repr__(self) -> str: ... This might matter where either You could point out that the same options passed in unambiguous contexts, like
are more lenient. That's intentional as well, as these are normally passed to |
@zero323 Given that, as you say: def load(self, path=None, format=None, schema=None, **options):
...
self.options(**options)
...
def options(self, **options):
....
for k in options:
self._jreader = self._jreader.option(k, to_str(options[k])) the code clearly expect non string options to be passable to And, despite the documentation first saying:
it then gives as an example two non-string options! df = spark.read.format("parquet").load('python/test_support/sql/parquet_partitioned',
opt1=True, opt2=1, opt3='str') I would suggest that real usage requires non-string options, and the example backs this up, but I see why you say what you do. |
We should distinguish between what is accepted (literally everything) and what is valid (a tiny subset of the universe). The path that is used here is for convenience of developers not end users. It is brittle (depends on a detail of implementation that can be easily overridden) and can result in all kinds of undesired outcomes. It also rejects inputs that would be otherwise perfectly sensible. Let's imagine a made up class class BoundedDecimal(decimal.Decimal):
def __init__(self, v):
super().__init__()
assert 0 <= self <= 1
def __str__(self):
return f"BoundedDecimal({super().__str__()})" It should be a valid choice for let's say That's true about almost all interfaces for Java classes, and general attitude is that JVM exception is good enough. In general I am opened to discussion about using the same set of known types with known representation ( I am strictly against using I realize that many choices here are more restrictive than the actual implementation ‒ it is most of the time intentional, even if goes against general Python attitude, that it is easier to ask for forgiveness. |
OK, makes sense. I generally don't like using |
So I guess we could:
Is it something you'd like to work on? |
Great. I'll be happy to do so, but will do so via my personal account. |
Sounds good. Let's continue this discussion under #276 |
Using 2.4.0.post6
mypy reports
Expected type 'str', got 'bool' instead
for bothinferSchema
andheader
.Looks like the issue is in
third_party/3/pyspark/sql/readwriter.pyi
Line 23 where in the definition forload()
we have**options: str
. For csv suppport this needs to be**options: Optional[Union[bool, str, int]]
but to handle the general case it probably needs to be**options: Any
.The text was updated successfully, but these errors were encountered: