feat(spark): add support for DelimiterSeparatedTextReadOptions#323
feat(spark): add support for DelimiterSeparatedTextReadOptions#323Blizzara merged 1 commit intosubstrait-io:mainfrom
Conversation
|
Not sure why |
eec36ab to
3405b3e
Compare
| indent_size = 2 | ||
|
|
||
| [{**/*.sql,**/OuterReferenceResolver.md,**gradlew.bat}] | ||
| [{**/*.sql,**/OuterReferenceResolver.md,**gradlew.bat,**/*.parquet,**/*.orc}] |
There was a problem hiding this comment.
is this needed, no-one should be editing .parquet or .orc files by hand afaik?
There was a problem hiding this comment.
I'm not sure, but the editorconfig-checker stage failed prior to adding this (at least for .orc, didn't do a separate check for .parquet):
https://github.com/substrait-io/substrait-java/actions/runs/13032936614/job/36356268259
| s"Cannot configure CSV reader to skip ${csv.getHeaderLinesToSkip} rows") | ||
| }), | ||
| "escape" -> csv.getEscape, | ||
| "maxColumns" -> csv.getMaxLineSize.toString |
There was a problem hiding this comment.
this sounds wrong, maxColumns in spark is for columns while maxLineSize in Substrait is for bytes
There was a problem hiding this comment.
Good catch. There doesn't seem to be a Spark equivalent option for this, so I'll just remove the mapping and ignore.
| // default values for options specified here: | ||
| // https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option | ||
| override def getFieldDelimiter: String = fsRelation.options.getOrElse("delimiter", ",") | ||
| override def getMaxLineSize: Long = |
There was a problem hiding this comment.
Same as above, but I need to assign something, so I'll use the proto default value of 0.
The handling of CSV in the spark module works only in a very limited way with many hard-coded assumptions. This commit adds full support for delimited text support as defined in the `FileOrFiles` proto message Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>
| // default values for options specified here: | ||
| // https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option | ||
| override def getFieldDelimiter: String = fsRelation.options.getOrElse("delimiter", ",") | ||
| override def getMaxLineSize: Long = 0 // No Spark equivalent. Assign proto default of 0 |
There was a problem hiding this comment.
this is a bit weird still, but I see that it's a part of the proto, so I don't have better ideas either. I guess only problem would appear if someone uses spark-produced plans in some other system which would support & implement this, but if that happens we can figure it out then, I think.
The handling of CSV in the spark module works only in a very limited way with many hard-coded assumptions.
This commit adds full support for delimited text support as defined in the
FileOrFilesproto message