feat(spark): add support for DelimiterSeparatedTextReadOptions by andrew-coleman · Pull Request #323 · substrait-io/substrait-java

andrew-coleman · 2025-01-29T13:59:36Z

The handling of CSV in the spark module works only in a very limited way with many hard-coded assumptions.
This commit adds full support for delimited text support as defined in the FileOrFiles proto message

Blizzara

Looks good to me!

andrew-coleman · 2025-01-29T15:07:01Z

Not sure why osv-scanner is failing. Is that related to this PR? It works locally with no issues. 🤔

Blizzara · 2025-02-03T09:32:15Z

 indent_size = 2

-[{**/*.sql,**/OuterReferenceResolver.md,**gradlew.bat}]
+[{**/*.sql,**/OuterReferenceResolver.md,**gradlew.bat,**/*.parquet,**/*.orc}]


is this needed, no-one should be editing .parquet or .orc files by hand afaik?

I'm not sure, but the editorconfig-checker stage failed prior to adding this (at least for .orc, didn't do a separate check for .parquet):
https://github.com/substrait-io/substrait-java/actions/runs/13032936614/job/36356268259

Weird but okay :D

Blizzara · 2025-02-03T09:35:33Z

+                s"Cannot configure CSV reader to skip ${csv.getHeaderLinesToSkip} rows")
+          }),
+          "escape" -> csv.getEscape,
+          "maxColumns" -> csv.getMaxLineSize.toString


this sounds wrong, maxColumns in spark is for columns while maxLineSize in Substrait is for bytes

Good catch. There doesn't seem to be a Spark equivalent option for this, so I'll just remove the mapping and ignore.

Blizzara · 2025-02-03T09:35:44Z

+          // default values for options specified here:
+          // https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
+          override def getFieldDelimiter: String = fsRelation.options.getOrElse("delimiter", ",")
+          override def getMaxLineSize: Long =


Same as above, but I need to assign something, so I'll use the proto default value of 0.

The handling of CSV in the spark module works only in a very limited way with many hard-coded assumptions. This commit adds full support for delimited text support as defined in the `FileOrFiles` proto message Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>

Blizzara · 2025-02-04T10:03:16Z

+          // default values for options specified here:
+          // https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
+          override def getFieldDelimiter: String = fsRelation.options.getOrElse("delimiter", ",")
+          override def getMaxLineSize: Long = 0 // No Spark equivalent. Assign proto default of 0


this is a bit weird still, but I see that it's a part of the proto, so I don't have better ideas either. I guess only problem would appear if someone uses spark-produced plans in some other system which would support & implement this, but if that happens we can figure it out then, I think.

Blizzara approved these changes Jan 29, 2025

View reviewed changes

andrew-coleman force-pushed the csv branch from f24a75f to d830544 Compare January 29, 2025 14:23

andrew-coleman force-pushed the csv branch 3 times, most recently from eec36ab to 3405b3e Compare February 3, 2025 07:32

Blizzara reviewed Feb 3, 2025

View reviewed changes

Blizzara requested changes Feb 3, 2025

View reviewed changes

andrew-coleman force-pushed the csv branch from 3405b3e to 2a6488c Compare February 3, 2025 11:05

andrew-coleman requested a review from Blizzara February 3, 2025 13:29

Blizzara reviewed Feb 4, 2025

View reviewed changes

Blizzara approved these changes Feb 4, 2025

View reviewed changes

Blizzara merged commit 13da183 into substrait-io:main Feb 4, 2025

andrew-coleman deleted the csv branch February 4, 2025 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spark): add support for DelimiterSeparatedTextReadOptions#323

feat(spark): add support for DelimiterSeparatedTextReadOptions#323
Blizzara merged 1 commit intosubstrait-io:mainfrom
andrew-coleman:csv

andrew-coleman commented Jan 29, 2025

Uh oh!

Blizzara left a comment

Uh oh!

andrew-coleman commented Jan 29, 2025

Uh oh!

Blizzara Feb 3, 2025

Uh oh!

andrew-coleman Feb 3, 2025

Uh oh!

Blizzara Feb 4, 2025

Uh oh!

Blizzara Feb 3, 2025

Uh oh!

andrew-coleman Feb 3, 2025

Uh oh!

Blizzara Feb 3, 2025

Uh oh!

andrew-coleman Feb 3, 2025

Uh oh!

Blizzara Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrew-coleman commented Jan 29, 2025

Uh oh!

Blizzara left a comment

Choose a reason for hiding this comment

Uh oh!

andrew-coleman commented Jan 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants