Skip to content

feat(spark): add support for DelimiterSeparatedTextReadOptions#323

Merged
Blizzara merged 1 commit intosubstrait-io:mainfrom
andrew-coleman:csv
Feb 4, 2025
Merged

feat(spark): add support for DelimiterSeparatedTextReadOptions#323
Blizzara merged 1 commit intosubstrait-io:mainfrom
andrew-coleman:csv

Conversation

@andrew-coleman
Copy link
Copy Markdown
Member

The handling of CSV in the spark module works only in a very limited way with many hard-coded assumptions.
This commit adds full support for delimited text support as defined in the FileOrFiles proto message

Copy link
Copy Markdown
Contributor

@Blizzara Blizzara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@andrew-coleman
Copy link
Copy Markdown
Member Author

Not sure why osv-scanner is failing. Is that related to this PR? It works locally with no issues. 🤔

@andrew-coleman andrew-coleman force-pushed the csv branch 3 times, most recently from eec36ab to 3405b3e Compare February 3, 2025 07:32
Comment thread .editorconfig
indent_size = 2

[{**/*.sql,**/OuterReferenceResolver.md,**gradlew.bat}]
[{**/*.sql,**/OuterReferenceResolver.md,**gradlew.bat,**/*.parquet,**/*.orc}]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed, no-one should be editing .parquet or .orc files by hand afaik?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but the editorconfig-checker stage failed prior to adding this (at least for .orc, didn't do a separate check for .parquet):
https://github.com/substrait-io/substrait-java/actions/runs/13032936614/job/36356268259

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird but okay :D

s"Cannot configure CSV reader to skip ${csv.getHeaderLinesToSkip} rows")
}),
"escape" -> csv.getEscape,
"maxColumns" -> csv.getMaxLineSize.toString
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds wrong, maxColumns in spark is for columns while maxLineSize in Substrait is for bytes

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. There doesn't seem to be a Spark equivalent option for this, so I'll just remove the mapping and ignore.

// default values for options specified here:
// https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
override def getFieldDelimiter: String = fsRelation.options.getOrElse("delimiter", ",")
override def getMaxLineSize: Long =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, but I need to assign something, so I'll use the proto default value of 0.

The handling of CSV in the spark module works only in a very
limited way with many hard-coded assumptions.
This commit adds full support for delimited text support as
defined in the `FileOrFiles` proto message

Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>
// default values for options specified here:
// https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
override def getFieldDelimiter: String = fsRelation.options.getOrElse("delimiter", ",")
override def getMaxLineSize: Long = 0 // No Spark equivalent. Assign proto default of 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit weird still, but I see that it's a part of the proto, so I don't have better ideas either. I guess only problem would appear if someone uses spark-produced plans in some other system which would support & implement this, but if that happens we can figure it out then, I think.

@Blizzara Blizzara merged commit 13da183 into substrait-io:main Feb 4, 2025
@andrew-coleman andrew-coleman deleted the csv branch February 4, 2025 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants