Add support for vectoredRead-compatible stream in scio-parquet#5732
Add support for vectoredRead-compatible stream in scio-parquet#5732clairemcginty merged 11 commits intomainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5732 +/- ##
==========================================
- Coverage 61.46% 61.45% -0.02%
==========================================
Files 314 314
Lines 11326 11329 +3
Branches 792 794 +2
==========================================
Hits 6962 6962
- Misses 4364 4367 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2f2818f to
aef6215
Compare
| val bigdataOssVersion = "3.1.3" // Check Maven for latest | ||
| val hadoopVersion = "3.3.6" // Check Maven for latest |
There was a problem hiding this comment.
Does bigdataOssVersion determine what the hadoopVersion should be?
There was a problem hiding this comment.
Not formally, but it gcs-connector 3.x uses a method that was removed in Hadoop 3.4, so it ties us to 3.3 :(
| "org.apache.hadoop" % "hadoop-common" % "3.3.6", | ||
| "org.apache.hadoop" % "hadoop-auth" % "3.3.6", | ||
| "org.apache.hadoop" % "hadoop-client" % "3.3.6" |
There was a problem hiding this comment.
nit: maybe move these hardcoded versions to some variable
There was a problem hiding this comment.
wait I just realized we already have hadoopVersion variable set to 3.4.1 which is used for org.apache.hadoop:hadoop-common and org.apache.hadoop:hadoop-client. Will this be an issue? can we just bring everything down to 3.3.6 or does this cause issue with beam's hadoop version?
There was a problem hiding this comment.
I was on the fence about that a bit!! Ultimately I left the default version as 3.4 since it does contain some (mostly) transitive security fixes.
jto
left a comment
There was a problem hiding this comment.
Maybe add basic doc. I think the setup is a bit complex. It'd be great if you could just pass --useVectoredParquetReads.
| ) | ||
|
|
||
| def vectoredReadSettings: Seq[Setting[_]] = sys.props | ||
| .get("vectoredReadsEnabled") |
There was a problem hiding this comment.
So if I understand things correctly to use vectored reads you need to have the vectoredReadsEnabled set at build time and also have parquet.hadoop.vectored.io.enabled=true in conf ?
Could you add doc in this PR ?
There was a problem hiding this comment.
the vectoredReadsEnabled setting isn't user-facing - it's only for enabling the overrides to run scio-examples with (similar to how we support sbt -DbeamRunners=DataflowRunner scio-examples/runMain ...
I realize I did not include the config setting in the doc though, will add!
My hope is that eventually the dependency stuff gets resolved (in Scio 0.15, when we officially drop Java 8, we can just upgrade the project gcs-connector version) so that the eventual user experience is just adding the boolean flag to their config 🙏 Maybe until 0.14 we just consider it as an experimental feature (I kept the new classes package-private so the public facing API isn't changed) |
Not ready for merge!!
glue code to support readVectored-compatible streams for scio-parquet.