Skip to content

Add support for vectoredRead-compatible stream in scio-parquet#5732

Merged
clairemcginty merged 11 commits intomainfrom
support-vectored-reads
Jul 30, 2025
Merged

Add support for vectoredRead-compatible stream in scio-parquet#5732
clairemcginty merged 11 commits intomainfrom
support-vectored-reads

Conversation

@clairemcginty
Copy link
Copy Markdown
Contributor

Not ready for merge!!

glue code to support readVectored-compatible streams for scio-parquet.

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 28, 2025

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.45%. Comparing base (17fcd8c) to head (c2b1838).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...sdk/extensions/smb/ParquetTypeFileOperations.scala 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5732      +/-   ##
==========================================
- Coverage   61.46%   61.45%   -0.02%     
==========================================
  Files         314      314              
  Lines       11326    11329       +3     
  Branches      792      794       +2     
==========================================
  Hits         6962     6962              
- Misses       4364     4367       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@clairemcginty clairemcginty force-pushed the support-vectored-reads branch from 2f2818f to aef6215 Compare July 28, 2025 21:24
@clairemcginty clairemcginty changed the title [WIP] Add support for vectoredRead-compatible stream in scio-parquet Add support for vectoredRead-compatible stream in scio-parquet Jul 28, 2025
@clairemcginty clairemcginty marked this pull request as ready for review July 28, 2025 23:05
Comment on lines 306 to +307
val bigdataOssVersion = "3.1.3" // Check Maven for latest
val hadoopVersion = "3.3.6" // Check Maven for latest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does bigdataOssVersion determine what the hadoopVersion should be?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not formally, but it gcs-connector 3.x uses a method that was removed in Hadoop 3.4, so it ties us to 3.3 :(

Comment thread build.sbt
Comment on lines +719 to +721
"org.apache.hadoop" % "hadoop-common" % "3.3.6",
"org.apache.hadoop" % "hadoop-auth" % "3.3.6",
"org.apache.hadoop" % "hadoop-client" % "3.3.6"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe move these hardcoded versions to some variable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait I just realized we already have hadoopVersion variable set to 3.4.1 which is used for org.apache.hadoop:hadoop-common and org.apache.hadoop:hadoop-client. Will this be an issue? can we just bring everything down to 3.3.6 or does this cause issue with beam's hadoop version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the fence about that a bit!! Ultimately I left the default version as 3.4 since it does contain some (mostly) transitive security fixes.

Copy link
Copy Markdown
Contributor

@jto jto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add basic doc. I think the setup is a bit complex. It'd be great if you could just pass --useVectoredParquetReads.

Comment thread build.sbt
)

def vectoredReadSettings: Seq[Setting[_]] = sys.props
.get("vectoredReadsEnabled")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand things correctly to use vectored reads you need to have the vectoredReadsEnabled set at build time and also have parquet.hadoop.vectored.io.enabled=true in conf ?
Could you add doc in this PR ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the vectoredReadsEnabled setting isn't user-facing - it's only for enabling the overrides to run scio-examples with (similar to how we support sbt -DbeamRunners=DataflowRunner scio-examples/runMain ...

I realize I did not include the config setting in the doc though, will add!

@clairemcginty
Copy link
Copy Markdown
Contributor Author

Maybe add basic doc. I think the setup is a bit complex. It'd be great if you could just pass --useVectoredParquetReads.

My hope is that eventually the dependency stuff gets resolved (in Scio 0.15, when we officially drop Java 8, we can just upgrade the project gcs-connector version) so that the eventual user experience is just adding the boolean flag to their config 🙏 Maybe until 0.14 we just consider it as an experimental feature (I kept the new classes package-private so the public facing API isn't changed)

@clairemcginty clairemcginty merged commit 22816a9 into main Jul 30, 2025
12 checks passed
@clairemcginty clairemcginty deleted the support-vectored-reads branch July 30, 2025 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants