Skip to content

Add ParquetMetadataDoFn#5820

Merged
kellen merged 12 commits intomainfrom
kellen/parquet-md
Dec 8, 2025
Merged

Add ParquetMetadataDoFn#5820
kellen merged 12 commits intomainfrom
kellen/parquet-md

Conversation

@kellen
Copy link
Copy Markdown
Contributor

@kellen kellen commented Nov 24, 2025

Reads parquet metadata, when possible, from the file footer.

Comment thread scio-parquet/src/main/scala/com/spotify/scio/parquet/ParquetMetadataDoFn.scala Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Nov 24, 2025

Codecov Report

❌ Patch coverage is 96.87500% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 61.59%. Comparing base (b566764) to head (c40b13d).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...potify/scio/parquet/syntax/SCollectionSyntax.scala 96.87% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5820      +/-   ##
==========================================
+ Coverage   61.49%   61.59%   +0.09%     
==========================================
  Files         314      315       +1     
  Lines       11437    11469      +32     
  Branches      830      828       -2     
==========================================
+ Hits         7033     7064      +31     
- Misses       4404     4405       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kellen
Copy link
Copy Markdown
Contributor Author

kellen commented Nov 25, 2025

TODO is to add some better syntax for this from String -> metadata and/or glob -> resourceid -> metadata

@kellen kellen added this to the 0.15.x milestone Dec 8, 2025
}
}

class ParquetStringSCollectionSyntax(self: SCollection[String]) {
Copy link
Copy Markdown
Contributor

@clairemcginty clairemcginty Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strong opinion loosely held:

IMO, this function is niche enough that instead of adding an extra SCollection[String] helper, we should just add a function to [FileSCollectionFuctions] like readFiles that just transforms SCollection[String] -> SCollection[ReadableFile]. Then the user can just do

sc
  .parallelize(paths)
  .readFiles
  .parquetMetadata

(approving anyway because I don't feel super strongly about this.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is already a ReadableFIle => A version of readFiles in which case

sc
  .parallelize(paths)
  .readFilesParquetMetadata() // or readParquetMeatadata, or even better than either of these parquetMetadata

Where readFilesParquetMetadata still needs to be in the parquet project, with an implicit/syntax, and we need to separately provide a ReadableFile => ParquetMetadata function

@kellen kellen merged commit 1d7f744 into main Dec 8, 2025
17 of 23 checks passed
@kellen kellen deleted the kellen/parquet-md branch December 8, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants