Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Seek to RowGroup #461

Open
zolstein opened this issue Apr 30, 2022 · 3 comments
Open

Feature Request: Seek to RowGroup #461

zolstein opened this issue Apr 30, 2022 · 3 comments

Comments

@zolstein
Copy link

In theory, one of the advantages of the parquet format is the ability to use metadata in the footer to avoid processing the entire file in order to locate specific records of interest. Specifically, one wants to use the RowGroup's min/max values per column to avoid processing RowGroups that don't contain records with particular values.

In practice, I can't see a way to do that using this library. SkipRows does almost what is needed, but the API doesn't make it possible (or at least easy) to navigate between row groups, and it needs to process every page so it doesn't provide the performance benefit.

I propose a new method on the Reader and ColumnReader types: SeekRowGroup(index int64) error that logically moves the reader to the start of the row group. This, in conjunction with the metadata in the footer, can be used to efficiently skip RowGroups that are known not to contain desired records.

If you have any interest in including a feature like this, I have a proof-of-concept that seems to work and that I can flesh out.

@hangxie
Copy link
Contributor

hangxie commented May 19, 2022

Something like this (note that this lack of lots of nil/empty checks), maybe? My personal opinion is this is kind of "easy":

	for rgIndex, rg := range reader.Footer.RowGroups {
		for _, col := range rg.Columns {
			// TODO check full path
			if  col.MetaData.PathInSchema[len(col.MetaData.PathInSchema)] != "FieldToCheck" {
				continue
			}
			// check col.MetaData.Statistics.MaxValue and col.MetaData.Statistics.MinValue
			// and return rgIndex that matches criteria

There are definitely valid use case for this, though I never encountered one, note that min and max are not mandatory so this functionality only works for a certain number of parquet files.

@zolstein
Copy link
Author

Something like this (note that this lack of lots of nil/empty checks), maybe?

Yeah, that is (more or less) how you'd identify row groups you care about. To clarify, though, the issue is that having done those checks there's no (easy, non-super-invasive) way to seek the ParquetReader into the right spot to consume from the beginning of the row-group. That's what the SeekRowGroup method solves.

note that min and max are not mandatory so this functionality only works for a certain number of parquet files.

True, but it's probably most likely that the files being consumed are generated using this library, and it does set the Min/MaxValue fields.

@zolstein
Copy link
Author

I posted a draft PR of my PoC here. #469

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants