Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Add support for caching parquet metadata #180

Conversation

annanay25
Copy link
Contributor

@annanay25 annanay25 commented May 16, 2022

As per our discussion in #135, opening this draft PR to support caching parquet metadata.

The PR includes some plumbing to ensure the right options are supplied to the parquet reader/writers, allowing them to use secondary io.ReaderAts to fetch column chunk data.

Marking draft to ensure we are aligned on the general design and direction of implementation Dropped draft, updated README and tests.

…or column chunk data

Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>
Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>
Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>
config.go Outdated
SkipBloomFilters bool
SkipPageIndex bool
SkipBloomFilters bool
GetIOReaderFromPath func(filepath string) io.ReaderAt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using an interface type here instead of a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I'm not sure about what the initial signature should look like. The application should have enough information about the object to provide a secondary reader, so I'm not even sure we need to pass the column chunk metadata

config.go Outdated
@@ -134,6 +136,7 @@ func (c *ReaderConfig) Validate() error {
//
type WriterConfig struct {
CreatedBy string
ColumnChunkFilePath string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be an interface which defines both the file path and the location where the writer would output the metadata? (e.g. a Name method and io.Writer so we could use a os.File in the simplest cases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I removed the option of writing metadata to a different location. I realise we don't need this if we are only looking at caching metadata/indexes. We can make use of secondary (cache-free) io readers while reading column chunk data

Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>
Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>
README.md Show resolved Hide resolved
Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>
@annanay25 annanay25 marked this pull request as ready for review June 20, 2022 11:30
@annanay25
Copy link
Contributor Author

@achille-roussel pinging on this one!

@achille-roussel
Copy link
Contributor

Ok I removed the option of writing metadata to a different location. I realise we don't need this if we are only looking at caching metadata/indexes. We can make use of secondary (cache-free) io readers while reading column chunk data

I'm curious how the metadata end up being cached if the writer doesn't produce them, are you using a read-through cache then?

@annanay25
Copy link
Contributor Author

Yes that's right, we are using a read-through. Are there any concerns with regard to that?

@achille-roussel
Copy link
Contributor

No concerns, just wanted to get a good understanding of the use case :)

@annanay25
Copy link
Contributor Author

I'm also thinking if it is good design to add an override for reading columnChunk data through a secondary reader... Would it be better design to add an override for reading metadata through the cached reader? That way we can incrementally add caching support for metadata objects to improve speed vis-a-vis the current situation where we could potentially end up caching large objects unintentionally

@achille-roussel
Copy link
Contributor

Regarding reading column data, I wrote down some thoughts in #239; when you get a chance to read through, let me know what you think about it!

@annanay25
Copy link
Contributor Author

annanay25 commented Oct 11, 2022 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants