Add support for caching parquet metadata #180

annanay25 · 2022-05-16T21:07:46Z

As per our discussion in #135, opening this draft PR to support caching parquet metadata.

The PR includes some plumbing to ensure the right options are supplied to the parquet reader/writers, allowing them to use secondary io.ReaderAts to fetch column chunk data.

~~Marking draft to ensure we are aligned on the general design and direction of implementation~~ Dropped draft, updated README and tests.

…or column chunk data Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>

Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>

achille-roussel · 2022-06-06T06:37:00Z

config.go

-	SkipBloomFilters bool
+	SkipPageIndex       bool
+	SkipBloomFilters    bool
+	GetIOReaderFromPath func(filepath string) io.ReaderAt


What do you think about using an interface type here instead of a function?

Done, but I'm not sure about what the initial signature should look like. The application should have enough information about the object to provide a secondary reader, so I'm not even sure we need to pass the column chunk metadata

achille-roussel · 2022-06-06T06:40:30Z

config.go

@@ -134,6 +136,7 @@ func (c *ReaderConfig) Validate() error {
 //
 type WriterConfig struct {
 	CreatedBy            string
+	ColumnChunkFilePath  string


Maybe this could be an interface which defines both the file path and the location where the writer would output the metadata? (e.g. a Name method and io.Writer so we could use a os.File in the simplest cases)

Ok I removed the option of writing metadata to a different location. I realise we don't need this if we are only looking at caching metadata/indexes. We can make use of secondary (cache-free) io readers while reading column chunk data

Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>

README.md

Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>

annanay25 · 2022-06-21T16:54:53Z

@achille-roussel pinging on this one!

achille-roussel · 2022-06-21T17:39:37Z

Ok I removed the option of writing metadata to a different location. I realise we don't need this if we are only looking at caching metadata/indexes. We can make use of secondary (cache-free) io readers while reading column chunk data

I'm curious how the metadata end up being cached if the writer doesn't produce them, are you using a read-through cache then?

annanay25 · 2022-06-21T17:58:25Z

Yes that's right, we are using a read-through. Are there any concerns with regard to that?

achille-roussel · 2022-06-21T18:02:24Z

No concerns, just wanted to get a good understanding of the use case :)

annanay25 · 2022-06-21T18:02:53Z

I'm also thinking if it is good design to add an override for reading columnChunk data through a secondary reader... Would it be better design to add an override for reading metadata through the cached reader? That way we can incrementally add caching support for metadata objects to improve speed vis-a-vis the current situation where we could potentially end up caching large objects unintentionally

achille-roussel · 2022-06-21T18:15:21Z

Regarding reading column data, I wrote down some thoughts in #239; when you get a chance to read through, let me know what you think about it!

annanay25 · 2022-10-11T09:24:13Z

👍 will take a look, thanks! Do you think we need something more to proceed with this PR?

…

On Tue, Jun 21, 2022, 11:45 PM Achille ***@***.***> wrote: Regarding reading column data, I wrote down some thoughts in #239 <#239>; when you get a chance to read through, let me know what you think about it! — Reply to this email directly, view it on GitHub <#180 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACTZMS4LEWDJQSM3OEFEQODVQIBEHANCNFSM5WCYP2MQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

annanay25 added 3 commits April 19, 2022 20:57

Checkpoint: Initial additions to allow using secondary io.ReaderAts f…

e90537c

…or column chunk data Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>

Merge branch 'main' into add-support-for-secondary-io-readerAts

25d7e0c

Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>

Add docs

e0d8244

Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>

achille-roussel reviewed Jun 6, 2022

View reviewed changes

annanay25 added 2 commits June 17, 2022 14:59

Merge branch 'main' into add-support-for-secondary-io-readerAts

dc7dde7

Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>

address feedback, refactor after rebasing over latest main

3bc75a8

Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>

annanay25 commented Jun 17, 2022

View reviewed changes

README.md Show resolved Hide resolved

Add test and update README

140bf37

Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>

annanay25 marked this pull request as ready for review June 20, 2022 11:30

achille-roussel added the feature New feature or request label Jun 21, 2022

annanay25 mentioned this pull request Jun 22, 2022

Pass optional readers to override and cache metadata objects #249

Merged

annanay25 closed this Jun 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for caching parquet metadata #180

Add support for caching parquet metadata #180

annanay25 commented May 16, 2022 •

edited

Loading

achille-roussel Jun 6, 2022

annanay25 Jun 17, 2022

achille-roussel Jun 6, 2022

annanay25 Jun 17, 2022

annanay25 commented Jun 21, 2022

achille-roussel commented Jun 21, 2022

annanay25 commented Jun 21, 2022

achille-roussel commented Jun 21, 2022

annanay25 commented Jun 21, 2022

achille-roussel commented Jun 21, 2022

annanay25 commented Oct 11, 2022 via email

Add support for caching parquet metadata #180

Add support for caching parquet metadata #180

Conversation

annanay25 commented May 16, 2022 • edited Loading

achille-roussel Jun 6, 2022

Choose a reason for hiding this comment

annanay25 Jun 17, 2022

Choose a reason for hiding this comment

achille-roussel Jun 6, 2022

Choose a reason for hiding this comment

annanay25 Jun 17, 2022

Choose a reason for hiding this comment

annanay25 commented Jun 21, 2022

achille-roussel commented Jun 21, 2022

annanay25 commented Jun 21, 2022

achille-roussel commented Jun 21, 2022

annanay25 commented Jun 21, 2022

achille-roussel commented Jun 21, 2022

annanay25 commented Oct 11, 2022 via email

annanay25 commented May 16, 2022 •

edited

Loading