-
Notifications
You must be signed in to change notification settings - Fork 98
Add support for caching parquet metadata #180
Add support for caching parquet metadata #180
Conversation
…or column chunk data Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>
Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>
config.go
Outdated
SkipBloomFilters bool | ||
SkipPageIndex bool | ||
SkipBloomFilters bool | ||
GetIOReaderFromPath func(filepath string) io.ReaderAt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using an interface type here instead of a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but I'm not sure about what the initial signature should look like. The application should have enough information about the object to provide a secondary reader, so I'm not even sure we need to pass the column chunk metadata
config.go
Outdated
@@ -134,6 +136,7 @@ func (c *ReaderConfig) Validate() error { | |||
// | |||
type WriterConfig struct { | |||
CreatedBy string | |||
ColumnChunkFilePath string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be an interface which defines both the file path and the location where the writer would output the metadata? (e.g. a Name method and io.Writer so we could use a os.File in the simplest cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I removed the option of writing metadata to a different location. I realise we don't need this if we are only looking at caching metadata/indexes. We can make use of secondary (cache-free) io readers while reading column chunk data
Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>
Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>
Signed-off-by: Annanay Agarwal <annanay.agarwal@grafana.com>
@achille-roussel pinging on this one! |
I'm curious how the metadata end up being cached if the writer doesn't produce them, are you using a read-through cache then? |
Yes that's right, we are using a read-through. Are there any concerns with regard to that? |
No concerns, just wanted to get a good understanding of the use case :) |
I'm also thinking if it is good design to add an override for reading columnChunk data through a secondary reader... Would it be better design to add an override for reading metadata through the cached reader? That way we can incrementally add caching support for metadata objects to improve speed vis-a-vis the current situation where we could potentially end up caching large objects unintentionally |
Regarding reading column data, I wrote down some thoughts in #239; when you get a chance to read through, let me know what you think about it! |
👍 will take a look, thanks! Do you think we need something more to proceed
with this PR?
…On Tue, Jun 21, 2022, 11:45 PM Achille ***@***.***> wrote:
Regarding reading column data, I wrote down some thoughts in #239
<#239>; when you get a
chance to read through, let me know what you think about it!
—
Reply to this email directly, view it on GitHub
<#180 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTZMS4LEWDJQSM3OEFEQODVQIBEHANCNFSM5WCYP2MQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
As per our discussion in #135, opening this draft PR to support caching parquet metadata.
The PR includes some plumbing to ensure the right options are supplied to the parquet reader/writers, allowing them to use secondary
io.ReaderAts
to fetch column chunk data.Marking draft to ensure we are aligned on the general design and direction of implementationDropped draft, updated README and tests.