-
Notifications
You must be signed in to change notification settings - Fork 98
Caching parquet metadata in an off-process cache #135
Comments
Adding a few additional thoughts:
|
Do you have more details to share about what metadata you would like to optimize access to? |
I believe we would need to cache the entire metadata in order to avoid hitting the object storage to fetch anything other than data? Going through code I think we use Schema and RowGroup information from the reader, but even so we might want to cache the entire metadata |
OK, I was wondering if you intended to also cache other parts of the file like column indexes or bloom filters.
This would be hard to do without duplicating the logic to decode footers of parquet files. A lower-level caching solution for the
This is really large indeed, almost questionable whether caching of metadata would improve access here: assuming the data is on some long term storage medium, loading large blocks will be bound on I/O bandwidth, not latency, so it may be just as efficient to read the original files.
I'm curious if you have ideas of what that could look like. For reference, parquet already attempts to offer a solution to these problems with file references in the file metadata, where one file may contain a schema only and a reference to a remote file for the data. This is currently not supported in parquet-go but could be something to explore https://github.com/segmentio/parquet-go/blob/main/format/parquet.go#L763-L765
We could easily expose the |
My bad, I forgot to mention that. We do intend to cache the dictionaries/bloom-filters as well. Although metadata definition seems to only store offsets for indexes, but in a conversation with @mdisibio I think we discussed that indexes are read as part of metadata read as well. So I will dig deeper into that as well.
TIL! This is really interesting. But its also a Will give this some thought. |
The string is an identifier in some file-system-like interface. We could have something similar to the standard library's This model would not allow for caching of the dictionary pages tho, those are part of the column chunks so they wouldn't be available in the file containing metadata only. This might call for having two caching mechanisms, one for the parquet metadata (via the column chunk's I'm also curious how effective it would be to add a caching layer. Latency of reads from an object store would definitely be higher, but considering the large volumes of metadata to read from the storage layer, bandwidth would likely be the bottleneck. Would the performance then be roughly equivalent regardless of the remote backend serving the data? |
Nice. It makes the read pattern slightly complex but nevertheless should do the trick with least effort. I believe it will require an addition of the following form:
to get the secondary file handlers for a given filepath. These can then be used to read data from the object store.
|
@achille-roussel I've started making some changes in this branch over here: main...annanay25:add-support-for-secondary-io-readerAts It seemed easy enough to edit the reader/writers to support this new |
So, with the above changes I was able to use two different But now I'm beginning to get concerned about the small reads happening as part of building metadata: This results in the following access pattern:
now we absolutely want to cache all of these, but its making me think if its worth having two One solution if we want to continue down this path is that our caching strategy will somehow need to encode |
Opening a file is quite a busy operation, lots of reads happen indeed.
This is reading the 4 bytes "PAR1" file header; it's just a sanity check, we could make it optional since it provides low value for the common case where the program knows that it's dealing with parquet files.
This one looks like reading of the footer and page index, which is somewhat buffered? The first 8 bytes read is for the "PAR1" footer + length of the file metadata. We can't really do without it but we could optimistically buffer a few KiB of footer data to avoid making small reads here.
This looks like reading of a page header? the thrift decoder will consume bytes one by one, offset 4 should have a page header. We could also reduce the number of small sequential reads happening with a small buffering mechanism (e.g. reading 4KiB pages through a |
I would look at having better buffering first, so we read data in pages of a few KiB instead of having the small read pattern we see here. |
Thanks, I like the idea of using a small buffer. I'm on PTO for the next few days now, will continue to experiment once I'm back. |
Hi all, I've been following along, great discussion. Definitely want to find a way to benefit from memcache/redis eventually and if there is something straightforward, but for now agree that basic buffering is a good solution to investigate. Spent some time on that and have some thoughts.
|
I made this code public not too long ago, which serves a similar purpose to what you described, providing a caching layer to some underlying
If more than one column is being read, it seems like read patterns would be somewhat random-ish: data for a column chunk A would be before column chunk B, but reading both A and B columns would cause reading at interleaving positions. I don't have much context on the test you ran, so I'm speculating, but generally there are plenty of reasons why read access may be somewhat random.
This is going to depend on physical properties of the underlying storage layer, if latency is high we may prefer larger buffers. The size of contiguous reads is also an important metric to look at; if scans are very short there is not many gains from using large buffers. For S3, I wouldn't be shy to use 1 MiB buffer size, from what you showed it causes a 2x size amplification, but 5x drop in the number of reads, that's an interesting ratio in my opinion. If you are reading from a local file system tho, a buffer size in the 64-256KiB seem like a better trade off: small size amplification and can bring 2-3x reduction in I/O operations.
We could definitely make it configurable, tho maybe the existing buffering strategy isn't very effective? For example, I'm not sure why the logs shared by @annanay25 showed a one-byte-at-a-time read pattern. The sequential reads should have been buffered and only require a single read from the underlying |
Addressed in #249 |
As discussed earlier, in Grafana Tempo we are planning to build an object store backed parquet database. At higher volumes, querying could mean retrieving metadata from a large number of parquet files and we would like to optimise by caching these in an off-process cache (redis/memcached/...)
There are two strategies we have considered to implement this:
Thoughts?
Using this ticket as a placeholder to start discussions around this.
The text was updated successfully, but these errors were encountered: