Fast way to read a particular row from and specific columns with uneven entries (which file format?!, saving/loading options etc.) #1785

thoglu · 2022-10-13T12:12:45Z

thoglu
Oct 13, 2022

Hi,
I currently save data into parquet files, where I have a fixed amount of entries (lets say 1000) per file in various columns. Most of the columns have single float entries, but one column stores 2-d Arrays of which 1 dimension can vary, so something like DXN where N is different per entry.

If I do a awkward.from_parquet(filename, columns=[*single_float_col_name*,...]), it loads extremely fast.
However if I include the uneven array column, it loads like a factor 100 or so slower.
awkward.from_parquet(filename, columns=[*single_float_col_name*, *uneven_col_name*])

Can I somehow specify row_groups or other loading options, choose a different dataformat, or give certain specs during parquet file creation in order to speed up the reading process when I include the uneven array?

I am really only interested in one particular row (lets say row 533 of 1000 rows) and a subset of columns for this row... and one of those columns has an uneven array in it, as I said above.

Any help appreciated.
Best,
Thorsten

Answered by jpivarski

Oct 13, 2022

I need to follow up on this when I have time to look things up, but I can provide some pointers in the meantime. There's another function, ak.metadata_from_parquet, which reads the (small) metadata of a Parquet file but not the (large) data. In this metadata, there are fields for num_entries, num_row_groups, and also row-group by row-group information about exactly which entries (rows) are in each row group.

If you have a specific entry/row to read, or a specific range, entry_start:entry_stop, this can be expanded to row_group_start:row_group_stop by rounding down the start index and rounding up the stop index. (There is no way to read one entry; row groups are the smallest granularity th…

View full answer

jpivarski · 2022-10-13T15:22:56Z

jpivarski
Oct 13, 2022
Maintainer

I need to follow up on this when I have time to look things up, but I can provide some pointers in the meantime. There's another function, ak.metadata_from_parquet, which reads the (small) metadata of a Parquet file but not the (large) data. In this metadata, there are fields for num_entries, num_row_groups, and also row-group by row-group information about exactly which entries (rows) are in each row group.

If you have a specific entry/row to read, or a specific range, entry_start:entry_stop, this can be expanded to row_group_start:row_group_stop by rounding down the start index and rounding up the stop index. (There is no way to read one entry; row groups are the smallest granularity that can be read, but you'll probably be able to ensure that you only read one row group.)

That trimming, to produce a given entry range by reading as few row groups as possible, could be automated, but it hasn't (yet). It would be a good feature for us to add.

But for now, if you can get that information for yourself from ak.metadata_from_parquet, you can use that information in ak.from_parquet by passing a row_groups=[...] argument, where ... is a list of row group numbers.

For columns, it looks like you've already found the column argument, which takes a list of strings. Those strings are allowed to have glob-style wildcards: columns=["*single_float_col_name*", "*uneven_col_name*"] (you just need quotes).

6 replies

jpivarski Oct 13, 2022
Maintainer

Columnar formats are optimized for bulk reads, not individual row reading or incremental appending. A Parquet file, for instance, is immutable after it has been created, since all of the navigation information is stored in a footer. (Actually, I could imagine appending to a Parquet file by overwriting that footer with another row group and an updated footer, but I don't think any existing Parquet tools work that way.) Adding data to a Parquet dataset generally means adding more Parquet files to a directory.

By "columnar formats," I would generally include HDF5, since those arrays ("Datasets") are accessed in bulk. But HDF5 has a lot of different ways of storing rectilinear arrays, and some of those methods might be append or insertion-friendly. HDF5 doesn't deal well with variable-length data, if that's why you were interested in Parquet and Awkward Array.

If you're really inserting rows frequently and interested in accessing individual rows, you're probably right that a traditional database is more appropriate. SQL is just the query language, do you mean SQLite (a file), BerkleyDB (a UNIX standard), PostgreSQL (a service), ...? If you need nested structures with random access/insertion, maybe MongoDB, which deals with JSON data transactionally?

The "either/or" thing here is that non-columnar formats are not optimized for bulk reads. If you need high performance for one type of access or another type of access, you'll need the data to be laid out in one way or another way; it can't be both. It's a tradeoff.

jpivarski Oct 13, 2022
Maintainer

Wellllllll... maybe uncompressed, memory-mapped Feather might be a "best of both" for individual row reading and bulk, sequential reading, but not row-insertion. And maybe appending can also be efficient, as long as the insertion happens in large chunks: a batch at a time. You can't optimize all desiderata in one format. I wrote a survey of tradeoffs once: https://indico.cern.ch/event/658060/contributions/2898569/

thoglu Oct 13, 2022
Author

yes I meant sqllite ... in principle I am only interested in a particular item at a time, so this might be a good option. One more question: Is there a header in parquet into which information can be stored? Wouldn't it be possible to have an index table there, where for each item (including the variable-length array entries) there is some starting index (e.g. which byte) and how many bytes the data structure takes? Or is something like this already being done ? Or would this not help ?

edit: ok looking at your slides, it seems "random accessibility" is what I want (incl. variable length arrays - and no nested structures - I also do not need to append/add new data once it is created.. it can be immutable). It looks like root would do that, but I dont really want to use root if I dont have to ;) So would you also now ( a few years after those slides) say either root or something like sql lite (or feather?) ?

thoglu Oct 13, 2022
Author

Wellllllll... maybe uncompressed, memory-mapped Feather might be a "best of both" for individual row reading and bulk, sequential reading, but not row-insertion. And maybe appending can also be efficient, as long as the insertion happens in large chunks: a batch at a time. You can't optimize all desiderata in one format. I wrote a survey of tradeoffs once: https://indico.cern.ch/event/658060/contributions/2898569/

How would you do row-reading with feather?
In pyarrow no row option seems to be there (https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_table.html),
however this is the suggested option in the awkward documentation to transform to ak array.

With pandas.read_table (https://pandas.pydata.org/docs/reference/api/pandas.read_table.html) ?

Best,
Thorsten

jpivarski Oct 13, 2022
Maintainer

(On Feather, I meant in principle, based on what the file format is like, not necessarily that any libraries use the data-on-disk in that way.)

If you're interested in single-entry access, that sounds like SQL systems. SQLite files are row-oriented, but that's what you want. I don't know what limitations SQLite has on value types, but if that's limiting, MongoDB will generalize them into JSON.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast way to read a particular row from and specific columns with uneven entries (which file format?!, saving/loading options etc.) #1785

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Fast way to read a particular row from and specific columns with uneven entries (which file format?!, saving/loading options etc.) #1785

thoglu Oct 13, 2022

Replies: 1 comment · 6 replies

jpivarski Oct 13, 2022 Maintainer

jpivarski Oct 13, 2022 Maintainer

jpivarski Oct 13, 2022 Maintainer

thoglu Oct 13, 2022 Author

thoglu Oct 13, 2022 Author

jpivarski Oct 13, 2022 Maintainer

thoglu
Oct 13, 2022

Replies: 1 comment 6 replies

jpivarski
Oct 13, 2022
Maintainer

jpivarski Oct 13, 2022
Maintainer

jpivarski Oct 13, 2022
Maintainer

thoglu Oct 13, 2022
Author

thoglu Oct 13, 2022
Author

jpivarski Oct 13, 2022
Maintainer