Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question/possible enhancement: Relationship to N5 + Arrow/Parquet? #515

Closed
wsphillips opened this issue Nov 15, 2019 · 11 comments
Closed

Question/possible enhancement: Relationship to N5 + Arrow/Parquet? #515

wsphillips opened this issue Nov 15, 2019 · 11 comments

Comments

@wsphillips
Copy link

A lot of the goals of Zarr overlap with N5, and I'm aware of the Zarr backend for N5. Is there a clear intersection between these two projects planned?

Also, and possibly more importantly: Has there been any communication with the Apache Arrow/Parquet project? Although their primary focus is on tabular data, support for tensors has been raised and noted on their Github/JIRA/mailing list--but simply shelved as a lower priority. Could/will Zarr eventually develop into a natural extension there since many of the file spec characteristics (at least on the macro level) seem similar?

@alimanfoo
Copy link
Member

alimanfoo commented Nov 16, 2019 via email

@wsphillips
Copy link
Author

Excellent to hear--and congrats on the CZI funding. I wasn't aware of the development going on with the v3 spec.

As a follow-up: What are the team's opinions on TileDB? Is Zarr meant to target different use cases?

Also, I apologize for the many project comparisons. It's fantastic that there's multiple groups working on this solution, I'm just trying to get a sense of where everyone is aiming and why to better inform decision making about future data management at my own institution.

@jakirkham
Copy link
Member

cc @jorisvandenbossche (in case you have thoughts on ways Arrow and Zarr could work together 🙂)

@alimanfoo
Copy link
Member

alimanfoo commented Dec 9, 2019

As a follow-up: What are the team's opinions on TileDB? Is Zarr meant to target different use cases?

Short answer is that tiledb and zarr do target very similar use cases but take a different technical approach, particularly regarding how to avoid synchronisation problems when there are multiple writers to a single array. The essence of it (as I understand) is that tiledb organises data into files or cloud objects based on write batches, whereas zarr takes a simpler approach and organises data into files or cloud objects based on chunks.

So, e.g., if you write some data to a region of an array, in tiledb the data your write will get stored as a single object, regardless of whether the region you're writing to is contained within a single chunk or overlaps multiple chunks. If you write multiple times to the same chunk, that will get stored as multiple objects. Reading the current state of the data in a chunk then requires traversing all overlapping writes in time order. Zarr stores each chunk in a separate object, i.e., there is only ever one object per chunk. So if the region you are writing overlaps multiple chunks then multiple objects will be updated (or created if they didn't already exist).

Because of this, tiledb is completely lock-free, i.e, you can have multiple writers writing to any region of an array in parallel. In zarr you can have multiple writers working in parallel but have to think about avoiding two writers ending up in contention for the same chunk, which most people do by aligning their write operations to chunk boundaries, but which can also be managed via chunk-level locks.

The tradeoff is that the tiledb storage format is more complicated, and understanding how to optimise stored data for performance may not be so straightforward, because it may require steps like consolidating arrays. The zarr default storage layout is simpler, you can hack around with the data more easily using generic tools like file system tools or cloud object browser, and it's a bit more obvious how to tune performance by adjusting chunk size and shape.

(In fact zarr abstracts over storage in a way that would allow someone to write a storage backend for zarr that follows something like the tiledb approach, i.e., zarr is not tied to the one-chunk-one-object storage layout. But that's more about what is possible rather than what is currently implemented.)

There's lots more I haven't touched on, comparison of zarr vs tiledb probably should become an issue by itself over on the zarr-community repo so others can chime and and we can flesh out differences a bit more, I'm sure would be useful for others.

Btw apologies for slow response on this.

@wsphillips
Copy link
Author

This is a fantastic response, and thank you very much for taking the time to roughly summarize things.

And yes, the information you have here makes it much more clear about the different fundamental design choices made in the two projects. I can definitely imagine scenarios where either approach might be more/less advantageous depending on the type of data, access patterns, how many concurrent users, etc. Again, thanks!

@alimanfoo
Copy link
Member

And yes, the information you have here makes it much more clear about the different fundamental design choices made in the two projects. I can definitely imagine scenarios where either approach might be more/less advantageous depending on the type of data, access patterns, how many concurrent users, etc. Again, thanks!

No problem at all. You put it very well, in general it's probably a good idea to try out both approaches with some real data in some realistic usage scenarios. And if people end up choosing tiledb then please feel free to tell us about it and why, it's technically very interesting to compare the two approaches and we're always interested to learn more about what people need :-)

@stavrospapadopoulos
Copy link

@wsphillips thanks for pointing us to this discussion via the Julia forum.

@alimanfoo, thanks for these fair points on TileDB vs Zarr. If you will forgive the lengthy response, I’d like to give some background on the design goals motivating those technical decisions, and how TileDB has evolved to meet the needs of our heaviest users (primarily in genomics):

  1. allowing sample additions to sparse arrays over time -- for genomic variant calling specifically, the “N+1” problem -- offering rapid updates
  2. allowing space, time, and i/o efficient queries on 10M+ ranges on a 100TB sparse array (this is a realistic dataset size for our largest genomics users)
  3. allowing versioning (“time traveling”) of un-consolidated datasets, which is important for auditability, database interaction, etc.

The tile/chunk-per-file approach does not work in the case of sparse arrays, because a tile contains an ordered collection of non-empty cells, which does not rely on some fixed space partitioning (to avoid storage/memory explosion of densified arrays 1, 2) - some space tiles may be full and some may be almost empty for certain distributions.

Updating the sparse cells in a way that destroys that order, while keeping a fixed capacity in each tile, would require to regroup a massive collection of sparse cells into new tiles, stored in new files. This was not feasible given our first goal, hence the decision to create new tiles for the new sparse cells and store them in a new immutable file.

With the addition of timestamps, the above design of writing separate "fragments" (i.e., update batches) very naturally enables the ability to time-travel, i.e., read snapshots of the array at some point in the past, without requiring a separate log (which can lead to consistency issues for parallel writes). Thus with TileDB, you are able to open an array at a timestamp and efficiently slice data as if no update has happened beyond that timestamp, or even undo an update (again, for un-consolidated arrays).

There were some additional constraints with the inode count of a tile-per-file approach when using smaller tile sizes on some file systems (we had originally implemented a tile-per-file solution in 2015), but this is not much of a problem on cloud object stores like AWS S3, especially if you perform parallel reads (in fact, increasing object prefix count can reduce enforced slow-downs!).

Some additional notes:

  1. We are currently working on making the batch size configurable, so that the user can choose to store even a single tile per file.
  2. We are currently working on providing automatic consolidation and simplifying the parameters of consolidation where necessary, which indeed is currently rather involved. In an upcoming version, the user will not need to explicitly consolidate anything (although, a power user will still be able to tune consolidation if necessary).
  3. We have recently published and open-sourced TileDB-VCF, which models VCF data in TileDB, meeting the constraints above, and provides both Python (+Dask) and Spark dataframe integrations.
  4. Several of the constraints above are related to the issues described by @ryan-williams in Database sources where each array element is a separate database row #438 (the sparse model is a good fit to store dataframes).

cc @ihnorton

@alimanfoo
Copy link
Member

Hi @stavrospapadopoulos, just to briefly say thank you for adding this comment, I think it's hugely valuable to unpack these different technical approaches and to have a chance to discuss and understand them.

@jpivarski
Copy link

I'd like to point out that the project I'll be working on to add Awkward Array as a Zarr v3 extension would be a way of accessing Arrow data through Zarr, though in a less direct route:

  • Apache Arrow <--> Awkward Array <--> Zarr v3

I've been talking about this as a possibility on zarr-developers/zarr-specs#62 and @martindurant developed an end-to-end demonstration in Zarr v2: https://github.com/martindurant/awkward_extras/blob/main/awkward_zarr/core.py

To show what I mean, suppose you have pyarrow>=2.0.0 and awkward>=1.0.2rc1 and the attached file, complicated-example.parquet.txt, which contains a simple Arrow array and a complicated one that shows off Arrow's rich data types.

>>> import pyarrow.parquet as pq
>>> import awkward as ak
>>> pyarrow_table = pq.read_table("complicated-example.parquet.txt")
>>> pyarrow_table["simple"].to_pylist()
[1, 2, 3]
>>> pyarrow_table["complicated"].to_pylist()
[
    [{'x': 0.0, 'y': []}, {'x': 1.1, 'y': [1]}, {'x': 2.2, 'y': None}],
    [],
    [{'x': 3.3, 'y': [1, 2, 3]}, None, {'x': 4.4, 'y': [1, 2, 3, 4]}]
]

The Awkward Array library has an equivalent for each of the Arrow types, so even the complicated Arrow array can be viewed as an Awkward Array.

>>> awkward_array = ak.from_arrow(pyarrow_table["complicated"])
>>> awkward_array
<Array [[{x: 0, y: []}, ... 1, 2, 3, 4]}]] type='3 * var * ?{"x": float64, "y": ...'>
>>> awkward_array.tolist()
[
    [{'x': 0.0, 'y': []}, {'x': 1.1, 'y': [1]}, {'x': 2.2, 'y': None}],
    [],
    [{'x': 3.3, 'y': [1, 2, 3]}, None, {'x': 4.4, 'y': [1, 2, 3, 4]}]
]
>>> awkward_array.type   # DataShape notation; see https://datashape.readthedocs.io/
3 * var * ?{"x": float64, "y": option[var * int64]}

The "Awkward Array <--> Zarr v3" project would use the ak.to_buffers function to decompose the data structure into one-dimensional buffers (shown here as a dict of NumPy arrays). The idea is to replace this dict with the Zarr group, as @martindurant has done, such that one Zarr group corresponds to one complicated array.

>>> form, length, container = ak.to_buffers(awkward_array)
>>> container
{
    'part0-node0-offsets': array([0, 3, 3, 6], dtype=int64),
    'part0-node1-mask': array([47], dtype=uint8),
    'part0-node3-data': array([0. , 1.1, 2.2, 3.3, 0. , 4.4]),
    'part0-node4-mask': array([1, 1, 0, 1, 0, 1], dtype=int8),
    'part0-node5-offsets': array([0, 0, 1, 1, 4, 4, 8], dtype=int64),
    'part0-node6-data': array([1, 1, 2, 3, 1, 2, 3, 4])
}

The data needed to reconstitute it are (a) the one-dimensional buffers, (b) this JSON "form", and (c) the length of the array. The JSON form could be added to the Zarr group as one more buffer, containing serialized JSON, as could the length, unless there's a better place for that in metadata.

>>> length
3
>>> form
{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "BitMaskedArray",
        "mask": "u8",
        "content": {
            "class": "RecordArray",
            "contents": {
                "x": {
                    "class": "NumpyArray",
                    "itemsize": 8,
                    "format": "d",
                    "primitive": "float64",
                    "form_key": "node3"
                },
                "y": {
                    "class": "ByteMaskedArray",
                    "mask": "i8",
                    "content": {
                        "class": "ListOffsetArray64",
                        "offsets": "i64",
                        "content": {
                            "class": "NumpyArray",
                            "itemsize": 8,
                            "format": "l",
                            "primitive": "int64",
                            "form_key": "node6"
                        },
                        "form_key": "node5"
                    },
                    "valid_when": true,
                    "form_key": "node4"
                }
            },
            "form_key": "node2"
        },
        "valid_when": true,
        "lsb_order": true,
        "form_key": "node1"
    },
    "form_key": "node0"
}

The v3 extension to the Zarr client would read it back with ak.from_buffers.

>>> reconstituted = ak.from_buffers(form, length, container)
>>> reconstituted
<Array [[{x: 0, y: []}, ... 1, 2, 3, 4]}]] type='3 * var * ?{"x": float64, "y": ...'>
>>> reconstituted.tolist()
[
    [{'x': 0.0, 'y': []}, {'x': 1.1, 'y': [1]}, {'x': 2.2, 'y': None}],
    [],
    [{'x': 3.3, 'y': [1, 2, 3]}, None, {'x': 4.4, 'y': [1, 2, 3, 4]}]
]
>>> reconstituted.type
3 * var * ?{"x": float64, "y": option[var * int64]}

This may be different from what you're thinking about for direct Arrow <--> Zarr because this has Zarr treating the six buffers in this example group as opaque objects, and a v3 extension would present a whole Zarr group as one Awkward Array (complaining if awkward can't be loaded, probably). An Arrow array has a complex physical layout that would have to be navigated byte by byte in its contiguous form (the form used for IPC, for instance).

By exploding the data out into a set of non-contiguous buffers that are navigated by name, we can read fields of nested records (and partitions of rows, not shown here) lazily, on demand:

>>> class VerboseMapping:
...     def __init__(self, container):
...         self.container = container
...     def __getitem__(self, key):
...         print(key)
...         return self.container[key]
...
>>> lazy = ak.from_buffers(form, length, VerboseMapping(container), lazy=True)
>>> lazy.x
part0-node0-offsets
part0-node1-mask
part0-node3-data
<Array [[0, 1.1, 2.2], ... [3.3, None, 4.4]] type='3 * var * ?float64'>
>>> lazy.y
part0-node4-mask
part0-node5-offsets
part0-node6-data
<Array [[[], [1], None], ... [1, 2, 3, 4]]] type='3 * var * ?option[var * int64]'>

The x fields of all the records can be supplied by reading only the buffers in the Zarr group above them in the tree, and the y fields can be supplied by a different set of buffers. The same could be said of Arrow data in its non-contiguous buffer form (the buffers method in pyarrow, which is exactly what the Arrow <--> Awkward transformation takes advantage of).

Some of the early comments in this thread were about Arrow's "tensor" type, which is included in the to_buffers/from_buffers transformation but not shown in the demo.

On today's Zarr call (2020-12-16), we talked about partial reading of buffers, which could come in handy here. Each of these buffers is a simple array, indexed by the one above it in the tree.

@martindurant
Copy link
Member

@jpivarski , you might want to start a new thread, since this is rather old.

I will add, though, that I would not be surprised is the "bunch of 1D arrays" in zarr, as a storage format for potentially deeply nested arrow datasets, is as performant in size, speed, and chunkability as parquet.

@jpivarski
Copy link

Thanks for the suggestion! I thought this was the best place because it already has everyone who's interested in Arrow <--> Zarr.

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants