Skip to content

Some thoughts about data tree design #9

@c42f

Description

@c42f

Hi Shashi it was nice to chat about this!

I had some thoughts about the design and how it relates to what I've been thinking about

  • I feel like the data structure you have here isn't about files (though "file" is an evocative name); rather it's a general, hierarchically path-indexed and in-memory tree data structure. So perhaps DataTree would be a more descriptive name. I almost called one of my own types this in my prototype! But in the end I settled on FileTree because the thing I've written so far is a lazy view which is explicitly backed by the filesystem rather than being reflected in memory.
  • In general it should be possible to lazily or eagerly initialize an in-memory "DataTree" from any of various data storage backends, not limited to the filesystem. I think this is already on your mind :)
  • I think what I want for DataSets.jl is a general lazy interface for hierarchically-indexed trees of data which storage/deserialization backend code can implement. The idea for DataSets.jl is that that one can name such data storage locations declaratively and open them, yielding a Julia type which can read them lazily. Then we would have a whole family S3Tree, ZipTree etc etc with the same basic interface.
  • A lazy loading interface would be similar in the spirit to how Tables.jl provides an interface which tabular data sources can satisfy. I feel like the fundamental parts of this interface are pretty basic — iteration over children, selecting a child by name, traversing into a child tree node, possibly attributes/metadata access per node (considering how HDF5 attributes, filesystem stat info could be represented)
  • The idea of having data parallel operations on the tree able to use Dagger.jl seems cool as a high level API for particular workloads where the tree structure happens to match the desired partitioning of work. I think this is a common case, but not always what you want. In any case, processing is definitely out of scope for DataSets.jl!

In general, I think we're building something related but largely complimentary: in DataSets.jl I'm focusing on how one lazily reads the data index and data "from disk" — or other static location. I want to declaratively define such data locations and systematically turn that config into Julia objects the user can work with in their program. Have a system to move such data between storage backends etc etc. (Of course, DataSets.jl isn't restricted to trees. In principle the same ideas apply to the many tabular data formats, and data we'd often consider as a "single file"; eg large images or other multidimensional arrays.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions