-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hi Shashi it was nice to chat about this!
I had some thoughts about the design and how it relates to what I've been thinking about
- I feel like the data structure you have here isn't about files (though "file" is an evocative name); rather it's a general, hierarchically path-indexed and in-memory tree data structure. So perhaps
DataTreewould be a more descriptive name. I almost called one of my own types this in my prototype! But in the end I settled onFileTreebecause the thing I've written so far is a lazy view which is explicitly backed by the filesystem rather than being reflected in memory. - In general it should be possible to lazily or eagerly initialize an in-memory "DataTree" from any of various data storage backends, not limited to the filesystem. I think this is already on your mind :)
- I think what I want for DataSets.jl is a general lazy interface for hierarchically-indexed trees of data which storage/deserialization backend code can implement. The idea for DataSets.jl is that that one can name such data storage locations declaratively and
openthem, yielding a Julia type which can read them lazily. Then we would have a whole familyS3Tree,ZipTreeetc etc with the same basic interface. - A lazy loading interface would be similar in the spirit to how Tables.jl provides an interface which tabular data sources can satisfy. I feel like the fundamental parts of this interface are pretty basic — iteration over children, selecting a child by name, traversing into a child tree node, possibly attributes/metadata access per node (considering how HDF5 attributes, filesystem
statinfo could be represented) - The idea of having data parallel operations on the tree able to use Dagger.jl seems cool as a high level API for particular workloads where the tree structure happens to match the desired partitioning of work. I think this is a common case, but not always what you want. In any case, processing is definitely out of scope for DataSets.jl!
In general, I think we're building something related but largely complimentary: in DataSets.jl I'm focusing on how one lazily reads the data index and data "from disk" — or other static location. I want to declaratively define such data locations and systematically turn that config into Julia objects the user can work with in their program. Have a system to move such data between storage backends etc etc. (Of course, DataSets.jl isn't restricted to trees. In principle the same ideas apply to the many tabular data formats, and data we'd often consider as a "single file"; eg large images or other multidimensional arrays.)