Skip to content

bundler/partitioner middleware #2708

@jorgeorpinel

Description

@jorgeorpinel

See treeverse/dvc.org/issues/682 for context.

It seems like large data sets (in the TBs) tend to get bundled and/or partitioned in different ways and formats such as HDFS/HDF5/TFRecord files. This poses a challenge for DVC data versioning which calculates checksums at the file (or directory) level.

What would be the easiest way to extend DVC support for this kind of dataset storing practice? Perhaps a tool separate to DVC itself even, as some sort of middleware that enables transparency between the actual dataset, however it's organized into bundles and partitions, and DVC commands.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions