Skip to content

Alterative hashings for data #1676

@dunstantom

Description

@dunstantom

Since md5 is sensitive to the order and format of the data, simple changes to the schema (eg. swapping two columns) or changing the type of a column (eg. integer to float) leads to new hash values and duplicated datasets. There are some alternatives that attempt to address this, such as UNF (http://guides.dataverse.org/en/latest/developers/unf/index.html).

It would be great to specify an alternative hash function in DVC, particularly to be able to provide a user-defined function.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestRequesting a new featurep3-nice-to-haveIt should be done this or next sprint

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions