Skip to content

Misclassification of binary files as text files #3364

@kevin-hanselman

Description

@kevin-hanselman

In the case where binary files have a large text header, the current
checksum routine will treat said files as text files and normalize
line-endings before performing the checksum. Not only is it dangerous to
manipulate binary files like this, it also doubles the runtime of the
checksum routine, as every block of data must be read twice.

As noted in #3264, at the minimum, DVC should probably match
Git's text file detection routine, which interrogates the first 8 kilobytes
(and doesn't do heuristics on ratio of printable characters,
as DVC currently does).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugDid we break something?p3-nice-to-haveIt should be done this or next sprint

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions