Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up re-signing via per-file hash #83

Closed
laurentsimon opened this issue Dec 15, 2023 · 3 comments
Closed

Speed up re-signing via per-file hash #83

laurentsimon opened this issue Dec 15, 2023 · 3 comments

Comments

@laurentsimon
Copy link
Collaborator

laurentsimon commented Dec 15, 2023

Certain models like huggingface repositories currently do not support versioning. This means that signing multiple times with small diffs can be expensive. It may be useful to have a format listing per-file hashes, in order to speed up re-hashing. Example:

path/a/b ...hash1...
file2       ...hash2...

This file would be kept under the model folder or in another configuration folder (probably the latter?). When signing, the caller would give us a list of files that need to be re-computed (from git diff), and our library would only re-hash the necessary files. We'd also pass back the recomputed file to the caller, so that they can store it where they want. The user would need a git hook that does all this for them.

This file could be the final file we sign, or a helper file that is only stored on a dev machine to speed up signing when making changes to the repo. (probably the latter?)

The main danger (?) is that:

  1. If someone edits the file (by mistake), we'd be re-computing the hash wrong.
  2. Folks start sharing this file. It would only take one bad actor to backdoor the model for multiple users (?)

@McPatate any thoughts?

@McPatate
Copy link

McPatate commented Dec 18, 2023

Certain models like huggingface repositories currently do not support versioning.

Do you mean per-file versioning? You have access to a git log, which can be interpreted as a versioning system since you know if a file changed across commits & the commit id being the version.

Regarding 1., we could encode the file in a binary format that is not human readable.

Regarding 2., could we not sign the file each time it's changed?

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Dec 18, 2023

Certain models like huggingface repositories currently do not support versioning.

Do you mean per-file versioning?

No, just repository versioning.

You have access to a git log, which can be interpreted as a versioning system since you know if a file changed across commits & the commit id being the version.

Correct. I thought the huggingface API did not support git versioning and it could only download at head. I was wrong. from_pretrained supports git tag, called revision (https://huggingface.co/docs/transformers/main_classes/model). So effectively users can tag upon release. I don't think the feature is commonly used by users today.. but probably not a problem for us.

Regarding 1., we could encode the file in a binary format that is not human readable.

Regarding 2., could we not sign the file each time it's changed?

+1, I had a similar idea. I think that should work.

@McPatate do you think we need this optimization at all? The motivation was to improve signing speed because we thought that model versioning was not supported by the huggingface API, but that assumption was wrong. Do you think it'd be enough to:

  1. Nudge users to version their models (git tag)
  2. Provide a CLI they can run when they push a tag. (the CLI in this repo when it's v1 ready)

If users version their models, they'd only need to sign when the model actually changes.. which means they need to re-sign large files and cannot use the per-file hash anyway... Wdut?

@laurentsimon
Copy link
Collaborator Author

Closing in favor of #111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants