Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest file #111

Closed
laurentsimon opened this issue Jan 18, 2024 · 37 comments · Fixed by #267
Closed

Manifest file #111

laurentsimon opened this issue Jan 18, 2024 · 37 comments · Fixed by #267
Milestone

Comments

@laurentsimon
Copy link
Collaborator

laurentsimon commented Jan 18, 2024

Our current code signs / serializes folders using a custom hash built using sha256. It works well but has 3 disadvantages:

  1. We need to update rekor hash support (very small change for rekord zo)
  2. AI frameworks that need to verify a subset of files (e.g., a tf file in a hugging face repo) won't be able to take advantage of existing signatures.
  3. When small files change, it's expensive to re-hash the entire model Speed up re-signing via per-file hash #83

A workaround is to create and sign a dedicated "manifest" file that lists all files within a directory with their corresponding hash (similar to SHA256SM output), rather than the output of the folder serialization. With our current code, however, this creates 2 files: the manifest and a signature (on the manifest), which is bad UX-wise, especially for single-file models.

What we want is a single file. I think a solution to this problem is to use DSEE envelope (supported by Sigstore and widely used by tools like cosign). DSSE lets us define a payload which is stored in the envelope, along the signature. This payload is the content of our manifest. We can use intoto / json as a format since it's widely adopted.

sigstore-python added support for signing intoto statements in their main branch (yet to be released), so I think we can use that. All we need is define our own predicate and its format, something like:

{
   "_type": "https://in-toto.io/Statement/v1",
   "predicateType": "https://github.com/openssf/model-signing/manifest/v1"
   "predicate": []{
       {
           "path": "path/to/file1",
            "digest": {
               "sha256-p1": "...",
            }
           "bla": "...",
       },
       {
           "path": "path/to/file2",
            "digest": {
               "sha256-p1": "...",
            }
       }
   }
}
@laurentsimon
Copy link
Collaborator Author

I've tested the signing part and it works. Next we need verification, blocked on sigstore/sigstore-python#628

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Jan 22, 2024

note to myself. https://github.com/in-toto/attestation/blob/main/spec/v1/statement.md Set of software artifacts that the attestation applies to. Each element represents a single software artifact. Each element MUST have digest set.

So either we use a dummy subject, or we move the predicate content to subject. I'm leaning towards the former, because the subject format is limited, whereas a custom predicate gives us room to evolve / version its format

@McPatate
Copy link

Btw regarding your question in #83 :

If users version their models, they'd only need to sign when the model actually changes.. which means they need to re-sign large files and cannot use the per-file hash anyway... Wdut?

We can get a blob id from git which is basically a SHA1 that lets you know if the file has changed in a given revision. We also have a git diff endpoint to check changes between two revisions. Not sure we'd need an extra tool to create tags as all our endpoints should be accessible and usable with https://github.com/huggingface/huggingface_hub

@laurentsimon
Copy link
Collaborator Author

Thanks for the info. Do you think we'll need to build a specific tool / integration like a git hook to improve UX? (I can create a separate issue for this)

@McPatate
Copy link

McPatate commented Feb 13, 2024

a git hook

Do you mean a pre-commit hook? I'm not sure people will have their git repos checked out, afaik people upload files via huggingface_hub which abstracts the whole git logic by creating commits via our REST API.

I guess we could think of integrating directly in huggingface_hub, cc @Wauplin for visibility

@mihaimaruseac
Copy link
Collaborator

Integrating with huggingface_hub seems right a strong approach to me

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Mar 14, 2024

@TomHennen mentioned we could put the manifest content in the subject's resource descriptor's content field.
Yet another possibility s hashing the manifest / predicate, but we'd need to canonicalize the JSON predicate

@TomHennen
Copy link

Right that's definitely something that could work (and I'm happy to discuss it more if anyone is interested).

However! After reading the initial description of this issue I'm pretty sure this can be accomplished with the existing in-toto statement type.

AI frameworks that need to verify a subset of files (e.g., a tf file in a hugging face repo) won't be able to take advantage of existing signatures.

I think this can be supported easily and without any changes using the 'name' field as described here.

AI model verifiers would check that the filename that changed matches the name field and corresponding digests in the attestation and could ignore the subjects that don't apply. In this way I don't think you'd need a separate manifest (but perhaps I'm missing something).

Either way, I think this is definitely a use case that https://github.com/in-toto/attestation wants to support, so if it's missing anything let's chat about it? CC @marcelamelara, @pxp928, @mikhailswift.

@mihaimaruseac
Copy link
Collaborator

I think @SantiagoTorres was also suggesting to use in-toto

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Mar 15, 2024

AI frameworks that need to verify a subset of files (e.g., a tf file in a hugging face repo) won't be able to take advantage of existing signatures.

I think this can be supported easily and without any changes using the 'name' field as described here.

nope, that does not work. If you do that each file will be represented as an individual artifact. We don't want arbitrary tools to miss verifying some of the files. The way it needs to be interpreted for verification (for this specific case) is that the caller needs to give us a list of files that may be present. Say, PyTorch has config files, a list of model_*.bin, etc. The caller must give us this list, otherwise an attacker can remove files. Imagine if the config file is removed and it says to run the pickle.Unload() in a sandbox: an attacker would bypass signature verification entirely and get shell on the machine.
So we should not list artifacts in the subject, because the verification does not match the semantics of intoto verification, ie that each subject is an independent artifact. That will lead to vulnerabilities so we should avoid that.

@TomHennen mentioned we could put the manifest content in the subject's resource descriptor's content field.

Although this technically works, I don't think it's a good approach either. This approach effectively adds an intoto layer that provides no benefits. Instead of being able to access the manifest content as dsse.payload, we need to access it by going thru another layer dsse.subject.content. Adding the intoto layer does not help interoperability either, because the verification logic is entirely dependent on interpreting the manifest content, so existing tools that support intoto won't be able to make sense of the content. At this point, imo it's actually better to use dsse with a non-itoto statement.

One approach that could work is waiving the requirement of having a subject in the intoto attestation based on the predicateType. Would this be possible or would it go against the intoto framework?

@TomHennen
Copy link

This doesn't sound like something that's specific to ML models but could occur in many other SW ecosystems.

I'm not sure I entirely understand the workflow though. Is there anyplace I can learn more?

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Mar 18, 2024

This doesn't sound like something that's specific to ML models but could occur in many other SW ecosystems.

I'm not sure. In the sense that we want to support both "full repository signing" (to start with) and "AI framework signing" that requires verifying a subset of the files signed for the "full repository signing". An example of this is Huggingface repository, where users sign all files under the repo. The repo often contains the same model in different formats. You can use Keras or PyTorch or another framework to load a model (using a subset of files; once we have integrated verification in these framework APIs) or you verify signature for the entire repo before loading it. So TL;DR: we want the existing signature for the full repo to be usable for verification by ML framework Load() APIs.

However, for full repository signing, this is not ML specific, ie anyone who wants to sign a folder will need to verify the exact set of files. What's specific to ML is the size of these files (several 100 GB), so it's kinda specific to ML still in this sense.

@TomHennen
Copy link

Is there anyplace I can find out more about what you're looking for? Any existing design docs?

It sounds like the idea is "We have a lot of files with different paths and digests. Sometimes we want to check all of them, and sometimes we want to check a subset of them." Is there more to it than that?

@laurentsimon
Copy link
Collaborator Author

It sounds like the idea is "We have a lot of files with different paths and digests. Sometimes we want to check all of them, and sometimes we want to check a subset of them." Is there more to it than that?

I think that's pretty much all there is to it, with specific use cases I mentioned above. Let me know if you have further questions.

@TomHennen
Copy link

So it sounds like it's totally possible to represent a model in the in-toto Statement 'subject' field.

Is the issue that the in-toto Statement rules are perceived as too restrictive and don't allow for all/subset use case? If so, can you say more about what that perceived limitation is? Something in https://github.com/in-toto/attestation/blob/main/docs/validation.md ?

@mihaimaruseac
Copy link
Collaborator

Chatted with Tom right now, and it seems in-toto can be used to model all of our scenarios. I'll do a deeper dig and add a comment with how things could work, at which point we can switch to in-toto, if it turns out there are no other blockers.

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Mar 22, 2024

So it sounds like it's totally possible to represent a model in the in-toto Statement 'subject' field.

Constraining ourselves to subject fields is not ideal imo. It basically adds technical debt early in the project, because the subject field is not flexible. If we ever want to evolve the format, we need to use annotations for that. Also awkward to pack in the subject what ought to live in the predicate - a subject is for artifact the attestation describes, not the attestation itself (defeats the purpose of having a predicate).

I want to stress that putting things in subject means that each subject is considered an artifact, and can be verified independently of each other as per the intoto specs. That means other tools following intoto specs (cosign) would verify the attestation in an unsafe way, ie they would pass verification if they are given the wrong subset of files to verify. This problem is due to semantic differences, and unless the into specs change it can't be fixed iiuc.

It's "possible" to express something via intoto, but that does not mean it's the right design if it does not benefit users and security.

Looking forward to the proposal.

@marcelamelara
Copy link

marcelamelara commented Mar 22, 2024

Jumping into the discussion after following along so far.

@laurentsimon based on your latest comment, what you really want to get out of a manifest predicate is to be able to make a claim a long the lines of "this is the expected set of files for model X". Is that a reasonable interpretation of your use case?

If so, could your use case be addressed by something like an existing BOM format? I also know that C2PA has the notion of manifests, is that something you've looked into as a possible predicate? I'm mostly trying to avoid us reinventing the wheel when it comes to developing predicates.

I completely agree that the in-toto Statement subject is not designed to represent claims about artifacts. That said, what would the subject for this manifest predicate be? The model as a logical artifact still needs to be identified. What representation of the model would be used, would a reference to the hosting repo of the model files be sufficient?

tools following intoto specs (cosign) would verify the attestation in an unsafe way, ie they would pass verification if they are given the wrong subset of files to verify.

To confirm my understanding here, your concern is that an attacker could omit legitimate files that are needed for a model to operate correctly? Wouldn't a manifest predicate be susceptible to the same problem?

@SantiagoTorres
Copy link

Hi all, dropping by. I spnt a little bit thinking about it before I jumped. My understanding is that this is "manageable" using a combination of ITE-4 + a custom hasher (as per the spec). For example:

"subject": {"name": "inline+aipredicate://", "sha256sum": "whatever"}

Where whatever can be any specified way to "hash" such a predicate. If we wanted to be cute about it, we could use a MHT of the file directory. Even cuter, it could follow the git tree object structure to avoid reinventing the wheel.

I'm not fully sure going through a fully qualified SBOM is a reasonable way to go, given that thy are in quite a flux, and I blieeve here we are trying to focus on `flow integrity rather than down-the-line transparency.

I completely agree that the in-toto Statement subject is not designed to represent claims about artifacts. That said, what would the subject for this manifest predicate be? The model as a logical artifact still needs to be identified. What representation of the model would be used, would a reference to the hosting repo of the model files be sufficient?

I'm not enrirely sure I understand this part, you could certainly have a predicate that says "this was reviewed by a legal team" or "the dataset used is considered un-biased". Am I missing something here?

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Mar 25, 2024

@laurentsimon based on your latest comment, what you really want to get out of a manifest predicate is to be able to make a claim a long the lines of "this is the expected set of files for model X". Is that a reasonable interpretation of your use case?

Yes. Eg: When we sign all files in a huggingface repo. We want to treat all files "together".

If so, could your use case be addressed by something like an existing BOM format? I also know that C2PA has the notion of manifests, is that something you've looked into as a possible predicate? I'm mostly trying to avoid us reinventing the wheel when it comes to developing predicates.

We have not. Thanks for the idea. We can explore these for inspiration.

To confirm my understanding here, your concern is that an attacker could omit legitimate files that are needed for a model to operate correctly?

Correct

I completely agree that the in-toto Statement subject is not designed to represent claims about artifacts. That said, what would the subject for this manifest predicate be?

There would be no subject. In effect we don't even need to wrap the manifest into an intoto statement. We'd simply use DSSE with this manifest schema inside. The semantics would be different from the intoto model. The hack I was eluding to in the issue description was to either use an empty subject or a dummy hash value, none of which are particularly appealing :/

The model as a logical artifact still needs to be identified. What representation of the model would be used, would a reference to the hosting repo of the model files be sufficient?

The manifest effectively contains the list of files that comprise the model, so the manifest is the representation of the model.

Wouldn't a manifest predicate be susceptible to the same problem?

It depends how we define the semantics for this manifest schema. If the schema lists all files and the verification semantics say that the exact same files listed in the manifest MUST match for verification to succeed, then we don't have this problem. (Here I'm assuming we don't use intoto at all. If we do use intoto, we still have this problem I think).

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Mar 25, 2024

Hi all, dropping by. I spnt a little bit thinking about it before I jumped. My understanding is that this is "manageable" using a combination of ITE-4 + a custom hasher (as per the spec). For example:

"subject": {"name": "inline+aipredicate://", "sha256sum": "whatever"}

What would you hash in this case? If we hash the predicate, we have to canonicalize the JSON predicate, which we should avoid.

Where whatever can be any specified way to "hash" such a predicate. If we wanted to be cute about it, we could use a MHT of the file directory. Even cuter, it could follow the git tree object structure to avoid reinventing the wheel.

If we hash the directory to identify the model, why use a manifest that lists the files? We'd end up hashing twice: for the manifest and for the model identification. We're trying to avoid serializing the directory because we want to support fast re-hashing, eg when a README file changes. Serializing the directory is what the current implementation of this repo does, and we've decided (with feedback from folks at huggingface) that it's not the right approach.

@TomHennen
Copy link

I want to stress that putting things in subject means that each subject is considered an artifact, and can be verified independently of each other as per the intoto specs. That means other tools following intoto specs (cosign) would verify the attestation in an unsafe way, ie they would pass verification if they are given the wrong subset of files to verify. This problem is due to semantic differences, and unless the into specs change it can't be fixed iiuc.

So, I think the root of the semantic problem boils down to "Reject if matchedSubjects is empty" and what you need is more flexibility?

To me it seems like one thing that might be tripping us up is that policy is getting mixed up with how we make statements about things?

Is it correctly to say that, if we ignore the opinionated validation model, it is possible to express all the files in a model in an in-toto statement and to express the properties of that model in the predicate?

Could this be resolved by relaxing the validation model, and giving verifiers (and policy owners) more freedom to say "all files must be verified and present" (among other things)?

@MarkLodato
Copy link

MarkLodato commented Apr 10, 2024

I think this should use Statement and not invent some new format.

I want to stress that putting things in subject means that each subject is considered an artifact, and can be verified independently of each other as per the intoto specs. That means other tools following intoto specs (cosign) would verify the attestation in an unsafe way, ie they would pass verification if they are given the wrong subset of files to verify. This problem is due to semantic differences, and unless the into specs change it can't be fixed iiuc.

This is false. I think you are misreading the specification.

The following pseudocode shows how to verify and extract metadata about a single artifact from a single attestation

TODO: Explain how to process multiple artifacts and/or multiple attestations.

It is unspecified how to verify a collection of artifacts because that will be dependent on each use case. In this particular case, the verifier should simply check all files. It would be crazy to allow a collection of files {A, B, C} if any match. I can't imagine a scenario where that would be acceptable.

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Apr 10, 2024

I think this should use Statement and not invent some new format.

If the use case works for Statement, for sure. Can you please address the various concerns we highlighted in the previous comments?

It is unspecified how to verify a collection of artifacts because that will be dependent on each use case. In this particular case, the verifier should simply check all files. It would be crazy to allow a collection of files {A, B, C} if any match. I can't imagine a scenario where that would be acceptable.

Here's what the specs say:

there exists at least one (alg, value) in s.digest where:
   - alg is in acceptableDigestAlgorithms AND
   - hash(alg, artifactToVerify) == hexDecode(value)

@MarkLodato
Copy link

Can you please address the various concerns we highlighted in the previous comments?

Can you enumerate those concerns? The only one I can find is the misconception that we are discussing now.

Here's what the specs say:

there exists at least one (alg, value) in s.digest where:
   - alg is in acceptableDigestAlgorithms AND
   - hash(alg, artifactToVerify) == hexDecode(value)

Right, that is what you do for each artifact. Again, I'll quote:

The following pseudocode shows how to verify and extract metadata about a single artifact from a single attestation

Perhaps you are confused about matchedSubjects being a list rather than a scalar? This is because a single artifact can match multiple subject entries. For example, a single artifact could match both A and B because they list two different hash algorithms:

"subject": [
    {"digest": {"sha256": "abcd1234"}, "name": "A"},
    {"digest": {"sha512": "98765432"}, "name": "B"}
]

or you could have a subject that contains file copies with identical hashes:

"subject": [
    {"digest": {"sha256": "abcd1234"}, "name": "A"},
    {"digest": {"sha256": "abcd1234"}, "name": "copy-of-A"}
]

We should update the spec to avoid this confusion, but to be clear, nothing in the spec says how to verify a collection of artifacts. It's unspecified.


Let's use a concrete example: TensorFlow SavedModel. Suppose the attestation were this:

"subject": [
    {"digest": {"sha256": "aaaa"}, "name": "assets/foo"},
    {"digest": {"sha256": "bbbb"}, "name": "variables/variables.data-00000-of-00002"},
    {"digest": {"sha256": "cccc"}, "name": "variables/variables.data-00001-of-00002"},
    {"digest": {"sha256": "dddd"}, "name": "variables/variables.index"},
    {"digest": {"sha256": "eeee"}, "name": "saved_model.pb"}
]

If you want to use the logic that all of the file must match, with no renames, additions, or deletions, then you'd do something like this:

def verify_all(artifactsToVerify: Struct[name, digest], attestation, *args):
  # Verify that all of the files on disk have a corresponding entry in the subject.
  for a in artifactsToVerify:
    matchedSubjects, ... = verify_single(a.digest, *args)
    if a.name not in [s.name for s in matchedSubjects]:
      error
  # Verify that there are no missing files.
  for s in attestation.statement.subject:
    if s.name not in [a.name for a in artifactsToVerify]:
      error

Now if you want to be less strict, you'd need to figure out what policy you want to apply. Do you allow renames? Added files? Removed files? That's all OK - it's up to the verifier logic.

For example:

aaaa  assets/name-other-than-foo
6666  assets/another-file
bbbb  variables/variables.data-00000-of-00002
cccc  variables/variables.data-00001-of-00002
eeee  saved_model.pb

Note that assets/foo was renamed assets/name-other-than-foo, assets/another-file was added, and variables/variables.index was deleted.

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Apr 10, 2024

Thanks @MarkLodato . That is the crux of the problem.

We should update the spec to avoid this confusion, but to be clear, nothing in the spec says how to verify a collection of artifacts. It's unspecified.

That would be nice, yes. What you're saying is that it's up to the predicateType to dictate the semantic verification? I was looking for a claim like this but could not find it.

Another problem is https://github.com/in-toto/attestation/blob/main/spec/v1/statement.md Set of software artifacts that the attestation applies to. Each element represents a single software artifact.

In our case, each file is not a software artifact. The model artifact is the set of files. A (published) package is made up of multiple artifacts (like in a release attestation); but the ML artifact is made up of a collection of files.

@MarkLodato
Copy link

Another problem is https://github.com/in-toto/attestation/blob/main/spec/v1/statement.md Set of software artifacts that the attestation applies to. Each element represents a single software artifact.

In our case, each file is not a software artifact. The model artifact is the set of files. A (published) package is made up of multiple artifacts (like in a release attestation); but the ML artifact is made up of a collection of files.

That's just a terminology nitpick. Our definition of "artifact" is "an immutable blob of data". You could represent an ML model as either a collection of artifacts (one per file) or as a single artifact (one hash over all the files). Either way works.

Another problem in the previous comments is #111 (comment), ie early technical debt / hard to evolve a format if we try to pack information in subjects. This information, given that we consider a model the artiact, should really be inside the predicate, imo. Otherwise we need to split "global" info in predicate and file info in the subjects.

What specifically do you want to put in there? Could you give a concrete example?

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Apr 11, 2024

Another problem is https://github.com/in-toto/attestation/blob/main/spec/v1/statement.md Set of software artifacts that the attestation applies to. Each element represents a single software artifact.
In our case, each file is not a software artifact. The model artifact is the set of files. A (published) package is made up of multiple artifacts (like in a release attestation); but the ML artifact is made up of a collection of files.

That's just a terminology nitpick. Our definition of "artifact" is "an immutable blob of data". You could represent an ML model as either a collection of artifacts (one per file) or as a single artifact (one hash over all the files). Either way works.

If that works it's fine. I would encourage updating the terminology, because I did not see this defined. All the use cases I've seen in intoto involve the subject being an independent artifact.

Other intoto maintainers on the thread, can you confirm that the predicateType dictates how to verify the subject list?

Another problem in the previous comments is #111 (comment), ie early technical debt / hard to evolve a format if we try to pack information in subjects. This information, given that we consider a model the artiact, should really be inside the predicate, imo. Otherwise we need to split "global" info in predicate and file info in the subjects.

What specifically do you want to put in there? Could you give a concrete example?

You can't know the unknown, and we have to plan for it (that's technical debt). But to give an example, we don't really need the intoto Statement to make things work. We don't need an additional intoto wrap / level of indirection. Instead, we can use:

DSSE:

{
  "payload": "...",
  "payloadType": "model/signing",
  "signatures": [...]
}

And manifest:

{
 // Some global fields
 "version": x,
 "some-other-property":
 // The files in this model.
 "files": [
    {
      "path":
      // Any field we want, no tied to the ResourceDescriptor
      "whatever_we_want":
    }
  ]
}

That's the simple alternative to using an intoto statement for our specific use case. That's the solution other ecosystems use (Java, npm). We don't have to separate the files from the rest of the manifest / predicate. What are the advantages of using intoto in this use case?

Again, we're not again using intoto (we created this issue in the first place mentioning it!). We're only trying to gather pros and cons for each approach to help in the final decision.

One argument in favor of intoto for this use case is that it may simplify upgrading tooling to support SLSA provenance (since provenance uses intoto format).

@TomHennen
Copy link

Other intoto maintainers on the thread, can you confirm that the predicateType dictates how to verify the subject list?

I'm not sure it's even up to the predicateType? We have existing use cases where it's the user (or the tooling they're using) that would know best what to do. E.g. it might be a matter of policy.

@MarkLodato
Copy link

What are the advantages of using intoto in this use case?

So that you don't have to roll your own format and tooling. It's the advantage of any standard. Since there are no extra fields that you anticipate needing, and there is a way to add them should you ever need them, I strongly recommend using the in-toto Statement here. It does everything you need.

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Apr 12, 2024

What are the advantages of using intoto in this use case?

So that you don't have to roll your own format and tooling. It's the advantage of any standard. Since there are no extra fields that you anticipate needing, and there is a way to add them should you ever need them, I strongly recommend using the in-toto Statement here. It does everything you need.

The problem is that there is no tooling that supports what we're trying to achieve, ie we're rolling our own. The verification semantics are also are own. I can argue that the JSON format we use is an implementation detail. Existing tooling is a source of risk since it won't understand our intoto predicate and will probably screw up verification because they interpret the subject field as any one must match instead of the exact set must match. Take cosign as an example: anyone who'd use it to verify would probably end up shooting themselves inn the foot. Many large projects already made mistake when using it to verify slsa-github-generator provenance. This problem is coming right at us if we use intoto "because it's a standard". I'd rather have existing tools fail that silently passing verification

@marcelamelara
Copy link

Our definition of "artifact" is "an immutable blob of data". You could represent an ML model as either a collection of artifacts (one per file) or as a single artifact (one hash over all the files). Either way works.

Seconded. The in-toto attestation spec does not dictate that the subject artifact(s) MUST be executable code. It's meant to be quite generic and apply to any immutable blob of data (e.g., files, binaries, packages, even other attestations).

Perhaps you are confused about matchedSubjects being a list rather than a scalar? This is because a single artifact can match multiple subject entries. We should update the spec to avoid this confusion.

We'd definitely like to edit the spec to address the concerns here. So we're clear on what to change: Are you looking for more explicit language on validation based on subject artifacts, more clarity around the subject field in a Statement? Both?

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented Apr 14, 2024

That's just a terminology nitpick.

I'm not sure it is. Take the example of SLSA provenance. If users were to set subject entry per model file, how would an end-user know they need to verify all of them? The description of the verification only says that statement’s subject matches the digest of the artifact in question. There seems to be an implicit assumption that an artifact is "self-contained". Do you envisage creating different predicates for SLSA provenance?

We'd definitely like to edit the spec to address the concerns here. So we're clear on what to change: Are you looking for more explicit language on validation based on subject artifacts, more clarity around the subject field in a Statement? Both?

I suppose giving concrete examples for different use cases, in particular the one that requires all subjects to match. Is it dictated by the predicate type (my question) or the policy (the verification mentions a "policy engine" and @TomHennen also mentioned that)? Could you give an example for SLSA as an example? Maybe also provide some real code would help too. How about taking this discussion to a tracking issue on intoto repo?

Note: I've not looked at whether existing tooling like cosign outputs matchedSubjects or the entire intoto statement as-is.

@laurentsimon
Copy link
Collaborator Author

That's just a terminology nitpick.

I'm not sure it is. Take the example of SLSA provenance. If users were to set subject entry per model file, how would an end-user know they need to verify all of them? The description of the verification only says that statement’s subject matches the digest of the artifact in question. There seems to be an implicit assumption that an artifact is "self-contained". Do you envisage creating different predicates for SLSA provenance?

To give a concrete example. Say we have recipients RSet and ROneOf. RSet needs to verify a set of files. ROneOf only has a use case with single-file artifacts. An attacker gives ROneOf a file with a provenance intended for RSet. ROneOf accepts it, because it's unaware that verification requires set verification. I think that's an attack we want to protect against, correct?

We'd definitely like to edit the spec to address the concerns here. So we're clear on what to change: Are you looking for more explicit language on validation based on subject artifacts, more clarity around the subject field in a Statement? Both?

I suppose giving concrete examples for different use cases, in particular the one that requires all subjects to match. Is it dictated by the predicate type (my question) or the policy (the verification mentions a "policy engine" and @TomHennen also mentioned that)? Could you give an example for SLSA as an example? Maybe also provide some real code would help too. How about taking this discussion to a tracking issue on intoto repo?

Note: I've not looked at whether existing tooling like cosign outputs matchedSubjects or the entire intoto statement as-is.

To add context. The manifest we created in this issue can be signed and added to a Sigstore bundle using messageSignature. This works just fine. There is a UX drawback zo: we need 2 files: the manifest and the signature. For single-file models, we decided that was not acceptable. So what we're after is a way to attach the signature to the manifest in a single file, and that's exactly what DSSE gives us.

@mihaimaruseac mihaimaruseac mentioned this issue May 14, 2024
10 tasks
@mihaimaruseac mihaimaruseac added this to the V1 release milestone May 14, 2024
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jun 4, 2024
This is the middle layer of the API design work (sigstore#172). We add a manifest abstract class to represent various manifests (sigstore#111 sigstore#112) and also ways to serialize a model directory into manifests and ways to verify the manifests.

For now, this only does what was formerly known as `serialize_v0`. The v1 and the manifest versions will come soon.

Note: This has a lot of inspiration from sigstore#112, but makes the API work with all the usecases we need to consider right now.

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jun 4, 2024
This is the middle layer of the API design work (sigstore#172). We add a manifest abstract class to represent various manifests (sigstore#111 sigstore#112) and also ways to serialize a model directory into manifests and ways to verify the manifests.

For now, this only does what was formerly known as `serialize_v0`. The v1 and the manifest versions will come soon.

Note: This has a lot of inspiration from sigstore#112, but makes the API work with all the usecases we need to consider right now.

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jun 4, 2024
This is the middle layer of the API design work (sigstore#172). We add a manifest abstract class to represent various manifests (sigstore#111 sigstore#112) and also ways to serialize a model directory into manifests and ways to verify the manifests.

For now, this only does what was formerly known as `serialize_v0`. The v1 and the manifest versions will come soon.

Note: This has a lot of inspiration from sigstore#112, but makes the API work with all the usecases we need to consider right now.

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>
mihaimaruseac added a commit that referenced this issue Jun 5, 2024
)

* Migrate `serialize_v0` to new API.

This is the middle layer of the API design work (#172). We add a manifest abstract class to represent various manifests (#111 #112) and also ways to serialize a model directory into manifests and ways to verify the manifests.

For now, this only does what was formerly known as `serialize_v0`. The v1 and the manifest versions will come soon.

Note: This has a lot of inspiration from #112, but makes the API work with all the usecases we need to consider right now.

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>

* Clarify some comments

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>

* Encode name with base64

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>

* Add another test case

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>

* Empty commit to retrigger DCO check.

See dcoapp/app#211 (comment)

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>

---------

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>
@font
Copy link
Member

font commented Jun 29, 2024

Have we made a final decision on the format of this manifest file? It appears we're settling on a DSSE envelope consisting of intoto statements. However, I've also seen reference to leveraging the sigstore-python signing APIs, which returns a Sigstore Bundle that contains the cert, DSSE envelop content, and the HashedRekor entry data. That seems to suggest that the Sigstore Bundle would be the content of the manifest file i.e. the way it currently works in this repo right now where model.sig contains the Sigstore Bundle.

@laurentsimon
Copy link
Collaborator Author

We agreed on using sigstore bundle as the "wire format". The "manifest" refers to the DSSE payload, which is inside the sigstore bundle.

mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 24, 2024
THIS IS DRAFT, WIP. Will split into separate PRs once it works. But
posting publicly to show what the plans are (sigstore#224, sigstore#248, sigstore#240, sigstore#111).

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 24, 2024
THIS IS DRAFT, WIP. Will split into separate PRs once it works. But
posting publicly to show what the plans are (sigstore#224, sigstore#248, sigstore#240, sigstore#111).

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 24, 2024
THIS IS DRAFT, WIP. Will split into separate PRs once it works. But
posting publicly to show what the plans are (sigstore#224, sigstore#248, sigstore#240, sigstore#111).

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>
mihaimaruseac added a commit to mihaimaruseac/model-transparency that referenced this issue Jul 24, 2024
THIS IS DRAFT, WIP. Will split into separate PRs once it works. But
posting publicly to show what the plans are (sigstore#224, sigstore#248, sigstore#240, sigstore#111).

Signed-off-by: Mihai Maruseac <mihaimaruseac@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants