Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple significant security vulnerabilities in the design of data integrity #272

Open
tplooker opened this issue Jun 21, 2024 · 76 comments
Labels
CR1 This item was processed during the first Candidate Recommendation phase.

Comments

@tplooker
Copy link

tplooker commented Jun 21, 2024

The following issue outlines two significant security vulnerabilities in data integrity.

For convenience in reviewing the below content here is a google slides version outlining the same information.

At a high level summary both vulnerabilities exploit the "Transform Data" phase in data integrity in different ways, a process that is unique to cryptographic representation formats that involve processes such as canonicalisation/normalisation.

In effect both vulnerabilities allow a malicious party to swap the key and value of arbitrary attributes in a credential without the signature being invalidated. For example as the attached presentation shows with the worked examples, an attacker could swap their first and middle name and employment and over18 status without invalidating the issuers signature.

The first vulnerability is called the unprotected term redefinition vulnerability. In general this vulnerability exploits a design issue with JSON-LD where the term protection feature offered by the @Protected keyword doesn't cover terms that are defined using the @vocab and @base keywords. This means any terms defined using @vocab and @base are vulnerable to term redefinition.

The second vulnerability exploits the fact that a document signed with data integrity has critical portions of the document which are unsigned, namely the @context element of the JSON-LD document. The fact that the @context element is unsigned in data integrity combined with the fact that it plays a critical part in the proof generation and proof verification procedure, is a critical flaw leaving data integrity documents open to many forms of manipulation that are not detectable through validating the issuers signature.

Please see the attached presentation for resolutions to this issue we have explored.

In my opinion the only solution I see that will provide the most adequate protection against these forms of attacks is to fundamentally change the design of data integrity to integrity protect the @context element. I recognise this would be a significant change in design, however I do not see an alternative that would prevent variants of this attack continuing to appear over time.

I'm also happy to present this analysis to the WG if required.

@msporny msporny added the CR1 This item was processed during the first Candidate Recommendation phase. label Jun 21, 2024
@dlongley
Copy link
Contributor

I believe that the core of the issue highlighted above is in a lack of validation on the information that is to be verified. Any protected information or data must be validated and understood prior to consumption, no matter the protection mechanism. However, when a protection mechanism allows multiple expressions of the same information (a powerful tool), it may be important to better highlight this need. This is especially true in the three party model, where there is no simple two-party agreement and known context between issuers and verifiers, i.e., the scale or scope of the VC ecosystem is much larger when parties totally unknown to the issuer can consume their VCs.

Certainly not understanding the context in which a message is expressed (or meant to be consumed) can lead to mistakes, even when that message is authentic. For example, a message that expresses "i authorize you to act on item 1", even if verified to be authentically from a particular source, can be misapplied in the wrong context (e.g., "item 1" was supposed to mean X, when it was misinterpreted as Y). In short, the context under which data is consumed must be well known and trusted by the consumer, no matter the protection mechanism.

We might want to add some examples to the specification that show that the information in documents can be expressed in one context and transformed into another. This could include showing an incoming document that is expressed using one or more contexts that the consumer does not understand, which can then be transformed using the JSON-LD API to a context that is trusted and understood. This would also help highlight the power of protection mechanisms that enable this kind of transformation.

For example, a VC that includes terms that are commonly consumed across many countries and some that are region specific. By using the JSON-LD API, a consumer that only understands the global-only terms can apply such a context to ensure that the terms they understood will appear as desired and other region-specific terms are expressed as full URLs, even when they do not understand or trust the regional context. All of this can happen without losing the ability to check the authenticity of the document.

We can also highlight that simpler consumers continue to be free to outright reject documents that are not already presented in the context that they trust and understand, no matter their authenticity.

@tplooker
Copy link
Author

tplooker commented Jun 21, 2024

I believe that the core of the issue highlighted above is in a lack of validation on the information that is to be verified. Any protected information or data must be validated and understood prior to consumption, no matter the protection mechanism. However, when a protection mechanism allows multiple expressions of the same information (a powerful tool), it may be important to better highlight this need. This is especially true in the three party model, where there is no simple two-party agreement and known context between issuers and verifiers, i.e., the scale or scope of the VC ecosystem is much larger when parties totally unknown to the issuer can consume their VCs.

The fundamental point of digital signatures is to reduce the information that needs to be trusted prior to verification. Most modern technologies e.g SD-JWT, mDocs, JWT and COSE and JOSE at large do this successfully meaning a relying party only needs to trust a public key prior to attempting to verify the signature of an otherwise untrusted payload. If the signature check fails, the payload can be safely discarded without undue expense.

The problem with data integrity is that this assumption is not the same. In essence the relying party doesn't just need the public key of the issuer/signer, but also all possible JSON-LD context entries that issuer may or may not use, if any of these are corrupted, manipulated or untrusted ones injected, the attacks highlighted in this issue become possible. Whether it is even possible to share these contexts appropriately at scale is another question, but these attacks demonstrate at a minimum that an entirely unique class of vulnerabilities exist because of this design choice.

Certainly not understanding the context in which a message is expressed (or meant to be consumed) can lead to mistakes, even when that message is authentic. For example, a message that expresses "i authorize you to act on item 1", even if verified to be authentically from a particular source, can be misapplied in the wrong context (e.g., "item 1" was supposed to mean X, when it was misinterpreted as Y). In short, the context under which data is consumed must be well known and trusted by the consumer, no matter the protection mechanism.

The point im making is not about whether one should understand the context of a message it has received or not, its about when it should attempt to establish this context. Doing this prior to validating the signature is dangerous and leads to these vulnerabilities.

For instance a JSON-LD document can be signed with a plain old JWS signature (like in JOSE COSE), once the signature is validated one can then process it as JSON-LD to understand the full context, if they so wish. The benefit of this approach is that if the JSON-LD context have been manipulated (e.g the context of the message), the relying party will have safely discarded the message before even reaching this point because the signature check will have failed. Data integrity on the other hand requires this context validation to happen as a part of signature verification thus leading to these issues.

@selfissued
Copy link
Contributor

Another take on this is that Data Integrity signing methods that sign the canonicalized RDF derived from JSON-LD, rather than the JSON-LD itself, enable multiple different JSON-LD inputs to canonicalize to the same RDF. The JSON-LD itself isn't secured - only RDF values derived from it. If only the derived RDF values were used by code, it might not be a problem, but in practice, code uses the unsecured JSON-LD values - hence the vulnerabilities.

@ottonomy
Copy link

In the example where the firstName and middleName plaintext properties are swapped, what should the verifier's behavior be? I don't think it should just be to verify the credential, whatever type it might be and then look at the plaintext properties within it that used @vocab-based IRIs. If I were writing this verifier, I would also ensure the @context matched my expectations, otherwise I wouldn't be sure that the properties of credentialSubject I was looking for actually meant the things that I expected them to mean.

If they were trying to depend on a credential of a certain type that expressed a holder's first name and middle name, it would not be a good idea to miss a check like this. Don't accept properties that aren't well-@protected in expected contexts. This is an additional cost that comes with processing JSON-LD documents like VCDM credentials, but it's not a step that should be skipped, because you're right that skipping it might open an implementer up to certain vulnerabilities.

Approaches that work:

  1. Verify the @context matches your expectations, such as including only known context URLs to contexts appropriate for the credential type that use @protected and only use explicitly defined terms.
  2. OR, use the JSON-LD tools to compact the credential into the context you expect before relying on plaintext property names.

Communities developing and using new credential type specifications benefit from defining a good @context with appropriately @protected terms. @vocab is ok for experimentation but not so great for production use cases. We don't really have a huge number of credential types yet, but hopefully as the list grows, the example contexts established for each of the good ones makes for an easy-to-follow pattern.

@OR13
Copy link
Contributor

OR13 commented Jun 22, 2024

schema.org and google knowledge graph both use @vocab.

https://developers.google.com/knowledge-graph

The problem is not JSON-LD keywords in contexts, the problem is insecure processing of attacker controlled data.

If you want to secure RDF, or JSON-LD, it is better to sign bytes and use media types.

You can sign and verify application/n-quads and application/ld+json, in ways that are faster and safer.

W3C is responsible for making the web safer, more accessible and more sustainable.

Data integrity proofs are less safe, harder to understand, and require more CPU cycles and memory to produce and consume.

They also create a culture problem for RDF and JSON-LD by attaching a valuable property which many people care deeply about (semantic precision and shared global vocabularies), with a security approach that is known to be problematic, and difficult to execute safely.

These flaws cannot be corrected, and they don't need to be, because better alternatives already exist.

W3C, please consider not publishing this document as a technical recommendation.

@msporny
Copy link
Member

msporny commented Jun 22, 2024

2024-05-08 MATTR Responsible Disclosure Analysis

On May 8th 2024, MATTR provided a responsible security disclosure to the Editor's of the W3C Data Integrity specifications. A private discussion ensued, with this analysis of the disclosure provided shortly after the disclosure and a public release date agreed to (after everyone was done with the conferences they were attending through May and June). The original response, without modification, is being included below (so language that speaks to "VC Data Model" could be interpreted as "VC Data Integrity" as the original intent was to file this issue against the VC Data Model specification).

The disclosure suggested two separate flaws in the Data Integrity specification:

  • "The unprotected term redefinition vulnerability"
  • "The @context substitution vulnerability"

The Editors of the W3C Data Integrity specification have performed an analysis of the responsible security disclosure and provide the following preliminary finding:

Both attacks are fundamentally the same attack, and the attack only appears successful because the attack model provided by MATTR presumes that verifiers will allow fields to be read from documents that use unrecognized @context values. Two documents with different @context values are different documents. All processors (whether utilizing JSON-LD processing or not) should treat the inbound documents as distinct; the software provided by MATTR failed to do that. Secure software, by design, does not treat unknown identifiers as equivalent.

That said, given that a credential technology company such as MATTR has gone so far as to report this as a vulnerability, further explanatory text could be added to the VC Data Model specification that normatively state that all processors should limit processing to known and trusted context identifiers and values, such that developers do not make the same mistake of treating documents with differing @context values as identical prior to verification.

The rest of this document contains a more detailed preliminary analysis of the responsible disclosure. We thank MATTR for the time and attention put into describing their concerns via a responsible security disclosure. The thorough explanation made analysis of the concerns a fairly straightforward process. If we have made a mistake in our analysis, we invite MATTR and others to identify the flaws in our analysis such that we may revise our findings.

Detailed Analysis

A JSON-LD consumer cannot presume to understand the meaning of fields in a JSON-LD document that uses a context that the consumer does not understand. The cases presented suggest the consumer is determining the meaning of fields based on their natural language names, but this is not how JSON-LD works, rather each field is mapped to an unambiguous URL using the JSON-LD context. This context MUST be understood by the consumer; it cannot be ignored.

A verifier of a Verifiable Credential MUST ensure that the context used matches an exact well-known @context value or MUST compact the document using the JSON-LD API to a well-known @context value before further processing the data.

Suggested Mitigation 1
Add a paragraph to the Data Integrity specification that mentions this and links to the same section in the Verifiable Credentials specification to help readers who are not familiar with JSON-LD, or did not read the JSON-LD specification, to understand that `@context` cannot be ignored when trying to understand the meaning of each JSON key. Additional analogies could be drawn to versioning to help developers unfamiliar with JSON-LD, e.g., "The `@context` field is similar to a `version` for a JSON document. You must understand the version field of a document before you read its other fields."

The former can be done by using JSON schema to require a specific JSON-LD shape and specific context values. This can be done prior to passing a document to a data integrity implementation. If contexts are provided by reference, a document loader can be used that resolves each one as "already dereferenced" by returning the content based on installed context values instead of retrieving them from the Web. Alternatively, well-known cryptographic hashes for each context can be used and compared against documents retrieved by the document loader over the Web. For this approach, all other JSON-LD documents MUST be rejected if they do not abide by these rules. See Type-Specific Credential Processing for more details on this:

https://www.w3.org/TR/vc-data-model-2.0/#type-specific-credential-processing.

This former approach is less powerful than using the JSON-LD Compaction API because it requires more domain-specific knowledge to profile down. However, it is still in support of decentralized extensibility through use of the JSON-LD @context field as a decentralized registry, instead of relying on a centralized registry. Decentralized approaches are expected to involve a spectrum of interoperability and feature use precisely because they do not require a one-size fits all approach.

Applying these rules to each case presented, for case 1:

A verifier that does not use the JSON-LD API and does not recognize the context URL, https://my-example-context.com/, will reject the document.

A verifier that does not use the JSON-LD API and does recognize the context URL, https://my-example-context.com/, will not conflate the natural language used for the JSON keys with their semantics. Instead, the verifier will appropriately use the semantics (that happens to be the opposite of the natural language used in the JSON keys) that the issuer intended, even though the JSON keys have changed.

A verifier that does use the JSON-LD API will compact the document to a well-known context, for example, the base VC v2 context, and the values in the JSON will be restored to what they were at signing time, resulting in semantics that the issuer intended.

For case 2:

A verifier that does not use the JSON-LD API and does not recognize the attacker-provided context URL, https://my-malicious-modified-context.com/, will reject the document.

A verifier that does not use the JSON-LD API and does recognize the attacker-provided context URL, https://my-malicious-modified-context.com/, will not conflate the natural language used for the JSON keys with their semantics. Instead, the verifier will appropriately use the semantics (that happens to be the opposite of the natural language used in the JSON keys) that the issuer intended, even though the JSON keys have changed.

A verifier that does use the JSON-LD API will compact the document to a well-known context, for example, the base VC v2 context (and optionally, https://my-original-context.com), and the values in the JSON will be restored to what they were at signing time, resulting in semantics that the issuer intended.

Note: While the disclosure suggests that the JSON-LD @protected feature is critical to this vulnerability, whether it is used, or whether a Data Integrity proof is used to secure the Verifiable Credential, is orthogonal to ensuring that the entire @context value is understood by the verifier. For clarity, this requirement stands even if an envelope-based securing mechanism focused on syntax protection were used to ensure authenticity of a document. Misunderstanding the semantics of an authentic message by ignoring its context is always a mistake and can lead to unexpected outcomes.

Comparison to JSON Schema

The scenarios described are identical in processing systems such as JSON Schema where document identifiers are used to express that two documents are different. A JSON document with differing $schema values would be treated as differing documents even if the contained data appeared identical.

Original document

{"$schema": "https://example.com/original-meaning",
 "firstName": "John"}

New or modified document

{"$schema": "https://example.com/new-meaning",
 "firstName": "John"}

Any document processor whether utilizing JSON Schema processing or not would rightly treat these two documents as distinct values and would seek to understand their values equivalence (or lack of it) prior to processing their contents. Even consuming a document that is recognized as authentic would be problematic if the $schema values were not understood.

Original meaning/schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/original-meaning",
  "title": "Person",
  "type": "object",
  "properties": {
    "firstName": {
      "description": "The name by which a person is generally called: 'given name'",
      "type": "string"
    }
  }
}

New meaning/schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/new-meaning",
  "title": "Person",
  "type": "object",
  "properties": {
    "firstName": {
      "description": "The name spoken first in Japan: typically a surname",
      "type": "string"
    }
  }
}

Demonstration of Proper Implementation

The attack demonstration code provided adds the unknown modified/malicious contexts to the application code's trusted document loader. A valid application should not do this and removing these lines will cause the attack demonstrations to no longer pass:

documentLoader.addStatic("https://my-example-context.com/", modifiedContext)

https://gist.github.com/tplooker/95ab5af54a141b69b55d0c2af0bc156a#file-protected-term-redefinition-attack-js-L38

To see "Proof failed" when this line is commented out and the failure result is logged, see: https://gist.github.com/dlongley/93c0ba17b25e500d72c1ad131fe7e869

documentLoader.addStatic("https://my-malicious-modified-context.com/", modifiedContext)

https://gist.github.com/tplooker/4864ffa2403ace5637b619620ce0c556#file-context-substitution-attack-js-L48

To see "Proof failed" when this line is commented out and the failure result is logged, see:

https://gist.github.com/dlongley/4fb032c422b77085ba550708b3615efe

Conclusion

While the mitigation for the misimplementation identified above is fairly straightforward, the more concerning thing, given that MATTR is knowledgeable in this area, is that they put together software that resulted in this sort of implementation failure. It demonstrates a gap between the text in the specification and the care that needs to be taken when building software to verify Verifiable Credentials. Additional text to the specification is needed, but may not result in preventing this sort of misimplementation in the future. As a result, the VCWG should probably add normative implementation text that will test for this form of mis-implementation via the test suite, such as injecting malicious contexts into certain VCs to ensure that verifiers detect and reject general malicious context usage.

Suggested Mitigation 2
Add tests to the Data Integrity test suites that are designed to cause verifiers to abort when an unknown context is detected to exercise type-specific credential processing.

@OR13
Copy link
Contributor

OR13 commented Jun 22, 2024

If you consider the contexts part of source code, then this sort of attack requires source code access or misconfiguration.

Validation of the attacker controlled content prior to running the data integrity suite, might provide mitigation, but at further implementation complexity cost.

Which increases the probability of misconfiguration.

A better solution is to verify the content before performing any JSON-LD (or other application specific) processing.

After verifying, schema checks or additional business validation can be performed as needed with assurance that the information the issuer intended to secure has been authenticated.

At a high level, this is what you want:

  1. minimal validation of hints
  2. key discovery & resolution
  3. verification
  4. validation
  5. deeper application processing

Most data integrity suites I have seen do this instead:

  1. deep application processing (JSON-LD / JSON Schema)
  2. canonicalization (RDF)
  3. verification
  4. validation
  5. deeper application processing

The proposed mitigations highlight, that these security issues are the result of a fundamental disagreement regarding authentication and integrity of data.

Adding additional application processing prior to verification, gives the attacker even more attack surface to exploit, including regular expression attacks, denial of service, schema reference tampering, and schema version mismatching, etc...

Any application processing that occurs prior to verification is a design flaw, doubling down on a design flaw is not an effective mitigation strategy.

@filip26
Copy link

filip26 commented Jun 22, 2024

@OR13

adding additional application processing prior to verification, gives the attacker even more attack surface to exploit, including regular expression attacks, denial of service, schema reference tampering, and schema version mismatching, etc...

we are speaking about this pseudo-code

  if (!ACCEPTED_CONTEXTS.includesAll(VC.@context)) {
     terminate
  }

which is loop and simple string comparison. I don't see a reason for any of the exploits you have listed here except an implementer's incompetence.

Please can you elaborate how those exploits could be performed and provide a calculation, an estimation, how much this adds to complexity?

Thank you!

@tplooker
Copy link
Author

@filip26, setting aside your apparent labelling of multiple community members who have participated in this community for several years as "incompetent".

Your specific pseduo-code is in-sufficient for at least the following reasons:

  1. What is actually input into the cryptographic verification procedure for data integrity aren't URL's, its the contents behind those URL's. So to put it plainly, data integrity cannot by design ensure that the contents of the @context entries used to actually verify a credential are those that the issuer used, because they are not integrity protected by the issuers signature.
  2. You have disregarded inline contexts, the @context array is not simply guaranteed to be an array of strings it may also include objects or "inline contexts".
  3. Your check appears to imply ACCEPTED_CONTEXTS is a flat list of contexts acceptable for any issuer, this means if contexts from different issuers collide in un-expected ways and a malicious party knows this, they can manipulate pre-trusted @context values by the relying party without even having to inject or modify an existing @context. If I'm mistaken and you meant that ACCEPTED_CONTEXTS is an array of issuer specific accepted contexts, then please explain how this is accomplished in an interoperable manner and or how it would scale.

@filip26
Copy link

filip26 commented Jun 22, 2024

@tplooker setting aside you are putting words in my mouth that I have not said which is quite rude and disrespectful ...

add 1. you are wrong, by ensuring data is processed with a context you accept (the URLs) you know what is behind those URLs, and how much you trust those URLs, and perhaps you have a static copy of the contexts. If you follow untrusted URLs then it's an implemters fault. Use a browser analogy.
add 2. yeah, I've simplified that, an inline context is a bad practice
add 3. your trust the URLs or not and based on the trust you proceed or not

@PatStLouis
Copy link

I was browsing through past issues related to this. This specific issue was raised to suggest adding @vocab in the base vcdm 2.0 context. It's my understanding that the authors of the Data Integrity spec were opposed to this. This is now being pointed as a direct security concern.

@tplooker given these new findings, would you revise your support since this was a bad recommendation introducing a security concern according to your disclosure?

@OR13
Copy link
Contributor

OR13 commented Jun 22, 2024

The URL for a context doesn't actually matter... In fact some document loaders will follow redirects when resolving contexts over a network (technically another misconfiguration).

Depending on the claims you sign, you may only detect a mismatch in the signature, when you attempt to sign a document that actually uses the differing part of the context.

Contexts are just like any other part of source code... Every single line of source code is a potential problem.

You often don't control what 3rd parties will consider the bytes of a context to be... It's a feature, that's been turned into a defect by where it was placed.

"It verified for me, must be a problem in your document loader."

"I thought I would be able to fix it in only a few hours, but it took me 2 days and delayed our release"

"I finally figured out how data integrity proofs work, thanks for letting me spend all week on them"

I've paired with devs and shown them how to step through data integrity proofs, dumping intermediate hex values and comparing against a "known good implementation", only later to learn the implementation had a bug...

Misconfiguration is common in complex systems.

I'm arguing that security experts who have evaluated data integrity proofs against alternatives should never recommend them, because every problem they exist to solve is already solved for better by other technologies used in the correct order.

Authentication of json -> json web signatures
Specification of json structure -> json schemas
Integrity protection of files -> hashes
Semantic mapping for json -> JSON-LD

The essence of a recommendation, is that you believe there isn't a better alternative.

@filip26
Copy link

filip26 commented Jun 22, 2024

@OR13 I'm sorry but don't see it. You mention two issues: misconfiguration and bugs. Well, we have tests, certification, etc. Those issues are endemic to any software applications but we don't call all the software vulnerable because of just a possibility that there might be a bug but after we find a bug.

Misconfiguration is common in complex systems.

I would really like to see the complexity estimated. I guess we are seeing a very different picture.

I'm arguing that security experts who have evaluated data integrity proofs against alternatives should never recommend them, because every problem they exist to solve is already solved for better by other technologies used in the correct order.

Please let's be factual, what experts, what was recommended, etc. In EU when press article starts with a title "American scientists have ... " everyone stops reading it (they add the American to make it credible ;)

@OR13
Copy link
Contributor

OR13 commented Jun 22, 2024

@PatStLouis you raise an excellent point regarding default vocabularies.

It's never too late to change what's in a context (joke).

This working group cannot prevent anyone else from adding a context that includes a vocab.

You are reporting an architectural flaw, that was "solved for" by making it explicit in the base context, but it's not fixed by removing it from that context.

If json compatibility isn't a requirement, the working group can drop the vc-jose-cose spec and remove the vocab from the default context... This might even improve adoption of data integrity while clarifying that RDF is the claims format that W3C secures.

I've argued this point previously.

@tplooker
Copy link
Author

I was browsing through past issues related to this. w3c/vc-data-model#953 was raised to suggest adding @vocab in the base vcdm 2.0 context. It's my understanding that the authors of the Data Integrity spec were opposed to this. This is now being pointed as a direct security concern.

@PatStLouis I agree this issue is relevant to the conversation, however the opinions I shared in that issue have not changed. @vocab is a broadly useful feature, that has not changed through disclosure of this vulnerability, what has become apparent is that JSON-LD is broken with regard to how this feature works. Simply removing @vocab from the vocabulary doesn't fix this issue it would be a band aid, what needs to be fixed is 1) JSON-LD with regard to how @vocab works with @protected and 2) more generally the @context entry needs to be integrity protected to prevent manipulation.

@tplooker
Copy link
Author

Just to add some additional colour here @PatStLouis, I don't believe the recommendation of putting @vocab in the base vocabulary was a "bad recommendation". In actual reality it was also necessitated to fix an even worse issue with data integrity as documented here digitalbazaar/jsonld.js#199 which lay around since 2017 un-patched, until we started a contribution for a fix in 2021, when we discovered it digitalbazaar/jsonld.js#452. Personally I believe removing @vocab from the core context will likely re-introduce this issue for JSON-LD processors that aren't handling these relative IRI's correctly.

Furthermore, if others in the WG knew about this issue, specifically that @vocab didn't work with @protected and chose not to disclose it when discussing this proposal, then that is even more troublesome.

@msporny
Copy link
Member

msporny commented Jun 23, 2024

@tplooker wrote:

setting aside your apparent labelling of multiple community members who have participated in this community for several years as "incompetent".

@tplooker wrote:

Furthermore, if others in the WG knew about this issue, specifically that @vocab didn't work with @Protected and chose not to disclose it when discussing this proposal, then that is even more troublesome.

Please stop insinuating that people are acting in bad faith.

Now might be a good time to remind everyone in this thread thread that W3C operates under a Code of Ethics and Professional Conduct that outlines unacceptable behaviour. Everyone engaging in this thread is expected to heed that advice in order to have a productive discussion that can bring this issue to a close.

@veikkoeeva
Copy link

veikkoeeva commented Jun 23, 2024

From an implementer perspective maybe adding an example that "should fail" could be a good thing. Something like at #272 (comment) .

As an implementation "case experience", I implemented in .NET something that produces a proof like at https://www.w3.org/community/reports/credentials/CG-FINAL-di-eddsa-2020-20220724/#example-6 the university crendetial and then also verifies it. It felt a bit tedious to find out what to canonicalize, hash and and sign to get a similar result. The code is or less private code still, but now that https://github.com/dotnetrdf/dotnetrdf/releases/tag/v3.2.0 and the canonicalization is publicly released, I might make something more public too. I still feel I need to go through this thread with more thought so I completely understand the issue at hand.

@msporny
Copy link
Member

msporny commented Jun 23, 2024

@veikkoeeva wrote:

From an implementer perspective maybe adding an example that "should fail" could be a good thing.

Yes, that is already the plan for the test suite in order to make sure that no conformant implementations can get through without ensuring that they refuse to generate a proof for something that drops terms, and/or, depending on the outcome of this thread, use @vocab to expand a term.

That's a fairly easy thing that this WG could do to ensure that this sort of implementation mistake isn't made by implementers. Again, we'll need to see how this thread resolves to see what actions we can take with spec language and test suites to further clarify the protections that we expect implementations to perform by default.

@PatStLouis
Copy link

@OR13

It's never too late to change what's in a context (joke).

I'm not suggesting a change, my goal is to understand why this recommendation was suggested in the first place and removing it is now listed as a remediation step to a security concern raised from the very same parties who suggested it.

This working group cannot prevent anyone else from adding a context that includes a vocab.

Correct, @vocab is a great feature for some use cases. I enjoy the feature for learning about jsonld, development and prototyping until I publish a proper context. I wouldn't use it in a production system (or at least I haven't found a use case that requires it).

Many protocols have features that can be unsecured depending how you use them. This doesn't make the protocol inherently flawed.

You are reporting an architectural flaw, that was "solved for" by making it explicit in the base context, but it's not fixed by removing it from that context.

Apologies if you misunderstood my statement, but my intention was not to report an architectural flaw.

@tplooker

@PatStLouis I agree this issue is relevant to the conversation, however the opinions I shared in that issue have not changed. @vocab is a broadly useful feature, that has not changed through disclosure of this vulnerability, what has become apparent is that JSON-LD is broken with regard to how this feature works. Simply removing @vocab from the vocabulary doesn't fix this issue it would be a band aid, what needs to be fixed is 1) JSON-LD with regard to how @vocab works with @Protected and 2) more generally the @context entry needs to be integrity protected to prevent manipulation.

Yes @vocab is a useful feature, but should it always be present? Nothing is stopping someone from using it, it's a feature (but shouldn't be a default behaviour). I would argue the same that the decision of adding an @vocab in the base context of the vcdm 2.0 is a band aid solution in itself, derived from a need for easier development process.

The Data Integrity spec provides hashes for their entries that verifiers can leverage while caching the content. AFAIK this is already a thing.

Just to add some additional colour here @PatStLouis, I don't believe the recommendation of putting @vocab in the base vocabulary was a "bad recommendation". In actual reality it was also necessitated to fix an even worse issue with data integrity as documented here digitalbazaar/jsonld.js#199 which lay around since 2017 un-patched, until we started a contribution for a fix in 2021, when we discovered it digitalbazaar/jsonld.js#452. Personally I believe removing @vocab from the core context will likely re-introduce this issue for JSON-LD processors that aren't handling these relative IRI's correctly.

Thank you for pointing out to these issues, I enjoy looking back at historical data from before my time in the space. As pointed earlier, some of the parties that made that recommendation are now recommending removing it as a remediation to a security concerned that they raised.

The use cases listed for this recommendation was for development purposes as described in #953. Furthermore, the private claims section of the jwt RFC reads as follows:

Private Claim Names are subject to collision and should be used with caution.

Enabling this by default does not sound like a good recommendation to me.

It's easy to setup a context file, it takes 5 minutes and a github account. If you are doing development, you can just include an @vocab object in your base context for the short term, why the recommendation to make it part of the VCDM 2.0 context?

Regardless this was already discussed by the group and the decision has been made.

The OWASP defines a class of vulnerabilities as Security Misconfigurations. This is where I would see this landing in. While valid, it's ultimately the implementers responsibility to properly configure their system, and sufficient information is provided in order for them to do so. If I expose an unsecured SSH service to the internet, then claim that SSH is unsecured because I can gain unauthorized access to my server, that doesn't align since the security flaw is not in the protocol in itself be in my security configuration. Yes it's a vulnerability, no it shouldn't be addressed by the underlying protocol.

For concluding I find this disclosure valuable as I got to learn a bit more about json-ld and gives a great resource to demonstrate implementers how to properly conduct verification of credentials + issuers how to properly design a VC.

@awoie
Copy link

awoie commented Jun 24, 2024

The OWASP defines a class of vulnerabilities as Security Misconfigurations. This is where I would see this landing in. While valid, it's ultimately the implementers responsibility to properly configure their system, and sufficient information is provided in order for them to do so. If I expose an unsecured SSH service to the internet, then claim that SSH is unsecured because I can gain unauthorized access to my server, that doesn't align since the security flaw is not in the protocol in itself be in my security configuration. Yes it's a vulnerability, no it shouldn't be addressed by the underlying protocol.

I would actually classify those attacks as "Data Integrity Signature Wrapping" (DISW) attacks. They share many similarities with XML Signature Wrapping Attacks (XSW) that occurred in the past. Also, note that it is possible to use XML Signatures securely if appropriate mitigations are implemented correctly. The same holds true for DI. The question is where we would add requirements for those additional mitigations for Data Integrity Proofs (DI).

The VCDM uses relatedResource to protect the integrity of specific external resources, such as @context values referenced by URIs. While DI is primarily used with the W3C VCDM 2.0, other JSON-LD data models might be secured by DI in the future, such as ZCaps, so we cannot just rely on mitigations defined in the W3C VCDM 2.0. For this reason, I believe this mechanism for protecting the integrity of the @context definitions is actually the responsibility of DI itself, since those @context definitions are part of the canonicalization and signature creation/verification. It would mitigate DISW attacks by making @context definition integrity checking part of the signature verification process. In that case, a similar mechanism to relatedResource has to be defined in the DI specification, and making it mandatory would help verifiers and issuers avoid skipping certain checks when issuing and verifying DIs.

@awoie
Copy link

awoie commented Jun 24, 2024

we are speaking about this pseudo-code

  if (!ACCEPTED_CONTEXTS.includesAll(VC.@context)) {
     terminate
  }

It's not that simple if the goal is to retain the open-world data model and extensibility model that the W3C VCDM promises. There might be instances where a verifier does not recognize all values in the ACCEPTED_CONTEXTS array. Consider the following simplified VC examples:

Example: VC using a base data model for all driving licenses

{
  "@context": [
    "https://www.w3.org/ns/credentials/v2",
    "https://www.w3id.org/dl/v1",
  ],
  "type": [ "VerifiableCredential", "DrivingLicense" ]
  "credentialSubject": {
    "canDrive": true
   }
}

Example: VC issued by DMV of Foo

{
  "@context": [
    "https://www.w3.org/ns/credentials/v2",
    "https://www.w3id.org/dl/v1",
    "https://foo.com/ns/dl/v1"
  ],
  "type": [ "VerifiableCredential", "DrivingLicense", "FooLicense" ]
  "credentialSubject": {
    "canDrive": true,
    "foo": true
   }
}

Example: VC issued by DMV of Bar

{
  "@context": [
    "https://www.w3.org/ns/credentials/v2",
    "https://www.w3id.org/dl/v1",
    "https://bar.com/ns/dl/v1"
  ],
  "type": [ "VerifiableCredential", "DrivingLicense", "BarLicense" ]
  "credentialSubject": {
    "canDrive": true,
    "bar": true
   }
}

When crossing realms, verifiers in the realms of Foo and Bar may have agreed on using the base data model but not on the specific properties unique to Foo and Bar. Verifiers in the realm of Foo are primarily interested in the base properties of the DrivingLicense and occasionally in the specific properties of the FooLicense. The same situation applies to the realm of Bar, but with a focus on their respective properties.

Adopting the ACCEPTED_CONTEXTS approach would require Foo, Bar, and all other realms to continually distribute and update their individual context definitions. This approach just does not scale very well and it sacrifices the open-world data model since all @contex URLs and/or definitions have to be statically configured.

@veikkoeeva
Copy link

veikkoeeva commented Jun 24, 2024

we are speaking about this pseudo-code

  if (!ACCEPTED_CONTEXTS.includesAll(VC.@context)) {
     terminate
  }

It's not that simple if the goal is to retain the open-world data model and extensibility model that the W3C VCDM promises. There might be instances where a verifier does not recognize all values in the ACCEPTED_CONTEXTS array. Consider the following simplified VC examples:
[...]

Great examples! Thanks!

Some context on why I think why a test "should not happen" plus a less mentioned issue of having good examples.

Related to #272 (comment): I'm not completely alien to this sort of work and indeed, when I implemented the "first pass sketch" of the code, I bit struggled with implications of this sort since I'm not so familiar with JSON-LD. So, I thought to "get back to with better time" and just not release anything before things are clearer (plus the library change not being public, though there's something already in the tests about this).

Some part of that was if I have a document like at https://www.w3.org/community/reports/credentials/CG-FINAL-di-eddsa-2020-20220724/#example-6, how to pick apart the pieces for canonicalization, turning into bytes, hashing, signing and so on. For this "sketch" I was quite happy to have the same results with the keys as the example document, but I know I paid only passing thought for these sort of things. Partially because there's been related discussion earlier.

I mention this about this example piece, since I think since good examples are perhaps more important than has been implied here. I naturally also think that a test of what not should happen are important -- and maybe add some notes of the sort to an example or two too. They're already something I've (we) been codifying to some tests. It's also a great way to document things.

@filip26
Copy link

filip26 commented Jun 24, 2024

@awoie a verifier should not guess what's inside a context nor to try to anticipate if there is some agreement between context providers.

When crossing realms, verifiers in the realms of Foo and Bar may have agreed on using the base data model but not on the specific properties unique to Foo and Bar.

If a verfier recognizes both https://foo.com/ns/dl/v1 and https://bar.com/ns/dl/v1 then there is no issue. It simply means that a verifier accepts both DMV departments' vocabulary, no matter that there are shared parts. A situation in which a verifier accepts something just because it uses well known terms is a risk, not otherwise.

An ability to understand well know terms, e.g. defined by schema.org is a great feature but not in VCs eco-system where we don't want to guess but be sure.

This approach just does not scale very well and it sacrifices the open-world data model since all @context

It scales the same way as www does. None prevents you using other contexts, well known terms, etc and include all in your context.

If there is a need, a good reason, to share parts between some parties, then the easiest, transparent, and scalable solution is this:

 "@context": [
    "https://www.w3.org/ns/credentials/v2",
    "https://www.w3id.org/dl/v1",
    "https://dmv-vocab/ns/dl/v1"
    "https://foo.com/ns/dl/v1"
  ],
 "@context": [
    "https://www.w3.org/ns/credentials/v2",
    "https://www.w3id.org/dl/v1",
    "https://dmv-vocab/ns/dl/v1"
    "https://bar.com/ns/dl/v1"
  ],

@awoie
Copy link

awoie commented Jun 24, 2024

@filip26 wrote:

If a verfier recognizes both https://foo.com/ns/dl/v1 and https://bar.com/ns/dl/v1 then there is no issue. It simply means that a verifier accepts both DMV departments' vocabulary, no matter that there are shared parts. A situation in which a verifier accepts something just because it uses well known terms is a risk, not otherwise.

I didn't say it is not a solution. My point was that it is a solution which does not scale. A verifier from Foo might have never seen a @context from Bar but it shouldn't matter because they agreed on a common vocab defined by https://www.w3id.org/dl/v1. Forcing all verifiers or issuers from different realms to continuously reach out to each other to keep @context URLs and definitions up-to-date and well-known does not scale for a lot of use cases.

@filip26 wrote:

It scales the same way as www does. None prevents you using other contexts, well known terms, etc and include all in your context.

No, it doesn't because the assumption of the ACCEPTED_CONTEXTS is to statically configure them. The web is not static and not all actors monitor each other continuously.

@filip26
Copy link

filip26 commented Jun 24, 2024

@awoie

No, it doesn't because the assumption of the ACCEPTED_CONTEXTS is to statically configure them. The web is not static and not all actors monitor each other continuously.

It's up to an implementer how to allow to configure a verifier,. A static configuration has nothing to do with scalability. But I guess that you have meant that a verifier would not be able to accept a context which is not know - that's exactly what we want, and it does not mean that VCs do not scale, that there cannot be infinite number of different VC types, issuers, verifiers, etc.

@awoie
Copy link

awoie commented Jun 24, 2024

@filip26 wrote:

It's up to an implementer how to allow to configure a verifier,. A static configuration has nothing to do with scalability. But I guess that you have meant that a verifier would not be able to accept a context which is not know -

My point on scalability refers to an increase in operational costs, not necessarily performance. Performance might be another point but I cannot comment on that.

@filip26 wrote:

that's exactly what we want, and it does not mean that VCs do not scale, that there cannot be infinite number of different VC types, issues, verifiers, etc.

If this is what we want, this sacrifices the open-world data model the VCDM promise as mentioned here.

@filip26
Copy link

filip26 commented Jun 24, 2024

@awoie I'm sorry, I don't think we are on the same page and I'll let others to explain that it does not affect scalability of VCs eco-system nor open-world data model.

@PatStLouis
Copy link

@awoie
I like you extensibility example a lot since its similar to the context in which I'm evaluating the impact of this.

My question is; if a verifier has no prior knowledge of foo or bar, why would they consider the extended data provided by those entities and how would this data lead to an exploit in their system?
The base information contained in the dl context is by design sufficient for verification needs by third parties.

Verifiers will know what information they want to verify, they are not blindly verifying abstract data.

As for the classification of this disclosure, while I can't really argue with your labeling, this is not a formal classification.

If we take 2 examples of vulnerability disclosed around XML Signature Wrapping Attacks:

Both of these affect a specific software and lead to 2 distinct CWE:

  • Improper Verification of Cryptographic Signature
  • Incorrect Authorization

They are not addressed by a change to XML, but a security mitigation in the affected software. This is an important distinction to make and loops back to a Security Misconfiguration.

It's hard for me to understand what exactly this disclosure tries to underline as the vulnerability

  • Json-ld?
  • Verifiable Credentials?
  • Data integrity?
  • A specific cryptosuite?
  • A specific implementation of a cryptosuite?

In seems the target of the vulnerability is being shifted around depending on the questions asked/comments made.

@awoie
Copy link

awoie commented Jun 26, 2024

jsonwebtoken

SD-JWT does not allow alg=none. So, this will be detected by a validation algorithm that validates any VC secured using SD-JWTs because SD-JWT validation requires checking that and rejecting VCs that use alg=none.

An SD-JWT has a JWT component that MUST be signed using the Issuer's private key. It MUST NOT use the none algorithm.

@msporny this means that alg=none validation is covered at the proof-specification-level which in that case is SD-JWT. In the same way, the vulnerability of Data Integrity that was demonstrated in this thread should be also covered at the spec-level.

@aniltj
Copy link

aniltj commented Jun 26, 2024

First and foremost, I wanted to thank both @tplooker for bringing this to the VCWG, and the Data Integrity editors for their analysis. Given the global interest in W3C VCDM, I am glad that this discussion is happening so that the right guidance can end up in the specifications going forward.

Some input into both the discussion and the proposed changes to make the specifications stronger:

From @peacekeeper:

... the traditional narrative that JSON-LD VCs can be "processed as plain JSON" can be problematic.

+1

The "traditional narrative", as Markus notes, was grounded in a desire to have a "big tent". The ecosystem has moved on from when this narrative was articulated to the reality that post-VCDM 1.1, the data model is and remains JSON-LD compact form, which has been a global standard. So there is fully an expectation by anyone using VCDM 2.0, they need to understand that data model.

What that in particular means is that, if you are NOT using a JSON-LD aware mechanism to process a VCDM 2.0 payload (Data Integrity being a JSON-LD aware option), you have an obligation to build in the "processing logic" to check for the things that are expected when using JSON-LD compact form (similar to how you need to be aware of when checking the particulars of a CSV, JSON or XML).

I think this needs to be emphasized.

There are other options for those who do not wish to leverage JSON-LD (and its power and flexibility) but if you are using VCDM 2.0, you can't pretend it is not JSON-LD.

From @kimdhamilton:

Remove @vocab from the base context

+1

I personally think that this earlier choice was a mistake, that makes many other mistakes possible.

At the same time, I fully see the value of @vocab when it comes to development and refinement of attribute bundles.

So, I would recommend that in addition to removing @vocab from the base context, @vocab is provided as an optional secondary context that developers can manually insert into the payload during development time and, as such, becomes explicitly visible when it is in use.

From @msporny:

we might want to get more forceful/normative with the language that exists in the specification today ...

+1

Very much so. Particularly when it comes to defining all terms concretely for production use, and a MUST NOT (instead of a SHOULD NOT as it currently stands) for using @vocab in production use.

From @tplooker:

Data integrity changes the proof generation and verification procedures to include a hash of the @context entries in the document ensuring no manipulation of the @context entry can be done without detection.

This feels right, but I don't know enough about the down-stream impacts of this, so would like to learn more.

@filip26
Copy link

filip26 commented Jun 26, 2024

Data integrity changes the proof generation and verification procedures to include a hash of the @context entries in the document ensuring no manipulation of the @context entry can be done without detection.

This creates a direct link between an issuer and a context set. Locks a holder in a situation to ask for a new credential every time when a different context is needed for some reason, even in cases when it can be translated one-to-one. Please note, many verification use-cases require only a few claims, especially in a context of SD.

There might be a risk of clustering holders based on a context set (requested to issue/presented to verify) - which would be hardwired - At this very early state of VC adoption we can expect many custom contexts being around.

Making this change without a deep analysis could potentially end up with a similar discussion to this one a few months later ...

some use cases:

@aniltj
Copy link

aniltj commented Jun 26, 2024

Making this change without a deep analysis could potentially end up with a similar discussion to this one a few months later ...

Thank you @filip26. If I understand correctly, this is about how best to securely distribute @context files. If so, I agree that a deeper analysis would be helpful to understand both the options on the table and the associated trade-offs an implementer needs to consider before making a particular choice.

@dlongley
Copy link
Contributor

dlongley commented Jun 26, 2024

Checking context is something that needs to happen at the application level and, if it is not checked properly, adding content-integrity checks will not help solve that problem, but it will harm use cases and decentralization.

In sticking with the "human name swapping" scenarios we've been using, take for example an application that will accept either a "Name VC" from an issuer from Japan or from an issuer from the US. In fact, these VCs are protected by some JWT-based mechanism that will ensure that the context cannot be changed without losing protection over the documents.

Now, suppose that the issuer from Japan issues their VC using a "Japanese context" that expresses the first and last name term names in the exact reverse way from the "US context". The issuer from the US issues their VC using the "US context".

The application sees this and is written using pseudo code like this to consume the VCs after verification checks are performed (that would weed out any unacceptable issuers and ensure no changes to the expression of the documents):

if(issuer === 'US') {
  // run consumption code of US document, ignoring context
} else if(issuer === 'Japan') {
  // run consumption code of Japan document, ignoring context
}

All is well here, for the time being, but it is actually only by chance that this is true in an open world setting. Because then, asynchronously, the issuer from Japan sees that a number of customers in Japan want to be able to use their "Name VC" at US-context-only consuming applications. So, seeing as they weren't using Data-Integrity-protected VCs, they decide they have to also start issuing duplicate "Name VCs" to every customer that wants one, using the "US context".

But now our application has a problem. You see, the application will happily accept these new "US context"-based VCs signed by the issuer in Japan, but the wrong code block will run! Depending on the scenario, this could crash the application or actually swap the data and perhaps produce a worse problem, like the concern here in this thread.

Remember, this is true even though JWT-based protections are used that force a particular context to be used by the holder.

The problem is, fundamentally, that checking the context is an application-level protection that must be performed by the consumer of the information. No basic JWT-verifier is going to check your custom claims or acceptable context combinations, just like no basic data integrity middleware would either. This is a validation responsibility of the application.

We can see that if the application had used this code instead:

// check the context!
if(context === 'US context') {
  // run consumption code of US document, after checking context
} else if(context === 'Japanese context') {
  // run consumption code of Japan document, after checking context
}

Now, the application would have continued to function just fine after the issuer from Japan made their asynchronous and decentralized decision to enable some of their customers to use the "US context".

But, we can take this a step further. If, instead, the issuer from Japan uses Data Integrity to protect their VCs, they don't even need to issue new VCs to allow their customers to use the "US context". Any party can change the context of the VC without losing the protection. And note that if the application continues to use the second block, which they need to use anyway to properly consume JSON-LD, everything will work properly, no matter whether the context was set the way it was by the issuer or by the holder (or by the verifier themselves). This enhances decentralization, scalability, and open world participation.

@msporny
Copy link
Member

msporny commented Jun 26, 2024

@tplooker wrote:

One must ask if whitelisting contexts is such a simple and effective measure alone to mitigate this issue, why doesn't the software highlighted follow this recommendation?

The software you provided specifically allowed the problematic contexts to be used, explicitly bypassing the protections you are criticizing other software in the ecosystem for not supporting. I know we (@awoie, @tplooker, and @msporny) keep talking past each other on this, so I'll keep asserting this in different ways until one of us sees the other's point. :P

The VC Playground software highlighted is playground software that specifically does not implement validation rules.

That is, we specifically do not enforce semantics in the VC Playground because one if its features is to allow developers to add arbitrary VCs and move them through the entire issue/hold/verify process. We did consider adding a "validation" feature to some of the examples, but even if we did that, your complaint would remain. That is, if a developer came along and used their own VC to do a full issue/hold/verify process, there is no way we could know what the validation rules are for their VC... should we reject all 3rd party VCs used in the VC Playground (limiting its use greatly)? Or should we require developers to provide validation rules for each VC (creating a higher burden to add arbitrary VCs to the playground)? In the end, we decided to focus on enabling the issue/hold/verify cycle and to come back to validation later. IOW, validation is out of scope for the playground, but we might add it in in the future.

The digital wallet software highlighted does not attempt to validate VCs because that is (arguably) not its primary purpose in the ecosystem; that's the verifiers job. We could build validation into the digital wallet, but we're hesitant to do so because of the broad range of VCs people can put into a wallet and the likelihood of us getting validation wrong for arbitrary VCs is high. What do we display if we don't know of a particular VC type? A warning? An error? Both seem wrong and the UX would make issuers be annoyed at the wallet software for marking their VC as "questionable" when it's not.

Enforcing application-specific @context values is the verifier application's job, and in the case of the VC Playground, we chose to NOT implement that for the reasons outlined above.

In my opinion it is because you loose the open world extensibility that the VC data model promises in the process and that is why it is an inadequate mitigation strategy and hence why I've suggested alternative solutions.

Hmm, disagree, but I see that this particular point hasn't been responded to yet (or I missed it). Will try to specifically respond to this point when I get some cycles later this week.

I the meantime, I suggest we open new issues for each of the 9 proposals above and focus on each proposal separately. I know that is asking A LOT of those participating, but I'm also concerned that trying to evaluate 9 proposals in a single thread is going to result in a conversational flow that's going to be hard for everyone to follow. Would anyone object to translating this issue into 9 different issue/proposals and focusing on each proposal in a separate issue?

@PatStLouis
Copy link

Some closed ecosystem wallets might have specific validation rules, others might not. Regardless, a verifier should always have validation rules (unless its a public utility tool made available for experimenting/discovering, such as the vc playground, uniresolver, etc) having validation in these environment would simply ruin their purpose. If I set up an agent that will simply verify the proof on a VC, I still need to have some controller to apply business logic. I don't want my barebones librairy come with rigid validations, this is the developer's job to implement. If I want to check VDLs, I will cache the VDL context and verify its integrity.

@tplooker If this isn't a misconfiguration error, how come proper software configuration will prevent this from being exploited? The myth that one single unconfigured verifier software must be able to verify and process every imaginable VC issued is a fallacy. The COVID passport verifier will verify covid passports, the age verification software will verify age according to it's jurisdiction's regulations. And these verifications will not happen with some arbitrary unknown/unheard of context/vc as input. If it does, then you can claim a vulnerability in the software since it was poorly implemented. There has been many vulnerabilities in software, even some leveraging JWT, believe it or not. Here's a list of known attacks.

This being said I enjoyed these demonstrations, and they should be documented in a lab somewhere, maybe even classified in the specification. They highlight risks associated with not properly reading/implementing the specification. kudos for the MATTR team for putting these together.

My suggestion as action items:

  • (9) remove @vocab from the base vcdm 2.0 context, breaking native interoperability with private jwt claims
  • (6) strongly discourage the use of @vocab for production use cases, unless the implementer is well aware of what this entails. Banning it seems slightly extreme.
  • (8) strengthen the spec text, emphasizing that the VC MUST not be treated as plain json
  • add this as (10): document these 2 attack vectors somewhere so they can be included in a pen-testers auditing toolbox
  • add this as (11): improve test-suites to include asserting bad contexts in the conformance stage

@mavarley
Copy link

mavarley commented Jun 26, 2024

Hello all, as an organization who will be supporting DI signatures in our product as we look to engage with a wide audience in the credential landscape, I would support the following recommendations (with some suggestions for consideration, given what I have grok'd from the above...)

  1. Data integrity changes the proof generation and verification procedures to include a hash of the @context entries in the document ensuring no manipulation of the @context entry can be done without detection. (Tobias)
  • recommended for highly secure applications; and describing a normative way of generating and including these hashes in the signature protected document - so it is clear when the creator of the document intends for this level of protection (and non-extensibility) applies.
  1. We replace all terms with full URLs in all VCs (DavidC)
  • recommended for 'production' systems or secure applications (develop/demo/poc with @vocab, lock it down when it matters)
  1. We more strongly discourage the use of @vocab or @base, possibly banning its usage. (DaveL)
  • again, for 'production' or secure applications (so not banning its usage outright, but identifying the security risks involved if the door is left open). I'm not sure if this can be achieved given the base VCDM model...
  1. We strengthen the language around ensuring that the values in @context MUST be understood by verifiers during verification, potentially by modifying the verification algorithm to perform normative checks. (Manu)
  • also guidance on what "understanding" means please? I understand a hard-coded, static set of context values, but I do get lost when dynamic loaded of contexts becomes a feature... like the German Lange example above... Do I have to "understand" German or not?
  1. Remove @vocab from the base context. (DaveL/Kim)
  • recommended for secure applications - but again - can we do this for certain scenarios, or is it inherent in the "base context" VCDM?
  1. Document these 2 attack vectors somewhere so they can be included in a pen-testers auditing toolbox. (Patrick)
  • +1 ; allowing security teams to detect, and businesses to evaluate their own risk appetite (to a degree)
  1. Improve test-suites to include asserting bad contexts in the conformance stage. (Patrick)
  • +1 ; allowing security teams to detect, and businesses to evaluate their own risk appetite (to a degree)

Although I fully support the fully-qualified names approach for ensuring there is no ambiguity in a secured document, I am concerned about the development overhead and lack of flexibility if this is required in all scenarios - but I am happy to learn more about the cost/benefit.

In general I focused on the above because they seem to properly address the described vulnerability when securing DI protected documents, and not focus on alternatives. Business and engineering teams are free to examine alternative methods for securing data and their cost/benefit analysis. But if a choice is made and a solution calls for DI -- how do we protect it as best we can? No solution is perfect, but clearly acknowledging the risks and providing clear guidance to mitigate these risks will help organizations make the right decisions for their needs. (If the mitigations are still insufficient for the use case, consider an alternate solution/technology).

@dlongley
Copy link
Contributor

dlongley commented Jun 26, 2024

@mavarley,

As explained in the example in my comment above, locking down the context does not solve the problem, but it does create new ones. The fundamental problem is that an application is not performing validation on @context prior to consuming the document. You MUST do this, no matter the protection mechanism.

also guidance on what "understanding" means please? I understand a hard-coded, static set of context values, but I do get lost when dynamic loaded of contexts becomes a feature... like the German Lange example above... Do I have to "understand" German or not?

Your application must only run against the context(s) it has been coded against. So if there is some context that uses German terms (or Japanese terms, or Frank McRandom's terms) and your application code wasn't natively written against that context, then your application MUST NOT try to consume the document.

When you see the property "foo" in a JSON-LD document, it should be understood as a localized name -- and its real name is the combination of "the context + foo". If you ignore "the context", that is not ok. That is the source of the problem here.

Notably, this actually isn't different from reading so-called "plain JSON" either, it's just that JSON-LD documents are self-describing, so "the context" is announced via @context. For so-called "plain JSON", you guess "the context" based on the interaction you're having, e.g., which service you think you're talking to, who you think authored the data, the purpose you think they authored it for, things of this sort. Whenever those guesses need to change, you call up / email / text the appropriate parties and figure out how the rollout will work. This is the closed-world, two-party model. In the open world, three-party model, many decisions are made independently, without consultation of all parties, asynchronously, and not everyone knows each other nor can assume "the context".

So, what are your options when your application, written entirely in let's say, English, gets in a document that uses a context with German terms? You can either:

  1. Notice the context isn't well-known to your application and reject the document outright. You may tell the sender that if they sent it in another specific context you'd accept it -- or you don't. This is ok and many applications will be able to function this way without forcing everyone to work this way. Protocols can be written that build on the VC and DI primitives that can allow parties to request documents in certain contexts. This can even allow holders to selectively disclose global-only properties, leaving out region-specific ones, and compact to the global-only context for the verifier!
  2. Call the JSON-LD compaction API to convert the document's context, no matter what it is, to a context that is well-known to your application, prior to consuming it in your application. After compaction, the document will be expressed in a context that your application is natively coded against, so it can be consumed.

Note that this is very similar to "content negotiation". Some servers will accept application/json (compare to context A) and others will accept application/xml (compare to context B). Some will accept both.

Using the JSON-LD API, anyone can translate from context A to context B. Using Data Integrity, this can be done without losing protection on the information in the document.

@awoie
Copy link

awoie commented Jun 27, 2024

Checking context is something that needs to happen at the application level and, if it is not checked properly, adding content-integrity checks will not help solve that problem, but it will harm use cases and decentralization.

@dlongley In your mental model, what @context is used in the DI verification process and is it the same @context that is provided to the business logic of JSON-LD-enabled (processing) verifiers? I thought verifiers have to know and trust @context, at least for DI verification but it appears that you are also saying that there might be other @context values that can be applied.

It sounds to me, that in your mental model, the issuer/holder provided @context is primarily used for DI verification purposes, but you also have the requirement to apply additional @context, e.g., for translation. These other @context entries seem to be not related to the DI verification logic, and it appears they are primarily used for JSON-LD processing in the business logic. Where does the verifier get these other @context values from? Is your assumption that these can be any trusted third-parties, or are they provided by the issuer? Wouldn't you still be able to inject these new @context entries for the business logic after DI verification with and without integrity protecting the @context in DI?

Perhaps I'm not following correctly, but in your mental model, who determines what @context to apply at what stage (verifying proof, data model processing), i.e., issuer, verifier, holder, any (trusted) party; and for which layer (DI verification vs JSON-LD business logic processing)?

It would also really help if we could always keep an eye on a holistic solution when evaluating the proposals made in this thread, i.e.,

  1. DI verifier (RDF/N-Quads signed) + JSON payload processor;
  2. DI verifier (RDF/N-Quads signed) + JSON-LD processor;
  3. DI verifier (JCS/JSON signed) + JSON payload processor;
  4. DI verifier (JCS/JSON signed) + JSON-LD processor;
  5. COSE-JOSE verifier + JSON-LD processor;
  6. COSE-JOSE verifier + JSON processor.

Is there any combination that is not valid, e.g., DI verifier + JSON processor seems to be odd although this is probably what most people are doing, i.e., using the compact form.

I guess JSON-LD processors can rely on the expanded terms (IRIs) but I haven't seen many implementations that do. It was probably not helpful to have a polyglot approach to VCDM with all the different combinations of JSON-LD/JSON across the data model and securing mechanism layer which is why we ended up here.

Irrespective of the solution we land on, I'd hope to be as explicit as possible in the spec and explain how this relates to options 1-4 above, and probably also 5-6.

@tplooker
Copy link
Author

tplooker commented Jun 27, 2024

@tplooker If this isn't a misconfiguration error, how come proper software configuration will prevent this from being exploited? The myth that one single unconfigured verifier software must be able to verify and process every imaginable VC issued is a fallacy. The COVID passport verifier will verify covid passports, the age verification software will verify age according to it's jurisdiction's regulations. And these verifications will not happen with some arbitrary unknown/unheard of context/vc as input. If it does, then you can claim a vulnerability in the software since it was poorly implemented. There has been many vulnerabilities in software, even some leveraging JWT, believe it or not. Here's a list of known attacks.

If this were just a misconfiguration issue, then why is the vcplayground, the three connected wallet applications and ~12 VC API backends connected to the vcplayground all "misconfigured". Surely if this is an obvious misconfiguration issue with no tradeoff, like you suggest then these software packages should have no issue being configured correctly?

Of course in reality its not because these aren't valid "applications" like has been previously argued by @dlongley, they are, its because adding in this configuration means they can't easily scale with new credential types without painful, careful reconfiguration. That is why the VC playground and all connected software today doesn't follow this advice and why it isn't a practical solution to this problem.

The software you provided specifically allowed the problematic contexts to be used, explicitly bypassing the protections you are criticizing other software in the ecosystem for not supporting. I know we (@awoie, @tplooker, and @msporny) keep talking past each other on this, so I'll keep asserting this in different ways until one of us sees the other's point. :P

Understood assert away :P and I will continue to make my point which I don't believe is being understood, as I've said before, the evidence in this community speaks for it self, we have plenty of examples of software "misconfigured" to use your terminology and little evidence of software that actually even follows this recommendation and thats because this isn't a configuration issue.

@tplooker
Copy link
Author

tplooker commented Jun 27, 2024

Your application must only run against the context(s) it has been coded against. So if there is some context that uses German terms (or Japanese terms, or Frank McRandom's terms) and your application code wasn't natively written against that context, then your application MUST NOT try to consume the document.

This approach as a mitigation which is to perform hard validation against every context in a presented credential (effectively whitelist every context) simply doesn't scale and below I outline a usecase which demonstrates exactly why.

Note, this isn't a theoretical use case either, we have lived this through real deployments of LDP and DI.

At MATTR several years ago we decided to extend the VC 1.0 data model to include our own early attempt at credential branding. This involved us defining our own company based @context value to extend the base data model with these terms. Then every credential we issued had to include this context value so that branding terms we defined became defined. As a consequence, because we wanted to limit our document loader from resolving contexts over network. We had several significant deployment issues where some downstream applications of ours didn't have the required @context entries resolved, meaning credentials failed verification until we did a redeploy. The pain was, these @context values defined terms that weren't even being processed by the wallet and verification software as the software didn't understand the branding by design! It just simply wanted to ignore this portion of the VC's and couldn't without being redeployed with a redundant @context value. Then when we updated this context multiple times over the deployment we had to pre-distribute the new @context values into the wallet and verification apps and wait for it to propagate before we could safely issue new VC's using the new context values. This required heavy co-ordination that was only possible because we were the issuer, wallet and verifier software, it simply wouldn't have been possible in a scaled and open ecosystem.

So in short @dlongley @filip26 @msporny and others, we have lived experience with your proposed solution here and it just does not work. It assumes all context values in an issued credential are critical to process when there many cases (like above) where some @context entries are totally irrelevant to a downstream wallet or verifier and forcing these applications to have to explicitly trust these redundant @context entires is a brittle, error prone, unscalable solution.

@awoie
Copy link

awoie commented Jun 27, 2024

I had the update the permutations in my previous post because I figured there is also DI + JCS + JSON but it contains JSON-LD, so there might be JSON-LD and JSON processors. So, here are the updated permutations a solution should cater for:

  1. DI verifier (RDF/N-Quads signed) + JSON payload processor;
  2. DI verifier (RDF/N-Quads signed) + JSON-LD processor;
  3. DI verifier (JCS/JSON signed) + JSON payload processor;
  4. DI verifier (JCS/JSON signed) + JSON-LD processor;
  5. COSE-JOSE verifier + JSON-LD processor;
  6. COSE-JOSE verifier + JSON processor.

@filip26
Copy link

filip26 commented Jun 27, 2024

@mavarley

Data integrity changes the proof generation and verification procedures to include a hash of the @context entries in the document ensuring no manipulation of the @context entry can be done without detection. (Tobias)

recommended for highly secure applications; and describing a normative way of generating and including these hashes in the signature protected document - so it is clear when the creator of the document intends for this level of protection (and non-extensibility) applies.

There is nothing like "less secure apps" (perhaps you have meant a profile or something like that?). Now it looks like an euphemism to say make it mandatory .

@tplooker

So in short @dlongley @filip26 @msporny and others, we have lived experience with your proposed solution here and it just does not work. It assumes all context values in an issued credential are critical to process when there many cases (like above) where some @context entries are totally irrelevant to a downstream wallet or verifier and forcing these applications to have to explicitly trust these redundant @context entires is a brittle, error prone, unscalable solution.

I'm sorry, I don't believe the "lived experience with the solution", (e.g. sketched here:
#272 (comment)) because you would not have security issues like the one reported and demonstrated.

Regarding unused @context entries, etc. would not locking a context into a signature make it much worse?
The issues have been explained here several times already. Locking @context into a signature threatens decentralization, scalability and even privacy.

@PatStLouis
Copy link

PatStLouis commented Jun 27, 2024

If this were just a misconfiguration issue, then why is the vcplayground, the three connected wallet applications and ~12 VC API backends connected to the vcplayground all "misconfigured". Surely if this is an obvious misconfiguration issue with no tradeoff, like you suggest then these software packages should have no issue being configured correctly?

If you are willing to die on this hill that the vcplayground is representative of production software deployed to verify sensitive information and should be configured the same so be it. I can't take an exploit demonstrated in a public demo environment as empirical evidence that every software deployed is vulnerable in the same way.

@msporny
Copy link
Member

msporny commented Jun 27, 2024

If you are willing to die on this hill that the vcplayground is representative of production software deployed to verify sensitive information and should be configured the same so be it.

I guess another way to put it, @tplooker, is: If we implemented strict checking of @context in the VC Playground, and removed any ability for a developer to add testing their own VC to the Playground, would you agree that the issue you reported is a "configuration issue"?

To be clear, Digital Bazaar's production deployments do strict checking of @context by not loading @context values from the network and use pre-cached, vetted contexts. So, yes, there is production software out there that takes this approach, which is recommended in the VCDM specification. We do reject contexts for which we know nothing about by default, because that's the safest thing to do (again, we'll get to your "does not scale" argument, which it does, later).

@msporny
Copy link
Member

msporny commented Jun 27, 2024

@tplooker wrote:

PROPOSAL: Data integrity changes the proof generation and verification procedures to include a hash of the @context entries in the document ensuring no manipulation of the @context entry can be done without detection.

but then you say:

performing a hard validation against every context in a presented credential (effectively whitelist every context) simply doesn't scale

Those two statements seem logically contradictory, please help me understand them.

In order to accomplish "including a hash of the context entries in the document", you have to have a hash of each context entry when you issue AND the verifier needs to be able to independently verify the hashes of each context entry when they verify. IOW, the issuer needs to understand the contents of each context used in the VC and the verifier needs to understand the contents of each context used in the VC (or, at least, be provided with a list of trusted hashes for each context they are verifying).

You then go on to say that allow listing contexts in that way is not scalable.

The specification insists that a verifier needs to check to make sure that they recognize every context in a VC before they take any significant action.

What is the difference between the verifier knowing the hashes of every context and the verifier checking the URLs of every context (which is vetted "by contents" or "by hash")? What am I missing?

@awoie
Copy link

awoie commented Jun 27, 2024

What is the difference between the verifier knowing the hashes of every context and the verifier checking the URLs of every context (which vetted by contents or by hash)? What am I missing?

@msporny why does the verifier need to know the hashes? Wouldn't it be possible to sign over the hashes and include the hashes in the proof object? Verifying the proofs would also include the verifier computing those hashes and checking them against the included hashes. I'm not saying this is my preferred solution but just asking whether I'm missing something here.

@dlongley
Copy link
Contributor

dlongley commented Jun 27, 2024

@awoie,

why does the verifier need to know the hashes? Wouldn't it be possible to sign over the hashes and include the hashes in the proof object? Verifying the proofs would also include the verifier computing those hashes and checking them against the included hashes. I'm not saying this is my preferred solution but just asking whether I'm missing something here.

All this would do is prove the document still expresses "something" in the same way it did when it was issued. But, as a verifier, you still don't know what that "something" is. You have to understand the context to actually consume the information. You don't have to understand that to confirm that the underlying information hasn't changed or to transform it from one expression to another (that you might understand).

So, the verifier will have to know the contexts (they can know them by hash or by content, as these are equivalent), such that they have written their applications against them, if they are to consume any terms that are defined by those contexts. This is why it does not matter whether the context is different from what the issuer used -- it doesn't help. Adding signed hashes doesn't help. In fact, if you lock the context down to a context that the verifier does not understand, it hurts.

If there's a context that a non-compacting verifier could use to consume the document, but the holder isn't free to compact to that context, then the verifier will not be able to accept the document. The holder would be forced to go back to the issuer and leak to them that they'd like to present to a verifier that only accepts documents in another context, asking for them to please issue a duplicate VC expressed in that other context.

If you have some special auxiliary terms that you want to consume in your own application, that you think many verifiers might reject based on a context they don't recognize:

  1. Express the credential using a context they will accept before sending it to them. Requests from verifiers can include the contexts they accept to help facilitate this. You might even be able to use selective disclosure to remove the special terms, if the verifier allows it.
  2. Realize they might reject your special data no matter what context you use (e.g., JSON schema "additionalProperties": false). Not everyone wants to accept something with a random https://example.com#meow property containing a huge list of favorite cat names, even if there's a unique and fantastic cat application that benefits from it. If your special terms have more utility than that, consider working with the community to provide a feature that everyone can benefit from (e.g., maybe a "render method"?), increasing the likelihood of acceptance by others.
  3. Compact to your special context only when using your special terms, and don't expose others to them in a way that could cause a conflict at global scale. Some verifiers are unlikely to accept anything like that anyway, and you don't get to make that decision for them.

@msporny
Copy link
Member

msporny commented Jun 27, 2024

@msporny why does the verifier need to know the hashes? Wouldn't it be possible to sign over the hashes and include the hashes in the proof object?

Ok, let's presume that's what we do... let's say we do something like this in the proof property:

"proof" : {
  ...
  "contextDigest": [
    ["https://www.w3.org/ns/credentials/v2", "0xfb83...43ad"],
    ["https://www.w3id.org/vdl/v1", "0x83a1...bc7a"],
    ["https://dmv-vocab/ns/dl/v1", "0x9b3b...24d4"],
  ]
  ...
]

When DI generates the proof, that content is signed over (both in RDFC and JCS). Alright, now the issuer has explicitly committed to cryptographic hashes for all context URLs and wallets and verifiers can check against those context hashes.

Verifying the proofs would also include the verifier computing those hashes and checking them against the included hashes.

Yes, and for the verifier to compute those hashes, they need to fetch and digest each context URL listed above (which means they now have the entire content for each context)... or they need to have a list that they, or someone they trust, has previously vetted that contains the context URL to hash mappings.

Having that information, however, is only part of what they need to safely process that document (and I'm going to avoid going into the use cases that we make impossible if we take that approach just for the sake of brevity for now --EDIT: Nevermind, turns out Dave and I were answering in parallel, see his post right above this one for some downsides of locking down the context hashes at the issuer). IF (for example) we continue to allow @vocab, that means that any context can come along and override it, which means that if the verifier wants to continue to be safe, they need to ensure that they are ok with what each context does, which means that they need to trust that each context used does things that they're ok with (like not override @vocab or overriding @vocab in a way that they're ok with or overriding unprotected terms in a way that is suitable for that use case or not attempting to protect terms that are already protected, etc.).

The point is that the contents of each context need to be known by the issuer (in order to hash them and generate the proof) and by the verifier (in order to verify that the contexts have not changed from when the issuer used them)... and if each party knows that information, then they have to know about each context and its contents (either by value or by cryptographic hash)... and if you know that information, you can verify the signature (and it'll either work if nothing the verifier is depending on has changed, or it'll fail if the contexts don't line up for the information that has been protected, which is what matters).

Did that answer your question, @awoie?

PS: As a related aside, I'm pretty sure we're using the words "known (context)", "understands (the context)", "trusts (the context)" in different ways that are leading to some of the miscommunication in this thread. I don't know what to do about it yet (other than keep talking), but just noting that we probably don't mean the same things when we use those words.

@tplooker
Copy link
Author

tplooker commented Jun 28, 2024

Those two statements seem logically contradictory, please help me understand them.

They aren't contradictory, but happy to explain.

Fundamentally including a hash of all the @context entries as a part of the signed payload accomplishes the following

It provides assurance to the issuer that in order for a relying party to be able to successfully verify their signature, they MUST have the same exact context as the issuer who produced the credential. This universally ensures context manipulation cannot happen after issuance without detection. And I might add there are more ways to mess with the context outside of the vulnerabilities I described at the start of this issue, so this just solves all of that out right.

Because these @context values are integrity protected it actually means that a relying party could download them in certain situations over a network safely if they don't already have them, because if they get corrupted or tampered with in anyway, they are going to then fail in the signature validation and this is the key to solving the scalability challenge. The use-case I described above gets somewhat more bearable if I as a verifier encounter a VC with a context I don't understand and isn't actually critical to me understanding, I can safely resolve it over a network, cache it and be confident it hasn't been messed with when I validate the signature. This isn't a perfect solution, but it is much better then the current state of play and likely the best we can do with data integrity without simply just signing the whole document with JWS instead, which of course would be much easier.

The important difference between your proposal and mine @msporny et al, is your solution

  1. Relies on guidance that developers can ignore, VC playground and all connected software to it, despite what is being insisted about what this means, is at a minimum clear evidence that implementations CAN and WILL ignore the advice to pin contexts currently in the spec, leaving them entirely open to these and other vulnerabilities.
  2. Provides the issuer with no enforce-able way of knowing that the only way their signature will verify is if the verifier has the contexts the issuer used.
  3. Won't work at scale because of the need to have a preprogrammed awareness of all possible @context values that issuers are using ahead of verification even ones that aren't critical for an application to understand.

@tplooker
Copy link
Author

tplooker commented Jun 28, 2024

IF (for example) we continue to allow @vocab, that means that any context can come along and override it, which means that if the verifier wants to continue to be safe, they need to ensure that they are ok with what each context does, which means that they need to trust that each context used does things that they're ok with (like not override @vocab or overriding @vocab in a way that they're ok with or overriding unprotected terms in a way that is suitable for that use case or not attempting to protect terms that are already protected, etc.).

Only if @vocab isn't fixed like it should be in JSON-LD to respect the @protected keyword.

@msporny
Copy link
Member

msporny commented Jun 29, 2024

@tplooker wrote:

Only if @vocab isn't fixed like it should be in JSON-LD to respect the @Protected keyword.

@dlongley explains in this comment why what you are requesting is a logical impossibility.

To summarize:

  1. The VCDM v2 context defines @vocab to be https://www.w3.org/ns/credentials/v2.
  2. If JSON-LD is "fixed", and @protected is applied to @vocab in the VCDM v2 context, then that would mean that all contexts other than the VCDMV v2 context would throw an error. This would happen because those other contexts would be trying to re-define undefined terms that have already been expanded by @vocab in the VCDM v2 context.

That might, understandably, seem counter-intuitive to some, but it does make logical sense once you think about it. So, let's walk through an example:

In year 1, I use just the VCDM v2 context, which I'm going to ship to production (the reason doesn't matter, I'm just going to do that). In that VC, I use MySpecialCredential for the type and it has a website property in credentialSubject to express my website. The base context has @vocab and so those terms are "issuer-defined" and are therefore expanded to https://www.w3.org/ns/credentials/issuer-dependent#MySpecialCredential and https://www.w3.org/ns/credentials/issuer-dependent#website.

In year 2, I decide that I want to define those more formally, so I create a new context that I'll append after the VCDM v2 context and in that context I define MySpecialCredential and the website property. When I run a "fixed" JSON-LD processor that protects @vocab on my document, which includes the new context, it throws an error. But, why did it throw an error?

It throws an error because @vocab was protected in year 1, which catches ALL undefined properties in the VCDM v2 context. MySpecialCredential and website are not in the base context, they're undefined, so they're caught by the VCDM v2 protected @vocab statement. Now, in year 2, the second context comes into play, it tries to re-define MySpecialCredential... but it can't do that, because MySpecialCredential is already mapped via the base VCDM v2 @vocab statement... which is protected, so the processor throws an error because I'm trying to re-define something that is already defined by the base VCDM v2 context. If we added the ability to protects a @vocab assertion in a context, it necessarily causes ALL subsequent contexts to throw an error.

Again, I know it sounds like a "nice to have" when said out loud, but when you think through the logical implementation of it, it doesn't work. I hope it's clear at this point that proposal 2 is unworkable.

If there was some other way you were expecting it to be implemented, please let us know; perhaps we don't see what you see.

@msporny
Copy link
Member

msporny commented Jun 29, 2024

@tplooker wrote:

This isn't a perfect solution, but it is much better then the current state of play and likely the best we can do with data integrity without simply just signing the whole document with JWS instead, which of course would be much easier.

Just "simply signing the whole document with JWS" does not:

  1. Hash the contexts (as you've asserted is a requirement to know the issuers intent), nor
  2. Digitally sign N-Quads (as Markus has previously documented, which provides protection against the statements changing), nor
  3. Ensures that a verifier understands the contexts used.

It is a red herring; it is not a solution to the concerns that you have raised. A verifier still has to ensure that a VC secured with any enveloping signature contains the semantics that they expect. They cannot just blindly accept any list of contexts and start executing business rules, even if they trust the issuer.

@dlongley
Copy link
Contributor

dlongley commented Jun 29, 2024

@tplooker,

Only if @vocab isn't fixed like it should be in JSON-LD to respect the @protected keyword.

As mentioned above, if I understand your ask properly, I think it is a logical impossibility.

The purpose of @protected is to allow consumption of specific terms in JSON-LD documents when only the contexts that define those terms are known (and other contexts are not, but they can be present). So, for example, if you have an array of contexts: [A, B], then an application can consume @protected terms defined in context A, without knowing context B (and importantly, not consuming any of the terms defined by context B). Again, note that a consumer MUST always understand the context that defines the terms it consumes -- and this holds true here.

Now, the way that @vocab is being used today in the VC v2 core context is as a "catch all" for any JSON keys not defined in that context. I believe you're asking that we apply "protection" to this value, with the aim of supporting consumption of terms in JSON-LD documents with contexts such as: [<core vc v2 context>, <unknown>]. However, @vocab "defines all terms" when it is used as a "catch all". By defining all terms in a protected way, it necessarily means that no further terms can be defined -- in any subsequent context.

It would not be possible to ever have the core VC v2 context be followed by any other meaningful contexts. Clearly this is not desirable and would prevent every other common use of VCs. If a consumer desires the definition of any other terms after a "catch all" @vocab to be prohibited, they can require that the context with this definition be the last context in documents they accept -- or they can use the JSON-LD compaction API.

@msporny
Copy link
Member

msporny commented Jun 29, 2024

@tplooker wrote:

They aren't contradictory, but happy to explain.

You didn't address the point of contention. The point of contention was that you (and @awoie, I presume) assert two things in your solution:

  1. An issuer must hash all contexts used in DI-protected content and protect them with the signature, and
  2. A verifier must know that these hashes are legitimate for their use (by either having the contexts or a hash of the contexts).

But then both of you state that distributing contexts in this way doesn't scale.

It sounds like you're saying that "even if we do 1 and 2, the solution won't work anyway, because there is no scalable way to distribute contexts".

It may be that you and @awoie think that the /only/ way for the verifier instance to operate is by having a completely fixed and static list of contexts they accept (and that that doesn't scale). It might be that you think that @filip26's example, which was just the simplest example that could be provided to demonstrate how easy it is to protect against the attack you describe (which is a minimum bar that the specification suggests), is the "one and only way" we're proposing? If that's the misunderstanding, then I can understand why you and @awoie are saying what you're saying. If it isn't, then I'm still seeing a contradiction.

Please clarify what you mean by "does not scale", because it's a misunderstanding we could focus on and clean up before continuing with analyzing solutions.

@msporny
Copy link
Member

msporny commented Jun 29, 2024

In an attempt to address the assertions you made above, which are beside one of the points of contention above:

@tplooker wrote:

Relies on guidance that developers can ignore

I already covered this point above.

Developers can ignore any guidance in the specification. We call that "doing a bad job" or, at worst, a non-conforming implementation. We can write algorithm language and tests that make it far less likely for a conforming implementation to misimplement in the way that you are concerned about.

I think we will get consensus to "do something" here, we're just debating what that "something" needs to be. At present, there is contention over at least two approaches:

  1. Include the context hashes in the signature.
  2. Ensure that a "trusted list of context URLs" are passed into the verification algorithm and checked (while still allowing @vocab to be used by those that want to do so).

VC playground and all connected software to it, despite what is being insisted about what this means, is at a minimum clear evidence that implementations CAN and WILL ignore the advice to pin contexts currently in the spec, leaving them entirely open to these and other vulnerabilities.

The playground does not pin to context hashes because many of the contexts used are changing regularly. Data Integrity (using RDFC) gets its security from the signed statements, which cryptographically hash only the values in the context that are used. Verifiers must check context values that are used in messages sent to them in production.

Developer software and playgrounds are NOT to be confused with production software.

Provides the issuer with no enforce-able way of knowing that the only way their signature will verify is if the verifier has the contexts the issuer used.

IF an issuer and a verifier follow the rules and guidance in the specification today, they are guaranteed (in an enforceable way) that the number of statements, the protected terms, and the information they expressed will not change when the verifier checks them.

If the issuer is sloppy in production and uses @vocab and the verifier is equally sloppy and doesn't check incoming contexts (which are the "vulnerabilities" disclosed in this issue)... then that's where things can go wrong and we should improve the specification text to make that a non-conforming implementation.

Won't work at scale

I cover this point in a previous comment.

@msporny
Copy link
Member

msporny commented Jun 29, 2024

I have raised w3c/vc-data-model#1514 to evaluate what to do about @vocab in the VCDM v2 specification and context. Please provide input over there, on that item specifically, while we process other parts of this issue here.

/cc @dlongley @kimdhamilton @aniltj @ottonomy @PatStLouis @mavarley @peacekeeper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CR1 This item was processed during the first Candidate Recommendation phase.
Projects
None yet
Development

No branches or pull requests