Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Module #108

Closed
aaronc opened this issue Aug 17, 2020 · 8 comments
Closed

Data Module #108

aaronc opened this issue Aug 17, 2020 · 8 comments

Comments

@aaronc
Copy link
Member

aaronc commented Aug 17, 2020

Summary


This proposes a module for tracking data related to ecological claims that lives both on and off-chain.

Problem Definition

The data module aims to satisfy the following use cases:

  • provide secure timestamps and signatures for data used to verify ecological state to allow for both programmatic verification and auditing
  • provide signed data to on-chain programmatic smart contracts
  • allow the blockchain to be used as an index for off-chain data
  • store on-chain metadata for ecosystem service credit fractional NFTs (Ecocredit Module on Mainnet #78) using a generic data system  

Can you say any more about this @clevinson ?

Proposal

Anchoring Data

Anchoring data refers to storing a hash of a piece of data on-chain that allows for proof of existence of that data at some block height. This effectively creates a “secure timestamp” for the data which proves that the data was created no later than the block height where it was included.

We propose a simple message MsgAnchorData for anchoring data on chain which uses the IPFS CID Specification as the expected format for data hashes:

message MsgAnchorData {
  // sender is the address of the party submitting the transaction
  bytes sender = 1;
  // cid is a binary IPFS CID identifier without a multibase prefix
  bytes cid = 2; 
}

Signing Data

Signing data refers to creating a claim that the signer of the piece of data attests to its veracity. What veracity means may be somewhat dependent on the context of the document being signed, but for simplistic purposes we draw an analogy to a legal document. If a party puts their signature on a legal document, its pretty clear from the contents of the legal document what the signature implies.

We propose a simple message MsgSignData for making on-chain signatures to data:

message MsgSignData {
  // signers is the addresses of the signers of the document
  repeated bytes signers = 1;
  // cid is the binary IPFS CID identifier of the document
  bytes cid = 2;
}

A few notes:

  • signers is distinct from sender in MsgAnchorData. The sender in MsgAnchorData could be a third-party relayer like a “postman”. Just delivering a document does not mean that one has signed the document in the legal sense. So MsgAnchorData implies that the sender simply delivered the document to the registry for timestamping. MsgSignData implies that the signers signed the document.
  • submitting MsgSignData automatically “anchors” the document on-chain. A call to MsgAnchorData is not needed if this is the first time the data has appeared on-chain. Two different messages are proposed because the data may be anchored on-chain before it is signed and different signers may sign the same document at different points in time.
  • on-chain signatures have some special value compared to off-chain PGP-like signatures. For instance, on-chain mechanisms that support key rotation like the group module allow public keys to be securely associated with an identity at one point in time and not another. The on-chain signature captures when the signature was created to verify that it is valid at that point in time. Off-chain mechanisms can include the concept of key revocation but there is no way of enforcing this without the type of secure timestamps and account sequences that a blockchain provides.

Storing Data

If desired, we can also store data directly on the blockchain. Off-chain storage will generally be cheaper, but on-chain data storage provides the following benefits:

  • high availability guarantee
  • data can be made available to programmatic smart contracts

We propose a simple message MsgStoreData for storing data on-chain:

message MsgStoreData {
  bytes sender = 1;
  // cid is the binary IPFS CID
  bytes cid = 2;
  // content must match the provided CID
  bytes content = 3;
}

The provided content bytes should be verified against the provided CID. While MsgAnchorData and MsgSignData should support any valid CID, MsgStoreData should support only an approved list of formats and hashes with gas priced appropriately for each supported hash. 

Future Improvements

Off-chain Data URL Index

One simple improvement to the above design is to allow for URLs to be stored for any CID. This effectively creates an index of

Partial Data Storage

Using a merkle-tree based canonicalization/hashing algorithm as described in #64, we could allow for part of a piece of data to be stored on-chain using merkle proofs. This would allow for:

  • partial privacy without needing more complex mechanisms like zero knowledge proofs
  • metadata parts of a document to be stored on-chain without storing a full dataset

Secondary indexes

We could allow for certain data properties within well defined document structures to be automatically indexed on chain so that they are searchable in smart contracts. For instance, we could define a property that defines a geographic polygon and whenever a piece of data is stored that contains that property, it is indexed in an on-chain geospatial data store. Requires more in-depth research. See #87 .

Schema Validation

On-chain data could be validated against some schemas for conformity. This may or not be the responsibility of on-chain consensus and would depend heavily on the use case. Schema validation if it were to be incorporated would likely relate to some on-chain schema registry which is out of the scope of this proposal.

@aaronc aaronc changed the title Data Module On and Off-chain Data Support Aug 17, 2020
@aaronc aaronc changed the title On and Off-chain Data Support Data Module Aug 17, 2020
@aaronc
Copy link
Member Author

aaronc commented Aug 17, 2020

do-not-edit-start-codetree-epic-issues

Issues in this epic:

Title Milestone Assignees Stage State
Create merkle-tree based RDF canonicalization spec #64 N/A N/A Open
do-not-edit-end-codetree-epic-issues

@aaronc aaronc added backlog and removed backlog labels Oct 21, 2020
@clevinson
Copy link
Member

clevinson commented Oct 22, 2020

As far as problem definition- i'd like to flesh out 2 interrelated use cases that will be good to orient this work around:

Regen Registry & Credit Module Implementation

As part of our issuing of CarbonPlus Grasslands credits, the credit module keeps it's use of metadata quite minimal (see credit rfc). This metadata is meant to link to claims about ecological data. While the credit RFC in its current form only requires arbitrary metadata blobs that may link to offchain data, we should also consider the use case of metadata from a credit linking to a previous claim (or baseline monitoring assesment) that is also represented as an on-chain asset. In this case, it would be good to have an easy way for the credit module to point to a dataset that was stored or signed using the operations described here.

OpenTEAM SurveyStack Digital Signatures Work Package

Some of the first technical partners that we will be wanting to make use of data anchoring, storage, and signing capabilities are ones from the OpenTEAM.

This year we had an approved paired work session to work with SurveyStack (a generic survey form application) as an initial partner for adding digital signature capabilities to tools in the openteam ecosystem. This is currently a proposed item for SurveyStack's "Year 2 Research Farm Roadmap" proposal (see Blockchain / Ledger integration section).

This work package actually connects quite closely with the previous Regen Registry use case, as the end-use case for this SurveyStack integration is the monitoring surveys that are being completed by Regen's science team as part of the monitoring tasks for issuing a CarbonPlus Grasslands credit.

@clevinson
Copy link
Member

In response to the actual proposal, I think this is a great starting place. Anchoring data, Signing data, and Storing data all being treated as separate operations makes sense to me - and by use of the same CID standard, they can actually connect to each other.

2 questions I have though:

  • what kind of querying functionality would we like to support?
    • the ability to index by CID ? (give me all signatures, and/or content corresponding to some CID) ?
    • will there be any deduplication efforts to prevent storing multiple anchorings of the same dataset, what about multiple sign records and/or overlapping groups of signers? How will these be resolved when querying?
  • Even at this first initial step, it would be really great to have some optional parameters that provide geoJSON / polygon support. Is that something that can be easily added to this scope? I think that even the most basic structured polygon data alongside CID's would be a huge-win and make early adopters a lot more excited to experiment with the module.

@aaronc
Copy link
Member Author

aaronc commented Oct 22, 2020

In response to the actual proposal, I think this is a great starting place. Anchoring data, Signing data, and Storing data all being treated as separate operations makes sense to me - and by use of the same CID standard, they can actually connect to each other.

2 questions I have though:

  • what kind of querying functionality would we like to support?

    • the ability to index by CID ? (give me all signatures, and/or content corresponding to some CID) ?

Yes

  • will there be any deduplication efforts to prevent storing multiple anchorings of the same dataset, what about multiple sign records and/or overlapping groups of signers? How will these be resolved when querying?

Yes deduplication of data and yes multiple signers.

  • Even at this first initial step, it would be really great to have some optional parameters that provide geoJSON / polygon support. Is that something that can be easily added to this scope? I think that even the most basic structured polygon data alongside CID's would be a huge-win and make early adopters a lot more excited to experiment with the module.

Content can include polygon data if it chooses. I'm resistant to adding some geo capability on top of this just because. I believe we will eventually want to specify more structured data, ideally with a Merkle tree based CID hash (#64) that includes some geo polygon spec. It's still only really relevant on chain if it's indexed IMHO.

@robert-zaremba
Copy link
Collaborator

Thoughts

The idea to use blockchain for timestamping data one of canonical use-cases. It was used on Bitcoin, Ethereum and even Ripple. There is also lot of academic research (eg: Stampery Blockchain Timestamping Architecture )

  1. It seams that the x/data is duplicating a blockchain (blocks data) - we copy the data from transaction and wrap them in more meaningful structures without doing additional operations.
  2. We claim that this data won't be manipulated. TBH, I'm not sure if this is easy without a code audit and proper module authorization mechanism (which we started to address in SDK).

If we don't do any operation then more natural way would be to wrap the transaction data in Events and use off chain DB for indexing. The verification process would be following: query the index for a specific data. This will result with a transaction_id or an event. Then you can use it to query tendermint to validate that the event / transaction indeed happened at a particular time. This would solve the issues above and keep the blockchain slim.

@aaronc
Copy link
Member Author

aaronc commented Nov 5, 2020

That's true @robert-zaremba. I guess one complication is we don't have access to the tendermint block history in the SDK state. So if we wanted to use the data or signers in a smart contract we couldn't... But yes it is duplication

@robert-zaremba
Copy link
Collaborator

Wouldn't be better to allow smart-contracts to select Events? A smart-contract method argument would require an event id.

@clevinson
Copy link
Member

Closing this as the basic implementation was completed in #118 and #124

@aaronc aaronc removed the backlog label Jan 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants