Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context and property optimization/compression #6

Open
msporny opened this issue Mar 31, 2020 · 0 comments
Open

Context and property optimization/compression #6

msporny opened this issue Mar 31, 2020 · 0 comments

Comments

@msporny
Copy link
Member

msporny commented Mar 31, 2020

I've run a few test on this locally over the years, resulting in some pretty great outcomes. I'll start with a few statements and work from there:

  • It is possible to use cryptographic hashes to represent URLs.
  • The Blake2 algorithm and Kangaroo12 algorithm support variable length outputs depending on the desired collision resistance.
  • The JSON-LD Context specifies the context to use when interpreting the semantics of a document, and JSON-LD Contexts are expressed as URLs, as are terms.
  • It's possible to use integers as CBOR keys and values.
  • It is possible to create a 16-bit lookup table that would store all well known JSON-LD contexts that are associated with standards

What this means is that we can:

  • In certain cases, we can compress all JSON-LD Contexts used down to a variable length cryptographic hash... that is, down to a few bytes, and use that as a "base URL" for all terms used in a CBOR-LD document.
  • In certain cases, we can compress all expanded terms and RDF Class URLs used in a document down to a few bytes using the same algorithm as in the previous step, but this time, utilizing fewer bytes because the use of the JSON-LD Context cryptographic hash gives us a global identification mechanism. That is, we can compress URLs to smaller than we would normally because we have a JSON-LD Context definition hash at the start of a CBOR payload.
  • We can tag these documents as "compressed CBOR-LD" documents.

If we do all of those things, in certain cases, we get:

  • single byte to sub-byte values for terms and classes in a CBOR-LD document
  • global uniqueness (read: excellent collision resistance) for all terms in a CBOR-LD document while not sacrificing storage size
  • An efficient, semantically meaningful normalization mechanism that depends on byte compares (similar to JCS, but w/o having to do tons of string comparisons) -- we could replace RDF Dataset Normalization in certain scenarios.
  • An efficient, semantically meaningful binary template format.

In short, we could achieve compression rates up to 75% for small documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Non TR Work
Development

No branches or pull requests

1 participant