Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VM] Compress contracts before storing them and decompress on load, and bill the user only for the number of bytes in the compressed representation #2926

Open
LNow opened this issue Nov 13, 2021 · 8 comments
Assignees
Labels
consensus-critical icebox Issues that are not being worked on L1 Working Group Issue or PR related to improving L1 ship future

Comments

@LNow
Copy link

LNow commented Nov 13, 2021

Is your feature request related to a problem? Please describe.
Contracts with comments (especially with very verbose and descriptive ones) are more expensive in use than contracts that do not have any comments at all. The difference in execution can be easily explain with size difference.

Over time we will see more and more complex contracts, and to make them readable and somewhat understandable to normal user they will need more and more comments.
If developers will have to choose between contracts readability and lower execution costs, they will start choosing the second one. And as a result we will loose the most important feature of Clarity.

But what if we would store contracts in 2 versions?

  1. Original one, just like they have been written and deployed by developers.
  2. "Minified" one, stripped of all comments and extra whitespaces

First one would be stored on-chain just like it is done right now - to keep readability. While the second one could be stored on a side, and used as "executable" version - to reduce execution costs.
There is no point in loading into memory contracts with comments every single time they are called if comments plays no role in execution.

Developers could pay more for contract deployment (2x storage + additional processing), but execution should be cheaper and faster.

@project-bot project-bot bot added this to New Issues in Stacks Blockchain Board Nov 13, 2021
@jcnelson
Copy link
Member

Yes, probably. We can have the DB store a compressed representation of the contract, and only bill the user for loading/storing the compressed representation. No need to limit ourselves to a minified representation -- we can lz4 it for example.

@LNow
Copy link
Author

LNow commented Nov 15, 2021

Great! With such change we could push stacksgov/sips#32 forward without thinking much about how comments affects contract size and execution costs.
As a developer I would pay more to deploy 100% readable and nicely documented code if I would be sure that my comments won't have big negative impact on contract execution costs.

@jcnelson jcnelson added this to To do in Stacks 2.1 via automation Nov 15, 2021
@jcnelson jcnelson changed the title [Clarity VM/DB] store and use "minified" contracts during execution. [Clarity] Compress contracts before storing them and decompress on load, and bill the user only for the number of bytes in the compressed representation Nov 15, 2021
@jcnelson
Copy link
Member

I updated the issue name to reflect the change that will be carried out here. It's pretty straight-forward:

  • compress the contract before storing it
  • decompress it while loading it
  • bill the caller by the number of bytes in the compressed representation

@cylewitruk
Copy link
Member

cylewitruk commented Nov 14, 2022

I think this issue sounds interesting if not to just to learn more of the codebase (I also like saving storage space). Is this still relevant? If so I'd be happy to take an (educational) stab it it... @jcnelson

A few random thoughts:

  • It would be interesting to explore the separation of storage of a minimized & execution-optimized version of the contract, vs. the contract in its original form (either completely separately, or diff'ed). Rationale being that the contract, once published, will mostly be read for execution. The stacks API could read from endpoints the specifically retrieve the originally-formatted version. This would imply that the costs for storing and subsequently reading a contract would be different.
  • On top of the above, would it be possible to further optimize this to only retrieve called function(s) + variables?
  • Given the above point, could nodes in some way opt-out from storing the "original" (non-minimized) versions of contracts? If so, how to ensure that a contract's original format could never be lost? (maybe some sort of minimum-number-of-nodes that has a full copy..?)
  • Benchmarking a (larger) sample of existing Clarity contracts against different compression algorithms. LZ4 is across-the-board a winner. But perhaps for read-heavy workloads there is an algorithm that is slower but more efficient on the compression-side, but cheaper on the read-side? Maybe zstd?
  • Some sort of header likely needs to be introduced for storing/retrieving (on a node-level) so that the node can determine upon read how to handle the data.

And then exploring other databases than SqlLite, for example RocksDB (since the majority of operations are KV in nature).. RocksDB particularly because it supports e.g. LZ4 out-of-the-box. A change here would likely be a separate issue, pending its relevance. Nevermind, now I found the other usages :)

@jcnelson
Copy link
Member

It's definitely relevant! Compressing the clarity contract text could save ~50% of the bytes loaded. In fact, changing the on-disk representation of the Clarity code and analysis metadata could be done at any time, without a consensus-breaking change or a SIP. However, in order to pass the savings on to users (e.g. by changing the amount of block space it requires), we'd need to calculate a new cost function for contract-loads. This could be done with the voting procedure described in SIP-006, or it could be done in the next hard fork -- whichever happens sooner.

It would be interesting to explore the separation of storage of a minimized & execution-optimized version of the contract, vs. the contract in its original form (either completely separately, or diff'ed). Rationale being that the contract, once published, will mostly be read for execution. The stacks API could read from endpoints the specifically retrieve the originally-formatted version. This would imply that the costs for storing and subsequently reading a contract would be different.

I'm not sure minification gets you anything special here? If we store the code compressed, we'd get better storage savings than minification. Also, minification won't improve execution speed nearly as well as something like byte-compiling the Clarity code. So if either of these are goals -- reduced storage and execution time -- we'd probably want to explore other tactics besides minification.

On top of the above, would it be possible to further optimize this to only retrieve called function(s) + variables?

Yes, I think this could be done. Again, changing the associated cost functions will be an involved process, but the node implementation could be changed to do this without breaking anything.

Given the above point, could nodes in some way opt-out from storing the "original" (non-minimized) versions of contracts?

No, this is neither possible nor desirable. Contracts are part of the blocks, and all nodes must store all blocks in order to ensure that the system remains resilient to unpredictable node churn and network partitions.

Benchmarking a (larger) sample of existing Clarity contracts against different compression algorithms. LZ4 is across-the-board a winner. But perhaps for read-heavy workloads there is an algorithm that is slower but more efficient on the compression-side, but cheaper on the read-side? Maybe zstd?

Yeah, we'd want to do this before picking a default compression algorithm. However, the choice of compression algorithm is only necessary once the cost of loading the contract from source is reduced to the cost of loading the compressed representation (i.e. by changing the cost function). The compression algorithm implementation would need to be deterministic and would almost certainly need to be vendored into the codebase to ensure that all nodes compress contracts to the exact same number of bytes.

Some sort of header likely needs to be introduced for storing/retrieving (on a node-level) so that the node can determine upon read how to handle the data.

This is kinda-sorta done with the analysis DB, but as you can see from the code comments, it's very coarse-grained at this time.

@cylewitruk
Copy link
Member

cylewitruk commented Nov 21, 2022

It would be interesting to explore the separation of storage of a minimized & execution-optimized version of the contract, vs. the contract in its original form (either completely separately, or diff'ed). Rationale being that the contract, once published, will mostly be read for execution. The stacks API could read from endpoints the specifically retrieve the originally-formatted version. This would imply that the costs for storing and subsequently reading a contract would be different.

I'm not sure minification gets you anything special here? If we store the code compressed, we'd get better storage savings than minification. Also, minification won't improve execution speed nearly as well as something like byte-compiling the Clarity code. So if either of these are goals -- reduced storage and execution time -- we'd probably want to explore other tactics besides minification.

I had written this before I had a better understanding how things worked - I had thought the contracts were loaded as plain-text and parsed again when pulled out, but now I see that's not the case :) So this point can be ignored.

Yeah, we'd want to do this before picking a default compression algorithm. However, the choice of compression algorithm is only necessary once the cost of loading the contract from source is reduced to the cost of loading the compressed representation (i.e. by changing the cost function). The compression algorithm implementation would need to be deterministic and would almost certainly need to be vendored into the codebase to ensure that all nodes compress contracts to the exact same number of bytes.

My quick local (and unscientific) tests on both lz4 and zstd, looking only at compression efficiency, were:

  • lz4: down to 8% of of uncompressed size (using defaults)
  • zstd: down to 3% of uncompressed size (using defaults)
    (for uncompressed contract sources indexed by boomcrypto)

@LNow LNow closed this as completed Jan 19, 2023
Stacks Blockchain Board automation moved this from New Issues to Done Jan 19, 2023
@jcnelson jcnelson reopened this Jan 19, 2023
Stacks Blockchain Board automation moved this from Done to In progress Jan 19, 2023
@jcnelson
Copy link
Member

This is something we'd like to do in the near future. @cylewitruk has graciously taken on the implementation effort.

@jcnelson jcnelson changed the title [Clarity] Compress contracts before storing them and decompress on load, and bill the user only for the number of bytes in the compressed representation [VM] Compress contracts before storing them and decompress on load, and bill the user only for the number of bytes in the compressed representation Feb 22, 2023
@jcnelson jcnelson added the L1 Working Group Issue or PR related to improving L1 label Feb 22, 2023
@jcnelson
Copy link
Member

Assigning to @obycode for now. Please feel free to re-assign.

@obycode obycode added the icebox Issues that are not being worked on label Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus-critical icebox Issues that are not being worked on L1 Working Group Issue or PR related to improving L1 ship future
Projects
Stacks Blockchain Board
  
In progress
Status: Status: 🆕 New
Status: No status
Development

No branches or pull requests

5 participants