Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap to production-ready phase 0 launch #484

Open
zah opened this issue Oct 14, 2019 · 0 comments
Open

Roadmap to production-ready phase 0 launch #484

zah opened this issue Oct 14, 2019 · 0 comments

Comments

@zah
Copy link
Member

@zah zah commented Oct 14, 2019

This roadmap attempts to cover all areas where we'll have to spend time before being ready to publish a release of the beacon node intended for use on mainnet with staked ETH.

Profiling and Optimization

Shipping a production-ready phase 0 client will require us code freeze large parts of our codebase while the components are subjected to fuzzing and security audit. This forces us to prioritize our optimization efforts, so the code can get closer to its final state earlier. As an initial goal, we need to develop better understanding what are the optimisation gains that can be achieved in each subsystem, as this knowledge will help us developer a better Fuzzing and Auditing roadmap.

  • Obtain initial profiling results for state sim and network sim that will guide our optimization efforts. Coz and Perf's flame graphs have been identified as the more promising initial reports to generate.

We have identified a number of possible optimizations. Their relative importance is to be determined through the profiling efforts:

  • Use hardware-accelerated SHA256 whenever possible.

  • Optimize SSZ hashing for multi-chunk objects of known size.

  • Switch to an optimized BLS implementation once the final specification is ready.
    https://github.com/herumi/bls is emerging as the fastest implementation.

  • State transition refactoring.
    Ideas may be taken from protoLambda's optimized Go implementation and implementation notes. Breaking up the state objects will depend on the support for inlined objects in nim-serialization.

  • Avoid excessive state copying by implementing copy-on-write scheme for some of the consensus objects.
    All mutations can be handled in a way mimicking the behavior of persistent data structures such as the Trie. The syntax for accessing the fields will remain the same through the use of automatically generated template setters and getters. Under the hood, the shared objects will be allocated on the heap and will be reference counted in acyclic fashion.

  • Implement memory accounting.
    We should be able to track the origin of all allocated memory and to measure the memory footprint of each subsystem. This can be achieved through am algorithm similar to the GC's marking phase that counts all transitively reachable allocations from sets of root objects that have been explicitly marked as belonging to a particular subsystem.

  • Explore different strategies for invalid transition roll-back.
    When the state transition function fails (e.g. on invalid block), it may leave the BeaconState object in a partially modified state. In such situations, we need to be able to quickly roll-back all modifications. The copy-on-write scheme will allow us to solve this problem while also allowing us to have multiple "cached" copies of the state sharing portions of their memory. Another possible approach is to introduce BeaconStateMutation object as an additional parameter to all state transition functions. The code will have to be refactored to execute all writes over this object and to performs reads from it where appropriate. After the state transition is validated, we'll be able to apply the final mutation to the starting state.

  • Implement the optimized LMD-GHOST fork-choice.

  • Implement the optimized shuffling.
    The shuffling should be executed only once per epoch and cached in appropriate way. This approach can exploit a significantly faster shuffling implementation developed by protolambda.

  • Implement an optimized slasher (stretch goal).
    Optimized algorithms have been explored in protolambda's eth2-surround.

  • Implement profit maximisation (stretch goal)
    More details about this problem were presented in Lighthouse's presentation during Eth2 clients summit at Devcon 5.

Over time, we should start profiling on all of our target platforms:

  • Prepare guides intended for all contributors (core and external) explaining how our code can be profiled on each targeted platform.

ETH1 Integration

  • Implement a HTTPS client to allow nim-web3 to interact with the public Gorli end-points.

  • Refactor the monitor to function as an isolated loop storing the latest ETH1 data in an easily accessible variable (offering non-async reading).

  • Support creating a genesis state from a pre-determined set of validators and a particular ETH1 head block.

  • Allow the monitoring to be started from any genesis state and any ETH1 validator contract (the monitor will follow the events starting from the ETH1 block referenced in the genesis state).

Discovery V5

LibP2P

Our progress towards implementing a native Nim LibP2P implementation is tracked in the LibP2P roadmap.

  • Integrate the native LibP2P implementation in the beacon node.

Spec compliance

  • Implement a proper SszList type.

Attestation aggregation and fork choice

  • Introduce per-shard GossipSub topics.

  • Address the plethora of to-do items in attestation_pool and block_pool.

  • Validate blocks and attestations before re-transmitting them.

  • Implement the slashing conditions.

  • Prune obsolete information from the database after finalization.

  • Create a stand-alone slasher?
    This may be desired by people and institutions trying to secure the network.

Database

  • Implement a fast random-accessible memory-mappable append-only database to store all finalized blocks.

  • Add support for all SSZ field types in the SSZ navigator.

  • Refactor the code accessing the database to use SSZ navigator created over the memory-mapped data.

  • Keep the incoming blocks and attestations as byte blobs and use SSZ navigators to access their contents.

  • Persist the hot block pool data to RocksDB?

Beacon chain syncing

  • Implement a request manager.

    • Keeps track of the observed latency and throughput for each peer. Maintains a persistent peer score.
    • Performs load balancing based on the peer score.
    • Merges multiple requests asking for the same data
  • Don't initiate sync with all peers in onPeerConnected.

RPC interface

  • Implement a production-ready HTTP server.

  • Implement a GraphQL query engine?

  • Update nim-json-rpc to use the new HTTP server.

  • Implement a REST server?
    This depends on the RPC standard that will be accepted by all clients. A REST requirement would involve adapting Jester or implementing something similar from scratch.

CLI and config files

  • Implement TOML or YAML support in nim-serialization (bountied)

  • Implement support for config files in Confutils

  • Production-ready help screens in Confutils

  • Validate and finish the shell auto-completion support supplied by Confutils

  • Develop setup procedures for installing the shell auto-completion

  • Windows registry support in Confutils?

Logging

  • Run-time log format choice in Chronicles.
  • Multi-line print outs of the exception stack traces in the textlines format
  • Better formatting properties in the human-readable formats
  • Logging to Windows event log?

Release management

The initial version of the beacon node will target Windows, macOS and Linux.

  • Develop a Windows installer/uninstaller.
  • Develop a macOS installer.
  • Provide Linux/Unix packages (for all popular distros plus Raspbian and OpenWRT).
    A tool like FPM can help us create packages in multiple formats such as: deb, rpm, pacman, snap, macOS's pkg, freebsd, solaris and plain binaries.
  • Publish a homebrew recipe.
  • Publish nixpkgs package.
  • Publish Gentoo build files.
  • Publish Chocolatey package.

All published binaries should be archived together with their debug symbols, so we are able to receive and investigate crash reports featuring memory dumps.

  • Create a debug symbol repository for handling minidumps
  • Add an opt-in crash reporter (Google breakpad)

Testnet improvements

  • Enable the metrics collections in the testnet docker builds.

  • Update the Ansible cookbooks for gathering metrics.

  • Define useful alerts.

  • Remove the curl dependency.
    This requires a functioning https client.

  • Guide the user through downloading a trusted state snapshot when their client hasn't been connected to the network since the beginning of the weak subjectivity period.
    May require a functioning https client.

  • Implement a testnet health page.

  • Develop a basic block explorer (may be outsourced to another team).

  • Develop opt-in telemetry?

  • Develop analytics for multi-client testnet and mainnet.
    These can be based on tracking the graffiti data in the network which may be populated by us in certain way by default.

Beacon node web UI

TBD

Validator client

  • Implement the validator client API in the beacon node.
    Requires a production-ready http server.

  • Create a validator client binary and define its CLI.

  • Implement a keystore.
    A key store proposal have been submitted here, although we've been advocating for the reuse of the keystore of ETH1.

  • Develop a validator client web UI.
    Requires a production-ready https server.

    • Authentication screens.
    • Status screen (showing number of validators, balance, etc).
    • Event log.
    • Deposit more / withdraw.
    • Slashing alerts.
  • Implementing a validator on-boarding process.

    • Select existing beacon node or launch a new one connected to a particular ETH network
    • Select the ETH amount to stake.
    • Create one or more validator keys (matching the staked amount). Write them to a keystore.
    • Create deposit contract transactions.
    • Switch to validator status screen.
  • Create documentation for end-users.

Operational safety

  • Implement time synchronization.
    May use NTP or network adjusted timestamps.

  • Consider redundancy features.

  • Support graceful degradation in adverse scenarios.

    • failing hard-drive
    • failing network connections
    • lost connection to ETH1 client
    • lost connection to validator client
    • low-memory pressure
    • system-time modified
    • resuming from hibernation
  • Implement zero-downtime upgrade procedures.

  • Implement canary deployments and quick roll-back procedures.

  • Develop and document back-up and restore procedures.

  • Develop contingency plans.
    For potentially discovered vulnerabilities, instabilities, etc.

Security Audit

Before we are ready to ship, we must undergo a security audit. Please see our Fuzzing and Auditing roadmap for more details.

Market research and business development

  • Learn more about the future users of the beacon node.
    We can benefit from a better "client profile" for our potential users. Who are the people interested in becoming early adopters as validators? What are their primary concerns and how do they plan to select the software being used? The answers can affect which features should be prioritized and what content should be highlighted on the Nimbus web-site and elsewhere. Any established connections with our future users will also be a valuable resource for improving the usability of the product.

  • Identify and engage potential hardware vendor partners.
    The beacon node can be shipped in routers and other appliances as an attractive value add for consumers looking to earn money as validators. Nimbus-based light client can be used in the future in POS terminals. If we provide enough value in managing the development and the necessary upgrade procedures for such partners, this can provide an ongoing revenue stream for our team.

  • Explore partnerships with hosting providers.
    We can seek out hosting companies that can offer ideals setups for running a beacon node. This can include everything we recommend in our Redundancy setup as well as custom hardware that secures in a better way the validator keys that must be kept in memory.

Web-site improvements

Before the official launch we'll need to prepare significant amount of new content for the Nimbus web-site. The content should achieve the following goals (wip, more to be added):

  • Promote the product to potential validators

    • Prepare benchmarks and comparative studies against the other clients (e.g. the minimum will be a table listing various features).
  • Help you get started

    • Implement host OS detection for correct download and installation instructions.
    • Write a quick-start tutorial.
      Depends on Release management
  • Help you learn how to operate the software optimally.

    • Reference documentation for the CLI and the config files.
    • Create a set of documents outlining the best practices for monitoring, backups, security considerations, redundancy setups, etc.
  • Help you solve problems

    • Setup dedicated discussion board or Q/A site.

    • Develop FAQ / knowledge base section.

    • Provide search (Google may be enough)

  • Promote the industrial applications of Nimbus

  • Promote commercial support plans for validators?

  • Promote the suitability of Nimbus for custom development:

    • Provide documentation for all Nim stand-alone libraries

Localisation

Market research can demonstrate that implementing a localized client can significantly increase our popularity in certain markets.

  • Develop a GetText-based Nim localization library gathering strings for localization at compile-time.
  • Add localization support in Chronicles.
  • Create localization projects in Transifex or a similar system.
  • Add localization support in the documentation generator.
  • Add localization support in the Nimbus web-site.
@zah zah pinned this issue Oct 14, 2019
This was referenced Oct 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.